Validate, profile, and document your data
I used to work for a retail analytics company where we provided analytical solutions to help retailers improve their businesses such as inventory and allocation optimization, demand forecasting, and dynamic pricing.
A typical workflow starts from a daily feed from the customer, which is the raw data used as input for our solutions. After a series of data cleaning, manipulation, analysis, and modeling steps, results are created and sent to the customer.
One of the main challenges in such processes is the validation of data coming from the customer. If it contains some unexpected or absurd values, the results will not be useful. In fact, they might do more harm than good.
If these problems are detected in the result step, the impact just accelerates. You will probably need to rerun the pipeline, which means extra cost and a waste of time. A worse case scenario would be sending the results to the customer, who then uses them in their operation.
Luckily, we have a lot of tools to prevent such disasters from happening. Great Expectations is one of them. It is a Python library for validation, documenting, and profiling your data to maintain quality and improve communication between teams.
Great Expectations allows for asserting what you expect from the data, which helps catch data issues quickly and at an early step.
The main component of the library is Expectation, which is a declarative statement that can be evaluated by a computer. Expectations are basically unit tests for your data.
The Expectations are assigned intuitive names which clearly tells us what they are about. Here is an example:
column="price", min_value=1, max_value=10000
What this Expectation does is to check if the values in the column are between the specified minimum and maximum values.
There are a lot of Expectations defined in the core library. However, we are not limited to or dependent on only these.
Great Expectations library has many more Expectations contributed by the community.
We can install it via pip as follows:
pip install great_expectations
Then, we can import it:
import great_expectations as ge
Let’s do some examples using a sales dataset I prepared with mock data. You can download it from the datasets repository on my GitHub page. It’s called “sales_data_with_stores”.
In order to use the Expectations, we need a Great Expectations dataset. We have two different ways to create it:
- From a Pandas DataFrame using the from_pandas function
- From a CSV file using the read_csv function of Great Expectations
import great_expectations as gedf = ge.read_csv("datasets/sales_data_with_stores.csv")type(df)
In order to catch an unexpected value in a column with distinct values, we can use the expect_column_distinct_values_to_be_in_set expectation. It checks if all the values in the column are in the given set.
Let’s use it on the store column.
The expectation fails (i.e. success: false) because we have a value (Daisy) in the store column that is not in the given list.
In addition to indicating success and failure, the output of an Expectation contains some other pieces of information such as the observed values, number of values, and missing values in the column.
We can check if the maximum value of a column is between a specific range:
The output is in a dictionary format so we can easily use a specific part of it and use it in our pipelines.
max_check = df.expect_column_max_to_be_between(
Uniqueness of value is important for some features such as an id column. We can check if all the values in a column are unique.
# for a single column
df.expect_column_values_to_be_unique("product_code")# for multiple columns
The outputs of these Expectations are quite long so I’m not showing them here but they include valuable insights such as the number of unexpected values and a partial unexpected value list.
A simple yet useful expectation is to check if a particular column exists in the dataset.
This comes in handy when you want to make sure the daily data feed contains all the necessary columns.
We have done only 3 examples but there are currently 297 expectations in the library and this number is increasing.
One of the things I really like about these expectations is that the names are self-explanatory so that it’s quite easy to understand what they do.
You may argue that these expectations can be check using pure Python code or some other packages. You are right but there are some advantages of using the Great Expectations library:
- Have a standard and highly intuitive syntax
- Some expectations are not very simple and require writing many lines of code if you prefer to do it on your own
- Last but not least, Great Expectations also creates data documentation and data quality reports from those Expectations.
You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.
Thank you for reading. Please let me know if you have any feedback.
Great Expectations: Automated Testing for Data Science and Engineering Teams Republished from Source https://towardsdatascience.com/great-expectations-automated-testing-for-data-science-and-engineering-teams-1e7c78f1d2d5?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed