🐳 Dataframe.ai — A comprehensive Data Context Management tool for modern data teams
One of the most important aspects of data science and data engineering is understanding and managing the context of the disparate data assets in an organization. Data context 🔎 is about using the right data for the right purpose. Using the right data for the right purpose is generally more important than building sophisticated models because models are only as good as the data that you feed them. To use the right data for the right occasion, you need context on a variety of issues such as: how the data was created, how it is commonly used in the organization, and whether the data pipeline is currently in a good and recent state to be able to trust the resulting dataset.
Data context 🔎 can be categorized into three buckets: structural context, operational context, and social context.
- Structural context 🔎: the schema (table and column schema, data types) and lineage (the order and structure of data flows) of data assets.
- Operational context 🔎: data quality metrics, operational metadata about pipelines, SLA times, query logs.
- Social context 🔎: who is using certain data assets, and for what purpose.
If you are a data scientist, you want data context as you create and modify queries and ML models. And if you are a data engineer, you want data context as you create and modify data pipelines. These are two data developer personas with overlapping goals. We believe that a single tool should comprehensively address the needs of both data scientists and data engineers.
At Dataframe.ai, we are committed to the productivity of data developers. We believe that the right pattern for data context management will be as impactful as the Git-Github pattern in code management. We are breaking down the problem of data context management into four feature sets: 1/ data discovery, 2/ data documentation, 3/ data schema and lineage, and 4/ data quality validation and monitoring.
- Data Discovery: personal and collaborative search over all data from disparate sources (data lakes and data warehouses).
- Data Documentation: free-form markdown documentation à la Github.
- Data Schema & Lineage: automated construction and syncing of data schema and lineage from data warehouse metadata.
- Data Quality Validation and Monitoring: low-code validation of deterministic quality checks and monitoring of statistical quality metrics (time-series anomalies).
We are eating the apple one bite at a time, so to speak, starting with data discovery & documentation. We believe that every table in an analytical data warehouse should be documented and easily discoverable. It’s the 2020’s folks, and it’s about time ⌚️.