Today, most companies identify data management as a very important part of their data strategy, but most often it happens, just because poor data management is risky. However, it is worth going a step further. To improve the approach to data use and management, these processes need to be democratized and automated. This may require an organizational change, but in the long term there is no alternative to it.
Moving from data to having information, i.e. what data we have, their automated description and context, is an important step towards applying artificial intelligence and machine learning to data use. However, we do not stop at the “statistical” understanding of the data. It’s also about whether the data is trustworthy, based on the business context. This can be illustrated by the example of exploring the problem of understanding threshold values – while we can generate various statistics and even perform anomaly detection, we still have a challenge with the correct business interpretation.
For example, sales figures typically should not increase by more than 5% per week. A 100% increase in sales should alert and stop the flow of data, rather than generating the typical report a CEO uses. Another example: CPU usage may increase by 300%, but still need not be a cause for alarm in terms of failure, as the interim consumption was very low. Not every anomaly is a mistake in understanding the business operation.
This need for intelligent alerts has prompted organizations to involve business teams in the data quality control writing process. We strive to achieve intelligent solutions that can automatically generate business-driven rules based on correlations and trends, including the use of Machine Learning (ML).
Traditional data management enriched with building and managing data models using ML.
The development and use of machine learning models in a production environment requires clear, unambiguous rules, roles, standards, and metrics.
A robust program for managing machine learning models aims to answer questions such as: Who is responsible for the performance and maintenance of production machine learning models? How are machine learning models updated and / or refreshed to accommodate training (responding to model drift or deterioration)? What performance indicators are measured in model development and selection, and what is the level of performance acceptable to the business? How are models monitored over time to detect deterioration or unexpected, anomalous data and forecasts? How are models audited and can they be explained to those outside the team who develop them?
The quality of the machine learning model will play a particularly important role in the effectiveness of the application and use of artificial intelligence, including (at the initial stage) convincing the persons who implement such a strategy in the company.
Issues related to the arrangement of the process
Centralization
A centralized, controlled environment in which all data work is a must. This makes managing data and artificial intelligence significantly simpler.
Speaking of MLOps
Models must be constantly monitored, refreshed, and tested to ensure that their performance matches the needs of the business. To that end, MLOps is an attempt to use the best DevOps processes from software development and apply Continuous Integration to data science. Just as the DevOps approach has drastically improved the quality and agility of software delivery, so should we do with data management. Starting from the reproducibility / repeatability of workflows and models (requires versioning not only of the code, but also of data and models), through the strategy of testing and validating models to connecting a feedback loop to further improve the model. We use model monitoring in order to detect a decrease in their performance, and continuous training ensures automatic response to possible deterioration of the model’s performance.
Sharing
Democratization is expressed through faster and better access to data, but also understanding them, therefore:
- ensuring cooperation between teams,
- faster model building; running models without re-coding; continuous training,
- raw data and model resource management and model monitoring.
Ultimately, this leads us to build a mechanism that enables real-time decision making based on centralized analytics.
New roles in MLOps
Organizations are increasingly realizing the need for a central team responsible for creating data platforms that help the rest of the organization do their jobs better. Naturally, this team needs a leader.
Data platform leader
In the past, this responsibility was covered by more traditional positions, such as data warehouse specialists and data architects. It is now common to have a data leader, who leads the data initiative across the organization. Data platform leaders typically oversee the modernization (or build from scratch, in the case of start-ups) of the company’s data stack. The ability of a leader to convince people and teams in an organization to adopt data (and data platforms) in their daily work leads to motivation and knowledge delivery to those who decide what data products to invest in, with the motivation of the people who end up using those products.
Analytics engineer
Analysts emphasize the limitations of relying on data engineers to produce and create data models. Thanks to technology and tools, we enable the analyst an easier work with data, putting the entire data transformation process into his hands. Today, the analyst has also become an Analytics Engineer and is the owner of the entire data stack, from its acquisition and transformation to the delivery of useful data sets to other departments of the company.
As mentioned above, in order to improve the approach to data use and management, these processes should be democratized and automated – preferably through the MLOPS or MLDevOPS methodology. Thanks to the democratization of data, decisions in companies will be able to be made primarily on the basis of data.
Companies that are able to use the potential hidden in data and changing their business processes, are the most successful ones.