A stage that allows for checking the data in a given source. It consists of connecting with the data source, taking a sample of data (the number of records can be pre-defined, e.g. 200 000), and generating reports linked to the source data. Thanks to profiling we get to know a precise data distribution, but we may also find the exceptions to the rules etc. At this stage, the system enables:
- to present the state of key data. For instance, in the diagram „Clients”, in the box „Date of birth” we have the data in yyy-mm-dd format, with 80% frequency; 10% stands for the dd-mm-yyyy format and remaining 10% are the empty values. Most frequent case is the date 17-01-2009. The „telefony.xls” file, in the card „numbers” we have 10 500 numbers, out of which 10 100 has the +48xxxxxxxxx format and 400 has the xxxxxxxxx format. Number +48111111111 occurs 75 times (which may mean that a person in charge does not know the right numbers, and puts the „ones” instead,
- to identify the rules for cleansing the data.
Thanks to data profiling we can generate reports of the data state. These reports may be more general, but may also concern single records, which fulfil the validation rules (or do not fulfil).
During the process, in the case of finding an exception to the rules, the system suggests specific actions related to the data cleansing. Each suggestion is accompanied by a detailed description of the steps to be taken. Additionally, the system generates a code, which may be used in the data cleansing process.
At the end of this stage we have a data catalogue, and the knowledge of the data and their quality.
- all sources are defined,
- for each of the sources we have appointed the key diagrams to be dealt with at further stages of proceedings,
- we know data models within the sources,
- we know all diagrams fulfilling the semantic rules (complete with the percentage distribution of data in the column). For instance, we receive an information that data to fulfil the ID Card pattern (3 letters and 6 digits) are to be found in the Alpha system base, in the “„clients” diagram and kli_tmp in the Excel sheet „windykacje.xlsx”, in the kli_wind column and the SAP system in the „customers” diagram,
- data were also, optionally, tagged,
- we exactly know the data quality. Thanks to profiling we know the data distribution in the diagrams, columns, the flat files etc.; we know the number of empty values and what is the distribution of the remaining data; we know the statistics (e.g. the longest and the shortest string of signs, the biggest and smallest value). The quality is put in the shape of reports or available in the Precisely Spectrum platform structures,
- we know which data do fulfil the validation rules, and which do not,
- we have ready-made scenarios of the data improvement, complete with the codes, that may be used in further stages of the data quality improvement process.
In subsequent stages, we physically improve the data quality.
In the Precisely Discovery tool, available are also the KPI charts, which stand for a graphic representation of the data state. They help measure and follow the data quality improvement. The tool allows for creating and attributing the result charts to the data that refer to the parameters such as accuracy, integrity and completeness.
With the help of Discovery Scorecards we can define our own parameters and rules. KPI’s are attributed on the basis of configured rules and threshold limits, set for the defined data. In the system the API was made available. With its help we can upload the information about data profiling (e.g. uploading the information about statistics for chosen columns and models, profiling configurations and results charts).