Blog

AZ Frame
/
Bez kategorii
/
Data management – data

Data management - data cleansing

30.07.2021 | Ewa Suszek

Next stage of Data Management is Data Cleansing. Depending on the process design, the data may be corrected in their sources, in dedicated database structures, in the files of in the reports.

We suggest making up the temporary, dedicated structures of the data, with the data cleansed on the basis of data types, and taking up the following steps:

automatic change of the data in the sources, e.g. correcting the postal codes, change of the data formats, enhancement of missing data. Such a change is possible, when in a source there is incorrect data (e.g. wrong format) and it has been decided that they may be changed (original data are stored in the archive). The list of records subject to change, together with the original and target values is put down in the shape of report and/or dedicated entries in the Precisely platform database,
„manual” change of data – supplying the data stewards or customer service personnel with the information about current status of data and their target shape. The customer service has the task of contacting the client and clarifying their consent as to changing the data. If so, the data in the source system is being changed accordingly.

The data cleansing process consists of several elements:

normalisation,
standardisation,
deduplication,
enrichment.

Normalisation – the analysis of the columns, in which are stored several data types. Then the data are separated and written down in dedicated structures. For instance, in many systems, in a single box, a name and surname can be found (or an address). The process of normalisation will encompass writing down the names and surnames in the dedicated columns. For addresses written in a single box, normalisation will result in putting the respective data (country, city, street, house number apartment number, postal code, voivodeship etc.) in the respective boxes. This process is based on Precisely advanced algorithms and feeds from the system built-in dictionaries or the ones coming from the external suppliers.

With the help of Discovery Scorecards we can define our own parameters and rules. KPIs are attributed on the basis of configured rules and threshold limits, set for the defined data. In the system the API was made available. With its help we can upload the information about data profiling (e.g. uploading the information about statistics for chosen columns and models, profiling configurations and results charts).

Standardisation is about matching the data with specific standards. Having defined them, one needs to verify if the data fulfil these standards. In case of any deviation, the data have to be modified so as to fulfil the earlier assumptions.

Examples of the standardising actions are:

dates: in any organisation and in various systems we can come across the dates written down according to different patterns. After profiling process, the system will report that e.g. 60% of dates has the ‘YYYY-MM-DD’ format, 20% – ‘DD-MM-YYYY’, 10% – YY-MM-DD and 10% – ‘YYYYDDMM’. System will suggest channelling the standards in subsequent boxes to the one, that was defined at the stage of format cataloguing. A code, changing other formats to the target one, will be generated,
e-mail addresses: verification of validation rules fulfilment (e.g. at least 3 characters, @ character, at least 3 characters with a ‘.’),
names: no diminutive forms and only in nominative case, as per dictionary,
surnames: only in accordance with dictionary,
addresses: countries’, cities’, streets, provinces and postal codes must be in accordance with dictionary,
and many, many more…

Deduplication – after normalisation and standardisation procedures it is much easier to find similar records. Thanks to making use of the so-called “„fuzzy logic”, i.e. mechanisms based on certain algorithms and comparing according to weight, we are able to find similar records. Thanks to their analysis we can find duplicates (even despite the mistakes in the data, like typos or missing letters). Deduplication is a very important stage, as elimination of duplicates leads us towards better customer service and towards measurable savings in company’s expenses (the contact is with only one person, not with his numerous “„impersonations”).

Enrichment – the data can be enriched by the dictionary information, collected from other sources within an organisation or from the outside. An example for this can be adding the geolocation data, navigation to and from a given point, statistical data etc. The Precisely platform boasts a very elaborate mechanisms of enrichment with the geographical and navigation data.