Building a Universal Data Mapper: Part 1

14 min readApr 22, 2019

One of the fun things about working in a start-up is that you get to explore new things — constantly challenging yourself, channeling creativity on unexplored things and unsurprisingly the effort and outcome creates excitement and accelerates the path to success.

Hi, this is Giri Venkatesan — a member of numberz, a fintech startup, providing SaaS solution for Accounts Receivables automation and management with clients in India, the Middle East, and the US. My primary focus was to cater to strategies around Ecosystem Growth & Integration, however, it extended to GTM activities, Solution Architecture and things of priority and strategic interest.

Apart from playing to the role, had the opportunity to build solutions, prototypes and explore ideas that filled a gap or gave a head-start for the product in the right direction. Here I am sharing the journey of one such engagement involving a simple idea that evolved from concept to realization in a short span of time resulting in a working solution, and that gave me the opportunity to dabble with the fun domain of solutions and application building.

After spending time sifting through data from various ERPs, realized that a general-purpose data mapper tool capable of supporting complex mapping and transformations can expedite data ingestion from source systems. It might sound like classic ETL problem — true, but the problem is not E or L of the ETL — it is more around T, transformation. Essentially, the business processes, ERP customizations and data entry practices that exist in a Business, permeates into the data domain making it complex and unwieldy for a uniform data mapping and transformation strategy. Use of standard tools could help, but lacked the flexibility and added overhead around integration made it unattractive. So, the idea for a general-purpose reusable mapper-transformer tool supporting standard file formats (Excel, csv, tsv) with a rule-based mapping and transformation was born — a Universal Data Mapper tool. Such a data mapper tool could offer a wide range of capabilities to realize a framework capable of reusing mappers with the potential for automated execution (RPA).

The goal was set to build a reusable mapper for a set of source and target systems that can be reused across customer installations with minimal or no re-working. If accomplished, it can become a tool of great help for ERP implementers and Data Migration specialists dealing.

For example, an SAP to SFDC migration or vice-versa could become a simpler and quicker affair saving time on the redundant efforts. The oft-repeated bulk transformation activities can become simpler.

So, the exercise started out by defining a clear set of requirements for the UDM tool that can process data, mapping the records from source to a target format.

This article is organized in two parts, with the first part around the dissemination of the problem, requirements and a concept for a solution, and the second part around a prototype (reference implementation) that came about at the end of the exercise.

Universal Data Mapper (UDM)

UDM is a workbench with support for design and run-time activities while working with data files. As a design tool, it enables users to build mappers to produce a map definition. The outcome of the design activity is a named mapper, capturing details of the source and target data templates, mapping and transformation rules and post-processing rules as configuration metadata. At run-time, a mapper can be executed with appropriate input files to produce mapped output. The mapper should be reusable and the execution may be tied to a schedule — e.g., monitoring the arrival of a file on an FTP endpoint or an email on a designated inbox to trigger mapper execution. Such automation can extend to direct loading of mapped data if the target system supports relevant interfaces. The mapper would also be available for an on-demand execution via manual invocation on a web portal or command-line interface.

UDM — Mapper Design

A guided, step-by-step approach is desired to make the design process intuitive and user-friendly. With each stage refining and preparing the source data for mapping and transformation, supporting more complex pre- and post-processing rules.

Input and output data shall be put through various validation rules over and above the key mapping and transformations.

At each stage, data errors shall be captured and returned, presenting an opportunity to fix the errors and retry.

A mapper shall be uniquely identified with an id and can be modified iteratively.

NOTE: Terms row and records are used interchangeably throughout this article.

One of the fundamental requirement is to support filter and reduce activities on the source data, represented as conditions/checks. Such a mapper is ideal for data subsetting without needing any transformation to arrive at the final dataset. The mapper should support a flexible framework for defining data filtering, reduce and aggregate rules at appropriate stages.

Stage 1: Mapper Specification

The design process caters to stage setting for the mapper.

Specify a name and a brief description of the mapper
Upload a source data template, an excel file representing the data layout — sheet names, and within each sheet the data layout around column names and position
In the same fashion, a target data template shall be uploaded representing the desired layout of output data. If the mapper is all about in-place tweaking or filtering on the source data, it should support an option to reuse the source template as the target without needing to upload again.
Optionally, one or more lookup sources shall be incorporated in the mapper. A named lookup source is a simple list of <key, value> pairs captured in a file that can be looked up at run-time for mapping and transformation activities.

A lookup functionality helps in looking up data from external source at runtime in the mapping and transformation rules.

Stage 2: Source Validation Rules

It is always a good idea to put the source data through validation checks to ensure the validity of incoming data. Based on the source file template, each sheet shall be presented with identified attributes extracted from a designated header row within the sheet (typically first row, but can be different). You should be able to specify validation rules at the attribute level on each sheet. Here are a few sample validation rules:

a) Ignore Sheet — It is not necessary that every sheet of the source file is of interest for the mapping process. You should be able to ignore a source sheet so as not to consider that sheet data as a source for the mapping process.

b) Identify the Header Row — Each sheet might have different data format, hence may have the header positioned in a row different from the conventional first row. For the mapping process, it is critical to identify the header row on each sheet, as the attributes from the designated row will define the structure/layout of the data on that sheet. A wrong selection may result in erroneous tagging of attributes as the header.

Data Filtering — Mapper should be able to weed out the garbage (data that is of no interest) from the processing. This will help identify and operate on the *right* subset derived by the application of appropriate filter conditions that can be contextual, logical or otherwise. The filtering rules are stateless in nature that allows to either skip or retain individual records based on a condition.

c) Preprocessors — Often times data captured on files are in a report format. A report may contain information and attribution that are not necessarily mappable data. In order to build the focused dataset, a subset of records need to be extracted for processing, and preprocessors offer a means to achieve that. At run time, the preprocessors will deduce the subset of data for further processing. A rich set of preprocessors help to identify the subset of processable data. Following are a few potential preprocessors:

Duplicate removal — Identify and remove duplicates
Skip first/last N rows — Skip a number of rows
Skip after N rows — Skip content after a number of rows
Skip row if equals/not equals to — Skip rows matching the condition
Skip row if empty/not empty — Skip rows matching the condition
Skip row if greater than/greater than or equals to — Skip rows matching the condition
Skip row if less than/less than or equals to — Skip rows matching the condition

d) Type Validation — UDM supports three fundamental data types viz., String, Date, and Number. Identifying attributes with the type of Date or Number offers support for type-based validation and transformations. At runtime, these validation rules will identify and report issues, if found.

e) Required Validation — A check for missing values on required attributes. At runtime, these validation rules will identify and report issues, if found.

f) Unique Validation — A check to verify the uniqueness of value on an attribute on the entire dataset. At runtime, these validation rules will identify and report issues, if found.

Stage 3: Target Validation Rules

Based on the mapping and transformation rules, a mapper execution will produce the mapped output. A sanity check on the output data can be enforced by defining validation rules at the attribute level.

Based on the target file template, each sheet is presented with the identified header attributes. You can specify validation rules at the attribute level on a target sheet.

a) Ignore Sheet — It is not necessary that all sheets of the target file required to be populated by the mapper. If a target sheet is not expected to be mapped, the sheet shall be marked as ignored. Ignored target sheets will not be presented as a target while defining mapping rules in the later phase.

b) Type Validation — UDM supports three fundamental data types viz., String, Date, and Number. Identifying attributes with the type of Date or Number offers support for type- validation on the mapper output. At runtime, these validation rules will identify and report issues, if found.

c) Required Validation — A check for missing values on required attributes. At runtime, these validation rules shall identify data issues on the output, if found.

d) Unique Validation — A check to verify the uniqueness of value on an attribute on the entire dataset. At runtime, these validation rules will identify and report issues, if found.

Stage 4: Mapping & Transformation Rules

With the source and target data layout configured appropriately, necessary rules for mapping (associating a target attribute with a source attribute) and transformation (massaging and manipulation) at an attribute level shall be defined.

This is the most critical activity in a mapper design. For each of the target sheets, the mapping process involves configuring attribute association and transformation rules.

Data Reduction — At times, a decision to skip or retain one or more records can be based on an empirical value derived from a grouped records. The grouping shall be based on data distinctness either at an attribute level or a composite value derived from multiple attributes. It serves as a means to reduce the group to into a single record with numerically aggregating values on an attribute, while retaining other attribute values from the first record. This is a stateful operation and takes into account the entire dataset to identify the group of records for reduce operation.
The data consideration settings at a sheet level determines the rules for grouping records.

a) Sheet Mapping — A step that identifies and associates a source sheet from which data will be picked up for assignment for a target sheet

b) Preprocessors — Depending on the target context, the focus of mapping is around the subset of data found on the source sheet. Preprocessors provide a set of rules (filters) to locate appropriate data subset to put through the mapping process. Here Following are few potential preprocessors:

a. Skip row if equals/not equals to — Skip the rows matching the condition on the specified attribute

b. Skip row if empty/not empty — Skip the rows matching the condition on the specified attribute

c. Skip row if greater than/greater than or equals to — Skip the rows matching the condition on the specified attribute

d. Skip row if less than/less than or equals to — Skip the rows matching the condition on the specified attribute

e. Skip row if starts/ends with — Skip the rows matching the condition on the specified attribute

f. Skip row if empty — Skip the rows if found to contain no value or empty on the specified attribute

c) Data Consideration — Specify which rows to be considered for processing (from the preprocessed output)

a. All Rows — Consider all rows for mapping

b. Distinct Rows — Consider rows with distinct values on the specified attribute. This presents a group of records that will have the same value (distinct on the whole dataset) on the attribute for mapping purposes. An attribute level mapping would consider the first record from the dataset and will ignore the rest; however, it is possible to numerically aggregate values on an attribute from the grouped records and assign to a target attribute.

Example: Building customer master from records where customer info is repeated across multiple records of the same customer.

c. Distinct Composite Rows — Consider rows with distinct values composed from one or more attributes on the dataset. This presents a group of records that are having the same composite value for mapping purpose. An attribute level mapping would consider the first record from the dataset and will ignore the rest; however, it is possible to numerically aggregate values on an attribute from the grouped records and assign to a target attribute.

Example: Building a customer-location master from records where multiple records are found for a customer with each capturing a distinct location.

d) Attribute Mapping — Association of a source attribute to a target attribute. For each of the source record, the value found on the specified attribute would be copied over to the mapped target attribute.

A mapping can be augmented with transformation rules that modify the value before assigning with appropriate computation and logical conditions. Supported operations can range from simple arithmetic, conditional/logical to complex aggregations.
A transformation supports appropriate data massaging/manipulation rules based on the source attribute data type (String, Number or Date). Complex transformation supporting multiple computations and conditions on one or more attributes on the source record with a logical consolidation should also be supported.

Here are a few potential mapping rules:

a. Simple Mapping — A simple association of a source and target attribute (without any transformation)

b. Conditional Mapping — Association of a target attribute with source attributes based on a logical condition. This gives an opportunity to choose the value to be assigned to a target attribute based on the TRUE or FALSE outcome of the condition evaluation.

NOTE: Equivalent to an IF-ELSE condition found in any programming language.

A condition has three parts

Source attribute to be considered
Operator and
Target attribute to be considered.

At runtime, a condition would be evaluated for a TRUE or FALSE outcome. Based on the outcome, the mapping specified for the resulting condition would be executed. Following are a few potential condition operators:

i. Is Empty/Is Not Empty — Evaluate the value on a source attribute for the specified condition

ii. Is Equals to/Is Not Equals to — Evaluate the value on a source attribute for the specified condition

iii. Is Less than/Is Less than or Equals to — Evaluate the value on a source attribute for the specified condition

iv. Is Greater than/Is Greater than or Equals to — Evaluate the value on a source attribute for the specified condition

v. Substring — Evaluate the substring value on a source attribute for the specified condition

c. Switch Mapping — Switch mapping is an enhanced conditional mapping with support for multiple conditions. Each of the condition would follow the same semantics of conditional mapping as specified above. At runtime, the first condition that evaluates to TRUE would get executed and the rest of the conditions would be ignored. If no condition is satisfied, if defined the DEFAULT condition would be executed.

NOTE: Equivalent to a SWITCH-CASE-DEFAULT construct found in any programming language.

d. Computed Mapping — Assigning a target attribute with the result of a series of computation (arithmetic and string operations) on the source attributes. Essentially, a Computed Mapping employs one or more computing conditions with the result of each condition is passed as input to the next condition, and a chaining operator to consolidate the result. At runtime, the final result of these computations will be assigned to the target attribute.

Following are a few potential computing operators:

i. Uppercase/Lowercase — Convert the input attribute of the condition as appropriate

ii. Concatenate — Concatenate the input and output attribute of the condition

iii. Substring — Extract the substring from the input attribute of the condition

iv. Add/Subtract/Divide/Multiply — Apply arithmetic operation on the input and output attributes as appropriate

Following are a few potential chaining operators:

i. Concatenate — Concatenate results of the current condition and next condition

ii. Add/Subtract/Divide/Multiply — Apply the arithmetic operator on the results of the current condition and next condition

e. Aggregated Mapping — Assigning a target attribute with the result of an aggregate operation applied on the source attribute on the grouped records. This is applicable to only attributes that are identified with Number as data type and a Distinct or Distinct Composite data consideration mode is chosen.

Following are a few potential aggregation operators:

i. Sum — Compute the sum of values found on the source attribute in the grouped dataset

ii. Average — Compute the average of values found on the source attribute in the grouped dataset

iii. Min/Max — Identify the Min/Max of values found on the source attribute in the grouped dataset

Stage 5: Post-processing Rules

Occasionally, the desired mapping and transformation cannot be accomplished in a single pass at the source data. It is mainly due to different processing requirements for different groups/categories of data within the source dataset. In such cases, multiple mappers can be created with each mapper focusing on a specific aspect and the mappers can be chained. It is possible to design the chaining that feeds the output of one mapper as input to the next mapper or consolidate the output of all chained mappers into a single output. At runtime, such a mapper with chaining can be executed as a single unit producing the desired output.

Chaining is optional and should be employed based on the need for multi-pass processing to accomplish the desired mapping.

Part 1 of the article on Building a Universal Data Mapper is now complete. The primary focus was to set the stage for an ideal UDM tool and requirements that will make it complete and usable potentially for any mapping requirement. In the next part, I will be following up with the details of a reference implementation (prototype) of UDM and dive into a few use cases.

Also, happy to share that the UDM reference implementation will be open-sourced under MIT license on GitHub soon.

Building a Universal Data Mapper: Part 1

Universal Data Mapper (UDM)

UDM — Mapper Design

Written by Giri Venkatesan