Track 4 Topic 2: Design of Parser

The parser we will create will be named “payment-parser.” Start with setting “Parser” as your current page in Calabash GUI. Then select “lake_finance” from the list of data systems. Click on the big green “Create Parser” button. Your screen would look like the following.

Enter “payment-parser” as the parser name and select “Text” as parser type. You will see an error message complaining about “field definition for source.”

Next, we will define fields for the source text lines. Click on the small blue plus sign to bring up the input field editor:

Enter parameters precisely as above to define the first field in the input text line. Why do we define the first field as fixed width? Examining the actual input data would help:

05/07/2021 06:45:30 account0000003, 2000

The timestamp occupies 19 characters, followed by a space, then the account id.

Click on the “OK” button, the GUI looks like this:

In the above screenshot, you can find two small blue buttons. One is next to the “Define fields on source text line” label, and another is on the field entry. The one on top is for adding a new field as the first field, and the second is for adding one below the current field.

Click on the second blue plus button to add a new field below the current one. Do this twice for two others. The GUI becomes:

The two more fields both are delimited by commas. With these, we are done with defining structures on input lines.

Next, we need to define the output schema. The parser takes in a string and produces an object defined here. Click on the small blue button beside the “Output record” title, you will bring up the output field editor.

Enter the name and datatype as shown above. Then choose “field0” from the source field name drop-down list.

The output field “ts” is defined to pull data from “field0” of the input. This is called “mapping” in the ETL world.

You may optionally trim the field data before pushing it to output. There is no need to do so for the field “ts.” But there is no side-effect to trim, either.

Click on the “OK” button, then add two more output fields. The GUI should look like the following.

So we map “field1” and “field2” from input to output fields “account_id” and “amount,” respectively. Also, the data type for the “amount” field is an integer. One word of caution: make sure for the “amount” field, the “trim left” and “trim right” boxes are selected.

There are several ways an input string fails the parsing. The output schema requires three input fields. If there are fewer than three fields, there will be an exception. If the data in the field2 (data type of string) is not convertible to integer, there will also be an exception.

Click on the “Save” button to save the parser to the Calabash repository.

There is no need to deploy the parser. A data pipeline will call it (we will see how to do it in the next tutorial topic).