Tip - Add additional columns (--additional_headers) and values (--additional_content) inline / on-the-fly without modifying your source data

lzhang · November 11, 2021, 4:17pm

Issue

There are cases when there is a requirement to add additional data, perhaps meta-data to the destination dataset that is being added to Snowflake without modifying the original source data file. Perhaps you need to include the source file name, source file created timestamp or any details that may be useful to you for enriching the dataset.

Solution

The Data Loader has two switches when used together to accomplish this capability:

--addditional_headers accepts a list (i.e. file_name file_date file_size) which are delimited by a space and is used to add column(s) to your destination dataset
--additional_content accepts a list (i.e… data1.csv 11-01-2021 4GB) which are delimited by a space and is used to add value(s) to your destination dataset

When the switches are used together, the number of list values provided to each switch must match. Meaning that if you provide 3 columns to the additional_headers switch then you must also provide 3 values to the additional_content switch.

Example 1 - running from command prompt / shell


flowobjects_data_loader.exe --file data.csv  --sf_username FLOWOBJECTS --sf_database FLOWOBJECTS --sf_schema PARKING_TICKET_EDW --sf_warehouse DEMO_WH --sf_role SYSADMIN --sf_table FACT_PARKING_TICKET --sf_file_format PARKING_TICKET_EDW.FF_CSV --sf_stage_type TABLE   --synchronous --additional_headers FILE_NAME --additional_content 'data1.csv'

Note: `--sf_account` and `--sf_password` switches are not provided in above example cli usage as they are being read from their environment variable equivalents.

Example 2 - a Python application wrapper shelling out to run the Data Loader


#!/usr/bin/python
#
# Python snippet to shell out to run the Data Loader

command = 'flowobjects_data_loader.exe --file ' + source_file \
    + ' --sf_account ' + config_data['SNOWFLAKE_TGT_DB']['ACCOUNT'] \
    + ' --sf_username ' + config_data['SNOWFLAKE_TGT_DB']['USERNAME'] \
    + ' --sf_password ' + config_data['SNOWFLAKE_TGT_DB']['PASSWORD'] \
    + ' --sf_database ' + df['Target_DB'][i] + ' --sf_schema ' \
    + df['Target_schema'][i] + ' --sf_warehouse ' \
    + config_data['SNOWFLAKE_TGT_DB']['WAREHOUSE'] + ' --sf_role ' \
    + config_data['SNOWFLAKE_TGT_DB']['ROLE'] + ' --sf_table ' \
    + df['Table_Name'][i] + ' --sf_file_format ' + df['Target_DB'][i] \
    + '.CONTROL.PIPEDELIMITED_NOHEADER --sf_stage_type TABLE --remove_split_dir --use_upload_dir --synchronous' \
    + ' --additional_headers FILE_NAME FILE_CREATE_DT  --additional_content ' \
    + var_file_name + ' ' + var_date

result = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = result.communicate()

Resulting COPY INTO query

The Data Loader knows to automatically generate the SQL for the copy into differently when using the switches additional_headers and additional_content from the normal copy into SQL that the Data Loader automatically generates as there is now an injection of additional information that is being added inline / on-the-fly without making any changes to your source data file(s).

Sample generated COPY INTO query


COPY INTO FACT_PARKING_TICKET (FILE_NAME, DATE_KEY, VIOLATION_KEY, REGISTRATION_KEY, VEHICLE_KEY, ISSUER_KEY, SUMMONS_NUMBER, PLATE_ID, VIOLATION_COUNT, IMPORT_TIME_STAMP, LOAD_PROCESS_ID)
FROM
  (SELECT 'data.csv', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
   FROM '@%FACT_PARKING_TICKET/') FILE_FORMAT = 'PARKING_TICKET_EDW.FF_CSV' ON_ERROR = CONTINUE PURGE = TRUE RETURN_FAILED_ONLY = TRUE