Issue
There are cases when there is a requirement to add additional data, perhaps meta-data to the destination dataset that is being added to Snowflake without modifying the original source data file. Perhaps you need to include the source file name
, source file created timestamp
or any details that may be useful to you for enriching the dataset.
Solution
The Data Loader has two switches when used together to accomplish this capability:
-
--addditional_headers
accepts a list (i.e.file_name file_date file_size
) which are delimited by a space and is used to add column(s) to your destination dataset -
--additional_content
accepts a list (i.e…data1.csv 11-01-2021 4GB
) which are delimited by a space and is used to add value(s) to your destination dataset
When the switches are used together, the number of list values provided to each switch must match. Meaning that if you provide 3 columns to the additional_headers
switch then you must also provide 3 values to the additional_content
switch.
Example 1 - running from command prompt / shell
flowobjects_data_loader.exe --file data.csv --sf_username FLOWOBJECTS --sf_database FLOWOBJECTS --sf_schema PARKING_TICKET_EDW --sf_warehouse DEMO_WH --sf_role SYSADMIN --sf_table FACT_PARKING_TICKET --sf_file_format PARKING_TICKET_EDW.FF_CSV --sf_stage_type TABLE --synchronous --additional_headers FILE_NAME --additional_content 'data1.csv'
Note: --sf_account
and --sf_password
switches are not provided in above example cli usage as they are being read from their environment variable equivalents.
Example 2 - a Python application wrapper shelling out to run the Data Loader
#!/usr/bin/python
#
# Python snippet to shell out to run the Data Loader
command = 'flowobjects_data_loader.exe --file ' + source_file \
+ ' --sf_account ' + config_data['SNOWFLAKE_TGT_DB']['ACCOUNT'] \
+ ' --sf_username ' + config_data['SNOWFLAKE_TGT_DB']['USERNAME'] \
+ ' --sf_password ' + config_data['SNOWFLAKE_TGT_DB']['PASSWORD'] \
+ ' --sf_database ' + df['Target_DB'][i] + ' --sf_schema ' \
+ df['Target_schema'][i] + ' --sf_warehouse ' \
+ config_data['SNOWFLAKE_TGT_DB']['WAREHOUSE'] + ' --sf_role ' \
+ config_data['SNOWFLAKE_TGT_DB']['ROLE'] + ' --sf_table ' \
+ df['Table_Name'][i] + ' --sf_file_format ' + df['Target_DB'][i] \
+ '.CONTROL.PIPEDELIMITED_NOHEADER --sf_stage_type TABLE --remove_split_dir --use_upload_dir --synchronous' \
+ ' --additional_headers FILE_NAME FILE_CREATE_DT --additional_content ' \
+ var_file_name + ' ' + var_date
result = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(out, err) = result.communicate()
Resulting COPY INTO query
The Data Loader knows to automatically generate the SQL for the copy into differently when using the switches additional_headers
and additional_content
from the normal copy into SQL that the Data Loader automatically generates as there is now an injection of additional information that is being added inline / on-the-fly without making any changes to your source data file(s).
Sample generated COPY INTO query
COPY INTO FACT_PARKING_TICKET (FILE_NAME, DATE_KEY, VIOLATION_KEY, REGISTRATION_KEY, VEHICLE_KEY, ISSUER_KEY, SUMMONS_NUMBER, PLATE_ID, VIOLATION_COUNT, IMPORT_TIME_STAMP, LOAD_PROCESS_ID)
FROM
(SELECT 'data.csv', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
FROM '@%FACT_PARKING_TICKET/') FILE_FORMAT = 'PARKING_TICKET_EDW.FF_CSV' ON_ERROR = CONTINUE PURGE = TRUE RETURN_FAILED_ONLY = TRUE