In part 2, entire pipeline to ingest data into Snowflake was automated using Azure Logic App and SnowPipe. JSON Data was loaded in Snowflake landing table with a single 1 column called JSON_DATA.

Ideally, there should also be a datetime column that will contain the date and time of when data was loaded into landing table. However, due to a limitation in SnowPipe it will not allow any additional column when using JSON format. If you try, you will get following error.

Snowflake’s Streams and Tasks feature can be leveraged to move this data into a 2nd landing table with additional columns such as load_dttm (load date time).

Snowflake Stream help with CDC (change data capture). It sort of works like a Kafka topic and will contain 1 row per changes in it base table. In this case VMS_Azure_Blob_LZ1 (landing zone 1)

//Create a stream on VMS_Azure_Blob_LZ1 table
CREATE OR REPLACE STREAM VMS_AZURE_BLOB_LZ1_STREAM ON TABLE “VMS”.”PUBLIC”.”VMS_AZURE_BLOB_LZ1";

//Verify using
SHOW STREAMS;
//Verify that stream works by invoking REST API to load some same data in LZ1 and then run a Select on stream
SELECT * FROM VMS_AZURE_BLOB_LZ1_STREAM;

Next step is to insert data present in stream to Landing Zone 2 table. It will be a simple SQL insert like this

//Create a 2nd Landing table (Seq is used to generate auto incremented ids)

create or replace TABLE VMS_AZURE_BLOB_LZ2 (
 SEQ_ID NUMBER(38,0) NOT NULL DEFAULT VMS.PUBLIC.VMS_AZURE_BLOB_LZ2_SEQ.NEXTVAL,
 LOAD_DTTM TIMESTAMP_NTZ(9) NOT NULL DEFAULT CURRENT_TIMESTAMP(),
 JSON_DATA VARIANT NOT NULL
)COMMENT='Data will be inserted from stream and task'
;
//Test and verify that select from Stream works as intended before using it in Task
INSERT INTO "VMS"."PUBLIC"."VMS_AZURE_BLOB_LZ2" (JSON_DATA) (
  SELECT 
    JSON_DATA
  FROM
    VMS_AZURE_BLOB_LZ1_STREAM
  WHERE
    "METADATA$ACTION" = 'INSERT'

Remember the whole idea is to automate, so Inserts needs to run automatically. It can be done so using Snowflake Task. It basically works as a Task Scheduler using “cron” time format. In this case it is set as 30 20 * * * to run at 8:30 PM PT after all the dealerships close.

CREATE OR REPLACE TASK VMS_AZURE_BLOB_MOVE_LZ1_TO_LZ2_TASK
  WAREHOUSE = TASKS_WH //A specific WH created to be used for Tasks only to show up on bill as separate line item
  SCHEDULE  = 'USING CRON 30 20 * * * America/Vancouver' //Process new records every night at 20:30HRS
WHEN
  SYSTEM$STREAM_HAS_DATA('VMS_AZURE_BLOB_LZ1_STREAM')
AS
  INSERT INTO "VMS"."PUBLIC"."VMS_AZURE_BLOB_LZ2" (JSON_DATA) (
  SELECT 
    JSON_DATA
  FROM
    VMS_AZURE_BLOB_LZ1_STREAM
  WHERE
    "METADATA$ACTION" = 'INSERT'
);

Notice that in above DDL, warehouse specified is TASKS_WH, it was created specifically to run Tasks and in the monthly billing that Snowflake generates it will come as a separate line time. This way it is easier to monitor and track costs of Tasks by aligning them together so that compute can do few of them together, resulting into cost savings.

#snowflake #blob #tasks #azure #snowflake

Car buying is fun with Snowflake Part 3
1.75 GEEK