Ignite Your Insights


Course Overview
About this Course This Hands On, self-paced course will provide you with a basic understanding of how to properly build a Data Vault model. This is meant to be a companion course to CDVP2 training. If you are enrolled in CDVP2 training, we strongly suggest that you listen to the…
Full Course Description
About this Course
This Hands On, self-paced course will provide you with a basic understanding of how to properly build a Data Vault model. This is meant to be a companion course to CDVP2 training. If you are enrolled in CDVP2 training, we strongly suggest that you listen to the lectures first before attempting to run the hands on. Why? Because this class does not explain what the Data Vault model is, or why it's different than a Kimball Model. This class is built to demonstrate a working knowledge of a Data Vault Model. In other words, this class teaches you how-to build. The CDVP2 course explains why, and what the Data Vault solution is.Modifications Made:
- 27 APR 2022 - Removed UPPER case on Hash Differences in Stage Level 2 Views (replaced the ZIPPED downloads)
Suggested Pre-Requisites:
Not enrolled in CDVP2?
If you've never built a Data Vault before, then we highly encourage you to have a copy of the book: Building a Scalable Data Warehouse with Data Vault 2.0 along side your efforts. The book will explain all the details and the nuances about how and why Data Vault modeling is different, and what many of the objects in a Data Vault model are defined to be.Enrolled in CDVP2?
We encourage you to wait to engage the hands on in this course. This is a companion course to the lecture provided in CDVP2 training. This course teaches you how-to-build, the CDVP2 course lectures explain why, the pros and cons and the differences between Kimball star schemas and Data Vault modeling approaches. We encourage you to sit through the lecture first - then engage in this course for the hands-on how-to.About the SQL
Our SQL is 98% ANSI-SQL compliant, about 2% needs to change to run on other platforms. Our IDE tool we use in this course is called IDERA: AquaFold Studio Ultimate. Feel free to contact Idera for more information, please tell them you saw their tool in this course.- For PostGreSQL: download and use the PostGreSQL named zip.
- For Snowflake: download and use the Snowflake named zip. We recorded the videos for Snowflake using Snowflake Browser Based management tool.
- Download and install PostgreSQL on your local machine or have DBA privileges to an existing PostgreSQL database.
- Download and unzip the .ZIP file attached to this course.
- The instructions given are demonstrated in the videos. You are expected to watch the videos in a place where you can easily view the steps you are being walked through while they are being executed. This course is not designed to be an audio only course; video display is required.
- You use whatever tool you wish to view the database schemas, tables, views, etc., and to execute the download scripts. Dan is using IDERA's AquaStudio product in these lessons, but choose your own tool.
- If you are a VTCP participant, please pay attention to the schemas that must be created in order to work with this BASELINE model.
- This .zip file contains the BASELINE model developed in PostgreSQL. The source model is designed to illustrate all of the possible objects that we typically encounter in a DV2 implementation.
- The BASELINE download that you will be using throughout this course contains all of the DDL, DML, and data sets needed to building, load, and query the model.
What is VTCP and Why is it Included?
This course also covers participants in our VTCP (Vendor Tool Certification Program). This entire course is meant to teach CDVP2 students what a core DV2 solution looks like, and to help vendors understand what they must generate before submitting artifacts to the certification process. Any technical resource in the VTCP program must complete this course, and align their tool to produce the artifacts to these specifications (exact match). Once the tool vendor can produce artifacts that match these specifications they will be provided a new model to generate. If you are not a part of the VTCP program, no problem you can ignore the references to the VTCP lessons. By the end of this course you will have built out core DV2 objects (hub, links, satellites) as well as specialized objects such as a non-historized link, multi-active satellite, satellite effectivity, hierarchical and same as links. In addition, we've included objects for getting data out of the Data Vault through PIT and Bridge tables. All of these objects have been built according to the Data Vault 2.0 standards. VTCP PARTICIPANTS WILL ALSO HAVE A SECOND CLASS ENABLED: DV101 Baseline Hands On Meta Config – you MUST watch these videos as well to complete the baseline efforts for the VTCP program.Learning Maximized!
Achieve expertise quickly
High quality content, self-paced video for your maximized learning.
Extensive training with focused topics leading to your success.
Join other students currently engaged in your learning journey.
Course Lessons
Section Header
Welcome !
Just a few house-keeping notes for you before you get started:
- This course follows every standard that we define in Data Vault 2.0
- There is a difference between a standard and a best practice.
- Standards are enforceable and when broken, will break one or more of the following: scalability, ease-of-use, optimization, maintainability, and more.
- Best Practices are not enforceable, these are strongly encouraged practices - methods and designs for you to apply.
Not using End-Dating is a best practice, NOT a standard. We highly discourage the use of end-dating, however people see a need and have implemented end-dating in the Data Vault since the beginning of their journey. We therefore demonstrate the use of end-dating in business driven satellites - prefixed with "bv_" in the data model. For this reason we are free to execute business rule calculations that set these end-dates, and update the fields. End-dates are not in use in Hubs, Links, raw Satellites, and we have at least one (if not two) Effectivity Satellite cases that are designed as 100% insert driven. For this reason, you can see both patterns and learn both approaches.
We recommend 100% insert driven best practice, especially for scalability in to the trillions of records.
Standard: All Hubs carry business key attributes.
Best Practice: Leverage Hash Keys as "surrogate keys" or replacement keys for your primary keys in your Hub Structures. Going forward - this practice may vary depending on platform and physical capabilities to execute without hashing in place. For now, most database platforms do well with the hashing practices.
Lastly: please don't forget to fill in the survey at the end of the course - it is the last lesson. Why? Your feedback is important to us, and it's the only way we can know how to improve the course going forward.
We hope you enjoy this course, Thank You!
Lesson Purpose
This first lesson starts off by walking you through the BASELINE model that is included as a course download. You should have already installed PostgreSQL (or have access to it) and unzipped the files provided in the course download.
In this lesson Dan covers the source BASELINE model. The model will also be available in Snowflake, is part of the CDVP2 class, and is offered at no cost to CDVP2 registered students.
During this lesson, you will understand that the source model is designed to illustrate all of the possible objects that we typically encounter in a DV2 implementation.The data sets represent two different source systems to emphasize the importance of integration within the data sets. These systems are exemplified by Organization A and Organization B.
The download includes a set of generated, made up, data sets that will loaded to the model throughout the course. The data sets are not intended to be realistic, but only to illustrate the load processes used by the ELT scripts and views contained in the BASELINE file set.These fabricated data sets are intentionally designed with broken data embedded again, it's about presenting the student with a real world experience.
You will want to pay attention to details like missing foreign key constraints in the source model. Note, these foreign keys are missing by design. The BASELINE model is a 3NF model that includes concepts around Country, Currency Rates, Organization, and Employee. Take some time to review the source model.
Lesson Purpose
Dan will discuss the exact names of the schemas that must be created in the database in order for the BASELINE scripts to execute properly.Our VTCP participants must create these schemas exactly as Dan described or their tool will fail BASELINE validation testing.You will want to name the PostgreSQL database and the each of the schemas names to match the names in the walk-through or the SQL scripts included in the zip file.
Each student must create the following schemas in their PostgreSQL (or Snowflake) databases, keeping in mind that with PostgreSQL, case matters. Execute the SQL in file 00_create_schemas.ddl to create all of the required by the course. Note: the Meta_config* schema isfor VTCP participants only.
Dan explains how the scripts found in the downloadable zip file are organized through their naming conventions to follow an order of execution. When the zip file is extracted, all of the directories and scripts will reflect their order of execution within each directory and subdirectory.
Once Dan explains the nuances of the lesson, he steps through which DDL files to open, copy, and paste into your SQL tool to begin to build out the database starting with the source system model objects. The objects built include the foreign key constraints.
After the source system objects are built, Dan then walks through which DML files will be opened and executed to populate data into every one of the source system model objects as the lessons progress..
BASELINE FILES USED IN THIS LESSON:
Folder/File: /20_DDL/00_create_schemas.ddl
Folder/File: /20_DDL/01_source.ddl
Folder/File: /40_control_data/01_Control_Data.sql
Lesson Purpose
As part of this lesson walks the student through creating the Stage Level 1 tables. The tables will be reverse-engineered into an Entity-Relationship model for review and walk through.
Dan will discuss the purpose of the Stage level 1 tables and that they are intended to be an exact copy of the source system tables or data files which were built during the previous lesson, Create Source Database. During this lesson, you will begin to understand the importance of entity naming conventions like table suffix and table prefix abbreviations. Dan explains what has been used as a best practice in the BASELINE model and why defining your naming conventions is so critical. Additionally, pay attention to the fact that the stage level 1 tables allow NULLS in the table structure.
For VTCP participants, understand that your tool may not create a physical level 1 stage table; however, to pass the BASELINE tests, you must generate the DDL for the level 1 and level 2 staging tables. Dan discusses why the level 1 tables may not be necessary especially when data is arriving in real time. For purposes of the VTCP certification process, you must produce the DDL for these physical tables it is considered a requirement.
Next, Dan discusses the Stage level 2 tables, what the level 2 tables are, and how they are intended to extract the data from the stage level 1 tables into the stage level 2 tables. Pay attention to the fact that Stage level 2 tables include the DV system fields like hash keys, hash differences, record source, and load dates. The naming conventions on these tables are important as part of the repeatable pattern and identity.
After building the physical staging tables, Dan will walk the student through creating the views used to load the stage level 1 tables from the source. Students should pay attention to the naming conventions used for the views. As a standard best practice, Dan discusses why a view should be named according to the source to target nature of the data movement achieved by the view.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /20_DDL/02_stage_level1.ddl
- Folder/File: /20_DDL/03_stage_level2.ddl
- Folder/File: /30_Views/01_stage1_views.sql
- Folder/File: /50_Insert-Into-Scripts/01_insert_stage1.sql
Lesson Purpose
During this lesson, I will talk to you about loading the Stage level 2 structures. There are a number of DV2 concepts, standards, and best practices mentioned in this lesson.
The video says: Views are required. We have reconsidered this statement, and redacted that requirement. Remember: For the VTCP Program and for using automation tools: Views are not required. You may stick with 100% insert statements combined with the selects if you so choose.
Just as in the previous lesson, we learn that views are used to load data from the Stage Level 1 structures to the Stage Level 2 structures. The views needed are named according to the target structure that will be populated. The views, just like in the previous lesson, are called by the INSERT INTO statements of the load scripts. This aligns with the consistent, repeatable, pattern-based approach of DV2. The view names should match up with the target structure name where the INSERT INTO function should land the data.
As part of the best practices, the student will learn that the views created to populate the Stage Level 2 structures should compute the hash keys and hash differences stored in the target Stage Level 2 structures. Dan touches on the key concepts of NULL Business Keys, Satellites, Link Hashes, Hash Differences including the naming conventions used for these system fields. The lesson begins to expose one of the key concepts of DV2 which is the concept of Divide and Conquer.
You will create the Stage Level 2 views required to move the data from the Stage Level 1 tables into the Stage Level 2 tables. The views are part of the repeatable, pattern-based DV2 templates taught as part of the standards as covered in the CDVP2 certified practitioner course.
You will also be walked through the View construction, including the creation of the hash key and hash difference values, the use of coalesce functions to replace null business keys with consistent values, and extracting the record source value. The views include a sequence number on Stage Level 2 table which is auto-generated using the database as a best practice. The views also ensure that the hash key function follows the DV2 standard for properly generating a hash key result.
For our VTCP participants, this lesson will cover certain specifics about the program - exactly what is expected and the critical nature of field/column naming conventions to your success during testing.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/02_stage2_views.sql
- Folder/File: /50_Insert-Into-Scripts/02_insert_stage2.sql
Lesson Purpose
Picking up from the previous two lessons regarding Stage Level 1 and Stage Level 2 structures and data movement, you will learn about the concept of a run cycle ID. Run cycle ID, also known as a process ID, is taught in the CDVP2 class. In this lesson, you will be taught what a run cycle ID is, how it is used in the Data Vault 2.0 standard, how to create a run cycle ID table, and how to implement the run cycle ID inside of a DV2 structure.
We've Changed the name of the table FROM: load_process_history TO stg_load_process_history. We've added the "STG_" prefix to the table name to make it compliant with the naming conventions.
You will learn the concept of run cycle ID by building Data Vault structures that contain a system field called LOAD_GROUP_ID. You will construct a table in the baseline_stage schema that is designed to store a run cycle ID that will be used to create a unique composite key set for each record in the target structure.
You will create the Load Process History table in this lesson. The load process ID that is generated for each specific load is used in specific Data Vault 2.0 objects such as Multi-Active Satellites. The technique you will learn in this lesson related to load process IDs ensures that insert processes into structures like multi-active satellites are 100% restart-able without introducing duplicates records into the structure if the load process fails and has to be restarted for the same data set.
BASELINE FILES REFERENCED IN THIS LESSON:
- Folder/File: /30_Views/02_stage2_views.sql
- Folder/File: /50_Insert-Into-Scripts/02_insert_stage2.sql
Section Header
Lesson Purpose
Starting with this lesson, Dan will begin to break down the Data Vault 2.0 structures object-by-object.
The first thing you will build in this lesson are the Raw Vault structures. You can open the DDL script that is used to create all of the structures used by the Raw Data Vault.
The second thing you will build in this lesson are the Views that are used to load the Raw Data Vault.
In each subsequent lesson, Dan will build upon the Views that you created during this lesson. He will begin to walk you through the DDL for each View and explain what each type of view does. In the next series of lessons, you will be populating the Raw Vault tables that you built during this lesson. You will learn about all of the Hubs, Links, and Satellites that were created during this lesson.
BASELINE FILES USED IN THIS LESSON
- Folder-File- /20_DDL/04_raw_vault.ddl
- Folder-File- /30_Views/03_dv_views.sql
Lesson Purpose
Before diving into the details regarding all of the structures and views you created in the previous lesson, Dan slows down the process here to walk you through the Data Vault 2.0 model itself. In this lesson, please pay close attention to the naming conventions of the objects that are presented in this model. You will learn best practices surrounding raw data vault structure naming conventions.
You will learn that there are three basic or core Data Vault structures hubs, links, and satellites. You will also learn that a logical Data Vault model is what is known as 3NF-ish or Third Normal Form-ish.
Dan walks through the model pointing out entity naming conventions used to differentiate between a raw vault table and a business vault table, the various types of links such as single level and multi-level hierarchies, and other aspects such as the fact that there is no temporality in a Link table. He points out that temporality around a link relationship is tracked in an Effectivity Satellite. You will learn that there are various techniques for loading Effectivity Satellites a 100% insert only approach (the optimal best practice approach) and the more legacy update approach.
You will also be introduced to the Data Vault 2.0 Driving Key concept, the naming convention used to manage it, and a high-level description of what a Driving key is.
Lesson Purpose
The focus of this lesson is to discuss the concepts behind Ghost Records and NULL Business Keys. Dan talks to you about the fact that NULL Business Keys or records that arrive from the source with missing business keys or where the Business Key is a NULL value should be captured in the Data Vault. He also talks about temporal data or data over time where a record arrives with a date that is earlier than the satellite data records contain. You will learn how to insert these data values that act as substitutes for the Ghost Record values that come through.
Dan walks through the control data load process to the Hubs where the Ghost Record and NULL Business Key values are inserted into every hub in the Raw Vault data model. You will learn that Ghost Records and NULL Business Key records will not be inserted into a Non-Historized Link. You will learn to use your common sense when it comes to selecting the value for a Ghost Record or a NULL Business Key. You will also learn that once you select a value for a Ghost Record and a NULL Business Key, then you will use that exact value across your entire Data Vault implementation consistently.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /40_control-data/00_GhostRecords.sql
Lesson Purpose
During this lesson, you will be introduced to loading a Hub. Dan walks you through the Views that were created in the previous lesson - Create and Load Raw Vault to move data from the Stage Level 2 tables into the Raw Data Vault. Dan will explain the rationale behind the View naming conventions as best practice.
As a reminder: VIEWS ARE NOW OPTIONAL - No longer required for the VTCP program or for the code-generators. Your tool may generate the select statements directly inside the INSERT statements.
The focus of this lesson is to walk through loading the raw vault country hub from the staging table country exchange rate. Notice that each view selects the business key for the Country Hub from every Stage Level 2 table that contains the country business key.
For our VTCP participants, Dan points out that the view selects the columns out of the Stage Level 2 table specifically through the column name identification rather than using an asterisk(*). This is by design. The SELECT statement should specify exactly which columns are being included in the View to be mapped to the Raw Data Vault target for purposes of data integrity. Dan talks about the restart-ability test that is used in the VTCP program to ensure that load processes are restart-able, recoverable, fault tolerant, and parallel load enabled.
You will learn that each view selects from its source stage table to the same Hub. Dan warns you against using a UNION statement and trying to combine views together. Using a UNION is not condoned by Data Vault standards and results in non-compliance for automation engines that do so. Dan points out that building your Views in accordance with this standard ensures restart-ability, fault tolerance, and recoverability of your load code.
Finally, Dan talks about how this approach to loading business keys to a single hub table wherever the business key appears in the source system provides full integration through the Data Vault across multiple source systems into the Data Vault.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
In the previous lesson, you learned above the consistent, repeatable, pattern for loading data from staging level 2 tables into a single hub that was focused on the business concept of Country. We will expand on those load concepts, patterns, and best practices during this lesson as Dan loads the remaining Hub tables that are in our Raw Data Vault Model.
Once the remaining hubs are loaded, Dan queries the hub structures to review and explain the data that is now inserted into the hub structures. Dan shows the Ghost Record and NULL Business Key records that were inserted into the hub. Dan walks through the hash key value matches.
Dan explains why the hash keys are the Primary Keys of the Hubs. He talks about what you should and should not see in the values of the Hub Business Key values.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
In this lesson, Dan discusses the best practice in Data Vault of loading one source to one target or put another way one map per target per source. The View in this lesson will load one hierarchical link, a couple of standard links, the non-historized link, and the Same-As link.
Dan reviews the resulting data set from each of the links to point out the hash keys and the data vaults associated with the hash keys.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
The focus of this lesson to on what we call Standard Satellites. The naming convention shows that the satellite is a standard satellite off of a hub.
Dan points out the hash_difference column and provides a high level explanation of what the hash difference is comprised of and what it is used for.
The Load Date is identified during this lesson as part of the satellite composite Primary Key. This is a standard in Data Vault 2.0. Every satellite must have, at the base level, the parent primary key (meaning the hash key of the parent hub or the hash key of the parent link) AND the LOAD DATE as the Primary Key of the Satellite. You will learn that the Load Date is a system-driven span or timeline it does not come from the source system but is generated during the load to the satellite.
As you will learn, satellites are delta-driven, in other words, records are only inserted into the satellite when there is a change in the underlying data from load to load. Dan walks through the SQL within the View used to select the data from the staging table where the hash keys match and the comparison is against the most current record that already exists in the satellite, and then the hash difference values are compared as well.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
In this lesson, you will learn about Link Effectivity Satellites. Loading Effectivity Satellites is a complex concept and is the method used to address the problem of data temporality meaning knowing when business key relationships have changes over time, being able to track those changes, and being able to identify the current relationship among business keys based on the values tracked in the Effectivity Satellite.
You will learn that there are two different stylesof Effectivity Satellites. With regard to satellites in general, all satellites are treated the same way; however, Effectivity Satellites may be loaded using two different approaches.
In this lesson, you will learn about something called Style Onewhich is100% insert-driven technique. Style Two is an update-in-place approach which means update the old record, keep an Is Current Flag around, and then use that Is Current Flag downstream. You will see the differences between the two styles as exemplified in the use of the Start Date column in the 100% insert-driven approach.
You will learn how to read the naming convention of the satellite and understand why it is considered a business vault object instead of a Raw Vault object. You will see that the use of the Start Date as a system computed date means that there are business rules or business logic applied in order to load data to this table. By association, whenever business rules are used to populate a table, then that table is considered a business vault object.
Here Dan talks about the concept of the Driving Key and what the driving key means. You will see that the view used to populate the satellite is complex, and in fact, the concept of Driving Key is a complex set of logic. Dan discusses the join between the employee and the organization.
For VTCP participants, Dan points out the order of the left join. You will learn that the manner in which the view is constructed matters to the performance of the join.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
You will walk through a complicated concept in this lesson because it deals with loading Effectivity Satsusing an approach known as update-in-place. In this lesson, the Effectivity Satellite is a business vault object which is an indicator that the Start Date and the End Date values are computed or system-driven.
Dan starts by explaining that these complex concepts are really all based on the same patterns. He then discusses the differences between using a nested sub-select exemplified in the previous lesson and the more simplified left join condition found in this View. You should also recognize the difference between what makes an object a Raw Vault object versus a Business Vault object after listening to this lesson.
You will hear Dan talk about the fact that Effectivity Satellites as generally being difficult concepts to understand. Dan takes these difficult concepts and provides you with the code to ensure that the pattern you are using to load the data is optimized. To achieve this optimization, Dan uses a CTE (Common Table Expression). You will learn what a CTE is, how it optimizes the operations, and the implementation of set logic in the execution.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
During this lesson, you will learn more about how to populate a Link Effectivity Satellite for a Hierarchical Link. Unlike the last example, this example is another 100% insert-driven approach. You will learn that the Effectivity Satelliteis the same pattern based on a driving key match. Dan highlights how to potentially change the view to utilize a control date table.
Once again, this Effectivity Satellite depends upon the driving key found in the Hierarchical Link.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: 50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
In this lesson Dan discusses what a multi-active satellite is and how it behaves like a fact table in dimensional modeling. When it comes to loading a multi-active satellite, you will want to use a sequence number so that you can identify true duplicate rows. You will learn what this means and that the sequence number never leaves the staging area to move into the raw vault.
You will learn how to use the Load Group ID that we discussed briefly in the previous lesson on Load Process IDs. You should have already built out the Load Process History table. You will learn that the Load Group ID is used to batch all of the data together that is being processed in a single load batch. The Load Group ID is used to assist with the restart-ability of the load process. Dan explains how the Load Group ID is used to ensure that the data is not duplicated. He goes into the need and use of the subsequence within this satellite. The subsequence is used to ensure uniqueness.
You will hear a reference for a status tracking satellite. Status tracking satellites are not covered in this course.
You will learn that when loading a multi-active satellite, you must load every row into the multi-active satellite.
Lesson Purpose
In this lesson, Dan talks about the Same-As link being populated from the pattern of using a hash difference. You will learn about why this Effectivity Satellite is a Raw Data Vault object based on the source of the Start and End Date temporality. The hash difference is calculated from the start and end dates coming off of the operational system.
Dan walks you through the View that is used to select the data and the complex logic needed to understand what is happening to the driving key.
You will learn why you need to adhere to the standards of the Data Vault and what happens if you don't.
Section Header
Lesson Purpose
You will learn what happens on the second, third, fourth, etc., load cycles after the initial data load cycle occurs.
BEFORE WATCHING THE VIDEO: We have already adjusted the scripts for LOAD_GROUP_ID to operate properly without changes. We no longer use the DD formula shown in the video. Follow the SCRIPTS in this example for the correct execution, and go back to Stage Load Process ID lesson to review the proper stg_load_history_id and LOAD GROUP information
The fabricated set of data is found in the control load files. Dan walks through the data that has been loaded and gives you the opportunity to review the updated set of data that will be loaded.
This lesson describes change data that comes into the satellite as a delta data set. The changes coming in also exemplify what happens when a new data set is introduced into your Data Vault.
You will learn how to manually increment the Load Group ID of the Load Process History table by issuing an update statement.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /40_control_data/02_ControlData_Update.sql
- Folder/File: /50_Insert-Into_Scripts/01_insert-stage1.sql
- Folder/File: /50_Insert-Into_Scripts/02_insert-stage2.sql
Lesson Purpose
Dan teaches you about the organization of the load processes in Data Vault. You will learn why you only need one set of code for you load insert processes. Unlike traditional approaches to loading data, Data Vault uses a single set of code that can handle the initial load as well as every subsequent incremental load that may occur after the initial load.
You will learn how to understand the record source column and what it might imply. Dan explains the nature of the driving key and why it's important. He shows you how to review the data and understand what is happening in your data sets. The multi-active satellite is not delta-driven, which means the entire data set is loaded every time the data arrives. You will learn that a multi-active satellite is really a snapshot-based table.
Dan warns you about databases and hard update logic. You need to understand and be careful with using hard update logic.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /50_Insert-Into_Scripts/03_insert-rdv.sql
Lesson Purpose
You will learn about something we call the Flip-Flop Effect. This lesson pulls from the Business Vault. You will come to understand the importance of the Effectivity Satellite and the Start and End dates.
Dan walks you through the various Control Update data. by previewing the data in the effectivity satellite, you can see the changes in the data. As Dan walks through the changes that are evident in the Effectivity Satellites after running the restored data load, you will understand why Effectivity Satellites are important to time spans and data temporality.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /40_control_data/03_ControlData_Restoredeffectivity.sql
- Folder/File: /50_Insert-Into_Scripts/01_insert_stage1.sql
- Folder/File: /50_Insert-Into_Scripts/02_insert_stage2.sql
- Folder/File: /50_Insert-Into_Scripts/03_insert_rdv.sql
Lesson Purpose
This lesson will walk you through getting data out of the Data Vault and what to do if you don't have an End Date in your Satellite Effectivity structure. Using a Style One 100% insert-driven Effectivity Satellite which does not contain an End Date. What if you don't have an End Date but we need one in an outcome object. To handle this situation, Dan walks you through how to create a View that will instantiate an accurate End Date.
You will use a Lead / Lag function in this lesson. We highly recommend that you do some reading on what a Lead and Lag function does. The importance of using the Start Date to order the records in the Lead and Lag functions are emphasized with this approach.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/04_mart_views.sql
Lesson Purpose
Following on from the last lesson, you will learn about how to load a PIT / Bridge hybrid. We are going to use a view that pulls history using the calculated End Dates previously discussed.
To abide by naming convention standards, we have changed the field name: snap_dts to snapshot_dts
Dan walks through the effectivity satellite data of the hierarchical link with focus on the driving key. The consistency of the driving key is critical.
You will learn about the snapshot_date table and how it is used to create a cartesian product for a particular date or range of dates for each snapshot date.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/04_mart_views.sql
Lesson Purpose
In this lesson you will learn about loading the raw PIT table. A PIT table is focused on a single hub and its two satellites.
The hub in this lesson becomes the driving table in the query used to load the PIT table.
Dan talks about something called an Applied Date as an introduction to that concept. This lesson provides an example of using a computed hash difference in a PIT table to detect deltas.
You are also introduced to how a Ghost Record is used in PIT table logic to ensure that you are able to equal join down to the Satellite without dropping records out of the query.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/04_mart_views.sql
- Folder/File: /50_Insert-Into_Scripts/04_insert_mart.sql
Lesson Purpose
In this lesson you will learn about loading a raw Bridge meaning a structure that contains any relationship that existed at any moment in time, and usually spans more than on link table. The raw bridge does not have temporality associated with it. You will also learn that PITs and Bridges are used for high speed joins for a particular Use Case.
Dan walks you through the entire build process.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/04_mart_views.sql
- Folder/File: /50_Insert-Into_Scripts/04_insert_mart.sql
Lesson Purpose
In this lesson you will cover a snapshot-based temporal table.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/04_mart_views.sql
- Folder/File: /50_Insert-Into_Scripts/04_insert_mart.sql
Lesson Purpose
In this lesson, Dan will run the inserts for the various snapshot dates.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/04_mart_views.sql
- Folder/File: /50_Insert-Into_Scripts/04_insert_mart.sql
Lesson Purpose
We now move into the BI Views. This lesson focuses on the BI Views that sit on top of the PITs and Bridges.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /30_Views/05_bi_views.sql
Section Header
Lesson Purpose
This lesson covers the deltas / changes to the script files that are made, in order to run the scripts on Snowflake instead of PostGreSQL. ALL of our SQL is about 98% ANSI-SQL compliant, meaning it will run on nearly all databases with only minor changes. The changes come in the form of: handling SEQUENCE (Identity) column datatypes, changing MD5 hashing function calls, and setting hash result datatypes in specific columns. The only other change made is: from the UPSERT (CTE) to the MERGE sql statement in Snowflake. Hopefully the vendors will be able to utilize all this information to gear their tool to properly produce the artifacts needed for verification.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: ALL SCRIPTS
Lesson Purpose
This is a short lesson that focuses on how to convert a PostGreSQL UPSERT command (used for updating Effectivity Satellite END Dates) to a Snowflake MERGE statement.
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /50_Insert-Into-Scripts/03_insert_rdv.sql
Lesson Purpose
This is a short lesson that focuses on how to convert a Snowflake MERGE command (used for updating Effectivity Satellite END Dates) to a PostGreSQL UPSERT Statement
BASELINE FILES USED IN THIS LESSON:
- Folder/File: /50_Insert-Into-Scripts/03_insert_rdv.sql
[gravityform id="10" title="false" description="false" tabindex="1"]
Why Choose Membership
benefit 1
Access a wealth of collective knowledge
benefit 2
Foster cross-industry perspectives
benefit 3
Adapt your strategies to evolving industry dynamics
benefit 4
Seek guidance and validation for your ideas
Experience the power of collaborative problem-solving as you engage with fellow professionals, guided by seasoned experts in the field. Join me and thousands of others around the world to enrich your experience.
Dan Linstedt (Data Vault Inventor)

Get Your Membership
Elevate your potential, expand your horizons, and become a driving force in the ever-evolving landscape of data management with our premium Professional Membership.
