Semantic Layer | Capstone Analytics

Musings from developing and deploying enterprise grade semantic models – Part 3: Technical learnings

Abhijith DSouza — Mon, 30 Jan 2023 10:43:33 +0000

In Part 1 of this series we introduced the concept of enterprise grade semantic models and in Part 2 we discussed some general learnings from developing and deploying enterprise grade semantic models. In Part 3 we will discuss the technical learnings. You won’t find a treatise on how to do each thing in detail but hopefully there is enough content in here to start thinking about applying some of these learnings in your project.

The learnings are divided into 4 sections: IMPORT, MODEL, REFRESH, SUPPORT. These correspond to roughly the four main tasks of a semantic modeller.

IMPORT

- - - Apply transformations as left as possible
      
      All reusable logic needs to be shifted as left as possible and as close to the source as practical. This is to ensure that the databases do all the heavy lifting (extract, transform and load) and your semantic model becomes a layer to aggregate the metrics via measures rather than an ETL layer. This diagram from Part 1 shows where the various transformations and measures are defined

- - - Star schema model is preferred
      
      A star schema is the best technique for faster reports and lesser memory consumption. An enterprise model may have multiple star schemas in the same model, with fact tables linked to conformed dimensions. This is the first step in getting the semantic model right. If it is not star schema, performance will suffer.
      
      Image source: MS Learn

- - - Use views instead of tables
      
      Use views to import data into your Power BI model. Using views has three advantages
      1. Tables generally do not contain business friendly names. They might have a name such as MemberAccountBalance_Total which is not user friendly. You can write this in a view as Member Account Balance.
      2. You can remove/add columns in views. Tables often contain columns such as Record_From_Date, Record_To_Date etc which are required for testing purposes but are not required in the semantic model. So you can remove such columns in the views. You can also add new columns in the views. These are commonly report level logic which are specific to certain reports. So if you want to bin your data, you can write that logic in the views.
      3. You can join multiple tables in views. Sometimes the views are not straight forward and you might want to join multiple tables to form one view to meet the reporting requirements.

- - - Simplify views – resulting in simpler DAX
      
      Simplify views as much as possible so that the DAX becomes simpler, and the queries perform faster. The goal of the semantic model is not to show off your DAX, in fact it is the exact opposite. The semantic model should only contain simple aggregations like SUM, DISTINCTCOUNT, AVERAGE and/or simple filters in CALCULATE. If you are traversing multiple tables with DAX then it’s too complex. Move the logic back to the tables/views so that the DAX is simplified.

- - - Do not import all data into desktop
      
      This is the key to making your model lighter in desktop and faster to publish. A model I’m currently working on is ~12GB in Power BI service (memory) while it only occupies 4MB of disk space on my laptop. How is it even possible? The trick is to only import a fraction of the data into your desktop model. You can either import only 1000 rows for each table and/or use a ‘Start Date’ parameter to only import from a certain date onwards. Use parameters to achieve this as explained below.

- - - Parameterize everything
      
      Use parameters in Power Query. Parameters can be used to store server name, database name, schema name, Start Date. A typical parameter can look like this:
      
      The advantage of using parameters are twofold:
      1. You can have different names for servers, databases in different environments. This is crucial as you would publish your model to a UAT/SIT environment first and then push it to production. And normally non prod environments have a different set of servers and databases
      2. You can quickly change names of servers, databases, schemas if they are renamed at the source. Without parameters you would have to manually go to each query and change it, which would be time consuming.

- - - Utilise query folding in power query
      
      One of the core principles of an enterprise model is that all columns have to be defined in the views and not in Power BI. However, you can certainly remove some columns in Power Query which you may not want to import into your model. An example would be fact table primary keys. These keys are normally created for testing purposes but don’t serve a modelling or reporting purpose. So, you can remove them in Power Query. Another step you might apply in power query is a filtering step to limit the rows being imported. Ensure that these two steps still utilise query folding (from my experience they should)

MODEL

- - - Use desktop for DEV
      
      Since you are only bringing in a fraction of the data into the desktop model, it cannot be used to test the data. Use it as a development (DEV) workspace where you apply all column and DAX formatting, table naming, column ordering etc. Since its a lighter model you can easily push the changes to UAT/SIT workspace with the use of external tools (see Use External Tools section).

- - - Give descriptions for tables and fields
      
      It’s not just important to import all the data and write good DAX it is also important to provide descriptions to all entities used by the business users in the model (tables, columns, measures, hierarchy) so that they exactly know what those entities are and how they can be used.

- - - Follow the eleven rules of DAX management
      
      DAX is the window of the semantic model to the outside world. It is via the DAX measures that users interact with the model. Hence the measures need to be appropriately named, faster to query, and proper formatting applied. Go through this article to apply all the eleven rules of DAX management before pushing the measures to production.

- - - Use external tools
      
      The use of external tools greatly increases productivity and agility. There are three main tools which are of importance, and they should be in the arsenal of every semantic modeler. They are all free to use and if you do not have them, get your IT department to install them for you.
      
      Tabular Editor
      
      Tabular Editor is a tool that lets you easily manipulate and manage measures, calculated columns, display folders, perspectives and translations in Analysis Services Tabular and Power BI Models. You can also automate a lot of the modelling process by writing C# scripts that interacts with the Tabular Model and edits the model. There are a lot of custom C# scripts freely available to get your started. Once you get to know the objects you can manipulate in the Tabular Object Model, you can easily write scripts that performs formatting, column ordering, relationship creation etc. Modelling using Tabular Editor is a breeze and once you start using it you will never go back to using the clunky desktop GUI for your work.
      
      DAX Studio
      
      This is the ultimate tool for working with DAX queries. If you are writing lots of DAX, be it in the semantic model or report level measures you can test your queries for speeds, server timings, query plan etc. You can also use the DMV (Dynamic Management Views) to query the TOM model and get important insights like table size, number of rows, measure expressions etc.
      
      ALM Toolkit
      
      This is a great toolkit to manage and publish your datasets to different workspaces. You can compare two datasets and see what is different and push the changes you want to the workspace of your choice. By using this toolkit you are only pushing the metadata changes to a workspace and not the entire dataset. So if you have added 10 new measures into your desktop model, instead of publishing it the manual way using Power BI desktop and updating the model in service and running a refresh on it again, you only update the measures using this toolkit, without a need to refresh the entire model. A must have external tool for agile delivery.

- - - Apply incremental refresh where possible
      
      Incremental refresh is a great way to reduce the memory footprint of your models while refreshing. When you apply incremental refresh policy on a fact table in your model you are locking in the past data and only incrementally refreshing the most recent data. Say you have a large fact table in your dataset with 200 million rows and five years’ worth of data and you want to reduce the number of rows to be refreshed every day, you could apply an incremental refresh policy and lock in the first 4 years of data and only refresh the current year. So, every day the fact table only gets data for the current year which would be much less than 200 million which means the model consumes less memory during refresh and you stay well within your memory limit (in case of Power BI Premium subscription).
      
      However, be mindful of the fact that if a table reload is required at the source, then you would need to disable incremental refresh in the model and bring in all the data once again. Be very careful of applying incremental refresh on a table in the model. Only do so when you are 100% sure that no table reloads will occur in the future.

REFRESH

- - - Use pipelines in service
      
      Use deployment pipelines in Power BI service to push your models fom UAT to PROD. Deployment pipelines enable semantic modelers to manage the testing and publishing of their semantic models. You can also publish reports, paginated reports, dashboards, as well as dataflows. You can also automate it by using the Power BI Automation Tools Extension in DevOps which is an effcient way of deploying models especially if there are multiple modes.

- - - Automate refresh
      
      Automatic refresh of semantic models doesnt mean setting up a refresh schedule for the models. This will not work as if the underlying tables are being reloaded, say each night in the data warehouse the exact time of when the reloads are complete cannot be determined. Hence setting up a scheduled refresh for the semantic models means that the model will start refreshing before the reloads have finished. This results in time out issues apart from not refreshing the latest data.
      
      In order to avoid such a scenario use the Power BI REST API to refresh the models after the reload is complete. Using this APT you can also fine tune the refresh by specifying the Max Parallelism values, tables and partitions to refresh.

- - - Monitor refresh
      
      Use the Gen2 Metrics App to monitor the Power BI Gen2 premium capacities. Among the features of this dashboard is the ability to monitor the refresh times, peak memory consumption, CPU usage and number of users utilising your semantic models.

SUPPORT

- - - Use source control
      
      Source control is a tricky subject in the Power BI world as there is no built in source control within Power BI. You would need to use a combination of Sharepoint/DevOps to manage your pbix/template files. A good place to start on an automated way to keep track of pbix files is here
      
      A turbulent journey through Power BI source control – Mutt0-ds Notes

- - - Document the model
      
      It is important to document the model in production so that everyone from Managers to developers know what entities are available in the model. You can do this manually via Sharepoint/Confluence but it becomes cumbersome to maintain and error prone. An automated data glossary dashboard which extracts the metadata from your models in production is a much better scenario. This glossary gives you details on
      – M expressions of each query
      – Number of tables and columns in the model
      – Number of measures and their expressions
      – BUS matrix
      – Number of rows in the tables
      – Size of tables and columns in the model
      – Lineage between data warehouse tables/view to measures
      
      Talk to us if you need help in implementing such a data glossary for your models.

- - - Optimize the model
      
      Models in production should always be optimised to ensure they consume less memory and refresh and query faster. Memory is at a premium in Power BI (forgive the pun) so you should always strive to make sure the models only use the bare minimum necessary to meet the reporting requirements. Here are some steps to perform to make sure models refresh faster and consume less memory
      – Remove unnecessary columns from tables, especially fact table primary key columns
      – Filter rows which are not required for reporting. If there is only a reporting requirement for the last five years, do not include data from 2010 onwards
      – Decrease precision of decimal fields to 2 decimal points
      – Set MDX as false to non attribute columns (PK, DK, Record_effective)
      – Set right encoding for columns – dimension tables should have hash encoding and decimal fields benefit from value
      – Custom partition on tables
      – Reducing cardinality
      – Do not use calculated columns
      – Use aggregated tables if detailed grain is not required for reporting

- - - Automate semantic model testing
      
      How do you ensure that the measures you have written are correct? How do you know that the business can trust the numbers coming from the measures ? One way to do that is to test your semantic model against the source of the measures which in most cases are views. You can use the Execute Queries API to query your models in production, extract the measure results and compare them against an equivalent SQL statement from the views. If you have a tester they can set up a testing framework in ADF and it can be automated to make sure that the measures are always giving the correct results.
  So there it is, the end of this series on my musings from developing and deploying enterprise grade semantic models. As a parting note, developing enterprise grade semantic models is an exciting role to work in as you get to collaborate with a wide variety of stakeholders while having ultimate control over the semantic model. Treat it like a product and make sure you listen deeply to the users before adding any new features. Finally, your job as a semantic modeler is to make it easier for the end users to create reports. Provide as much details as possible on how to use it and evangelise it to everyone you meet in the organisation. All the best !

Musings from developing and deploying enterprise grade semantic models – Part 2: General learnings

Abhijith DSouza — Mon, 31 Oct 2022 05:43:11 +0000

In Part 1 of this series, I gave a brief introduction to the concept of semantic models and why they should be designed. In this article I discuss the general learnings from developing and deploying semantic models

- If you are doing it all by yourself, it is not enterprise grade

As discussed in Part 1 these are the following people I collaborated with while building enterprise grade semantic models for a major superannuation company in Australia

- - - - Information Architects
      - Solution Architects
      - Data Stewards
      - Product Champions
      - Data Modelers
      - Business Analysts
      - Business Intelligence Developers
      - Testers
      - Delivery Managers
      - Report Developers

It truly does take a team to ideate, design, test, validate, and support the build of enterprise grade semantic models. In the past, I have developed pseudo enterprise grade models as I was donning multiple hats and that resulted in a not optimal data warehouse and subsequently the semantic model wasn’t fit for reporting. Hence is important to have specialists for the above roles. If you have built a semantic model by yourself, it is probably not enterprise grade. This leads to the next point

- Designing a report for an enterprise is not the same as semantic modeling

Many Power BI consultants make the mistake of interpreting a report built for a business unit in an enterprise as enterprise grade semantic models. They are not one and the same. Firstly, the reports have been probably built by a single person meaning, the person building the report is the architect, data modeler, data engineer, semantic modeler, and a report developer all rolled into one. It is akin to building the Burj Khalifa all by yourself, simply not possible when you deal with enterprise data.

That doesn’t mean you should not build Power BI models for enterprises. But due to the localised nature of it, the models will only be used by a section of the end users and the rest of them won’t have a clue about it. Enterprise models should be available to be consumed (via reports) by all employees within the organisation. Only then can you call the models truly enterprise grade.

- Semantic model is not the same as a data warehouse

A data warehouse is built to standards conforming to the industry business unit from which the data is sourced. It is usually designed using dimension modelling using techniques of conformed dimensions, slowly changing dimensions, enterprise bus matrix etc. However, it may not always be used as is by the business for reporting. For starters the field names in the data warehouse may be something like “MemberAccountBalance_Total”, which may not mean much to the business. However, in the semantic model you can change it to something like “Member Account Balance” which is understandable for everyone.

The other reason is more important. The fact tables in the data warehouse in most cases cannot be used directly to extract the corporate measures. For ex: in Superannuation and Banking industries, one of the most important measures is the Member Count, both current and historical. The Member Counts here represent Members who have an account in the superannuation or banking organisation. The Member Count should be increasing over time as that means the company is growing and is able to make a profit. This measure is normally not modelled as a field in the fact table in the data warehouse. Hence a semantic layer view needs to be created so that this measure is derived by say joining the Member dimension (which has information on all member attributes) with other fact tables like MemberBalanceSnapshot table (which has information on members who have accounts) along with the required dimensions so that it can be sliced appropriately.

- Be prepared for multiple iterations

If you are building a semantic model for the first time, you will probably need a few iterations to get it right. The model which you ultimately choose will be the one which has the right granularity for corporate reporting, has the business approved field names, simpler DAX expressions, refreshes faster, queries faster and is well documented. It is recommended to start the iteration process by focusing on the correct field names, table names, measure names and definitions and begin by designing a high-fidelity prototype with the important fact, dimension tables and measures. The prototype should be parameterised so it can be deployed to different environments easily. This way you will have a version which will be ready to scale when you deploy the final model for production.

- Semantic modeling is not technically difficult

While the job of a semantic modeler is technically not that demanding, say as compared to a data engineer it is probably the role with the most exposure to all stakeholders. You are like a conduit between the business and the technical team. You are constantly fielding questions on the semantic model from the report developers/business, and it is your job to make sure their questions are answered satisfactorily.

Requirements for new tables, columns, measures may come thick and fast and that is why it is critical to ask the right questions so that these requirements can be formalised through a demand portal and taken up by the technical team based on their priorities. Which brings us to the next point…

- Have an excellent relationship with all relevant stakeholders

Because a semantic modeler sits between business and the technical team, they should maintain excellent relationships with all parties involved. They should not only be adept at talking to the data modelers and architects on things like measure definitions, table grain, memory usage, processing times etc. they should also be skillful at explaining the model in simple terms while talking to business.

You need to be persuasive to make your point while also seeking clarity to better inform your decision-making process. These are some of the soft skills which you need to master in order to be successful in this role.

Musings from developing and deploying enterprise grade semantic models – Part 1: Introduction

Abhijith DSouza — Fri, 21 Oct 2022 01:12:36 +0000

For the past one year, I have been fortunate enough to work on a digital transformation project for one of Australia’s largest superannuation funds to develop enterprise grade semantic models. I worked with a wonderful team comprising Solution Architects, Information Architects, Data Modelers, Data Stewards, Product Champions, Business Intelligence Analysts, Testers, Report Developers, and Delivery Managers to ideate, design, test, deploy, and document semantic models which contained superannuation data for the funds’ members. It was an exciting project with lots of challenges and lots of learnings. So I thought I would make a blog post to share the learnings with everybody.

Firstly, what is a semantic model ? Before we get to the semantic model we need to understand its relationship with respect to the semantic layer. According to Wikipedia:

A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization

The above definition is as good as it gets without the use of unnecessary jargon. So we will stick with this going forward. Where does the semantic layer sit in a modern BI/analytics landscape. The below simplified architecture answers the question. This is for a Power BI landscape but would work similarly with other tools.

So as you can see, the semantic layer is that layer which sits between the data warehouse (which is the output of the Transformation Layers) and the reporting layer. The semantic layer consists of two parts

- Semantic Layer DB – These are the SQL views which are created in the database from the Fact and Dimension tables. The views can be a one-to-one copy of the tables or they can be enhanced with fields containing report level logic. It is in these views that business friendly names are given to the fields. So instead of ‘MemberName’ we would have ‘Member Name’ as a field. It is important that no business transformations are applied in the views. All transformations are to be applied in the transformation layers and the data loaded to the data warehouse.

- Semantic Model PBI – The views are imported into a Power BI Desktop model which is called the semantic model. Corporate measures, which have been defined by data stewards are then calculated in the model. Most of these measures are simple aggregations of fields in the tables (SUM, DISTINCTCOUNT). The semantic model is then published to PBI service where report developers connect to it, design reports and share via apps. The semantic model is the ‘single version of truth’ and is used as the primary data source for enterprise reports. In this example, all superannuation related reports would connect to this model to generate insights. It is the design and deployment of this semantic model which would be the focus of this series of blogs.

So in essence, in the PBI world, the semantic model is a dataset with corporate measures, used to query organisational metrics. The reports which query the dataset are in fact ‘Thin Reports’, with the report developers having the ability to create report level measures. In the simplest architecture the dataset contains many datamarts and the data is imported into the model. All datamarts have a star schema associated with them.

So why build the semantic model after all ? There are several reasons why an organisation should implement a semantic model. The key reasons are given below

- Single Version of Truth (SVOT)
  
  When implemented correctly, the semantic model provides accurate and trustworthy data for key organisational metrics which serve as the SVOT no matter which tool is used to query it. Because the metrics have been defined in consultation with relevant stakeholders from the business, any report connecting to the model and querying the metrics (in the form of measures) produces the same result. This is quite powerful as there is no longer confusion on what the metrics mean, and everyone will be on the same page discussing the reports.

- Seamless collaboration
  
  A semantic model allows for creation of a data model which is a visual description of the business for analyzing, understanding, and clarifying the data and the associated relationships. This model can not only be used by the business to generate insights but can also be used by data scientists to complement the raw data which they use in their models. Composite models, which are models which are created by combining the semantic model with business unit specific data (say salary data) can also be created and shared with the relevant stakeholders. Thus the creation of the semantic model enables easy authoring, sharing, and collaborating of data models and insights.

- Reduce computing costs
  
  With most business running their data warehouses in the cloud, ad-hoc querying the data warehouse for every report leads to poor workload management and long running queries being run multiple times. With a cached semantic model, only optimised queries (via views) which are at the grain required for business reporting are run, which improves query performance and reduces costs.

- Simplify DAX calculations
  
  The semantic model is structured in such a way that measures in the model are simple aggregations of the columns in the views. Since all the transformations are done prior to data being loaded into the data warehouse and report specific logic being implemented in the views, writing DAX to create either the corporate measures or report level measures becomes easier. This is why it is important to have solution architects and data modelers in your team who would make sure that the views are at the right grain and fit for purpose for reporting.

- Improved security
  
  Users can be authenticated to use the semantic model with Azure AD groups and further, Row Level Security can be implemented at both the DB level and at the dataset level to protect sensitive data and limit access to data for users based on their roles in the organisation.

So what constitutes an enterprise grade semantic model ? In my opinion, the following conditions should be satisfied for a model to be classified as enterprise grade

- Represent data from a business unit
  
  The semantic models which I designed enabled end users to query data related to the members of the superannuation fund. Metrics such as Member Count, Exited Member Count,Member Account Balance, New Join Count etc could be queried from the model by slicing them with dimensions such as Member, Age Bracket, Employer, Payment Institution, Amount Bracket etc. In short, the relevant business unit in this case was Superannuation and the semantic model contained almost all the data required by the business to attract and retain members into the fund. Normally there is an executive sponsor who funds the project to set the ball rolling.
- Governed corporate measures
  
  It doesn’t make sense to model the data from a business unit if there is no consensus on the definitions of the corporate metrics. It is important that even before conceptualising the data platform, a thorough process is undertaken to define the most important corporate metrics and get them approved by the relevant stakeholders. This can be facilitated by setting up a Data Office in the organisation which is responsible for signing off on the corporate measures based on the approved definitions from the stakeholders.
- Architecting the solution
  
  So, you have identified which business unit’s data to model and have also got sign offs on the corporate metrics which need to be the output of the model. It is then critical to start brainstorming on how the solution would look like. This is where the Architects come into play. Most enterprise data reside in databases in different source systems. The architects will perform a detail analysis on what is the current state of the source systems, the best way to extract the data from the source systems, selecting the appropriate cloud platform for the solution, and designing the conceptual and logical design in tandem with the Data Modelers. The data warehouse is then built by the data engineers based on the specifications given to them by the Data Modeler. The semantic model is then designed from the data in the data warehouse based on the reporting requirements of the business. One of the core architectural principles is that any reusable logic (columns) should be applied as close to the source as practical. ie, there should be no columns defined in Power Query or in the Power BI model. If the solution is not architected, then it is not an enterprise grade model.

- Deliver fast queries
  
  Once the solution is architected and the corporate measures have been defined in the model based on the business definitions, it is time to deploy the model. If the reports connecting to the semantic model cannot query a measure in less than 3 seconds (nothing scientific about measures being able to run in less than 3 seconds, just a number which was felt good enough at the time of deployment), then it is time to go back to the drawing board. The semantic model should be fast enough to respond to the needs of the business. A slow model will impede adoption and people will lose trust in the efficacy of the model.

Hopefully this article has piqued your interest in all things enterprise semantic modelling using Power BI – They Why’s and the What’s. In the next series of articles, I will share about the learnings I gained from developing and deploying enterprise grade semantic models. Some will be general, and others will be technical. Hope you find them useful too.

Semantic Layer | Capstone Analytics

Musings from developing and deploying enterprise grade semantic models – Part 3: Technical learnings

IMPORT

Apply transformations as left as possible

Star schema model is preferred

Use views instead of tables

Simplify views – resulting in simpler DAX

Do not import all data into desktop

Parameterize everything

Utilise query folding in power query

MODEL

Use desktop for DEV

Give descriptions for tables and fields

Follow the eleven rules of DAX management

Use external tools

Apply incremental refresh where possible

REFRESH

Use pipelines in service

Automate refresh

Monitor refresh

SUPPORT

Use source control

Document the model

Optimize the model

Automate semantic model testing

Musings from developing and deploying enterprise grade semantic models – Part 2: General learnings

If you are doing it all by yourself, it is not enterprise grade

Designing a report for an enterprise is not the same as semantic modeling

Semantic model is not the same as a data warehouse

Be prepared for multiple iterations

Semantic modeling is not technically difficult

Have an excellent relationship with all relevant stakeholders

Musings from developing and deploying enterprise grade semantic models – Part 1: Introduction

Single Version of Truth (SVOT)

Seamless collaboration

Reduce computing costs

Simplify DAX calculations

Improved security

Represent data from a business unit

Governed corporate measures

Architecting the solution

Deliver fast queries