From SAS Viya 4.0 you have the possibility to use the SAS Information Catalog (7 Ways to Use the New SAS Information Catalog). This feature is especially useful at the start of your analytics project. It enables you to easily assess data quality, it helps to better understand the context of the data and it give quick insights into the metrics per column.
However, fewer people know you can also build your own Information Catalog using SAS Studio and SAS Visual Analytics! This is useful for the following reasons:
You can give your own touch to the dashboards
You can automatically refresh the dashboards using jobs, avoiding the need to manually re-import the data into the Information Catalog when the source changes (this is especially useful for incrementally loaded data, for example when having a direct connection to a data lake)
The functionality explained below is also available in Viya 3.4 and Viya 3.5 (and not limited to Viya 4.0 onwards).
With only one procedure you can get started yourself rather quickly. Read on below for:
Introduction to the Data Discovery Action Set
Using the Data Discovery Action Set on SASHELP.CARS
Visualizing the results in SAS VA
The Data Discovery Action set provides the profile Action for performing data discovery services. This profile Action can be used to quickly gather metrics on a table, resulting in a fast way to show basic statistics, to identify scarcity in the data and to recognize user-defined patterns in the data fields.
The exact syntax can be found here:
SAS Help Center: profile Action
An important aspect of the action is the coding of the Metric Values. In order to be able to produce a manageble table, independent of the number of rows and columns you want to analyze, the profile Action produces a long table with all metrics in a concise way:
ColumnId: an ID for the original data column (first column will become ‘1', second '2’ etc.)
RowId: an ID for the metric (e.g. minimum value='1010', mean='1008', complete mapping for all Metric Values can be found here: SAS Help Center: Data Discovery Action Set - Details)
Then three columns follow, of which at least one* is filled out every row, dependent on the metric data type:
Count: integer metric (e.g. counts)
CharValue: character metric (e.g. data type)
DoubleValue: double metric (e.g. average, percentage of missings)
*Sometimes two columns of one row are filled: e.g. for metric ‘1004’, which indicates the most frequently occuring values for each column (column CharValue) and a Count indicating how many time these most frequently occuring values occur.
Read on to see how you run this profile using only 7 lines of code!
See below a code example of how we would call this action set. Code lines 17 to 23 perform the actual profile.
/*
Author: Paul Koot
Created on: 2022/05/27
Goal script: Show how to use Data Discovery Action Set
Dependencies: - */
/* Initiate CAS session to be able to access the CASLIBS */
cas casauto;
caslib _all_ assign;
/* Load SASHELP.CARS dataset */
data casuser.CARS;
set sashelp.CARS;
run;
/* Use the Data Discovery Action Set */
proc cas;
loadactionset / actionSet="dataDiscovery"; /* 1: use dataDiscovery actionSet */
run;
dataDiscovery.profile result=r / /* 2: generate output dataset */
table={name="CARS"},
casOut={name="CARS_Profiled"};
run;
table.fetch / /* 3: check output for ColumnId==0*/
table={name="CARS_Profiled"
where="0 = ColumnId"
}
to=9999;
run;
quit;
/* Promote resulting profile to CASLIB in order to visualize results in VA */
data shared.CARS_Profiled (promote=yes);
set casuser.CARS_Profiled;
run;
/* Terminate CAS session */
cas casauto terminate;
Although the resulting table is concise, it is not easily readable. But with some filters in Visual Analytics we can split the information from this table into all relevant metrics again. Below you see two screenshots of the page we created using solely this output table from the profile Action. The first screenshot shows the behaviour of a numeric variable and the second screenshot shows the behaviour when selecting a character variable:
The main mechanisms behind this dashboard are the following:
We filter each object by a RowId that corresponds with the value we want to show in that object. E.g., ‘1025’ is the RowId for the metric ‘Unique count’ (count all unique values). We use this to show the Number of distinct values in a key value by filtering on this RowId:
As can be seen in the image above, we also need to filter on the ColumnId. To do this, we create a filter based on a parameter:
Make sure the variable ‘ColumnId’ is a categorical variable
Create a button bar and assign ‘ColumnId’ as Category, and a Parameter ‘ColumnId Parameter’ (you can create this easily after the assignment of ‘ColumnId’ as Category by clicking ‘Add’ below the Parameter role:
In the options of this button bar object, enable the field ‘Required’:
Create a drop-down list object in which a variable can be selected. Assign ‘CharValue’ to the Category Role and ‘ColumnId’ to the Hidden role:
Filter this drop-down list object on RowId=1000 to only obtain the original column names
In the Actions pane, create an Object Link between this drop-down list and the button bar to filter the button bar. This results in a mechanism in which selection a variable in the drop-down list also changes the parameter to the correspondiong ColumnID:
Then all other objects can be filtered on this ColumnId parameter to show e.g. the Count of unique values of the selected variable.
This mechanism is the backbone of this whole Detail Page as shown in the first two images. Of course can choose to make two detail pages, one for character variables and one for numeric variables, which ensures you never have empty objects in your report. You could even consider creating a separate page for all of your datetime fields. Data type is also one of the metrics that is available in the mapping, so this split can be easily made by filtering on a page level.
To create an overview page with statistics of multiple variables, we need some more coding, this will be discussed in a seperate How-To!