Data Analysis with Redshift Serverless and Quicksight - Part 2
Ryan Nazareth
Posted on May 13, 2023
In the first part of this blog, we have introduced Redshift Serverless offering and setup a workgroup and namespace configured with datasharing to allow access to AWS Marketplace Data Exchange. In the second part of this blog, we will focus on the accessing the data from Amazon Quicksight to generate interactive visualisations and explore some other features it has to offer.
Amazon Quicksight is a fully managed business intelligence (BI) service which allows users to publish dashboards and share amongst team members. Since it is a serverless offering, it scales to tens of thousands of users without having to worry about managing underlying infrastructure. It can connect to a wide variety of data sources in the cloud (S3, RDS, Redshift, Athena to name o few) and on-prem.
It also provides some more advanced features such as integrating machine learning insights with dashboards in the form of forecasting, anomaly detection and natural language querying. We will explore some of these in this blog.
Setting up a Quicksight subscription
If this is the first time using Quicksight, you will need to set up an account. Sign into your IAM user account and then navigate to Quicksight service. Follow the instructions here and choose the options to setup an enterprise account with method of authentication as federated users and QuickSight-managed role and grant access to Redshift. You will get a free trial subscription for 30 days for a Standard or Enterprise subscription. If your free trial has expired, then you can sign up for one of the cheaper pay as you go Reader subscriptions which is only charged for an active session and can be stopped after this tutorial is complete.
Once you have setup the subcription, you should be able to sign into Quicksight as an IAM user and manage your account by choosing the user icon at the upper right of the page and select Manage Quicksight
. You can now check your active subscription or as admin user, invite users to your account if required and manage permissions accordingly.
Quicksight also uses SPICE to run fast in-memory computations on data for visual analytics. For Enterprise subscriptions, data is also encrypted at rest by default. By default, we get a total 11GB SPICE capacity per region per subscription account, which can be shared amongst Quicksight users added to the account. We will be loading data from the Redshift into SPICE for this tutorial and the capacity would be more than enough.
Connecting to Redshift
- On the Amazon QuickSight start page choose Datasets from the options on the left and on the Datasets page, choose the New data set option on the top right (screenshot below).
- In the new window, choose the Redshift Manual connect icon. A new window will pop up requiring the the connection information for the data source to be filled in.
- For Data source name, enter a name for the data source.
- For Database server, you will need to retrieve the endpoint of the cluster. You can get the endpoint value from the Endpoint field on in the general information section when clicking on the cluster workgroup in the Redshift Serverless dashboard. The server address is the first part of the endpoint before the colon as highlighted in yellow below.
- The port will be default port for redshift (5439) unless this was set differently in setup, in which case confirm from the endpoint address (the number following the first colon).
- Enter the name of the database (after the second colon in the endpoint). In my case, it is dev.
- For UserName and Password, enter the user name and password you configured in part1 of this blog when setting up the redshift cluster
- Click on
Validate Connection
. If successful, you should see a green tick, saying validated. If this has failed check that you have done the following things:- Check that the security group attached to the Redshift cluster allows inbound traffic from IP address range associated with the region Quicksight was setup in as explained in the previous blog.
- Did you forget to make the VPC that the Redshift cluster resides in publicly available ?
- Check that you are using the correct username and/or password (this can be reset from the Redshift dashboard).
- Assuming everything worked, click Create DataSource.
- You will be presented with the schema and set of tables to connect to. The view
worldwide_events_vw
created in the previous blog , should be visible. Select this and click next.
- In the next pop up, we need to select whether we want to directly query the dataset from source or use the table data as-is and import into SPICE. The latter is the recommended method as it improves performance and quicker analytics, provided you have enough SPICE capacity. Select the
Import to SPICE
option - If you do not want to be emailed when a refresh fails, then untick the box. Then choose
Visualize
.
Accept the default settings for creating a new sheet and you should now be presented with the dashboard for creating the charts.
Data Insights
Quicksight offers a number of visual types which can be selected from the visual types pane using the representative visual icon. The AWS docs on creating quicksight visuals goes through the steps for adding a visual to the dashboard. First we will create a line chart from the visual types to plot the fields caldate and totalprice from the fields list.
Quicksight allows non technical users to generate forecasts using the built in Random Cut Forest algorithm to analyse historical data and generate forecast for the a specified period with a prediction interval of required confidence level.
- For forecast length, we set the periods forward to 14
- Set the prediction interval to 90.
- Set the seasonality to
auto
and leave the other settings as the default values.
We get a wide confidence interval which suggests that the forecast could lie anywhere within that range. A smaller value of prediction interval will generate a narrower band but will give less confidence in the forecast.
We can also generate a forecast for a period in history and compare it to actual data. To do this, edit the forecast and for the forecast length setting, set the periods forward option to 0 and the periods backward setting to 100.
Amazon Quicksight also provides users with ML powered anomaly insights by analysing a number of combinations of metrics and trends in data. The concepts for detecting outliers are based on whether an extreme data point occurs by random chance or is a significant event. Quicksight notifies users when there are any anomalies in the visuals and whether they are worth investigating. Click on the bulb icon in the top right hand corner of the chart. You will see the largest anomaly detected in the time series via ML insights. Click on the more options and then view more details. On the left hand panel, you should see a list of anomalies with additional statistics on the percentage change from average expected total price. Click 'Add anomaly to sheet'
This will open an insight widget in the same sheet. Click get started in the widget. You are now taken to a configuration screen with preview
Amazon Quicksight provides contributions analysis (key drivers) that contribute to the anomalous outcomes. Expand the top contributors option and tick upto 4 features to use as key drivers for running contribution analysis. The screenshot below shows the results for day, eventname, month and venuecity.
Choose Save to confirm your choices. You are taken back to the
insight widget, where you can select Run now to run the anomaly detection and view your insight. This will take a few minutes to complete. Once complete you should see an update in the widget with the latest anomaly detected and an option to explore anomalies, which you can click.
This will open the anomalies screen as in the screenshot below. Select SHOW ANOMALIES BY DATE
to display the The Number of anomalies chart which shows outliers detected over time. We can see two outliers detected end of May and end of June. On the left pane, we can re-run contribution analysis if required with different set of key drivers. In the screenshot below, I have run this between May 26, 2008 and May 27, 2008 (corresponding to the first anomaly) and selected eventname
and eventcity
. We can also explore anomalies per category or dimension.
The dashboard below uses the vertical bar chart,histogram and boxplot visual types. The bar chart shows the total number of tickets for each quarter in the year split by each of the even days in the week. We can see that the first quarter (Jan-March) has the least number of tickets sold and quarter 3 has a larger variation in tickets sold across the week, with Sunday being the most day for events. In the last 3 months of the year, we have more tickets sold between Friday-Monday compared to rest of the week.
Jan is the month in the first quarter where the range and median of total transactions for an event were the least possibly due to fewer tickets sold. We can see a strong right skew in February with a long upper whisker. For the rest of the months, the median remains consistent between £12k - £15k. November showed a slight left skew and the maximum transaction values for a given event of just over £32k were seen for Dec, Feb and May.
This sheet contains tree map with venuename
dimension arranged by total_tickets
(rectangle size) and color encoded by venueseats
. The larger the venue size the darker the shade of green (e.g FedEx field, New York Giants stadium,,Arrowhead stadium) whilst the smaller the venue the brighter shade of yellow (e.g. Shoreline Ampitheatre). We can see that some of the events smaller venue sizes between 20k-50k sold a larger number of ticket (possibly because more events held at these venues during this timeframe). The pie chart shows the proportion of total transactions for the top eventname
.Here the top 6 events are represented and the rest grouped in 'others' category. Greg Kihn and Yaz (Yazoo) bands accounted for more than 65% of total transaction sales.
Deleting Resources
Finally, remember to delete all resources in Redshift and stop the Quicksight subscription created in both parts of this blog to avoid being charged further. Note that for Redshift Serverless although you do not pay for compute capacity when you do not run any queries, you still pay for storage (more details can be found here).
- To delete the Quicksight Enterprise subscription follow the instructions here. You can also export the dashboard to pdf and delete the dashboard if required.
- The Redshift Serverless workgroup and associated namespace can be deleted by following these instructions.
Posted on May 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.