Setup Kinesis Firehose, S3 and Athena

User events are recorded in Amazon Kinesis Data Streams. We are going to send data to Amazon S3 using Amazon Kinesis Data Firehose, and use AWS Glue, Amazon Athena, and Amazon QuickSight to catalog, analyze, and visualize the data, respectively.

Create S3 buckets

We are going to create two buckets: one to store events from Kineses Data Firehose, another to store results of Athena queries. Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.

  1. Use AWS CLI command to create new bucket, which we are going to use to store events from Kineses Data Firehose:
aws s3 mb s3://traveldeals-12142020-kdf
  • Use AWS CLI command to create new bucket, which we are going to use to store results of Athena queries:
aws s3 mb s3://traveldeals-12142020-athena

Setup Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

  1. Open the Athena Console.

Before you run your first query, you need to set up a query result location in Amazon S3. Learn more

  1. Choose the link to set up a query result location in Amazon S3.
Set up a query result location in Amazon S3 in Athena
  1. In the Settings dialog box, enter the path to the bucket that you created in Amazon S3 for your query results, e.g. s3://traveldeals-12142020-athena/
  2. Choose Connect data source.
Connect data source in Athena
Step 1: Choose a data source
  1. Choose Query data in Amazon S3.
  2. Choose AWS Glue data catalog.
Choose a data source in Athena
Step 2: Connection details
  1. Choose Add a table and enter schema information manually.
  2. Choose Continue to add table.
Connection details

Now we are going to configure database and table.

Step 1: Name & Location
  1. Choose Create a new database
  2. Set name of the database, e.g. traveldealsDb
  3. Set Table Name, e.g. dealViews
  4. Set Location of Input Data Set to the bucket we created in Create S3 bucket section, e.g. s3://traveldeals-12142020-kdf/
  5. Choose Next.
Set Name and Location of a data source
Step 2: Data Format
  1. For Data Format choose Parquet.
  2. Choose Next.
Select Data Format
Step 3: Columns
  1. Choose Bulk add columns.
Add columns
  1. Add these values:
eventtype string, userid string, itemid string, itemname string, itemcategory string, timestamp timestamp, pagevieworigin string
Bulk add columns
  1. Choose Add.
  2. Review created columns and choose Next.
Save added columns
Step 4: Partitions
  • Keep default settings and choose Create table.

Configure Kinesis Delivery Stream

Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt your data streams before loading, minimizing the amount of storage used and increasing security.

We are going to use Kinesis Data Firehose to store data from Kinesis Data Streams to S3:

  1. Open the Amazon Kinesis Data Firehose Console.
  2. Choose Create Delivery Stream.
Step 1: Name and source
  1. Set Delivery stream name, e.g. traveldeals-delivery-stream
  2. Set Source to Kinesis Data Stream.
  3. Choose a Kinesis Data Stream, that was recenty created in Setup Kinesis Data Streams, e.g. traveldealsKinesis-dev.
  4. Choose Next.
Create Kinesis delivery stream
Step 2: Process records
  1. Set Record format conversion to Enabled.
  2. Set Output format to Apache Parquet.
  3. Set AWS Glue region to the region, where you created resources with Amplify.
  4. Set AWS Glue database to the name of the created Glue database, e.g.: traveldealsdb
  5. Set AWS Glue table to the name of the created Gluy table, e.g.: dealviews
  6. Choose Next.
Setup record processing

Step 3: Choose a destination

  1. Under S3 destination, set S3 bucket to the bucket name, that was created in Create S3 bucket section, e.g.: traveldeals-12142020-kdf
  2. Choose Next.
Choose destination
Step 4: Configure settings
  1. Under S3 buffer conditions, set Buffer interval to 60 seconds.
  2. Choose Next.
Configure settings
Step 5: Review
  • Review settings and choose Create delivery stream.

Test Kinesis Delivery Stream

  1. Now open web application, and login as a user.
  2. Open several Deals to record related Kinesis events.
  3. Check S3 bucket after 60 seconds:
aws s3 ls --recursive s3://traveldeals-12142020-kdf

It will have content similar to this:

2020-12-16 15:34:44       1990 2020/12/16/23/traveldeals-delivery-stream-1-2020-12-16-23-33-37-9aaa31e6-18b0-4d36-bb70-019314f67191.parquet

Setup Glue Crawler

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

  1. Open the Glue console.
  2. In the navigation pane, choose Crawlers.
  3. Choose Add crawler.
List of Glue crawlers
Crawler info
  1. Set Crawler name, e.g. traveldealsDb-crawler
  2. Choose Next.
Set a name for a new Glue crawler
Crawler source type
  1. For Crawler source type choose Existing catalog tables.
  2. Choose Next.
Specify crawler source type
Catalog tables
  1. Under Available tables, choose Add for the table that was created in Setup Athena section, e.g. dealviews
  2. Choose Next.
Choose catalog tables
IAM Role
  1. Choose Create an IAM role.
  2. Add suffix to the role name, e.g. traveldeals-12142020-kdf
  3. Choose Next.
Choose an IAM Role
Schedule
  1. Set Frequency to Run on demand.
  2. Choose Next.
Create a schedule
Output
  1. Expand Configuration options (optional)
  2. Set How should AWS Glue handle deleted objects in the data store? to Ignore the change and don’t update the table in the data catalog.
  3. Choose Next.
Configure crawler's output
Review all steps
  • Review configuration and choose Finish.

Run Glue Crawler

  1. Open the Glue console.
  2. In the navigation pane, choose Crawlers.
  3. Choose the crawler, that you was created in previous section, e.g. traveldealsDb-crawler
  4. Choose Run crawler.
Run a crawler
  1. Wait until crawler changes statues to Ready.
  2. In the navigation pane, open Databases.
  3. Choose the database, created in Setup Athena section, e.g. traveldealsdb
List of Glue databases
  1. Choose Tables in traveldealsdb.
Details of a Glue database
  1. Choose the table, created in Setup Athena section, e.g. dealviews
Tables in a selected database

On this page you can see table information, including amount of records crawled and schema.

Details of a table

Use Athena to query data

  1. Open the Athena Console.
  2. Choose the database, e.g. traveldeals
  3. Query data from the created table, e.g. dealviews:
SELECT * FROM "traveldealsdb"."dealviews" limit 10;
  1. Choose Run Query.

The output will be similar to this:

Run an Athena query