Generate Sample Data

Amazon Personalize uses data that you provide to train a model. When you import data, you can choose to import records in bulk or incrementally or both. With incremental imports, you can add individual historical records or data from live events, or both, depending on your business requirements.

Service quotas for Amazon Personalize: Minimum number of unique combined historical and event interactions required to train a model is 1000; Minimum number of unique users, with at least 2 interactions each, required to train a model is 25.

In the section we are going to create sample Deals, Users and interactions between Users and Deals, which we will use to train a model, which we are going to use for recommendations.

Create sample Deals

Web application allows to create Deals using UI. Login as a User, click Create new Deal button to start creating items:

  • All fields will be prepopulated with randomly generated values.
  • You can click Reset to get other set of random values, or fill values manually.
  • Click Save to save a Deal.
  • Repeat as many times as necessary (e.g. 12-15 items), there is no lower or upper limits on the amount of Deals for training.

Create Users

You can import users into an Amazon Cognito user pool. The user information is imported from a specially formatted .csv file. The import process sets values for all user attributes except password. Password import is not supported, because security best practices require that passwords are not available as plain text, and we don’t support importing hashes. This means that your users must change their passwords the first time they sign in. So, your users will be in a RESET_REQUIRED state when imported using this method.

Generate Cognito users

We are going to create a script, that generates a CSV file, ready for import.

  1. Create a separate directory for data scripts, e.g. data-scripts/
  2. Install required NPM dependencies in this directory:
npm install faker minimist
  1. Create a new file generate-cognito-users.js in this directory.
  2. Use this code for the file:
/**
 * 1. Generate users in a format ready for Cognito import.
 */
const args = require('minimist')(process.argv.slice(2))
const fs = require('fs');
const faker = require('faker');

const USER_COUNT = args['number'] || 30;

const BASE_EMAIL = args['baseEmail'];
if (!BASE_EMAIL || BASE_EMAIL.length === 0) {
    throw new Error('Missing "baseEmail" parameter.');
}
const emailSplit = BASE_EMAIL.split('@');
const emailusername = emailSplit[0];
const emaildomain = emailSplit[1];

const csvHeader = "name,given_name,family_name,middle_name,nickname,preferred_username,profile,picture,website,email,email_verified,gender,birthdate,zoneinfo,locale,phone_number,phone_number_verified,address,updated_at,cognito:mfa_enabled,cognito:username";

function generateCognitoUser() {
    let s = "";
    const name = faker.name;
    const firstName = name.firstName().replace(/\W/g, '');
    const lastName = name.lastName().replace(/\W/g, '');
    const username = `${firstName + lastName}`.toLowerCase();
    const email = `${emailusername}+${username}@${emaildomain}`;

    s += firstName + " " + lastName + ","; // name,
    s += "," // given_name,
    s += ","; // family_name,
    s += "," // middle_name,
    s += "," // nickname,
    s += "," // preferred_username,
    s += "," // profile,
    s += "," // picture,
    s += "," // website,
    s += email + ","; // email,
    s += "true," // email_verified,
    s += "," // gender,
    s += "," // birthdate,
    s += "," // zoneinfo,
    s += "," // locale,
    s += ","// phone_number,
    s += "false,"// phone_number_verified,
    s += "," // address,
    s += "," // updated_at,
    s += "false," // cognito:mfa_enabled,
    s += username // cognito:username

    return s;
}

function writeToFile(fileName, header, items) {
    let fd;
    try {
        fd = fs.openSync(fileName, 'w');
        fs.writeFileSync(fd, `${header}\n`);
        items.forEach(item => {
            fs.writeFileSync(fd, `${item}\n`);
        });
        console.log(`Data saved to: ${fileName}`);
    } catch (err) {
        console.log(err, err.stack);
    } finally {
        if (fd !== undefined) {
            fs.closeSync(fd);
        }
    }
}

const users = [];
for (let i = 0; i < USER_COUNT; i++) {
    users.push(generateCognitoUser());
}

writeToFile(
    './cognito-users.csv', 
    csvHeader,
    users
);
  1. Run the script to generate users:
node generate-cognito-users.js --baseEmail name@example.com --number 30

Parameters:

  • baseEmail - every generated users will get a unique username and email. Email will be generated as an alias of a provided baseEmail. For example, if baseEmail is name@example.com, and username of a generated user is johndoe, then unique generated email for this user will be name+johndoe@example.com
  • number number of users to generate

Amazon SES includes a mailbox simulator that you can use to test how your application handles different email sending scenarios. The mailbox simulator is useful when, for example, you need to test an email sending application without creating fictitious email addresses, or when you need to find your system’s maximum throughput without impacting your daily sending quota.

Output of the script:

Data saved to: ./cognito-users.csv

Example of generated Users in ./cognito-users.csv:

name,given_name,family_name,middle_name,nickname,preferred_username,profile,picture,website,email,email_verified,gender,birthdate,zoneinfo,locale,phone_number,phone_number_verified,address,updated_at,cognito:mfa_enabled,cognito:username
Bennett Carter,,,,,,,,,name+bennettcarter@example.com,true,,,,,,false,,,false,bennettcarter
Berneice Mante,,,,,,,,,name+berneicemante@example.com,true,,,,,,false,,,false,berneicemante
Alda Bartell,,,,,,,,,name+aldabartell@example.com,true,,,,,,false,,,false,aldabartell
Reva Dare,,,,,,,,,name+revadare@example.com,true,,,,,,false,,,false,revadare
Zachary Ledner,,,,,,,,,name+zacharyledner@example.com,true,,,,,,false,,,false,zacharyledner

Import Cognito Users

  1. Open the Amplify Console.
  2. On the All apps page, choose the project you are currently working on, e.g. traveldeals
  3. Under Backend Environments, among Categories added, choose Authentication.
List of enabled categories for application in Amplify Console
  1. Under Users, choose View in Cognito.
Authentication category in Amplify Console
  1. Copy and save Pool Id value, we will need it later to run the script.
  2. In the navigation pane, choose Users and groups.
General settings of a User pool in Cognito
  1. Choose Import users.
List of users in Cognito
  1. Choose Create import job:
    1. Set Job name, e.g. import-users-121420
    2. Keep IAM name as Create role.
    3. Set IAM role name, e.g. Cognito-UserImport-Role-121420
    4. Choose Select file and locate a file, that was generated by the script on the previous step. e.g. ./cognito-users.csv
    5. Choose Create job.
    6. Choose start and wait until job changes its status to Succeeded.

More information about Import Job statuses you can find on the Creating and Running the Amazon Cognito User Pool Import Job page.

Create import job in Cognito
  1. In the navigation pane, open Users and groups to see imported users.
List of imported users in Cognito

Generate random interactions

Interaction in our scenario means that a User views a Deal.

  1. Install NPM dependencies:
npm install aws-sdk

We are going to create a script that:

  • Pulls existing items from DynamoDB table
  • Pulls users from Cognito with all necessary metainformation, including user Id.
  • Generate interaction output in format ready for Personalize import.

For usernames that start with vowels, interactions will be generated with items that have categories starting with vowels. E.g. “johndoe” will interact with “Cities” category, but not with “Outdoors”. You can use this later to verify created recommendations for those users.

  1. Create a file generate-personalize-datasets.js in data scripts directory and paste this code:
/**
 * 1. Pulls existing items from DynamoDB table.
 * 2. Pulls users from Cognito.
 * 3. Generate interaction output in format ready for Personalize import.
 * 
 * For usernames that start with vowels, interactions will be generated with items that have categories starting with vowels.
 * E.g. "johndoe" will interact with "Cities" category, but not with "Outdoors".
 */
const AWS = require('aws-sdk');
const args = require('minimist')(process.argv.slice(2))
const fs = require('fs');
const faker = require('faker');

const INTERACTIONS_COUNT = args['number'] || 1000;

async function scanDynamoDbItems(awsRegion, dealTableName) {
    if (!dealTableName || dealTableName.length === 0) {
        throw new Error('Missing "dealTableName" parameter.');
    }

    var awsConfig = new AWS.Config({region: awsRegion});
    const dynamodb = new AWS.DynamoDB(awsConfig);
    try {
        const items = await dynamodb.scan({
            TableName: dealTableName,
            AttributesToGet: [
                'id',
                'name',
                'category',
            ],
        }).promise();

        return items.Items.map(item => { return { id: item.id.S, category: item.category.S, name: item.name.S } });
    } catch (e) {
        console.error(e);
    }
}

async function listCognitoUsers(awsRegion, userPoolId) {
    if (!userPoolId || userPoolId.length === 0) {
        throw new Error('Missing "userPoolId" parameter.');
    }

    var awsConfig = new AWS.Config({region: awsRegion});
    const cognito = new AWS.CognitoIdentityServiceProvider(awsConfig);
    try {
        const users = await cognito.listUsers({
            UserPoolId: userPoolId
        }).promise();

        return users.Users.map(user => ({ id: user.Attributes.find(el => el.Name == 'sub').Value, username: user.Username}));
    } catch (e) {
        console.error(e);
    }
}

const vowels = ['a', 'e', 'i', 'o', 'u', 'y'];
function starstWithVowel(str) {
    return vowels.includes(str.toLowerCase().charAt(0));
}

function randomElement(arr) {
    return arr[Math.floor(Math.random() * arr.length)];
}

function randomInteraction(users, items) {
    const randomUser = randomElement(users);
    const vowelUsername = starstWithVowel(randomUser.username);
    const filteredItems = items.filter(item => {
        return vowelUsername === starstWithVowel(item.category);
    });

    const randomItem = randomElement(filteredItems.length > 0 ? filteredItems : items);    
    return `${randomUser.id},${randomItem.id},${faker.date.recent(14).getTime()}`;
}

function generateAllInteractions(users, items, n) {
    const interactions = []
    for (let i = 0; i < n; i++) {
        interactions.push(randomInteraction(users, items));
    }    
    return interactions;
}

function writeToFile(fileName, header, items) {
    let fd;
    try {
        fd = fs.openSync(fileName, 'w');
        fs.writeFileSync(fd, `${header}\n`);
        items.forEach(item => {
            fs.writeFileSync(fd, `${item}\n`);
        });
        console.log(`Data saved to: ${fileName}`);
    } catch (err) {
        console.log(err, err.stack);
    } finally {
        if (fd !== undefined) {
            fs.closeSync(fd);
        }
    }
}

(async () => {
    const awsRegion = args['awsRegion'];
    if (!awsRegion || awsRegion.length === 0) {
        throw new Error('Missing "awsRegion" parameter.');
    }

    const items = (await scanDynamoDbItems(awsRegion, args['dealTableName']));
    writeToFile(
        './personalize-items.csv', 
        'ITEM_ID,CATEGORY,NAME', 
        items.map(item => `${item.id},${item.category},"${item.name}"`)
    );

    const users = await listCognitoUsers(awsRegion, args['userPoolId']);
    const interactions = generateAllInteractions(users, items, INTERACTIONS_COUNT);
    writeToFile(
        './personalize-interactions.csv', 
        'USER_ID,ITEM_ID,TIMESTAMP',
        interactions
    );
})();
  1. Run the script to generate CSV files for Personalize:
node generate-personalize-datasets.js --userPoolId us-west-2_xxxxxxxx --dealTableName Deal-xxxxxxxxxxxxxxxxxxxxxxxxxx-dev --awsRegion us-west-2 --number 1000

Parameters:

  • userPoolId - id of the User Pool, available on the Generate Settings page for the the User Pool. You can find it as described in Import Cognito Users
  • dealTableName - name of the table in DynamoDB. You can find it by navigating through: Amplify Console / Backend Environments / API category / Data Sources / DealTable - View; use the value of Table name
  • awsRegion - AWS region, where resources were created
  • number number of interactions to generate

Script will generate two files:

Data saved to: ./personalize-items.csv
Data saved to: ./personalize-interactions.csv

Sample ./personalize-items.csv:

ITEM_ID,CATEGORY,NAME
2ffb2a39-5dfa-4724-a52b-4f6b931573fa,Cities,"North Clementine"
2bfa788e-b31b-4659-8e01-d956c81f2319,Cities,"Bechtelarland"
b4aa747c-16ed-4dc9-aeff-956fd6792072,Outdoors,"Greenfelderport"
f193d277-da8e-460c-8987-3a60d76f5074,Cities,"Shaneburgh"
213291cf-2413-48af-a287-e1bfeedaeee0,Cities,"Allisontown"
b098f6d3-88f0-4b6f-b1a7-fc0dd44d3b0f,Cities,"Williamsonland"

Sample ./personalize-interactions.csv:

USER_ID,ITEM_ID,TIMESTAMP
8aeb2a63-9497-4dc1-a1c2-702e7eb7857c,b4aa747c-16ed-4dc9-aeff-956fd6792072,1607831539230
7c00ead2-b3f5-467d-930a-e2c44501fcec,b4aa747c-16ed-4dc9-aeff-956fd6792072,1607844858510
51a14f5c-06c3-4e47-acea-86377e23e050,2ffb2a39-5dfa-4724-a52b-4f6b931573fa,1607318935104
26509fe8-a481-4c9a-b2f2-736864b9b077,2bfa788e-b31b-4659-8e01-d956c81f2319,1606984541487
2fea49db-eb5d-4044-ad05-6fc74780bee0,f193d277-da8e-460c-8987-3a60d76f5074,1607107287704

Upload Sample data to S3

We are going to import bulk records stored in an Amazon S3 bucket using a dataset import job. To prepare for this step, we will upload generated data to S3.

  1. Create S3 bucket with a unique name, e.g. traveldeals-12142020
aws s3 mb s3://traveldeals-12142020
  1. Update S3 bucket policy to allow Personalize to access its content:
    1. Create a file for S3 bucket policy personalize-s3-bucket-policy.json
    2. Set it to the content below, and set Resources values to the bucket name that was created, e.g. traveldeals-12142020
{
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::traveldeals-12142020",
                "arn:aws:s3:::traveldeals-12142020/*"
            ]
        }
    ]
}
  1. Update S3 bucket with this policy:
aws s3api put-bucket-policy --bucket traveldeals-12142020 --policy file://personalize-s3-bucket-policy.json
  1. Upload generated Items to S3 bucket with prefix item
aws s3 cp personalize-items.csv s3://traveldeals-12142020/item/
  1. Upload generated User-item interactions to S3 bucket with prefix user-item
aws s3 cp personalize-interactions.csv s3://traveldeals-12142020/user-item/
  1. Run aws s3 ls --recursive s3://traveldeals-12142020/ to check the bucket structure with data for training, which will look similar to this:
2020-12-15 13:16:10        385 item/personalize-items.csv
2020-12-15 13:16:17      88026 user-item/personalize-interactions.csv