Abstract

Goal

The goal of this project is to leverage Amazon Web Service's machine learning services to create a dataset that automatically adds and updates files on IEEE DataPort's S3 storage. Through this process, we sought to learn and demonstrate how an ongoing data collection script can create a shared living dataset by streaming data to our IEEE DataPort dataset storage. In the process, we also hoped to gain further insights into areas including:

Generating data for machine learning
Using headless Chrome to collect data/images (via Python's Puppeteer port Pyppeteer)
Varaibles that may impact facial recognition in machine learning
Combining existing compute resources (e.g. our own AWS account) with IEEE DataPort S3 storage
Learning more about IEEE DataPort AWS integration
Exploring AWS ML services

Approach

In pursuit of this goal, we created a Python script (see handler.py aws-celebrity-recognition) designed to:

Capture a screenshots from the homepages of a list of top celebrity news/paparazzi websites. Website mobile views were used, and the entire scrollable page is captured.
The screenshots are then processed so that they can be used by AWS Recognition. Images processing is performed to ensure the file size is below the maximum file size allowed by the service (5 MB). We also found that AWS Recognition provided relatively few hits from the full website screenshots, which are of large dimensions and contain many celebrities. To address these issues, the screenshot images are split up into smaller slices, which will be sent to AWS Recognition for scanning.
Image slices are sent to for facial recognition to the AWS Recognition API, which returns a list of celebrities detected within the images.
The script then takes the list of detected celebrities on the sites and adds them to a CSV file along with the date and time when the picture was taken and the name of the website it was taken from (see Instructions for more details).
The screenshots are saved to the images folder in the S3 Bucket and the CSV is saved to the root directory of the S3.

For more information on how everything was done: https://github.com/Wlntr/aws-celebrity-recognition/blob/master/README.md (also available for download on IEEE DataPort)

Instructions:

Celebs.csv contains the celebrity names returned by AWS's Recognition from a series of website screenshots. The data collection began on June 13, 2023 and is ongoing. Our script, running twice a day, adds new rows for each website scanned with a list of celebrity matches. Celebs.csv contains the following columns:

Column A: Datetime (e.g. Jun_13_2023_14H_38M_45S)
Column B: Soure (e.g. TMZ)
Column C - ZZ: Celebrity Name (e.g. Jennifer Lawrence)

Images

The images folder contains the original screenshots used for "celebrity recognition". These are captured for project reproducability and potential comparison with other Machine Learning comparisons and explorations. The naming convention used is SOURCE_DATETIME.jpeg (e.g. DLISTED_Jun_13_2023_14H_38M_45S.jpeg)

Tip: IEEE DataPort subscribers may access these files using the platforms direct AWS S3 access fearture.