MaliciousML: MovieLens dataset attacked with diverse profile injection attacks

Citation Author(s):
Santiago
Alonso
Dpto. Sistemas Informáticos, ETSI de Sistemas Informáticos, Universidad Politécnica de Madrid, Madrid, Spain
Jesús
Bobadilla
Dpto. Sistemas Informáticos, ETSI de Sistemas Informáticos, Universidad Politécnica de Madrid, Madrid, Spain
Fernando
Ortega
Dpto. Sistemas Informáticos, ETSI de Sistemas Informáticos, Universidad Politécnica de Madrid, Madrid, Spain
Ricardo
Moya
Telefónica Investigación y Desarrollo, S.A. 28050 Madrid, Spain
Submitted by:
Fernando Ortega
Last updated:
Wed, 02/27/2019 - 19:00
DOI:
10.21227/rcsd-h160
Data Format:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

The MovieLens 1M dataset has been extended introducing diverse shilling profiles to push or nuke a target item. Shilling profiles has been generated using different shilling attack methods: random, average, bandwagon, reverse-bandwagon, love-hate and perfect-knowledge. Each file contains the 1M original MovieLens ratings plus the added votes for the shilling profiles. Each shilling profile rates as many items as the mean number of ratings from each user in the original dataset. The dataset is divided in four quartiles based on the number of rating for the target item.

Instructions: 
Within the files "nuke.zip" and "push.zip" we may find the whole set of files containing all the data.

As their names expose, "nuke.zip" contains all the datasets used for the "nuke" attacks while "push.zip" contains all the datasets used for the "push" attacks.

The whole set of nuke attacks used to test the system was:

  • random attack (votes for filler items were random ratings with a normal distribution around the mean rating value across the whole database. Target with r min vote)
  • average attack (votes for filler items were random ratings with a normal distribution around the mean rating for filler item. Target with r min vote)
  • reverse bandwagon random attack (selected items as least popular items from q4 and r min as vote, votes for filler items were random ratings with a normal distribution around the mean rating value across the whole database. Target with r min vote)
  • reverse bandwagon average attack (selected items as least popular items from q4 and r min as vote, votes for filler items were random ratings with a normal distribution around the mean rating for filler item. Target with r min vote)
  • love/hate attack (random filler items voted with r max and target with r min)
  • perfect knowledge attack (each new user profile matches a randomly selected existing profile except target item that gets the r min vote)

 

The whole set of push attacks used to test the system was:

  • random attack (votes for filler items were random ratings with a normal distribution around the mean rating value across the whole database. Target with r max vote)
  • average attack (votes for filler items were random ratings with a normal distribution around the mean rating for filler item. Target with r max vote)
  • bandwagon random attack (selected items as popular items from q1 and r max as vote, votes for filler items were random ratings with a normal distribution around the mean rating value across the whole database. Target with r max vote)
  • bandwagon average attack (selected items as popular items from q1 and r max as vote, votes for filler items were random ratings with a normal distribution around the mean rating for filler item. Target with r max vote)
  • love/hate attack (random filler items voted with r min and target with r max)
  • perfect knowledge attack (each new user profile matches a randomly selected existing profile except target item that gets the r max vote)

 

The initial dataset (MovieLens 1M dataset) was extended in each attack with the number of required votes, introducing new users in the dataset that are identified with a id over number 100000 (movielens dataset contains 6040 diferent reviewers). So, each file contains that 1M votes plus the added votes for the extended users. Each user votes as many votes as the mean for the votes for each user in the original dataset.

The dataset is divided in four quartiles and tests are done for each of them, so the first level of directories below the attack is each quartile.

For each quartile, datasets for 25 target items were generated, and for each target item 11 files with a different number of users were generated. The first file contains no new users (so is the original dataset) and the subsequent files contain from 50 to 500 new users with a step of 50 users.
All this means that for each target item 11 files were generated that are 275 files for each quartile and are 1100 files for each kind of attack.

The name of each file, that has all correspondig information is:

datasets/attacktype/attackname/quartile/m1lm_attacktype_attackname_quartile_it_targetitem_unumberofusers.dat



where:

  • datasets is the name of main directory
  • attacktype may be nuke or push
  • attackname may be random, average, q1-bandawagon-random, q1-bandwagon-average, q4-reverse-bandwagon-random, q4-reverse-bandwagon-average, love-hate, perfect-knoledge
  • quartile may be q1, q2, q3 or q4
  • targetitem is the id for the target item in that dataset
  • numberofusers is the number of profiles introduced in the dataset for specific attack