Text Mining

Biographies of literature writers

The biographies_EN dataset contains 1000 biographies of literature writers retrieved from the english version of Wikipedia.

Categories:: Machine Learning

93 Views

Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people

Several fields of study can benefit from a large, structured, and accurate dataset of historical figures. Due to a lack of such a dataset, in this paper, we aim to use machine learning and text mining models to collect, predict, and cleanse online data with a focus on age and gender. We developed a five-step method and inferred birth and death years, binary gender, and occupation from community-submitted data to all language versions of the Wikipedia project.

Categories:: Social Sciences
Demographic
Age
Health

831 Views

Korean stock trading app review dataset

This dataset contains information about Android app users’ reviews crawled from https://play.google.com/store/apps from 2022/4/2 to 2022/4/14. User reviews of 24 Korean trading apps were collected from Google Play Store, and the total number of the collected reviews is 41,705. App name, user ID, review content, rating, and date information were collected for each review by web crawling. The entire dataset is in Korean.

Categories:: Artificial Intelligence

146 Views

Job-Skills

This dataset contains job and their skills extracted from the job adverisments.

Categories:: Artificial Intelligence

1733 Views

RetroRevMatchEvalICIP16: A retrospective reviewer matching dataset and evaluation for IEEE ICIP 2016

The "RetroRevMatchEvalICIP16" dataset provides a retrospective reviewer recommendation dataset and evaluation for IEEE ICIP 2016. The methodology via which the recommendations were obtained and the evaluation was performed is described in the associated paper.

Y. Zhao, A. Anand, and G. Sharma, “Reviewer recommendations using document vector embeddings and a publisher database: Implementation and evaluation,” IEEE Access, vol. 10, pp. 21 798–21 811, 2022. https://doi.org/10.1109/ACCESS.2022.3151640

Categories:: Artificial Intelligence
Machine Learning

258 Views

USA Nov.2020 Election 20 Mil. Tweets (with Sentiment and Party Name Labels) Dataset

This dataset includes 24,201,654 tweets related to the US Presidential Election on November 3, 2020, collected between July 1, 2020, and November 11, 2020. The related party name and sentiment scores of tweets, also the words that affect the score were added to the data set.

Categories:: Artificial Intelligence
Machine Learning
Social Sciences

6448 Views