Abstract

In the realm of real-time communications, WebRTC-based multimedia applications are increasingly prevalent as these can be smoothly integrated within Web browsing sessions. The browsing experience is then significantly improved concerning scenarios where browser add-ons and/or plug-ins are used; still, the end user's Quality of Experience (QoE) in WebRTC sessions may be affected by network impairments, such as delays and losses. Due to the variability in user perceptions under different communications scenarios, comprehending and enhancing the resulting service quality is a complex endeavour. To address this, we present a dataset that provides a comprehensive perspective on the conversational quality of a two-party WebRTC-based audiovisual telemeeting service. This dataset was gathered through subjective evaluations involving 20 subjects across 15 different test conditions (TCs). A specialized system was developed to induce controlled network disruptions such as delay, jitter, and packet loss rate, which adversely affected the communication between the parties. This methodology offered insight into user perceptions under various network impairments. The dataset encompasses a blend of objective and subjective data including ACR (Absolute Category Rating) subjective scores, webrtc-internals parameters, facial expressions features, and speech features. Consequently, it serves as a substantial contribution to the improvement of WebRTC-based video call systems, offering practical and real-world data that can drive the development of more robust and efficient multimedia communication systems, thereby enhancing the user’s experience.

The published article is available here:

https://www.sciencedirect.com/science/article/pii/S1389128624001889

Instructions:

In the following, we discuss the details of the provided datasets.

Subjective_results_dataset.csv: This dataset encompasses subjective evaluation results from 20 subjects (users) who assessed the quality of WebRTC-based video calls under 15 distinct test conditions (TCs), which included combinations of 3 network impairments (delay, jitter, packet loss) to disturb the communication. The single discrete Absolute Category Rating (ACR) scale with five category labels (1-Bad, 2-Poor, 3-Fair, 4-Good, and 5-Excellent) was used by the users to rate the perceived QoE. A total of 300 ACR scores were obtained (20 participants x 15 TCs). The size of this dataset is 4.00 KB.

The significance of each column is explained as follows:

Test Condition (TC): It enumerates the TC numbers, which span from 1 to 15.
Delay [ms]: It refers to the time it takes for a signal to travel from one point to another and is represented at three levels: 0 ms (no delay), 500 ms (moderate delay), and 1000 ms (significant delay).
Jitter [ms]: It refers to the variability in delay and is represented at two levels: 0 ms (no jitter) and 500 ms (moderate jitter).
Packet Loss Rate [%]: It refers to the loss of data packets during transmission and is represented at three levels: 0% (no packet loss), 15% (moderate packet loss), and 30% (significant packet loss).
User: The users, identified from 1 to 20, participated in 15 video calls and evaluated the quality of these calls on a scale from 1 to 5.

Webrtc_internals_dataset.zip: This dataset contains text files collected using the webrtc-internals tool during the video calls. The zip is organized into 20 distinct folders, each labeled from ‘User1’ to ‘User20’. Each user folder contains 15 text files (.txt), each named following the pattern ‘webrtc_internals_dump-TCx_y-z-t.txt’. In this naming convention, ‘x’ denotes the TC number, ranging from 1 to 15. The ‘y-z-t’ segment varies with each TC, where ‘y’ signifies the delay value (0, 500, or 1000), ‘z’ indicates the jitter value (0 or 500), and ‘t’ represents the packet loss rate (0, 15, or 30). Each text file includes application-level data concerning WebRTC sessions' statistics in a JSON format. A total of 300 webrtc-internals dump text files were obtained (20 participants x 15 TCs). The size of this zip dataset is 28.7 MB (396 MB uncompressed).

Facial_expression_features_dataset.zip: This dataset contains facial expression features extracted from the recorded videos with face images using the OpenFace toolkit. The zip is organized into 20 distinct folders, each labelled from ‘User1’ to ‘User20’. Each user folder contains 15 .csv files, which are named following the pattern ‘TCx_y-z-t.csv’. In this naming convention, ‘x’ denotes the TC, ranging from 1 to 15. The ‘y-z-t’ segment varies with each TC, where ‘y’ signifies the delay value (0, 500, or 1000), ‘z’ indicates the jitter value (0 or 500), and ‘t’ represents the packet loss rate (0, 15, or 30). A total of 300 facial expression feature files in .csv format were obtained (20 participants x 15 TCs). For each face image of each TC, the OpenFace outputs 6 gaze direction features, 280 eye region landmarks and 35 Action Units (AUs). The size of this zip dataset is 547 MB (2.14 GB uncompressed).

Each column carries a specific significance, which is elaborated as follows:

frame: the frame number in the context of sequences.
face_id: the identifier assigned to each face when multiple faces are present.
timestamp: the elapsed time in seconds during the processing of a video sequence.
confidence: the level of confidence the tracker has in the current landmark detection estimate.
success: a face has been detected in the frame and it has been tracked accurately.
gaze_0_x, gaze_0_y, gaze_0_z: the normalized eye gaze direction vector in world coordinates for eye 0, which is the eye on the left in the image.
gaze_1_x, gaze_1_y, gaze_1_z: the normalized eye gaze direction vector in world coordinates for eye 1, which is the eye on the right in the image.
gaze_angle_x, gaze_angle_y: the eye gaze direction, averaged for both eyes and expressed in world coordinates in radians, is converted into a format that is easier to use than gaze vectors.
eye_lmk_x_0, eye_lmk_x_1, ..., eye_lmk_x55, eye_lmk_y_1, ... eye_lmk_y_55: the pixel coordinates of 2D landmarks in the eye region.
eye_lmk_X_0, eye_lmk_X_1, ..., eye_lmk_X55, eye_lmk_Y_0, ..., eye_lmk_Z_55: the position of landmarks in the eye region in 3D space, measured in millimeters.
17 AUr: detect the activation intensity (from 1 to 5) of a particular facial muscle. These are: AU01_r, AU02_r, AU04_r, AU05_r, AU06_r, AU07_r, AU09_r, AU10_r, AU12_r, AU14_r, AU15_r, AU17_r, AU20_r, AU23_r, AU25_r, AU26_r, AU45_r.
18 AUc: Identify the activation of a particular muscle and note its presence (0 for absent, 1 for present). These are: AU01_c, AU02_c, AU04_c, AU05_c, AU06_c, AU07_c, AU09_c, AU10_c, AU12_c, AU14_c, AU15_c, AU17_c, AU20_c, AU23_c, AU25_c, AU26_c, AU28_c, AU45_c.

Speech_features_dataset.csv: This dataset is a robust assembly of speech features extracted from the recorded audio files using the OpenSMILE toolkit. The dataset is further enhanced with the integration of Absolute Category Rating (ACR) scores, which were assigned by each subject for every TC. These scores are embedded within the speech features of each subject. The dataset is exhaustive, encompassing a total of 1,911,900 speech features (calculated as 15 TCs x 20 subjects x 6373 speech features per subject). The total size of this dataset is 18.9 MB.

Each column carries a specific significance, which is elaborated as follows:

file: It presents the audio files from which the speech features were extracted.
Users: The users, identified from 1 to 20, participated in 15 video calls and evaluated the quality of these calls on a scale from 1 to 5.
Test Condition (TC): It enumerates the Test Condition numbers, which span from 1 to 15.
Delay [ms]: It refers to the time it takes for a signal to travel from one point to another and is represented at three levels: 0 ms (no delay), 500 ms (moderate delay), and 1000 ms (significant delay).
Jitter [ms]: It refers to the variability in delay and is represented at two levels: 0 ms (no jitter) and 500 ms (moderate jitter).
Packet Loss Rate [%]: It refers to the loss of data packets during transmission and is represented at three levels: 0% (no packet loss), 15% (moderate packet loss), and 30% (significant packet loss).
OpenSmile Speech Features Columns: This presents a detailed view of the speech features. Specifically, from column G to column IKI, it enumerates 6373 distinct speech features for each user under each TC of each processed audio file in .wav.
ACR Score: The Absolute Category Rating (ACR) scale ranges from 1, representing the lowest quality, to 5, indicating the highest quality evaluated by the subjects.

If you make use of this dataset, please consider citing the following publication:

Bingol G., Porcu, S., Floris, A., & Atzori, L. (2024). WebRTC-QoE: A dataset of QoE assessment of subjective scores, network impairments, and facial & speech features. Computer Networks, 244, 110356.

BibTex format:

@article{bingol2024datasetwebrtc, title={WebRTC-QoE: A dataset of QoE assessment of subjective scores, network impairments, and facial & speech features}, author={Bingol, Gulnaziye and Porcu, Simone and Floris, Alessandro and Atzori, Luigi}, journal={Computer Networks}, volume={244}, pages={110356}, year={2024}, publisher={Elsevier}, doi = {https://doi.org/10.1016/j.comnet.2024.110356} }

Funding Agency:

PON “Ricerca e Innovazione” 2014-2020 (PON R&I)

Grant Number:

"Azione IV.4 Dottorati e contratti di ricerca su tematiche dell’innovazione” with D.M. 1062 on 10.08.2021

Dataset Files

Subjective_results_dataset.csv (989 bytes)
Speech_features_dataset.csv (18.93 MB)
Webrtc_internals_dataset.zip (28.79 MB)
Facial_expression_features_dataset.zip (547.65 MB)

Datasets

Standard Dataset

WebRTC-QoE: A Dataset of Quality of Experience in Audio-Video Communications

Abstract

Dataset Files

QUESTIONS?