Abstract

Register allocation is an important phase in compiler optimization. Often, its resolution involves graph coloring, which is an NP-complete problem. Because of their significance, numerous heuristics have been suggested for their resolution. Heuristic development is a complex process that requires specialized domain expertise. Recently, several machine learning based approaches have been proposed to solve compiler optimization problems. Still, due to the complexity of the problem and the lack of specialized datasets for training models applied to register allocation, few works researched this topic. The deficiency in sufficient adequate test cases is a recurring issue when working with register allocation, even beyond the scope of machine learning applications. In an effort to address this problem and facilitate forthcoming research in this domain, we present this dataset for training Machine Learning models focused on register allocation. The dataset consists of PBQP interference graphs from real C/C++ codes, and it is easily adaptable to various register allocation techniques.

Instructions:

Each JSON file contains an object in which the keys are the names of functions, and the values are their respective graphs. The graphs consist of JSON objects separated into vertices (nodes) and edges. The value related to the key nodes is an object containing all the vertices, each associated with the name of the virtual register linked to a variable from LLVM intermediate code representation. The vertices have the following keys and values:

type: indicates the type of value stored in the register;

uses : an array with a size equal to the number of uses of the virtual register. Each value indicates a line where the variable is used;

uses deepness: an array with a size equal to the size of uses. The values in the array refer to the depth in loop structures of each virtual register usage. For example, if a value is not inside any loop, its depth value will be zero. If it is inside one loop, its depth will be one, if inside two nested loops, it will be two, and so on;

cost array: an array of cost for each allocation option. It provides the costs of associating the virtual register with other virtual registers, along with the cost of spilling the value. The graph generation program offers an option to omit the cost array from the output file. This functionality reduces the memory space occupied by the dataset, as the cost array can be added independently from the code. Additionally, it allows generating graphs of common interference.

The value related to the key edges is an array of edges. Each edge contains attributes node 1 and node 2, which indicate the interference represented by the edge. Note that the order of vertices does not matter. Moreover, the edges also include the cost matrix as a third attribute. Like the cost array, it is possible to create graphs without the cost matrix in the edges. Below is an example of a output file.

Funding Agency:

capes

Grant Number:

88887.901379/2023-00

Dataset Files

basic_graphs.zip (5.56 MB)
codenet_graphs.zip (43.18 MB)
speccpu_graphs.zip (13.35 MB)
basic_codes.zip (3.65 MB)
speccpu_codes.zip (6.73 MB)
codenet_codes.zip (29.08 MB)
Inteference-Graph-Generator.zip (8.60 kB)

Datasets

Standard Dataset

Interference graph dataset for machine learning based register allocation

Abstract

Dataset Files

QUESTIONS?