Data set class

The DataSet class contains all the information to build our model. Each column represents a variable, and each row corresponds to a sample.

Throughout this tutorial, we will use the Iris data set to illustrate the most important methods of the DataSet class.

Sepal length Sepal width Petal length Petal width Iris flower
5.1 3.5 1.4 0.2 Setosa
4.9 3.0 1.4 0.2 Setosa
4.7 3.2 1.3 0.2 Setosa
4.6 3.1 1.5 0.2 Setosa
5.0 3.6 1.4 0.2 Setosa
5.4 3.9 1.7 0.4 Setosa

Iris data set

You can download the iris data set here.

The DataSet class offers a wide variety of constructors. The easiest and most common way to create a dataset object is by employing the default constructor, which creates an empty data set.

// Dataset
Dataset data_set;

Once we have created the dataset object, the next step is to fill it with data:

// Load dataset from CSV file

data_set.set_data_path("../data/iris_plant.csv");
data_set.set_separator(Dataset::Separator::Space);
data_set.read_csv();

By default, the last column is set as the target, and the remaining columns are set as inputs. Note that, in this case, the last column contains three different categories.

It is possible to set the use of each instance in the data set. For example, they can be split randomly or sequentially as follows:

// Split dataset into training, validation, and testing sets

data_set.split_samples_random(type(0.7), type(0.1), type(0.2));
data_set.split_samples_sequential(type(0.7), type(0.1), type(0.2));

The first number corresponds to the ratio of training instances, the second to the ratio of selection, and the last to the ratio of testing instances. By default, these values are set to 0.6, 0.2 and 0.2.

The DataSet class also implements some useful preprocessing methods. Below, we present some of them:

  • Scaling methods:

Scaling the data improves the neural network’s performance. The following method scales the input data to the specified minimum and maximum values and returns a structure containing the input information. This information corresponds to the maximum, minimum, mean, and standard deviation for each variable.

// Scale input variables and retrieve descriptives

const vector<Descriptives> inputs_descriptives =
    data_set.scale_variables("Input");

cout << inputs_descriptives << endl;

Scaled data are introduced into the neural network during training.

  • Correlation methods:

Correlation provides deeper insight into which variables are more strongly related to the targets. This method calculates the correlations between all targets and all inputs.

// Analyze input–target correlations

const Tensor<Correlation, 2> input_target_correlations =
    data_set.calculate_input_target_raw_variable_pearson_correlations();

data_set.print_input_target_raw_variables_correlations();

For more information on the Dataset class visit DataSet Class Reference.