Data set class

The DataSet class contains all the information to build our model. Every column represents a particular variable, and each row corresponds to one sample.

Throughout this tutorial, we will use the Iris data set to give a notion of how to use the most important methods of the DataSet class.

Sepal length Sepal width Petal length Petal width Iris flower
5.1 3.5 1.4 0.2 Setosa
4.9 3.0 1.4 0.2 Setosa
4.7 3.2 1.3 0.2 Setosa
4.6 3.1 1.5 0.2 Setosa
5.0 3.6 1.4 0.2 Setosa
5.4 3.9 1.7 0.4 Setosa

Iris data set

You can download the iris data set here.

The DataSet class offers a wide variety of constructors. The easiest and the most common way to create a dataset object is employing the default constructor, which creates an empty data set.

DataSet data_set;

Once we have created the dataset object, the next step is to fill it with data:

data_set.set_data_file_name("../data/iris_plant.csv");
data_set.set_separator("Space");
data_set.read_csv();

By default, the last column is set as a target and the remainder as inputs. Note that, in this case, the last column contains three different categories. It is possible to set the name of each column by means of the DataSet class member called columns as follows

data_set.set_column_name(0, "sepal_length");
data_set.set_column_name(1, "sepal_width");
data_set.set_column_name(2, "petal_length");
data_set.set_column_name(3, "petal_width");
data_set.set_column_name(4, "iris_type");

The name of the categories of the last column is set automatically during the loading process.

It is also possible to set the use of each of the instances of the data set. For example, they can be split randomly or sequentially as follows

data_set.split_samples_random(type(0.7), type(0.1), type(0.2));
data_set.split_samples_sequential(type(0.7), type(0.1), type(0.2));

The first number corresponds to the ratio of training instances, the second number represents the ratio of selection, and the last one corresponds to the ratio of testing instances. By default, these values are set to 0.6, 0.2 and 0.2.

Finally, the DataSet class implements some useful preprocessing methods. Below we present some of them:

  • Scaling methods:

Scaling data lets the neural network work in better conditions. The following method scales the input data between a minimum and maximum value and simultaneously returns a structure that shows the inputs’ information. This information corresponds to the maximum value, the minimum value, the mean and the standard deviation for each variable.

const Tensor<Descriptives, 1> inputs_descriptives = data_set.scale_input_variables();
cout << inputs_descriptives << endl;

These scaled data are introduced into the neural network in the training phase.

  • Correlation methods:

Correlation provides better knowledge about which variables have more relation with the targets. This method calculates the correlations between all targets and all inputs.

const Tensor<Correlation, 2> input_target_columns_correlations = data_set.calculate_input_target_columns_correlations();
data_set.print_input_target_columns_correlations();

If you need more information about the Dataset class visit DataSet Class Reference