The DataSet class contains all the information to build our model. Every column represents a particular variable, and each row corresponds to one sample.
Throughout this tutorial, we will use the Iris data set to give a notion of how to use the most important methods of the DataSet class.
Iris data set
You can download the iris data set here.
The DataSet class offers a wide variety of constructors. The easiest and the most common way to create a dataset object is by means of the default constructor, which creates an empty data set.
Once we have created the dataset object, the next step is to fill it with data:
By default the last column is set as target and the remainder as inputs. Note that, in this case, the last column contains three different categories. It is possible to set the name of each columns by means of the DataSet class member called columns as follows
The name of the categories of the last column is set automatically during the loading process.
It is also possible to set the use of each of the instances of the data set. For example, they can be splitted randomly or sequentially as follows
data_set.split_instances_random(0.7, 0.1, 0.2);
data_set.split_instances_sequential(0.7, 0.1, 0.2);
The first number corresponds to the ratio of training instances, the second number represents the ratio of selection and the last one correspond to the ratio of testing instances. By default, these values are set to 0.6, 0.2 and 0.2.
Finally, DataSet class implements some useful preprocessing methods, below we present some of them:
Scaling data let the neural network work in better conditions. The following method scale the input data between a minimum and maximum value, and simultaneously returns a structure that shows information about the inputs. This information corresponds to the maximum value, the minimum value, the mean and the standard deviation for each variable.
const Vector< Descriptives<double> > inputs_descriptives = data_set.scale_inputs_minimum_maximum();
cout << inputs_descriptives << endl;
These scaled data are introduced into the neural network in the training phase.
Correlation provide you better knowledge about which variables have more relation with the targets. This method calculates the correlations between all taqrgets and all inputs.
const Matrix<CorrelationResults> input_target_columns_correlations = data_set.calculate_input_target_columns_correlations();
If you need more information about Dataset class visit DataSet Class Reference