Data set class
The DataSet
class contains all the information to build our model. Every column represents a particular variable, and each row corresponds to one sample.
Throughout this tutorial, we will use the Iris data set to illustrate the most important methods of the DataSet
class.
Sepal length | Sepal width | Petal length | Petal width | Iris flower |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | Setosa |
4.9 | 3.0 | 1.4 | 0.2 | Setosa |
4.7 | 3.2 | 1.3 | 0.2 | Setosa |
4.6 | 3.1 | 1.5 | 0.2 | Setosa |
5.0 | 3.6 | 1.4 | 0.2 | Setosa |
5.4 | 3.9 | 1.7 | 0.4 | Setosa |
Sepal length | Sepal width | Petal length | Petal width | Iris flower |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | Setosa |
4.9 | 3.0 | 1.4 | 0.2 | Setosa |
4.7 | 3.2 | 1.3 | 0.2 | Setosa |
4.6 | 3.1 | 1.5 | 0.2 | Setosa |
5.0 | 3.6 | 1.4 | 0.2 | Setosa |
5.4 | 3.9 | 1.7 | 0.4 | Setosa |
4.6 | 3.4 | 1.4 | 0.3 | Setosa |
5.0 | 3.4 | 1.5 | 0.2 | Setosa |
4.4 | 2.9 | 1.4 | 0.2 | Setosa |
4.9 | 3.1 | 1.5 | 0.1 | Setosa |
5.4 | 3.7 | 1.5 | 0.2 | Setosa |
4.8 | 3.4 | 1.6 | 0.2 | Setosa |
4.8 | 3.0 | 1.4 | 0.1 | Setosa |
4.3 | 3.0 | 1.1 | 0.1 | Setosa |
5.8 | 4.0 | 1.2 | 0.2 | Setosa |
5.7 | 4.4 | 1.5 | 0.4 | Setosa |
5.4 | 3.9 | 1.3 | 0.4 | Setosa |
5.1 | 3.5 | 1.4 | 0.3 | Setosa |
5.7 | 3.8 | 1.7 | 0.3 | Setosa |
5.1 | 3.8 | 1.5 | 0.3 | Setosa |
5.4 | 3.4 | 1.7 | 0.2 | Setosa |
5.1 | 3.7 | 1.5 | 0.4 | Setosa |
4.6 | 3.6 | 1.0 | 0.2 | Setosa |
5.1 | 3.3 | 1.7 | 0.5 | Setosa |
4.8 | 3.4 | 1.9 | 0.2 | Setosa |
5.0 | 3.0 | 1.6 | 0.2 | Setosa |
5.0 | 3.4 | 1.6 | 0.4 | Setosa |
5.2 | 3.5 | 1.5 | 0.2 | Setosa |
5.2 | 3.4 | 1.4 | 0.2 | Setosa |
4.7 | 3.2 | 1.6 | 0.2 | Setosa |
4.8 | 3.1 | 1.6 | 0.2 | Setosa |
5.4 | 3.4 | 1.5 | 0.4 | Setosa |
5.2 | 4.1 | 1.5 | 0.1 | Setosa |
5.5 | 4.2 | 1.4 | 0.2 | Setosa |
4.9 | 3.1 | 1.5 | 0.2 | Setosa |
5.0 | 3.2 | 1.2 | 0.2 | Setosa |
5.5 | 3.5 | 1.3 | 0.2 | Setosa |
4.9 | 3.6 | 1.4 | 0.1 | Setosa |
4.4 | 3.0 | 1.3 | 0.2 | Setosa |
5.1 | 3.4 | 1.5 | 0.2 | Setosa |
5.0 | 3.5 | 1.3 | 0.3 | Setosa |
4.5 | 2.3 | 1.3 | 0.3 | Setosa |
4.4 | 3.2 | 1.3 | 0.2 | Setosa |
5.0 | 3.5 | 1.6 | 0.6 | Setosa |
5.1 | 3.8 | 1.9 | 0.4 | Setosa |
4.8 | 3.0 | 1.4 | 0.3 | Setosa |
5.1 | 3.8 | 1.6 | 0.2 | Setosa |
4.6 | 3.2 | 1.4 | 0.2 | Setosa |
5.3 | 3.7 | 1.5 | 0.2 | Setosa |
5.0 | 3.3 | 1.4 | 0.2 | Setosa |
7.0 | 3.2 | 4.7 | 1.4 | Versicolor |
6.4 | 3.2 | 4.5 | 1.5 | Versicolor |
6.9 | 3.1 | 4.9 | 1.5 | Versicolor |
5.5 | 2.3 | 4.0 | 1.3 | Versicolor |
6.5 | 2.8 | 4.6 | 1.5 | Versicolor |
5.7 | 2.8 | 4.5 | 1.3 | Versicolor |
6.3 | 3.3 | 4.7 | 1.6 | Versicolor |
4.9 | 2.4 | 3.3 | 1.0 | Versicolor |
6.6 | 2.9 | 4.6 | 1.3 | Versicolor |
5.2 | 2.7 | 3.9 | 1.4 | Versicolor |
5.0 | 2.0 | 3.5 | 1.0 | Versicolor |
5.9 | 3.0 | 4.2 | 1.5 | Versicolor |
6.0 | 2.2 | 4.0 | 1.0 | Versicolor |
6.1 | 2.9 | 4.7 | 1.4 | Versicolor |
5.6 | 2.9 | 3.6 | 1.3 | Versicolor |
6.7 | 3.1 | 4.4 | 1.4 | Versicolor |
5.6 | 3.0 | 4.5 | 1.5 | Versicolor |
5.8 | 2.7 | 4.1 | 1.0 | Versicolor |
6.2 | 2.2 | 4.5 | 1.5 | Versicolor |
5.6 | 2.5 | 3.9 | 1.1 | Versicolor |
5.9 | 3.2 | 4.8 | 1.8 | Versicolor |
6.1 | 2.8 | 4.0 | 1.3 | Versicolor |
6.3 | 2.5 | 4.9 | 1.5 | Versicolor |
6.1 | 2.8 | 4.7 | 1.2 | Versicolor |
6.4 | 2.9 | 4.3 | 1.3 | Versicolor |
6.6 | 3.0 | 4.4 | 1.4 | Versicolor |
6.8 | 2.8 | 4.8 | 1.4 | Versicolor |
6.7 | 3.0 | 5.0 | 1.7 | Versicolor |
6.0 | 2.9 | 4.5 | 1.5 | Versicolor |
5.7 | 2.6 | 3.5 | 1.0 | Versicolor |
5.5 | 2.4 | 3.8 | 1.1 | Versicolor |
5.5 | 2.4 | 3.7 | 1.0 | Versicolor |
5.8 | 2.7 | 3.9 | 1.2 | Versicolor |
6.0 | 2.7 | 5.1 | 1.6 | Versicolor |
5.4 | 3.0 | 4.5 | 1.5 | Versicolor |
6.0 | 3.4 | 4.5 | 1.6 | Versicolor |
6.7 | 3.1 | 4.7 | 1.5 | Versicolor |
6.3 | 2.3 | 4.4 | 1.3 | Versicolor |
5.6 | 3.0 | 4.1 | 1.3 | Versicolor |
5.5 | 2.5 | 4.0 | 1.3 | Versicolor |
5.5 | 2.6 | 4.4 | 1.2 | Versicolor |
6.1 | 3.0 | 4.6 | 1.4 | Versicolor |
5.8 | 2.6 | 4.0 | 1.2 | Versicolor |
5.0 | 2.3 | 3.3 | 1.0 | Versicolor |
5.6 | 2.7 | 4.2 | 1.3 | Versicolor |
5.7 | 3.0 | 4.2 | 1.2 | Versicolor |
5.7 | 2.9 | 4.2 | 1.3 | Versicolor |
6.2 | 2.9 | 4.3 | 1.3 | Versicolor |
5.1 | 2.5 | 3.0 | 1.1 | Versicolor |
5.7 | 2.8 | 4.1 | 1.3 | Versicolor |
6.3 | 3.3 | 6.0 | 2.5 | Virginica |
5.8 | 2.7 | 5.1 | 1.9 | Virginica |
7.1 | 3.0 | 5.9 | 2.1 | Virginica |
6.3 | 2.9 | 5.6 | 1.8 | Virginica |
6.5 | 3.0 | 5.8 | 2.2 | Virginica |
7.6 | 3.0 | 6.6 | 2.1 | Virginica |
4.9 | 2.5 | 4.5 | 1.7 | Virginica |
7.3 | 2.9 | 6.3 | 1.8 | Virginica |
6.7 | 2.5 | 5.8 | 1.8 | Virginica |
7.2 | 3.6 | 6.1 | 2.5 | Virginica |
6.5 | 3.2 | 5.1 | 2.0 | Virginica |
6.4 | 2.7 | 5.3 | 1.9 | Virginica |
6.8 | 3.0 | 5.5 | 2.1 | Virginica |
5.7 | 2.5 | 5.0 | 2.0 | Virginica |
5.8 | 2.8 | 5.1 | 2.4 | Virginica |
6.4 | 3.2 | 5.3 | 2.3 | Virginica |
6.5 | 3.0 | 5.5 | 1.8 | Virginica |
7.7 | 3.8 | 6.7 | 2.2 | Virginica |
7.7 | 2.6 | 6.9 | 2.3 | Virginica |
6.0 | 2.2 | 5.0 | 1.5 | Virginica |
6.9 | 3.2 | 5.7 | 2.3 | Virginica |
5.6 | 2.8 | 4.9 | 2.0 | Virginica |
7.7 | 2.8 | 6.7 | 2.0 | Virginica |
6.3 | 2.7 | 4.9 | 1.8 | Virginica |
6.7 | 3.3 | 5.7 | 2.1 | Virginica |
7.2 | 3.2 | 6.0 | 1.8 | Virginica |
6.2 | 2.8 | 4.8 | 1.8 | Virginica |
6.1 | 3.0 | 4.9 | 1.8 | Virginica |
6.4 | 2.8 | 5.6 | 2.1 | Virginica |
7.2 | 3.0 | 5.8 | 1.6 | Virginica |
7.4 | 2.8 | 6.1 | 1.9 | Virginica |
7.9 | 3.8 | 6.4 | 2.0 | Virginica |
6.4 | 2.8 | 5.6 | 2.2 | Virginica |
6.3 | 2.8 | 5.1 | 1.5 | Virginica |
6.1 | 2.6 | 5.6 | 1.4 | Virginica |
7.7 | 3.0 | 6.1 | 2.3 | Virginica |
6.3 | 3.4 | 5.6 | 2.4 | Virginica |
6.4 | 3.1 | 5.5 | 1.8 | Virginica |
6.0 | 3.0 | 4.8 | 1.8 | Virginica |
6.9 | 3.1 | 5.4 | 2.1 | Virginica |
6.7 | 3.1 | 5.6 | 2.4 | Virginica |
6.9 | 3.1 | 5.1 | 2.3 | Virginica |
5.8 | 2.7 | 5.1 | 1.9 | Virginica |
6.8 | 3.2 | 5.9 | 2.3 | Virginica |
6.7 | 3.3 | 5.7 | 2.5 | Virginica |
6.7 | 3.0 | 5.2 | 2.3 | Virginica |
6.3 | 2.5 | 5.0 | 1.9 | Virginica |
6.5 | 3.0 | 5.2 | 2.0 | Virginica |
6.2 | 3.4 | 5.4 | 2.3 | Virginica |
5.9 | 3.0 | 5.1 | 1.8 | Virginica |
Iris data set
You can download the iris data set here.
The DataSet
class offers a wide variety of constructors. The easiest and most common way to create a dataset object is by employing the default constructor, which creates an empty data set.
DataSet data_set;
Once we have created the dataset object, the next step is to fill it with data:
data_set.set_data_path("../data/iris_plant.csv"); data_set.set_separator(DataSet::Separator::Space); data_set.read_csv();
By default, the last column is set as the target and the remainder as inputs. Note that, in this case, the last column contains three different categories.
It is possible to set the use of each of the instances of the data set. For example, they can be split randomly or sequentially as follows:
data_set.split_samples_random(type(0.7), type(0.1), type(0.2)); data_set.split_samples_sequential(type(0.7), type(0.1), type(0.2));
The first number corresponds to the ratio of training instances, the second represents the ratio of selection, and the last one corresponds to the ratio of testing instances. By default, these values are set to 0.6, 0.2 and 0.2.
The DataSet
class also implements some useful preprocessing methods. Below, we present some of them:
- Scaling methods:
Scaling data lets the neural network work in better conditions. The following method scales the input data between a minimum and maximum value and returns a structure showing the input information. This information corresponds to the maximum value, the minimum value, the mean, and the standard deviation for each variable.
const vector <Descriptives> inputs_descriptives = data_set.scale_variables(DataSet::VariableUse::Input); cout << inputs_descriptives << endl;
Scaled data are introduced into the neural network during the training phase.
- Correlation methods:
Correlation offers more profound insight into which variables are more strongly related to the targets. This method calculates the correlations between all targets and all inputs.
const Tensor<Correlation, 2> input_target_correlations = data_set.calculate_input_target_raw_variable_pearson_correlations(); data_set.print_input_target_raw_variables_correlations();
For more information on the Dataset
class visit DataSet Class Reference.