DataSet Class Reference

This class represents the concept of data set for data modelling problems, such as function regression, classification, time series prediction, images approximation and images classification. More...

#include <data_set.h>

Classes

struct  Column
 This structure represents the columns of the DataSet. More...
 

Public Types

enum  Separator { Space, Tab, Comma, Semicolon }
 Enumeration of available separators for the data file.
 
enum  MissingValuesMethod { Unuse, Mean, Median }
 Enumeration of available methods for missing values in the data.
 
enum  ScalingUnscalingMethod {
  NoScaling, NoUnscaling, MinimumMaximum, MeanStandardDeviation,
  StandardDeviation, Logarithmic
}
 Enumeration of available methods for scaling and unscaling the data.
 
enum  ProjectType {
  Approximation, Classification, Forecasting, ImageApproximation,
  ImageClassification
}
 Enumeration of the learning tasks.
 
enum  InstanceUse { Training, Selection, Testing, UnusedInstance }
 
enum  VariableUse { Input, Target, Time, UnusedVariable }
 
enum  ColumnType { Numeric, Binary, Categorical, DateTime }
 

Public Member Functions

 DataSet ()
 
 DataSet (const Eigen::MatrixXd &)
 
 DataSet (const Matrix< double > &)
 
 DataSet (const size_t &, const size_t &)
 
 DataSet (const size_t &, const size_t &, const size_t &)
 
 DataSet (const tinyxml2::XMLDocument &)
 
 DataSet (const string &, const char &, const bool &)
 
 DataSet (const DataSet &)
 
virtual ~DataSet ()
 Destructor.
 
size_t get_instances_number () const
 
size_t get_training_instances_number () const
 Returns the number of instances in the data set which will be used for training.
 
size_t get_selection_instances_number () const
 Returns the number of instances in the data set which will be used for selection.
 
size_t get_testing_instances_number () const
 Returns the number of instances in the data set which will be used for testing.
 
size_t get_used_instances_number () const
 
size_t get_unused_instances_number () const
 
Vector< size_t > get_training_instances_indices () const
 Returns the indices of the instances which will be used for training.
 
Vector< size_t > get_selection_instances_indices () const
 Returns the indices of the instances which will be used for selection.
 
Vector< size_t > get_testing_instances_indices () const
 Returns the indices of the instances which will be used for testing.
 
Vector< size_t > get_used_instances_indices () const
 Returns the indices of the used instances(those which are not set unused).
 
Vector< size_t > get_unused_instances_indices () const
 Returns the indices of the instances set unused.
 
InstanceUse get_instance_use (const size_t &) const
 
const Vector< InstanceUse > & get_instances_uses () const
 Returns the use of every instance (training, selection, testing or unused) in a vector.
 
Vector< size_t > get_instances_uses_numbers () const
 
Vector< double > get_instances_uses_percentages () const
 
Vector< Columnget_columns () const
 
Vector< Columnget_used_columns () const
 
size_t get_columns_number () const
 
size_t get_input_columns_number () const
 
size_t get_target_columns_number () const
 
size_t get_time_columns_number () const
 
size_t get_unused_columns_number () const
 
size_t get_used_columns_number () const
 
size_t get_column_index (const string &) const
 
Vector< size_t > get_input_columns_indices () const
 Returns a indices vector with the positions of the inputs.
 
Vector< size_t > get_target_columns_indices () const
 Returns a indices vector with the positions of the targets.
 
Vector< size_t > get_unused_columns_indices () const
 Returns a indices vector with the positions of the unused columns.
 
Vector< size_t > get_used_columns_indices () const
 Returns a indices vector with the positions of the used columns.
 
Vector< string > get_columns_names () const
 Returns a string vector that contains the names of the columns.
 
Vector< string > get_input_columns_names () const
 
Vector< string > get_target_columns_names () const
 
Vector< string > get_used_columns_names () const
 
ColumnType get_column_type (const size_t &index) const
 
VariableUse get_column_use (const size_t &) const
 Returns a vector containing the use of the column, without taking into account the categories.
 
Vector< VariableUseget_columns_uses () const
 
size_t get_variables_number () const
 
size_t get_input_variables_number () const
 Returns the number of input variables of the data set.
 
size_t get_target_variables_number () const
 Returns the number of target variables of the data set.
 
size_t get_unused_variables_number () const
 Returns the number of variables which will neither be used as input nor as target.
 
size_t get_used_variables_number () const
 Returns the number of variables which are either input nor target.
 
string get_variable_name (const size_t &) const
 
Vector< string > get_variables_names () const
 
Vector< string > get_input_variables_names () const
 
Vector< string > get_target_variables_names () const
 
size_t get_variable_index (const string &) const
 
Vector< size_t > get_variable_indices (const size_t &) const
 
Vector< size_t > get_unused_variables_indices () const
 Returns the indices of the unused variables.
 
Vector< size_t > get_input_variables_indices () const
 Returns the indices of the input variables.
 
Vector< size_t > get_target_variables_indices () const
 Returns the indices of the target variables.
 
VariableUse get_variable_use (const size_t &) const
 
Vector< VariableUseget_variables_uses () const
 
Vector< size_t > get_input_variables_dimensions () const
 Returns the dimensions of the input variables.
 
Vector< size_t > get_target_variables_dimensions () const
 Returns the dimesions of the target variables.
 
size_t get_batch_instances_number ()
 
Vector< Vector< size_t > > get_training_batches (const bool &=true) const
 
Vector< Vector< size_t > > get_selection_batches (const bool &=true) const
 
Vector< Vector< size_t > > get_testing_batches (const bool &=true) const
 
const Matrix< double > & get_data () const
 
const Eigen::MatrixXd get_data_eigen () const
 
const Matrix< double > & get_time_series_data () const
 
Matrix< double > get_training_data () const
 
Eigen::MatrixXd get_training_data_eigen () const
 
Matrix< double > get_selection_data () const
 
Eigen::MatrixXd get_selection_data_eigen () const
 
Matrix< double > get_testing_data () const
 
Eigen::MatrixXd get_testing_data_eigen () const
 
Matrix< double > get_input_data () const
 
Eigen::MatrixXd get_input_data_eigen () const
 
Matrix< double > get_target_data () const
 
Eigen::MatrixXd get_target_data_eigen () const
 
Tensor< double > get_input_data (const Vector< size_t > &) const
 
Tensor< double > get_target_data (const Vector< size_t > &) const
 
Matrix< float > get_input_data_float (const Vector< size_t > &) const
 
Matrix< float > get_target_data_float (const Vector< size_t > &) const
 
Tensor< double > get_training_input_data () const
 
Eigen::MatrixXd get_training_input_data_eigen () const
 
Tensor< double > get_training_target_data () const
 
Eigen::MatrixXd get_training_target_data_eigen () const
 
Tensor< double > get_selection_input_data () const
 
Eigen::MatrixXd get_selection_input_data_eigen () const
 
Tensor< double > get_selection_target_data () const
 
Eigen::MatrixXd get_selection_target_data_eigen () const
 
Tensor< double > get_testing_input_data () const
 
Eigen::MatrixXd get_testing_input_data_eigen () const
 
Tensor< double > get_testing_target_data () const
 
Eigen::MatrixXd get_testing_target_data_eigen () const
 
Vector< double > get_instance_data (const size_t &) const
 
Vector< double > get_instance_data (const size_t &, const Vector< size_t > &) const
 
Tensor< double > get_instance_input_data (const size_t &) const
 
Tensor< double > get_instance_target_data (const size_t &) const
 
Matrix< double > get_column_data (const size_t &) const
 
Matrix< double > get_column_data (const Vector< size_t > &) const
 
Matrix< double > get_column_data (const string &) const
 
Vector< double > get_variable_data (const size_t &) const
 
Vector< double > get_variable_data (const string &) const
 
Vector< double > get_variable_data (const size_t &, const Vector< size_t > &) const
 
Vector< double > get_variable_data (const string &, const Vector< size_t > &) const
 
MissingValuesMethod get_missing_values_method () const
 Returns a string with the method used.
 
const string & get_data_file_name () const
 Returns the name of the data file.
 
const bool & get_header_line () const
 Returns true if the first line of the data file has a header with the names of the variables, and false otherwise.
 
const bool & get_rows_label () const
 Returns true if the data file has rows label, and false otherwise.
 
const Separatorget_separator () const
 Returns the separator to be used in the data file.
 
char get_separator_char () const
 Returns the string which will be used as separator in the data file.
 
string get_separator_string () const
 Returns the string which will be used as separator in the data file.
 
const string & get_missing_values_label () const
 Returns the string which will be used as label for the missing values in the data file.
 
const size_t & get_lags_number () const
 Returns the number of lags to be used in a time series prediction application.
 
const size_t & get_steps_ahead () const
 Returns the number of steps ahead to be used in a time series prediction application.
 
const size_t & get_time_index () const
 Returns the indices of the time variables in the data set.
 
int get_gmt () const
 
const bool & get_display () const
 
void set ()
 Sets zero instances and zero variables in the data set.
 
void set (const Matrix< double > &)
 
void set (const Eigen::MatrixXd &)
 
void set (const size_t &, const size_t &)
 
void set (const size_t &, const size_t &, const size_t &)
 
void set (const DataSet &)
 
void set (const tinyxml2::XMLDocument &)
 
void set (const string &)
 
void set_default ()
 
void set_instances_number (const size_t &)
 
void set_training ()
 Sets all the instances in the data set for training.
 
void set_selection ()
 Sets all the instances in the data set for selection.
 
void set_testing ()
 Sets all the instances in the data set for testing.
 
void set_training (const Vector< size_t > &)
 
void set_selection (const Vector< size_t > &)
 
void set_testing (const Vector< size_t > &)
 
void set_instances_unused ()
 Sets all the instances in the data set for unused.
 
void set_instances_unused (const Vector< size_t > &)
 
void set_instance_use (const size_t &, const InstanceUse &)
 
void set_instance_use (const size_t &, const string &)
 
void set_instances_uses (const Vector< InstanceUse > &)
 
void set_instances_uses (const Vector< string > &)
 
void set_testing_to_selection_instances ()
 Changes instances for testing by instances for selection.
 
void set_selection_to_testing_instances ()
 Changes instances for selection by instances for testing.
 
void set_batch_instances_number (const size_t &)
 
void set_k_fold_cross_validation_instances_uses (const size_t &, const size_t &)
 
void set_default_columns_uses ()
 
void set_default_columns_names ()
 
void set_columns_uses (const Vector< string > &)
 
void set_columns_uses (const Vector< VariableUse > &)
 
void set_columns_unused ()
 
void set_input_columns_unused ()
 
void set_column_use (const size_t &, const VariableUse &)
 
void set_column_use (const string &, const VariableUse &)
 
void set_columns_names (const Vector< string > &)
 
void set_columns_number (const size_t &)
 
void set_variables_names (const Vector< string > &)
 
void set_variable_name (const size_t &, const string &)
 
void set_input ()
 Sets all the variables in the data set as input variables.
 
void set_target ()
 Sets all the variables in the data set as target variables.
 
void set_variables_unused ()
 Sets all the variables in the data set as unused variables.
 
void set_input_variables_dimensions (const Vector< size_t > &)
 Sets new input dimensions in the data set.
 
void set_target_variables_dimensions (const Vector< size_t > &)
 Sets new target dimensions in the data set.
 
void set_data (const Matrix< double > &)
 
void set_instance (const size_t &, const Vector< double > &)
 
void set_data_file_name (const string &)
 
void set_has_columns_names (const bool &)
 Sets if the data file contains a header with the names of the columns.
 
void set_has_rows_label (const bool &)
 Sets if the data file contains rows label.
 
void set_separator (const Separator &)
 
void set_separator (const string &)
 
void set_separator (const char &)
 
void set_missing_values_label (const string &)
 
void set_missing_values_method (const MissingValuesMethod &)
 
void set_missing_values_method (const string &)
 
void set_lags_number (const size_t &)
 
void set_steps_ahead_number (const size_t &)
 
void set_time_index (const size_t &)
 
void set_gmt (int &)
 
void set_display (const bool &)
 
bool is_binary_classification () const
 Returns true if the data set is a binary classification problem, false otherwise.
 
bool is_multiple_classification () const
 Returns true if the data set is a multiple classification problem, false otherwise.
 
bool is_empty () const
 Returns true if the data matrix is empty, and false otherwise.
 
bool is_instance_used (const size_t &) const
 
bool is_instance_unused (const size_t &) const
 
bool has_data () const
 
bool has_categorical_variables () const
 
bool has_time_variables () const
 
void split_instances_sequential (const double &training_ratio=0.6, const double &selection_ratio=0.2, const double &testing_ratio=0.2)
 
void split_instances_random (const double &training_ratio=0.6, const double &selection_ratio=0.2, const double &testing_ratio=0.2)
 
Vector< string > unuse_constant_columns ()
 
Vector< size_t > unuse_repeated_instances ()
 
Vector< size_t > unuse_non_significant_input_columns ()
 
Vector< size_t > unuse_uncorrelated_columns (const double &=0.25)
 
Vector< size_t > unuse_most_populated_target (const size_t &)
 
void initialize_data (const double &)
 
void randomize_data_uniform (const double &minimum=-1.0, const double &maximum=1.0)
 
void randomize_data_normal (const double &mean=0.0, const double &standard_deviation=1.0)
 
Vector< Descriptivescalculate_columns_descriptives () const
 
Matrix< double > calculate_columns_descriptives_matrix () const
 
Eigen::MatrixXd calculate_columns_descriptives_eigen () const
 
Vector< Descriptivescalculate_columns_descriptives_positive_instances () const
 
Vector< Descriptivescalculate_columns_descriptives_negative_instances () const
 
Vector< Descriptivescalculate_columns_descriptives_classes (const size_t &) const
 
Vector< Descriptivescalculate_columns_descriptives_training_instances () const
 
Vector< Descriptivescalculate_columns_descriptives_selection_instances () const
 
Vector< Descriptivescalculate_columns_descriptives_testing_instances () const
 
Vector< Descriptivescalculate_input_variables_descriptives () const
 
Vector< Descriptivescalculate_target_variables_descriptives () const
 
Vector< double > calculate_variables_means (const Vector< size_t > &) const
 
Descriptives calculate_inputs_descriptives (const size_t &) const
 
Vector< double > calculate_training_targets_mean () const
 Returns the mean values of the target variables on the training.
 
Vector< double > calculate_selection_targets_mean () const
 Returns the mean values of the target variables on the selection.
 
Vector< double > calculate_testing_targets_mean () const
 Returns the mean values of the target variables on the testing.
 
size_t calculate_training_negatives (const size_t &) const
 
size_t calculate_selection_negatives (const size_t &) const
 
size_t calculate_testing_negatives (const size_t &) const
 
Vector< Histogramcalculate_columns_histograms (const size_t &=10) const
 
Vector< BoxPlotcalculate_columns_box_plots () const
 
Matrix< double > calculate_inputs_correlations () const
 
void print_inputs_correlations () const
 Print on screen the correlation between variables in the data set.
 
void print_top_inputs_correlations (const size_t &=10) const
 
Matrix< CorrelationResultscalculate_input_target_columns_correlations () const
 
Matrix< double > calculate_input_target_columns_correlations_double () const
 
Eigen::MatrixXd calculate_input_target_columns_correlations_eigen () const
 
void print_input_target_columns_correlations () const
 Print on screen the correlation between targets and inputs.
 
void print_top_input_target_columns_correlations (const size_t &=10) const
 
Matrix< double > calculate_covariance_matrix () const
 
Matrix< double > perform_principal_components_analysis (const double &=0.0)
 
Matrix< double > perform_principal_components_analysis (const Matrix< double > &, const Vector< double > &, const double &=0.0)
 
void transform_principal_components_data (const Matrix< double > &)
 
void subtract_inputs_mean ()
 Subtracts off the mean to every of the input variables.
 
Vector< size_t > filter_column (const size_t &, const double &, const double &)
 
Vector< size_t > filter_column (const string &, const double &, const double &)
 
Vector< size_t > filter_data (const Vector< double > &, const Vector< double > &)
 
Vector< string > calculate_default_scaling_methods () const
 
void scale_data_minimum_maximum (const Vector< Descriptives > &)
 
void scale_data_mean_standard_deviation (const Vector< Descriptives > &)
 
Vector< Descriptivesscale_data_minimum_maximum ()
 
Vector< Descriptivesscale_data_mean_standard_deviation ()
 
void scale_inputs_mean_standard_deviation (const Vector< Descriptives > &)
 
Vector< Descriptivesscale_inputs_mean_standard_deviation ()
 
void scale_input_mean_standard_deviation (const Descriptives &, const size_t &)
 
Descriptives scale_input_mean_standard_deviation (const size_t &)
 
void scale_input_standard_deviation (const Descriptives &, const size_t &)
 
Descriptives scale_input_standard_deviation (const size_t &)
 
void scale_inputs_minimum_maximum (const Vector< Descriptives > &)
 
Vector< Descriptivesscale_inputs_minimum_maximum ()
 
Eigen::MatrixXd scale_inputs_minimum_maximum_eigen ()
 
Eigen::MatrixXd scale_targets_minimum_maximum_eigen ()
 
void scale_input_minimum_maximum (const Descriptives &, const size_t &)
 
Descriptives scale_input_minimum_maximum (const size_t &)
 
Vector< Descriptivesscale_inputs (const string &)
 
void scale_inputs (const string &, const Vector< Descriptives > &)
 
void scale_inputs (const Vector< string > &, const Vector< Descriptives > &)
 
void scale_targets_minimum_maximum (const Vector< Descriptives > &)
 
Vector< Descriptivesscale_targets_minimum_maximum ()
 
void scale_targets_mean_standard_deviation (const Vector< Descriptives > &)
 
Vector< Descriptivesscale_targets_mean_standard_deviation ()
 
void scale_targets_logarithmic (const Vector< Descriptives > &)
 
Vector< Descriptivesscale_targets_logarithmic ()
 
Vector< Descriptivesscale_targets (const string &)
 
void scale_targets (const string &, const Vector< Descriptives > &)
 
void unscale_data_minimum_maximum (const Vector< Descriptives > &)
 
void unscale_data_mean_standard_deviation (const Vector< Descriptives > &)
 
void unscale_inputs_minimum_maximum (const Vector< Descriptives > &)
 
void unscale_inputs_mean_standard_deviation (const Vector< Descriptives > &)
 
void unscale_targets_minimum_maximum (const Vector< Descriptives > &)
 
void unscale_targets_mean_standard_deviation (const Vector< Descriptives > &)
 
Vector< size_t > calculate_target_distribution () const
 
Vector< size_t > balance_binary_targets_distribution (const double &=100.0)
 
Vector< size_t > balance_multiple_targets_distribution ()
 
Vector< size_t > balance_approximation_targets_distribution (const double &=10.0)
 
Vector< size_t > calculate_Tukey_outliers (const size_t &, const double &=1.5) const
 
Vector< Vector< size_t > > calculate_Tukey_outliers (const double &=1.5) const
 
void unuse_Tukey_outliers (const double &=1.5)
 
void transform_columns_time_series ()
 
Matrix< double > calculate_autocorrelations (const size_t &=10) const
 
Matrix< Vector< double > > calculate_cross_correlations (const size_t &=10) const
 Calculates the cross-correlation between all the variables in the data set.
 
Matrix< double > calculate_lag_plot () const
 
Matrix< double > calculate_lag_plot (const size_t &)
 
void generate_constant_data (const size_t &, const size_t &)
 
void generate_random_data (const size_t &, const size_t &)
 
void generate_sequential_data (const size_t &, const size_t &)
 
void generate_paraboloid_data (const size_t &, const size_t &)
 
void generate_Rosenbrock_data (const size_t &, const size_t &)
 
void generate_inputs_selection_data (const size_t &, const size_t &)
 
void generate_sum_data (const size_t &, const size_t &)
 
void generate_data_binary_classification (const size_t &, const size_t &)
 
void generate_data_multiple_classification (const size_t &, const size_t &, const size_t &)
 
string object_to_string () const
 Returns a string representation of the current data set object.
 
void print () const
 Prints to the screen in text format the members of the data set object.
 
void print_summary () const
 Prints to the screen in text format the main numbers from the data set object.
 
tinyxml2::XMLDocumentto_XML () const
 Serializes the data set object into a XML document of the TinyXML library.
 
void from_XML (const tinyxml2::XMLDocument &)
 
void write_XML (tinyxml2::XMLPrinter &) const
 Serializes the data set object into a XML document of the TinyXML library without keep the DOM tree in memory.
 
void save (const string &) const
 
void load (const string &)
 
void print_columns_types () const
 
void print_data () const
 Prints to the screen the values of the data matrix.
 
void print_data_preview () const
 
void print_data_file_preview () const
 
void save_data () const
 Saves to the data file the values of the data matrix.
 
void read_csv ()
 
void load_data_binary ()
 This method loads the data from a binary data file.
 
void load_time_series_data_binary ()
 This method loads data from a binary data file for time series prediction methodata_set.
 
void transform_time_series ()
 Arranges an input-target matrix from a time series matrix, according to the number of lags.
 
void transform_association ()
 
void fill_time_series (const size_t &)
 
void delete_unused_instances ()
 
void numeric_to_categorical (const size_t &)
 
void print_missing_values_information () const
 
void impute_missing_values_unuse ()
 Sets all the instances with missing values to "Unused".
 
void impute_missing_values_mean ()
 Substitutes all the missing values by the mean of the corresponding variable.
 
void impute_missing_values_median ()
 Substitutes all the missing values by the median of the corresponding variable.
 
void scrub_missing_values ()
 
Vector< string > unuse_columns_missing_values (const double &)
 

Static Public Member Functions

static Vector< string > get_default_columns_names (const size_t &)
 
static ScalingUnscalingMethod get_scaling_unscaling_method (const string &)
 

Private Member Functions

void read_csv_1 ()
 
void read_csv_2_simple ()
 
void read_csv_3_simple ()
 
void read_csv_2_complete ()
 
void read_csv_3_complete ()
 
void check_separators (const string &) const
 

Private Attributes

string data_file_name
 Data file name.
 
Separator separator = Comma
 Separator character.
 
string missing_values_label = "NA"
 Missing values label.
 
size_t lags_number
 Number of lags.
 
size_t steps_ahead
 Number of steps ahead.
 
Matrix< double > data
 
Matrix< double > time_series_data
 
Vector< Columntime_series_columns
 
bool display = true
 Display messages to screen.
 
size_t time_index
 Index where time variable is located for forecasting applications.
 
MissingValuesMethod missing_values_method = Unuse
 Missing values method object.
 
Vector< InstanceUseinstances_uses
 
size_t batch_instances_number = 1000
 Number of batch instances. It is used to optimized the training strategy.
 
bool has_columns_names = false
 Header which contains variables name.
 
Vector< size_t > inputs_dimensions
 
Vector< size_t > targets_dimensions
 
Vector< Columncolumns
 
bool has_rows_labels = false
 Header wihch contains the rows label.
 
Vector< string > rows_labels
 
int gmt = 0
 
Vector< Vector< string > > data_file_preview
 

Detailed Description

This class represents the concept of data set for data modelling problems, such as function regression, classification, time series prediction, images approximation and images classification.

It basically consists of a data Matrix separated by columns. These columns can take different categories depending on the data hosted in them.

With OpenNN DataSet class you can edit the data to prepare your model, such as eliminating missing values, calculating correlations between variables (inputs and targets), not using certain variables or instances, etc \dots.

Definition at line 57 of file data_set.h.

Member Enumeration Documentation

◆ ColumnType

enum ColumnType

This enumeration represents the data type of a column (numeric, binary, categorical or time).

Definition at line 115 of file data_set.h.

◆ InstanceUse

This enumeration represents the possible uses of an instance (training, selection, testing or unused).

Definition at line 105 of file data_set.h.

◆ VariableUse

This enumeration represents the possible uses of an variable (input, target, time or unused).

Definition at line 110 of file data_set.h.

Constructor & Destructor Documentation

◆ DataSet() [1/8]

DataSet ( )
explicit

Default constructor. It creates a data set object with zero instances and zero inputs and target variables. It also initializes the rest of class members to their default values.

Definition at line 21 of file data_set.cpp.

◆ DataSet() [2/8]

DataSet ( const Eigen::MatrixXd &  data)
explicit

Default constructor. It creates a data set object from data Eigen Matrix. It also initializes the rest of class members to their default values.

Parameters
dataData MatrixXd.

Definition at line 33 of file data_set.cpp.

◆ DataSet() [3/8]

DataSet ( const Matrix< double > &  data)
explicit

Data constructor. It creates a data set object from a data matrix. It also initializes the rest of class members to their default values.

Parameters
dataData matrix.

Definition at line 45 of file data_set.cpp.

◆ DataSet() [4/8]

DataSet ( const size_t &  new_instances_number,
const size_t &  new_variables_number 
)
explicit

Instances and variables number constructor. It creates a data set object with given instances and variables numbers. All the variables are set as inputs. It also initializes the rest of class members to their default values.

Parameters
new_instances_numberNumber of instances in the data set.
new_variables_numberNumber of variables.

Definition at line 60 of file data_set.cpp.

◆ DataSet() [5/8]

DataSet ( const size_t &  new_instances_number,
const size_t &  new_inputs_number,
const size_t &  new_targets_number 
)
explicit

Instances number, input variables number and target variables number constructor. It creates a data set object with given instances and inputs and target variables numbers. It also initializes the rest of class members to their default values.

Parameters
new_instances_numberNumber of instances in the data set.
new_inputs_numberNumber of input variables.
new_targets_numberNumber of target variables.

Definition at line 75 of file data_set.cpp.

◆ DataSet() [6/8]

DataSet ( const tinyxml2::XMLDocument data_set_document)
explicit

Sets the data set members from a XML document.

Parameters
data_set_documentTinyXML document containing the member data.

Definition at line 86 of file data_set.cpp.

◆ DataSet() [7/8]

DataSet ( const string &  data_file_name,
const char &  separator,
const bool &  new_has_columns_names 
)
explicit

File and separator constructor. It creates a data set object by loading the object members from a data file. It also sets a separator. Please mind about the file format. This is specified in the User's Guide.

Parameters
data_file_nameData file file name.
separatorData file file name.

Definition at line 100 of file data_set.cpp.

◆ DataSet() [8/8]

DataSet ( const DataSet other_data_set)

Copy constructor. It creates a copy of an existing inputs targets data set object.

Parameters
other_data_setData set object to be copied.

Definition at line 121 of file data_set.cpp.

Member Function Documentation

◆ balance_approximation_targets_distribution()

Vector< size_t > balance_approximation_targets_distribution ( const double &  percentage = 10.0)

This method balances the target ditribution of a data set for a function regression problem. It returns a vector with the indices of the instances set unused. It unuses a given percentage of the

Parameters
percentagePercentage of the instances to be unused.
Todo:
Low priority.

Definition at line 7548 of file data_set.cpp.

◆ balance_binary_targets_distribution()

Vector< size_t > balance_binary_targets_distribution ( const double &  percentage = 100.0)

This method balances the targets ditribution of a data set with only one target variable by unusing instances whose target variable belongs to the most populated target class. It returns a vector with the indices of the instances set unused.

Parameters
percentagePercentage of instances to be unused.
Todo:
Low priority. "total unbalanced instances" needs target class distribution function.

Definition at line 7270 of file data_set.cpp.

◆ balance_multiple_targets_distribution()

Vector< size_t > balance_multiple_targets_distribution ( )

This method balances the targets ditribution of a data set with more than one target variable by unusing instances whose target variable belongs to the most populated target class. It returns a vector with the indices of the instances set unused.

Todo:
"total unbalanced instances" needs target class distribution function

Definition at line 7323 of file data_set.cpp.

◆ calculate_autocorrelations()

Matrix< double > calculate_autocorrelations ( const size_t &  maximum_lags_number = 10) const

Returns a matrix with the values of autocorrelation for every variable in the data set. The number of rows is equal to the number of The number of columns is the maximum lags number.

Parameters
maximum_lags_numberMaximum lags number for which autocorrelation is calculated.

Definition at line 7715 of file data_set.cpp.

◆ calculate_columns_box_plots()

Vector< BoxPlot > calculate_columns_box_plots ( ) const

Returns a vector of subvectors with the values of a box and whiskers plot. The size of the vector is equal to the number of used variables. The size of the subvectors is 5 and they consist on:

  • Minimum
  • First quartile
  • Second quartile
  • Third quartile
  • Maximum

Definition at line 4300 of file data_set.cpp.

◆ calculate_columns_descriptives()

Vector< Descriptives > calculate_columns_descriptives ( ) const

Returns a vector of vectors containing some basic descriptives of all the variables in the data set. The size of this vector is four. The subvectors are:

  • Minimum.
  • Maximum.
  • Mean.
  • Standard deviation.

Definition at line 4439 of file data_set.cpp.

◆ calculate_columns_descriptives_classes()

Vector< Descriptives > calculate_columns_descriptives_classes ( const size_t &  class_index) const

Returns a matrix with the data set descriptive statistics.

Parameters
class_indexData set index number to make the descriptive statistics.
Todo:
Low priority.

Definition at line 4623 of file data_set.cpp.

◆ calculate_columns_descriptives_eigen()

Eigen::MatrixXd calculate_columns_descriptives_eigen ( ) const

Returns all the variables descriptives from a single eigen matrixXd. The number of rows is the number of used variables. The number of columns is five(minimum, maximum, mean and standard deviation).

Todo:
Low priority.

Definition at line 4475 of file data_set.cpp.

◆ calculate_columns_descriptives_matrix()

Matrix< double > calculate_columns_descriptives_matrix ( ) const

Returns all the variables descriptives from a single matrix. The number of rows is the number of used variables. The number of columns is five(minimum, maximum, mean and standard deviation).

Definition at line 4449 of file data_set.cpp.

◆ calculate_columns_descriptives_negative_instances()

Vector< Descriptives > calculate_columns_descriptives_negative_instances ( ) const

Calculate the descriptives of the instances with neagtive targets in binary classification problems.

Todo:
Low priority.

Definition at line 4555 of file data_set.cpp.

◆ calculate_columns_descriptives_positive_instances()

Vector< Descriptives > calculate_columns_descriptives_positive_instances ( ) const

Calculate the descriptives of the instances with positive targets in binary classification problems.

Todo:
Low priority.

Definition at line 4488 of file data_set.cpp.

◆ calculate_columns_descriptives_selection_instances()

Vector< Descriptives > calculate_columns_descriptives_selection_instances ( ) const

Returns a vector of vectors containing some basic descriptives of all variables on the selection The size of this vector is two. The subvectors are:

  • Selection data minimum.
  • Selection data maximum.
  • Selection data mean.
  • Selection data standard deviation.

Definition at line 4698 of file data_set.cpp.

◆ calculate_columns_descriptives_testing_instances()

Vector< Descriptives > calculate_columns_descriptives_testing_instances ( ) const

Returns a vector of vectors containing some basic descriptives of all variables on the testing The size of this vector is five. The subvectors are:

  • Testing data minimum.
  • Testing data maximum.
  • Testing data mean.
  • Testing data standard deviation.

Definition at line 4717 of file data_set.cpp.

◆ calculate_columns_descriptives_training_instances()

Vector< Descriptives > calculate_columns_descriptives_training_instances ( ) const

Returns a vector of vectors containing some basic descriptives of all variables on the training The size of this vector is two. The subvectors are:

  • Training data minimum.
  • Training data maximum.
  • Training data mean.
  • Training data standard deviation.

Definition at line 4679 of file data_set.cpp.

◆ calculate_columns_histograms()

Vector< Histogram > calculate_columns_histograms ( const size_t &  bins_number = 10) const

Returns a histogram for each variable with a given number of bins. The default number of bins is 10.

Parameters
bins_numberNumber of bins.

Definition at line 4264 of file data_set.cpp.

◆ calculate_covariance_matrix()

Matrix< double > calculate_covariance_matrix ( ) const

Returns the covariance matrix for the input data set. The number of rows of the matrix is the number of inputs. The number of columns of the matrix is the number of inputs.

Definition at line 5196 of file data_set.cpp.

◆ calculate_default_scaling_methods()

Vector< string > calculate_default_scaling_methods ( ) const

Returns a vector of strings containing the scaling method that best fits each of the input variables.

Todo:
Low priority.

Definition at line 5522 of file data_set.cpp.

◆ calculate_input_target_columns_correlations()

Matrix< CorrelationResults > calculate_input_target_columns_correlations ( ) const

Calculates the linear correlations between all outputs and all inputs. It returns a matrix with number of rows the targets number and number of columns the inputs number. Each element contains the linear correlation between a single target and a single output.

Definition at line 4849 of file data_set.cpp.

◆ calculate_input_target_columns_correlations_eigen()

Eigen::MatrixXd calculate_input_target_columns_correlations_eigen ( ) const

Calculates the linear correlations between all outputs and all inputs. It returns a matrixXd with number of rows the targets number and number of columns the inputs number. Each element contains the linear correlation between a single target and a single output.

Todo:

Definition at line 4956 of file data_set.cpp.

◆ calculate_input_variables_descriptives()

Vector< Descriptives > calculate_input_variables_descriptives ( ) const

Returns a Vector of Descriptives structures with some basic statistics of the input variables on the used This includes the minimum, maximum, mean and standard deviation. The size of this vector is the number of inputs.

Definition at line 4731 of file data_set.cpp.

◆ calculate_inputs_correlations()

Matrix< double > calculate_inputs_correlations ( ) const

Calculate the correlation between each variable in the data set. Returns a matrix with the correlation values between variables in the data set.

Definition at line 5050 of file data_set.cpp.

◆ calculate_inputs_descriptives()

Descriptives calculate_inputs_descriptives ( const size_t &  input_index) const

Returns a vector with some basic descriptives of the given input variable on all The size of this vector is four:

  • Input variable minimum.
  • Input variable maximum.
  • Input variable mean.
  • Input variable standard deviation.

Definition at line 4791 of file data_set.cpp.

◆ calculate_lag_plot() [1/2]

Matrix< double > calculate_lag_plot ( ) const
Todo:
, check

Definition at line 7769 of file data_set.cpp.

◆ calculate_lag_plot() [2/2]

Matrix< double > calculate_lag_plot ( const size_t &  maximum_lags_number)
Todo:
, check

Definition at line 7787 of file data_set.cpp.

◆ calculate_selection_negatives()

size_t calculate_selection_negatives ( const size_t &  target_index) const

Counts the number of negatives of the selected target in the selection data.

Parameters
target_indexIndex of the target to evaluate.

Definition at line 4363 of file data_set.cpp.

◆ calculate_target_distribution()

Vector< size_t > calculate_target_distribution ( ) const

Returns a vector containing the number of instances of each class in the data set. If the number of target variables is one then the number of classes is two. If the number of target variables is greater than one then the number of classes is equal to the number of target variables.

Todo:
Low priority. Return class_distribution is wrong

Definition at line 7207 of file data_set.cpp.

◆ calculate_target_variables_descriptives()

Vector< Descriptives > calculate_target_variables_descriptives ( ) const

Returns a vector of vectors with some basic descriptives of the target variables on all The size of this vector is four. The subvectors are:

  • Target variables minimum.
  • Target variables maximum.
  • Target variables mean.
  • Target variables standard deviation.

Definition at line 4750 of file data_set.cpp.

◆ calculate_testing_negatives()

size_t calculate_testing_negatives ( const size_t &  target_index) const

Counts the number of negatives of the selected target in the testing data.

Parameters
target_indexIndex of the target to evaluate.

Definition at line 4398 of file data_set.cpp.

◆ calculate_training_negatives()

size_t calculate_training_negatives ( const size_t &  target_index) const

Counts the number of negatives of the selected target in the training data.

Parameters
target_indexIndex of the target to evaluate.

Definition at line 4328 of file data_set.cpp.

◆ calculate_Tukey_outliers() [1/2]

Vector< Vector< size_t > > calculate_Tukey_outliers ( const double &  cleaning_parameter = 1.5) const

Calculate the outliers from the data set using the Tukey's test.

Parameters
cleaning_parameterParameter used to detect outliers.
Todo:
Low priority.

Definition at line 7631 of file data_set.cpp.

◆ calculate_Tukey_outliers() [2/2]

Vector< size_t > calculate_Tukey_outliers ( const size_t &  column_index,
const double &  cleaning_parameter = 1.5 
) const

Calculate the outliers from the data set using the Tukey's test for a single variable.

Parameters
variable_indexIndex of the variable to calculate the outliers.
cleaning_parameterParameter used to detect outliers.
Todo:
Low priority.

Definition at line 7587 of file data_set.cpp.

◆ calculate_variables_means()

Vector< double > calculate_variables_means ( const Vector< size_t > &  variables_indices) const

Returns a vector containing the means of a set of given variables.

Parameters
variables_indicesIndices of the variables.

Definition at line 4763 of file data_set.cpp.

◆ filter_column() [1/2]

Vector< size_t > filter_column ( const size_t &  variable_index,
const double &  minimum,
const double &  maximum 
)

Filter data set variable using a rank. The values within the variable must be between minimum and maximum.

Parameters
variable_indexIndex number where the variable to be filtered is located.
minimumValue that determine the lower limit.
maximumValue that determine the upper limit. Returns a indices vector.

Definition at line 8117 of file data_set.cpp.

◆ filter_column() [2/2]

Vector< size_t > filter_column ( const string &  variable_name,
const double &  minimum,
const double &  maximum 
)

Filter data set variable using a rank. The values within the variable must be between minimum and maximum.

Parameters
variable_nameString name where the variable to be filtered is located.
minimumValue that determine the lower limit.
maximumValue that determine the upper limit. Returns a indices vector.

Definition at line 8151 of file data_set.cpp.

◆ filter_data()

Vector< size_t > filter_data ( const Vector< double > &  minimums,
const Vector< double > &  maximums 
)

Unuses those instances with values outside a defined range.

Parameters
minimumsVector of minimum values in the range. The size must be equal to the number of variables.
maximumsVector of maximum values in the range. The size must be equal to the number of variables.
Todo:
Low priority.

Definition at line 8046 of file data_set.cpp.

◆ from_XML()

void from_XML ( const tinyxml2::XMLDocument data_set_document)
Todo:

Definition at line 6622 of file data_set.cpp.

◆ generate_constant_data()

void generate_constant_data ( const size_t &  instances_number,
const size_t &  variables_number 
)

Generates an artificial dataset with a given number of instances and number of variables by constant data.

Parameters
instances_numberNumber of instances in the dataset.
variables_numberNumber of variables in the dataset.

Definition at line 7818 of file data_set.cpp.

◆ generate_data_binary_classification()

void generate_data_binary_classification ( const size_t &  instances_number,
const size_t &  inputs_number 
)

Generate artificial data for a binary classification problem with a given number of instances and inputs.

Parameters
instances_numberNumber of the instances to generate.
inputs_numberNumber of the variables that the data set will have.

Definition at line 7970 of file data_set.cpp.

◆ generate_data_multiple_classification()

void generate_data_multiple_classification ( const size_t &  instances_number,
const size_t &  inputs_number,
const size_t &  outputs_number 
)
Todo:
Low priority.

Definition at line 8003 of file data_set.cpp.

◆ generate_paraboloid_data()

void generate_paraboloid_data ( const size_t &  instances_number,
const size_t &  variables_number 
)

Generates an artificial dataset with a given number of instances and number of variables using a paraboloid data.

Parameters
instances_numberNumber of instances in the dataset.
variables_numberNumber of variables in the dataset.

Definition at line 7872 of file data_set.cpp.

◆ generate_random_data()

void generate_random_data ( const size_t &  instances_number,
const size_t &  variables_number 
)

Generates an artificial dataset with a given number of instances and number of variables using random data.

Parameters
instances_numberNumber of instances in the dataset.
variables_numberNumber of variables in the dataset.

Definition at line 7840 of file data_set.cpp.

◆ generate_Rosenbrock_data()

void generate_Rosenbrock_data ( const size_t &  instances_number,
const size_t &  variables_number 
)

Generates an artificial dataset with a given number of instances and number of variables using the Rosenbrock function.

Parameters
instances_numberNumber of instances in the dataset.
variables_numberNumber of variables in the dataset.

Definition at line 7896 of file data_set.cpp.

◆ generate_sequential_data()

void generate_sequential_data ( const size_t &  instances_number,
const size_t &  variables_number 
)

Generates an artificial dataset with a given number of instances and number of variables using a sequential data.

Parameters
instances_numberNumber of instances in the dataset.
variables_numberNumber of variables in the dataset.

Definition at line 7853 of file data_set.cpp.

◆ get_data()

const Matrix< double > & get_data ( ) const

Returns a reference to the data matrix in the data set. The number of rows is equal to the number of The number of columns is equal to the number of variables.

Definition at line 2636 of file data_set.cpp.

◆ get_data_eigen()

const Eigen::MatrixXd get_data_eigen ( ) const

Returns a reference to the data matrixXd in the data set. The number of rows is equal to the number of The number of columns is equal to the number of variables

Definition at line 2646 of file data_set.cpp.

◆ get_display()

const bool & get_display ( ) const

Returns true if messages from this class can be displayed on the screen, or false if messages from this class can't be displayed on the screen.

Definition at line 139 of file data_set.cpp.

◆ get_input_data() [1/2]

Matrix< double > get_input_data ( ) const

Returns a matrix with the input variables in the data set. The number of rows is the number of The number of columns is the number of input variables.

Definition at line 2919 of file data_set.cpp.

◆ get_input_data() [2/2]

Tensor< double > get_input_data ( const Vector< size_t > &  instances_indices) const

Returns a tensor with the input variables in the data set. The number of rows is the number of The number of columns is the number of input variables.

Definition at line 2982 of file data_set.cpp.

◆ get_input_data_eigen()

Eigen::MatrixXd get_input_data_eigen ( ) const

Returns a eigen matrixXd with the input variables in the data set. The number of rows is the number of The number of columns is the number of input variables.

Definition at line 2935 of file data_set.cpp.

◆ get_input_data_float()

Matrix< float > get_input_data_float ( const Vector< size_t > &  instances_indices) const

Returns a matrix with the input variables in the data set is float type. The number of rows is the number of The number of columns is the number of input variables.

Definition at line 3010 of file data_set.cpp.

◆ get_input_variables_names()

Vector< string > get_input_variables_names ( ) const

Returns the names of the input variables in the data set. The size of the vector is the number of input variables.

Definition at line 1693 of file data_set.cpp.

◆ get_instance_data() [1/2]

Vector< double > get_instance_data ( const size_t &  index) const

Returns the inputs and target values of a single instance in the data set.

Parameters
indexIndex of the instance.

Definition at line 3256 of file data_set.cpp.

◆ get_instance_data() [2/2]

Vector< double > get_instance_data ( const size_t &  instance_index,
const Vector< size_t > &  variables_indices 
) const

Returns the inputs and target values of a single instance in the data set.

Parameters
instance_indexIndex of the instance.
variables_indicesIndices of the variables.

Definition at line 3285 of file data_set.cpp.

◆ get_instance_input_data()

Tensor< double > get_instance_input_data ( const size_t &  instance_index) const

Returns the inputs values of a single instance in the data set.

Parameters
instance_indexIndex of the instance.

Definition at line 3311 of file data_set.cpp.

◆ get_instance_target_data()

Tensor< double > get_instance_target_data ( const size_t &  instance_index) const

Returns the target values of a single instance in the data set.

Parameters
instance_indexIndex of the instance.

Definition at line 3323 of file data_set.cpp.

◆ get_instance_use()

DataSet::InstanceUse get_instance_use ( const size_t &  index) const

Returns the use of a single instance.

Parameters
indexInstance index.

Definition at line 854 of file data_set.cpp.

◆ get_instances_uses_numbers()

Vector< size_t > get_instances_uses_numbers ( ) const

Returns a vector with the number of training, selection, testing and unused instances. The size of that vector is therefore four.

Definition at line 680 of file data_set.cpp.

◆ get_scaling_unscaling_method()

DataSet::ScalingUnscalingMethod get_scaling_unscaling_method ( const string &  scaling_unscaling_method)
static

Returns a value of the scaling-unscaling method enumeration from a string containing the name of that method.

Parameters
scaling_unscaling_methodString with the name of the scaling and unscaling method.

Definition at line 2781 of file data_set.cpp.

◆ get_selection_batches()

Vector< Vector< size_t > > get_selection_batches ( const bool &  shuffle_batches_instances = true) const

Returns a vector, where each element is a vector that contains the indices of the different batches of the selection instances.

Parameters
shuffleIs a boleean. If shuffle is true, then the indices are shuffled into batches, and false otherwise

Definition at line 887 of file data_set.cpp.

◆ get_selection_data()

Matrix< double > get_selection_data ( ) const

Returns a matrix with the selection instances in the data set. The number of rows is the number of selection The number of columns is the number of variables.

Definition at line 2856 of file data_set.cpp.

◆ get_selection_data_eigen()

Eigen::MatrixXd get_selection_data_eigen ( ) const

Returns a eigen matrixXd with the selection instances in the data set. The number of rows is the number of selection The number of columns is the number of variables.

Definition at line 2872 of file data_set.cpp.

◆ get_selection_input_data()

Tensor< double > get_selection_input_data ( ) const

Returns a tensor with selection instances and input variables. The number of rows is the number of selection The number of columns is the number of input variables.

Definition at line 3137 of file data_set.cpp.

◆ get_selection_input_data_eigen()

Eigen::MatrixXd get_selection_input_data_eigen ( ) const

Returns a eigen matrixXd with selection instances and input variables. The number of rows is the number of selection The number of columns is the number of input variables.

Definition at line 3151 of file data_set.cpp.

◆ get_selection_target_data()

Tensor< double > get_selection_target_data ( ) const

Returns a tensor with selection instances and target variables. The number of rows is the number of selection The number of columns is the number of target variables.

Definition at line 3167 of file data_set.cpp.

◆ get_selection_target_data_eigen()

Eigen::MatrixXd get_selection_target_data_eigen ( ) const

Returns a eigen matrixXd with selection instances and target variables. The number of rows is the number of selection The number of columns is the number of target variables.

Definition at line 3181 of file data_set.cpp.

◆ get_target_data() [1/2]

Matrix< double > get_target_data ( ) const

Returns a matrix with the target variables in the data set. The number of rows is the number of The number of columns is the number of target variables.

Definition at line 2951 of file data_set.cpp.

◆ get_target_data() [2/2]

Tensor< double > get_target_data ( const Vector< size_t > &  instances_indices) const

Returns a tensor with the target variables in the data set. The number of rows is the number of The number of columns is the number of input variables.

Definition at line 2996 of file data_set.cpp.

◆ get_target_data_eigen()

Eigen::MatrixXd get_target_data_eigen ( ) const

Returns a eigen matrixXd with the target variables in the data set. The number of rows is the number of The number of columns is the number of target variables.

Definition at line 2966 of file data_set.cpp.

◆ get_target_data_float()

Matrix< float > get_target_data_float ( const Vector< size_t > &  instances_indices) const

Returns a matrix with the target variables in the data set is float type. The number of rows is the number of The number of columns is the number of input variables.

Definition at line 3041 of file data_set.cpp.

◆ get_target_variables_names()

Vector< string > get_target_variables_names ( ) const

Returns the names of the target variables in the data set. The size of the vector is the number of target variables.

Definition at line 1721 of file data_set.cpp.

◆ get_testing_batches()

Vector< Vector< size_t > > get_testing_batches ( const bool &  shuffle_batches_instances = true) const

Returns a vector, where each element is a vector that contains the indices of the different batches of the testing instances. If shuffle is true, then the indices within batches are shuffle, and false otherwise

Parameters
shuffle_batches_instancesIs a boleean.

Definition at line 901 of file data_set.cpp.

◆ get_testing_data()

Matrix< double > get_testing_data ( ) const

Returns a matrix with the testing instances in the data set. The number of rows is the number of testing The number of columns is the number of variables.

Definition at line 2888 of file data_set.cpp.

◆ get_testing_data_eigen()

Eigen::MatrixXd get_testing_data_eigen ( ) const

Returns a eigen matrixXd with the testing instances in the data set. The number of rows is the number of testing The number of columns is the number of variables.

Definition at line 2903 of file data_set.cpp.

◆ get_testing_input_data()

Tensor< double > get_testing_input_data ( ) const

Returns a tensor with testing instances and input variables. The number of rows is the number of testing The number of columns is the number of input variables.

Definition at line 3197 of file data_set.cpp.

◆ get_testing_input_data_eigen()

Eigen::MatrixXd get_testing_input_data_eigen ( ) const

Returns a eigen matrixXd with testing instances and input variables. The number of rows is the number of testing The number of columns is the number of input variables.

Definition at line 3211 of file data_set.cpp.

◆ get_testing_target_data()

Tensor< double > get_testing_target_data ( ) const

Returns a tensor with testing instances and target variables. The number of rows is the number of testing The number of columns is the number of target variables.

Definition at line 3227 of file data_set.cpp.

◆ get_testing_target_data_eigen()

Eigen::MatrixXd get_testing_target_data_eigen ( ) const

Returns a eigen matrixXd with testing instances and target variables. The number of rows is the number of testing The number of columns is the number of target variables.

Definition at line 3241 of file data_set.cpp.

◆ get_time_series_data()

const Matrix< double > & get_time_series_data ( ) const

Returns a reference to the time series data matrix in the data set. Only for time series problems.

Definition at line 2662 of file data_set.cpp.

◆ get_training_batches()

Vector< Vector< size_t > > get_training_batches ( const bool &  shuffle_batches_instances = true) const

Returns a vector, where each element is a vector that contains the indices of the different batches of the training instances.

Parameters
shuffleIs a boleean. If shuffle is true, then the indices are shuffled into batches, and false otherwise
Todo:
In forecasting must be false.

Definition at line 873 of file data_set.cpp.

◆ get_training_data()

Matrix< double > get_training_data ( ) const

Returns a matrix with the training instances in the data set. The number of rows is the number of training The number of columns is the number of variables.

Definition at line 2824 of file data_set.cpp.

◆ get_training_data_eigen()

Eigen::MatrixXd get_training_data_eigen ( ) const

Returns a eigen matrixXd with the training instances in the data set. The number of rows is the number of training The number of columns is the number of variables.

Definition at line 2840 of file data_set.cpp.

◆ get_training_input_data()

Tensor< double > get_training_input_data ( ) const

Returns a matrix with training instances and input variables. The number of rows is the number of training The number of columns is the number of input variables.

Definition at line 3074 of file data_set.cpp.

◆ get_training_input_data_eigen()

Eigen::MatrixXd get_training_input_data_eigen ( ) const

Returns a eigen matrixXd with training instances and input variables. The number of rows is the number of training The number of columns is the number of input variables.

Definition at line 3090 of file data_set.cpp.

◆ get_training_target_data()

Tensor< double > get_training_target_data ( ) const

Returns a tensor with training instances and target variables. The number of rows is the number of training The number of columns is the number of target variables.

Definition at line 3107 of file data_set.cpp.

◆ get_training_target_data_eigen()

Eigen::MatrixXd get_training_target_data_eigen ( ) const

Returns a eigen matrixXd with training instances and target variables. The number of rows is the number of training The number of columns is the number of target variables.

Definition at line 3121 of file data_set.cpp.

◆ get_unused_instances_number()

size_t get_unused_instances_number ( ) const

Returns the number of instances in the data set which will neither be used for training, selection or testing.

Definition at line 986 of file data_set.cpp.

◆ get_used_instances_number()

size_t get_used_instances_number ( ) const

Returns the total number of training, selection and testing instances, i.e. those which are not "Unused".

Definition at line 974 of file data_set.cpp.

◆ get_variable_data() [1/4]

Vector< double > get_variable_data ( const size_t &  index) const

Returns all the instances of a single variable in the data set.

Parameters
indexIndex of the variable.

Definition at line 3414 of file data_set.cpp.

◆ get_variable_data() [2/4]

Vector< double > get_variable_data ( const size_t &  variable_index,
const Vector< size_t > &  instances_indices 
) const

Returns a given set of instances of a single variable in the data set.

Parameters
variable_indexIndex of the variable.
instances_indicesIndices of the

Definition at line 3482 of file data_set.cpp.

◆ get_variable_data() [3/4]

Vector< double > get_variable_data ( const string &  variable_name) const

Returns all the instances of a single variable in the data set.

Parameters
variable_nameName of the variable.

Definition at line 3440 of file data_set.cpp.

◆ get_variable_data() [4/4]

Vector< double > get_variable_data ( const string &  variable_name,
const Vector< size_t > &  instances_indices 
) const

Returns a given set of instances of a single variable in the data set.

Parameters
variable_nameName of the variable.
instances_indicesIndices of the

Definition at line 3509 of file data_set.cpp.

◆ get_variable_index()

size_t get_variable_index ( const string &  name) const

Returns a variable index in the data set with given name.

Parameters
nameName of variable.

Definition at line 2153 of file data_set.cpp.

◆ get_variable_name()

string get_variable_name ( const size_t &  variable_index) const

Returns the name of a single variable in the data set.

Parameters
indexIndex of variable.

Definition at line 1605 of file data_set.cpp.

◆ get_variable_use()

DataSet::VariableUse get_variable_use ( const size_t &  index) const

Returns the use of a single variable.

Parameters
indexIndex of variable.

Definition at line 1543 of file data_set.cpp.

◆ get_variables_names()

Vector< string > get_variables_names ( ) const

Returns a string vector with the names of all the variables in the data set. The size of the vector is the number of variables.

Definition at line 1664 of file data_set.cpp.

◆ get_variables_uses()

Vector< DataSet::VariableUse > get_variables_uses ( ) const

Returns a vector containing the use of each column, including the categories. The size of the vector is equal to the number of variables.

Definition at line 1575 of file data_set.cpp.

◆ has_data()

bool has_data ( ) const

Returns true if the data matrix is not empty(it has not been loaded), and false otherwise.

Definition at line 8026 of file data_set.cpp.

◆ initialize_data()

void initialize_data ( const double &  new_value)

Initializes the data matrix with a given value.

Parameters
new_valueInitialization value.

Definition at line 6276 of file data_set.cpp.

◆ is_instance_unused()

bool is_instance_unused ( const size_t &  index) const

Returns true if a given instance is to be unused and false in other case.

Parameters
indexInstance index.

Definition at line 663 of file data_set.cpp.

◆ is_instance_used()

bool is_instance_used ( const size_t &  index) const

Returns true if a given instance is to be used for training, selection or testing, and false if it is to be unused.

Parameters
indexInstance index.

Definition at line 647 of file data_set.cpp.

◆ load()

void load ( const string &  file_name)

Loads the members of a data set object from a XML-type file:

  • Instances number.
  • Training instances number.
  • Training instances indices.
  • Selection instances number.
  • Selection instances indices.
  • Testing instances number.
  • Testing instances indices.
  • Input variables number.
  • Input variables indices.
  • Target variables number.
  • Target variables indices.
  • Input variables name.
  • Target variables name.
  • Input variables description.
  • Target variables description.
  • Display.
  • Data.

Please mind about the file format. This is specified in the User's Guide.

Parameters
file_nameName of data set XML-type file.

Definition at line 6994 of file data_set.cpp.

◆ numeric_to_categorical()

void numeric_to_categorical ( const size_t &  variable_index)

This method converts a numerical variable into categorical. Note that this method resizes the dataset.

Parameters
variable_indexIndex of the variable to be converted.

Definition at line 8183 of file data_set.cpp.

◆ perform_principal_components_analysis() [1/2]

Matrix< double > perform_principal_components_analysis ( const double &  minimum_explained_variance = 0.0)

Performs the principal components analysis of the inputs. It returns a matrix containing the principal components getd in rows. This method deletes the unused instances of the original data set.

Parameters
minimum_explained_varianceMinimum percentage of variance used to select a principal component.

Definition at line 5231 of file data_set.cpp.

◆ perform_principal_components_analysis() [2/2]

Matrix< double > perform_principal_components_analysis ( const Matrix< double > &  covariance_matrix,
const Vector< double > &  explained_variance,
const double &  minimum_explained_variance = 0.0 
)

Performs the principal components analysis of the inputs. It returns a matrix containing the principal components arranged in rows. This method deletes the unused instances of the original data set.

Parameters
covariance_matrixMatrix of covariances.
explained_varianceVector of the explained variances of the variables.
minimum_explained_varianceMinimum percentage of variance used to select a principal component.

Definition at line 5314 of file data_set.cpp.

◆ print_data_preview()

void print_data_preview ( ) const

Prints to the sceen a preview of the data matrix, i.e., the first, second and last instances

Definition at line 7040 of file data_set.cpp.

◆ print_top_input_target_columns_correlations()

void print_top_input_target_columns_correlations ( const size_t &  number = 10) const

This method print on screen the corretaliont between inputs and targets.

Parameters
numberNumber of variables to be printed.
Todo:

Definition at line 5014 of file data_set.cpp.

◆ print_top_inputs_correlations()

void print_top_inputs_correlations ( const size_t &  number = 10) const

This method print on screen the corretaliont between variables.

Parameters
numberNumber of variables to be printed.
Todo:
Low priority.

Definition at line 5159 of file data_set.cpp.

◆ randomize_data_normal()

void randomize_data_normal ( const double &  mean = 0.0,
const double &  standard_deviation = 1.0 
)

Initializes the data matrix with random values chosen from a normal distribution with given mean and standard deviation.

Definition at line 6294 of file data_set.cpp.

◆ randomize_data_uniform()

void randomize_data_uniform ( const double &  minimum = -1.0,
const double &  maximum = 1.0 
)

Initializes the data matrix with random values chosen from a uniform distribution with given minimum and maximum.

Definition at line 6285 of file data_set.cpp.

◆ save()

void save ( const string &  file_name) const

Saves the members of a data set object to a XML-type file in an XML-type format.

Parameters
file_nameName of data set XML-type file.
Todo:

Definition at line 6954 of file data_set.cpp.

◆ scale_data_mean_standard_deviation() [1/2]

Vector< Descriptives > scale_data_mean_standard_deviation ( )

Scales the data using the mean and standard deviation method, and the mean and standard deviation values calculated from the data matrix. It also returns the descriptives from all columns.

Definition at line 5478 of file data_set.cpp.

◆ scale_data_mean_standard_deviation() [2/2]

void scale_data_mean_standard_deviation ( const Vector< Descriptives > &  data_descriptives)

Scales the data matrix with given mean and standard deviation values. It updates the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the variables in the data set. The size of that vector must be equal to the number of variables.

Definition at line 5422 of file data_set.cpp.

◆ scale_data_minimum_maximum() [1/2]

Vector< Descriptives > scale_data_minimum_maximum ( )

Scales the data using the minimum and maximum method, and the minimum and maximum values calculated from the data matrix. It also returns the descriptives from all columns.

Definition at line 5464 of file data_set.cpp.

◆ scale_data_minimum_maximum() [2/2]

void scale_data_minimum_maximum ( const Vector< Descriptives > &  data_descriptives)

Scales the data matrix with given minimum and maximum values. It updates the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the variables in the data set. The size of that vector must be equal to the number of variables.

Definition at line 5559 of file data_set.cpp.

◆ scale_input_mean_standard_deviation() [1/2]

void scale_input_mean_standard_deviation ( const Descriptives input_statistics,
const size_t &  input_index 
)

Scales the given input variables with given mean and standard deviation values. It updates the input variable of the data matrix.

Parameters
input_statisticsVector of descriptives structures for the input variables.
input_indexIndex of the input to be scaled.

Definition at line 5643 of file data_set.cpp.

◆ scale_input_mean_standard_deviation() [2/2]

Descriptives scale_input_mean_standard_deviation ( const size_t &  input_index)

Scales the given input variables with the calculated mean and standard deviation values from the data matrix. It updates the input variables of the data matrix. It also returns a vector with the variables descriptives.

Parameters
input_indexIndex of the input to be scaled.

Definition at line 5658 of file data_set.cpp.

◆ scale_input_minimum_maximum() [1/2]

void scale_input_minimum_maximum ( const Descriptives input_statistics,
const size_t &  input_index 
)

Scales the given input variable with given minimum and maximum values. It updates the input variables of the data matrix.

Parameters
input_statisticsVector with the descriptives of the input variable.
input_indexIndex of the input to be scaled.

Definition at line 5800 of file data_set.cpp.

◆ scale_input_minimum_maximum() [2/2]

Descriptives scale_input_minimum_maximum ( const size_t &  input_index)

Scales the given input variable with the calculated minimum and maximum values from the data matrix. It updates the input variable of the data matrix. It also returns a vector with the minimum and maximum values of the input variables.

Definition at line 5814 of file data_set.cpp.

◆ scale_input_standard_deviation() [1/2]

void scale_input_standard_deviation ( const Descriptives input_statistics,
const size_t &  input_index 
)

Scales the given input variables with given standard deviation values. It updates the input variable of the data matrix.

Parameters
inputs_statisticsVector of descriptives structures for the input variables.
input_indexIndex of the input to be scaled.

Definition at line 5688 of file data_set.cpp.

◆ scale_input_standard_deviation() [2/2]

Descriptives scale_input_standard_deviation ( const size_t &  input_index)

Scales the given input variables with the calculated standard deviation values from the data matrix. It updates the input variables of the data matrix. It also returns a vector with the variables descriptives.

Parameters
input_indexIndex of the input to be scaled.

Definition at line 5703 of file data_set.cpp.

◆ scale_inputs() [1/3]

Vector< Descriptives > scale_inputs ( const string &  scaling_unscaling_method)

Calculates the input and target variables descriptives. Then it scales the input variables with that values. The method to be used is that in the scaling and unscaling method variable. Finally, it returns the descriptives.

Definition at line 5844 of file data_set.cpp.

◆ scale_inputs() [2/3]

void scale_inputs ( const string &  scaling_unscaling_method,
const Vector< Descriptives > &  inputs_descriptives 
)

Calculates the input and target variables descriptives. Then it scales the input variables with that values. The method to be used is that in the scaling and unscaling method variable.

Definition at line 5874 of file data_set.cpp.

◆ scale_inputs() [3/3]

void scale_inputs ( const Vector< string > &  scaling_unscaling_methods,
const Vector< Descriptives > &  inputs_descriptives 
)

It scales every input variable with the given method. The method to be used is that in the scaling and unscaling method variable.

Definition at line 5913 of file data_set.cpp.

◆ scale_inputs_mean_standard_deviation() [1/2]

Vector< Descriptives > scale_inputs_mean_standard_deviation ( )

Scales the input variables with the calculated mean and standard deviation values from the data matrix. It updates the input variables of the data matrix. It also returns a vector of vectors with the variables descriptives.

Definition at line 5613 of file data_set.cpp.

◆ scale_inputs_mean_standard_deviation() [2/2]

void scale_inputs_mean_standard_deviation ( const Vector< Descriptives > &  inputs_descriptives)

Scales the input variables with given mean and standard deviation values. It updates the input variables of the data matrix.

Parameters
inputs_descriptivesVector of descriptives structures for the input variables. The size of that vector must be equal to the number of inputs.

Definition at line 5601 of file data_set.cpp.

◆ scale_inputs_minimum_maximum() [1/2]

Vector< Descriptives > scale_inputs_minimum_maximum ( )

Scales the input variables with the calculated minimum and maximum values from the data matrix. It updates the input variables of the data matrix. It also returns a vector of vectors with the minimum and maximum values of the input variables.

Definition at line 5745 of file data_set.cpp.

◆ scale_inputs_minimum_maximum() [2/2]

void scale_inputs_minimum_maximum ( const Vector< Descriptives > &  inputs_descriptives)

Scales the input variables with given minimum and maximum values. It updates the input variables of the data matrix.

Parameters
inputs_descriptivesVector of descriptives structures for all the inputs in the data set. The size of that vector must be equal to the number of input variables.

Definition at line 5733 of file data_set.cpp.

◆ scale_inputs_minimum_maximum_eigen()

Eigen::MatrixXd scale_inputs_minimum_maximum_eigen ( )

Scales the input variables with the calculated minimum and maximum values from the data matrixXd. It updates the input variables of the data matrixXd. It also returns a vector of vectors with the minimum and maximum values of the input variables.

Definition at line 5775 of file data_set.cpp.

◆ scale_targets() [1/2]

Vector< Descriptives > scale_targets ( const string &  scaling_unscaling_method)

Calculates the input and target variables descriptives. Then it scales the target variables with those values. The method to be used is that in the scaling and unscaling method variable. Finally, it returns the descriptives.

Definition at line 6116 of file data_set.cpp.

◆ scale_targets() [2/2]

void scale_targets ( const string &  scaling_unscaling_method,
const Vector< Descriptives > &  targets_descriptives 
)

It scales the input variables with that values. The method to be used is that in the scaling and unscaling method variable.

Definition at line 6157 of file data_set.cpp.

◆ scale_targets_logarithmic() [1/2]

Vector< Descriptives > scale_targets_logarithmic ( )

Scales the target variables with the logarithmic scale using the calculated minimum and maximum values from the data matrix. It updates the target variables of the data matrix. It also returns a vector of vectors with the descriptives of the input target variables.

Definition at line 6101 of file data_set.cpp.

◆ scale_targets_logarithmic() [2/2]

void scale_targets_logarithmic ( const Vector< Descriptives > &  targets_descriptives)

Scales the target variables with the logarithmic scale using the given minimum and maximum values. It updates the target variables of the data matrix.

Parameters
targets_descriptivesVector of descriptives structures for all the targets in the data set. The size of that vector must be equal to the number of target variables.

Definition at line 6073 of file data_set.cpp.

◆ scale_targets_mean_standard_deviation() [1/2]

Vector< Descriptives > scale_targets_mean_standard_deviation ( )

Scales the target variables with the calculated mean and standard deviation values from the data matrix. It updates the target variables of the data matrix. It also returns a vector of descriptives structures with the basic descriptives of all the variables.

Definition at line 5977 of file data_set.cpp.

◆ scale_targets_mean_standard_deviation() [2/2]

void scale_targets_mean_standard_deviation ( const Vector< Descriptives > &  targets_descriptives)

Scales the target variables with given mean and standard deviation values. It updates the target variables of the data matrix.

Parameters
targets_descriptivesVector of descriptives structures for all the targets in the data set. The size of that vector must be equal to the number of target variables.

Definition at line 5965 of file data_set.cpp.

◆ scale_targets_minimum_maximum() [1/2]

Vector< Descriptives > scale_targets_minimum_maximum ( )

Scales the target variables with the calculated minimum and maximum values from the data matrix. It updates the target variables of the data matrix. It also returns a vector of vectors with the descriptives of the input target variables.

Definition at line 6034 of file data_set.cpp.

◆ scale_targets_minimum_maximum() [2/2]

void scale_targets_minimum_maximum ( const Vector< Descriptives > &  targets_descriptives)

Scales the target variables with given minimum and maximum values. It updates the target variables of the data matrix.

Parameters
targets_descriptivesVector of descriptives structures for all the targets in the data set. The size of that vector must be equal to the number of target variables.

Definition at line 6007 of file data_set.cpp.

◆ scale_targets_minimum_maximum_eigen()

Eigen::MatrixXd scale_targets_minimum_maximum_eigen ( )

Scales the target variables with the calculated minimum and maximum values from the data eigen matrix. It updates the target variables of the data eigen matrix. It also returns a vector of vectors with the descriptives of the input target variables.

Definition at line 6048 of file data_set.cpp.

◆ scrub_missing_values()

void scrub_missing_values ( )

General method for dealing with missing values. It switches among the different scrubbing methods available, according to the corresponding value in the missing values object.

Definition at line 8282 of file data_set.cpp.

◆ set() [1/7]

void set ( const DataSet other_data_set)

Sets the members of this data set object with those from another data set object.

Parameters
other_data_setData set object to be copied.

Definition at line 3708 of file data_set.cpp.

◆ set() [2/7]

void set ( const Eigen::MatrixXd &  new_data)

Sets all variables from a data matrix.

Parameters
new_dataData matrix.

Definition at line 3582 of file data_set.cpp.

◆ set() [3/7]

void set ( const Matrix< double > &  new_data)

Sets all variables from a data matrix.

Parameters
new_dataData matrix.

Definition at line 3562 of file data_set.cpp.

◆ set() [4/7]

void set ( const size_t &  new_instances_number,
const size_t &  new_variables_number 
)

Sets new numbers of instances and variables in the inputs targets data set. All the instances are set for training. All the variables are set as inputs.

Parameters
new_instances_numberNumber of
new_variables_numberNumber of variables.

Definition at line 3609 of file data_set.cpp.

◆ set() [5/7]

void set ( const size_t &  new_instances_number,
const size_t &  new_inputs_number,
const size_t &  new_targets_number 
)

Sets new numbers of instances and inputs and target variables in the data set. The variables in the data set are the number of inputs plus the number of targets.

Parameters
new_instances_numberNumber of
new_inputs_numberNumber of input variables.
new_targets_numberNumber of target variables.

Definition at line 3665 of file data_set.cpp.

◆ set() [6/7]

void set ( const string &  file_name)

Sets the data set members by loading them from a XML file.

Parameters
file_nameData set XML file_name.

Definition at line 3740 of file data_set.cpp.

◆ set() [7/7]

void set ( const tinyxml2::XMLDocument data_set_document)

Sets the data set members from a XML document.

Parameters
data_set_documentTinyXML document containing the member data.

Definition at line 3729 of file data_set.cpp.

◆ set_column_use()

void set_column_use ( const size_t &  index,
const VariableUse new_use 
)

Sets the use of a single column.

Parameters
indexIndex of column.
new_useUse for that column.

Definition at line 2373 of file data_set.cpp.

◆ set_columns_names()

void set_columns_names ( const Vector< string > &  new_names)

Sets new names for the columns in the data set from a vector of strings. The size of that vector must be equal to the total number of variables.

Parameters
new_namesName of variables.

Definition at line 2499 of file data_set.cpp.

◆ set_columns_number()

void set_columns_number ( const size_t &  new_variables_number)

Sets a new number of variables in the variables object. All variables are set as inputs but the last one, which is set as targets.

Parameters
new_variables_numberNumber of variables.

Definition at line 2559 of file data_set.cpp.

◆ set_data()

void set_data ( const Matrix< double > &  new_data)

Sets a new data matrix. The number of rows must be equal to the number of The number of columns must be equal to the number of variables. Indices of all training, selection and testing instances and inputs and target variables do not change.

Parameters
new_dataData matrix.

Definition at line 3788 of file data_set.cpp.

◆ set_data_file_name()

void set_data_file_name ( const string &  new_data_file_name)

Sets the name of the data file. It also loads the data from that file. Moreover, it sets the variables and instances objects.

Parameters
new_data_file_nameName of the file containing the data.

Definition at line 3811 of file data_set.cpp.

◆ set_default()

void set_default ( )

Sets the default member values:

  • Display: True.

Definition at line 3762 of file data_set.cpp.

◆ set_display()

void set_display ( const bool &  new_display)

Sets a new display value. If it is set to true messages from this class are to be displayed on the screen; if it is set to false messages from this class are not to be displayed on the screen.

Parameters
new_displayDisplay value.

Definition at line 3751 of file data_set.cpp.

◆ set_instance()

void set_instance ( const size_t &  instance_index,
const Vector< double > &  instance 
)

Sets new inputs and target values of a single instance in the data set.

Parameters
instance_indexIndex of the instance.
instanceNew inputs and target values of the instance.

Definition at line 4017 of file data_set.cpp.

◆ set_instance_use() [1/2]

void set_instance_use ( const size_t &  index,
const InstanceUse new_use 
)

Sets the use of a single instance.

Parameters
indexIndex of instance.
new_useUse for that instance.

Definition at line 1122 of file data_set.cpp.

◆ set_instance_use() [2/2]

void set_instance_use ( const size_t &  index,
const string &  new_use 
)

Sets the use of a single instance from a string.

Parameters
indexIndex of instance.
new_useString with the use name("Training", "Selection", "Testing" or "Unused")

Definition at line 1133 of file data_set.cpp.

◆ set_instances_number()

void set_instances_number ( const size_t &  new_instances_number)

Sets a new number of instances in the data set. All instances are also set for training. The indices of the inputs and target variables do not change.

Parameters
new_instances_numberNumber of instances.

Definition at line 4005 of file data_set.cpp.

◆ set_instances_unused()

void set_instances_unused ( const Vector< size_t > &  indices)

Sets instances with given indices in the data set for unused.

Parameters
indicesIndices vector with the index of instances in the data set for unused.

Definition at line 1107 of file data_set.cpp.

◆ set_instances_uses() [1/2]

void set_instances_uses ( const Vector< InstanceUse > &  new_uses)

Sets new uses to all the instances from a single vector.

Parameters
new_usesVector of use structures. The size of given vector must be equal to the number of instances.

Definition at line 1168 of file data_set.cpp.

◆ set_instances_uses() [2/2]

void set_instances_uses ( const Vector< string > &  new_uses)

Sets new uses to all the instances from a single vector of strings.

Parameters
new_usesVector of use strings. Possible values for the elements are "Training", "Selection", "Testing" and "Unused". The size of given vector must be equal to the number of instances.

Definition at line 1201 of file data_set.cpp.

◆ set_k_fold_cross_validation_instances_uses()

void set_k_fold_cross_validation_instances_uses ( const size_t &  k,
const size_t &  fold_index 
)

This method separates the dataset into n-groups to validate a model with limited data.

Parameters
kNumber of folds that a given data sample is given to be split into.
fold_index.
Todo:
Low priority

Definition at line 1456 of file data_set.cpp.

◆ set_lags_number()

void set_lags_number ( const size_t &  new_lags_number)

Sets a new number of lags to be defined for a time series prediction application. When loading the data file, the time series data will be modified according to this number.

Parameters
new_lags_numberNumber of lags(x-1, ..., x-l) to be used.

Definition at line 3975 of file data_set.cpp.

◆ set_missing_values_label()

void set_missing_values_label ( const string &  new_missing_values_label)

Sets a new label for the missing values.

Parameters
new_missing_values_labelLabel for the missing values.

Definition at line 3914 of file data_set.cpp.

◆ set_missing_values_method()

void set_missing_values_method ( const MissingValuesMethod new_missing_values_method)

Sets a new method for the missing values.

Parameters
new_missing_values_methodMethod for the missing values.

Definition at line 3938 of file data_set.cpp.

◆ set_selection()

void set_selection ( const Vector< size_t > &  indices)

Sets instances with given indices in the data set for selection.

Parameters
indicesIndices vector with the index of instances in the data set for selection.

Definition at line 1062 of file data_set.cpp.

◆ set_separator() [1/3]

void set_separator ( const char &  new_separator)

Sets a new separator from a char.

Parameters
new_separatorChar with the separator value.

Definition at line 3845 of file data_set.cpp.

◆ set_separator() [2/3]

void set_separator ( const Separator new_separator)

Sets a new separator.

Parameters
new_separatorSeparator value.

Definition at line 3836 of file data_set.cpp.

◆ set_separator() [3/3]

void set_separator ( const string &  new_separator_string)

Sets a new separator from a string.

Parameters
new_separatorChar with the separator value.

Definition at line 3879 of file data_set.cpp.

◆ set_steps_ahead_number()

void set_steps_ahead_number ( const size_t &  new_steps_ahead_number)

Sets a new number of steps ahead to be defined for a time series prediction application. When loading the data file, the time series data will be modified according to this number.

Parameters
new_steps_ahead_numberNumber of steps ahead to be used.

Definition at line 3985 of file data_set.cpp.

◆ set_testing()

void set_testing ( const Vector< size_t > &  indices)

Sets instances with given indices in the data set for testing.

Parameters
indicesIndices vector with the index of instances in the data set for testing.

Definition at line 1078 of file data_set.cpp.

◆ set_time_index()

void set_time_index ( const size_t &  new_time_index)

Sets the new position where the time data is located in the data set.

Parameters
new_time_indexPosition where the time data is located.

Definition at line 3994 of file data_set.cpp.

◆ set_training()

void set_training ( const Vector< size_t > &  indices)

Sets instances with given indices in the data set for training.

Parameters
indicesIndices vector with the index of instances in the data set for training.

Definition at line 1046 of file data_set.cpp.

◆ set_variable_name()

void set_variable_name ( const size_t &  variable_index,
const string &  new_variable_name 
)

This method set the name of a single variable.

Parameters
indexIndex of variable.
new_nameName of variable.

Definition at line 2391 of file data_set.cpp.

◆ set_variables_names()

void set_variables_names ( const Vector< string > &  new_variables_names)

Sets new names for the variables in the data set from a vector of strings. The size of that vector must be equal to the total number of variables.

Parameters
new_namesName of variables.

Definition at line 2451 of file data_set.cpp.

◆ split_instances_random()

void split_instances_random ( const double &  training_instances_ratio = 0.6,
const double &  selection_instances_ratio = 0.2,
const double &  testing_instances_ratio = 0.2 
)

Creates new training, selection and testing indices at random.

Parameters
training_instances_ratioRatio of training instances in the data set.
selection_instances_ratioRatio of selection instances in the data set.
testing_instances_ratioRatio of testing instances in the data set.

Definition at line 1257 of file data_set.cpp.

◆ split_instances_sequential()

void split_instances_sequential ( const double &  training_instances_ratio = 0.6,
const double &  selection_instances_ratio = 0.2,
const double &  testing_instances_ratio = 0.2 
)

Creates new training, selection and testing indices with sequential indices.

Parameters
training_instances_ratioRatio of training instances in the data set.
selection_instances_ratioRatio of selection instances in the data set.
testing_instances_ratioRatio of testing instances in the data set.

Definition at line 1352 of file data_set.cpp.

◆ transform_association()

void transform_association ( )

Arranges the data set for association.

Todo:
Low priority. Variables and instances.

Definition at line 7121 of file data_set.cpp.

◆ transform_principal_components_data()

void transform_principal_components_data ( const Matrix< double > &  principal_components)

Transforms the data according to the principal components.

Parameters
principal_componentsMatrix containing the principal components.

Definition at line 5383 of file data_set.cpp.

◆ unscale_data_mean_standard_deviation()

void unscale_data_mean_standard_deviation ( const Vector< Descriptives > &  data_descriptives)

Unscales the data matrix with given mean and standard deviation values. It updates the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the variables in the data set. The size of that vector must be equal to the number of variables.

Definition at line 6204 of file data_set.cpp.

◆ unscale_data_minimum_maximum()

void unscale_data_minimum_maximum ( const Vector< Descriptives > &  data_descriptives)

Unscales the data matrix with given minimum and maximum values. It updates the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the variables in the data set. The size of that vector must be equal to the number of variables.

Definition at line 6215 of file data_set.cpp.

◆ unscale_inputs_mean_standard_deviation()

void unscale_inputs_mean_standard_deviation ( const Vector< Descriptives > &  data_descriptives)

Unscales the input variables with given mean and standard deviation values. It updates the input variables of the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the variables in the data set. The size of that vector must be equal to the number of variables.

Definition at line 6226 of file data_set.cpp.

◆ unscale_inputs_minimum_maximum()

void unscale_inputs_minimum_maximum ( const Vector< Descriptives > &  data_descriptives)

Unscales the input variables with given minimum and maximum values. It updates the input variables of the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the data in the data set. The size of that vector must be equal to the number of variables.

Definition at line 6239 of file data_set.cpp.

◆ unscale_targets_mean_standard_deviation()

void unscale_targets_mean_standard_deviation ( const Vector< Descriptives > &  targets_descriptives)

Unscales the target variables with given mean and standard deviation values. It updates the target variables of the data matrix.

Parameters
targets_descriptivesVector of descriptives structures for all the variables in the data set. The size of that vector must be equal to the number of variables.

Definition at line 6252 of file data_set.cpp.

◆ unscale_targets_minimum_maximum()

void unscale_targets_minimum_maximum ( const Vector< Descriptives > &  data_descriptives)

Unscales the target variables with given minimum and maximum values. It updates the target variables of the data matrix.

Parameters
data_descriptivesVector of descriptives structures for all the variables. The size of that vector must be equal to the number of variables.

Definition at line 6265 of file data_set.cpp.

◆ unuse_columns_missing_values()

Vector< string > unuse_columns_missing_values ( const double &  missing_ratio)

Returns a vector with the unuse variables by missing values method.

Parameters
missing_ratioRatio to find the missing variables.

Definition at line 4201 of file data_set.cpp.

◆ unuse_constant_columns()

Vector< string > unuse_constant_columns ( )

Removes the input of target indices of that variables with zero standard deviation. It might change the size of the vectors containing the inputs and targets indices.

Definition at line 4057 of file data_set.cpp.

◆ unuse_most_populated_target()

Vector< size_t > unuse_most_populated_target ( const size_t &  instances_to_unuse)

This method unuses a given number of instances of the most populated target. If the given number is greater than the number of used instances which belongs to that target, it unuses all the instances in that target. If the given number is lower than 1, it unuses 1 instance.

Parameters
instances_to_unuseNumber of instances to set unused.
Todo:
Low priority. instance frequency

Definition at line 7435 of file data_set.cpp.

◆ unuse_non_significant_input_columns()

Vector< size_t > unuse_non_significant_input_columns ( )

Unuses those binary inputs whose positives does not correspond to any positive in the target variables.

Todo:
Low priority.

Definition at line 4147 of file data_set.cpp.

◆ unuse_repeated_instances()

Vector< size_t > unuse_repeated_instances ( )

Removes the training, selection and testing indices of that instances which are repeated in the data matrix. It might change the size of the vectors containing the training, selection and testing indices.

Definition at line 4095 of file data_set.cpp.

◆ unuse_Tukey_outliers()

void unuse_Tukey_outliers ( const double &  cleaning_parameter = 1.5)

Calculate the outliers from the data set using the Tukey's test and sets in instances object.

Parameters
cleaning_parameterParameter used to detect outliers

Definition at line 7700 of file data_set.cpp.

◆ unuse_uncorrelated_columns()

Vector< size_t > unuse_uncorrelated_columns ( const double &  minimum_correlation = 0.25)

Return unused variables without correlation.

Parameters
minimum_correlationMinimum correlation between variables.
nominal_variablesVector containing the classes of each categorical variable.

Definition at line 4231 of file data_set.cpp.

Member Data Documentation

◆ data

Matrix<double> data
private

Data Matrix. The number of rows is the number of instances. The number of columns is the number of variables.

Definition at line 755 of file data_set.h.

◆ time_series_data

Matrix<double> time_series_data
private

Time series data matrix. The number of rows is the number of instances before time series transfomration. The number of columns is the number of variables before time series transformation.

Definition at line 761 of file data_set.h.


The documentation for this class was generated from the following files: