Diagnose breast cancer from fine-needle aspirate images using OpenNN
This example aims to assess whether a lump in a breast could be malignant (cancerous) or benign (non-cancerous) from digitized images of a fine-needle aspiration biopsy.
The breast cancer database was obtained from the University of Wisconsin Hospitals, Madison, from Dr. William H. Wolberg.
Contents:
1. Application type
The variable to be predicted can have two values (malignant or benign tumor). Therefore, this is a binary classification project.
The goal is to model the probability of a malignant tumor, conditioned on the fine needle aspiration test features.
2. Data set
The first step is to prepare the data set, the source of information for the classification problem. For that, we need to configure the following concepts:
- Data source.
- Variables.
- Instances.
The data source is the file breast_cancer.csv. It contains the data for this example in comma-separated values (CSV) format and can be loaded as:
DataSet data_set("path_to_source/breast_cancer.csv", ';', true, false);
There are 10 columns and 683 rows. The variables in this problem are:
clump_thickness
: (1-10). Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.cell_size_uniformity
: (1-10). Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are or not cancerous.cell_shape_uniformity
: (1-10). Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are or not cancerous.marginal_adhesion
: (1-10). Normal cells tend to stick together. Cancer cells tend to lose this ability. So the loss of adhesion is a sign of malignancy.single_epithelial_cell_size
: (1-10). It is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be malignant.bare_nuclei
: (1-10). This is a term used for nuclei not surrounded by cytoplasm (the rest of the cell), typically seen in benign tumors.bland_chromatin
: (1-10). Describes a uniform «texture» of the nucleus seen in benign cells. In cancer cells, the chromatin tends to be more coarse.normal_nucleoli
: (1-10). Nucleoli are small structures seen in the nucleus. In normal cells, the nucleolus is usually very small if visible at all. In cancer cells, the nucleoli become more prominent, and sometimes there are more of them.mitoses
: (1-10). Cancer is essentially a disease of uncontrolled mitosis.diagnose
: (0 or 1). Benign (non-cancerous) or malignant (cancerous) lump in a breast.
Once we have the data ready, we will get the information on the variables, such as names and statistical descriptives.
const vector <string> inputs_names = data_set.get_variable_names(DataSet::VariableUse::Input); const vector <string> targets_names = data_set.get_variable_names(DataSet::VariableUse::Target);
The instances are divided into training, selection, and testing subsets. They represent 60% (409), 20% (137), and 20% (137) of the original instances, respectively, and are split randomly using the following command:
data_set.split_samples_random();
To get the input variables number and target variables number, we use the following command
const Index input_variables_number = data_set.get_variables_number(DataSet::VariableUse::Input); const Index target_variables_number = data_set.get_variables_number(DataSet::VariableUse::Target);
For more information about the data set methods, see the DataSet class.
3. Neural network
The second step is to choose the correct neural network architecture. For classification problems, it is usually composed of:
- A scaling layer.
- Two perceptron layers.
- A probabilistic layer.
- An unscaling layer.
The NeuralNetwork
class is responsible for building the neural network and properly organizing the layers of neurons using the following constructor. If you need more complex architectures, you should see the NeuralNetwork class.
const Index hidden_neurons_number = 6; NeuralNetwork neural_network(NeuralNetwork::ModelType::Classification, {input_variables_number}, {hidden_neurons_number}, {target_variables_number});
Once the neural network has been created, we can input data in the layers for more precise calibration.
neural_network.set_inputs_names(inputs_names); neural_network.set_outputs_names(targets_names);
Therefore, we have already created a good-looking model. Thus we proceed to the learning process with TrainingStrategy
.
4. Training strategy
The third step is to set the training strategy, which is composed of:
- Loss index.
- Optimization algorithm.
First of all, we construct the training strategy object
TrainingStrategy training_strategy(&neural_network, &data_set);
then, set the error term
training_strategy.set_loss_method(TrainingStrategy::LossMethod::CROSS_ENTROPY_ERROR);
and finally the optimization algorithm
training_strategy.set_optimization_method(TrainingStrategy::OptimizationMethod::QUASI_NEWTON_METHOD);
We can now start the training process by using the following command:
training_strategy.perform_training();
For more information on the training strategy methods, see the TrainingStrategy class.
5. Testing analysis
The fourth step is to evaluate our model. For that purpose, we need to use the testing analysis class, whose goal is to validate the model’s generalization performance. Here, we compare the neural network outputs to the corresponding targets in the testing instances of the data set.
As in previous cases, we start by building the testing analysis object
TestingAnalysis testing_analysis(&neural_network, &data_set);
and perform the testing. In our case, we use binary classification tests:
testing_analysis.print_binary_classification_tests();
For more information about the testing analysis methods, see the TestingAnalysis class.
6. Model deployment
Once our model is completed, the neural network is ready to predict outputs for inputs that it has never seen. This process is called model deployment.
To generate predictions with new data, you can use
neural_network.calculate_outputs();
For instance, the new inputs are:
- Clump thickness (1-10): 4
- Cell size uniformity (1-10): 3
- Cell shape uniformity (1-10): 3
- Marginal adhesion (1-10): 2
- Single epithelial cell size (1-10): 3
- Bare nuclei (1-10): 4
- Bland chromatin (1-10):3
- Normal nucleoli (1-10): 2
- Mitoses (1-10): 1
In OpenNN we can write it as:
Tensor<type, 2> inputs(1,9); inputs.setValues({{type(4), type(3), type(3), type(2), type(3), type(4), type(3), type(2), type(1)}}); neural_network.calculate_outputs(inputs);
or save the model.
neural_network.save_expression(C, "../data/expression.txt"); neural_network.save_expression(Python, "../data/expression.txt");
The model can be implemented in Python, PHP, and so on.
7. Full Code
Joining all steps, we obtain the following code:
// DataSet DataSet data_set("../data/breast_cancer.csv", ';', true, false); const Index input_variables_number = data_set.get_variables_number(DataSet::VariableUse::Input); const Index target_variables_number = data_set.get_variables_number(DataSet::VariableUse::Target); // Neural Network const Index hidden_neurons_number = 6; NeuralNetwork neural_network(NeuralNetwork::ModelType::Classification, {input_variables_number}, {hidden_neurons_number}, {target_variables_number}); // Training Strategy TrainingStrategy training_strategy(&neural_network, &data_set); training_strategy.set_loss_method(TrainingStrategy::LossMethod::CROSS_ENTROPY_ERROR); training_strategy.set_optimization_method(TrainingStrategy::OptimizationMethod::QUASI_NEWTON_METHOD); training_strategy.perform_training(); // Testing Analysis TestingAnalysis testing_analysis(&neural_network, &data_set); testing_analysis.print_binary_classification_tests(); // Model deployment Tensor<type, 2> inputs(1, 9); inputs.setValues({{type(4), type(3), type(3), type(2), type(3), type(4), type(3), type(2), type(1)}}); neural_network.calculate_outputs(inputs); // Save results opennn::NeuralNetwork::ProgrammingLanguage C; opennn::NeuralNetwork::ProgrammingLanguage Python; neural_network.save_expression(C, "../data/breast_cancer.txt"); neural_network.save_expression(Python, "../data/breast_cancer.py");
This code can be exported to your C++ project.
References:
- The data for this problem has been taken from the UCI Machine Learning Repository.
- Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193–9196.
- Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Conference (pp. 470–479). Aberdeen, Scotland: Morgan Kaufmann.
- Fisher, R.A. «The use of multiple measurements in taxonomic problems» Annual Eugenics, 7, Part II, 179-188 (1936); also in «Contributions to Mathematical Statistics» (John Wiley, NY, 1950).