I've previously discussed the Reduced Coloumb Energy Neural Net algorithm on this site. I wrote the algorithm in Matlab, which uses index based logic to select, filter, wrangle and process data. I thought it would be a fun exercise to refactor the Matlab code to Tidyverse. Tidyverse uses forward pipe operators to flow data through the data processing steps.

The example RCE algorithm assigns a class to data based on weather or not the data points live inside "footprints" of training data. If you are interested in a visual walk through of the RCE algorithm, take a minute to read my post titled A Graphical Introduction to Probabalistic Neural Networks.

The following graphic captures an animation of the RCE NN Algorithm in action.

You can find the original Matlab script and the new R script on GitHub.

# Loading Data

The algorithm loads the BUPA liver disorders database from the University of California, Irvine (UCI) machine learning repository.

## Matlab

In Matlab, we simply encode the CSV as matrix using brackets and assignment.

```
data = [85,92,45,27,31,0.0,1
85,64,59,32,23,0.0,2
86,54,33,16,54,0.0,2
91,78,34,24,36,0.0,2
87,70,12,28,10,0.0,2
...
98,77,55,35,89,15.0,1
91,68,27,26,14,16.0,1
98,99,57,45,65,20.0,1];
```

## Tidyverse

Tidyverse allows us to read the raw CSV and store it in a Tibble.

In addition to storing the CSV data in a Tibble, we use the **readr** library to add column names (** col_names**) and an ID column (

**).**

*rowid_to_column*I discuss the definitions of the column names in the next section

```
library("readr")
library("dplyr")
library("magrittr")
library("purrr")
library("tidyr")
columns <- c( "mcv", "alkphos", "sgpt", "sgot", "gammagt", "drinks_num", "select")
Bupa.Tib <- read_csv( "bupa.data", col_names = columns ) %>%
tibble::rowid_to_column("id")
```

# Selecting Features

The BUPA data has six features and two classes (one for alcohol related liver disorder, and one for alcohol unrelated liver disorder).

The six (6) BUPA features include:

- mean corpuscular volume (mcv)
- Four chemical markers
- alkaline phosphotase (alkphos)
- alamine aminotransferase (sgpt)
- aspartate aminotransferase (sgot)
- gamma-glutamyl transpeptidase (gammagt)

- half-pint equivalents of alcohol per day (drinks)

I discussed the salient features in my RCE writeup. Three features, "alkphos", "sgpt" and "gammagt" stand out in terms of the algorithm's classification performance. We still would like to provide the Data Scientist with flexibility in selecting the features, for "what if" scenarios, so we write our code to accommodate an arbitrary number of features.

## Matlab

In Matlab, we use the column index to select the features. In this case, we use ** alkphos == 2**,

**, and**

*sgpt == 5***.**

*gammagt == 6*```
feats = [2 5 6];
```

## Tidyverse

Tidyverse allows us to name the columns and then select by name.

When we created ** Bupa.Tib** we named the columns, so now we can

**columns by name.**

*select*We create a list that records the column names that we intend to keep.

```
features <- c( "alkphos", "sgpt", "gammagt")
```

# Creating the training set

## Matlab

In Matlab, we sort the data by the ** class**, which the matrix stores in column seven (7).

We then use index operations to select all features, excluding the class.

Then we select the desired features using the ** feats** array. A function, named

**wraps this operation.**

*prepare_uncoded*```
function [data] = prepare_uncoded(data,feats)
data = sortrows(data,7);
data = data(:,1:6);
data = data(:,feats);
end
```

We create two separate matrices, one that includes the first seventy-two (72) rows, and one that includes the bottom seventy-two (72) rows. Since we sorted by class in the function above, we produce one matrix of train patterns that contains class one, and one that contains train patterns of class two.

```
data = prepare_uncoded(data,feats);
class1 = data(73:144,:)';
class2 = data(145:216,:)';
```

## Tidyverse

The ** MagrittR** package of Tidyverse enables a pipe forward operator. The pipe forward operation provides a more readable feature selection operation.

We use ** filter** to filter points of each class,

**to select the features and**

*select***to pull specific rows.**

*slice*NOTE: Just to disambiguate, Irvine named the

columnclass, so weselectbased on the value of thefiltercolumn.select

```
class_1_training_patterns <- c( 73:144 )
class_2_training_patterns <- c( 1:72 )
Class.1.Train.Tib <- Bupa.Tib %>%
filter( select == 1 ) %>%
select( id, features) %>%
slice( class_1_training_patterns )
Class.2.Train.Tib <- Bupa.Tib %>%
filter( select == 2 ) %>%
select( id, features) %>%
slice( class_2_training_patterns )
```

# Find Radii

The RCE NN algorithm requires us to find the radii between a train point and the nearest train point of the opposite class.

We compute the euclidean distance to all other training points of the other class, and store the distance (named ** lambda**) of the closest one.

## Matlab

In Matlab, we create a function that ingests both the ** Class 1** and

**training matrices, along with**

*Class 2***and**

*epsilon***.**

*lambda max***provides an upper bound in terms of the maximum radius the algorithm will consider.**

*Lambda max***is a very small value that we subtract from the calculated**

*Epsilon***. For more details, see my writeup of the RCE NN algorithm.**

*lambda*The Matlab code performs Matrix operations via nested functions to calculate the euclidean distance to all other points and then record the minimum.

In addition, the code uses a ** for loop** to iterate through every training pattern.

The function iterates through each training point, calculates the distance to every other training point (stored in the two ** Class** matrices) and keeps the minimum.

It then returns two arrays that contain ** lambda**, one array per class.

```
function [lambda_1, lambda_2] = rce_train(class1,class2,eps,lambda_max)
%Find number of train patterns (colums)
n_c1p = size(class1,2);
n_c2p = size(class2,2);
for i=1:n_c1p
x_hat = min(sqrt(sum((class2-class1(:,i)*ones(1,n_c1p)).^2)));
lambda_1(i) = min(x_hat - eps, lambda_max);
end
for i=1:n_c2p
x_hat = min(sqrt(sum((class1-class2(:,i)*ones(1,n_c2p)).^2)));
lambda_2(i) = min(x_hat - eps, lambda_max);
end
end
```

We apply the function to the training matrices as follows:

```
[lambda_1 lambda_2] = rce_train(class1,class2,eps,lambda_max);
```

## Tidyverse

R best practices do not encourage ** for loops**, since R follows a

**programming convention.**

*functional*In addition, the ** MagrittR** pipes allow us to avoid

**.**

*nested functions*We first create a function ** find_lambda**. I decided to process the data one class at a time, so this function only calculates the distance to training points of the

**class, and not all data points as in the Matlab function above.**

*other*The ** find_lambda** function takes a single observation (row of data) for a particular

**, along with the entire**

*class***that contains all data points of the**

*Tibble***. The function also ingests**

*other class***,**

*epsilon***and the**

*lambda max***vector.**

*features*Not to overload terms too much, but the function includes a ** lambda function** that calculates the Euclidean distance between two vectors.

The ** lambda function** takes two vectors, one which is the

**vector and the second which is a row from the**

*observation***Tibble, represented as**

*other class***.**

*x*```
function(x) sqrt( sum( ( x - observation )^2 )
```

The Lambda function can perform calculations on vectors of any length, which means that a data scientist is flexible in choosing which features to include.

The ** find_lambda** function follows, and I will explain it quickly line by line.

```
find_lambda <- function( observation, Other.Class.Tib, lambda_max, epsilon, features ) {
Other.Class.Tib %>%
select( features ) %>%
mutate( euclid_dist = apply( . , 1, function(x) sqrt( sum( ( x - observation )^2 ) ) ) ) %>%
select( euclid_dist ) %>%
min() %>%
min( . - epsilon, lambda_max ) }
```

We start with the Tibble that contains ** all** observations of the

**class, stored in**

*other***.**

*Other.Class.Tib*The function pipes the Tibble in its entirety to a ** select** statement that selects all of the

**features.**

*desired*We then use the ** mutate** operator to create a new column named

**. This column stores the**

*euclid_dist***from the current**

*euclid_dist***(single vector) to every data point (row) in the**

*observation***.**

*Other.Class.Tib*The ** apply** operator tells

**to apply the Euclidean distance**

*Tidyverse***to every row in**

*lambda function***and store the result for each row in the**

*Other.Class.Tib***column.**

*euclid_dist*Since we must accommodate vectors of arbitrary length we tell ** apply** to input

**data via the**

*row wise***in the second parameter in the**

*1***.**

*function signature*Once the ** apply** operation completes, we have a column that records the distance to each data point in the

**. We are only interested in the**

*Other.Class.Tib***data point of the**

*nearest***so we**

*other class***the**

*select***column and find the**

*euclid_dist***. We then ensure that the**

*min()***is less than**

*minimum distance***.**

*lambda max*In summary, we supply the function with a single observation for a class, along with a ** Tibble** that includes

**observations for the**

*all***. The function then returns a single value, the distance between the current observation and the nearest data point of the**

*other class***.**

*other class*We are not done yet. We must apply this function to every training point in the Class under observation.

```
# Find Lambda for Class 1 Training patterns
Class.1.Train.Tib %<>%
select( features ) %>%
mutate( lambda = apply(. , 1, function(x) find_lambda(x,
Class.2.Train.Tib,
lambda_max,
epsilon,
features ) ) ) %>%
mutate( id = Class.1.Train.Tib$id )
# Find Lambda for Class 2 Training patterns
Class.2.Train.Tib %<>%
select( features ) %>%
mutate( lambda = apply(. , 1, function(x) find_lambda(x,
Class.1.Train.Tib,
lambda_max,
epsilon,
features ) ) ) %>%
mutate( id = Class.2.Train.Tib$id )
```

As you can see, we pipe the entire ** Class.1.Train.Tib** to a

**function and then use the**

*select***operation to execute**

*apply***on every row of**

*find_lambda***tib. Although each iteration (application) of**

*Class.1.Train***inputs the entire**

*find_lambda***of**

*Tibble***, it returns a single value for lambda.**

*Class.2.Train.Tin*NOTE: The

MagrittRoperation pipes data forward and stores the final result of all chained operations back into initial variable%<>%

The following output tibble depicts what ** Class.1.Train.Tib** looks like after application of

**.**

*find_lambda*```
> Class.1.Train.Tib
# A tibble: 72 x 5
alkphos sgpt gammagt lambda id
<dbl> <dbl> <dbl> <dbl> <int>
1 67 77 114 29.1 175
2 71 29 52 10.5 176
3 93 22 123 19.4 182
4 77 86 31 26.8 183
5 77 39 108 20.4 189
6 83 81 201 58.3 190
7 75 25 14 3.16 191
8 56 23 12 6.48 192
9 91 27 15 7.87 194
10 62 17 5 5.00 195
# ... with 62 more rows
```

The closest data point in Class 2, for example, to the first Class 1 observation exists 29.1 units away.

# Classify the Data

We first take the remaining BUPA data to create test patterns for each class.

## Matlab

In Matlab:

```
test_class1 = data(1:72,:)';
test_class2 = data(217:288,:)';
```

## Tidyverse

In Tidyverse I decided to create one ** Tibble** for all Test Patterns, via the

**operation.**

*bind_rows*```
Test.Patterns <- Bupa.Tib %>%
filter( select == 1 ) %>%
slice( class_1_test_patterns ) %>%
bind_rows( Bupa.Tib %>%
filter( select == 2 ) %>%
slice( class_2_test_patterns ) )
```

Once we have test data, we need to classify it.

## Matlab

In Matlab, I wrote a function named ** rce_clasify**. As you can see, the function contains a ton of nested functions and a for loop.

Each training pattern includes a circular "footprint" around it that extends to the nearest point of the ** other** class, with radius equal to the

**we calculated above.**

*lambda*The ** rce_clasify** function finds which

**each test observation lies in.**

*footprint*```
function [cl] = rce_classify(class1,lambda_1,class2,lambda_2,test_patterns)
%Test Patterns in form: num_features x num_patterns
ind1 = []; ind2 = [];
%Find number of train patterns (colums)
n_c1p = size(class1,2);
n_c2p = size(class2,2);
num_test_patterns = size(test_patterns,2);
for i = 1:num_test_patterns
test_x = test_patterns(:,i);
dist1 = test_x*ones(1,n_c1p)-class1;
dist1 = sqrt(diag(dist1'*dist1))';
dist2 = test_x*ones(1,n_c2p)-class2;
dist2 = sqrt(diag(dist2'*dist2))';
ind1 = find(dist1 < lambda_1);
ind2 = find(dist2 < lambda_2);
p = 3;
if ~isempty(ind1)
p = 1;
end
if ~isempty(ind2)
p = 2;
end
if (~isempty(ind1) && ~isempty(ind2))
p = 3;
end
cl(i) = p;
end
end
```

## Tidyverse

In the ** Tidyverse** classification approach, we use nested functions in the logical sense, since our code exclusively uses pipes.

We create a generic function to discover how many "footprints" the given observation lives in.

Similar to the ** Matlab** code above, we calculate the distance between an observation of the Test data and all of the training samples of a given class.

We then use the ** lambda** values of the training samples to identify the

**(nrow) of footprints the test data lives in.**

*count*```
rce_classify <- function( observation, Data.Tib, features ) {
Data.Tib %>%
select( features ) %>%
mutate( euclid_dist = apply( . , 1, function(x) sqrt( sum( ( x - observation )^2 ) ) ) ) %>%
filter( euclid_dist < Data.Tib$lambda ) %>%
nrow
}
```

Without getting two complicated, we pass the Test data to a function that uses ** rce_classify** to detect the number of

**against each class of Training data. First it finds the**

*hits***against**

*hits***, and then it finds the hits against**

*Class.2.Training.Tib***.**

*Class.1.Training.Tib*The new function ** rce_classify_tib** then uses the number of hits for each class to classify the data. In this example, we use a voting approach, although you can tailor the algorithm to classify a test point as ambiguous if it hits either

**or**

*zero***class.**

*more than one*```
rce_classify_tib <- function(Test.Data.Tib, Class.One.Train.Tib, Class.Two.Train.Tib, features) { Test.Data.Tib %<>%
select( features ) %>%
mutate( class.2.hits = apply( . , 1, function(x) rce_classify( x, Class.Two.Train.Tib , features ) ) ) %>%
mutate( id = Test.Data.Tib$id )
Test.Data.Tib %<>%
select(features) %>%
mutate( class.1.hits = apply( . , 1, function(x) rce_classify( x, Class.One.Train.Tib, features ) ) ) %>%
mutate( class.2.hits = Test.Data.Tib$class.2.hits,
id = Test.Data.Tib$id ) %>%
mutate( rce_class = ifelse( test = class.1.hits > class.2.hits,
yes = 1,
no = ifelse( test = class.2.hits > class.1.hits,
yes = 2,
no = 3)))
return(Test.Data.Tib)
}
```

We then apply these functions to our data.

## Matlab

In Matlab:

```
cl1 = rce_classify(class1,lambda_1,class2,lambda_2,test_class1);
cl2 = rce_classify(class1,lambda_1,class2,lambda_2,test_class2);
```

## Tidyverse

In Tidyverse:

```
Test.Patterns %<>%
rce_classify_tib(Class.1.Train.Tib, Class.2.Train.Tib, features)
```

# Graphing RCE NN

We can graph the RCE NN in action by creating a uniform data grid and running ** rce_classify** against every point.

First, create the data grid. We find the highest valued observation in the data set, in order to ensure that our graph includes this data point.

```
max_obs <- Class.1.Train.Tib %>%
bind_rows(Class.2.Train.Tib) %>%
select(features) %>%
max
test_grid <- expand.grid( seq( 0, max_obs * 1.1, length.out = 50 ),
seq( 0, max_obs * 1.1, length.out = 50 ),
seq( 0, max_obs * 1.1, length.out = 50 ) )
names( test_grid ) <- features
```

Now we apply the classification to every test data point, in order to blanket the entire canvas.

Note: This will take a long time. If you don't want to wait, you can execute: test_grid = readxl::read_xlsx("NinetyK.xlsx")

You can then use ** Plotly** to generate a Three Dimensional image that you can rotate.

```
library("plotly")
test_grid %>% filter( rce_class != 3 ) %>%
mutate( rce_class = ifelse( test = rce_class == 1,
yes = "one",
no = "two") ) %>%
plot_ly( x = ~alkphos, y = ~sgpt, z = ~gammagt , color = ~rce_class )
```

If you would like to see a 2D graph, then re-run the script using two features.

```
features <- c( "alkphos", "sgpt")
```

Create a test grid using two dimensions and classify.

```
test_grid <- expand.grid( seq( 0, max_obs * 1.1, length.out = 300 ), seq( 0, max_obs * 1.1, length.out = 300 ) )
names( test_grid ) <- features
test_grid %<>%
as_tibble %>%
tibble::rowid_to_column("id") %>%
rce_classify_tib( Class.1.Train.Tib, Class.2.Train.Tib, features)
```

You can just load the pre-processed data instead of waiting.

```
test_grid = readxl::read_xlsx("NinetyK.xlsx")
```

Here I plot using Grammer of Graphics.

```
ggplot( ) +
geom_point( data = test_grid %>% filter( rce_class == 1 ), aes(x = alkphos, y = sgpt, color = class.1.hits )) +
geom_point( data = test_grid %>% filter( rce_class == 2 ), aes(x = alkphos, y = sgpt, color = class.2.hits ))
```

# Conclusion

This blog post described how to convert a Matlab script that uses for loops and nested function into a functional, pipe based Tidyverse script.

If you enjoyed this, you may enjoy these other Machine Learning posts.