Machine Learning Classification and Ecological Data

I’ve been toying around with various machine learning algorithms in R, trying to expand my knowledge of data science and see how these techniques can be applied to my long-running projects. Mostly, I’ve been playing with the data we used in writing my thesis, where we examined beaver foraging preferences in the Adirondacks, and the impact these have on the surrounding ecosystem.

One of the most interesting parts of that thesis was the work we did to try and predict what stems beaver would harvest throughout the landscape. We had some pretty decent success building logistic regression models both for the most common species harvested in our dataset, as well as for the overall dataset.

However, we didn’t dive too far into other classification algorithms - partially due to time constraints on our analyses, and partially due to past work from other regions using logistic models. As such, I figured I could try to use the techniques I was learning to classify our beaver data, trying to predict which of the stems in our dataset were most likely to be harvested.

Since we haven’t gotten published quite yet, this post is going to use a bootstrapped sample of our data, with the values jittered slightly so they don’t match the real field data. This dataset (which I’ve stored here in ForestData) looks something like this:

Distance Diameter DiameterSquared Harvested
1.386294 1.5 2.25 1
1.386294 1.5 2.25 0

Those columns contain the variables we found to be most predictive of which stems beaver would harvest:

  • The distance of the stem from the beaver impoundment, as beavers don’t really like to wander that far from their pond’s relative safety. We use the log-transformed distance, since while trends in selectivity for individual species are linear, the overall trend is more curved (since it synthesizes multiple linear lines).
  • The diameter of each stem, since certain sizes are harvested more often.
  • The squared diameter of each stem as a variable is also used as a variable, since beaver won’t harvest stems that are too small (since they have no real energy content) or too large (since they take too much energy to drag back to the pond). By introducing a quadratic term, we’re able to deal with the preference for intermediate sizes.

Additionally, our dataframe has a binary variable representing if any given stem has been harvested by beaver - coded as 0 being a standing stem, while 1s were harvested. That’s our target variable.

In order to compare the different methods, we’re going to be comparing cross-validation accuracy using a split training and testing dataset. We’ll use the caret package to split the data, with 3/4 used to train our models and 1/4 used for tests:

## Make sure our work is reproducible!
set.seed(42)
library(caret)

## Create a vector identifying row numbers to include in the training dataset
index <- createDataPartition(ForestData$Harvested, p = 0.75, list = FALSE)

## Split the full dataset into training and testing portions
BeaverTrain <- ForestData[index, ]
BeaverTest <- ForestData[-index, ]


In the course of writing the thesis, we found that the best model formula looked something like this:

LogMod <- glm(Harvested ~ 
                Distance + 
                Diameter + 
                DiameterSquared + 
                Distance:Diameter + 
                Distance:DiameterSquared, 
              data=BeaverTrain, 
              family="binomial")


We’re able to assess the accuracy of this model by generating predictions for our hold-out test dataset, then using caret’s built-in confusionMatrix() function:

## Generate predictions for our test dataset
LogPred <- predict(LogMod, BeaverTest, type = "response") 

## Convert the numeric probabilities generated by predict() into 1-or-0 predictions
LogFactor <- factor(round(LogPred)) 

## Generate a confusion matrix comparing predictions to the actual dataset
(LCM <- confusionMatrix(LogFactor, BeaverTest$Harvested))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 111  41
##          1  30  62
##                                           
##                Accuracy : 0.709           
##                  95% CI : (0.6477, 0.7652)
##     No Information Rate : 0.5779          
##     P-Value [Acc > NIR] : 1.576e-05       
##                                           
##                   Kappa : 0.3949          
##  Mcnemar's Test P-Value : 0.2353          
##                                           
##             Sensitivity : 0.7872          
##             Specificity : 0.6019          
##          Pos Pred Value : 0.7303          
##          Neg Pred Value : 0.6739          
##              Prevalence : 0.5779          
##          Detection Rate : 0.4549          
##    Detection Prevalence : 0.6230          
##       Balanced Accuracy : 0.6946          
##                                           
##        'Positive' Class : 0               
## 


As we can see, this model does a semi-decent job of predicting which stems a beaver will harvest, with an accuracy of 70.9%.

Another common metric used in evaluating logistic models is the c-statistic (also known as the AUC, the area under the receiver operating characteristic curve), which we can calculate using the pROC package’s roc() function. To do so, I’m also going to have to make numeric versions of our factor variables:

library(pROC)

LogFactorNum <- as.numeric(LogFactor)
BeaverNum <- as.numeric(BeaverTest$Harvested)

(LogROC <- roc(BeaverNum, LogFactorNum))
## 
## Call:
## roc.default(response = BeaverNum, predictor = LogFactorNum)
## 
## Data: LogFactorNum in 141 controls (BeaverNum 1) < 103 cases (BeaverNum 2).
## Area under the curve: 0.6946


So this model has a c-statistic of 0.695. A random model would give us a c-stat value of 0.5, while models with c-statistics of 0.7 or higher are usually considered good fits to the data - meaning, when we examine it with the accuracy value above, that our model is performing pretty alright!

Of course, I already knew that - this is the best model I found while writing my thesis, after all. What I’m more interested is in how other algorithms perform on the same dataset.

One of the most popular machine learning algorithms these days is the random forest classifier, which aggregates a number of decision trees in order to make its eventual classification predictions. We can fit this model using the train() function from caret, also taking advantage of the random forest implementation built in the ranger package:

library(ranger)
ForestFit <- train(Harvested ~., data = BeaverTrain, method = "ranger")
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .


Once we do that, we can assess this algorithm using the same metrics we used for the logistic regression:

ForestPredict <- predict(ForestFit, BeaverTest)
confusionMatrix(ForestPredict, BeaverTest$Harvested)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 109  36
##          1  32  67
##                                           
##                Accuracy : 0.7213          
##                  95% CI : (0.6605, 0.7766)
##     No Information Rate : 0.5779          
##     P-Value [Acc > NIR] : 2.421e-06       
##                                           
##                   Kappa : 0.4258          
##  Mcnemar's Test P-Value : 0.716           
##                                           
##             Sensitivity : 0.7730          
##             Specificity : 0.6505          
##          Pos Pred Value : 0.7517          
##          Neg Pred Value : 0.6768          
##              Prevalence : 0.5779          
##          Detection Rate : 0.4467          
##    Detection Prevalence : 0.5943          
##       Balanced Accuracy : 0.7118          
##                                           
##        'Positive' Class : 0               
## 
ForestPredictNum <- as.numeric(ForestPredict)
roc(BeaverNum, ForestPredictNum)
## 
## Call:
## roc.default(response = BeaverNum, predictor = ForestPredictNum)
## 
## Data: ForestPredictNum in 141 controls (BeaverNum 1) < 103 cases (BeaverNum 2).
## Area under the curve: 0.7118


It looks like we have a slight improvement with both metrics - the random forest is (just barely) a better fit to our data.

The last algorithm I want to apply to the dataset is the support vector machine method. The best implementation I’ve found for this in R is from the e1071 package, which lets us build support vector machines via the svm() command:

library(e1071)
ForestSVM <- svm(Harvested ~ ., data = BeaverTrain)


And just like before, we’re able to assess this algorithm’s performance using a confusion matrix and the area under the curve:

SVMPred <- predict(ForestSVM, BeaverTest)
confusionMatrix(SVMPred, BeaverTest$Harvested)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 111  41
##          1  30  62
##                                           
##                Accuracy : 0.709           
##                  95% CI : (0.6477, 0.7652)
##     No Information Rate : 0.5779          
##     P-Value [Acc > NIR] : 1.576e-05       
##                                           
##                   Kappa : 0.3949          
##  Mcnemar's Test P-Value : 0.2353          
##                                           
##             Sensitivity : 0.7872          
##             Specificity : 0.6019          
##          Pos Pred Value : 0.7303          
##          Neg Pred Value : 0.6739          
##              Prevalence : 0.5779          
##          Detection Rate : 0.4549          
##    Detection Prevalence : 0.6230          
##       Balanced Accuracy : 0.6946          
##                                           
##        'Positive' Class : 0               
## 
SVMPredNum <- as.numeric(SVMPred)
roc(BeaverNum, SVMPredNum)
## 
## Call:
## roc.default(response = BeaverNum, predictor = SVMPredNum)
## 
## Data: SVMPredNum in 141 controls (BeaverNum 1) < 103 cases (BeaverNum 2).
## Area under the curve: 0.6946


Funnily enough, this model performed exactly the same as the logistic regression I used in my thesis! In fact, the two models made the same prediction for every single observation in our test dataset. We can prove this by printing out every observation where the two predictions don’t equal each other - which produces a vector of length 0, since the two are always in agreement.

length((SVMPredNum == LogFactorNum)[(SVMPredNum == LogFactorNum) == FALSE])
## [1] 0


Since these two models are identical, and logistic regressions are easier to understand, are more commonly used in the literature, and are more useful for wildlife managers, we can eliminate the SVM from further consideration. As such, we’re left to choose between the random forest model and logistic regression. While the random forest did perform marginally better than the logistic regression, I’m interested in knowing exactly where the differences between these two models lie.

For that, I’m going to visualize these model predictions, showing which sizes and distances of stems they believe are the most likely to be harvested by beaver. To do that, I’m going to create a dataframe containing every possible combination of diameter and distance we found in the field. The method I use here does a little bit of data wrangling work that I didn’t show up above - I have to create the log-distance and squared diameter columns separately, and remove the linear distance column:

library(tidyverse)
GraphData <- expand.grid(Diameter = seq(0, 15, by= 0.1), 
                       PlotCode = seq(4, 65, by= 0.1)) %>%
  mutate(DiameterSquared = Diameter^2,
         Distance = log(PlotCode))

PlotDistance <- select(GraphData, PlotCode)
GraphData <- select(GraphData, -PlotCode)


We’ll then refit our random forest model in a way that lets us see the probabilities it assigns to any given stem being harvested, rather than the eventual prediction:

levels(BeaverTrain$Harvested) <- c("No", "Yes")
ForestFit <- train(Harvested ~., 
                   data = BeaverTrain, 
                   method = "ranger", 
                   trControl = trainControl(classProbs = TRUE))
## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .


We then add predictions from our random forest and logistic regression models to the graph dataframe, and restore the linear distance column:

GraphData$ForestPredict <- predict(ForestFit, GraphData, type = "prob")$Yes
GraphData$LogPredict <- predict(LogMod, GraphData, type = "response")
GraphData <- cbind(GraphData, PlotDistance)


And now we can go ahead and make our tile plots, showing the probability of a stem of a given size and distance of being harvested. I make use of the viridis and cowplot packages here, to make the end plot look a little more appealing:

GraphData %>%
  rename(`Logistic Model` = LogPredict,
         `Random Forest` = ForestPredict) %>%
  gather(key = "Type", value = "Predict", `Logistic Model`, `Random Forest`) %>%
  ggplot(aes(PlotCode, Diameter, fill = Predict)) + 
    geom_tile() +
    viridis::scale_fill_viridis(name = "Probability of \nStem Harvest \nby Beaver") + 
    cowplot::theme_cowplot() + 
    scale_y_continuous(expand = c(0,0), 
                       breaks = c(0, 15),
                       labels = c("Small", "Large")) + 
    labs(x = "Distance from the Impoundment (m)", 
         y = "Diameter (cm)") + 
    scale_x_continuous(breaks = c(4, 64),
                       labels = c("Near", "Far")) + 
  facet_wrap(~ Type) + 
  theme(strip.background = element_blank()) + 
  coord_fixed(ratio = 4)


We can now clearly identify the differences between our two models - while the logistic regression predicts stems of intermediate sizes are preferred at all distances, the random forest predicts a few “sweet-spot” distances.

This is where some knowledge about our data comes in handy. For instance, I know that the general trend of what percent of stems were harvested in my plots looks like this:


But I also know that the second bar is artifically low, due to a challenge in how we sampled data. You see, we’d only sample as far away from the impoundment as beaver were harvesting - once we found that we were outside the area they foraged, we’d stop sampling. Since beavers like to forage as close to the impoundment as possible, that means a lot of our sampling ended at distances represented by that second bar, where we’d measure a lot of live stems and no stumps.

As such, it looks like the random forest performs better because it adjusts to that weird anomaly in our data - which is to say, the random forest is a better fit for our sampling method, but not necessarily to how beaver forage. So I wound up back where I started, using logistic regression to predict what beaver will forage - but at least now I’m using it because it’s the best option, rather than because it’s what I know!

I do think, though, that I should have seen this result coming. After all - it doesn’t take an algorithm to let you know that beaver like logs!

Related