Simulating tournament results with R - r

aIn R, how do one run a tournament simulation?
I have the probabilities of each teams chance of winning against the other pairs, for example:
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
Which would mean something like this:
1 2 3 4 5 6 7 8
1 0 0.76 0.35 0.81 0.95 0.08 0.47 0.26
2 0 0.00 0.24 0.34 0.54 0.48 0.53 0.54
3 0 0.00 0.00 0.47 0.51 0.68 0.50 0.80
4 0 0.00 0.00 0.00 0.52 0.59 0.38 0.91
5 0 0.00 0.00 0.00 0.00 0.05 0.88 0.64
6 0 0.00 0.00 0.00 0.00 0.00 0.23 0.65
7 0 0.00 0.00 0.00 0.00 0.00 0.00 0.77
8 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The next step would be to run a set of simulations, say n = 100000
First the quarter-finals (best out of 3):
1 vs 8
2 vs 7
3 vs 6
4 vs 5
And then the winners of each pair face off in the semi-finals:
1-8 winner VS 4-5 winner
2-7 winner VS 3-6 winner
Winners move on to the final. All is best out of 3.
What approach/package could I use to run bracket simulations? I did find a package called mRchmadness but it's too specific to handle this simulation.

I have created some dummy code that can help you figure out how to do it. The code is not optimized at all, but it is quite linear for you to understand how to do it.
prob_res <- matrix(round(runif(64),2), 8, 8)
prob_res[lower.tri(prob_res, diag = TRUE)] <- 0
prob_res <- as.data.frame(prob_res)
colnames(prob_res) <- 1:8
rownames(prob_res) <- 1:8
prob_res
## Total number of combinations
posscombi<-t(combn(1:8, 2))
## This function gives you winners of the match with n repetitionmatches against every other team possible combination of teams.
## It "reproduces" like the whole league assuming winning probabilities are static.
League <- function(repetitionMatches, posscomb , prob_res)
{
TotalVect<-integer(0)
for(i in 1:nrow(posscomb)){
pair <- posscomb[i,]
Vect<-sample(pair,
size = repetitionMatches,
prob = c(prob_res[pair[1], pair[2]], 1-prob_res[pair[1], pair[2]]),
replace = TRUE)
TotalVect <- c(TotalVect, Vect)
}
return(table(TotalVect))
}
Result<-League(100,posscomb = posscombi, prob_res= prob_res)
Myorder<-order(Result)
### Quarters
pair1<- c(names(Result)[Myorder[c(1,8)]])
pair2<- c(names(Result)[Myorder[c(2,7)]])
pair3<- c(names(Result)[Myorder[c(3,6)]])
pair4<- c(names(Result)[Myorder[c(4,5)]])
## This function gives you the results to n matches (being 3 in the example)
PlayMatch<-function(pairs, numMatches){
Res <-sample(pairs, size = numMatches,
prob = c(prob_res[pairs[1], pairs[2]], 1-prob_res[pairs[1], pairs[2]]),
replace = TRUE)
return(table(Res))
}
# Results of the matches
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
winner3<-PlayMatch(pairs = pair3, 3)
winner4<-PlayMatch(pairs = pair4, 3)
## Semis
#Choosing the winning teams
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
pair2<- c(names(winner3)[which.max(winner3)],names(winner4)[which.max(winner4)])
winner1<-PlayMatch(pairs = pair1, 3)
winner2<-PlayMatch(pairs = pair2, 3)
## Final
# Same as before
pair1<- c(names(winner1)[which.max(winner1)],names(winner2)[which.max(winner2)])
winner1<-PlayMatch(pairs = pair1, 3)
paste0( "team ",names(winner1)[which.max(winner1)], " is the winner!")

Related

R: Efficient Rolling Calculations by Group

I have some asset data in the middle of a dplyr pipeline similar to this:
fcast <- data.frame(group = rep(c('a','b'),each=12),
yr = rep(2018:2019,each=6,times=2),
mo = rep(c(7:12,1:6),times=2),
book_value = c(10000,rep(0,times=11),15000,rep(0,times=11)),
accum_depr = c(200,rep(0,times=11),700,rep(0,times=11)),
depr_rate = .02,
depr_expense = c(10,rep(0,times=11),15,rep(0,times=11)),
book_addn = c(0,0,0,0,80,0,0,40,0,0,0,0,0,0,100,70,0,0,0,0,0,0,0,0),
book_growth = 1.01
)
I need to apply some (ideally, tidy) rolling function to each group like the one below, which does not work at the moment.
roll_depr <- function(.data) {
r_d <- .data$depr_rate[1]
r_g <- .data$book_growth[1]
for(i in 2:length(.data$depreciation_rate)) {
.data$book_value[i] <- .data$book_value[i-1]*r_g + .data$book_addn[i]
.data$depr_expense[i] <- (.data$book_value[i] - .data$accum_depr[i-1])*r_d
.data$accum_depr[i] <- .data$accum_depr[i-1]+.data$depr_expense[i]
}
return(.data)
}
To further complicate things, this calculation will be performed in a shiny dashboard repeatedly as users input new values for book_addn. The actual dataset is much larger, and for loops don't cut it.
I know a better solution must exist with data.table or apply, but I haven't been able to figure it out. Bonus points if this can be accomplished from within the pipeline!
EDIT: I'm expecting the code to output the following table. Basically, the book_value grows at 1% of the previous value, plus any additions in the period. The depr_expense takes the book_value net of the previous accum_depr, and multiplies by the depr_rate. Finally, accum_depr updates to account for the newly-calculated depr_expense.
group yr mo book_value accum_depr depr_rate depr_expense book_addn book_growth
a 2018 7 10000.00 200.00 0.02 10.00 0 1.01
a 2018 8 10100.00 398.00 0.02 198.00 0 1.01
a 2018 9 10201.00 594.06 0.02 196.06 0 1.01
a 2018 10 10303.01 788.24 0.02 194.18 0 1.01
a 2018 11 10486.04 982.20 0.02 193.96 80 1.01
a 2018 12 10590.90 1174.37 0.02 192.17 0 1.01
a 2019 1 10696.81 1364.82 0.02 190.45 0 1.01
a 2019 2 10843.78 1554.40 0.02 189.58 40 1.01
a 2019 3 10952.22 1742.35 0.02 187.96 0 1.01
a 2019 4 11061.74 1928.74 0.02 186.39 0 1.01
a 2019 5 11172.35 2113.61 0.02 184.87 0 1.01
a 2019 6 11284.08 2297.02 0.02 183.41 0 1.01
b 2018 7 15000.00 700.00 0.02 15.00 0 1.01
b 2018 8 15150.00 989.00 0.02 289.00 0 1.01
b 2018 9 15401.50 1277.25 0.02 288.25 100 1.01
b 2018 10 15625.52 1564.22 0.02 286.97 70 1.01
b 2018 11 15781.77 1848.57 0.02 284.35 0 1.01
b 2018 12 15939.59 2130.39 0.02 281.82 0 1.01
b 2019 1 16098.98 2409.76 0.02 279.37 0 1.01
b 2019 2 16259.97 2686.76 0.02 277.00 0 1.01
b 2019 3 16422.57 2961.48 0.02 274.72 0 1.01
b 2019 4 16586.80 3233.99 0.02 272.51 0 1.01
b 2019 5 16752.67 3504.36 0.02 270.37 0 1.01
b 2019 6 16920.19 3772.68 0.02 268.32 0 1.01
This can actually be done at decent speed with two simple functions that implement for loops, and using them within mutate.
The key is to recognize that book_value can be calculated independently in its own loop. Once that has been done, accum_depr[i] is only a function of accum_depr[i-1] and book_value[i]. The depr_expense can be extracted as the difference between accum_depr and its lag, but I don't need it for my purposes.
expn[i] = (book[i] - accum_depr[i-1])*depr_rate
accum_depr[i] = accum_depr[i-1] + expn[i]
Which implies
accum_depr[i] = accum_depr[i-1]*(1-depr_rate) + book_value[i]*depr_rate
The code:
roll_book <- function(book_val,addn,g_rate) {
z <- rep(0,length(book_val))
z[1] <- book_val[1]
for(i in 2:length(book_val)) {
z[i] <- z[i-1]*g_rate[1] + addn[i]
}
return(z)
}
roll_depr <- function(accum_depr,book_val,depr_rate) {
r_d <- depr_rate[1]
z <- rep(0, length(accum_depr))
z[1] <- accum_depr[1]
for(i in 2:length(accum_depr)) {
z[i] <- book_val[i]*r_d + z[i-1]*(1-r_d)
}
return(z)
}
fcast <- fcast %>%
group_by(group) %>%
mutate(book_value = roll_book(book_value,book_addn,book_growth),
accum_depr = roll_depr(accum_depr,book_value,depr_rate))
On my dataset with ~110,000 rows and ~450 groups:
Unit: milliseconds
min lq mean median uq max neval
65.01492 67.14825 70.80178 69.85741 72.53611 98.75224 100

Column Mean for rows with unique values

how can I compute the mean R, R1, R2, R3 values from the rows sharing the same lon,lat field? I'm sure this questions exists multiple times but I could not easily find it.
lon lat length depth R R1 R2 R3
1 147.5348 -35.32395 13709 1 0.67 0.80 0.84 0.83
2 147.5348 -35.32395 13709 2 0.47 0.48 0.56 0.54
3 147.5348 -35.32395 13709 3 0.43 0.29 0.36 0.34
4 147.4290 -35.27202 12652 1 0.46 0.61 0.60 0.58
5 147.4290 -35.27202 12652 2 0.73 0.96 0.95 0.95
6 147.4290 -35.27202 12652 3 0.77 0.92 0.92 0.91
I'd recommend using the split-apply-combine strategy, where you're splitting by BOTH lon and lat, applying mean to each group, then recombining into a single data frame.
I'd recommend using dplyr:
library(dplyr)
mydata %>%
group_by(lon, lat) %>%
summarize(
mean_r = mean(R)
, mean_r1 = mean(R1)
, mean_r2 = mean(R2)
, mean_r3 = mean(R3)
)

Indexing certain values in a function

I have a data frame that looks like this:
df <-
ID TIME AMT k10 k12 k21
1.00 0.00 50.00 0.10 0.40 0.01
1.00 1.00 0.00 0.10 0.40 0.01
1.00 2.00 0.00 0.10 0.40 0.01
1.00 3.00 50.00 0.10 0.40 0.01
1.00 4.00 0.00 0.10 0.40 0.01
2.00 0.00 100.00 0.25 0.50 0.06
2.00 1.00 0.00 0.25 0.50 0.06
2.00 2.00 0.00 0.25 0.50 0.06
I am using the values of k10, k12, k21 to process certain calculations in the function below. Each of these values is specific to a subject ID and doesn't with time. My Question is: How can I can write it in the function so it uses, the first value for each subject ID? As you may notice in the function below, this is what I am currently using:
k10 <- d$k10
k12 <- d$k12
k21 <- d$k21
Each of these gives a vector of the same value at all time points which is obviously no need for that. I just need one value for each. I think that is one reason why I am getting warnings saying number of items to replace is not a multiple of replacement length
#This is the function that I am using:
TwoCompIVbolus <- function(d){
#set initial values in the compartments
d$A1[d$TIME==0] <- d$AMT[d$TIME==0] # drug amount in the central compartment at time zero.
d$A2[d$TIME==0] <- 0 # drug amount in the peripheral compartment at time zero.
k10 <- d$k10
k12 <- d$k12
k21 <- d$k21
k20 <- 0
E1 <- k10+k12
E2 <- k21+k20
#calculate hybrid rate constants
lambda1 <- 0.5*(k12+k21+k10+sqrt((k12+k21+k10)^2-4*k21*k10))
lambda2 <- 0.5*(k12+k21+k10-sqrt((k12+k21+k10)^2-4*k21*k10))
for(i in 2:nrow(d))
{
t <- d$TIME[i]-d$TIME[i-1]
A1last <- d$A1[i-1]
A2last <- d$A2[i-1]
A1term = (((A1last*E2+A2last*k21)-A1last*lambda1)*exp(-t*lambda1)-((A1last*E2+A2last*k21)-A1last*lambda2)*exp(-t*lambda2))/(lambda2-lambda1)
d$A1[i] = A1term + d$AMT[i] #Amount in the central compartment
A2term = (((A2last*E1+A1last*k12)-A2last*lambda1)*exp(-t*lambda1)-((A2last*E1+A1last*k12)-A2last*lambda2)*exp(-t*lambda2))/(lambda2-lambda1)
d$A2[i] = A2term #Amount in the peripheral compartment
}
d
}
#to apply it for each subject
simdf <- ddply(df, .(ID), TwoCompIVbolus)
You can just use k10 <- d$k10[1]

Fast(er) way of indexing matrix in R

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:
for (i in 1:99000) {
subset.data <- data[index[, i], ]
}
Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon
this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset
the original data matrix to draw resamples of the data.
# The basic setup
B <- 1000 # no. of bootstrap replications
n <- 250 # no. of observations
m <- 100 # no. of models/data series
# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).
boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)
# Make matrix with m data series of length n.
sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)
subsetMatrix <- function(data, index) { # fn definition for timing
subset.data <- data[index, ]
return(subset.data)
}
# check how long it takes.
Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) {
for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000
boot.data <- subsetMatrix(sample.data, boot.index[, b])
# do some other stuff
}
# do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")
# > summaryRprof("subsetMatrix.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 9.96 100 9.96 100
# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 1.98 15.76 2.20 17.52
# makeIndex 1.08 8.60 2.12 16.88
# makeStats 0.66 5.25 9.66 76.91
# runif 0.60 4.78 0.72 5.73
# apply 0.30 2.39 0.42 3.34
# is.data.frame 0.22 1.75 0.22 1.75
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# estimateMCS 0.10 0.80 12.56 100.00
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
#
# $by.total
# total.time total.pct self.time self.pct
# estimateMCS 12.56 100.00 0.10 0.80
# makeStats 9.66 76.91 0.66 5.25
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 2.20 17.52 1.98 15.76
# makeIndex 2.12 16.88 1.08 8.60
# runif 0.72 5.73 0.60 4.78
# doTest 0.68 5.41 0.00 0.00
# apply 0.42 3.34 0.30 2.39
# aperm 0.26 2.07 0.00 0.00
# is.data.frame 0.22 1.75 0.22 1.75
# sweep 0.20 1.59 0.00 0.00
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# unlist 0.06 0.48 0.00 0.00
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
# mean 0.02 0.16 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'
Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several
parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but
still not very pleasing :/
Is there a better way to resample the data / get rid of the loop?
Can apply, Vectorize, replicate etc. come in anywhere?
Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?
Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!
related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix
from there
mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))
didn't really do it for me.
I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:
makeStats <- function(data, index) {
data.mean <- colMeans(data)
m <- nrow(data)
n <- ncol(index)
tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
boot.data.mean <- t(data) %*% weights - data.mean
return(list(data.mean = data.mean,
boot.data.mean = boot.data.mean))
}
makeIndex <- function(B, blocks){
n <- ncol(blocks)
l <- nrow(blocks)
z <- ceiling(n/l)
start.points <- sample.int(n, z * B, replace = TRUE)
index <- blocks[, start.points]
keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
boot.index <- matrix(as.vector(index)[keep],
nrow = n, ncol = B)
return(boot.index)
}
This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

Speed up `strsplit` when possible output are known

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):
data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
## Time Location Class Replicate Population
##1 1 1 LIDE.1.S 1 0.03859605
##2 2 1 LIDE.1.S 1 0.03852957
##3 3 1 LIDE.1.S 1 0.03846853
##4 4 1 LIDE.1.S 1 0.03841260
##5 5 1 LIDE.1.S 1 0.03836147
##6 6 1 LIDE.1.S 1 0.03831485
Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)
head(data.df)
## Time Location Species SizeClass Infected Replicate Population
##1 1 1 LIDE 1 S 1 0.03859605
##2 2 1 LIDE 1 S 1 0.03852957
##3 3 1 LIDE 1 S 1 0.03846853
##4 4 1 LIDE 1 S 1 0.03841260
##5 5 1 LIDE 1 S 1 0.03836147
##6 6 1 LIDE 1 S 1 0.03831485
summaryRprof("str.out")
$by.self
self.time self.pct total.time total.pct
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"do.call" 0.04 1.49 2.54 94.78
"unique.default" 0.04 1.49 0.04 1.49
"data.frame" 0.02 0.75 0.12 4.48
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
"do.call" 2.54 94.78 0.04 1.49
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"cbind" 0.14 5.22 0.00 0.00
"data.frame" 0.12 4.48 0.02 0.75
"as.data.frame.matrix" 0.08 2.99 0.00 0.00
"as.data.frame" 0.08 2.99 0.00 0.00
"as.factor" 0.08 2.99 0.00 0.00
"factor" 0.06 2.24 0.00 0.00
"unique.default" 0.04 1.49 0.04 1.49
"unique" 0.04 1.49 0.00 0.00
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
"[.data.frame" 0.02 0.75 0.00 0.00
"[" 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.
Notes:
stringr::str_split_fixed performs this task, but not any faster
The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds
This should probably offer quite an increase:
library(data.table)
DT <- data.table(data.df)
DT[, c("Species", "SizeClass", "Infected")
:= as.list(strsplit(Class, "\\.")[[1]]), by=Class ]
The reasons for the increase:
data.table pre allocates memory for columns
every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
the by statement allows you to implement the strsplit task once per each unique value.
Here is a nice quick method for the whole process.
# Save the new col names as a character vector
newCols <- c("Species", "SizeClass", "Infected")
# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]
# remove the old column. This is instantaneous.
DT[, Class := NULL]
## Have a look:
DT[, lapply(.SD, class)]
# Time Location Replicate Population Species SizeClass Infected
# 1: integer integer integer numeric factor factor factor
DT
You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:
data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)
# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})
user system elapsed
3.349 0.062 3.411
#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
perl = TRUE)
data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class,
perl = TRUE)
data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class,
perl = TRUE)
})
user system elapsed
0.812 0.037 0.848
Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".
Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)

Resources