Cut dendrogram / cluster: Error in function 'cutree': tree incorrect (composante 'merge') - r

I have a dendrogram which I want to cut into less clusters because right know there are too many for interpretation.
My dataframe looks like this:
> head(alpha)
locs 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
1 A1 12 14 15 15 14 21 10 18 18 20
2 A2 11 11 12 13 9 16 20 18 18 11
3 B1 12 13 20 17 21 20 27 14 22 25
4 B2 15 18 18 25 21 17 27 23 28 23
5 B3 22 22 26 24 28 23 31 25 32 25
6 B4 18 21 25 20 20 14 23 22 20 26
library("ggplot2") #for the heatmap
library("ggdendro") #for the dendrogram
library("reshape2") #for data wrangling
library("grid") #to combine the two plots heatmap and dendrogram
# Read in data
setwd("C:/Users/data")
alpha <- read.csv2("alpha.csv", header=TRUE, check.names=FALSE)
str(alpha) #structure of the dataset: locations (locs) = factor, values = integer
#scale the data variables (columns 4-9)
alpha.scaled <- alpha
alpha.scaled[, c(2:11)] <- scale(alpha.scaled[, 2:11])
alpha.scaled[, c(2:11)] <- scale(alpha.scaled[, 2:11])
# run clustering
alpha.matrix <- as.matrix(alpha.scaled[, -c(1)])
rownames(alpha.matrix) <- alpha.scaled$locs
alpha.dendro <- as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" ))
# Create dendrogram (=cluster)
dendro.plot <- ggdendrogram(data = alpha.dendro, rotate = TRUE)
alphacut <- cutree(alpha.dendro, h=3)
alphacut <- cutree(alpha.dendro, h=3)
Error in cutree(alpha.dendro, h = 3) :
'tree' incorrect (composante 'merge')
alphacut <- cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" )),k=5)
alphacut <- cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" )), k=5)
Error in cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method = "complete")), :
'tree' incorrect (composante 'merge')
I haven't found a solution to this. When I look at 'alpha.dendro' there is a list of 2 but no merge component, so this seems to be the problem. Does somebody know what to do?

Related

Making a "Race" Between Two Variables

I would like to make two variables ("a" and "b") that keep:
taking a random value less ALWAYS than their current value (i.e. a1 > a2 > a3 ...> an , b1 > b2 > b3 ... bn ALWAYS)
until one of them less than or equal to 0:
I showed a demo below:
#iteration 1
a1 = 100 - rnorm(1,5,10)
b1 = 100 -rnorm(1,5,10)
a2 = a1 - rnorm(1,5,10)
b2 = b1 -rnorm(1,5,10)
a3 = a2 - rnorm(1,5,10)
b3 = b2 -rnorm(1,5,10)
#etc.
I would then like to repeat this many times. In the end, this would look something :
Currently, I am doing this manually, and then using the bind_rows() command to "pile" each iteration on top of each other. Can someone please show me a faster way to do this?
Thank you!
You could write a smallrecursive function:
fun <- function(x){
if(any(x < 0)) x
else rbind(x, fun(x - abs(rnorm(length(x),5,10)) ))
}
Now for 1 draw of A and B:
set.seed(1)
fun(c(A=100, B=100))
A B
x 100.00000 100.000000
x 98.73546 93.163567
x 95.37918 72.210759
x 87.08410 69.006075
x 77.20981 56.622828
x 66.45199 54.676712
x 46.33418 45.778279
x 45.12178 28.631280
x 28.87247 24.080617
x 24.03437 9.642254
10.82216 -1.296759
We can use this within a function to replicate. Will maintain BASE R although can be simplified in tidyverse:
random_seq <- function(n, start){
fun <- function(x){
if(any(x < 0)) c(x)
else rbind(x, fun(x - abs(rnorm(length(x),5,10)) ))
}
R <-replicate(n, data.frame(fun(start), row.names = NULL), simplify = FALSE)
S <- do.call(rbind, Map(cbind, id = seq(R), R))
U <-transform(S, time = ave(id, id, FUN = seq_along))
reshape(U, dir='wide', idvar = 'id', sep='')
}
set.seed(1)
random_seq(4, c(A=20,B=20))
id A1 B1 A2 B2 A3 B3 A4 B4
1 1 20 20 18.7354619 13.163567 15.379176 -7.789241 NA NA
4 2 20 20 11.7049223 16.795316 1.830632 4.412069 -8.927182 2.465953
8 3 20 20 -0.1178117 11.101568 NA NA NA NA
10 4 20 20 18.7875942 2.853001 2.538285 -1.697663 NA NA
BONUS:
if interested, fun can directly reproduce the names:
fun <- function(x){
nms <- as.numeric(sub('\\D+', '',names(x))) + 1
names(x) <- paste0(sub("\\d+", '', names(x)), nms)
if(any(x < 0)) c(x)
else c(x, Recall(x - abs(rnorm(length(x),5,10)) ))
}
fun(c(A0=20, B0=30))
A1 B1 A2 B2 A3 B3
20.000000 30.000000 11.234808 23.323201 -9.611483 1.544311
Here's a function that runs a single start to 0, nicely configurable, and we can use replicate to run it as many times as needed, returning a list.
to_0 = function(start = 100, fun = runif, ..., n = 1000) {
if(start <= 0) stop("Must start greater than 0")
result = start - c(0, cumsum(fun(n, ...)))
if(all(result > 0)) stop("Didn't reach 0, set a higher n or check inputs.")
first_0 = match(TRUE, result < 0)
result[seq_len(first_0)]
}
I used runif as the default instead of your rnorm because you say you want the series to be strictly decreasing, but rnorm is sometimes positive and sometimes negative so it will sometimes lead to increases.
I cut off the series at the first negative value. Since the lengths of each run are different, a data.frame seems like a bad choice, keeping them in a list is better. We can use lengths() to see how long each vector in the list is.
The function is parametrized, so you can easily try out other distributions or custom functions, e.g., to_0(start = 100, fun = rexp, rate = 0.1). Below I demonstrate with the uniform distribution starting at 10.
set.seed(47)
race = replicate(n = 100, to_0(start = 10))
head(race)
# [[1]]
# [1] 10.00000000 9.02303800 8.64912196 7.88761993 7.06512831 6.49158390 5.80017147 5.41110962 4.94216364 4.39885390 3.47396185
# [12] 3.33516427 2.63317707 2.47098343 1.87167641 1.36564030 0.46366678 0.06316398 0.03221901 -0.03913915
#
# [[2]]
# [1] 10.00000000 9.27320918 8.54814801 7.77974923 7.34440424 7.27499236 6.76825217 6.75134855 6.20214287 5.43031741 4.56633348
# [12] 3.59288910 3.24547860 2.60269295 1.75639299 1.73279651 1.72371866 1.38211688 0.71933800 0.04916749 -0.40714758
#
# [[3]]
# [1] 10.00000000 9.08923490 9.06189460 8.69397353 8.30179409 8.11077841 7.96295850 7.49701585 6.52812608 6.26480567 5.34558158
# [12] 5.31801508 4.90573089 3.98774633 3.89046321 3.70358854 3.61482042 3.53824450 3.36900151 2.86522484 2.23295349 1.80544403
# [23] 0.82311022 0.73664857 -0.09385818
#
# [[4]]
# [1] 10.0000000 9.2172681 8.4175584 8.1672679 7.3683421 7.3373712 7.0319788 6.6512214 5.7210315 5.2732412 4.6817849 4.1065416
# [13] 3.9452541 3.4009742 2.5018050 1.5316136 0.7175295 0.4410275 -0.1859260
#
# [[5]]
# [1] 10.00000000 9.91914621 9.90238843 9.82993154 9.33156028 8.90827720 8.44160294 7.46348397 6.76539075 6.27298443 5.97401412
# [12] 5.03395592 4.55537992 3.75737919 2.82175869 2.75045000 2.70081885 2.67523320 2.20266408 2.12695183 1.25880525 0.57011279
# [23] 0.03173135 -0.79275633
#
# [[6]]
# [1] 10.0000000 9.9292630 9.6154147 9.0754730 8.7814754 8.5273701 7.6998567 6.8127609 5.9944598 5.6232599 5.1505038 4.8676191
# [13] 4.6337121 4.5868438 4.0435219 3.0981151 2.2621741 1.9925101 1.2104707 0.9334569 0.7574446 0.1643009 -0.5220925
lengths(race)
# [1] 20 21 25 19 24 23 21 24 23 22 25 24 19 19 23 17 19 23 25 21 24 25 18 22 24 25 19 19 23 22 19 26 20 23 24 24 22 21 25 23 21 28 19 20 16 20
# [47] 22 25 20 22 23 23 24 22 19 23 23 23 22 18 22 23 24 21 21 23 21 22 20 25 22 23 21 17 20 20 16 25 21 21 21 20 20 19 24 19 23 24 26 25 20 21
# [93] 23 17 27 18 30 24 21 23

Barplot in R fill with certain values

I have a dataset which contains the data of pairs playing a game. I have a barplot that shows the total games played by the pairs. But now I want those bars('number') to be filled with the amount of games they successfully completed('sum'). I can't get it to work. The barplot is created like this:
barplot(height = game_count$number, xlab = 'Pairs', ylim = c(0,35), ylab='Games played')
The data looks like this:
participants sum number
1 06104873220647518670 30 32
2 06105747340637377404 23 24
3 06113978630633565020 28 32
4 06121794480617858550 25 27
5 06122613960611857952 23 26
6 06123139380653583516 25 28
7 06123650620648276595 28 32
8 06124453210624910109 32 34
9 06127993700610846968 24 26
10 06128440030639764541 19 24
11 06132461300624244572 26 30
12 06137611390651588167 25 28
13 06145014400637290807 16 19
14 06163181050611257617 30 30
15 06172024240651919112 21 23
One option can be ggplot2:
library(ggplot2)
#Code
game_count$Freq <- game_count$sum/game_count$number
#Plot
ggplot(game_count,aes(x=1:nrow(game_count),y=Freq))+
geom_col(fill='cyan3',color='black')+
xlab('')
Output:
This worked for me:
barplot(t(game_correct[c('number', 'sum')]), beside=TRUE, ylim=c(0,35), col=c('black', 'green'), main='Games played and successive games by the pairs', xlab='Pairs', ylab='Games')
Result in this graph:

test for in-tile for Dirichlet tile, using R

So I can take points and use the R libraries deldir or spatstat::dirichlet to find the dirichlet tesselation of those points.
Now I have a point not in the set, and I want to know the indices of the points forming the dirichlet tile which my not-in-set-point is interior to. I can get there by knowing the tile label (or index).
Are there any libraries or methods to do this? I'm thinking spatstat, but not finding something there yet.
The function cut.ppp() can take a point pattern and find which tesselation
tile each point in the pattern belongs to. Below is the code for a simple
example of a point pattern that only contains a single point (0.5, 0.5).
library(spatstat)
dd <- dirichlet(cells)
plot.tess(dd, do.labels = TRUE)
xx <- ppp(.5, .5, window = Window(dd))
plot(xx, add = TRUE, col = "red", cex = 2, pch = 20)
yy <- cut(xx, dd)
yy
#> Marked planar point pattern: 1 point
#> Multitype, with levels =
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
#> 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
#> window: rectangle = [0, 1] x [0, 1] units
marks(yy)
#> [1] 18
#> 42 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ... 42
Created on 2018-12-03 by the reprex package (v0.2.1)
If X is a point pattern and B is a tessellation, then
M <- marks(cut(X, B))
returns a factor (vector of categorical values) identifying which tile contains each of the points of X. Alternatively,
M <- tileindex(X$x, X$y, B)
or
f <- as.function(B)
M <- f(X)

Interpolate missing climate data from similar datasets

I am working with climate data of several (5) data loggers I used in a field experiment. I have some missing climate data, partially because the loggers were installed after start of the experiment, partially because of defect loggers.
For interpolation of my missing data, I have climate data from two nearby weather stations.
I'm trying to find a rapid and safe way to replace the missing data in this dataset, because the way I have come up with is very tedious and quite convoluted.
df <- data.frame(date=c("2015-06-17","2015-06-18","2015-06-19","2015-06-20","2015-06-21"),
meantemp1=c(15,17,19,15,19),maxtemp1=c(18,25,27,25,28),
meantemp2=c(13,12,12,18,14),maxtemp2=c(22,25,25,24,26),
meantemp3=c(NA,NA,21,17,21),maxtemp3=c(NA,NA,29,25,29),
meantemp4=c(NA,14,14,20,16),maxtemp4=c(NA,27,27,26,28))#create toy dataset
df$date <- as.Date(df$date)
View(df)
date meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17 15 18 13 22 NA NA NA NA
2 2015-06-18 17 25 12 25 NA NA 14 27
3 2015-06-19 19 27 12 25 21 29 14 27
4 2015-06-20 15 25 18 24 17 25 20 26
5 2015-06-21 19 28 14 26 21 29 16 28
Let's say the first four column correspond to weather station data, and the last four columns correspond to data logger data. In reality, I look at many more climatic factors.
Now, I shorten the dataset to a relevant timescale (irl a month), without any NA values, to calculate correlation.
#create a short dataset without NA values
dfshort <- df[df$date>="2015-06-19"&df$date<="2015-06-21 ",]
dfshort$date<-as.numeric(dfshort$date)#date needs to be transformed to numeric for cor()
corrmatrix <-((cor(dfshort)))
library(reshape)
m <- melt(corrmatrix)#show correlation matrix as a list
m <- m[order(- abs(m$value)), ]#order correlation matrix according to correlation values
View(m)
X1 X2 value
1 date date 1
11 meantemp1 meantemp1 1
15 meantemp3 meantemp1 1
16 maxtemp3 meantemp1 1
21 maxtemp1 maxtemp1 1
31 meantemp2 meantemp2 1
35 meantemp4 meantemp2 1
10 date meantemp1 0
46 date meantemp3 0
55 date maxtemp3 0
To get a better overview which factors are most correlated, I shorten this list to the cases where the 'logger data' are dependent on the 'weather station data'.
m1 <- subset(m, X1 %in% c('meantemp3', 'maxtemp3', 'meantemp4', 'maxtemp4'))
#select logger data for first ("dependent") column
m2 <-subset(m1,X2 %in% c('meantemp1', 'maxtemp1', 'meantemp2', 'maxtemp2'))
#select cases with weather station data for second ("reference")column
View(m2)
X1 X2 value
15 meantemp3 meantemp1 1.0000000
16 maxtemp3 meantemp1 1.0000000
35 meantemp4 meantemp2 1.0000000
45 maxtemp4 maxtemp2 1.0000000
27 maxtemp4 maxtemp1 0.9819805
17 meantemp4 meantemp1 -0.9449112
24 meantemp3 maxtemp1 0.9449112
26 meantemp4 maxtemp1 -0.7857143
36 maxtemp4 meantemp2 -0.6546537
44 meantemp4 maxtemp2 -0.6546537
Now I mark down the highest correlations for the 'logger data', and create lm's following this post:
#formulate linear models
model.meantemp3 <- lm(meantemp3 ~ meantemp1, data = df)
model.maxtemp3 <- lm(maxtemp3 ~ meantemp1, data = df)
model.meantemp4 <- lm(meantemp4 ~ meantemp2, data = df)
model.maxtemp4 <- lm(maxtemp4 ~ maxtemp2, data = df)
#predict values as column
df$predict.meantemp3 = predict(model.meantemp3, newdata = df)
df$predict.maxtemp3 = predict(model.maxtemp3, newdata = df)
df$predict.meantemp4 = predict(model.meantemp4, newdata = df)
df$predict.maxtemp4 = predict(model.maxtemp4, newdata = df)
# replace (only) NAs with predictions
df$meantemp3 = ifelse(is.na(df$meantemp3), df$predict.meantemp3, df$meantemp3)
df$maxtemp3 = ifelse(is.na(df$maxtemp3), df$predict.maxtemp3,df$maxtemp3)
df$meantemp4 = ifelse(is.na(df$meantemp4), df$predict.meantemp4, df$meantemp4)
df$maxtemp4 = ifelse(is.na(df$maxtemp4), df$predict.maxtemp4, df$maxtemp4)
#tadaa!
df<- df[c(-10:-13)] #drop column we are not interested in
head(df) #dataset without NA's
date meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17 15 18 13 22 17 25 15 24
2 2015-06-18 17 25 12 25 19 27 14 27
3 2015-06-19 19 27 12 25 21 29 14 27
4 2015-06-20 15 25 18 24 17 25 20 26
5 2015-06-21 19 28 14 26 21 29 16 28
There has to be a more concise and less error-prone way to do this, and I can't be the only one with this issue, as this unanswered stackexchange question suggests.
I have looked for packages to do this (package "mice" e.g.), but they tend to end up in quite complicated outputs. Geographers seem to agree that simple linear models are too primitive for temperature data imputation. However, my climate data are highly correlated, so I want to do it this way for the sake of simplicity.
Help is highly appreciated!

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

Resources