Applying a function for multiple groups using dplyr - r

I have some data for multiple location and year
big.data <- data.frame(loc.id = rep(1:3, each = 10*3),
year = rep(rep(1981:1983, each = 10),times = 3),
day = rep(1:10, times = 3*3),
CN = rep(c(50,55,58), each = 10*3),
top.FC = rep(c(72,76,80),each = 10*3),
DC = rep(c(0.02,0.5,0.8), each = 10*3),
WAT0 = rep(c(20,22,26), each = 10*3),
Precp = sample(1:100,90, replace = T),
ETo = sample(1:10,90, replace = T))
I have a function: water.model which uses a second function internally called water.update
water.model <- function(dat){
top.FC <- unique(dat$top.FC)
dat$WAT <- -9.9
dat$RO <- -9.9
dat$DR <- -9.9
dat$WAT[1] <- top.FC/2 # WAT.i is a constant
dat$RO[1] <- NA
dat$DR[1] <- NA
for(d in 1:(nrow(dat)-1)){
dat[d + 1,10:12] <- water.update(WAT0 = dat$WAT[d],
RAIN.i = dat$Precp[d + 1],
ETo.i = dat$ETo[d + 1],
CN = unique(dat$CN),
DC = unique(dat$DC),
top.FC = unique(dat$top.FC))
}
return(dat)
}
water.update <- function(WAT0, RAIN.i, ETo.i, CN, DC, top.FC){
S = 25400/CN - 254; IA = 0.2*S
if (RAIN.i > IA) { RO = (RAIN.i - 0.2 * S)^2/(RAIN.i + 0.8 * S)
} else {
RO = 0
}
if (WAT0 + RAIN.i - RO > top.FC) {
DR = DC * (WAT0 + RAIN.i - RO - top.FC)
} else {
DR = 0
}
dWAT = RAIN.i - RO - DR - ETo.i
WAT1 = WAT0 + dWAT
WAT1 <- ifelse(WAT1 < 0, 0, WAT1)
return(list(WAT1,RO,DR))
}
If I run the above function for a single location X year
big.data.sub <- big.data[big.data$loc.id == 1 & big.data$year == 1981,]
water.model(big.data.sub)
loc.id year day CN top.FC DC WAT0 Precp ETo WAT RO DR
1 1 1981 1 50 72 0.02 20 52 5 36.0000 NA NA
2 1 1981 2 50 72 0.02 20 12 9 39.0000 0.0000000 0.000000
3 1 1981 3 50 72 0.02 20 3 2 40.0000 0.0000000 0.000000
4 1 1981 4 50 72 0.02 20 81 9 107.8750 3.2091485 0.915817
5 1 1981 5 50 72 0.02 20 37 10 133.4175 0.0000000 1.457501
6 1 1981 6 50 72 0.02 20 61 7 184.5833 0.3937926 2.440475
7 1 1981 7 50 72 0.02 20 14 10 186.0516 0.0000000 2.531665
8 1 1981 8 50 72 0.02 20 9 6 186.5906 0.0000000 2.461032
9 1 1981 9 50 72 0.02 20 77 9 248.3579 2.4498216 3.782815
10 1 1981 10 50 72 0.02 20 18 6 256.4708 0.0000000 3.887159
How do I run this for all location and year?
big.data %>% group_by(loc.id, year) %>% # apply my function here.
My final data should look like the above with three new columns called WAT, RO and DR which are generated when the function is run.

We can split the data and apply the water.model by looping over the list with map
library(tidyverse)
split(big.data, big.data[c('loc.id', 'year')], drop = TRUE) %>%
map_df(water.model)
Or apply the function within do after group_by
big.data %>%
group_by(loc.id, year) %>%
do(data.frame(water.model(.)))

Related

time between max and min of cycles

I have a series of data of 60,000 data which part of the data is as the figure 1 (the whole curve is not so nice and uniform like this image (some other part of data is as second image)) but there are many cycles with different period in my data.
I need to calculate the time of three red, green and purple rectangles for each of the cycles (** the time between each maximum and minimum and total time of cycles **)
Can you give me some ideas on how to do it in R ... is there any special command or package that I can use?
Premise is that the mean value of the data range is used to split the data into categories of peaks and not peaks. Then a running id is generated to group each set of data so an appropriate min or max value can be determined. The half_cycle provides the red and green boxes, while full_cycle provides the purple box for max-to-max and min-to-min. There is likely room for improvement, but it gives a method that can be adjusted as needed.
This sample uses random data since no sample data was provided.
set.seed(7)
wave <- c(seq(20, 50, 10), seq(50, 60, 0.5), seq(50, 20, -10))
df1 <- data.frame(time = seq_len(length(wave) * 5),
data = as.vector(replicate(5, wave + rnorm(length(wave), sd = 5))))
library(dplyr)
df1 %>%
mutate(peak = data > mean(range(df1$data))) %>%
mutate(run = cumsum(peak != lag(peak, default = TRUE))) %>%
group_by(run) %>%
mutate(max = max(data), min = min(data)) %>%
filter((peak == TRUE & data == max) | (peak == FALSE & data == min)) %>%
mutate(max = if_else(data == max, max, NULL), min = if_else(data == min, min , NULL)) %>%
ungroup() %>%
mutate(half_cycle = time - lag(time), full_cycle = time - lag(time, n = 2L))
# A tibble: 11 x 8
time data peak run max min half_cycle full_cycle
<int> <dbl> <lgl> <int> <dbl> <dbl> <int> <int>
1 2 24.0 FALSE 1 NA 24.0 NA NA
2 12 67.1 TRUE 2 67.1 NA 10 NA
3 29 15.1 FALSE 3 NA 15.1 17 27
4 54 68.5 TRUE 4 68.5 NA 25 42
5 59 20.8 FALSE 5 NA 20.8 5 30
6 80 70.6 TRUE 6 70.6 NA 21 26
7 87 18.3 FALSE 7 NA 18.3 7 28
8 108 63.1 TRUE 8 63.1 NA 21 28
9 117 13.8 FALSE 9 NA 13.8 9 30
10 140 64.5 TRUE 10 64.5 NA 23 32
11 145 22.4 FALSE 11 NA 22.4 5 28

Remove link between time series and add minor date tick on x_axis in ggplot

I was trying to plot a time series composed of weekly averanges. Here is the plot that I have obtained:
[weekly averages A]
[1]: https://i.stack.imgur.com/XMGMs.png
As you can see the time serie do not cover all the years completely, so, when I have got no data ggplot links two subsequent years. I think I have to group the data in some ways, but I do not understand how. Here is the code:
df4 <- data.frame(df$Date, df$A)
colnames(df4)<- c("date","A")
df4$date <- as.Date(df4$date,"%Y/%m/%d")
df4$week_day <- as.numeric(format(df4$date, format='%w'))
df4$endofweek <- df4$date + (6 - df4$week_day)
week_aveA <- df4 %>%
group_by(endofweek) %>%
summarise_all(list(mean=mean), na.rm=TRUE) %>%
na.omit()
g1 = ggplot() +
geom_step(data=week_aveA, aes(group = 1, x = (endofweek), y = (A_mean)), colour="gray25") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 2500)) +
scale_x_date(breaks="year", labels=date_format("%Y")) +
labs(y = expression(A~ ~index),
x = NULL) +
theme(axis.text.x = element_text(size=10),
axis.title = element_text(size=10))
Here an extraction (the former three years) of the dataset:
endofweek date_mean A_mean week_day_mean
1 20/03/2010 17/03/2010 939,2533437 3
2 27/03/2010 24/03/2010 867,3620121 3
3 03/04/2010 31/03/2010 1426,791222 3
4 10/04/2010 07/04/2010 358,5698314 3
5 17/04/2010 13/04/2010 301,1815352 2
6 24/04/2010 21/04/2010 273,4922895 3,333333333
7 01/05/2010 28/04/2010 128,5989633 3
8 08/05/2010 05/05/2010 447,8858881 3
9 15/05/2010 12/05/2010 387,9828891 3
10 22/05/2010 19/05/2010 138,0770986 3
11 29/05/2010 26/05/2010 370,2147933 3
12 05/06/2010 02/06/2010 139,0451791 3
13 12/06/2010 09/06/2010 217,1286356 3
14 19/06/2010 16/06/2010 72,36972411 3
15 26/06/2010 23/06/2010 282,2911902 3
16 03/07/2010 30/06/2010 324,3215936 3
17 10/07/2010 07/07/2010 210,568691 3
18 17/07/2010 14/07/2010 91,76930829 3
19 24/07/2010 21/07/2010 36,4211218 3,666666667
20 31/07/2010 28/07/2010 37,53981103 3
21 07/08/2010 04/08/2010 91,33282642 3
22 14/08/2010 11/08/2010 28,38587352 3
23 21/08/2010 18/08/2010 58,72836406 3
24 28/08/2010 24/08/2010 102,1050612 2,5
25 04/09/2010 02/09/2010 13,45357513 4,5
26 11/09/2010 08/09/2010 51,24017212 3
27 18/09/2010 15/09/2010 159,7395663 3
28 25/09/2010 21/09/2010 62,71136678 2
29 02/04/2011 31/03/2011 1484,661164 4
30 09/04/2011 06/04/2011 656,1827964 3
31 16/04/2011 13/04/2011 315,3097313 3
32 23/04/2011 20/04/2011 293,2904042 3
33 30/04/2011 26/04/2011 255,7517519 2,4
34 07/05/2011 04/05/2011 360,7035289 3
35 14/05/2011 11/05/2011 342,0902797 3
36 21/05/2011 18/05/2011 386,1380421 3
37 28/05/2011 24/05/2011 418,9624807 2,833333333
38 04/06/2011 01/06/2011 112,7568 3
39 11/06/2011 08/06/2011 85,17855619 3,2
40 18/06/2011 15/06/2011 351,8714638 3
41 25/06/2011 22/06/2011 139,7936898 3
42 02/07/2011 29/06/2011 68,57716191 3,6
43 09/07/2011 06/07/2011 62,31823822 3
44 16/07/2011 13/07/2011 80,7328917 3
45 23/07/2011 20/07/2011 114,9475331 3
46 30/07/2011 27/07/2011 90,13118758 3
47 06/08/2011 03/08/2011 43,29372258 3
48 13/08/2011 10/08/2011 49,39935204 3
49 20/08/2011 16/08/2011 133,746822 2
50 03/09/2011 31/08/2011 76,03928942 3
51 10/09/2011 05/09/2011 27,99834637 1
52 24/03/2012 23/03/2012 366,2625797 5,5
53 31/03/2012 28/03/2012 878,8535513 3
54 07/04/2012 04/04/2012 1029,909052 3
55 14/04/2012 11/04/2012 892,9163416 3
56 21/04/2012 18/04/2012 534,8278693 3
57 28/04/2012 25/04/2012 255,1177585 3
58 05/05/2012 02/05/2012 564,5280546 3
59 12/05/2012 09/05/2012 767,5018168 3
60 19/05/2012 16/05/2012 516,2680148 3
61 26/05/2012 23/05/2012 241,2113073 3
62 02/06/2012 30/05/2012 863,6123397 3
63 09/06/2012 06/06/2012 201,2019288 3
64 16/06/2012 13/06/2012 222,9955486 3
65 23/06/2012 20/06/2012 91,14166632 3
66 30/06/2012 27/06/2012 26,93145693 3
67 07/07/2012 04/07/2012 67,32183278 3
68 14/07/2012 11/07/2012 46,25297513 3
69 21/07/2012 18/07/2012 81,34359825 3,666666667
70 28/07/2012 25/07/2012 49,59130851 3
71 04/08/2012 01/08/2012 44,13438077 3
72 11/08/2012 08/08/2012 30,15773151 3
73 18/08/2012 15/08/2012 57,47256772 3
74 25/08/2012 22/08/2012 31,9109555 3
75 01/09/2012 29/08/2012 52,71058484 3
76 08/09/2012 04/09/2012 24,52495229 2
77 06/04/2013 01/04/2013 1344,388042 1,5
78 13/04/2013 10/04/2013 1304,838687 3
79 20/04/2013 17/04/2013 892,620141 3
80 27/04/2013 24/04/2013 400,1720434 3
81 04/05/2013 01/05/2013 424,8473083 3
82 11/05/2013 08/05/2013 269,2380208 3
83 18/05/2013 15/05/2013 238,9993749 3
84 25/05/2013 22/05/2013 128,4096151 3
85 01/06/2013 29/05/2013 158,5576121 3
86 08/06/2013 05/06/2013 175,2036942 3
87 15/06/2013 12/06/2013 79,20250839 3
88 22/06/2013 19/06/2013 126,9065428 3
89 29/06/2013 26/06/2013 133,7480108 3
90 06/07/2013 03/07/2013 218,0092943 3
91 13/07/2013 10/07/2013 54,08460936 3
92 20/07/2013 17/07/2013 91,54285041 3
93 27/07/2013 24/07/2013 44,64567928 3
94 03/08/2013 31/07/2013 229,5067999 3
95 10/08/2013 07/08/2013 49,70729373 3
96 17/08/2013 14/08/2013 53,38618335 3
97 24/08/2013 21/08/2013 217,2800997 3
98 31/08/2013 28/08/2013 49,43590136 3
99 07/09/2013 04/09/2013 64,88783029 3
100 14/09/2013 11/09/2013 11,04300773 3
So at the end I have one mainly question: how can I eliminated the connection between the years? ... and an aesthetic question: how can I add minor ticks on the x_axis? At least one every 6 months, just to make the plot easy to read.
Thanks in advance for any suggestion!
Edit
This is the code I tried with the suggestion, maybe I mistype some part of it.
library(tidyverse)
library(dplyr)
library(lubridate)
df4 <- data.frame(df$Date, df$A)
colnames(df4)<- c("date","A")
df4$date <- as.Date(df4$date,"%Y/%m/%d")
df4$week_day <- as.numeric(format(df4$date, format='%w'))
df4$endofweek <- df4$date + (6 - df4$week_day)
week_aveA <- df4 %>%
group_by(endofweek) %>%
summarise_all(list(mean=mean), na.rm=TRUE) %>%
na.omit()
week_aveA$endofweek <- as.Date(week_aveA$endofweek,"%d/%m/%Y")
week_aveA$A_mean <- as.numeric(gsub(",", ".", week_aveA$A_mean))
week_aveA$week_day_mean <- as.numeric(gsub(",", ".", week_aveA$week_day_mean))
week_aveA$year <- format(week_aveA$endofweek, "%Y")
library(ggplot2)
library(methods)
library(scales)
mylabel <- function(x) {
ifelse(grepl("-07-01$", x), "", format(x, "%Y"))
}
ggplot() +
geom_step(data=week_aveA, aes(x = endofweek, y = A_mean, group = year), colour="gray25") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 2500)) +
scale_x_date(breaks="6 month", labels = mylabel) +
labs(y = expression(A~ ~index),
x = NULL) +
theme(axis.text.x = element_text(size=10),
axis.title = element_text(size=10))
You have to group by year:
Add a variable with the year to your dataset
Map the year variable on the group aesthetic
For the ticks. Increase the number of the breaks. If you want only ticks but not labels you can use a custom function to get rid of unwanted labels, e.g. my approach below set the breaks to "6 month" but replaces the mid-year labels with an empty string:
week_aveA$endofweek <- as.Date(week_aveA$endofweek,"%d/%m/%Y")
week_aveA$A_mean <- as.numeric(gsub(",", ".", week_aveA$A_mean))
week_aveA$week_day_mean <- as.numeric(gsub(",", ".", week_aveA$week_day_mean))
week_aveA$year <- format(week_aveA$endofweek, "%Y")
library(ggplot2)
mylabel <- function(x) {
ifelse(grepl("-07-01$", x), "", format(x, "%Y"))
}
ggplot() +
geom_step(data=week_aveA, aes(x = endofweek, y = A_mean, group = year), colour="gray25") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 2500)) +
scale_x_date(breaks="6 month", labels = mylabel) +
labs(y = expression(A~ ~index),
x = NULL) +
theme(axis.text.x = element_text(size=10),
axis.title = element_text(size=10))

How to resample and remodel n times by vectorization?

here's my for loop version of doing resample and remodel,
B <- 999
n <- nrow(butterfly)
estMat <- matrix(NA, B+1, 2)
estMat[B+1,] <- model$coef
for (i in 1:B) {
resample <- butterfly[sample(1:n, n, replace = TRUE),]
re.model <- lm(Hk ~ inv.alt, resample)
estMat[i,] <- re.model$coef
}
I tried to avoid for loop,
B <- 999
n <- nrow(butterfly)
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
re.model.coef <- sapply(re.model,coef)
estMat <- cbind(re.model.coef, model$coef)
It worked but didn't improve efficiency. Is there any approach I can do vectorization?
Sorry, not quite familiar with StackOverflow. Here's the dataset butterfly.
colony alt precip max.temp min.temp Hk
pd+ss 0.5 58 97 16 98
sb 0.8 20 92 32 36
wsb 0.57 28 98 26 72
jrc+jrh 0.55 28 98 26 67
sj 0.38 15 99 28 82
cr 0.93 21 99 28 72
mi 0.48 24 101 27 65
uo+lo 0.63 10 101 27 1
dp 1.5 19 99 23 40
pz 1.75 22 101 27 39
mc 2 58 100 18 9
hh 4.2 36 95 13 19
if 2.5 34 102 16 42
af 2 21 105 20 37
sl 6.5 40 83 0 16
gh 7.85 42 84 5 4
ep 8.95 57 79 -7 1
gl 10.5 50 81 -12 4
(Assuming butterfly$inv.alt <- 1/butterfly$alt)
You get the error because resample is not a list of resampled data.frames, which you can obtain with:
resample <- replicate(B, butterfly[sample(1:n, replace = TRUE),], simplify = FALSE)
The the following should work:
re.model <- lapply(resample, lm, formula = Hk ~ inv.alt)
To extract coefficients from a list of models, re.model$coef does work. The correct path to coefficients are: re.model[[1]]$coef, re.model[[2]]$coef, .... You can get all of them with the following code:
re.model.coef <- sapply(re.model, coef)
Then you can combined it with the observed coefficients:
estMat <- cbind(re.model.coef, model$coef)
In fact, you can put all of them into replicate:
re.model.coef <- replicate(B, {
bf.rs <- butterfly[sample(1:n, replace = TRUE),]
coef(lm(formula = Hk ~ inv.alt, data = bf.rs))
})
estMat <- cbind(re.model.coef, model$coef)

Function with a for loop to create a column with values 1:n conditioned by intervals matched by another column

I have a data frame like the following
my_df=data.frame(x=runif(100, min = 0,max = 60),
y=runif(100, min = 0,max = 60)) #x and y in cm
With this I need a new column with values from 1 to 36 that match x and y every 10 cm. For example, if 0<=x<=10 & 0<=y<=10, put 1, then if 10<=x<=20 & 0<=y<=10, put 2 and so on up to 6, then 0<=x<=10 & 10<=y<=20 starting with 7 up to 12, etc. I tried to make a function with an if repeating the interval for x 6 times, and increasing by 10 the interval for y every iteration. Here is the function
#my miscarried function 'zones'
>zones= function(x,y) {
i=vector(length = 6)
n=vector(length = 6)
z=vector(length = 36)
i[1]=0
z[1]=0
n[1]=1
for (t in 1:6) {
if (0<=x & x<10 & i[t]<=y & y<i[t]+10) { z[t] = n[t]} else
if (10<=x & x<20 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+1} else
if (20<=x & x<30 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+2} else
if (30<=x & x<40 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+3} else
if (40<=x & x<50 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+4}else
if (50<=x & x<=60 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+5}
else {i[t+1]=i[t]+10
n[t+1]=n[t]+6}
}
return(z)
}
>xy$z=zones(x=xy$x,y=xy$y)
and I got
There were 31 warnings (use warnings() to see them)
>xy$z
[1] 0 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Please,help me before I die alone!
I think think this does the trick.
a <- cut(my_df$x, (0:6) * 10)
b <- cut(my_df$y, (0:6) * 10)
z <- interaction(a, b)
levels(z)
[1] "(0,10].(0,10]" "(10,20].(0,10]" "(20,30].(0,10]" "(30,40].(0,10]"
[5] "(40,50].(0,10]" "(50,60].(0,10]" "(0,10].(10,20]" "(10,20].(10,20]"
[9] "(20,30].(10,20]" "(30,40].(10,20]" "(40,50].(10,20]" "(50,60].(10,20]"
[13] "(0,10].(20,30]" "(10,20].(20,30]" "(20,30].(20,30]" "(30,40].(20,30]"
[17] "(40,50].(20,30]" "(50,60].(20,30]" "(0,10].(30,40]" "(10,20].(30,40]"
[21] "(20,30].(30,40]" "(30,40].(30,40]" "(40,50].(30,40]" "(50,60].(30,40]"
[25] "(0,10].(40,50]" "(10,20].(40,50]" "(20,30].(40,50]" "(30,40].(40,50]"
[29] "(40,50].(40,50]" "(50,60].(40,50]" "(0,10].(50,60]" "(10,20].(50,60]"
[33] "(20,30].(50,60]" "(30,40].(50,60]" "(40,50].(50,60]" "(50,60].(50,60]"
If this types of levels aren't for your taste, then change as below:
levels(z) <- 1:36
Is this what you're after? The resulting numbers are in column res:
# Get bin index for x values and y values
my_df$bin1 <- as.numeric(cut(my_df$x, breaks = seq(0, max(my_df$x) + 10, by = 10)));
my_df$bin2 <- as.numeric(cut(my_df$y, breaks = seq(0, max(my_df$x) + 10, by = 10)));
# Multiply bin indices
my_df$res <- my_df$bin1 * my_df$bin2;
> head(my_df)
x y bin1 bin2 res
1 49.887499 47.302849 5 5 25
2 43.169773 50.931357 5 6 30
3 10.626466 43.673533 2 5 10
4 43.401454 3.397009 5 1 5
5 7.080386 22.870539 1 3 3
6 39.094724 24.672907 4 3 12
I've broken down the steps for illustration purposes; you probably don't want to keep the intermediate columns bin1 and bin2.
We probably need a table showing the relationship between x, y, and z. After that, we can define a function to do the join.
The solution is related and inspired by this post (R dplyr join by range or virtual column). You may also find other solutions are useful.
# Set seed for reproducibility
set.seed(1)
# Create example data frame
my_df <- data.frame(x=runif(100, min = 0,max = 60),
y=runif(100, min = 0,max = 60))
# Load the dplyr package
library(dplyr)
# Create a table to show the relationship between x, y, and z
r <- expand.grid(x_from = seq(0, 50, 10), y_from = seq(0, 50, 10)) %>%
mutate(x_to = x_from + 10, y_to = y_from + 10, z = 1:n())
# Define a function for dynamic join
dynamic_join <- function(d, r){
if (!("z" %in% colnames(d))){
d[["z"]] <- NA_integer_
}
d <- d %>%
mutate(z = ifelse(x >= r$x_from & x < r$x_to & y >= r$y_from & y < r$y_to,
r$z, z))
return(d)
}
re_dynamic_join <- function(d, r){
r_list <- split(r, r$z)
for (i in 1:length(r_list)){
d <- dynamic_join(d, r_list[[i]])
}
return(d)
}
# Apply the function
re_dynamic_join(my_df, r)
x y z
1 15.930520 39.2834357 20
2 22.327434 21.1918363 15
3 34.371202 16.2156088 10
4 54.492467 59.5610437 36
5 12.100916 38.0095959 20
6 53.903381 12.7924881 12
7 56.680516 7.7623409 6
8 39.647868 28.6870821 16
9 37.746843 55.4444682 34
10 3.707176 35.9256580 19
11 12.358474 58.5702417 32
12 10.593405 43.9075507 26
13 41.221371 21.4036147 17
14 23.046223 25.8884214 15
15 46.190485 8.8926936 5
16 29.861955 0.7846545 3
17 43.057110 42.9339640 29
18 59.514366 6.1910541 6
19 22.802111 26.7770609 15
20 46.646713 38.4060627 23
21 56.082314 59.5103172 36
22 12.728551 29.7356147 14
23 39.100426 29.0609715 16
24 7.533306 10.4065401 7
25 16.033240 45.2892567 26
26 23.166846 27.2337294 15
27 0.803420 30.6701870 19
28 22.943277 12.4527068 9
29 52.181451 13.7194886 12
30 20.420940 35.7427198 21
31 28.924807 34.4923319 21
32 35.973950 4.6238628 4
33 29.612478 2.1324348 3
34 11.173056 38.5677295 20
35 49.642399 55.7169120 35
36 40.108004 35.8855453 23
37 47.654392 33.6540449 23
38 6.476618 31.5616634 19
39 43.422657 59.1057134 35
40 24.676466 30.4585093 21
41 49.256778 40.9672847 29
42 38.823612 36.0924731 22
43 46.975966 14.3321207 11
44 33.182179 15.4899556 10
45 31.783175 43.7585774 28
46 47.361374 27.1542499 17
47 1.399872 10.5076061 7
48 28.633804 44.8018962 27
49 43.938824 6.2992584 5
50 41.563893 51.8726969 35
51 28.657177 36.8786983 21
52 51.672569 33.4295723 24
53 26.285826 19.7266391 9
54 14.687837 27.1878867 14
55 4.240743 30.0264584 19
56 5.967970 10.8519817 7
57 18.976302 31.7778362 20
58 31.118056 4.5165447 4
59 39.720305 16.6653560 10
60 24.409811 12.7619712 9
61 54.772555 17.0874289 12
62 17.616202 53.7056462 32
63 27.543944 26.7741194 15
64 19.943680 46.7990934 26
65 39.052228 52.8371421 34
66 15.481007 24.7874526 14
67 28.712715 3.8285088 3
68 45.978640 20.1292495 17
69 5.054815 43.4235568 25
70 52.519280 20.2569200 18
71 20.344376 37.8248473 21
72 50.366421 50.4368732 36
73 20.801009 51.3678999 33
74 20.026496 23.4815569 15
75 28.581075 22.8296331 15
76 53.531900 53.7267256 36
77 51.860368 38.6589458 24
78 23.399373 44.4647189 27
79 46.639242 36.3182068 23
80 57.637080 54.1848967 36
81 26.079569 17.6238093 9
82 42.750881 11.4756066 11
83 23.999662 53.1870566 33
84 19.521129 30.2003691 20
85 45.425229 52.6234526 35
86 12.161535 11.3516173 8
87 42.667273 45.4861831 29
88 7.301515 43.4699336 25
89 14.729311 56.6234891 32
90 8.598263 32.8587952 19
91 14.377765 42.7046321 26
92 3.536063 23.3343060 13
93 38.537296 6.0523876 4
94 52.576153 55.6381253 36
95 46.734881 16.9939500 11
96 47.838530 35.4343895 23
97 27.316467 6.6216363 3
98 24.605045 50.4304219 33
99 48.652215 19.0778211 11
100 36.295997 46.9710802 28

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

Resources