Linear model and dplyr - a better solution? - r

I got a lot of good feedback on a question I recently asked and was guided to use dplyr to transform some data. I'm having an issue with lm() and trying to find a slope from this transformed data and thought I'd open up a new question.
First I have data that looks like this:
Var1 Var2 Var3 Time Temp
a w j 9/9/2014 20
a w j 9/9/2014 15
a w k 9/20/2014 10
a w j 9/10/2014 0
b x L 9/12/2014 30
b x L 9/12/2014 10
b y k 9/13/2014 20
b y k 9/13/2014 15
c z j 9/14/2014 20
c z j 9/14/2014 10
c z k 9/14/2014 11
c w l 9/10/2014 45
a d j 9/22/2014 20
a d k 9/15/2014 4
a d l 9/15/2014 23
a d k 9/15/2014 11
And I want it in the form of this (values for Slope and Pearson simulated for illustration):
V1 V2 V3 Slope Pearson
a w j -3 -0.9
a w k 2 0
a d j 1.5 0.6
a d k 0 0.5
a d l -0.5 -0.6
b x L 12 0.7
b y k 4 0.6
c z j -1 -0.5
c z k -3 -0.4
c w l -10 -0.9
The slope being a linear-least-squares slope. In theory, the script would look like so:
library(dplyr)
data <- read.table("clipboard",sep="\t",quote="",header=T)
newdata = summarise(group_by(data
,Var1
,Var2
,Var3
)
,Slope = lm(Temp ~ Time)$coeff[2]
,Pearson = cor(Time, Temp, method="pearson")
)
But R throws an error like it can't find Time or Temp. It can run lm(data$Temp ~ data$Time)$coeff[2], but returns the slope for the entire data set and not the subsetted form that I'm looking for. cor() seems to run just fine in the group_by section, so is there a specific syntax I need to pass to lm() to have it run in a similar manner or use a different function entirely to get a slope passed from the subset?

You have several issues here.
If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
The third issue here is that you will need to use do in order to run your linear model
Here's an illustration for grouping only on V1
data %>%
group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it
do(mod = lm(Temp ~ Time, data = .)) %>%
mutate(Slope = summary(mod)$coeff[2]) %>%
select(-mod)
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Var1 Slope
# 1 a 12.66667
# 2 b -2.50000
# 3 c -31.33333
If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)
data %>%
mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables
test2 = sample(1:3, n(), replace = TRUE)) %>%
group_by(Var1) %>%
do(mod = lm(Temp ~ Time, data = .),
mod2 = cor(.$test1, .$test2, method = "pearson")) %>%
mutate(Slope = summary(mod)$coeff[2],
Pearson = mod2[1]) %>%
select(-mod, -mod2)
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# Var1 Slope Pearson
# 1 a 12.66667 0.25264558
# 2 b -2.50000 -0.09090909
# 3 c -31.33333 0.30151134
Bonus solution: you can do this quite efficiently/easily with data.table package too
library(data.table)
setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1]
# Var1 Slope
# 1: a 12.66667
# 2: b -2.50000
# 3: c -31.33333
Or if we want to create some dummy variables too
library(data.table)
setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE),
test2 = sample(1:3, .N, replace = TRUE))][,
list(Slope = summary(lm(Temp ~ Time))$coeff[2],
Pearson = cor(test1, test2, method = "pearson")), Var1]
# Var1 Slope Pearson
# 1: a 12.66667 -0.02159168
# 2: b -2.50000 -0.81649658
# 3: c -31.33333 -1.00000000

Related

Aggregate including all levels?

I have an R data frame with factor and numeric columns, and I would like to calculate certain summaries of the numeric columns by various groupings of the factors. In particular, I would like to be able to summarise by multiple combinations of the factors at once, and return the results in a single object.
For example, say I'm using the warpbreaks data frame, which has columns breaks (integer), wool (factor with levels "A", "B") and tension (factor with levels "L", "M" and "H"). If I want to get the average number of breaks for each combination of wool and tension, I know I can use aggregate(breaks ~ wool + tension, data = warpbreaks, mean) and it will give me something like:
wool tension breaks
1 A L 44.55556
2 B L 28.22222
...
6 B H 18.77778
But I'd like to also calculate the means across just wool, and just tension, and for the dataset as a whole, and return something like:
wool tension breaks
1 NA NA 24.14815
2 NA L 36.38889
...
5 A NA 31.03704
...
7 A L 44.55556
...
12 B H 18.77778
I tried a few variations of formulas in the aggregate function and couldn't find anything suitable, is this something that can be done simply?
If you have an arbitrary number of inputs that you want to treat as a whole + individually, you can set up a loop via Map:
## set formula inputs
rhs <- c("wool","tension")
lhs <- "breaks"
## map it
Map(
\(d,l,r) aggregate(d[l], d[r], FUN=mean),
list(warpbreaks),
list(lhs),
c(list(rhs), rhs)
)
#[[1]]
# wool tension breaks
#1 A L 44.55556
#2 B L 28.22222
#3 A M 24.00000
#4 B M 28.77778
#5 A H 24.55556
#6 B H 18.77778
#
#[[2]]
# wool breaks
#1 A 31.03704
#2 B 25.25926
#
#[[3]]
# tension breaks
#1 L 36.38889
#2 M 26.38889
#3 H 21.66667
You can extend the rhs inputs however you see fit for many combinations, e.g.:
rhs <- 1:3
c(list(rhs), combn(rhs, 2, simplify=FALSE), rhs)
#[[1]]
#[1] 1 2 3
#
#[[2]]
#[1] 1 2
#
#[[3]]
#[1] 1 3
#
#[[4]]
#[1] 2 3
#
#[[5]]
#[1] 1
#
#[[6]]
#[1] 2
#
#[[7]]
#[1] 3
You can calculate aggregate separately and combine them with bind_rows.
dplyr::bind_rows(aggregate(breaks ~ wool + tension, data = warpbreaks, mean),
aggregate(breaks ~ wool, data = warpbreaks, mean),
aggregate(breaks ~ tension, data = warpbreaks, mean))
# wool tension breaks
# <fct> <fct> <dbl>
# 1 A L 44.556
# 2 A M 24
# 3 A H 24.556
# 4 B L 28.222
# 5 B M 28.778
# 6 B H 18.778
# 7 A NA 31.037
# 8 B NA 25.259
# 9 NA L 36.389
#10 NA M 26.389
#11 NA H 21.667
Or in dplyr -
library(dplyr)
bind_rows(
warpbreaks %>% group_by(wool, tension) %>% summarise(breaks = mean(breaks), .groups = 'drop'),
warpbreaks %>% group_by(wool) %>% summarise(breaks = mean(breaks)),
warpbreaks %>% group_by(tension) %>% summarise(breaks = mean(breaks))
)
For any number of inputs.
library(dplyr)
library(purrr)
cols <- c('wool', 'tension')
map_df(seq_along(cols), function(x) combn(cols, x, function(y) {
warpbreaks %>% group_by(across(all_of(y))) %>% summarise(breaks = mean(breaks))
}, simplify = FALSE) %>% bind_rows())

calculate means and occurences from multiple matrices

I have a number of matrices that they all have the same type of elements but different lengths. Columns in all files are the same (lets call them "A" and "B") but rows between files are mostly the same elements but not always.
Here are some example data (in the form of dataframes)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
as you can see as far as the rows go even though "alpha","beta" and "gamma" are always present many of the others are not always there
I would like to calculate 2 things:
the average values of all A and B columns in all matrices and ideally that would be by creating an ave.matr that would have all rownames and the average/mean values of the columns "A" and "B"
A B
alpha 1 7
beta 2 6
delta 3 5
gamma 4 4
zeta 5 3
theta 6 2
epsilon 7 1
(where the above numbers are the mean values of all matrices)
and then an occurrence matrix, lets call it occur.matr that would count the number of occurrences of each row across all matrices and it should look like that
A B
alpha 3
beta 3
delta 2
gamma 3
zeta 2
theta 1
epsilon 1
I have started working on this today but I cannot figure out how to do it.
I started by creating a list and a matrix with the unique rownames from all matrices
list=c(rownames(df1),rownames(df2),rownames(df3))
unique=unique(list)
avematr<-matrix(NA,nrow=length(unique),ncol=2)
and my next step would be to make rownames of all matrices identical. I tried with match but i cannot figure it out but at this moment I dont even know if this is the best strategy...
And all similar questions out there are related to merging the matrices (which is not what I want to do).
Any help is greatly appreciated
Here is a tidyverse approach:
library(tidyverse)
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
dat <- list(df1, df2, df3) %>%
map_dfr(rownames_to_column)
avg_dat <- dat %>%
group_by(id) %>%
summarise(A = mean(A),
B = mean(B))
#> `summarise()` ungrouping output (override with `.groups` argument)
avg_dat
#> # A tibble: 7 x 3
#> id A B
#> <chr> <dbl> <dbl>
#> 1 alpha 1 5
#> 2 beta 2 4
#> 3 delta 3 4
#> 4 epsilon 7 1
#> 5 gamma 3.67 2.33
#> 6 theta 6 2
#> 7 zeta 5 2
occ_dat <- dat %>% count(id)
occ_dat
#> id n
#> 1 alpha 3
#> 2 beta 3
#> 3 delta 2
#> 4 epsilon 1
#> 5 gamma 3
#> 6 theta 1
#> 7 zeta 2
Created on 2021-01-27 by the reprex package (v0.3.0)
If you want to stick to base R:
For the averaging task it makes things easier when you add your rowname as a column. This prevents autonumbering of rownames when combining the dataframes. You then can simply loop over every unique rowname and construct the averages. A quick and dirty solution could look like this:
df1 <- data.frame(A = 1:3, B = 3:1)
rownames(df1)=c("alpha","beta","gamma")
df2 <- data.frame(A = 1:5,B = 5:1)
rownames(df2)=c("alpha","beta","delta","gamma","zeta")
df3 <- data.frame(A = 1:7, B = 7:1)
rownames(df3)=c("alpha","beta","delta","gamma","zeta","theta","epsilon")
add_row_names_to_df <- function(df) {
df$rn <- rownames(df)
return(df)
}
new_df <- rbind(add_row_names_to_df(df1),
add_row_names_to_df(df2),
add_row_names_to_df(df3))
avg_df <- as.data.frame(matrix(unique(new_df$rn),
nrow = length(unique(new_df$rn)),
ncol = 3))
for(i in 1:nrow(avg_df)) {
avg.df[i,] <- c(avg_df[i,1],
mean(new_df$A[new_df$rn==avg_df[i,1]]),
mean(new_df$B[new_df$rn==avg_df[i,1]]))
}
colnames(avg_df) <- c("rowname", "avgA", "avgB")
avg_df
results in:
rowname avgA avgB
1 alpha 1 5
2 beta 2 4
3 gamma 3.66666666666667 2.33333333333333
4 delta 3 4
5 zeta 5 2
6 theta 6 2
7 epsilon 7 1
For the occurence matrix you can use the table() function from R:
as.matrix(table(c(rownames(df1),rownames(df2),rownames(df3))))
yields:
[,1]
alpha 3
beta 3
delta 2
epsilon 1
gamma 3
theta 1
zeta 2

forloop inside dplyr mutate

I would like to do a few column operations using mutate in more elegant way as I have more than 200 columns in my table that I would like transform using mutate.
here is an example
Sample data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20),c2_y=rnorm(20),c3_z=rnorm(20),
c4_x=rnorm(20),c5_y=rnorm(20),c6_z=rnorm(20),
c7_x=rnorm(20),c8_y=rnorm(20),c9_z=rnorm(20),
c10_x=rnorm(20),c11_y=rnorm(20),c12_z=rnorm(20),
c_n=rnorm(20))
sample code:
dfm<-df %>%
mutate(cx=(c1_x*c4_x/c_n+c7_x*c10_x/c_n),
cy=(c2_y*c5_y/c_n+c8_y*c11_y/c_n),
cz=(c3_z*c6_z/c_n+c9_z*c12_z/c_n))
Despite the tangent, the initial recommendations for using tidyr functions is where you need to go. This pipe of functions seems to do the job based on what you've provided.
Your data:
df <- data.frame(treatment=rep(letters[1:2],10),
c1_x=rnorm(20), c2_y=rnorm(20), c3_z=rnorm(20),
c4_x=rnorm(20), c5_y=rnorm(20), c6_z=rnorm(20),
c7_x=rnorm(20), c8_y=rnorm(20), c9_z=rnorm(20),
c10_x=rnorm(20), c11_y=rnorm(20), c12_z=rnorm(20),
c_n=rnorm(20))
library(dplyr)
library(tidyr)
This first auxiliary data.frame is used to translate your c#_[xyz] variable into a unified one. I'm sure there are other ways to handle this, but it works and is relatively easy to reproduce and extend based on your 200+ columns.
variableTransform <- data_frame(
cnum = paste0("c", 1:12),
cvar = rep(paste0("a", 1:4), each = 3)
)
head(variableTransform)
# Source: local data frame [6 x 2]
# cnum cvar
# <chr> <chr>
# 1 c1 a1
# 2 c2 a1
# 3 c3 a1
# 4 c4 a2
# 5 c5 a2
# 6 c6 a2
Here's the pipe all at once. I'll explain the steps in a sec. What you're looking for is likely a combination of the treatment, xyz, and ans columns.
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
tidyr::spread(cvar, value) %>%
mutate(
ans = a1 * (a2/c_n) + a3 * (a4/c_n)
) %>%
head
# treatment c_n xyz a1 a2 a3 a4 ans
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419 1.15801448
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979 -0.01828831
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878 -2.03197283
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493 0.15759418
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839 0.65270681
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095 0.06136036
First, we take the original data and turn all (except two) columns into two columns of "column name" and "column values" pairs:
df %>%
tidyr::gather(cnum, value, -treatment, -c_n) %>%
# treatment c_n cnum value
# 1 a 0.20745647 c1_x -0.1250222
# 2 b 0.01015871 c1_x -0.4585088
# 3 a 1.65671028 c1_x -0.2455927
# 4 b -0.24037137 c1_x 0.6219516
# 5 a -1.16092349 c1_x -0.3716138
# 6 b 1.61191700 c1_x 1.7605452
It will be helpful to split c1_x into c1 and x in order to translate the first and preserve the latter:
tidyr::separate(cnum, c("cnum", "xyz"), sep = "_") %>%
# treatment c_n cnum xyz value
# 1 a 0.20745647 c1 x -0.1250222
# 2 b 0.01015871 c1 x -0.4585088
# 3 a 1.65671028 c1 x -0.2455927
# 4 b -0.24037137 c1 x 0.6219516
# 5 a -1.16092349 c1 x -0.3716138
# 6 b 1.61191700 c1 x 1.7605452
From here, let's translate the c1, c2, and c3 variables into a1 (repeat for other 9 variables) using variableTransform:
left_join(variableTransform, by = "cnum") %>%
select(-cnum) %>%
# treatment c_n xyz value cvar
# 1 a 0.20745647 x -0.1250222 a1
# 2 b 0.01015871 x -0.4585088 a1
# 3 a 1.65671028 x -0.2455927 a1
# 4 b -0.24037137 x 0.6219516 a1
# 5 a -1.16092349 x -0.3716138 a1
# 6 b 1.61191700 x 1.7605452 a1
Since we want to deal with multiple variables simultaneously (with a simple mutate), we need to bring some of the variables back into columns. (The reason we gathered and will now spread helps me with keeping things organized and named well. I'm confident somebody can come up with another way to do it.)
tidyr::spread(cvar, value) %>% head
# treatment c_n xyz a1 a2 a3 a4
# 1 a -1.535934 x -0.3276474 1.45959746 -1.2650369 1.02795419
# 2 a -1.535934 y -1.3662388 -0.05668467 0.4867865 -0.10138979
# 3 a -1.535934 z -2.5026018 -0.99797169 0.5181513 1.20321878
# 4 a -1.363584 x -0.9742016 -0.12650863 1.3612361 -0.24840493
# 5 a -1.363584 y -0.9795871 1.52027017 0.5510857 1.08733839
# 6 a -1.363584 z 0.2985557 -0.22883439 0.1536078 -0.09993095
From here, we just need to mutate to get the right answer.
Similar to r2evans's answer, but with more manipulation instead of the joins (and less explanation).
library(tidyr)
library(stringr)
library(dplyr)
# get it into fully long form
gather(df, key = cc_xyz, value = value, c1_x:c12_z) %>%
# separate off the xyz and the c123
separate(col = cc_xyz, into = c("cc", "xyz")) %>%
# extract the number
mutate(num = as.numeric(str_replace(cc, pattern = "c", replacement = "")),
# mod it by 4 for groupings and add a letter so its a good col name
num_mod = paste0("v", (num %% 4) + 1)) %>%
# remove unwanted columns
select(-cc, -num) %>%
# go into a reasonable data width for calculation
spread(key = num_mod, value = value) %>%
# calculate
mutate(result = v1 + v2/c_n + v3 + v4 / c_n)
# treatment c_n xyz v1 v2 v3 v4 result
# 1 a -1.433858289 x 1.242153708 -0.985482158 -0.0240414692 1.98710285 0.51956295
# 2 a -1.433858289 y -0.019255516 0.074453615 -1.6081599298 1.18228939 -2.50389188
# 3 a -1.433858289 z -0.362785313 2.296744655 -0.0610463292 0.89797526 -2.65188998
# 4 a -0.911463819 x -1.088308527 -0.703388193 0.6308253909 0.22685013 0.06534405
# 5 a -0.911463819 y 1.284513516 1.410276163 0.5066869590 -2.07263912 2.51790289
# 6 a -0.911463819 z 0.957778345 -1.136532104 1.3959561507 -0.50021647 4.14947069
# ...

How to subset only the rows that have multiple, different values in another column in R?

I have a dataset similar to that below:
zz <- "Session Rater
1 A X
2 A X
3 A X
4 B Y
5 B Y
6 B Z
7 B Z
8 C X
9 C Y
10 C Z"
Data <- read.table(text=zz, header = TRUE)
I'd like to only subset the session rows for which the session has multiple raters, even though that data is stored in another column. Therefore, I'd like end up with a dataset that looks like this:
zz2 <- "Session Rater
1 B Y
2 B Y
3 B Z
4 B Z
5 C X
6 C Y
7 C Z"
Data2 <- read.table(text=zz2, header = TRUE)
Where Session A rows were removed from the dataset because Session A only had one rater, "X," but Sessions B and C (and all of their rows) were retained because they had more than one rater (Y & Z for Session B, and X, Y, & Z for Session C).
I've played around with dplyr, but with no success. Many thanks.
We can use filter with n_distinct
library(dplyr)
Data %>%
group_by(Session) %>%
filter(n_distinct(Rater)>1)
# Session Rater
# <fctr> <fctr>
#1 B Y
#2 B Y
#3 B Z
#4 B Z
#5 C X
#6 C Y
#7 C Z
Or using data.table
library(data.table)
setDT(Data)[, if(uniqueN(Rater)>1) .SD, by = Session]
Or with base R
i1 <- rowSums(!!table(Data))
subset(Data, Session %in% names(i1)[i1 >1])
... or using ave() and subscripting (assuming Rater is a factor, which is the default when reading character data)
Data[with(Data,ave(unclass(Rater),Session,
FUN = function(x)length(unique(x)))) > 1,]

Ranking data frame

Her is my data:
x<- data.frame(P=c("M","C","M","C","C","M","C","M"),
Q=c(13,12,12,14,19,15,12,11),
R=c(15,13,21,32,32,21,13,32),
T=c(15,12,12,14,12,11,19,15))
I want to calculate means for variables within each category.
For example means for Q is: M= (13+12+15+11)= 12.75 and for C= (12+14+19+12)= 14.25 and so on.
Next, I want to rank the means for each variable and get the following table:
P Q R T
M 2 2 2
C 1 1 1
I want to get equal rank for my real data.
For example. if I have three 12, all will get the same rank
You can try
aggregate(.~P, x, mean)
Or
library(dplyr)
x1 <- x %>%
group_by(P) %>%
summarise_each(funs(mean))
x1
# P Q R T
#1 C 14.25 22.50 14.25
#2 M 12.75 22.25 13.25
For ranking
x1 %>%
mutate_each(funs(rank(-.)), Q:T)
# P Q R T
#1 C 1 1 1
#2 M 2 2 2
Update
Suppose if there are ties,
x1$Q[2] <- x1$Q[1]
rank will get the ties averaged by default. You can specify the ties.method to min or use min_rank
x1 %>%
mutate_each(funs(min_rank(-.)), Q:T)
# P Q R T
#1 C 1 1 1
#2 M 1 2 2
For completeness, here's a possible data.table solution using frank from the devel version on GitHub
For means
library(data.table) ## v >= 1.9.5
(Res <- setDT(x)[, lapply(.SD, mean), by = P])
# P Q R T
# 1: M 12.75 22.25 13.25
# 2: C 14.25 22.50 14.25
For ranks
Res[, 2:4 := lapply(-.SD, frank, ties.method = "dense"), .SDcols = 2:4]
Res
# P Q R T
# 1: M 2 2 2
# 2: C 1 1 1
to get the mean you can do this
do.call(rbind,lapply(split(x, f = x$P), function(x) data.frame(P = unique(x$P), Q = mean(x$Q), R = mean(x$R), T = mean(x$T))))

Resources