I'm trying to evaluate multiple functions on my dataset using purr library. I tried spliting the data and then apply map with the functions list but I don't seem to find the right way to introduce my varaibles as arguments. What I am missing?
library(tidyverse)
df <- tibble(f1 = c('a','a','a','b','b','c','c','c','d','d'),
f2 = c('z','x','y','z','x','z','x','y','z','x'),
obs = sample(1:10, 10),
pred = sample(1:10,10))
#RMSE
#------------------------------------------------------------------------------
rmse <- function(obs, model){
n = length(obs)
diff = (model - obs)**2
rmse = sqrt(sum(diff)) / n
return(rmse)
}
#Mean Bias
#------------------------------------------------------------------------------
mean_bias <- function(obs, model){
n =length(obs)
diff=model-obs
mean_bias=sum(diff)/n
return(mean_bias)
}
#Mean error
#------------------------------------------------------------------------------
mean_error <- function(obs, model){
n =length(obs)
abs=abs(model-obs)
mean_error=sum(abs)/n
return(mean_error)
}
mod_eval <- list(rmse, mean_bias,mean_error)
test <- df%>%
split(list(c(.$f1,.$f2)))%>%
invoke_map(.f=mod_eval)##????
I'll expect to get something like this
f1 f2 rmse mean_bias mean_error
1 a x 17 11 5
2 b x 17 11 5
3 c x 17 11 5
4 d x 17 11 5
5 a y 17 11 5
6 b y 17 11 5
You can pass a named list of functions and use map to apply each function to each combination.
library(dplyr)
library(purrr)
mod_eval <- lst(rmse, mean_bias,mean_error)
df %>%
split(.$f1) %>%
map_df(function(x) map_dbl(mod_eval, ~.x(x$obs, x$pred)), .id = 'f1')
# f1 rmse mean_bias mean_error
# <chr> <dbl> <dbl> <dbl>
#1 a 2.33 -1.67 3.67
#2 b 3.91 5.5 5.5
#3 c 2.92 -2.33 4.33
#4 d 2.5 0.5 3.5
We could use tidyverse approaches
library(dplyr)
library(tidyr)
mod_eval <- dplyr::lst(rmse, mean_bias, mean_error)
df %>%
nest_by(f1) %>%
crossing(mod_eval) %>%
mutate(nm1 = names(mod_eval)) %>%
rowwise %>%
transmute(f1, nm1, out = list(mod_eval(data$obs, data$pred))) %>%
unnest(out) %>%
pivot_wider(names_from = nm1, values_from = out)
-output
f1 rmse mean_bias mean_error
<chr> <dbl> <dbl> <dbl>
1 a 1.80 0.333 3
2 b 3.35 -1.5 4.5
3 c 3.09 4 4.67
4 d 4.53 -5 5
if it is grouped by 'f1', 'f2'
df %>%
nest_by(f1, f2) %>%
crossing(mod_eval) %>%
mutate(nm1 = names(mod_eval)) %>%
rowwise %>%
transmute(f1, f2, nm1, out = list(mod_eval(data$obs, data$pred))) %>%
unnest(out) %>%
pivot_wider(names_from = nm1, values_from = out)
-output
# A tibble: 10 x 5
f1 f2 rmse mean_bias mean_error
<chr> <chr> <dbl> <dbl> <dbl>
1 a x 4 -4 4
2 a y 2 2 2
3 a z 3 3 3
4 b x 3 3 3
5 b z 6 -6 6
6 c x 1 -1 1
7 c y 7 7 7
8 c z 6 6 6
9 d x 9 -9 9
10 d z 1 -1 1
Related
This question already has an answer here:
dplyr summarize across ttest
(1 answer)
Closed 7 months ago.
I have data frame like this
X1 X2 X3 X4 X5 class
1 1 7 3 9 5 n
2 2 8 4 10 6 n
3 3 9 5 1 7 n
4 4 10 6 2 8 p
5 5 1 7 3 9 p
6 6 2 8 4 10 p
I like to run t test over all columns, separated on groups formed by the variable class.
I know I can use for loop for this, but I wonder if there is better code for that.
m1 <- data.frame(matrix(c <- (1:10), nrow = 6, ncol = 5))
m1 <- data.frame(m1,c("n","n","n","p","p","p"))
names(m1)[6] = "class"
# work for one column
t.test(X1~class, data= m1)
# What I'm looking for
# t.test(X_i~class, data= m1)
library(dplyr)
library(tidyr)
library(broom)
df <- data.frame(
x1 = 0:9,
x2 = 10:19,
x3 = 20:29,
class = rep(c("a", "b"), each = 5)
)
# Conduct tests and store in nested data.frame
nested <- df %>%
group_by(class) %>%
summarise(across(everything(), ~ list(
t.test(.) %>%
tidy() %>%
select(estimate, statistic,
p.value, conf.low, conf.high)
)))
# Unnest and turn into long structure
long <- nested %>%
unnest(cols = starts_with("x"), names_sep = "_") %>%
pivot_longer(cols = starts_with("x"),
names_to = "quantity",
values_to = "value")
# Split variables into multiple columns and widen
long %>%
separate(col = quantity, into = c("variable", "quantity"), sep = "_") %>%
pivot_wider(names_from = "quantity")
#> # A tibble: 6 x 7
#> class variable estimate statistic p.value conf.low conf.high
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a x1 2 2.83 0.0474 0.0368 3.96
#> 2 a x2 12 17.0 0.0000707 10.0 14.0
#> 3 a x3 22 31.1 0.00000636 20.0 24.0
#> 4 b x1 7 9.90 0.000584 5.04 8.96
#> 5 b x2 17 24.0 0.0000178 15.0 19.0
#> 6 b x3 27 38.2 0.00000281 25.0 29.0
I found solutions for simple vectors, but is there a way to make all pairwise differences using dplyr or base R for all the elements in a category?
library(tidyverse)
x = 1:10
y = rep(letters[1:5],each=2)
z = rep(1:2,length.out =10)
df = data.frame(x,y, z)
df = rbind(df,c(11,"e",3))
df$verif = paste0(df$y,df$z)
df$x = as.numeric(df$x)
df %>%
group_by(y) %>%
summarise(Diff = abs(x - lag(x)))
gives:
`summarise()` regrouping output by 'y' (override with `.groups` argument)
# A tibble: 11 x 2
# Groups: y [5]
y Diff
<chr> <dbl>
1 a NA
2 a 1
3 b NA
4 b 1
5 c NA
6 c 1
7 d NA
8 d 1
9 e NA
10 e 1
11 e 1
In this example, it's only using the previous value in the data frame, therefore missing pairwise differences (look at 9, 10 and 11 for group "e" ).
Is there a way to get all the pairwise differences in each category? Keeping track of the pairwise differences would be useful as well (e.g., e1 with e2 = 1, e2 with e3 is = 1 and e1 with e3 is =2)
I tired the outer() function but wasn't able to make it work as well as the dist() function.
I continued to try and found this:
my.df=df %>%
group_by(y) %>%
summarise(Diff = combn(x,2,diff))
my.df
# A tibble: 7 x 2
# Groups: y [5]
y Diff
<chr> <dbl>
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 e 2
7 e 1
I just now need to get which pairwise difference was calculated...
Continued again and got this mess:
my.df=df %>%
group_by(y) %>%
summarise(Diff = combn(x,2,diff),
test = combn(verif,2,paste, simplify = FALSE)) %>%
mutate(test2 = paste0(test, collapse = "-"))
my.df
> my.df
# A tibble: 7 x 4
# Groups: y [5]
y Diff test test2
<chr> <dbl> <list> <chr>
1 a 1 <chr [2]> "c(\"a1\", \"a2\")"
2 b 1 <chr [2]> "c(\"b1\", \"b2\")"
3 c 1 <chr [2]> "c(\"c1\", \"c2\")"
4 d 1 <chr [2]> "c(\"d1\", \"d2\")"
5 e 1 <chr [2]> "c(\"e1\", \"e2\")-c(\"e1\", \"e3\")-c(\"e2\", \"e3\")"
6 e 2 <chr [2]> "c(\"e1\", \"e2\")-c(\"e1\", \"e3\")-c(\"e2\", \"e3\")"
7 e 1 <chr [2]> "c(\"e1\", \"e2\")-c(\"e1\", \"e3\")-c(\"e2\", \"e3\")"
Got it:
library(tidyverse)
x = 1:10
y = rep(letters[1:5],each=2)
z = rep(1:2,length.out =10)
df = data.frame(x,y, z)
df = rbind(df,c(11,"e",3))
df$verif = paste0(df$y,df$z)
df$x = as.numeric(df$x)
my.df=df %>%
group_by(y) %>%
summarise(Diff = combn(x,2,diff),
test = combn(verif,2,paste, simplify = FALSE)) %>%
mutate(test2 = unlist(lapply(test, function(x)paste(x,collapse="-")))) %>%
select(-test)
Here is the output
my.df
# A tibble: 7 x 3
# Groups: y [5]
y Diff test2
<chr> <dbl> <chr>
1 a 1 a1-a2
2 b 1 b1-b2
3 c 1 c1-c2
4 d 1 d1-d2
5 e 1 e1-e2
6 e 2 e1-e3
7 e 1 e2-e3
You could do:
library(tidyverse)
df %>%
group_by(y) %>%
summarise(result = combn(seq_along(x), 2, function(i)
list(test1 = diff(x[i]), #The difference
test2 = paste0(verif[i], collapse = '-')), # The pairs
simplify = FALSE),
.groups = 'drop') %>%
unnest_wider(result)
# A tibble: 7 x 3
y test1 test2
<chr> <dbl> <chr>
1 a 1 a1-a2
2 b 1 b1-b2
3 c 1 c1-c2
4 d 1 d1-d2
5 e 1 e1-e2
6 e 2 e1-e3
7 e 1 e2-e3
#Function that takes df1,group_vars as input and return df1 with seq columns as output
get_seq <- function(df1,group_vars) {
df1<-df1[ with( df1, do.call(order, mget(group_vars)) ), ]
df1<-df1 %>%
group_by(.dots=group_vars) %>%
mutate(seq=row_number())
return(df1)
}
Try using this function :
library(dplyr)
get_seq <- function(df1, group_vars) {
df1 %>%
arrange(across(all_of(group_vars))) %>%
group_by(across(all_of(group_vars))) %>%
mutate(seq=row_number())
}
You can call this function as :
df2 <- get_seq(df1, 'col1')
df2 <- get_seq(df1, c('col1', 'col2'))
It's really not clear what you're trying to do here. If you want to pass a variable number of column names to a function, sort the data frame according to these columns, then group_by the columns, then add a row number within each subgroup, you would do:
get_seq <- function(df1, ...)
{
group_vars <- enquos(...)
df1 %>%
arrange(!!!group_vars) %>%
group_by(!!!group_vars) %>%
mutate(seq = row_number())
}
So if we had a data frame like this:
df <- data.frame(a = rep(1:3, each = 4),
b = rep(LETTERS[4:1], each = 3),
c = rnorm(12))
We could do:
get_seq(df, a, b)
#> # A tibble: 12 x 4
#> # Groups: a, b [6]
#> a b c seq
#> <int> <fct> <dbl> <int>
#> 1 1 C 0.779 1
#> 2 1 D 0.318 1
#> 3 1 D -0.0710 2
#> 4 1 D 0.183 3
#> 5 2 B -0.351 1
#> 6 2 B 0.401 2
#> 7 2 C -1.26 1
#> 8 2 C 1.99 2
#> 9 3 A -0.0723 1
#> 10 3 A -0.602 2
#> 11 3 A 2.05 3
#> 12 3 B 2.13 1
To perform an ANOVA I am looking to merge this data in a new variable called CompensationGroup. Thereby, people who have been in "Compensationproject1" should be displayed as 1, people who have been in "Compensationproject2" should be displayed as 2...
library(tidyverse)
data %>%
mutate(Compensationproject2 = case_when(
Compensationproject2 == 1 ~ 2,
T ~ NA_real_
)) %>%
mutate(Compensationproject3 = case_when(
Compensationproject3 == 1 ~ 3,
T ~ NA_real_
)) %>%
unite("CompensationGroup",c(Compensationproject1,Compensationproject2,Compensationproject3),remove = F,na.rm = T) %>%
mutate(CompensationGroup = str_extract(CompensationGroup,'\\d'))""
Here is an option that does not require the use of pivot. However, it requires the use of a bunch of other tidyversefunctions ala mutate, case_when, unite and str_extract.
library(tidyverse)
df <- tribble(~id,~Comp1,~Comp2,~Comp3,
1,1,NA,NA,
2,NA,1,NA,
3,NA,NA,1)
df %>%
mutate(Comp2 = case_when(
Comp2 == 1 ~ 2,
T ~ NA_real_
)) %>%
mutate(Comp3 = case_when(
Comp3 == 1 ~ 3,
T ~ NA_real_
)) %>%
unite("group",c(Comp1,Comp2,Comp3),remove = F,na.rm = T) %>%
mutate(group = str_extract(group,'\\d'))
Good luck!
In the absence of a reproducible example I give you an example with toy data using the pivot_longer function from the tidyverse package.
library(tidyverse)
data <- tibble::tribble(
~Category, ~MeasureA, ~MeasureB, ~MeasureC, ~MeasureD,
1, 0.0930158825381708, 0.0138786762728455, 0.0659906858706141, 0.0677226540871513,
2, 0.103201113378404, 0.0149293889876177, 0.0644022070960172, 0.0605524137103402,
3, 0.12028743617311, 0.0209951412575897, 0.0598004419601402, 0.0584817396677436,
4, 0.0996307145670469, 0.016288452837476, 0.0624144782432749, 0.0538275028212587
)
data
# A tibble: 4 x 5
Category MeasureA MeasureB MeasureC MeasureD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.0930 0.0139 0.0660 0.0677
2 2 0.103 0.0149 0.0644 0.0606
3 3 0.120 0.0210 0.0598 0.0585
4 4 0.0996 0.0163 0.0624 0.0538
The following tells R to take columns 2:5 which hold the values, put the values into a column called Value and to put the label from the column names into a column called Measurement whilst removing the Measure label that is prefixed in the Measure columns.
data %>% pivot_longer(cols = 2:5, names_to = "Measurement", names_prefix = "Measure", values_to = "Value")
# A tibble: 16 x 3
Category Measurement Value
<dbl> <chr> <dbl>
1 1 A 0.0930
2 1 B 0.0139
3 1 C 0.0660
4 1 D 0.0677
5 2 A 0.103
6 2 B 0.0149
7 2 C 0.0644
8 2 D 0.0606
9 3 A 0.120
10 3 B 0.0210
11 3 C 0.0598
12 3 D 0.0585
13 4 A 0.0996
14 4 B 0.0163
15 4 C 0.0624
16 4 D 0.0538
Found an answer myself:
data[, "CompensationGroup"] <- 1
for(i in seq(2,3,1)){
data[which(is.na(data[,paste0("Compensationproject",i)]) == F), "CompensationGroup"] <- as.numeric(i)
}
I have a data set containing groups of data and I performed regression on on each group of data. I used dplyr to do the regression and get a tbl_df object with all results. Then I want to extract fitted value vector for each group of regression and put them in a data frame. I used to use summarise() to extract relevant information conveniently. But it only works for scalars. Here is some sample code with lapply I used to extract the information and I feel it kind of cumbersome:
library(dplyr)
library(reshape2)
df1 = data.frame(type1 = c(rep('a',5),rep('b',5)),
x = 1:10,
y = 11:20)
df1 %>%
group_by(type1) %>%
do(model = lm(y~x,.)) -> model1
names(model1$model) = model1$type1
lapply(model1$model,function(mod) mod$fit) %>%
melt
library(broom)
model1 %>% augment(model)
# A tibble: 10 x 10
# Groups: type1 [2]
type1 y x .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
<fctr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 11 1 11 2.482534e-16 3.567051e-19 0.6 3.925229e-16 2.322633e-06 0.001759785
2 a 12 2 12 1.755417e-16 3.026750e-16 0.3 2.977199e-16 2.730293e-01 1.128776594
3 a 13 3 13 1.433292e-16 -3.857170e-16 0.2 2.471607e-16 2.263176e-01 -1.345563357
4 a 14 4 14 1.755417e-16 -1.380180e-16 0.3 3.747906e-16 5.677113e-02 -0.514715401
5 a 15 5 15 2.482534e-16 2.207032e-16 0.6 3.052655e-16 8.891591e-01 1.088827560
6 b 16 6 16 1.709167e-15 -2.416065e-15 0.6 8.008132e-17 2.248024e+00 -1.731290167
7 b 17 7 17 1.208563e-15 2.359219e-15 0.3 1.824137e-15 3.499565e-01 1.277939838
8 b 18 8 18 9.867878e-16 1.265324e-15 0.2 2.510473e-15 5.138141e-02 0.641132787
9 b 19 9 19 1.208563e-15 5.595623e-17 0.3 2.702016e-15 1.968677e-04 0.030310330
10 b 20 10 20 1.709167e-15 -1.264434e-15 0.6 2.303179e-15 6.157097e-01 -0.906060815
Use the tidyverse package, which contains dplyr, purrr, tidyr
library(tidyverse)
Use nest and map
df1 %>%
group_by(type1) %>%
nest() %>%
mutate(data = map(data, ~lm(y~x,.x)$fit)) %>% # combined lm with $fit
unnest()
Output
type1 data
1 a 11
2 a 12
3 a 13
4 a 14
5 a 15
6 b 16
# etc
We can use modelr together with tidyverse. The add_predictions function is handy. Here is an example.
# Load package
library(tidyverse)
library(modelr)
# Create example data frame
df1 = data.frame(type1 = c(rep('a',5),rep('b',5)),
x = 1:10,
y = 11:20)
# Created nested data frame
df2 <- df1 %>%
group_by(type1) %>%
nest()
# A function to fit lm model to y ~ x
lm_model <- function(df) {
lm(y ~ x, data = df)
}
# Fit model
df3 <- df2 %>%
mutate(model = map(data, lm_model))
# Add prediction
df4 <- df3 %>%
mutate(Pred = map2(data, model, add_predictions))
# Unnest the data frame
df5 <- df4 %>% unnest(Pred)
df5
# A tibble: 10 x 4
type1 x y pred
<fctr> <int> <int> <dbl>
1 a 1 11 11
2 a 2 12 12
3 a 3 13 13
4 a 4 14 14
5 a 5 15 15
6 b 6 16 16
7 b 7 17 17
8 b 8 18 18
9 b 9 19 19
10 b 10 20 20
I'd like provide a slick answer using data.table package:
library(data.table)
df1 %>%
data.table %>%
.[,
.(x,
y,
fit = lm(y~x)$fit),
by = type1]