I have a dataset with three columns that are grouped by two variables.
df <- tibble(paper = rep(c("A_2012", "B_2019"), each = 5),
question = rep(c(1,2,3,4,5), 2),
rate = c(4.545455, 4.010000, 4.672727, 4.100000, 3.418182, 3.060000,
4.563636, 3.760000, 4.636364, 4.000000))
> df %>% group_by(question) %>%
select(question, paper, rate) %>%
arrange(question)
# A tibble: 10 x 3
# Groups: question [5]
question paper rate
<dbl> <.chr> <dbl>
1 1 A_2012 4.55
2 1 B_2019 3.06
3 2 A_2012 4.01
4 2 B_2019 4.56
5 3 A_2012 4.67
6 3 B_2019 3.76
7 4 A_2012 4.1
8 4 B_2019 4.64
9 5 A_2012 3.42
10 5 B_2019 4
I need to perform an operation within the 'rate' values of a group. But really, I do not know how to write the code using tidyverse style. In this example, I´ll get the difference (paperB - paperA) for each question:
> df_result
# A tibble: 5 x 2
question diff_rate
<dbl> <dbl>
1 1 -1.49
2 2 0.55
3 3 -0.91
4 4 0.54
5 5 0.58
I´ve tried using pivot_widerand then some operations but I have actually 54 different values for the variable paper, so it is not efficient.
Any help is truly appreciated.
you can do
df %>% group_by(question) %>%
select(question, paper, rate) %>%
arrange(question) %>% mutate(
diff_rate=diff(rate)
if you wanna the same format as your df_result, you can do
df %>% group_by(question) %>%
select(question, paper, rate) %>%
arrange(question) %>% mutate(
diff_rate=diff(rate)
) %>% select(question, diff_rate) %>% distinct()
Related
I have a list of lists where I would like to extract and combine those with the same name (in the example below I'd like to separate cof and pred). (I would prefer a tidyverse solution.)
Example data:
outputlist <- list(
list(cof=0.12),
list(pred=c(1, 2, 3)),
list(cof=0.34),
list(pred=c(4, 5, 6)),
list(cof=0.56),
list(pred=c(7, 8, 9))
)
I would like to separate these so that I have one vector/dataframe with all cofs; and then another dataframe/vector with all predictions.
I've tried this (but it does not separate them):
outputlist %>% bind_cols()
Thanks in advance
You can try :
library(dplyr)
library(purrr)
outputlist %>% bind_rows() %>% split.default(names(.)) %>% map(na.omit)
#$cof
# A tibble: 3 x 1
# cof
# <dbl>
#1 0.12
#2 0.34
#3 0.56
#$pred
# A tibble: 9 x 1
# pred
# <dbl>
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
Not a tidyverse solution, but a base R one-liner:
lapply(split(outputlist, sapply(outputlist, names)), as.data.frame)
#> $cof
#> cof cof.1 cof.2
#> 1 0.12 0.34 0.56
#>
#> $pred
#> pred pred.1 pred.2
#> 1 1 4 7
#> 2 2 5 8
#> 3 3 6 9
We can also do
library(dplyr)
library(purrr)
library(tidyr)
outputlist %>%
map_dfr(~ enframe(.x) %>%
unnest(c(value))) %>%
{split(.$value, .$name)}
I'm still new to the group and R.
I had some really helpful feedback on my last query so hoping I can get
some more support with the following:
I am working on a horse racing database that at this stage has 4 variables:
race horse number, race id, distance of race and the rating (DaH) assigned for the horses
performance for the race.
The dataset:
horse_ratings <- tibble(
horse=c(1,1,1,2,2,2,3,3,3),
raceid=c(1,2,3,1,2,3,1,2,3),
Dist=c(9.47,9.47,10,10.1,10.2,9,11,9.47,10.5),
DaH=c(101,99,103,101,94,87,102,96,62)
)
Giving:
> horse_ratings
# A tibble: 9 x 4
horse raceid Dist DaH
<dbl> <dbl> <dbl> <dbl>
1 1 1 9.47 101
2 1 2 9.47 99
3 1 3 10 103
4 2 1 10.1 101
5 2 2 10.2 94
6 2 3 9 87
7 3 1 11 102
8 3 2 9.47 96
9 3 3 10.5 62
I will perform a number of calculations on the dataset such as mean rating, max rating etc
which id like to result in a number of vectors of equal length.
I'm using the filter function to look at the performance ratings achieved for different
race distances (ie. Distance greater than 10 to begin). However, if one of the horses has not
run a race for that distance then i've noticed that the result does not include that
horse in the output. ie:
> horse_ratings %>%
+ group_by(horse) %>%
+ filter(Dist>10) %>%
+ summarise(mean_rating=mean(DaH))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 2 97.5
2 3 82
So horse 1 has disappeared as it has not run a race of distance greater than 10.
I need to keep the output vector of length 3 ideally so I can put all the calculations
in to a dataframe of same length (for my final data output/print out).
I'm hoping there's a way of assigning an NA or similar to an output for horse 1
Giving:
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 1 NA
2 2 97.5
3 3 82
Or a similar solution.
Help would be much appreciated!!
You can use the .drop = FALSE parameter in group_by():
horse_ratings %>%
group_by(horse, .drop = FALSE) %>%
filter(Dist > 10) %>%
summarise(mean_rating = mean(DaH))
horse mean_rating
<dbl> <dbl>
1 1 NaN
2 2 97.5
3 3 82
Don't filter first, do it in summarise so you don't drop groups (horse).
library(dplyr)
horse_ratings %>%
group_by(horse) %>%
summarise(mean_rating = mean(DaH[Dist>10], na.rm = TRUE))
# A tibble: 3 x 2
# horse mean_rating
# <dbl> <dbl>
#1 1 NaN
#2 2 97.5
#3 3 82
library(tidyverse)
Method 1:
horse_stats <-
horse_ratings %>%
mutate(raceid = as.factor(raceid)) %>%
filter(Dist > 10) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
left_join(horse_ratings %>%
select(horse) %>%
distinct(),
., by = "horse", all.x = TRUE)
Method 2 :
horse_stats <-
horse_ratings %>%
mutate(raceid = factor(raceid),
Dist = ifelse(Dist <= 10, 0, Dist),
DaH = ifelse(Dist == 0, 0, Dist)) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
mutate_if(is.numeric, list(~na_if(., 0)))
I'm working with large datasets that have countless rows and am trying to automate some of my analyses. I mostly use #tidyverse to reduce the need of adding more packages, but I'm open to all suggestions. Consider the following tibble:
id <- rep(1:3, each = 48) # 3 individuals
time <- rep(seq(0, 23.5, by = .5), 3)
count <- runif(48*3)
df <- tibble(id, time, count)
I'm trying to filter a 2-hour interval around the time of max count.
I can identify the time of max count using:
df %>%
group_by(id) %>%
filter(count == max(count))
# OR
df$time[which.max(df$count)] # Only for 1 id, though
I am struggling to filter a range around the time of max count. I can identify the time correctly as a vector using Base R, but I can't filter for entire rows. I have not prepared for potential negative or missing values yet.
df$time[(which.max(df$count) - 2):(which.max(df$count) + 2)]
I'm calculating a few different variables using mutate(), so I want to incorporate this filter() into a pipe. I've attempted to use between(), match(), lead(), and lag(). which.max() has been the closest I've gotten to filtering the correct time duration. The following are a dead end and my closest, correct attempt:
# Listed max(count) in a new column; maybe use for matching?
df %>%
group_by(id) %>%
mutate(peak = max(count))
# Partially selects time around max count, but not accurately.
df %>%
group_by(id) %>%
filter(time == time[(which.max(count) - 1.5):(which.max(count)+1.5)])
I've been coding for about a year now, but I think I'm missing some basic functions that I just don't know. Similar questions have been posted for SQL, but I have not found any regarding R or tidyverse. If you can help, I'd really appreciate it. Let me know if there's any clarification needed.
We could use slice after the grouping step
library(dplyr)
df %>%
group_by(id) %>%
slice({i1 <- which.max(count)
(i1 -2):(i1 + 2)})
# A tibble: 15 x 3
# Groups: id [3]
# id time count
# <int> <dbl> <dbl>
# 1 1 6.5 0.447
# 2 1 7 0.785
# 3 1 7.5 0.984
# 4 1 8 0.133
# 5 1 8.5 0.433
# 6 2 14.5 0.266
# 7 2 15 0.501
# 8 2 15.5 0.965
# 9 2 16 0.214
#10 2 16.5 0.492
#11 3 14 0.894
#12 3 14.5 0.0388
#13 3 15 0.947
#14 3 15.5 0.776
#15 3 16 0.293
Or it can be made more compact
df %>%
group_by(id) %>%
slice(which.max(count) + (-2:2))
An alternative solution using row_number()
library(dplyr)
df %>%
group_by(id) %>%
filter(abs(row_number() - which.max(count)) <= 2)
which gives
# A tibble: 15 x 3
# Groups: id [3]
id time count
<int> <dbl> <dbl>
1 1 5 0.574
2 1 5.5 0.763
3 1 6 0.985
4 1 6.5 0.701
5 1 7 0.281
6 2 21 0.0563
7 2 21.5 0.274
8 2 22 0.978
9 2 22.5 0.560
10 2 23 0.726
11 3 12 0.889
12 3 12.5 0.767
13 3 13 0.999
14 3 13.5 0.157
15 3 14 0.896
I tried to calculate the cumsum with a depreciation rate.
I have a grouped dataframe with a column number.
I want to add the number one by one with depreciation.
If the rate is 1, then the cumsum function in base r is good enough.
But if not, let's say the rate of 0.5 (means each number will multiply by 0.5 to add the next number), cumsum is not enough.
I tried to write my own function to work with dplyr, but it fails.
library(tidyverse)
# dataframe
id=sample(1:5,25,replace=TRUE)
num=rnorm(25)
df=data.frame(id,num)
# my custom function
depre=function(data){
rate=0.5
r=nrow(data)
sl=data$num
nl=data$num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
# work with one group
df %>% filter(id==1) %>% depre(.)
# failed to work with dplyr
df %>% group_by(id) %>% mutate(sl=depre(.))
I expect the first element of column s, should be the same as in column num.
But the following ones, should be depreciate by times 0.5 and add next num.
It works in one group, but failed in multi-grouped dataframe.
The error message is: "Error: Column sl must be length 6 (the group size) or one, not 25".
I have no idea. Could anyone have a clue?
Thanks
Your function would work if you pass vector to your function instead of dataframe
depre <- function(num){
rate = 0.5
r= length(num)
sl = num
nl = num
for (i in 2:r){
sl[i]=sl[i-1]*rate+nl[i]
}
return(sl)
}
and then apply it by group.
library(dplyr)
df %>% group_by(id) %>% mutate(sl = depre(num))
We can split by 'id' and use the OP's function without any changes
library(dplyr)
library(purrr)
df %>%
group_split(id, keep = FALSE) %>%
map_df(~ tibble(id = .$id, sl = depre(.)))
# id sl
# <int> <dbl>
# 1 1 1.07
# 2 1 -0.776
# 3 1 -0.518
# 4 1 0.628
# 5 1 0.601
# 6 1 1.10
# 7 2 -0.734
# 8 2 -0.583
# 9 2 -0.437
#10 2 -3.45
# … with 15 more rows
or an option would be accumulate from purrr which would be more compact
out <- df %>%
group_by(id) %>%
mutate(sl = accumulate(num, ~ .y + .x * 0.5))
out
# A tibble: 25 x 3
# Groups: id [5]
# id num sl
# <int> <dbl> <dbl>
# 1 3 -0.784 -0.784
# 2 2 -0.734 -0.734
# 3 2 -0.216 -0.583
# 4 3 -0.335 -0.727
# 5 5 -1.09 -1.09
# 6 4 -0.0854 -0.0854
# 7 1 1.07 1.07
# 8 2 -0.145 -0.437
# 9 3 -1.17 -1.53
#10 5 -0.819 -1.36
# … with 15 more rows
out %>%
filter(id == 1)
# A tibble: 6 x 3
# Groups: id [1]
# id num sl
# <int> <dbl> <dbl>
#1 1 1.07 1.07
#2 1 -1.31 -0.776
#3 1 -0.129 -0.518
#4 1 0.887 0.628
#5 1 0.287 0.601
#6 1 0.800 1.10
Issue in the OP's function is that the input is the whole dataset and during the process of getting the number of rows, it uses nrow(data), which would be the total number of rows. With group_by, the dplyr convention is n() - giving the number of rows. By doing the group_split, the input data.frame is split into subset of data.frames and the nrow of those will work for the created function
I have the following 4 columns in a data frame in R:
ID A B C Revenue
1 0 1 0 2.33
1 1 1 0 3.1
2 1 0 1 4
2 0 0 1 5.22
2 1 1 0 6.45
3 0 0 0 3
3 0 0 0 2
4 1 1 1 7.22
4 0 0 0 1.22
4 1 1 0 4.55
4 0 1 1 1
A, B, and C are categorical values.
I want to create 3 data frames with 3 columns with columns names: ID, 0, 1. In column 0 I want avg. of Revenue for A = 0 rows and in column 1 average of Revenue for A = 1 for each distinct ID. Likewise for B and C in two other data frames.
I am unable to figure out how to do it with dplyr or any package for that matter.
Thanks in advance.
You can also write a custom function that does what you want using tidy_eval.
The syntax takes a bit of getting used to, but it's very useful once you get the hang of it.
require(tidyverse)
df <- tibble(ID = c(1,1,2,2,2,3,3,4,4,4,4),
A = c(0,1,1,0,1,0,0,1,0,1,0),
B = c(1,1,0,0,1,0,0,1,0,1,1), C = c(0,0,1,1,0,0,0,1,0,0,1),
Revenue = c(2.33,3.1,4,5.22,6.45,3,2,7.22,1.22,4.55,1))
create_df_mean <- function(df, mean_var, pos_spread, ...){
group_var <- enquos(...) # get the grouping columns
spread_var <- group_var[[pos_spread]] # get the column used as key to spread df
mean_var <- enquo(mean_var) # get the column used to calculate mean
df <- df %>%
group_by(!!!group_var) %>%
summarise(mean = mean(!!mean_var)) %>%
spread(!!spread_var, mean)
return(df)
}
# arguments are:
# 1. data frame
# 2. column for calc. mean
# 3. the position of the spread key in grouping columns
# 4. grouping columns
create_df_mean(df, Revenue, 2, ID, A)
You can customise this function even further following these tutorials: 1 and 2.
One way using dplyr and tidyr could be to gather data to long format, get mean value for each ID, value and key and spread it to wide format.
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -ID, -Revenue) %>%
group_by(ID, value, key) %>%
summarise(mean_rev = mean(Revenue)) %>%
spread(value, mean_rev, fill = 0)
# ID key `0` `1`
# <dbl> <chr> <dbl> <dbl>
# 1 1 A 2.33 3.1
# 2 1 B 0 2.72
# 3 1 C 2.72 0
# 4 2 A 5.22 5.22
# 5 2 B 4.61 6.45
# 6 2 C 6.45 4.61
# 7 3 A 2.5 0
# 8 3 B 2.5 0
# 9 3 C 2.5 0
#10 4 A 1.11 5.88
#11 4 B 1.22 4.26
#12 4 C 2.88 4.11
If you need them in separate dataframes with only three columns we can use group_split
df %>%
gather(key, value, -ID, -Revenue) %>%
group_by(ID, value, key) %>%
summarise(mean_rev = mean(Revenue)) %>%
spread(value, mean_rev, fill = 0) %>%
ungroup() %>%
group_split(key, keep = FALSE)
#[[1]]
# A tibble: 4 x 3
# ID `0` `1`
# <dbl> <dbl> <dbl>
#1 1 2.33 3.1
#2 2 5.22 5.22
#3 3 2.5 0
#4 4 1.11 5.88
#[[2]]
# A tibble: 4 x 3
# ID `0` `1`
# <dbl> <dbl> <dbl>
#1 1 0 2.72
#2 2 4.61 6.45
#3 3 2.5 0
#4 4 1.22 4.26
#[[3]]
# A tibble: 4 x 3
# ID `0` `1`
# <dbl> <dbl> <dbl>
#1 1 2.72 0
#2 2 6.45 4.61
#3 3 2.5 0
#4 4 2.88 4.11
To get the output into separate dataframe, we can do
df1 <- df %>%
dplyr::select(ID, A, B, C, Revenue) %>%
gather(key, value, -ID, -Revenue) %>%
group_by(ID, value, key) %>%
summarise(mean_rev = mean(Revenue)) %>%
spread(value, mean_rev, fill = 0) %>%
ungroup() %>%
group_split(key, keep = FALSE)
names(df1) <- LETTERS[seq_along(df1)]
list2env(df1, .GlobalEnv)