I do have time series with months in rows instead of columns. It's quite a large dataset and I am looking for a way to get the mean for every 12 rows, in this case for temperature so that a smaller dataset will emerge.
This can be done with group_by and summarize from dplyr. First you have to create "groups", variable that will be used for grouping data.
library(dplyr)
dta <- data.frame(temp = rnorm(60, 0, 1))
dta$group <- sort(rep(1:12, 60/12))
dta %>% group_by(group) %>% summarize(mean_temp = mean(temp))
Result
# A tibble: 12 x 2
group mean_temp
<int> <dbl>
1 1 -0.582
2 2 0.490
3 3 -0.197
Related
I'm working with a dataset that has two level grouping. Here I give an example:
set.seed(123)
example=data.frame(
id = c(rep(1,20),rep(2,20)), # grouping
Grp = rep(c(rep('A',10),rep('B',10)),2), # grouping
target = c(rep(1:10,2), rep(20:11,2)),
var1 = sample(1:100,40,replace=TRUE),
var2 =sample(1:100,40,replace=TRUE)
)
In this case, the data is grouped by id and by Grp. I want to calculate the correlation of target with var1 and var2. However, I don't know which is the most efficient way to apply this using a tidy approach and based on the groups.
I tried with dplyr approach. Like using:
example %>% group_by(id,Grp) %>%
summarise(cor(target,c(var1,var2))) # length error
Or even creating a custom function and applying it. But this last only summarise all the data without grouping:
corr_analisis_e = function(df){
return( cor(df[,'target'] , df[,c('var1','var2')]) )
}
example %>% group_by(id,Grp) %>% corr_analisis_e() # get all the data at once
As output, I would expect to get something like a matrix or dataframe of 4 rows and 2 columns, where each row is a group (id and Grp) and columns are var1 and var2. Every value the result from the cor() method.
example %>%
group_by(id, Grp) %>%
summarise(across(c(var1, var2), ~ cor(.x, target)))
# A tibble: 4 x 4
# Groups: id [2]
id Grp var1 var2
<dbl> <chr> <dbl> <dbl>
1 1 A -0.400 0.532
2 1 B -0.133 -0.187
3 2 A -0.655 -0.103
4 2 B -0.580 0.445
The grouping columns can then be removed with %>% ungroup %>% select(var1:var2).
I analyse a data set from an experiment and would like to calculate effect sizes for each variable. My dataframe consist of multiple variables (= columns) for 8 treatments t (= rows), with t1 - t4 being the control for t5 - t8, respectively (t1 control for t5, t2 for t6, ... ). The original data set is way larger, so I would like to solve the following two tasks::
I would like to calculate the log(treatment/control) for each t5 - t8 for one variable, e.g. effect size for t5 = log(t5/t1), effect size for t6 = log(t6/t2), ... . The name of the resulting column should be variablename_effect and the new column would only have 4 rows instead of 8.
The most tricky part is, that I need to implement the combination of specific rows into my code, so that the correct control is used for each treatment.
I would like to calculate the effect sizes for all my variables within one code, so create multiple new columns with the correct names (variablename_effect).
I would prefer to solve the problem in dplyr or base R to keep it simple.
So far, the only related question I found was /r-dplyr-mutate-refer-new-column-itself (shows combination of multiple if else()). I would be very thankful for either a solution, links to similar questions or which packages I should use in cast it's not possible within dplyr / base R!
Sample data:
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
Edit: this is the df_effect I would expect to receive as an output, thanks #Martin_Gal for the hint!
df_effect <- data.frame("treatment" = c(5:8), "Var1_effect" = c(log(13/9), log(14/10), log(15/11), log(16/12)), "Var2_effect" = c(log(21/17), log(22/18), log(23/19), log(24/20)))
My ideas so far:
For calculating the effect size:
mutate() and for function:
# 1st option:
for (i in 5:8) {
dt_effect <- df %>%
mutate(Var1_effect = log(df[i, "Var1"]/df[i - 4, "Var1"]))
}
#2nd option:
for (i in 5:8){
dt_effect <- df %>%
mutate(Var1_effect = log(df[treatment == i , "Var1"]/df[treatment == i - 4 , "Var1"]))
}
problem: both return the result for i = 8 for every row!
mutate() and ifelse():
df_effect <- df %>%
mutate(Var1_effect = ifelse(treatment >= 5, log(df[, "Var1"]/df[ , "Var1"]), NA))
seems to work, but so far I couldn't implement which row to pick for the control, so it returns NA for t1 - t4 (correct) and 0 for t5 - t8 (mathematically correct as I calculate log(t5/t5), ... but not what I want).
maybe I should use summarise() instead of mutate() because I create fewer rows than in my original dataframe?
Make this work for every variable at the same time
My only idea would be to index the columns within a second for function and use the paste() to create the new column names, but I don't know exactly how to do this ...
I don't know if this will solve your problem, but I want to make a suggestion similar to Limey:
library(dplyr)
library(tidyr)
df %>%
mutate(control = 1 - (treatment-1) %/% (nrow(.)/2),
group = ifelse(treatment %% (nrow(.)/2) == 0, nrow(.)/2, treatment %% (nrow(.)/2))) %>%
select(-treatment) %>%
pivot_wider(names_from = c(control), values_from=c(Var1, Var2)) %>%
group_by(group) %>%
mutate(Var1_effect = log(Var1_0/Var1_1))
This yields
# A tibble: 4 x 6
# Groups: group [4]
group Var1_1 Var1_0 Var2_1 Var2_0 Var1_effect
<dbl> <int> <int> <int> <int> <dbl>
1 1 9 13 17 21 0.368
2 2 10 14 18 22 0.336
3 3 11 15 19 23 0.310
4 4 12 16 20 24 0.288
What happend here?
I expected the first half of your data.frame to be the control variables for the second half. So I created an indicator variable and a grouping variable based on the treatment id's/numbers.
Now the treatment id isn't used anymore, so I dropped it.
Next I used pivot_wider to create a dataset with Var1_1 (i.e. Var1 for your control variable) and Var1_0 (i.e. Var1 for your "ordinary" variable).
Finally I calculated Var1_effect per group.
In response to OP's comment to #MartinGal 's solution (which is perfectly fione in its own right):
First convert the input data to a more convenient form:
# Original input dataset
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
# Revised input dataset
revisedDF <- df %>%
select(-treatment) %>%
add_column(
Treatment=rep(c("Control", "Test"), each=4),
Experiment=rep(1:4, times=2)
) %>%
pivot_longer(
names_to="Variable",
values_to="Value",
cols=c(Var1, Var2)
) %>%
arrange(Experiment, Variable, Treatment)
revisedDF %>% head(6)
Giving
# A tibble: 6 x 4
Treatment Experiment Variable Value
<chr> <int> <chr> <int>
1 Control 1 Var1 9
2 Test 1 Var1 13
3 Control 1 Var2 17
4 Test 1 Var2 21
5 Control 2 Var1 10
6 Test 2 Var1 14
I like this format because it makes the analysis code completely independent of the number of variables, the number of experiements and the number of Treatments.
The analysis is straightforward, too:
result <- revisedDF %>% pivot_wider(
names_from=Treatment,
values_from=Value
) %>%
mutate(Effect=log(Test/Control))
result
Giving
Experiment Variable Control Test Effect
<int> <chr> <int> <int> <dbl>
1 1 Var1 9 13 0.368
2 1 Var2 17 21 0.211
3 2 Var1 10 14 0.336
4 2 Var2 18 22 0.201
5 3 Var1 11 15 0.310
6 3 Var2 19 23 0.191
7 4 Var1 12 16 0.288
8 4 Var2 20 24 0.182
pivot_wider and pivot_longer are relatively new dplyr verbs. If you're unable to use the most recent version of the package, spread and gather do the same job with slightly different argument names.
I'm building a dataset and am looking to be able to add a week count to a dataset starting from the first date, ending on the last. I'm using it to summarize a much larger dataset, which I'd like summarized by week eventually.
Using this sample:
library(dplyr)
df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
Week = nrow/7)
# A tibble: 93 x 2
Date Week
<date> <dbl>
1 1944-06-01 0.143
2 1944-06-02 0.286
3 1944-06-03 0.429
4 1944-06-04 0.571
5 1944-06-05 0.714
6 1944-06-06 0.857
7 1944-06-07 1
8 1944-06-08 1.14
9 1944-06-09 1.29
10 1944-06-10 1.43
# … with 83 more rows
Which definitely isn't right. Also, my real dataset isn't structured sequentially, there are many days missing between weeks so a straight sequential count won't work.
An ideal end result is an additional "week" column, based upon the actual dates (rather than hard-coded with a seq_along() type of result)
Similar solution to Ronak's but with lubridate:
library(lubridate)
(df <- tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"),
week = interval(min(Date), Date) %>%
as.duration() %>%
as.numeric("weeks") %>%
floor() + 1))
You could subtract all the Date values with the first Date and calculate the difference using difftime in "weeks", floor all the values and add 1 to start the counter from 1.
df$week <- floor(as.numeric(difftime(df$Date, df$Date[1], units = "weeks"))) + 1
df
# A tibble: 93 x 2
# Date week
# <date> <dbl>
# 1 1944-06-01 1
# 2 1944-06-02 1
# 3 1944-06-03 1
# 4 1944-06-04 1
# 5 1944-06-05 1
# 6 1944-06-06 1
# 7 1944-06-07 1
# 8 1944-06-08 2
# 9 1944-06-09 2
#10 1944-06-10 2
# … with 83 more rows
To use this in your dplyr pipe you could do
library(dplyr)
df %>%
mutate(week = floor(as.numeric(difftime(Date, first(Date), units = "weeks"))) + 1)
data
df <- tibble::tibble(Date = seq(as.Date("1944/06/1"), as.Date("1944/09/1"), "days"))
This question is based on the following post with additional requirements (Iterate through columns in dplyr?).
The original code is as follows:
df <- data.frame(col1 = rep(1, 15),
col2 = rep(2, 15),
col3 = rep(3, 15),
group = c(rep("A", 5), rep("B", 5), rep("C", 5)))
for(col in c("col1", "col2", "col3")){
filt.df <- df %>%
filter(group == "A") %>%
select_(.dots = c('group', col))
# do other things, like ggplotting
print(filt.df)
}
My objective is to output a frequency table for each unique COL by GROUP combination. The current example specifies a dplyr filter based on a GROUP value A, B, or C. In my case, I want to iterate (loop) through a list of values in GROUP (list <- c("A", "B", "C") and generate a frequency table for each combination.
The frequency table is based on counts. For Col1 the result would look something like the table below. The example data set is simplified. My real dataset is more complex with multiple 'values' per 'group'. I need to iterate through Col1-Col3 by group.
group value n prop
A 1 5 .1
B 2 5 .1
C 3 5 .1
A better example of the frequency table is here: How to use dplyr to generate a frequency table
I struggled with this for a couple days, and I could have done better with my example. Thanks for the posts. Here is what I ended up doing to solve this. The result is a series of frequency tables for each column and each unique value found in group. I had 3 columns (col1, col2, col3) and 3 unique values in group (A,B,C), 3x3. The result is 9 frequency tables and a frequency table for each group value that is non-sensical. I am sure there is a better way to do this. The output generates some labeling, which is useful.
# Build unique group list
group <- unique(df$group)
# Generate frequency tables via a loop
iterate_by_group <- function(x)
for (i in 1:length(group)){
filt.df <- df[df$group==group[i],]
print(lapply(filt.df, freq))
}
# Run
iterate_by_group(df)
We could gather into long format and then get the frequency (n()) by group
library(tidyverse)
gather(df, value, val, col1:col3) %>%
group_by(group, value = parse_number(value)) %>%
summarise(n = n(), prop = n/nrow(.))
# A tibble: 9 x 4
# Groups: group [?]
# group value n prop
# <fct> <dbl> <int> <dbl>
#1 A 1 5 0.111
#2 A 2 5 0.111
#3 A 3 5 0.111
#4 B 1 5 0.111
#5 B 2 5 0.111
#6 B 3 5 0.111
#7 C 1 5 0.111
#8 C 2 5 0.111
#9 C 3 5 0.111
Is this what you want?
df %>%
group_by(group) %>%
summarise_all(funs(freq = sum))
I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581