I want to summarise several columns from a data.frame. The grouping and summary was achieved with dplyr, as in the example below.
df = data.frame (time = rep(c("day", "night"), 10) ,
who =rep(c("Paul", "Simon"), each=10) ,
var1 = runif(20, 5, 15), var2 = runif(20, 10, 12), var3 = runif(20, 2, 7), var4 = runif(20, 1, 3))
Writting the function I need
quantil_x = function (var, num) {
quantile(var, num, na.rm=T)
}
Using it at var1 and exporting
percentiles = df %>% group_by(time, who) %>% summarise(
P0 = quantil_x (var1, 0),
P25 = quantil_x (var1, .25),
P75 = quantil_x (var1, .75)
)
write.table(percentiles, file = "summary_var1.csv",row.names=FALSE, dec=",",sep=";")
What I want is to repeat this same task for 'var2', 'var3' and 'var4'. I have tried to run a loop with no success to perform this task multiple times. Unfortunately I couldn't find a way to handle distinct calls of variables within the code. That is, within the loop I have tried to use summarise_(), tried to use get() inside the fuction quantil_x() or within summarise, also as.name but none of this worked.
I'm pretty sure this is a bad coding skill issue, but that's all I came up with so far. Here is an example of what I tried to do:
list = c("var1", "var2", "var3", "var4")
for (i in list){
percentiles = df %>% group_by(time, who) %>% summarise(
P0 = quantil_x (get(i), 0),
P25 = quantil_x (get(i), .25),
P75 = quantil_x (get(i), .75)
)
write.table(percentiles, file = paste0("summary_",i,".csv",row.names=FALSE, dec=",",sep=";")
}
I read this post, but didn't help much on my case.
Thanks in advance.
You can do this with summarise_each()
df %>%
group_by(time, who) %>%
summarise_each(funs (`0` = quantile(., 0, na.rm=T),
`25`= quantile(., .25, na.rm = T),
`75`= quantile(., .75, na.rm = T)))
You can do this with gather()
percentiles = df %>%
gather(Var,Value,var1,var2,var3) %>%
group_by(Var,time, who) %>%
summarise(
P0 = quantil_x (Value, 0),
P25 = quantil_x (Value, .25),
P75 = quantil_x (Value, .75)
)
Related
I need to sum the values for about 40 variables by the same group.
This is an example dataset. So I wanted to sum the values of score1-score5 by region and department.
region <- rep(c("south", "east", "west", "north"),times=10)
department <- rep(c("A", "B","C","D","E"),times=8)
score1 <- rnorm(n = 40, mean = 0, sd = 1)
score2 <-rnorm(n = 40, mean = 3, sd = 1.5)
score3 <-rnorm(n = 40, mean = 2, sd = 1)
score4 <-rnorm(n = 40, mean = 1, sd = 1.5)
score5 <-rnorm(n = 40, mean = 5, sd = 1.5)
df <- data.frame(region, department, score1, score2, score3, score4, score5)
This is the code that would lead to the resutls I wanted to have but is there any easier ways to do this:
df %>% group_by(region, department) %>%
summarise(score1=sum(score1),
score2=sum(score2),
score3=sum(score3),
score4=sum(score4),
score5=sum(score5))
I tried to use a loop but this didn't work:
vlist<-c("score1", "score2", "score3", "score4", "score5")
for (var in vlist) {
df<-df %>% group_by(region, department) %>%
summarise(var=sum(.[[var]]))
}
Is there any other ways or what is wrong with my loop?
Thanks!
Use across - loop across the columns that starts_with 'score' and get the sum
library(dplyr)
out1 <- df %>%
group_by(region, department) %>%
summarise(across(starts_with('score'), sum), .groups = 'drop')
In the for loop, the issue is that df is getting updated (df <-..) in each iteration and summarise returns only the columns provided in the group by and the summarised output. Thus, after the first iteration, 'df' wouldn't have the 'score' columns at all. If we want to use a for loop, get the output in a list and then reduce with a join
library(purrr)
out_list <- vector('list', length(vlist))
names(out_list) <- vlist
for (var in vlist) {
out_list[[var]] <- df %>%
group_by(region, department) %>%
summarise(!!var := sum(cur_data()[[var]]), .groups = 'drop')
}
out2 <- reduce(out_list, full_join, by = c('region', 'department'))
-checking the outputs
> identical(out1, out2)
[1] TRUE
I have a code in R where I work with multiple dataframes.
Example of a dataframe format :
ClientID Group CountC
X1 A 3
R3 B 2
D4 A 1
T5 A 7
H0 B 5
The other dataframes have the same 2 columns, but CountC differs.
For each of the dataframes, I have a common code that calculates quantile by group / and then pivot the form of the dataframe :
quantileByGroup <-
df %>%
group_by(Group) %>%
summarize(Q25 = quantile(CountC, probs = .25),
Q50 = quantile(CountC, probs = .5),
Q75 = quantile(CountC, probs = .75),
Q100 = quantile(CountC, probs = 1))
quantileByGroupFinal <- pivot_longer(quantileByGroup,
cols = c(2,3,4,5),
names_to = "name",
values_to = "value")
To avoid repeating the same code everytime, I want to put this code in a function.
However when I try, it is complicated especially for this part :
quantileByGroup <-
df %>%
group_by(Group) %>%
summarize(Q25 = quantile(CountC, probs = .25),
Q50 = quantile(CountC, probs = .5),
Q75 = quantile(CountC, probs = .75),
Q100 = quantile(CountC, probs = 1))
Since it is impossible to pass the column names Group and CountC as parameters in the function.
Is there any way to do this?
Thank you
f <- function(.data, .group, .summarize)
{
.data %>%
dplyr::group_by({{.group}}) %>%
dplyr::summarise( "{{.summarize}}_Q25" := quantile({{.summarize}}, probs = .25),
"{{.summarize}}_Q50" := quantile({{.summarize}}, probs = .5),
"{{.summarize}}_Q75" := quantile({{.summarize}}, probs = .75),
"{{.summarize}}_Q100" := quantile({{.summarize}}, probs = 1)) %>%
dplyr::ungroup() %>%
tidyr::pivot_longer(-{{.group}}) %>%
return()
}
and call like:
df %>%
f(.group = Group, .summarize = CountC)
I am trying to compute the upper and lower quartile of the two variables in my data.frame across the time period of my interest. The code below gave me single digit for upper and lower value.
set.seed(50)
FakeData <- data.frame(seq(as.Date("2001-01-01"), to= as.Date("2003-12-31"), by="day"),
A = runif(1095, 0,10),
D = runif(1095,5,15))
colnames(FakeData) <- c("Date", "A","D")
statistics <- FakeData %>%
gather(-Date, key = "Variable", value = "Value") %>%
mutate(Year = year(Date), Month = month(Date)) %>%
filter(between(Month,3,5)) %>%
mutate(NewDate = ymd(paste("2020", Month,day(Date), sep = "-"))) %>%
group_by(Variable, NewDate) %>%
summarise(Upper = quantile(Value,0.75, na.rm = T),
Lower = quantile(Value, 0.25, na.rm = T))
I would want an output like below (the Final_output is what i am interested)
Output1 <- data.frame(seq(as.Date("2000-03-01"), to= as.Date("2000-05-31"), by="day"),
Upper = runif(92, 0,10), lower = runif(92,5,15), Variable = rep("A",92))
colnames(Output1)[1] <- "Date"
Output2 <- data.frame(seq(as.Date("2000-03-01"), to= as.Date("2000-05-31"), by="day"),
Upper = runif(92, 2,10), lower = runif(92,5,15), Variable = rep("D",92))
colnames(Output2)[1] <- "Date"
Final_Output<- bind_rows(Output1,Output2)
I can propose you a data.table solution. In fact there are several ways to do that.
The final steps (apply quartile by group on the Value variable) could be translated into (if you want, as in your example, two columns):
statistics[,.('p25' = quantile(get('Value'), probs = 0.25), 'p75' = quantile(get('Value'), probs = 0.75)),
by = c("Variable", "NewDate")]
If you prefer long-formatted output:
library(data.table)
setDT(statistics)
statistics[,.(lapply(get('Value'), quantile, probs = .25,.75)) ,
by = c("Variable", "NewDate")]
All steps together
It's probably better if you chose to use data.table to do all steps using data.table verbs. I will assume your data have the structure similar to the dataframe you generated and arranged, i.e.
statistics <- FakeData %>%
gather(-Date, key = "Variable", value = "Value")
In that case, mutate and filter steps would become
statistics[,`:=`(Year = year(Date), Month = month(Date))]
statistics <- statistics[Month %between% c(3,5)]
statistics[, NewDate = :ymd(paste("2020", Month,day(Date), sep = "-"))]
And choose the final step you prefer, e.g.
statistics[,.('p25' = quantile(get('Value'), probs = 0.25), 'p75' = quantile(get('Value'), probs = 0.75)),
by = c("Variable", "NewDate")]
I have a tibble which contains values for different variables at a daily level.
library(lubridate)
library(tidyverse)
df <- tibble::tibble(date = seq.Date(ymd('2019-01-01'), ymd('2019-03-31'), by = 1),
high = sample(-5:100, 90, replace = T),
low = sample(-25:50, 90, replace = T),
sd = sample(5:25, 90, replace = T))
These variables need to be bound by certain min and max values which are found in another tibble as:
cutoffs <- tibble::tibble(var_name = c('high', 'low', 'sd'),
min = c(0, -5, 10),
max = c(75, 15, 15))
Now I want to go through my original df and change it so that every value below min is changed to min and every value above max is changed to max, where min and max are found in the cutoffs.
I currently do it in a for loop but I feel like a function like map could be used here, but I am not sure how to use it.
for (i in 1:3){
a <- cutoffs$var_name[[i]]
print(a)
min <- cutoffs$min[[i]]
max <- cutoffs$max[[i]]
df <- df %>%
mutate(!!a := ifelse(!!as.name(a) < min, min, !!as.name(a)),
!!a := ifelse(!!as.name(a) > max, max, !!as.name(a)))
}
I would appreciate your help in creating a solution that does not use a for loop.
Thanks :)
Try this. It pivots your dataframe long-wise, joins to the cutoffs, and then uses case_when to replace value where applicable:
library(lubridate)
library(tidyverse)
df <- tibble::tibble(date = seq.Date(ymd('2019-01-01'), ymd('2019-03-31'), by = 1),
high = sample(-5:100, 90, replace = T),
low = sample(-25:50, 90, replace = T),
sd = sample(5:25, 90, replace = T)) %>%
pivot_longer(-date, names_to = "var_name", values_to = "value")
df
cutoffs <- tibble::tibble(var_name = c('high', 'low', 'sd'),
min = c(0, -5, 10),
max = c(75, 15, 15))
df %>%
left_join(cutoffs) %>%
mutate(value_new = case_when(value > max ~ max,
value < min ~ min,
TRUE ~ as.double(value))) %>%
select(date, var_name, value, value_new, min, max)
I have multiple observations from each of a few groups and I'd like to make a matrix of QQ plots (or another type of plot), comparing each group to every other group.
Here's an example of what I'm talking about:
library(tidyverse)
set.seed(27599)
n <- 30
d <- data_frame(person = c(rep('Alice', n),
rep('Bob', n),
rep('Charlie', n),
rep('Danielle', n)),
score = c(rnorm(n = n),
rnorm(n = n, mean = 0.1),
rnorm(n = n, sd = 2),
rnorm(n = n, mean = 0.3, sd = 1.4)))
by_hand <- data_frame(a = sort(d$score[d$person == 'Alice']),
b = sort(d$score[d$person == 'Bob']),
c = sort(d$score[d$person == 'Charlie']),
d = sort(d$score[d$person == 'Danielle']))
pairs(x = by_hand,
lower.panel = function(x, y) { points(x, y); abline(0, 1);})
Here, I've manipulated the data by hand and used graphics::pairs() to make the plot. Can the same be done inside the tidyverse?
Here's what I've tried.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
glimpse()
This seems promising.
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
spread(key = person, value = score)
This gives the 'duplicate identifiers' error.
Maybe reshape2 would be better to use here?
d %>%
group_by(person) %>%
mutate(score = sort(score)) %>%
dcast(formula = score ~ person)
This creates a data.frame with 120 rows, and most of the values (90 per person) are NA. How can I create a wide data.frame without introducing so many NA?
You need a variable that links the row position for each person. Try
by_tidyverse <- d %>%
group_by(person) %>%
mutate(rowID=1:n(),
score=sort(score)
) %>%
spread(key = person, value = score) %>%
select(-rowID)
pairs(x = by_tidyverse, lower.panel = function(x, y) { points(x, y); abline(0, 1);})