In R, is there a way to use "group_by" and convert daily data to yearly data using last() of all variable in one go rather than using summarise of each variable?
(adf <- data.frame(
year=rep(1:3,each=3),
month=rep(1:3,times=3),
var1= letters[1:9],
var2= -9:-1
))
#solution
library(tidyverse)
group_by(adf,
year) |>
summarise(across(.cols = everything(),
.fns = last))
Related
I have a data frame with 58 columns labeled SD1 through to SD58 along with columns for date info (Date, Year, Month, Day).
I'm trying to find the date of the maximum value of each of the SD columns each year using the following code:
maxs<-aggregate(SD1~Year, data=SDtime, max)
SDMax<-merge(maxs,SDtime)
I only need the dates so I made a new df and relabeled the column as below:
SD1Max = subset(SDMax, select = c(Year, Date))
SD1Max %>%
rename(
SD1=Date
)
I want to do the same thing for every SD column but I don't want to have to repeat these steps 58 times. Is there a way to loop the process?
Assuming there are no ties (multiple days with where the variable reached its maximum) this probably does what you want:
library('tidyverse')
SDtime %>%
pivot_longer(
cols = matches('^SD[0-9]{1,2}$')
) %>%
group_by(name) %>%
filter(value == max(value, na.rm = TRUE)) %>%
ungroup()
You might want to pivot_wider afterwards.
I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)
I have CSVs in a clunky format (National Weather Service 'DATACARD' format - sample monthly data on page 3 here) and I'm hoping to find a better way of transposing and filtering out NAs. I think there may be something along the lines of gather() from the tidyverse, but I'm open to all approaches.
a <- c(10.5,14,16,20,23)
b <- c(11,15,17,21,24)
c <- c(12,NA,18,22,25.2)
d <- c(13,NA,19,NA,26)
rawcsv <- data.frame(a,b,c,d)
rawcsv_singlecolumn <- data.frame(singlecolumn=c(t(rawcsv)))
rawcsv_NAsremoved_thedesiredvector <- na.omit(rawcsv_singlecolumn)
desiredvector <- c(10.5,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25.2,26)
We can extract the single column from the dataset
rawcsv_NAsremoved_thedesiredvector[[1]]
If we need to use tidyverse
library(tidyverse)
rownames_to_column(rawcsv, 'rn') %>%
gather(key, value, -rn, na.rm = TRUE) %>%
arrange(as.integer(rn)) %>%
pull(value)
I have a large data frame with on every rows enough data to calculate a correlation using specific columns of this data frame and add a new column containing the correlations calculated.
Here is a summary of what I would like to do (this one using dplyr):
example_data %>%
mutate(pearsoncor = cor(x = X001_F5_000_A:X030_F5_480_C, y = X031_H5_000_A:X060_H5_480_C))
Obviously it is not working this way as I get only NA's in the pearsoncor column, does anyone has a suggestion? Is there an easy way to do this?
Best,
Example data frame
With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.
library(dplyr)
library(tidyr)
example_data %>%
gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>%
gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>%
group_by(x_var, y_var) %>%
summarise(cor_coef = cor.test(x_val, y_val)$estimate,
p_val = cor.test(x_val, y_val)$p.value)
edit, update some years later:
library(tidyr)
library(purrr)
library(broom)
library(dplyr)
longley %>%
pivot_longer(GNP.deflator:Armed.Forces, names_to="x_var", values_to="x_val") %>%
pivot_longer(Population:Employed, names_to="y_var", values_to="y_val") %>%
nest(data=c(x_val, y_val)) %>%
mutate(cor_test = map(data, ~cor.test(.x$x_val, .x$y_val)),
tidied = map(cor_test, tidy)) %>%
unnest(tidied)
Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.
library(reshape2)
library(dplyr)
names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]
example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))
example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
summarize(pearsoncor = cor(x = value[variable %in% names1],
y = value[variable %in% names2]))
You could also generate more detailed results, as in Eudald's answer, using do():
detailed_r <- example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
do(cor = cor.test(x = .$value[.$variable %in% names1],
y = .$value[.$variable %in% names2]))
This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.
lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))
I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:
df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))
for (i in 1:nrow(df)){
ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}
I am currently learning purrr in R. I have code which does the following
Uses the pysch package in r to get the mean, SD, range etc from a list of questions
Returns those statistics in a single data-frame where the list item is added to the table as a column. In the case below its schools.
Below is an example where I'm about 90% there i think. All i want to do is add the names of the schools to the dataframe as a column so as to be able to chart them afterwards. Can anyone help? The method below loses the names as soon as the bind_rows() command is run
library(lavaan)
library(tidyverse)
# function pulls the mean, sd, range, kurtosis and skew
get_stats <- function(x){
row_names <- rownames(x)
mydf_temp <- x %>%
dplyr::select(mean, sd, range, kurtosis, skew) %>%
mutate_if(is.numeric, round, digits=2) %>%
filter(complete.cases(.))
mydf_temp
}
# Generate the data for the reproducible example
mydf <- HolzingerSwineford1939 %>%
select(school, starts_with("x")) %>%
psych::describeBy(., group=.$school, digits = 2)
# Gets the summary statistics per school
stats_summ <- mydf %>%
map(get_stats) %>%
bind_rows()
We can use the .id argument from bind_rows
mydf %>%
map(get_stats) %>%
bind_rows(., .id = 'group')
Using a reproducible example with iris dataset
mydf <- iris %>%
psych::describeBy(., group=.$Species, digits = 2)