Lagged values multiple columns with function in R - r

I would like to create lagged values for multiple columns in R.
First, I used a function to create lead/lag like this:
mleadlag <- function(x, n, ts_id) {
pos <- match(as.numeric(ts_id) + n, as.numeric(ts_id))
x[pos]
}
Second, I would like to apply this function for several columns in R. firm.characteristics is list of columns I would like to compute lagged values.
library(dplyr)
firm.characteristics <- colnames(df)[4:6]
for(i in 1:length(firm.characteristics)){
df <- df %>%
group_by(company) %>%
mutate(!!paste0("lag_", i) := mleadlag(df[[i]] ,-1, fye)) %>%
ungroup()
}
However, I didn't get the correct values. The output for all companies in year t is the last row in year t-1. It didn't group by the company any compute the lagged values.
Can anyone help me which is wrong in the loop? Or what should I do to get the correct lagged values?
Thank you so much for your help.
Reproducible sample could be like this:
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(company=1:n,
fye=2009,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat2 <- data.frame(company=1:n,
fye=2010,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat3 <- data.frame(company=1:n,
fye=2011,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
df <- rbind(dat,dat2,dat3)

I would try to stay away from loops in the tidyverse. Many of the tidyverse applications that would traditionally require loops already exist and are very fast, which creates more efficient and intuitive code (the latter being my opinion). This is a great use case for dplyr's across() functionality. I first changed the df to a tibble.
df %>%
as_tibble() %>%
group_by(company) %>%
mutate(
across(firm.characteristics, ~lag(., 1L))
) %>%
ungroup()
This generates the required lagged values. For more information see dplyr's across documentation.

Related

How to group by one column then convert the other column into vectors?

For example, my df now is:
person <- c("a","a","a","b","b","b","c","c","c")
score <- c(31,2,13,5,6,7,8,9,4)
df <- data.frame(person,score)
what I want to get is a two-column table with three rows.
[1,1]="a", [1,2]= a vector of c(31,2,13)
[2,1]="b", [2,2]= a vector of c(5,6,7)
[3,1]="c", [3,2]= a vector of c(8,9,4)
Actually, I just want the three vectors to perform another function but I tried something like the following code, it didn't work(the actual function is much more complex but it takes in two vectors of the same length where one is provided).
f <- function(x,y){x-y}
df <- df %>%
group_by(person) %>%
summarise(diff = f(c(1,2,3), score))
Thanks so much in advance!
Base R solution:
aggregate(
score ~ person,
df,
list
)
Tidyverse solution:
library(dplyr)
df %>%
group_by(person) %>%
summarise(score = list(score))

How to calculate weighted mean using mutate_at in R?

I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)

R- How write a loop for applying a function built-in already in a a package to all columns of a df and populate another df?

So far I splitted the df in df contained in a list as I want to do this operation for each different sample.
TestData <- split(df, df$SAMPLE)
output_Data <- data.frame()
for(i in 1:length(TestData)){
TestData[[i]] <- TestData[[i]] %>%
select(contains("strings")) ###subset columns I need to apply the function on
}
This should do it
library(purrr)
(data <- df %>%
split(.$SAMPLE) %>%
map(~ select(.x, contains("strings"))))
This approach is closer to R philosophy of functional programming and compliant with the tidyverse style of doing things.

Calculate pairwise correlation in R using dplyr::mutate

I have a large data frame with on every rows enough data to calculate a correlation using specific columns of this data frame and add a new column containing the correlations calculated.
Here is a summary of what I would like to do (this one using dplyr):
example_data %>%
mutate(pearsoncor = cor(x = X001_F5_000_A:X030_F5_480_C, y = X031_H5_000_A:X060_H5_480_C))
Obviously it is not working this way as I get only NA's in the pearsoncor column, does anyone has a suggestion? Is there an easy way to do this?
Best,
Example data frame
With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.
library(dplyr)
library(tidyr)
example_data %>%
gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>%
gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>%
group_by(x_var, y_var) %>%
summarise(cor_coef = cor.test(x_val, y_val)$estimate,
p_val = cor.test(x_val, y_val)$p.value)
edit, update some years later:
library(tidyr)
library(purrr)
library(broom)
library(dplyr)
longley %>%
pivot_longer(GNP.deflator:Armed.Forces, names_to="x_var", values_to="x_val") %>%
pivot_longer(Population:Employed, names_to="y_var", values_to="y_val") %>%
nest(data=c(x_val, y_val)) %>%
mutate(cor_test = map(data, ~cor.test(.x$x_val, .x$y_val)),
tidied = map(cor_test, tidy)) %>%
unnest(tidied)
Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.
library(reshape2)
library(dplyr)
names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]
example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))
example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
summarize(pearsoncor = cor(x = value[variable %in% names1],
y = value[variable %in% names2]))
You could also generate more detailed results, as in Eudald's answer, using do():
detailed_r <- example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
do(cor = cor.test(x = .$value[.$variable %in% names1],
y = .$value[.$variable %in% names2]))
This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.
lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))
I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:
df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))
for (i in 1:nrow(df)){
ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}

grouping in dplyr with missing columns

I have a complex dplyr structure within a function call. The input is a data frame which can have an extra column called s. If this column is available, I want to group by this column additionally to the standard grouping.
At the moment I solved it by an if statement checking if the column is in the data frame and make the grouping differently. After the grouping I have the same code for both kind of data.
Is there a more elegant way of doing this? In my original function, there are several variables I calculate in the summarise function and I don't want to maintain both parts separately.
Here is an example.
library(dplyr)
df1 <- data.frame(s=rep(c('a','b'), each=10),
p=rep(letters[1:5], 4),
v=runif(20))
df2 <- data.frame(p=rep(letters[1:5], each=4),
v=runif(20))
avgP <- function(df) {
if('s' %in% names(df)) {
df %>%
group_by(s, p) %>%
summarise(avg=mean(v))
} else {
df %>%
group_by(p) %>%
summarise(avg=mean(v))
}
}
avgP(df1)
avgP(df2)
My preferred solution would be something like group_by is just ignoring the missing column and will group only by p when I work on df2.
We can use intersect
avgP1 <- function(df){
df %>%
group_by_(.dots = intersect(names(df), c("s", "p"))) %>%
summarise(avg=mean(v))
}
avgP1(df1)
avgP1(df2)

Resources