R: apply function per group using dplyr - r

I have a function which detrends data:
detrender <- function(Data,Vars,timevar){
Data_detrend <- Data
for (v in seq_along(Vars)){
ff <- as.formula(paste0(Vars[[v]]," ~ ",timevar))
fit <- lm(ff, data = Data_detrend)
if (anova(fit)$P[1] < 0.05){
message(paste("Detrending variable",v))
Data_detrend[[Vars[v]]][!is.na(Data_detrend[[Vars[[v]]]])] <- residuals(fit)
}
}
Data_detrend
}
useVar <- c("depression_sum")
inertia <- detrender(inertia,useVar, "timestamp")
This works fine, and detrends my data. However, as my datafile has many groups (Rid), detrend is conducted across groups. I want to detrend per group (Rid) and have tried to apply my function using group_by in dplyr. However, this code seems to add thousands of new rows with data. This is wrong, the number of rows should stay the same, while the data is detrended.
inertia_2<- inertia %>%
group_by(Rid) %>%
do(detrender(inertia,useVar, "timestamp")) %>%
ungroup()

Related

Lagged values multiple columns with function in R

I would like to create lagged values for multiple columns in R.
First, I used a function to create lead/lag like this:
mleadlag <- function(x, n, ts_id) {
pos <- match(as.numeric(ts_id) + n, as.numeric(ts_id))
x[pos]
}
Second, I would like to apply this function for several columns in R. firm.characteristics is list of columns I would like to compute lagged values.
library(dplyr)
firm.characteristics <- colnames(df)[4:6]
for(i in 1:length(firm.characteristics)){
df <- df %>%
group_by(company) %>%
mutate(!!paste0("lag_", i) := mleadlag(df[[i]] ,-1, fye)) %>%
ungroup()
}
However, I didn't get the correct values. The output for all companies in year t is the last row in year t-1. It didn't group by the company any compute the lagged values.
Can anyone help me which is wrong in the loop? Or what should I do to get the correct lagged values?
Thank you so much for your help.
Reproducible sample could be like this:
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(company=1:n,
fye=2009,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat2 <- data.frame(company=1:n,
fye=2010,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat3 <- data.frame(company=1:n,
fye=2011,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
df <- rbind(dat,dat2,dat3)
I would try to stay away from loops in the tidyverse. Many of the tidyverse applications that would traditionally require loops already exist and are very fast, which creates more efficient and intuitive code (the latter being my opinion). This is a great use case for dplyr's across() functionality. I first changed the df to a tibble.
df %>%
as_tibble() %>%
group_by(company) %>%
mutate(
across(firm.characteristics, ~lag(., 1L))
) %>%
ungroup()
This generates the required lagged values. For more information see dplyr's across documentation.

Run `dlply` with a different function for some subsets of data in `R`

I have a large data for which some fits a logistic growth model and some fits an exponential growth model. I have managed to successfully calculated the nls regression parameters based on my dlply code to calculate parameters for each subset of my data based on three factors that can take on multiple values. However, I would like to add a constraint to use one nls form for some sets of factors as defined by another variable and another for the rest. I had thought I could use a if … else form but it does not seem to work.
I would really like an answer using dlply specifically. If it is not possible to use dlply, that would be helpful to know.
Here is a dummy set to illustrate what I would like to do:
> library(plyr)
> data(iris) iris$form<-"b" iris[iris$Species=="setosa",]$form<-"a"
> diris<-dlply(iris, as.quoted(.(Species)),
> function(x){
> if(x$form=="a"){
> mean(x$Sepal.Length)
> }else{
> median(x$Sepal.Length)
> }
> })
splitting it into two different dlply functions does work but I would rather have it all tidily together:
diris_mean<-dlply(iris[iris$form=="a",], as.quoted(.(Species)),
function(x){
mean(x$Sepal.Length)
}
)
diris_med<-dlply(iris[iris$form!="a",], as.quoted(.(Species)),
function(x){
median(x$Sepal.Length)
}
)
UPDATE:
It seems my dummy example was too simple to convey what I needed. I do not understand how summarise or mutate works (from the posted answer) to be able to translate to my function. Here is the function I am using:
NLmodels <- dlply(cum[form=="logistic growth",], as.quoted(.(region, climate, size)),
function(x) {
essai=try(logis<-nls(freq~1/(1+b*exp(-(c*mid_point))),
start=list(b=170,c=0.1),data=x,control=list(maxiter=200),trace=FALSE))
#if the nls was successful, then calculate values
if(class(essai)!="try-error"){
nls_values<-summary(nls(x$freq~1/(1+b*exp(-(c*mid_point))),
start=list(b=170,c=0.1),
data=x, control=list(maxiter=200)))$parameters
}else {
print("error")
}
}
)
Here is another example of what I want using an accessible dataset for just one form of a regression equation (I have two forms in my data).
data(mtcars)
mtcars$a_cat<-rep(c("a", "b", "c", "d"), each=8)
mtcars$b_cat<-rep(c("a", "b"), each=16)
mtcars_A<-dlply(mtcars[mtcars$b_cat=="a",], as.quoted(.(a_cat)),
function(x){
values<-summary(lm(mpg~hp, data=x))$coefficients[,1]
}
)
We can use dplyr. After splitting by 'Species', 'form' into a list of data.frame, loop through the list with map, check if first element of 'form' column is 'a', then create the 'new' column as the mean of 'Sepal.Length' or else return the median (map_dfr - returns a single data.frame by row binding)
library(dplyr)
library(purrr)
iris %>%
group_split(Species, form) %>%
map_dfr(~ if(first(.x$form) == 'a') {
.x %>% mutate(new = mean(Sepal.Length)) }
else {.x %>% mutate(new = median(Sepal.Length))})
For the regression based new example,
map_dfr(c('a_cat', 'b_cat'), ~
mtcars %>%
group_by_at(.x) %>%
summarise(Coef = lm(mpg ~hp)$coefficients[1]))
Or
map_dfr(c('a_cat', 'b_cat'), ~
mtcars %>%
group_by_at(.x) %>%
do(data.frame(Coef = lm(mpg ~ hp, data = .)$coefficients[1])))

stratified sampling R - more directions needed

To reduce the dataset, I have been advised to use "stratified sampling".
Because I'm very new to the R programming, current articles on Stack aren't easy to follow, there is very little explanation.
I have a data set of over 60000 obs. and 24 variables. Out of all variables, 21 are quantitative (numbers).
How do I get sample data out of that? Also
- Where do I specify the dataset name
- do I need to name the new "reduced" dataset, so I could include it for the further analysis?
ADDED CODE (this is what I used for sampling):
# Sample a percentage of values from each stratum (10% in this case)
DB.quant.sample = lapply(split(DB.quant, DB.quant$group_size), function(DB.quant) {
DB.quant[sample(1:nrow(DB.quant), ceiling(nrow(DB.quant) * 0.1)), ]
})
Browse[1]>DB.quant[sample(1:nrow(DB.quant), 6000), ]
#DB.quant is the dataset and group_size is one of the variable. I'm not sure which variable should I use?
I'm having problem with illustrating graphically and intuitively how a cluster algorithm works
I started:
DB <- na.omit(DB)
DB.quant <- DB[c(2,3,4,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24)]
And then:
d <- dist(DB.quant.sample) # but im getting an error:
Error in dist(DB.quant.sample, method = "euclidean") : (list) object cannot be coerced to type 'double'
Example image of my DataSet:
first few rows of the data
I'm not sure exactly how you want to sample, but here's a simple example using the built-in iris data frame. Below are two ways to do it. One using Base R and the other using the dplyr package.
Base R
split the data frame into three separate smaller data frames, one for each Species.
Randomly sample 5 rows per Species.
# Sample 5 rows from each stratum
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), 5), ]
})
# Sample a percentage of values from each stratum (10% in this case)
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), ceiling(nrow(df) * 0.1)), ]
})
This gives us a list containing three data frames, one for each of the three different unique values of Species.
Combine the three samples into a single data frame.
df.sample = do.call(rbind, df.sample)
dplyr package
Do the grouping and sampling in a single chain of functions using the pipe (%>%) operator:
library(dplyr)
# Sample 5 values from each stratum
df.sample = iris %>%
group_by(Species) %>%
sample_n(5)
# Sample a percentage of values from each stratum (10% in this case)
df.sample = iris %>%
group_by(Species) %>%
sample_frac(0.1)

How to add up data frames from a loop within a function?

I have a function whose purpose is to predict revenue from cost. The twist is, I have as inputs many different dataframes, and many different corresponding models to predict with - the function loops through each dataframe, predicting on it its corresponding model and outputing a prediction output with confidence intervals. Now, I need to find a way to add all of these prediction outputs up.
Here's a simplified example of what I'm doing, feel free to skip over it if you don't need it to answer the question (it might be hard to read), but if it helps read away. Note that each prediction output isn't a dataframe of cost and revenue, but a summary of what revenues you can expect from a variable cost.
predictions <- function(df_list, model_list) {
for(i in 1:length(df_list)) {
sapply(seq(1, 2, .25), function(x) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output
}) -> output
output %>% t %>% as.data.frame -> output
}
}
With the output for each index looking like this:
Cost Lower_Rev Fit_Rev Upper_Rev
1 2048884 18114566 20898884 24145077
2 2561105 21684691 25085853 29064495
3 3073326 25092823 29122421 33853693
4 3585547 28369901 33038060 38539706
5 4097768 31537704 36853067 43140547
I need some way to add together each output into one master output, whose cost and revenue values will be the sum of all others. Any ideas?
Your output simply gets replaced every time. You need to assign it to something or just simply use a lapply/sapply. Also added i as an argument as opposed to abusing R's scoping rules and grabbing the argument i from .GlobalEnv
L = lapply(1:length(df_list),
function(i) sapply(seq(1, 2, .25), function(x, i) {
df_list[[i]]$cost <- df_list[[i]]$cost * x
predictions <- predict(model_list[[i]], df_list[[i]], interval = "confidence")
temp <- cbind(df_list[[i]]$cost, predictions)
output <- summarise(temp, Cost = sum(cost), Low = sum(lwr), Fit = sum(fit), Upper = sum(upr))
output %>% t %>% as.data.frame
})
)
Found a solution: Kind of wonky, but it works.
At the beginning of the function, I establish the following vectors:
costs <- c()
lows <- c()
fits <- c()
highs <- c()
Those are in accordance with the four columns of my output. Then, at the end of the loop, after the -> output statement, I run this:
costs[i] <- output[1]
lows[i] <- output[2]
fits[i] <- output[3]
highs[i] <- output[4]
For some reason it wouldn't work to simply assign each full df to a list of dfs; I had to do it vector by vector. Then, once each vector is stored with the full extent of my index, I ran this:
costs <- sapply(costs, unlist) %>% rowSums %>% as.data.frame
lows <- sapply(lows, unlist) %>% rowSums %>% as.data.frame
fits <- sapply(fits, unlist) %>% rowSums %>% as.data.frame
highs <- sapply(highs, unlist) %>% rowSums %>% as.data.frame
output <- cbind(costs, lows, fits, highs)
output
...to end the function. So really weird all in all, but it works.

dplyr: manipulate function with multiple arguments by groups

With the robCompositions package, I need to impute missing values on a group basis. For example, with the iris dataset.
library(robCompositions)
library(dplyr)
data(iris)
# Insert random NAs
for (i in 1:4) {
n_NA = sample(0:10, 1)
index_NA = sample(1:nrow(iris), n_NA)
iris[index_NA, i] = NA
}
This is where I have no idea which manip to use...
impfunc <- function(x) x %.%
regroup(list(...)) %.%
mutate(impKNNa(x[,-5], k=6, metric="Euclidean"))
impfunc(iris, "Species")
iris %.% group_by(Species) %.% mutate(impKNNa(iris[,-5], k=6, metric="Euclidean"))
Any idea?
Thanks.
Use the the do() function. It allows you to apply any arbitrary function to a grouped data frame.
You'll also want to extract not just the output from impKNNa but specifically impKNNA$xImp which is the altered data frame.
The other issue is that impKNNA doesn't want any variables except the numeric variables of interest and do() won't remove the categorical variables. So perhaps a solution is to write a wrapper function for impKNNA that will remove categorical variables and return xIMP, and use do() to apply that to a grouped data frame.

Resources