I need to calculate a rolling sum by group.
y<- 1:10
tmp<-data.frame(y)
tmp$roll<-NA
tmp$roll[2:10]<-rollapply (y, 2, sum)
tmp$g<-(c("a","a","a","a","a","b","b","b","b","b"))
tmp$roll calculates the rolling sum for tmp$y; I need to do this by tmp$g. I think I may need to split the data frame into a list of data frames by group and then bind back together but this seems like a long route. The result would be an additional column of the rolling sum by group a and b (this a simplified example of actual data frame):
roll_group
NA
3
5
7
9
NA
13
15
17
19
Here is the data.table way:
library(data.table)
tmp.dt <- data.table(tmp)
tmp.dt <- tmp.dt[, .(y =y, roll = cumsum(y)), by = g]
You can do it with dplyr package as well.
Thanks but the answers provided in this post use the cumsum whereas I need the rolled sum with NA's if there aren't enough lagged values. I solved it this way:
#function to calculate rolled sum, returns a column vector
roll<-function(x,lags){
if (length(x)<lags) {
tmp=c(rep(NA,length(x)))
}
else {
tmp=rollsum(x, lags, align = "right", fill = NA)
}
tmp=as.numeric(tmp)
return(tmp)
}
tmp1 <- tmp %>%
group_by(g) %>%
mutate(roll_group = ave(y, g, FUN = function(x) roll(x, 2)))%>%
ungroup
How about wrapping it in tapply (or lapply split):
tapply(y, tmp$g, cumsum)
Consider this base solution with sapply() combining running count and running sum:
tmp$roll <- sapply(1:nrow(tmp),
function(i)
sum((tmp[1:i, c("g")] == tmp$g[i]) * tmp[1:i,]$y)
)
Related
I've been using a code to run means for specific variable values (demographic breaks), however I now have data that has a weight variable and need to calculate weighted means. I've already been using a code to calculate sample means, and was wondering if it's possible to change change or adjust the function to calculate the weighted mean. Here is some code to generate sample data
df <- data.frame(gender=c(2,2,1,1,2,2,1,1,1,1,1,1,2,2,2,2,1,2,2,1),
agegroup=c(2,2,7,5,5,5,2,7,2,2,4,4,4,3,4,5,3,3,6,6),
attitude_1=c(4,3,4,4,4,4,4,4,5,2,5,5,5,4,3,2,3,4,2,4),
attitude_2=c(4,4,1,3,4,2,4,5,5,5,5,4,5,4,3,3,4,4,4,4),
attitude_3=c(2,2,1,1,3,2,5,1,4,2,2,2,3,3,4,1,4,1,3,1),
income=c(40794,74579,62809,47280,72056,57908,70784,96742,66629,117530,79547,54110,39569,111217,109146,56421,106206,28385,85830,71110),
weight=c(1.77,1.89,2.29,6.14,2.07,5.03,0.73,1.60,1.95,2.56,5.41,2.02,6.87,3.23,3.01,4.68,3.42,2.75,2.31,4.04))
So far I've been using this code to get sample means
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) mean(x, na.rm = TRUE))))
> Gender_Profile_1
sapply.subset.df..gender....1...FUN...function.x..mean.x..na.rm...TRUE..
gender 1.000
agegroup 4.200
attitude_1 4.000
attitude_2 4.000
attitude_3 2.300
income 77274.700
weight 3.016
As you can see it generates Gender_Profile_1 with the means for all variables.
In my attempt to calculate the weighted mean, I've tried to change the "FUN=" part to this
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
I get the following error message
Error in weighted.mean.default(x, w = weight, na.rm = TRUE) :
'x' and 'w' must have the same length
I've been trying all kinds of permutations of df$weight and df$x, but nothing seems to work.
Any help or ideas would be great. Many thanks
Base R
If you want to stick to base R, you can do the following:
# define func to return all weighted means
all_wmeans <- function(data_subset) {
# which cols to summarise? all but gender and weight
summ_cols <- setdiff(names(data_subset), c('gender', 'weight'))
# for each col, calc weighted mean with weights from the 'weight' column
result <- lapply(data_subset[, summ_cols],
weighted.mean, w=data_subset$weight)
# squeeze the resuling list back to a data.frame and return
return(data.frame(result))
}
# now, split the df on gender, and apply the func to each chunk
lapply(split(df, df$gender), all_wmeans)
The result is a list of two data frames, for each value of gender:
$`1`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.397546 4.027851 3.950597 1.962202 74985.25
$`2`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.092234 3.642666 3.676287 2.388872 64075.23
The fabulous data.table
If you don't mind using packages, dplyr and data.table are great packages that make this kind of stuff much simpler. Here's data.table:
# load library and create a data.table object
library(data.table)
my_dt <- data.table(df)
# now it's a one liner:
my_dt[, lapply(.SD, weighted.mean, w=.SD$weight), by=gender]
which returns:
gender agegroup attitude_1 attitude_2 attitude_3 income weight
1: 2 4.092234 3.642666 3.676287 2.388872 64075.23 4.099426
2: 1 4.397546 4.027851 3.950597 1.962202 74985.25 3.904483
The data.table code also groups the rows by gender, and uses lapply to apply a function and extra argument to each Subset of Data (that's what the .SD call is). Conceptually, it's the exact same as the base R code, just compact and fast.
You can do the whole lot at once like this:
sapply(1:2, function(y)
sapply(subset(df, df$gender == y), function(x)
weighted.mean(x, df$weight[df$gender == y])))
#> [,1] [,2]
#> gender 1.000000 2.000000
#> agegroup 4.397546 4.092234
#> attitude_1 4.027851 3.642666
#> attitude_2 3.950597 3.676287
#> attitude_3 1.962202 2.388872
#> income 74985.247679 64075.232966
#> weight 3.904483 4.099426
I think the main problem with your code is that you are calling the weights column inside the sapply loop, however, this column has not been subsetted (as df has). Thus, you could just subset the weights columns before the sapply and then loop using that subsetted weights.
Using the code you posted:
weight <- subset(df, gender==1)[,"weight"]
#Exactly the same code you posted
assign("Gender_Profile_2",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
Here is another solution using apply, that might be easier to implement:
#Apply the desired function by columns
apply(subset(df, gender==1), 2, FUN = function(x) mean(x, na.rm = TRUE))
#Get the weights of the rows that have gender == 1
weight <- subset(df, gender==1)[,7]
#Apply the wighted mean function
apply(subset(df[,-7], gender==1), 2, FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))
I have a categorical variable with three levels (A, B, and C).
I also have a continuous variable with some missing values on it.
I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A.
I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops.
A <- subset(data, group == "A")
mean(A$variable, rm.na = TRUE)
A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE)
Now, I understand I could do the same for group B and C, but perhaps a for loop (with if and else) might do the trick?
require(dplyr)
data %>% group_by(group) %>%
mutate(variable=ifelse(is.na(variable),mean(variable,na.rm=TRUE),variable))
For a faster, base-R version, you can use ave:
data$variable<-ave(data$variable,data$group,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))
You could use data.table package to achieve this-
tomean <- c("var1", "var2")
library(data.table)
setDT(dat)
dat[, (tomean) := lapply(tomean, function(x) {
x <- get(x)
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
})]
I have a dataset with 4 columns, 1st col is date, the other 3 are numeric. I am trying to get the % diff from previous row for those 3 numeric columns. I know there already have some posts about this kind of questions df %>% mutate_each(funs(. - lag(.))) %>% na.omit(), but most of them can not take care about the date, since I want the date to be unchange, and need % different.
here is the dataset
date=c('2018-01-01', '2018-02-01', '2018-03-01')
a=c(1,3,2)
b=c(89,56,47)
c=c(1872,7222,2930)
x=data.frame(date,a,b,c)
I wish to have the final dataset like this
x=data.frame(date,a,b,c)
a=c(NA, 2, -0.333)
b=c(NA, -0.371, -0.161)
c=c(NA,2.857, -0.594)
x=data.frame(date,a,b,c)
which means for col A, 2=3/1-1, -0.333=2/3-1
for col B, -0.371=56/89-1 etc
Thank you so much for your help!
A solution using package data.table:
x = as.data.table(x)
cols = c("a", "b", "c")
x[,(paste0(cols, "_pctChange")) := lapply(.SD, function(col){
(col-shift(col,1,type = "lag"))/shift(col,1,type = "lag")
}), .SDcols=cols]
quantmod package has a very useful function for exactly this called Delt().
All you would need to do is the following:
x[-1] <- sapply(x[-1], Delt)
I'm not sure how familiar you are with sapply, but if you wanted to access Delt()'s parameters to tweak your calculation, you could try something like:
x[-1] <- sapply(x[-1], function(x) { Delt(x, k=2) })
Say I have some data of the following kind:
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
I want a new dataframe that keeps the 10 original columns, but for every column retains only the highest 10 and lowest 10 values. Importantly, the rows have names corresponding to id values that need to be kept in the new data frame.
Thus, the end result data.frame is gonna be of dimensions m by 10, where m is very likely to be more than 20. But for every column, I want only 20 valid values.
The only way I can think of doing this is doing it manually per column, using dplyr and arrange, grabbing the top and bottom rows, and then creating a matrix from all the individual vectors. Clearly this is inefficient. Help?
Assuming you want to keep all the rows from the original dataset, where there is at least one value satisfying your condition (value among ten largest or ten smallest in the given column), you could do it like this:
# create a data frame
df<-as.data.frame(matrix(rnorm(10*10000, 1, .5), ncol=10))
# function to find lowes 10 and highest 10 values
lowHigh <- function(x)
{
test <- x
test[!(order(x) <= 10 | order(x) >= (length(x)- 10))] <- NA
test
}
# apply the function defined above
test2 <- apply(df, 2, lowHigh)
# use the original rownames
rownames(test2) <- rownames(df)
# keep only rows where there is value of interest
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
Please note that there is definitely some smarter way of doing it...
Here is the data matrix with 10 highest and 10 lowest in each column,
x<-apply(df,2,function(k) k[order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))]])
x is your 20 by 10 matrix.
Your requirement of rownames is conflicting column by column, altogether you only have 20 rownames in this matrix and it can not be same for all 10 columns. Instead, here is your order matrix,
x_roworder<-apply(df,2,function(k) order(k,decreasing=T)[c(1:10,(length(k)-9):length(k))])
This will give you corresponding rows in original data matrix within each column.
I offer a couple of answers to this.
A base R implementation ( I have used %>% to make it easier to read)
ix = lapply(df, function(x) order(x)[-(1:(length(x)-20)+10)]) %>%
unlist %>% unique %>% sort
df[ix,]
This abuses the fact that data frames are lists, finds the row id satisfying the condition for each column, then takes the unique ones in order as the row indices you want to keep. This should retain any row names attached to df
An alternative using dplyr (since you mentioned it) which if I remember correctly doesn't particular like row names
# add id as a variable
df$id = 1:nrow(df) # or row names
df %>%
gather("col",value,-id) %>%
group_by(col) %>%
filter(min_rank(value) <= 10 | min_rank(desc(value)) <= 10) %>%
ungroup %>%
select(id) %>%
left_join(df)
Edited: To fix code alignment and make a neater filter
I'm not entirely sure what you're expecting for your return / output. But this will get you the appropriate indices
# example data
set.seed(41234L)
N <- 1000
df<-data.frame(id= 1:N, matrix(rnorm(10*N, 1, .5), ncol=10))
# for each column, extract ID's for top 10 and bottom 10 values
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N)
# check:
xx <- sort(df[,2])
all.equal(sort(df[l1[[1]], 2]), xx[c(1:10, 991:1000)])
[1] TRUE
If you want an m * 10 matrix with these unique values, where m is the number of unique indices, you could do:
l2 <- do.call("c", l1)
l2 <- unique(l2)
df2 <- df[l2,] # in this case, m == 189
This doesn't 0 / NA the columns which you're not searching on for each row. But it's unclear what your question is trying to do.
Note
This isn't as efficient as using data.table since you're going to get a copy of the data in xy <- data.frame(x,y)
Benchmark
library(microbenchmark)
microbenchmark(ira= {
test2 <- apply(df[,2:11], 2, lowHigh);
rownames(test2) <- rownames(df);
finalData <- test2[apply(apply(test2, 2, is.na), 1, sum) < 10, ]
},
alex= {
l1 <- lapply(df[,2:11], function(x,y, n) {
xy <- data.frame(x,y)
xy <- xy[order(xy[,1]),]
return(xy[c(1:10, (n-9):n),2])
}, y= df[,1], n = N);
l2 <- unique(do.call("c", l1));
df2 <- df[l2,]
}, times= 50L)
Unit: milliseconds
expr min lq mean median uq max neval cld
ira 4.360452 4.522082 5.328403 5.140874 5.560295 8.369525 50 b
alex 3.771111 3.854477 4.054388 3.936716 4.158801 5.654280 50 a
I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code:
sum(is.na(dataframe$variable) / nrow(dataframe) * 100)
My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Use lapply() with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA