Split and Diff function in R - r

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.

This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4

I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.

Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

Related

R- Referencing different dataframes in a loop

I am brand new to R so if I'm thinking about this completely wrong feel free to tell me. I have a series of imported dataframes on power plants, one of each year (Plant1987, Plant1988 etc...) that I am trying to combine ultimately into one data frame. Prior to doing so, I'd like to add a "year" variable to each dataframe. I could do this for each individual dataframe, but would like to formalize it and do it in one step. I know how to do it in stata, but I'm struggling here.
I was thinking something along the lines of:
for (y in 1987:2008) {
paste("Plant",y,sep="")$year <- y
}
which doesn't work because paste is obviously not the right function. Is there a smart, quick way to do this? Thanks
Try this ..
year=seq(1987,2008,by=1)
list_object_names = sprintf("Plant%s", 1987:2008)
list_DataFrame = lapply(list_object_names, get)
for (i in 1:length(list_DataFrame ) ){
list_DataFrame[[i]][,'Year']=year[i]
}
Here are some codes to give you some ideas. I used the mtcars data frame as an example to create a list with three data frames. After that I used two solutions to add the year (2000 to 2002) to each data frame. You will need to modify the codes for your data.
# Load the mtcars data frame
data(mtcars)
# Create a list with three data frames
ex_list <- list(mtcars, mtcars, mtcars)
# Create a list with three years: 2000 to 2002
year_list <- 2000:2002
Solution 1: Use lapply from base R
ex_list2 <- lapply(1:3, function(i) {
dt <- ex_list[[i]]
dt[["Year"]] <- year_list[[i]]
return(dt)
})
Solution 2: Use map2 from purrr
library(purrr)
ex_list3 <- map2(ex_list, year_list, .f = function(dt, year){
dt$Year <- year
return(dt)
})
ex_list2 and ex_list3 are the final output.
Let's say you have data.frames
Plant1987 <- data.frame(plantID=1:4, x=rnorm(4))
Plant1988 <- data.frame(plantID=1:4, x=rnorm(4))
Plant1989 <- data.frame(plantID=1:4, x=rnorm(4))
You could put a $year column in each with
year <- 1987:1989
for(yeari in year) {
eval(parse(text=paste0("Plant",yeari,"$year<-",yeari)))
}
Plant1987
# plantID x year
# 1 1 0.67724230 1987
# 2 2 -1.74773250 1987
# 3 3 0.67982621 1987
# 4 4 0.04731677 1987
# ...etc for other years...
...and either bind them together into one data.frame with
df <- Plant1987
for(yeari in year[-1]) {
df <- rbind(df, eval(parse(text=paste0("Plant",yeari))))
}
df
# plantID x year
# 1 1 0.677242300 1987
# 2 2 -1.747732498 1987
# 3 3 0.679826213 1987
# 4 4 0.047316768 1987
# 5 1 1.043299473 1988
# 6 2 0.003758675 1988
# 7 3 0.601255190 1988
# 8 4 0.904374498 1988
# 9 1 0.082030356 1989
# 10 2 -1.409670456 1989
# 11 3 -0.064881722 1989
# 12 4 1.312507736 1989
...or in a list as
itsalist <- list()
for(yeari in year) {
eval(parse(text=paste0("itsalist$Plant",yeari,"<-Plant",yeari)))
}
itsalist
# $Plant1987
# plantID x year
# 1 1 0.67724230 1987
# 2 2 -1.74773250 1987
# 3 3 0.67982621 1987
# 4 4 0.04731677 1987
#
# $Plant1988
# plantID x year
# 1 1 1.043299473 1988
# 2 2 0.003758675 1988
# 3 3 0.601255190 1988
# 4 4 0.904374498 1988
#
# $Plant1989
# plantID x year
# 1 1 0.08203036 1989
# 2 2 -1.40967046 1989
# 3 3 -0.06488172 1989
# 4 4 1.31250774 1989

Calculate row sum value in R

Hi I am new to R and would like to get some advice on how to perform sum calculation in data frame structure.
year value
Row 1 2001 10
Row 2 2001 20
Row 3 2002 15
Row 4 2002 NA
Row 5 2003 5
How can I use R to return the total sum value by year? Many thanks!
year sum value
Row 1 2001 30
Row 2 2002 15
Row 3 2003 5
There are lots of ways to do that.
One of them is using the function aggregate like this:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mytable <- aggregate(mydf$value, by=list(year), FUN=sum, na.rm=TRUE)
colnames(mytable) <- c('Year','sum_values')
> mytable
Year sum_values
1 2001 30
2 2002 15
3 2003 5
This link might also be helpful.
There is also rowsum, which is quite efficient
with(mydf, rowsum(value, year, na.rm=TRUE))
# [,1]
# 2001 30
# 2002 15
# 2003 5
Or tapply
with(mydf, tapply(value, year, sum, na.rm=TRUE))
# 2001 2002 2003
# 30 15 5
Or as.data.frame(xtabs(...))
as.data.frame(xtabs(mydf[2:1]))
# year Freq
# 1 2001 30
# 2 2002 15
# 3 2003 5
LyzandeR has provided a working answer in base R. If you want to use dplyr which is a great data management tool you could do:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mydf %>%
group_by(year) %>%
summarise(sum_values = sum(value,na.rm=T))
The advantage of dplyr in this case is for larger datasets it will be much, much faster than base R. I also believe it's much more readable.

Counting unique values across variables (columns) in R

I have a large dataset with repeated measures over 5 time periods.
2012 2009 2006 2003 2000
3 1 4 4 1
5 3 2 2 3
6 7 3 5 6
I want to add a new column, which is the number of unique values among years 2000 to 2012. e.g.,
2012 2009 2006 2003 2000 nunique
3 1 4 4 1 3
5 3 2 2 3 3
6 7 3 5 6 4
I am working in R and, if it helps, there are only 14 possible different values of the measured value at each time period.
I found this page: Count occurrences of value in a set of variables in R (per row) and tried the various solutions offered on it. What it gives me however is a count of each value, not the number of unique values.
Other similar questions on here seem to ask about counting number of unique values within a variable /column, rather than across each row.
Any suggestions would be appreciated.
Here's one alternative
> df$nunique <- apply(df, 1, function(x) length(unique(x)))
> df
2012 2009 2006 2003 2000 nunique
1 3 1 4 4 1 3
2 5 3 2 2 3 3
3 6 7 3 5 6 4
If you have a large dataset, you may want to avoid looping over the rows, but use a faster framework, like S4Vectors:
df <- data.frame('2012'=c(3,5,6),
'2009'=c(1,3,7),
'2006'=c(4,2,3),
'2003'=c(4,2,5),
'2000'=c(1,3,6))
dup <- S4Vectors:::duplicatedIntegerPairs(as.integer(as.matrix(df)), row(df))
dim(dup) <- dim(df)
rowSums(!dup)
Or, the matrixStats package:
m <- as.matrix(df)
mode(m) <- "integer"
rowSums(matrixStats::rowTabulates(m) > 0)
The trick is to use 'apply' and assign each row to a variable (e.g. x). You can then write a custom function, in this case one that uses 'unique' and 'length' to get the answer that you want.
df <- data.frame('2012'=c(3,5,6), '2009'=c(1,3,7), '2006'=c(4,2,3), '2003'=c(4,2,5), '2000'=c(1,3,6))
df$nunique = apply(df, 1, function(x) {length(unique(x))})
try this one out:
sapply(data, function(x) length(unique(x)))

get z standardized score within each group

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .
Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.
One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632
I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)
Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

Producing a rolling average of ALL the previous observations per ID in an unbalanced panel data set

I am trying to compute rolling means of an unbalanced data set. To illustrate my point I have produced this toy example of my data:
ID year Var RollingAvg(Var)
1 2000 2 NA
1 2001 3 2
1 2002 4 2.5
1 2003 2 3
2 2001 2 NA
2 2002 5 2
2 2003 4 3.5
The column RollingAvg(Var) is what I want, but can't get. In words, I am looking for the rolling average of ALL the previous observations of Var for each ID. I have tried using rollapply and ddply in the zoo and the plyr package, but I can't see how to set the rolling window length to use ALL the previous observations for each ID. Maybe I should use the plm package instead? Any help is appreciated.
I have seen other posts on rolling means on BALANCED panel data set, but I can't seem to extrapolate their answers to unbalanced data.
Thanks,
M
Using data.table:
library(data.table)
d = data.table(your_df)
d[, RollingAvg := {avg = cumsum(Var)/seq_len(.N);
c(NA, avg[-length(avg)])},
by = ID]
(or even simplified)
d[, RollingAvg := c(NA, head(cumsum(Var)/(seq_len(.N)), -1)), by = ID]
Assuming that years are contiguous within each ID (which is case in the example data) and DF is the input data frame, here is a solution using just base R. cumRoll is a function that performs the required operation on one ID and ave then performs it by ID:
cumRoll <- function(x) c(NA, head(cumsum(x) / seq_along(x), -1))
DF$Roll <- ave(DF$Var, DF$ID, FUN = cumRoll)
The result is:
> DF
ID year Var Roll
1 1 2000 2 NA
2 1 2001 3 2.0
3 1 2002 4 2.5
4 1 2003 2 3.0
5 2 2001 2 NA
6 2 2002 5 2.0
7 2 2003 4 3.5

Resources