R- Referencing different dataframes in a loop - r

I am brand new to R so if I'm thinking about this completely wrong feel free to tell me. I have a series of imported dataframes on power plants, one of each year (Plant1987, Plant1988 etc...) that I am trying to combine ultimately into one data frame. Prior to doing so, I'd like to add a "year" variable to each dataframe. I could do this for each individual dataframe, but would like to formalize it and do it in one step. I know how to do it in stata, but I'm struggling here.
I was thinking something along the lines of:
for (y in 1987:2008) {
paste("Plant",y,sep="")$year <- y
}
which doesn't work because paste is obviously not the right function. Is there a smart, quick way to do this? Thanks

Try this ..
year=seq(1987,2008,by=1)
list_object_names = sprintf("Plant%s", 1987:2008)
list_DataFrame = lapply(list_object_names, get)
for (i in 1:length(list_DataFrame ) ){
list_DataFrame[[i]][,'Year']=year[i]
}

Here are some codes to give you some ideas. I used the mtcars data frame as an example to create a list with three data frames. After that I used two solutions to add the year (2000 to 2002) to each data frame. You will need to modify the codes for your data.
# Load the mtcars data frame
data(mtcars)
# Create a list with three data frames
ex_list <- list(mtcars, mtcars, mtcars)
# Create a list with three years: 2000 to 2002
year_list <- 2000:2002
Solution 1: Use lapply from base R
ex_list2 <- lapply(1:3, function(i) {
dt <- ex_list[[i]]
dt[["Year"]] <- year_list[[i]]
return(dt)
})
Solution 2: Use map2 from purrr
library(purrr)
ex_list3 <- map2(ex_list, year_list, .f = function(dt, year){
dt$Year <- year
return(dt)
})
ex_list2 and ex_list3 are the final output.

Let's say you have data.frames
Plant1987 <- data.frame(plantID=1:4, x=rnorm(4))
Plant1988 <- data.frame(plantID=1:4, x=rnorm(4))
Plant1989 <- data.frame(plantID=1:4, x=rnorm(4))
You could put a $year column in each with
year <- 1987:1989
for(yeari in year) {
eval(parse(text=paste0("Plant",yeari,"$year<-",yeari)))
}
Plant1987
# plantID x year
# 1 1 0.67724230 1987
# 2 2 -1.74773250 1987
# 3 3 0.67982621 1987
# 4 4 0.04731677 1987
# ...etc for other years...
...and either bind them together into one data.frame with
df <- Plant1987
for(yeari in year[-1]) {
df <- rbind(df, eval(parse(text=paste0("Plant",yeari))))
}
df
# plantID x year
# 1 1 0.677242300 1987
# 2 2 -1.747732498 1987
# 3 3 0.679826213 1987
# 4 4 0.047316768 1987
# 5 1 1.043299473 1988
# 6 2 0.003758675 1988
# 7 3 0.601255190 1988
# 8 4 0.904374498 1988
# 9 1 0.082030356 1989
# 10 2 -1.409670456 1989
# 11 3 -0.064881722 1989
# 12 4 1.312507736 1989
...or in a list as
itsalist <- list()
for(yeari in year) {
eval(parse(text=paste0("itsalist$Plant",yeari,"<-Plant",yeari)))
}
itsalist
# $Plant1987
# plantID x year
# 1 1 0.67724230 1987
# 2 2 -1.74773250 1987
# 3 3 0.67982621 1987
# 4 4 0.04731677 1987
#
# $Plant1988
# plantID x year
# 1 1 1.043299473 1988
# 2 2 0.003758675 1988
# 3 3 0.601255190 1988
# 4 4 0.904374498 1988
#
# $Plant1989
# plantID x year
# 1 1 0.08203036 1989
# 2 2 -1.40967046 1989
# 3 3 -0.06488172 1989
# 4 4 1.31250774 1989

Related

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

Calculate row sum value in R

Hi I am new to R and would like to get some advice on how to perform sum calculation in data frame structure.
year value
Row 1 2001 10
Row 2 2001 20
Row 3 2002 15
Row 4 2002 NA
Row 5 2003 5
How can I use R to return the total sum value by year? Many thanks!
year sum value
Row 1 2001 30
Row 2 2002 15
Row 3 2003 5
There are lots of ways to do that.
One of them is using the function aggregate like this:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mytable <- aggregate(mydf$value, by=list(year), FUN=sum, na.rm=TRUE)
colnames(mytable) <- c('Year','sum_values')
> mytable
Year sum_values
1 2001 30
2 2002 15
3 2003 5
This link might also be helpful.
There is also rowsum, which is quite efficient
with(mydf, rowsum(value, year, na.rm=TRUE))
# [,1]
# 2001 30
# 2002 15
# 2003 5
Or tapply
with(mydf, tapply(value, year, sum, na.rm=TRUE))
# 2001 2002 2003
# 30 15 5
Or as.data.frame(xtabs(...))
as.data.frame(xtabs(mydf[2:1]))
# year Freq
# 1 2001 30
# 2 2002 15
# 3 2003 5
LyzandeR has provided a working answer in base R. If you want to use dplyr which is a great data management tool you could do:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mydf %>%
group_by(year) %>%
summarise(sum_values = sum(value,na.rm=T))
The advantage of dplyr in this case is for larger datasets it will be much, much faster than base R. I also believe it's much more readable.

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

Split and Diff function in R

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.
This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4
I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.
Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

Creating a function to split data frames multiple times then recombine

I'm working on a large dataset in R with 3 factors: FY (6 levels), Region (10 levels), and Service (24 levels). I need to sum my numeric vector, SumOfUnits, at all three levels, and the only way I can think to do this is to split the data frames up into first: 6 data frames, split by FY, then split those 6 into 10 data frames, split on region, then those 10 into the 24 Services, then I can finally take the sum of the numeric vector and recombine all of the data frames into one. This data frame would then have 6*10*24 (1440) rows and 4 columns. The way I'm currently doing it involves a lot of splitting, so I thought there might be a function I could write that I could use at each level of the split, but I haven't used "function" very much in R so I'm not sure what to write (if there even is something). I also imagine there is probably a more efficient way to get the formatted data set, so I welcome all suggestions.
Here are a few lines from my data frame:
FY Region Service SumOfUnits
1 2006 1 Medication 13
2 2006 1 Medication 1
3 2006 1 Screening & Assessment 38
4 2006 1 Screening & Assessment 13
5 2006 1 Screening & Assessment 41
6 2006 1 Screening & Assessment 67
7 2006 1 Screening & Assessment 222
8 2006 1 Residential Treatment 38
9 2006 1 Residential Treatment 1558
This is the code I've been using for my splits:
# Creating a data frame by year
X <- split(MIC, MIC$FY)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
#Assign the dataframes in the list Y to individual objects
A <- Y[[1]]
B <- Y[[2]]
C <- Y[[3]]
D <- Y[[4]]
E <- Y[[5]]
Q <- Y[[6]]
#Creating 10 dataframes from 2006 split by region
X <- split(A, A$Region)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
Reg1 <- Y[[1]]
Reg2 <- Y[[2]]
Reg3<- Y[[3]]
Reg4 <- Y[[4]]
Reg5<- Y[[5]]
Reg6 <- Y[[6]]
Reg7 <- Y[[7]]
Reg8 <- Y[[8]]
Reg9 <- Y[[9]]
Reg10<- Y[[10]]
#Creating 24 dataframes: for 2006, region 1
X <- split(Reg1, Reg1$Service)
Y <- lapply(seq_along(X), function(x) as.data.frame(X[[x]])[, ])
Serv1 <- Y[[1]]
Serv2 <- Y[[2]]
Serv3<- Y[[3]]
Serv4 <- Y[[4]]
Serv5<- Y[[5]]
#etc...
I would want a sample of my data to look something like this:
FY Region Service SumOfUnits
2006 1 Medication 4300
2006 2 Medication 3299
2006 3 Medication 2198
2007 1 Medication 5467
2007 2 Medication 3214
2007 3 Medication 9807
this is quite nice function to do this:
library(plyr)
ddply(MIC, .(FY, Region, Service), summarize, sumOfUnits=sum(SumOfUnits))
it gives back exactly what you need.
For MIC =
FY Region Service SumOfUnits
1 2006 1 A 1
2 2006 2 B 4
3 2007 1 C 3
4 2007 2 D 2
5 2007 2 E 7
6 2006 1 A 3
7 2007 1 D 3
8 2007 2 B 4
9 2007 2 B 6
returns:
FY Region Service sumOfUnits
1 2006 1 A 4
2 2006 2 B 4
3 2007 1 C 3
4 2007 1 D 3
5 2007 2 B 10
6 2007 2 D 2
7 2007 2 E 7

Resources