R getting rid of nested for loops - r

I did quite some searching on how to simplify the code for the problem below but was not successful. I assume that with some kind of apply-magic one could speed things up a little, but so far I still have my difficulties with these kind of functions ....
I have an data.frame data, structured as follows:
year iso3c gdpppc elec solid liquid heat
2010 USA 1567 1063 1118 835 616
2015 USA 1571 NA NA NA NA
2020 USA 1579 NA NA NA NA
... USA ... NA NA NA NA
2100 USA 3568 NA NA NA NA
2010 ARG 256 145 91 85 37
2015 ARG 261 NA NA NA NA
2020 ARG 270 NA NA NA NA
... ARG ... NA NA NA NA
2100 ARG 632 NA NA NA NA
As you can see, I have a historical starting value for 2010 and a complete scenario for gdppc up to 2100. I want to let values for elec, solid, liquid and heat grow according to some elasticity with respect to the development of gdppc, but separately for each country (coded in iso3c).
I have the elasticities defined in a separate data.frame parameters:
item value
elec 0.5
liquid 0.2
solid -0.1
heat 0.1
So far I am using a nested for loop:
for (e in 1:length(levels(parameters$item)){
for (c in 1:length(levels(data$iso3c)){
tmp <- subset(data, select=c("year", "iso3c", "gdppc", parameters[e, "item"]), subset=("iso3c" == levels(data$iso3c)[c]))
tmp[tmp$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <-
tmp[tmp$year == 2010, parameters[e, "item"]] *
cumprod((1 + (tmp[tmp$year %in% seq(2015, 2100, 5), "gdppc"] /
tmp[tmp$year %in% seq(2010, 2095, 5), "gdppc"] - 1) * parameters[e, "value"]))
data[data$iso3c == levels(data$iso3c)[i] & data$year %in% seq(2015, 2100, 5), parameters[e, "item"]] <- tmp[tmp$year > 2010, parameters[e, "item"]]
}
}
The outer loop loops over the columns and the inner one over the countries. The inner loop runs for every country (I have 180+ countries). First, a subset containing data on one single country and on the variable of interest is selected. Then I let the respective variable grow with a certain elasticity to growth in gdppc and finally put the subset back into place in data.
I have already tried to let the outer loop run in parallel using foreach but was not succesful recombining the results. Since I have to run similar calculations quite often I would be very grateful for any help.
Thanks

Here's one way. Note I renamed your parameters data.frame to p
library(data.table)
library(reshape2)
dt <- data.table(data)
dt.melt = melt(dt,id=1:3)
dt.melt[,value:=as.numeric(value)] # coerce value column to numeric
dt.melt[,value:=head(value,1)+(gdpppc-head(gdpppc,1))*p[p$item==variable,]$value,
by="iso3c,variable"]
result <- dcast(dt.melt,iso3c+year+gdpppc~variable)
result
# iso3c year gdpppc elec solid liquid heat
# 1 ARG 2010 256 145.0 91.0 85.0 37.0
# 2 ARG 2015 261 147.5 90.5 86.0 37.5
# 3 ARG 2020 270 152.0 89.6 87.8 38.4
# 4 ARG 2100 632 333.0 53.4 160.2 74.6
# 5 USA 2010 1567 1063.0 1118.0 835.0 616.0
# 6 USA 2015 1571 1065.0 1117.6 835.8 616.4
# 7 USA 2020 1579 1069.0 1116.8 837.4 617.2
# 8 USA 2100 3568 2063.5 917.9 1235.2 816.1
The basic idea is to use the melt(...) function to reshape your original data into "long" format, where the values in the four columns solid, liquid, elec, and heat are all in one column, value, and the column variable indicates which metric value refers to. Now, using data tables, you can fill in the values easily. Then, reshape the result back into wide format using dcast(...).

Related

Calculation of rolling standard deviation by group

I have a long dataset in the following format:
Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
: : :
: : :
2019-06-30 Australia 57
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10
: : :
: : :
2019-06-30 Austria 21
I want to calculate a 90-day period rolling standard deviation of the Score for each country. I have tried using the rollapply function (Package:zoo) and roll_sd (Package:RcppRoll) but they are not working for groupwise standard deviation. Can anyone please suggest a possible way to calculate the rolling standard deviation.
Thanks!
In general grouping is done separately from the base operation in R so it is not that those functions can't be used for grouped data. It is just that you need to embed them within a grouping operation. Here we use ave to do the grouping and rollapplyr to perform the rolling sd.
Now, at each point can we assume that the last 90 days are the last 90 rows? Assuming yes and taking rolling standard deviations of 2 so that we can use the selected rows of the posted data shown reproducibly in the Note at the end:
library(zoo)
roll <- function(x) rollapplyr(x, 2, sd, fill = NA)
transform(DF, roll = ave(Score, Country, FUN = roll))
giving:
Date Country Score roll
1 1995-01-01 Australia 100 NA
2 1995-01-02 Australia 99 0.7071068
3 1995-01-03 Australia 85 9.8994949
4 1995-01-01 Austria 67 NA
5 1995-01-02 Austria 12 38.8908730
6 1995-01-03 Austria 10 1.4142136
Wide form approach
Another approach is to convert the data to wide form and then perform the rolling operation:
library(zoo)
z <- read.zoo(DF, split = "Country")
zr <- rollapplyr(z, 2, sd, fill = NA)
zr
giving this zoo series:
Australia Austria
1995-01-01 NA NA
1995-01-02 0.7071068 38.890873
1995-01-03 9.8994949 1.414214
You can then just leave it as a zoo series in order to take advantage of the other time series functions in that package or can convert it back to a data frame using fortify.zoo(zr) or fortify.zoo(zr, melt = TRUE, names = names(DF)) depending on what you need.
Note
The input used in reproducible form.
Lines <- "Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10"
DF <- read.table(text = Lines, header = TRUE)
DF$Date <- as.Date(DF$Date)

Removing "outer rows" to allow for interpolation (and prevent extrapolation)

I have (left)joined two data frames by country-year.
df<- left_join(df, df2, by="country-year")
leading to the following example output:
country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I initially wanted to remove all values for which both of the added columns (a,b) were NA.
df<-df[!is.na( df$a | df$b ),]
However, in second instance, I decided I wanted to interpolate the data I had (but not extrapolate). So instead I would like to remove all the columns for which I cannot interpolate; in the example:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I believe there are 2 options. First I somehow adapt this function:
library(tidyerse)
TRcomplete<-TRcomplete%>%
group_by(country) %>%
mutate_at(a:b,~na.fill(.x,"extend"))
to interpolate only, and then remove then apply df<-df[!is.na( df$a | df$b ),]
or I write a code to remove the "outer"columns first and then use extend like normal. Desired output:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
Any suggestions?
There are options in na.fill to specify what is done. If you look at ?na.fill, you see that fill can specify the left, interior and right, so if you specify the left and right are NA and the interior is "extend", then it will only fill the interior data. You can then filter the rows with NA.
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
By the way, you have a typo in your library(tidyverse) statement; you are missing the v.

extracting values of a column into a string and replacing values in a data frame column

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.
I want to standardize the market names in both data frames so I can perform merge operations later.
I thought of tackling the problem in two steps:
1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:
Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .
Table2 Markets > "SPOKANE" , "BOSTON" . . .
I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.
2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.
I know I could use a looping function like below but I still need a lookup table I think.
replacefunc <- function (data, oldvalue, newvalue) {
newdata <- data
for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i]
newdata
}
Table 1: This table is 90 rows x 2 columns and has 90 unique market names.
Market Leads Investment Leads1 Leads2 Leads3
1 Albany.Schenectady.Troy NA NA NA NA NA
2 Albuquerque.Santa.Fe NA NA NA NA NA
3 Atlanta NA NA NA NA NA
4 Austin NA NA NA NA NA
5 Baltimore NA NA NA NA NA
Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.
> df
Spot.ID Date Hour Time Local.Date Broadcast.Week Local.Hour Local.Time Market
2 13072765 6/30/14 0 12:40 AM 2014-06-29 1 21 9:40 PM SPOKANE
261 13072946 6/30/14 5 5:49 AM 2014-06-30 1 5 5:49 AM BOSTON
356 13081398 6/30/14 10 10:52 AM 2014-06-30 1 7 7:52 AM SPOKANE
389 13082306 6/30/14 11 11:25 AM 2014-06-30 1 8 8:25 AM SPOKANE
438 13082121 6/30/14 8 8:58 AM 2014-06-30 1 8 8:58 AM BOSTON
469 13081040 6/30/14 9 9:17 AM 2014-06-30 1 9 9:17 AM ALBANY
482 13080104 6/30/14 12 12:25 PM 2014-06-30 1 9 9:25 AM SPOKANE
501 13082120 6/30/14 9 9:36 AM 2014-06-30 1 9 9:36 AM BOSTON
617 13080490 6/30/14 13 1:23 PM 2014-06-30 1 10 10:23 AM SPOKANE
Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.
First, list the markets, use the following command to list the unique names in df1, repeat for df2.
mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)
Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use
df2[df2$market=="Atlanta","market"] = "Atlanta, GA"
The format is
df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"
If you only have 90 names to correct, I would write out 90 change lines like the one above.
After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

Creating lag variables for matched factors

I have a question about creating lag variables depending on a time factor.
Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.
Here is my loop:
i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}
Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!
Sample Data won't help a bunch but here is some:
edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!
TS[(6:10),c('name','Season','Run_Value')]
name Season ARuns
321 Abad Andy 2003 -1.05
3158 Abercrombie Reggie 2006 27.42
1312 Abercrombie Reggie 2007 7.65
1069 Abercrombie Reggie 2008 5.34
4614 Abernathy Brent 2002 46.71
707 Abernathy Brent 2003 -2.29
1297 Abernathy Brent 2005 5.59
6024 Abreu Bobby 2002 102.89
6087 Abreu Bobby 2003 113.23
6177 Abreu Bobby 2004 128.60
Thank you!
Smth along these lines should do it:
names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)
library(data.table)
dt = data.table(names, years, Run_value)
dt[, lag1 := c(NA, Run_value), by = names]
# names years Run_value lag1
#1: Adams 2002 10 NA
#2: Adams 2003 15 10
#3: Adams 2004 15 15
#4: Adams 2005 20 15
#5: Bobby 2004 10 NA
#6: Bobby 2005 5 10
#7: Charlie 2010 5 NA
An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:
TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))
or
TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))
But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.
Better:
TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))
This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:
full=data.frame(names, years, Run_value)
The ave function is quite useful for constructing new columns within categories of other columns:
full$Lag1 <- ave(full$Run_value, full$names,
FUN= function(x) c(NA, x[-length(x)] ) )
full
names years Run_value Lag1
1 Adams 2002 10 NA
2 Adams 2003 15 10
3 Adams 2004 15 15
4 Adams 2005 20 15
5 Bobby 2004 10 NA
6 Bobby 2005 5 10
7 Charlie 2010 5 NA
I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

How to take the mean of last 10 values in a column before a missing value using R?

I am new to R and having trouble figuring out to go about this. I have data on tree growth rates from dead trees, organized by year. So, my first column is year and the columns to the right are growth rates for individual trees, ending in the year each tree died. After the tree died, the values are "NA" for the remaining years in the dataset. I need to take the mean growth for the 10 years preceding each tree's death, but each tree died in a different year. Does anyone have an idea for how to do this? Here is an example of what a dataset might look like:
Year Tree1 Tree2 Tree3
1989 53.00 84.58 102.52
1990 63.68 133.16 146.07
1991 90.37 103.10 233.58
1992 149.24 127.61 245.69
1993 96.20 54.78 417.96
1994 230.64 60.92 125.31
1995 150.81 60.98 100.43
1996 124.25 42.73 75.43
1997 173.42 67.20 50.34
1998 119.60 73.40 32.43
1999 179.97 61.24 NA
2000 114.88 67.43 NA
2001 82.23 55.23 NA
2002 49.40 NA NA
2003 93.46 NA NA
2004 104.67 NA NA
2005 44.14 NA NA
2006 88.40 NA NA
So, the averages I need to calculate are:
Tree1: mean(1997-2006) = 105.01
Tree2: mean(1992-2001) = 67.15
Tree3: mean(1989-1998) = 152.98
Since I need to do this for a large number of trees, it would be helpful to have a method of automating the calculation. Thank you very much for any help! Katie
You can use sapply and tail together with na.omit as follows:
sapply(mydf[-1], function(x) mean(tail(na.omit(x), 10)))
# Tree1 Tree2 Tree3
# 105.017 67.152 152.976
mydf[-1] says to drop the first column. tail has an argument, n, that lets you specify how many values you want from the end (tail) of your data. Here, we've set it to "10" since you want the last 10 values. Then, assuming that there are no NA values in your actual data from while the trees are alive, you can safely use na.omit on your data.

Resources