Conditional calculation based on other columns lagged values - r

Newbie: I have a dataset where I want to calculate the y-o-y growth of sales of a company. The dataset contains approx. 1000 companies with each different number of years listed on a public stock exchange. The data looks like this:
# gvkey fyear at company name
#22 17436 2010 59393 BASF SE
#23 17436 2011 61175 BASF SE
#24 17436 2012 64327 BASF SE
...
#30 17436 2018 86556 BASF SE
#31 17828 1989 62737 DAIMLER AG
#32 17828 1990 67339 DAIMLER AG
#33 17828 1991 75714 DAIMLER AG
...
#60 17828 2018 281619 DAIMLER AG
I would like to create a new column growth where I calculate the percentage increase of at from e.g. BASF SE (gvkey 17436) from 2010 to 2011, to 2012 and so on. In row #31 the conditional statement is supposed to work that it would not calculate the increase based on values that belong to BASF but rather have a NA value. Therefore the next value in this new column "growth" in row 32 would be the percentage increase of DAIMLER (gvkey 17828) from 62727 to 67339
So far I tried:
if TA$gvkey == lag(TA$gvkey) {mutate(TA, growth = (at - lag(at))/lag(at))} else {NULL}
Basically I tried to condition the calculation on the change of the gvkey identifier as this makes the most sense to me. I believe there is a nicer way of maybe running a loop until the gvkey changes and the continue with the next set of values - but I simply don't know how to code that.
I am very new to R and quite lost. I would appreciate every support! Thank you, guys :)

I do not see a way to do this in one line. Assuming you data is called data you may try:
for(i in data$gvkey){
a = subset(data,data$gvkey==i) # a now contains the data of one company
# calculate pairwise relative difference (assumes sorted years!)
rel_diff = diff(a)/head(a,-1) #diff computes pariwise difference and divide by a ( head(a,-1) removes the last element)
a$growth = c(0,rel_diff) # extend data frame by result, first difference is 0
#output tro somewhere
}
This is a solution with r-base. There might be more efficient ways but this is easy to understand.

In this case the group_by function in dplyr is a good tool to use.By group_by() ing your gv column you will segment out your mutate() call to apply separately for each distinct value of gv. Here is a quick example I made with some dummy data and your same column values:
library(dplyr)
dummyData =
data.frame(gvkey = c(111,111,111,222,222,222),
fyear = c(2010,2012,2011,2010,2011,2013),
at =c(2,4,2,4,5,10)
)
dummyDataTransformed = dummyData %>%
group_by(gvkey) %>%
arrange(fyear) %>% #to make sure we are chronologically in order
mutate(growth = at/lag(at,1) -1) %>% #subtract 1 to get year over year change
ungroup() #I like to ungroup just to make sure i'm not bugging out any calculations I might add further down the line

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

Updating Values within a Simulation in R

I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!

How do I replace values in an R dataframe column with a corresponding value?

Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311

How to deal with non-consecutive (non-daily) dates in R, while looping?

I am trying to write a script that loops through month-end dates and compares associated fields, but I am unable to find a way to way to do this.
I have my data in a flatfile and subset based on 'TheDate'
For instance I have:
date.range <- subset(raw.data, observation_date == theDate)
Say TheDate = 2007-01-31
I want to find the next month included in my data flatfile which is 2007-02-28. How can I reference this in my loop?
I currently have:
date.range.t1 <- subset(raw.data, observation_date == theDate+1)
This doesnt work obviously as my data is not daily.
EDIT:
To make it more clear, my data is like below
ticker observation_date Price
ADB 31/01/2007 1
ALS 31/01/2007 2
ALZ 31/01/2007 3
ADB 28/02/2007 2
ALS 28/02/2007 5
ALZ 28/02/2007 1
I am using a loop so I want to skip from 31/01/2007 to 29/02/2007 by recognising it is the next date, and use that value to subset my data
First get unique values of date like so:
unique_dates<-unique(raw.data$observation_date)
The sort these unique dates:
unique_dates_ordered<-unique_dates[order(as.Date(unique_dates, format="%Y-%m-%d"))]
Now you can subset based on the index of unique_dates_ordered i.e.
subset(raw.data, raw.data$observation_date == unique_dates_ordered[i])
Where i = 1 for the first value, i = 2 for the second value etc.

R adding rows of data and summarize them by group

After looking at my notes from a recent R course and here in the Q and As, the most probable function I need to use to get what I need would seem to be colsum, and groupby but no idea how to do it, can you can help me out.
( first I tried to look into summarize and group by but did not get far )
What I Have
player year team rbi
a 2001 NYY 56
b 2001 NYY 22
c 2001 BOS 55
d 2002 DET 77
Results wanted
year team rbi
2001 NYY 78
2001 BOS 55
2002 DET 77
The players name is lost, why ?
I want to add up the RBI for each team for each year using the individual players RBI's
So for each year there should be lets say 32 teams and for each of these teams there should be an RBI number which is the sum of all the players that batted for each of the teams that particular year.
Thank you
As per #bunk 's comment you can use the aggregate function
aggregate(df$rbi, list(df$team, df$year), sum )
# Group.1 Group.2 x
#1 BOS 2001 55
#2 NYY 2001 78
#3 DET 2002 77
As per #akrun's comment to keep the column names as it is, you can use
aggregate(rbi ~ team + year, data = df, sum)
A data.table approach would be to convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year' and 'team', we get the sum of 'rbi'.
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
# year team rbi
#1: 2001 NYY 78
#2: 2001 BOS 55
#3: 2002 DET 77
NOTE: The 'player' name is lost because we are not using that variable in the summarizing step.
Assume df contains your player data, then you can get the result you want by
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
The players' names are lost because the column player is not included in the group_by clause, and so is not used by summarise to aggregate the data in the rbi column.
Thank you for your help resolving my issue, something that could have been done easier in a popular spreadsheet program, but I decided to do it in R, I love this program and it{s libraries although with a learning curve
There were 4 proposals to resolve my question and 3 of them worked fine, when I evaluate the answer by the number of rows the final run has because I know what the answer should be from a related dataframe.
1) Arun"s proposal worked fine and its using a novel library(data.table) I read a little more on this library and looks interesting
library(data.table)
setDT(df)[, .(rbi=sum(rbi)), by= .(year, team)]
2) Also Alexs proposal worked fine too, it was
library(dplyr)
df %>%
group_by(year, team) %>%
summarise(rbi = sum(rbi))
3) Akruns solution was also good. This is the one I liked the most because the team column came already in alphabetical order, it came sorted by year and team, while the previous two solutions you need to specify you wanted sorted by year and then team
aggregate(list(rbi=df$rbi), list(team=df$team, year=df$year), sum )
4 ) Solution by Ronak almost worked, out of the 2775 rows that the results had to have this solution only brought 2761 The code was:
aggregate(rbi ~ team + year, data = df, sum)
Thanks again to everybody
Javier

Resources