I have a variable called Depression which has 40 observations and goes from 2004 to 2013 quarterly (e.g. 2004 Q1, 2004 Q2 etc.) I would like to make a new column which differences with respect to the 27th row/observations which corresponds with 2010 Q3 and set that value to 0. Any help is greatly appreciated!
If I understand correctly your question, this would do it:
# generate sample data
dat <- data.frame(id=paste0("Obs.",1:40),depression=as.integer(runif(40,0,20)))
# Create new var that calculates difference with 27th observation on depression score
dat$diff <- dat$depression - dat$depression[27]
Related
I have a data frame and would like to calculate the growth rate of nominal GDP in R. I know how to do it in Excel with the formula ((gdp of this year)-gdp of last year)/( gdp of last year))*100. What kind of command could be used in R to calculate it?
year nominal gdp
2003 7696034.9
2004 8690254.3
2005 9424601.9
2006 10520792.8
2007 11399472.2
2008 12256863.6
2009 12072541.6
2010 13266857.9
2011 14527336.9
2012 15599270.7
2013 16078959.8
You can also use the lag() fuction from dplyr. It gives the previous values in a vector. Here is an example
data <- data.frame(year = c(2003:2013),
gdp = c(7696034.9, 8690254.3, 9424601.9, 10520792.8,
11399472.2, 12256863.6, 12072541.6, 13266857.9,
14527336.9, 15599270.7, 16078959.8))
library(dplyr)
growth_rate <- function(x)(x/lag(x)-1)*100
data$growth_rate <- growth_rate(data$gdp)
It's probably best for you to get familiar with data tables, and do something like this:
library(data.table)
dt_gdp <- data.table(df)
dt_gdp[, growth_rate_of_gdp := 100 * (Producto.interno.bruto..PIB. - shift(Producto.interno.bruto..PIB.)) / shift(Producto.interno.bruto..PIB.)]
A base-R solution:
with(data,
c(NA, ## augment results (growth rate unknown in year 1)
diff(gdp)/ ## this is gdp(t) - gdp(t-1)
head(gdp, -1)) ## gdp(t-1)
*100) ## scale to percentage growth
head(gdp, -1) is perhaps a little too clever. gdp[-length(gdp)] (i.e. "gdp, excluding the last value") would be slightly more idiomatic.
Or
(gdp/c(NA,gdp[-length(gdp)])-1)*100
Newbie: I have a dataset where I want to calculate the y-o-y growth of sales of a company. The dataset contains approx. 1000 companies with each different number of years listed on a public stock exchange. The data looks like this:
# gvkey fyear at company name
#22 17436 2010 59393 BASF SE
#23 17436 2011 61175 BASF SE
#24 17436 2012 64327 BASF SE
...
#30 17436 2018 86556 BASF SE
#31 17828 1989 62737 DAIMLER AG
#32 17828 1990 67339 DAIMLER AG
#33 17828 1991 75714 DAIMLER AG
...
#60 17828 2018 281619 DAIMLER AG
I would like to create a new column growth where I calculate the percentage increase of at from e.g. BASF SE (gvkey 17436) from 2010 to 2011, to 2012 and so on. In row #31 the conditional statement is supposed to work that it would not calculate the increase based on values that belong to BASF but rather have a NA value. Therefore the next value in this new column "growth" in row 32 would be the percentage increase of DAIMLER (gvkey 17828) from 62727 to 67339
So far I tried:
if TA$gvkey == lag(TA$gvkey) {mutate(TA, growth = (at - lag(at))/lag(at))} else {NULL}
Basically I tried to condition the calculation on the change of the gvkey identifier as this makes the most sense to me. I believe there is a nicer way of maybe running a loop until the gvkey changes and the continue with the next set of values - but I simply don't know how to code that.
I am very new to R and quite lost. I would appreciate every support! Thank you, guys :)
I do not see a way to do this in one line. Assuming you data is called data you may try:
for(i in data$gvkey){
a = subset(data,data$gvkey==i) # a now contains the data of one company
# calculate pairwise relative difference (assumes sorted years!)
rel_diff = diff(a)/head(a,-1) #diff computes pariwise difference and divide by a ( head(a,-1) removes the last element)
a$growth = c(0,rel_diff) # extend data frame by result, first difference is 0
#output tro somewhere
}
This is a solution with r-base. There might be more efficient ways but this is easy to understand.
In this case the group_by function in dplyr is a good tool to use.By group_by() ing your gv column you will segment out your mutate() call to apply separately for each distinct value of gv. Here is a quick example I made with some dummy data and your same column values:
library(dplyr)
dummyData =
data.frame(gvkey = c(111,111,111,222,222,222),
fyear = c(2010,2012,2011,2010,2011,2013),
at =c(2,4,2,4,5,10)
)
dummyDataTransformed = dummyData %>%
group_by(gvkey) %>%
arrange(fyear) %>% #to make sure we are chronologically in order
mutate(growth = at/lag(at,1) -1) %>% #subtract 1 to get year over year change
ungroup() #I like to ungroup just to make sure i'm not bugging out any calculations I might add further down the line
I've a panel dataset of several banks, each from 1997 to 2015, with annual observations s.t.:
CODE COUNTRY YEAR LOANS_NET ...other variables
671405 AT 1997 39028938
671405 AT 1998 41033237
671405 AT 1999 35735062
...
...
671405 AT 2015 130701872
...
30885R DE 2004 200024673
...
...
Using R, I need to compute additional two columns:
1) LOANS_NET growth rate at 1-year horizon
2) LOANS_NET growth rate at 3-years horizon, which must be annualized, once calculated.
E.g.:
Loan Growth 3-year = [Bank's(i) LOANS_NET Year(t) / Bank's(i) LOANS_NET Year(t-3)] -1
nb: data contains lots of missing values, code must consider that issue! :)
#Dan Do you use any packages? I recommend you using zoo and data.table packages and transform dates in the following way:
DT[, YearNumeric := as.numeric(YEAR)]
DT[, PreviousYearLoanNet := .SD[match(YearNumeric - 1, .SD$YearNumeric), LOANS_NET], by=CODE]
Here, you create a column with previous (-1 year) loan values. Then you create a new column with growth:
DT[,Growth1Y:= (YEARLOANNET- PreviousYearLoanNet)/PreviousYearLoanNet]
And then you do whatever you want:) Cheers!
MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.
There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)
I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034