create new variable based on year for time series [duplicate] - r

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
Closed 3 years ago.
gvkey = id
data = dataframe
data$t <- NA;
data[data$year = 2005, "t"] <- 1
data[data$year = 2006, "t"] <- 2
data[data$year = 2007, "t"] <- 3
data[data$year = 2008, "t"] <- 4
data[data$year = 2009, "t"] <- 5
data[data$year = 2010, "t"] <- 6
I want to create variable "t":
gvkey year t
1004 2005 1
1004 2006 2
1004 2007 3
1004 2008 4
1004 2009 5
1004 2010 6
1013 2005 1
1013 2006 2
1013 2007 3
1013 2008 4
1013 2009 5
1013 2010 6
.....
Somehow my code does not work. Do you have any idea why?
Is there a more efficient way to run this code?
I am new to R and would really appreciate your help.
column of interest

Maybe you can try
data$t <- data$year - min(data$year) + 1

Related

Counting the unique values of two combined columns [duplicate]

This question already has an answer here:
How to count the number of unique values by group? [duplicate]
(1 answer)
Closed 1 year ago.
I have a data.table as follows
library(data.table)
library(haven)
df1 <- fread(
"A B C iso year
0 B 1 NLD 2009
1 A 2 NLD 2009
0 Y 3 AUS 2011
1 Q 4 AUS 2011
0 NA 7 NLD 2008
1 0 1 NLD 2008
0 1 3 AUS 2012",
header = TRUE
)
I want to count the unique values of the combination of iso, and year (which would be NLD 2009, AUS 2011, NLD 2008 and AUS 2012, so 4.
I tried df1[,uniqueN(.(iso, year))] and df1[,uniqueN(c("iso", "year"))]
The first one gives an error, and the second one gives the answer 2, where I am looking for 4 unique combinations.
What am I doing wrong here?
(as I am doing this with a big dataset of strings, I would prefer no to combine the columns, then test).
You can solve it as follows using data.table package.
df1[, uniqueN(.SD), .SDcols=c("iso", "year")]
or
uniqueN(df1, by=c("iso", "year"))
Alternative to the data.table approach, count from dplyr does it very nicely:
library(dplyr)
df1 %>% count(iso, year)
Output:
iso year n
1: AUS 2011 2
2: AUS 2012 1
3: NLD 2008 2
4: NLD 2009 2

Conversion of monthly data to yearly data in a dataframe in r

I have a dataframe showing monthly mgpp from 2000-2010:
dataframe1
Year Month mgpp
1: 2000 1 0.01986404
2: 2000 2 0.011178429
3: 2000 3 0.02662008
4: 2000 4 0.05034293
5: 2000 5 0.23491388
---
128: 2010 8 0.13234501
129: 2010 9 0.10432369
130: 2010 10 0.04329537
131: 2010 11 0.04343289
132: 2010 12 0.09494946
I am trying to convert this dataframe1 into a raster that will show the variable mgpp. However I want to format the dataframe first which will show only the yearly mgpp. The expected outcome is shown below :
dataframe1
Year mgpp
1: 2000 0.01986704
2: 2001 0.01578429
3: 2002 0.02662328
4: 2003 0.05089593
5: 2004 0.07491388
6: 2005 0.11229201
7: 2006 0.10318569
8: 2007 0.07129537
9: 2008 0.04373689
10: 2009 0.02885386
11: 2010 0.74848348
I want to aggregate the months by mean. For instance, 2000 value shows one value that is the mean from Jan-Dec for the 2000 year.How can I achieve this? Help would be appreciated
Here a data.table approach.
library(data.table)
setDT(dataframe1)[,.(Yearly.mgpp = mean(mgpp)),by=Year]
Year Yearly.mgpp
1: 2000 0.06858387
2: 2010 0.08366928
Or if you prefer dplyr.
library(dplyr)
dataframe1 %>%
group_by(Year) %>%
summarise(Yearly.mgpp = mean(mgpp))
# A tibble: 2 x 2
Year Yearly.mgpp
<dbl> <dbl>
1 2000 0.0686
2 2010 0.0837
Or base R.
result <- sapply(split(dataframe1$mgpp,dataframe1$Year),mean)
data.frame(Year = as.numeric(names(result)),Yearly.mgpp = result)
Year Yearly.mgpp
2000 2000 0.06858387
2010 2010 0.08366928
Sample Data
dataframe1 <- structure(list(Year = c(2000, 2000, 2000, 2000, 2000, 2010, 2010,
2010, 2010, 2010), Month = c(1, 2, 3, 4, 5, 8, 9, 10, 11, 12),
mgpp = c(0.01986404, 0.011178429, 0.02662008, 0.05034293,
0.23491388, 0.13234501, 0.10432369, 0.04329537, 0.04343289,
0.09494946)), class = "data.frame", row.names = c(NA, -10L
))

Merge two data frames from a national survey with panel and not panel individuals of two different years (in r)

I tried to search on the website but I didn't find the answer to my question; if there is already one please write the link.
I have two data frames from a national survey: each year I have some families that have already been interviewed and others that are new. I want to merge the data frames in order to have only the families present in both data frames and match them in order to have the 2014 values in a row and the 2012 values in the next one for each individual (for the sake of semplicity I omitted other social variables present in the survey).
For example: df1 and df2
> df1 <- data.frame(nquest=c(173, 526, 1066, 1066), nord=c(1,1,1,2), year=c(2014, 2014, 2014, 2014))
> structure(df1)
nquest nord year
1 173 1 2014
2 526 1 2014
3 1066 1 2014
4 1066 2 2014
> df2 <- data.frame(nquest=c(173, 526, 3456, 3456), nord=c(1,1,1,2), year=c(2012, 2012, 2012, 2012))
> structure(df2)
nquest nord year
1 173 1 2012
2 526 1 2012
3 3456 1 2012
4 3456 2 2012
where nquest is the number of the family and nord the component of the family (ex. 1 father, 2 mother).
I want to merge them in this way:
> df <- data.frame(nquest=c(173, 173, 526,526), nord=c(1,1,1,1), year=c(2014, 2012, 2014, 2012))
> structure(df)
nquest nord year
1 173 1 2014
2 173 1 2012
3 526 1 2014
4 526 1 2012
I tried the to merge them:
tot <- merge (df1, df2, by=c("nquest", "nord")
structure(tot)
nquest nord year.x year.y
1 173 1 2014 2012
2 526 1 2014 2012
and I tried the rbind function:
> tot <- rbind(s, df2)
> structure(tot)
nquest nord year
1 173 1 201
2 526 1 2014
3 1066 1 2014
4 1066 2 2014
5 173 1 2012
6 526 1 2012
7 3456 1 2012
8 3456 2 2012
Thank you
This is an approach using "dplyr", there is probably a better way to do the filtering though
bind_rows(df1, df2) %>%
filter( nquest %in% df1$nquest & nquest %in% df2$nquest) %>%
arrange(nquest, desc(year))
The second condition on the "arrange" function, that specifies year, is not necessary in this case but I am putting it there for completness

Grouping and Std. Dev in R

I have a data frame called dt. dt looks like this.
Year Sale
2009 6
2008 3
2007 4
2006 5
2005 12
2004 3
I am interested in getting std.dev of sales in the past four years. In case, there are not four year data, as in 2006,2005, and 2004, I want to get NA. How can I create a new column with the values corresponding to each year. New data would look like.
Year Sale std.
2009 6 std(05,06,07,08)
2008 3 std(07,06,05,04)
2007 4 NA
2006 5 NA
2005 12 NA
2004 3 NA
I tried this a lot, but because I am a novice at R, I couldn't do it. Someone please help. Thanks.
Edit :
Here is the data with GVKEY.
GVKEY FYEAR IBC
1 1004 2003 3.504
2 1004 2004 18.572
3 1004 2005 35.163
4 1004 2006 59.447
5 1004 2007 75.745
Regards
Edit:
I am using the mentioned function rollapply function in this manner:
dt <- ddply(dt, .(GVKEY), function(x){x$ww <- rollapply(x$Sale,4,sd, fill =NA, align="right"); x});
But I am getting following error.
Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument
Not sure what I am doing wrong. The data with GVKEY is mentioned at the top.
You can use rollapply from package zoo:
require(zoo)
rollapply(df$Sale, 4, sd, fill=NA, align="right")
[edit] I used your data frame as sorted by year. If you have it in original order, you will probably need to use align="left"
This is how I solved the problem:
dt <- dt[order(dt$GVKEY,dt$FYEAR),];
dt <- sqldf("select GVKEY, FYEAR, IBC from dt");
dt$STDEARN <- ave(dt$IBC, dt$GVKEY,FUN = function(x) {if(length(x)>3) c(NA,head(runSD(x,4),-1)) else sample(NA,length(x),TRUE)});

Simple filtering in R, but with more than one value

I am well aware of how to extract some data based on a condition, but whenever I try multiple conditions, a struggle ensues. I have some data and I only want to extract certain years from the df. Here is an example df:
year value
2006 3
2007 4
2007 3
2008 5
2008 4
2008 4
2009 5
2009 9
2010 2
2010 8
2011 3
2011 8
2011 7
2012 3
2013 4
2012 6
Now let's say I just want 2008, 2009, 2010, and 2011. I try
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
doesn't work, so then:
df<-df[df$year == "2008" & df$year == "2009"
& df$year == "2010" & df$year == "2011",]
No error messages, just an empty df. What am I missing?
You need to use %in% and not==
df[df$year %in% c(2008, 2009, 2010, 2011),]
year value
4 2008 5
5 2008 4
6 2008 4
7 2009 5
8 2009 9
9 2010 2
10 2010 8
11 2011 3
12 2011 8
13 2011 7
As answered %in% works but so should using |. The & is for AND logic, meaning that the year would need to be equal to 2008, 2009, 2010 AND 2011 whereas what you want is the OR operator.
df<-df[df$year == "2008" | df$year == "2009" | df$year == "2010" | df$year == "2011",]
If you don't like %in%, try the function is.element. You might find it more intuitive.
df[is.element(el=df[,"year"], set=c(2008:2011)),]
Careful, though... switching el and set gives different results, and it can be confusing which way you want it. For this example, just remember that "set" contains the "subSET" of years that you want.
The questions has been answered but I wanted to add a comment about why your first try gives an unexpected result. This is a good example of R's vector recycling.
I'm guessing you got
year value
6 2008 4
13 2011 8
Why has R done this? What happens is R recycles the vector c("2008", "2009", "2010", "2011") like the below.
year value compare
2006 3 2008
2007 4 2009
2007 3 2010
2008 5 2011
2008 4 2008
2008 4 2009
2009 5 2010
2009 9 2011
2010 2 2008
2010 8 2009
2011 3 2010
2011 8 2011
2011 7 2008
2012 3 2009
2013 4 2010
2012 6 2011
Do you see what's about to happen? When you run
df<-df[df$year == c("2008", "2009", "2010", "2011"),]
it will return the rows where the year column and the compare column are equal. You didn't get a warning because (by chance) your comparison vector was a divisor of the number of rows, so R thought it was doing the right thing.
This is essentially the same as #Metrics answer:
subset(df, year %in% c(2008, 2009, 2010, 2011))
And if you need help with %in%, see ?intersect

Resources