R: Insert and fill missing periods in panel data - r

I'm trying to learn R coming from Stata, but have run into the following two problems which I cannot seem to find elegant solutions for in R:
1) I have a panel dataset with gaps in my time variable. I would like to expand my time variable to include the gaps despite having no observed data for these rows.
In Stata I would usually go about this by setting my ID and time variables with xtset and then expanding the dataset based on this with tsfill. Is there an equivalently elegant way in R?
2) I would like to fill some of the new, blank cells with data for constant variables.
In Stata I would do this by copying data from previous (relative to my time variable) observations using the l.-prefix; for example using replace Con = l.Con.
In other words I'm asking how to go from something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Jun 15 B
To something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 Mar A
1 Apr A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Apr B
2 May B
2 Jun 15 B
Hopefully that makes sense. Thanks in advance.

You can try merge from base R or the data.table join
library(data.table)
DT2 <- setDT(df1)[, {tmp <- match(Time, month.abb)
list(Time=month.abb[min(tmp):max(tmp)])}, .(ID,Con)]
setkey(df1[, c(1,4,2,3), with=FALSE], ID, Con, Time)[DT2]
# ID Con Time Num
# 1: 1 A Jan 10
# 2: 1 A Feb 15
# 3: 1 A Mar NA
# 4: 1 A Apr NA
# 5: 1 A May 20
# 6: 2 B Feb 12
# 7: 2 B Mar 14
# 8: 2 B Apr NA
# 9: 2 B May NA
#10: 2 B Jun 15
NOTE: It may be better to keep missing value as NA

Related

Taking variance of some rows above in panel structrure (R data table )

# Example of a panel data
library(data.table)
panel<-data.table(expand.grid(Year=c(2017:2020),Individual=c("A","B","C")))
panel$value<-rnorm(nrow(panel),10) # The value I am interested in
I want to take the variance of prior two years value by Individual.
For example, if I were to sum the value of prior two years I would do something like:
panel[,sum_of_past_2_years:=shift(value)+shift(value, 2),Individual]
I thought this would work.
panel[,var(c(shift(value),shift(value, 2))),Individual]
# This doesn't work of course
Ideally the answer should look like
a<-c(NA,NA,var(panel$value[1:2]),var(panel$value[2:3]))
b<-c(NA,NA,var(panel$value[5:6]),var(panel$value[6:7]))
c<-c(NA,NA,var(panel$value[9:10]),var(panel$value[10:11]))
panel[,variance_past_2_years:=c(a,b,c)]
# NAs when there is no value for 2 prior years
You can use frollapply to perform rolling operation of every 2 values.
library(data.table)
panel[, var := frollapply(shift(value), 2, var), Individual]
# Year Individual value var
# 1: 2017 A 9.416218 NA
# 2: 2018 A 8.424868 NA
# 3: 2019 A 8.743061 0.49138739
# 4: 2020 A 9.489386 0.05062333
# 5: 2017 B 10.102086 NA
# 6: 2018 B 8.674827 NA
# 7: 2019 B 10.708943 1.01853361
# 8: 2020 B 11.828768 2.06881272
# 9: 2017 C 10.124349 NA
#10: 2018 C 9.024261 NA
#11: 2019 C 10.677998 0.60509700
#12: 2020 C 10.397105 1.36742220

How to insert a value in a table

I have aggregated a table from my datafile using this synthax:
sumtab <- as.data.frame(table(S$MONTH))
colnames(sumtab) <- c("Month", "Frq")
rownames(sumtab) <- c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug",
"Sep","Oct","Dec")
Resulting in this table sumtab:
Month Frq
Jan 1 3
Feb 2 5
Mar 3 16
Apr 4 45
May 5 11
Jun 6 16
Jul 7 99
Aug 8 101
Sep 9 45
Oct 10 456
Dec 12 112
And this script produces a ggplot:
ggplot(sumtab, aes(x=Month,y=Frq),width=1.5) +
scale_y_continuous(limit=c(0,17),expand=c(0, 0)) +
geom_bar(stat='identity',fill="lightgreen",colour="black") +
xlab("Month") + ylab("No of bears killed") +
theme_bw(base_size = 11) +
theme(axis.text.x=element_text(angle=0,size=9))
The problem is that there are no values for November in my data, and I need to somehow enter a zero for November in the table. Probably a simple thing for most of you, and I have tried to search in other questions , and I have googled and read the books, but been unable to find the correct synthax.Need a little help.
Adding rbind into the script:
sumtab <- as.data.frame(table(S$MONTH))
sumtab <- rbind(sumtab, c(11, 0))
produced this error message:
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = 11) :
invalid factor level, NA generated
ant this table:
Var1 Freq
1 1 3
2 2 5
3 3 6
4 4 14
5 5 7
6 6 2
7 7 13
8 8 12
9 9 3
10 10 1
11 12 4
12 <NA> 0
So thanks #PaulH for your help, but I've probably used your help in a wrong way.
You could use the rbind command to add the November row:
sumtab <- rbind(sumtab, Nov = c(11, 0))
Good luck!

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.
Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:
library(dplyr)
library(hflights)
data(hflights)
# Calculate both statistics separately
per_day <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Daily")
per_month <- hflights %>%
group_by(Year, Month) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Monthly", DayofMonth = NA)
# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))
my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
#
# Year Month DayofMonth Delay Interval
# 1 2011 1 1 10.067642 Daily
# 2 2011 1 2 10.509745 Daily
# 3 2011 1 3 6.038627 Daily
# 4 2011 1 4 7.970740 Daily
# 5 2011 1 5 4.172650 Daily
# 6 2011 1 6 6.069909 Daily
# 7 2011 1 7 3.907295 Daily
# 8 2011 1 8 3.070140 Daily
# 9 2011 1 9 17.254325 Daily
# 10 2011 1 10 11.040388 Daily
# .. ... ... ... ... ...
Are there better ways to do this?
(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)
As a one-off table. This is fairly straightforward in data.table:
require(data.table)
setDT(hflights)[,{
mo_del <- mean(ArrDelay,na.rm=TRUE)
.SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
# Year Month DayofMonth DailyDelay MonthlyDelay
# 1: 2011 1 1 10.0676417 4.926065
# 2: 2011 1 2 10.5097451 4.926065
# 3: 2011 1 3 6.0386266 4.926065
# 4: 2011 1 4 7.9707401 4.926065
# 5: 2011 1 5 4.1726496 4.926065
# ---
# 361: 2011 12 14 1.0293610 5.013244
# 362: 2011 12 17 -0.1049822 5.013244
# 363: 2011 12 24 -4.1457490 5.013244
# 364: 2011 12 25 -2.2976827 5.013244
# 365: 2011 12 31 46.4846491 5.013244
How it works. The basic syntax is DT[i,j,by].
With by=.(Year,Month), all operations in j are done per "by group."
We can nest another "by group" using the data.table of the current Subset of Data, .SD.
To return columns in j we use .(colname1=col1,colname2=col2,...).
Creating new variables. Alternately, we could create new variables in hflights using := in j.
hflights[,DailyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]
Then we can view the summary table:
hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
# Year Month DayofMonth DailyDelay MonthlyDelay .GRP
# 1: 2011 1 1 10.0676417 4.926065 1
# 2: 2011 1 2 10.5097451 4.926065 2
# 3: 2011 1 3 6.0386266 4.926065 3
# 4: 2011 1 4 7.9707401 4.926065 4
# 5: 2011 1 5 4.1726496 4.926065 5
# ---
# 361: 2011 12 14 1.0293610 5.013244 361
# 362: 2011 12 17 -0.1049822 5.013244 362
# 363: 2011 12 24 -4.1457490 5.013244 363
# 364: 2011 12 25 -2.2976827 5.013244 364
# 365: 2011 12 31 46.4846491 5.013244 365
(Something needed to be put in j here, so I used the "by group" code, .GRP.)

copy as long as value does not change in R

I have a data frame which contain multiple rows, see example.
df <- data.frame(rbind(c('1','CAR','Jan'),
c('2','3','4'),
c('5','6','7'),
c('8','CAR','Feb'),
c('9','10', '11'),
c('12','13','14')))
I would like to copy the value in which comes after CAR (Jan and Feb) and copy it in a new column X4 until CAR is coming again. The number of rows is not all the time the same between the CARs, the number of columns are the same.
The data should look like this
data.frame(rbind(c('1','CAR','Jan','Jan' ),
c('2','3','4','Jan'),
c('5','6','7','Jan'),
c('8','CAR','Feb','Feb'),
c('9','10','11','Feb'),
c('11','12','12','Feb')))
I have tried different options (ifelse, if, for loop), but none of them provides the right result.
Would you have any hints on how to solve this?
Thanks in advance
Eric
Here's a another data.table solution
library(data.table)
setDT(df)[, X4 := X3[1L], by = cumsum(X2 == "CAR")]
df
# X1 X2 X3 X4
# 1: 1 CAR Jan Jan
# 2: 2 3 4 Jan
# 3: 5 6 7 Jan
# 4: 8 CAR Feb Feb
# 5: 9 10 11 Feb
# 6: 12 13 14 Feb
We could also do a similar thing using dplyr (but it will add an indx column too)
library(dplyr)
df %>%
group_by(indx = cumsum(X2 == "CAR")) %>%
mutate(X4 = X3[1L])
You can try
library(data.table)
library(zoo)
setDT(df)[X2=='CAR', X4:= X3][, X4:= na.locf(X4)]
# X1 X2 X3 X4
#1: 1 CAR Jan Jan
#2: 2 3 4 Jan
#3: 5 6 7 Jan
#4: 8 CAR Feb Feb
#5: 9 10 11 Feb
#6: 12 13 14 Feb
Here's an uglier, base-R version of David's answer:
df$X4 <- unlist(tapply(
df$X3,
cumsum(df$X2=="CAR"),
function(x){y <- levels(x)[x[1]]; rep(y,length(x))}
))

insert NA into a time series object in r

i want to sum the months for all the years in a time series that looks like
Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec
2006 4 4 3 4 4 5 5 3 3
2007 3 3 2 2 4 3 3 2 2 5 5
2008 3 3 3 2 2 4 4 3
by using
window(the time series object,start=c(2006,3),end=c(2008,3),frequency=1)
this line gives you a new ts object with just march of 2006-2007. However this does not work when the month does not have any values in it, is there any way to replace the gaps with NA? I have seen questions like this before but the dont answer i think for a ts object.
Assuming that
the_time_series_object <- ts(1:31, frequency = 12, start = c(2006, 3))
then:
window(the time series object, start = c(2006,3), end = c(2008,3), frequency = 12)
Your frequency should be 12 instead of 1. There's no NA problem it's just that one variable that you have wrong

Resources