Imputing observations to make cross section a time series in R - r

I'm hoping to take a dataset with cross section salary data for employees and create a large uninterrupted time series, imputing values along the way. Suppose I have:
name <- c("carl","carl","bob","rick","rick","rick","rick")
sex <- c(rep("M",7))
salary <- c(18000, 14000, 34000, 11000, 23000, 23000, 25000)
date <- as.Date(c("2007-04-30","2007-07-30","2009-12-09","2006-01-01",
"2008-01-01","2009-12-09", "2010-01-01"))
salaries <- data.frame(name,sex,salary,date)
salaries
name sex salary date
carl M 18000 2007-04-30
carl M 14000 2007-07-30
bob M 34000 2009-12-09
rick M 11000 2006-01-01
rick M 23000 2008-01-01
rick M 23000 2009-12-09
rick M 25000 2010-01-01
As we can see, poor carl got his salary cut by 4k in July. Prior to that, he was earning 18k. This was the case for 3 months before he got the cut ,but my data doesn't reflect this. I'd like to make a nice picture showing this trend, but first I need to change the data to look like this (where * denotes imputed values):
head(salaries)
name sex salary date change
carl M 18000 2007-04-30 0
carl M 18000 2007-05-30* 0
carl M 18000 2007-06-30* 0
carl M 14000 2007-07-30 1
bob M 34000 2009-12-09 0
rick M 11000 2006-01-01 0
rick M 11000 2006-02-01* 0
... .. ....... ...... ....
rick M 11000 2007-12-01* 0
rick M 23000 2008-01-01 1
rick M 23000 2008-02-01* 1
.... ...... ...... ........
rick M 23000 2009-12-09 1
rick M 25000 2010-01-01 2
So i'd like to impute in-between values and also mark when a change occurs. A guy like bob, who never had a salary change, just stays at 0. But rick, who's had multiple salary changes get's marked each time so we know when the change occurred and which number it is. I'm only interested in the month as the unit of analysis but it would be useful to know how to impute daily as well.

If you have a single time series,
you can use na.locf to replace missing values with the last available value
or approx if you only want to interpolate between values.
To create those individual time series, you can convert the data between your "tall" (normalized) format and a "wide" format with dcast and melt.
To count the number of changes, you can use ddply and cumsum.
library(reshape2)
library(plyr)
library(zoo)
# Convert to wide format
d <- dcast( salaries, date ~ name, value.var = "salary" )
# Add all the dates you want
dates <- seq.Date( from = min(d$date), max(d$date), by="month" )
d <- merge( d, data.frame(date=dates), all=TRUE )
# Fill in the missing values
# If you want the last non-missing value:
#d <- as.data.frame(lapply(d, na.locf, na.rm=FALSE))
# If you only want to interpolate between values:
d <- as.data.frame(lapply(d,
function(x) approx( seq_along(x), x, seq_along(x), method="constant" )$y
))
# Convert back to the tall format
d <- melt(d, id.vars="date", value.name="salary", variable.name="name", na.rm=TRUE)
# Add the number of changes
d <- ddply(
d, "name", transform,
change = cumsum(c(0, diff(salary) != 0))
)

Elaborating on #Vincent's advice:
name <- c("carl","carl","bob","rick","rick","rick","rick")
sex <- c(rep("M",7))
salary <- c(18000, 14000, 34000, 11000, 23000, 23000, 25000)
office <- c('melbourne','sydney','adelaide','perth','perth','melbourne','melbourne')
date <- as.Date(c("2007-04-30","2007-07-30","2009-12-09","2006-01-01",
"2008-01-01","2009-12-09", "2010-01-01"))
salaries <- data.frame(name,sex,salary,date, office)
salaries
library(reshape2)
library(plyr)
library(zoo)
Dealing with numeric vector using approx
# Convert to wide format
d <- dcast( salaries, date ~ name, value.var = "salary" )
# Add all the dates you want
dates <- seq.Date( from = min(d$date), max(d$date), by="month" )
d <- merge( d, data.frame(date=dates), all=TRUE )
# Fill in the missing values
# If you want the last non-missing value:
#d <- as.data.frame(lapply(d, na.locf, na.rm=FALSE, fromLast = T))
#If you only want to interpolate between values:
d <- as.data.frame(lapply(d,
function(x) approx( seq_along(x), x, seq_along(x), method="constant" )$y
))
# Convert back to the tall format
d <- melt(d, id.vars="date", value.name="salary", variable.name="name", na.rm=TRUE)
# Add the number of changes
d <- ddply(
d, "name", transform,
change = cumsum(c(0, diff(salary) != 0))
)
Convert character vector with na.locf
# Convert to wide format
a <- dcast( salaries, date ~ name, value.var = "office" )
# Add all the dates you want
dates <- seq.Date( from = min(a$date), max(a$date), by="month" )
a <- merge( a, data.frame(date=dates), all=TRUE )
# Fill in the missing values using na.locf
a <- as.data.frame(lapply(a, na.locf, na.rm=FALSE, fromLast = T))
# Convert back to the tall format
a <- melt(a, id.vars="date", value.name="office", variable.name="name", na.rm=TRUE)
Merge results
d$date <- as.Date(d$date)
out = merge(a,d, by = c('name','date'))

Related

Extracting the frequencies in percentage

Here are some dataframe with volume in numerical numbers
data.frame(class = ("a","b","a","b"), date = c(2009,2009,2010,2010), volume=c(1,1,2,0))
How is it possible to convert the volume column to be in percentage for the same year(date) of different labels?
data.frame(class = ("a","b","a","b"), date = c(2009,2009,2010,2010), volumepercentage=c("50.00%","50.00%","100.00%","9.00%"))
Here is a base R approach:
df1.spl <- split(df1, df1$date)
df1.lst <- lapply(df1.spl, function(x) data.frame(x, pct=prop.table(x$volume)*100))
df2 <- do.call(rbind, df1.lst)
df2
# class date volume pct
# 2009.1 a 2009 1 50
# 2009.2 b 2009 1 50
# 2010.3 a 2010 2 100
# 2010.4 b 2010 0 0
Note the change in the row names. The command rownames(df2) <- NULL will remove them.

Combining two time series with different ranges, when column headings are the dates

I am stuck trying to combine two time series datasets that have different ranges and both are stored with item# in column1 and date as column headings. For example:
df1
#ITEM 1/1/16 1/2/16 1/3/16 ... 3/24/17
#1 350 365 370 ... 400
#2 100 95 101 ... 95
#3 5 8 9 ... 15
The other dataset range is smaller, its in the same format, and both are daily frequency.
How can I append the rows of df2 to df1 despite having different ranges, but making sure the dates are aligned when merged? Happy with NA in the new dataframe where df#2 didn't have values for dates in df1
Should I create these at xts objects so that once they are merged I can easily pull data for item1 on X date? Or is there an easy way to do that with this format as well?
Thanks in advance for you help.
One option is to use data.table::rbindlist(df1, df2) with fill = TRUE
that fills missing columns with NAs.
Example:
library(data.table)
dt1 <- data.table(item=c(1,2,3),"d1/1/16" = c(350,100,5) ,"d1/2/16" = c(360,120,7))
dt2 <- data.table(item=c(3,4,5),"d1/2/16" = c(50,50,2) ,"d1/3/16" = c(460,150,9))
l = list(dt1,dt2)
data.table::rbindlist(l, use.names= TRUE, fill=TRUE, idcol=TRUE )
Normally in R time series are represented in columns, not rows. Assuming we have DF1 and DF2 shown reproducibly in the Note at the end here are some alternatives
1) zoo we can create zoo series from each by transposing. Then merge them:
library(zoo)
fmt <- "%m/%d/%y"
z1 <- setNames(zoo(t(DF1[-1]), as.Date(names(DF1[-1]), fmt)), DF1[[1]])
z2 <- setNames(zoo(t(DF2[-1]), as.Date(names(DF2[-1]), ftm)), DF2[[1]])
z <- merge(z1, z2)
It is probably best to leave this as the zoo series z but if you want to transform to a data frame then use: fortity.zoo(z)
2) base Alternately, without zoo using fmt from above:
d1 <- data.frame(as.Date(names(DF1[-1]), fmt), t(DF1[-1]))
names(d1) <- c("Index", DF1[[1]])
d2 <- data.frame(as.Date(names(DF2[-1]), fmt), t(DF2[-1]))
names(d2) <- c("Index", DF2[[1]])
merge(d1, d2, by = "Index", all = TRUE)
Note: The input in reproducible form is assumed to be:
Lines <- "ITEM 1/1/16 1/2/16 1/3/16 3/24/17
1 350 365 370 400
2 100 95 101 95
3 5 8 9 15"
DF <- read.table(text = Lines, header = TRUE, check.names = FALSE)
DF1 <- DF[1:2, 1:3]
DF2 <- DF[3, -3]

Using lapply to output values between date ranges within different factor levels

I have 2 dataframes, one representing daily sales figures of different stores (df1) and one representing when each store has been audited (df2). I need to create a new dataframe displaying sales information from each site taken 1 week before each audit (i.e. the information in df2). Some example data, firstly for the daily sales figures from different stores across a certain period:
Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3))
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")
And for the dates of each audit across different stores:
Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")
Of note is that there will be an uneven amount of dates within each output (i.e. there may not be a full weeks worth of information prior to some store audits). I have previously asked a question addressing a similar problem Creating a dataframe from an lapply function with different numbers of rows. Below shows an answer from this which would work for an example if I was to consider information from only 1 store:
library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1))
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))
##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output, function(x) x[seq_len(n1),]))
But I don't know how I would get this for multiple sites.
Try this:
# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")
library(tidyr)
library(dplyr)
# Gather df1 so that df1 and df2 have the same format:
df1 = gather(df1, Store, Sales, -t)
head(df1)
t Store Sales
1 2015-12-30 Store.A 16
2 2015-12-31 Store.A 24
3 2016-01-01 Store.A 8
4 2016-01-02 Store.A 42
5 2016-01-03 Store.A 7
6 2016-01-04 Store.A 46
# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:
r <- lapply(1:nrow(df2), function(i) {
audit.t = df2[i, "t"] #time of audit
audit.s = df1[, "Store"] == df2[i, "Store"] #store audited
df = df1[audit.s, ] #data from audited store
df[, "audited"] = audit.t #add extra column with audit date
week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
week_audit = difftime(df[, "t"], audit.t) <= 0
df[week_before & week_audit, ]
})
Does this give you the proper subsets?
Also, to summarise your results:
r = do.call("rbind", r) %>%
group_by(audited, Store) %>%
summarise(sales = sum(Sales))
r
audited Store sales
<time> <chr> <int>
1 2016-01-04 Store.A 97
2 2016-02-01 Store.B 156
3 2016-02-01 Store.C 226
4 2016-03-01 Store.A 115
5 2016-03-01 Store.C 187

jumping average column at every n-th rows

Please help me on this..
so I have daily observations (data frame) for 32-year period. (thus total around 11659 rows: there's some missing rows)
I want to calculate average of each column at every 365th interval (i.e. every jan-01 for 32 year period, every Jan-02 for 32 year period, etc.
so the output would have total 365 rows and each row is average of 32 rows at 365 interval.
any suggestions? I found similar case and tried their solution and modified a bit but the output is not correct. especially I don't understand sapply part below..
df <-data.frame(x=c(1:10000),y=c(1:10000))
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nr <- nrow(x)
split.index <- rep(1:ceiling(nr / by), each = by, length.out = nr)
} else
{
nr <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nr), split.index)
# Pass index list to fun using sapply() and return object #this is where I am lost
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
thank you for your time..
How about using the plyr package:
require(plyr) # for aggregating data
require(plyr) # for aggregating data
series<-data.frame(date=as.Date("1964-01-01")+(1:100000),
obs=runif(10000),
obs2=runif(10000),
obs3=runif(10000))
ddply(series, # run on series df
.(DOY=format(date,"%j")), # group by string of day and month (call col DOY)
summarise, # tell the function to summarise by group (day of year)
daymean=mean(obs), # calculate the mean
daymean2=mean(obs2), # calculate the mean
daymean3=mean(obs3) # calculate the mean
)
# DOY daymean daymean2 daymean3
#1 001 0.4957763 0.4882559 0.4944281
#2 002 0.5184197 0.4970996 0.4720893
#3 003 0.5192313 0.5185357 0.4878891
#4 004 0.4787227 0.5150596 0.5317068
#5 005 0.4972933 0.5065012 0.4956527
#6 006 0.5112484 0.5276013 0.4785681
#...
Although there's possibly a special function, which does exactly what you need, here is a solution using ave:
set.seed(1)
dates = seq(from=as.Date("1970-01-01"), as.Date("2000-01-01"), by="day")
df <- data.frame(val1=runif(length(dates)),
val2=rchisq(length(dates), 10))
day <- format(dates, "%j") # day of year (1:366)
df <- cbind(df, setNames(as.data.frame(sapply(df, function(x) {
ave(x, day) # calculate mean by day for df$val1 and df$val2
})), paste0(names(df), "_mean")))
head(df[1:365, 3:4], 3)
# val1_mean val2_mean
# 1 0.5317151 10.485001
# 2 0.5555664 10.490968
# 3 0.6428217 10.763027
That is, if I understood your task correctly.

Find range of values in each unique day

I have the following example:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
t = rnorm(length(Date1)))
I would like to find the range of values in a given day (i.e. maximum - minimum).
First, I've defined additional columns which define the unique days in terms of the date and in terms of the day of year (doy).
Dat$date <- format(Dat$DateTime, format = "%Y-%m-%d") # find the unique days
Dat$doy <- as.numeric(format(Dat$DateTime, format="%j")) # find the unique days
To then find the range I tried
by(Dat$t, Dat$doy, function(x) range(x))
but this returns the range as two values not a single value, So, my question is, how do I find the calculated range for each day and return them in a data.frame which has
new_data <- data.frame(date = unique(Dat$date),
range = ...)
Can anyone suggest a method for doing this?
I tend to use tapply for this kind of thing. ave is also useful sometimes. Here:
> dr = tapply(Dat$t,Dat$doy,function(x){diff(range(x))})
Always check tricksy stuff:
> dr[1]
121
3.084317
> diff(range(Dat$t[Dat$doy==121]))
[1] 3.084317
Use the names attribute to get the day-of-year and the values to make a data frame:
> new_data = data.frame(date=names(dr),range=dr)
> head(new_data)
date range
121 121 3.084317
122 122 4.204053
Did you want to convert the number day-of-year back to a date object?
# Use the data.table package
require(data.table)
# Set seed so data is reproducible
set.seed(42)
# Create data.table
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"), to = as.POSIXct("2010-10-10 22:00"), by = 3600)
DT <- data.table(date = as.IDate(Date1), t = rnorm(length(Date1)))
# Set key on data.table so that it is sorted by date
setkey(DT, "date")
# Make a new data.table with the required information (can be used as a data.frame)
new_data <- DT[, diff(range(t)), by = date]
# date V1
# 1: 2010-05-01 4.943101
# 2: 2010-05-02 4.309401
# 3: 2010-05-03 4.568818
# 4: 2010-05-04 2.707036
# 5: 2010-05-05 4.362990
# ---
# 159: 2010-10-06 2.659115
# 160: 2010-10-07 5.820803
# 161: 2010-10-08 4.516654
# 162: 2010-10-09 4.010017
# 163: 2010-10-10 3.311408

Resources