Merging irregular time series - r

I have two time series, one being a daily time series and the other one a discrete one. In my case I have share prices and ratings I need to merge but in a way that the merged time series keeps the daily dates according to the stock prices and that the rating is fitted to the daily data by ticker and date.
A simple merge command would only look for the exact date and ticker and apply NA to non-fitting cases. But I would like to look for the exact matches and fill the dates between with last rating.
Daily time series:
ticker date stock.price
AA US Equity 2004-09-06 1
AA US Equity 2004-09-07 2
AA US Equity 2004-09-08 3
AA US Equity 2004-09-09 4
AA US Equity 2004-09-10 5
AA US Equity 2004-09-11 6
Discrete time series
ticker date Rating Last_Rating
AA US Equity 2004-09-08 A A+
AA US Equity 2004-09-11 AA A
AAL LN Equity 2005-09-08 BB BB
AAL LN Equity 2007-09-09 AA AA-
ABE SM Equity 2006-09-10 AA AA-
ABE SM Equity 2009-09-11 AA AA-
Required Output:
ticker date stock.price Rating
AA US Equity 2004-09-06 1 A+
AA US Equity 2004-09-07 2 A+
AA US Equity 2004-09-08 3 A
AA US Equity 2004-09-09 4 A
AA US Equity 2004-09-10 5 A
AA US Equity 2004-09-11 6 AA
I would be very greatful for your help.

Maybe this is the solution you want.
The function na.locf in the time series package zoo can be used to carry values forward (or backward).
library(zoo)
library(plyr)
options(stringsAsFactors=FALSE)
daily_ts=data.frame(
ticker=c('A','A','A','A','B','B','B','B'),
date=c(1,2,3,4,1,2,3,4),
stock.price=c(1.1,1.2,1.3,1.4,4.1,4.2,4.3,4.4)
)
discrete_ts=data.frame(
ticker=c('A','A','B','B'),
date=c(2,4,2,4),
Rating=c('A','AA','BB','BB-'),
Last_Rating=c('A+','A','BB+','BB')
)
res=ddply(
merge(daily_ts,discrete_ts,by=c("ticker","date"),all=TRUE),
"ticker",
function(x)
data.frame(
x[,c("ticker","date","stock.price")],
Rating=na.locf(x$Rating,na.rm=FALSE),
Last_Rating=na.locf(x$Last_Rating,na.rm=FALSE,fromLast=TRUE)
)
)
res=within(
res,
Rating<-ifelse(
is.na(Rating),
Last_Rating,Rating
)
)[,setdiff(colnames(res),"Last_Rating")]
res
Gives
# ticker date stock.price Rating
#1 A 1 1.1 A+
#2 A 2 1.2 A
#3 A 3 1.3 A
#4 A 4 1.4 AA
#5 B 1 4.1 BB+
#6 B 2 4.2 BB
#7 B 3 4.3 BB
#8 B 4 4.4 BB-

Related

unnesting list inside of a list, inside of a list, inside of a list... while preserving id in R

I imported a JSON file with below structure:
link
I would like to transform it to a dataframe with 3 columns: ID group_name date_joined,
where ID is a element number from "data" list.
It should look like this:
ID group_name date_joined
1 aaa dttm
1 bbb dttm
1 ccc dttm
1 ddd dttm
2 eee dttm
2 aaa dttm
2 bbb dttm
2 fff dttm
2 ggg dttm
3 bbb dttm
3 ccc dttm
3 ggg dttm
3 mmm dttm
Using below code few times i get a dataframe with just 2 columns: group_name and date_joined
train2 <- do.call("rbind", train2)
sample file link
the following should work:
library(jsonlite)
train2 <- fromJSON("sample.json")
train2 <- train2[[1]]$groups$data
df <- data.frame(
ID = unlist(lapply(1:length(train2), function(x) rep.int(x,length(train2[[x]]$group_name)))),
group_name = unlist(lapply(1:length(train2),function(x) train2[[x]]$group_name)),
date_joined = unlist(lapply(1:length(train2),function(x) train2[[x]]$date_joined)))
output:
> df
ID group_name date_joined
1 1 Let's excercise together and lose a few kilo quicker - everyone is welcome! (Piastow) 2008-09-05 09:55:18.730066
2 1 Strongman competition 2008-05-22 21:25:22.572365
3 1 Fast food 4 life 2012-02-02 05:26:01.293628
4 1 alternative medicine - Hypnosis and bioenergotheraphy 2008-07-05 05:47:12.254848
5 2 Tom Cruise group 2009-06-14 16:48:28.606142
6 2 Babysitters (Sokoka) 2010-09-25 03:21:01.944684
7 2 Work abroad - join to find well paid work and enjoy the experience (Sokoka) 2010-09-21 23:44:39.499240
8 2 Tennis, Squash, Badminton, table tennis - looking for sparring partner (Sokoka) 2007-10-09 17:15:13.896508
9 2 Lost&Found (Sokoka) 2007-01-03 04:49:01.499555
10 3 Polish wildlife - best places 2007-07-29 18:15:49.603727
11 3 Politics and politicians 2010-10-03 21:00:27.154597
12 3 Pizza ! Best recipes 2010-08-25 22:26:48.331266
13 3 Animal rights group - join us if you care! 2010-11-02 12:41:37.753989
14 4 The Aspiring Writer 2009-09-08 15:49:57.132171
15 4 Nutrition & food advices 2010-12-02 18:19:30.887307
16 4 Game of thrones 2009-09-18 10:00:16.190795
17 5 The ultimate house and electro group 2008-01-02 14:57:39.269135
18 5 Pirates of the Carribean 2012-03-05 03:28:37.972484
19 5 Musicians Available Poland (Osieczna) 2009-12-21 13:48:10.887986
20 5 Housekeeping - looking for a housekeeper ? Join the group! (Osieczna) 2008-10-28 23:22:26.159789
21 5 Rooms for rent (Osieczna) 2012-08-09 12:14:34.190438
22 5 Counter strike - global ladderboard 2008-11-28 03:33:43.272435
23 5 Nutrition & food advices 2011-02-08 19:38:58.932003

Working with times in R - categorising time intervals by an ID

I'd really appreciate some help on a problem I'm struggling to resolve in R.
I have a data frame with a series of IDs, dates and treatments. My end goal is to count the number of events that happen to an ID by treatment within a given timeframe.
For example,
ID has treatment A within twice within the space of three months, and four times within six months. I expect to have a series of conditional columns which count the number of occurrences.
The data frame follows a similar structure to:
ID date treatment
1A 20/09/2015 A
1A 21/09/2015 B
1A 22/10/2015 A
2A 22/09/2015 C
2A 20/10/2015 C
My end goal would be to have something like...
ID date treatment
1A 01/01/2016 A
1A 01/03/2016 A
1A 01/04/2016 A
1A 01/05/2016 A
1A 01/11/2016 A
2A 01/01/2016 A
2A 01/09/2016 A
Grouping to...
ID a_within_3_months a_within_6_months...
1A 3 1
2A 0 0
I'm sure this must be possible in data.table, but I'm struggling to figure out how to calculate this over rows by the conditions I want.
I hope this is clear - happy to provide more detail is helpful.
Really appreciate any help with this issue! Thank you for your time.
This might be what you are looking for:
> first_date <- as.Date(
as.character(20140612),
"%Y%m%d")
> data<- data.frame(
ID=c(rep(1,5), rep(2,5)),
date=seq(first_date, by="1 day", length.out=10),
trtm=c(rep("a",3), rep("b",2), rep("c",3), rep("d",2)))
data
ID date trtm
1 2014-06-12 a
1 2014-06-13 a
1 2014-06-14 a
1 2014-06-15 b
1 2014-06-16 b
2 2014-06-17 c
2 2014-06-18 c
2 2014-06-19 c
2 2014-06-20 d
2 2014-06-21 d
> data <- data.table(data)
> data[,.( within=max(date)-min(date),
n_of_trtm=length(date) ),
by=.(ID,trtm)]
ID trtm within n_of_trtm
1: 1 a 2 days 3
2: 1 b 1 days 2
3: 2 c 2 days 3
4: 2 d 1 days 2

reshape data wide to long for multiple variables in R

I have a dataset that shows each bank's investment and dollar value associated with this investment. Currently the data looks like this. I have inv and amt variables stretching from 1 to 43.
bankid year location inv1 amt1 inv2 amt2 ... inv43 amt43
1 1990 NYC AIG 2000 GM 4000 Ford 6000
but I want the data to look like this
bankid year location inv number amt
1 1990 NYC AIG 1 2000
1 1990 NYC GM 2 4000
...
1 1990 NYC Ford 43 6000
In Stata, I would use this code
reshape long inv amt, i(bankid location year) j(number)
What would be the equivalent code in R?
reshape can do this. Here I am using the posted subset of your data, where you have time variables 1, 2, and 43:
x <- read.table(header=TRUE, text='bankid year location inv1 amt1 inv2 amt2 inv43 amt43
1 1990 NYC AIG 2000 GM 4000 Ford 6000 ')
x
## bankid year location inv1 amt1 inv2 amt2 inv43 amt43
## 1 1 1990 NYC AIG 2000 GM 4000 Ford 6000
v <- outer(c('inv', 'amt'), c(1,2,43), FUN=paste0)
v
## [,1] [,2] [,3]
## [1,] "inv1" "inv2" "inv43"
## [2,] "amt1" "amt2" "amt43"
reshape(x, direction='long', varying=c(v), sep='')
## bankid year location time inv amt id
## 1.1 1 1990 NYC 1 AIG 2000 1
## 1.2 1 1990 NYC 2 GM 4000 1
## 1.43 1 1990 NYC 43 Ford 6000 1
For your full table, the varying argument would be c(outer(c('inv', 'amt'), 1:43, FUN=paste0)) (but that won't work for the small example, as columns are missing).
Here, reshape infers the 'time' variable by inspecting the varying argument and finding common elements (inv and amt) on the left, and other elements on the right (1, 2, and 43). The sep argument says that there is no separator character (default sep character is .).

Best Way to Subset a Large Dataframe into a List of Variables?

I currently have a data frame of ~83000 rows (13 columns) that has data from years 2000-2012 of crimes, each row is a crime and the zip code is reported (so the zip code xxxxx can be found in year 2001, 2003, and 2007 as an example).
Here is an example of my data:
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
2000 1 99502 1 3 5 2 9479
2009 2 99502 2 3 4 3 3220
2000 1 11111 1 3 5 2 3479
2004 2 11111 2 3 4 3 1020
Right now, I am able to assign global variables to all of my zip codes (I am using R studio and my list of data shown is very long and it has significantly slowed the program).
Here is how I have assigned global variables to all of my zip codes:
for (n in all.data$Zip) {
x <- subset(all.data, n == all.data$Zip) #subsets the data
u <- x[1,3] #gets the zip code value
assign(paste0("Zip", u), x, envir = .GlobalEnv) #assigns it to a global environment
#need something here, MasterList <<- ?
}
I would like to contain all of these variables in a list. For example, if all my zip code variables were stored in list "MasterList":
MasterList["Zip11111"]
would yield the data frame:
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
2000 1 11111 1 3 5 2 3479
2004 2 11111 2 3 4 3 1020
Is this possible? What would be an alternative/faster/better way to do such? I was hoping that storing these variables in a list would be more efficient.
Bonus points: I know in my for loop I am reassigning variables that already exist to the exact same thing, wasting processing time. Any quick line I could add to speed this up?
Thanks in advance for your help!
With only base R:
dat <- read.table(text = "Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
+ 2000 1 99502 1 3 5 2 9479
+ 2009 2 99502 2 3 4 3 3220
+ 2000 1 11111 1 3 5 2 3479
+ 2004 2 11111 2 3 4 3 1020",header = TRUE,sep = "")
> dats <- split(dat,dat$Zip)
> dats
$`11111`
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
3 2000 1 11111 1 3 5 2 3479
4 2004 2 11111 2 3 4 3 1020
$`99502`
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
1 2000 1 99502 1 3 5 2 9479
2 2009 2 99502 2 3 4 3 3220
> names(dats) <- paste0('Zip',names(dats))
> dats
$Zip11111
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
3 2000 1 11111 1 3 5 2 3479
4 2004 2 11111 2 3 4 3 1020
$Zip99502
Year Quarter Zip MissingZip BusCode LossCode NumTheftsPQ DUL
1 2000 1 99502 1 3 5 2 9479
2 2009 2 99502 2 3 4 3 3220
You could change for (n in all.data$Zip) to for (n in unique(all.data$Zip)). That would cut down on redundancy. Why don't you make a list before the loop, MasterList <- list() and then add to the list by
MasterList[[paste0("Zip", n)]] <- x
Yes, I used n for the zip code number because n is assigned each value in the vector you tell it (in your case all.data$Zip, in mine unique(all.data$Zip))
Probably the easiest way to make your list is using the plyr function, like so:
> set.seed(2)
> dat <- data.frame(zip=as.factor(sample(11111:22222,1000,replace=T)),var1=rnorm(1000),var2=rnorm(1000))
> head(dat)
zip var1 var2
1 13165 -0.4597894 -0.84724423
2 18915 0.6179261 0.07042928
3 17481 -0.7204224 1.58119491
4 12978 -0.5835119 0.02059799
5 21598 0.2163245 -0.12337051
6 21594 1.2449912 -1.25737890
> library(plyr)
> MasterList <- dlply(dat,.(zip))
> MasterList[["13165"]]
zip var1 var2
1 13165 -0.4597894 -0.8472442
However it sounds like speed is your motivation here and if so you'd probably be much better off not storing the data in some separate list object and converting your data frame to a data.table():
> library(data.table)
> dat.dt <- data.table(dat)
> dat.dt[zip==13165]
zip var1 var2
1: 13165 -0.4597894 -0.8472442

Aggregate column R

I am new here and have a problem
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
3 1990 EFG Pear 0.1350
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094
For each year and separately for the two markets (i.e.,ABC,EFG), examine the
combined data for Apple and Pear on the bid price variable BID for presence
of potential outliers.5 Identify instances where you observe the presence of
potential outliers.
I managed to separate the data by year only
y <- c(1, seq(300))
year1991 <- subset(X, y < 39)
year1991
Year1991 <- year1991[, c(1,2,3,5)]
Year1991
now I need help on whats the right R command to key to select(View) only ABC
of the Market COLUMN, which the other column values remains.
Is it possible to do multiple separation at one time? or step by step
Possible to give me a tip,how do I exlude if I wanna view the date in such
a manner
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
Year Market Winner BID
1 1990 EFG Pear 0.1350
Like trying to split the 'Market' but still see the whole list of values
Thanks in advance :)
> df
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
3 1990 EFG Pear 0.1350
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094
library(plyr)
# Then you can break up the data into chunks of year x market.
# I split your data.frame into a list. You can do further things with that list.
# alternatively, you can use ddply and add a function to do your hw bit and collate all
# results back into a final data.frame. This should be a helpful start.
> dlply(df, .(Year,Market))
$`1990.ABC`
Year Market Winner BID
1 1990 ABC Apple 0.1260
2 1990 ABC Apple 0.1395
$`1990.EFG`
Year Market Winner BID
3 1990 EFG Pear 0.135
$`1991.EFG`
Year Market Winner BID
4 1991 EFG Apple 0.1113
5 1991 EFG Orange 0.1094

Resources