Cut time series into specific bins and label each chunk - r

I have some experimental data on CO2 values over a few days in a room which are time and date-stamped. I would like to break it up into a series of "experiments" based on an experiment list of when each experiment happened.
e.g.
Data
df<-data.frame(CO2.ppm.=runif(10), Date.time.=as.POSIXct(" 2019-2-08 07:00:00") + runif(n=10, min=0, max=3600))
List of experiments with start and stop times:
ExpertimentList<- data.frame(StartTime=c("2019-2-08 07:10:00", "2019-2-08 07:15:00", "2019-2-08 08:30:00"), StopTime=c("2019-2-08 07:12:00","2019-2-08 07:16:00","2019-2-08 08:15:00"),ExptID=c(1,2,3))
Note there is time when CO2 is measured but no experiment is happening. E.g. between 07:12:00 and 07:15:00.
I would like to split df$Date.time. by ExperimentList's StartTime and StopTime
So far I've converted everything to integers
df$Date.time.<-as.integer(df$Date.time.)
ExperimentList$StartTime<-as.integer(ExperimentList$StartTime
ExperimentList$StopTime<-as.integer(ExperimentList$StopTime)
Then looking at cut
breakz<-dplyr::arrange(paste(Experiment_List$StartTime,Experiment_List$StopTime)%>%as_tibble())
cut(df$Dev.Date.Time,breaks=unique(breakz$value))
But I can't filter out the data when no experiment was taking place. Any thoughts are much appreciated.
Expected output:
set.seed(143)
data.frame(CO2.ppm.=runif(10), Date.time.=sort(as.POSIXct(" 2019-2-08 07:00:00") + runif(n=10, min=0, max=3600)),ExptID=c(NA,NA,NA,1,NA,NA,NA,NA,NA,NA))
ANSWER:
I found that I would run out of memory with #Ronak's answer so I chunked the data.frame into 10000 row segments:
df<-split(df, (as.numeric(rownames(df))-1) %/% 10000)
Then based on #Ronak's answer, I popped the code into a function and used mclapply from the parallel package.
#Do a left join to remove any rows not belonging to an experiment
fuzzyJoinFunction<-function(a){
a<-fuzzy_left_join(a, Experiment_List,
by = c('Dev.Date.Time' = 'StartTime', 'Dev.Date.Time'= 'StopTime'),
match_fun = c(`>=`, `<=`))
a
}
df<-rbindlist(mclapply(X=df,FUN=fuzzyJoinFunction,mc.cores=4))

We can use fuzzyjoin::fuzzy_inner_join to keep only the rows which are in range.
library(dplyr)
library(fuzzyjoin)
#All the datetime values should be of type POSIXct.
ExpertimentList %>%
mutate(across(c(StartTime, StopTime), lubridate::ymd_hms)) -> ExpertimentList
fuzzy_inner_join(df, ExpertimentList,
by = c('Date.time.' = 'StartTime', 'Date.time.'= 'StopTime'),
match_fun = c(`>=`, `<=`))
To get all df values in the final output with NA for ExptID use fuzzy_left_join.

Related

tsibble -- how do you get around implicit gaps when there are none

I am new to the tsibble package. I have monthly data that I coerced to a tsibble to use the fable package. A few issues I am having
It appears the index variable (from my testing) is not of class date even though I applied
lubridate's ymd function to it.
has_gaps function returns FALSE but when I model on the data, I get the error that ".data contains
implicit gaps in time"
library(dplyr)
library(fable)
library(lubridate)
library(tsibble)
test <- data.frame(
YearMonth = c(20160101, 20160201, 20160301, 20160401, 20160501, 20160601,
20160701, 20160801, 20160901, 20161001, 20161101, 20161201),
Claims = c(13032647, 1668005, 24473616, 13640769, 17891432, 11596556,
23176360, 7885872, 11948461, 16194792, 4971310, 18032363),
Revenue = c(12603367, 18733242, 5862766, 3861877, 15407158, 24534258,
15633646, 13720258, 24944078, 13375742, 4537475, 22988443)
)
test_ts <- test %>%
mutate(YearMonth = ymd(YearMonth)) %>%
as_tsibble(
index = YearMonth,
regular = FALSE #because it picks up gaps when I set it to TRUE
)
# Are there any gaps?
has_gaps(test_ts, .full = T)
model_new <- test_ts %>%
model(
snaive = SNAIVE(Claims))
Warning messages:
1: 1 error encountered for snaive
[1] .data contains implicit gaps in time. You should check your data and convert implicit gaps into explicit missing values using `tsibble::fill_gaps()` if required.
Any help will appreciated.
You have a daily index, but you want a monthly index. The simplest way is to use the tsibble::yearmonth() function, but you will need to convert the date to character first.
library(dplyr)
library(tsibble)
test <- data.frame(
YearMonth = c(20160101, 20160201, 20160301, 20160401, 20160501, 20160601,
20160701, 20160801, 20160901, 20161001, 20161101, 20161201),
Claims = c(13032647, 1668005, 24473616, 13640769, 17891432, 11596556,
23176360, 7885872, 11948461, 16194792, 4971310, 18032363),
Revenue = c(12603367, 18733242, 5862766, 3861877, 15407158, 24534258,
15633646, 13720258, 24944078, 13375742, 4537475, 22988443)
)
test_ts <- test %>%
mutate(YearMonth = yearmonth(as.character(YearMonth))) %>%
as_tsibble(index = YearMonth)
Looks like as_tsibble isn't able to recognize the interval properly in the YearMonth column because it is a Date class object. It's hidden in the 'Index' section of help page that that might be problem:
For a tbl_ts of regular interval, a choice of index representation has to be made. For example, a monthly data should correspond to time index created by yearmonth or zoo::yearmon, instead of Date or POSIXct.
Like that excerpt suggests you can get around the problem with yearmonth(). But that requires a little string manipulation first to get it into a format that will parse properly.
test_ts <- test %>%
mutate(YearMonth = gsub("(.{2})01$", "-\\1", YearMonth) %>%
yearmonth()
) %>%
as_tsibble(
index = YearMonth
)
Now the model should run error free! Not sure why the has_gaps() test is saying everything is okay in your example...

SMA using R & TTR Package

Afternoon! I'm just starting out with R and learning about data frames, packages, etc... read a lot of the messages here but couldn't find an answer.
I have a table I'm accessing with R that has the following fields:
[Symbol],[Date],[Open],[High],[Low],[Close],[Volume]
And, I'm calculating SMAs on the close prices:
sqlQuery <- "Select * from [dbo].[Stock_Data]"
conn <- odbcDriverConnect(connectionString)
dfSMA <- sqlQuery(conn, sqlQuery)
sma20 <- SMA(dfSMA$Close, n = 20)
dfSMA["SMA20"] <- sma20
When I look at the output, it appears to be calculating the SMA without any regard for what the symbol is. I haven't tried to replicate the calculation, but I would suspect it's just doing it by 20 moving rows, regardless of date/symbol.
How do I restrict the calculation to a given symbol?
Any help is appreciated - just need to be pointed in the right direction.
Thanks
You're far more likely to get answers if you provide reproducible examples. First, let's replicate your data:
library(quantmod)
symbols <- c("GS", "MS")
getSymbols(symbols)
# Create example data:
dGS <- data.frame("Symbol" = "GS", "Date" = index(GS), coredata(OHLCV(GS)))
names(dGS) <- str_replace(names(dGS), "GS\\.", "")
dMS <- data.frame("Symbol" = "MS", "Date" = index(MS), coredata(OHLCV(MS)))
names(dMS) <- str_replace(names(dMS), "MS\\.", "")
dfSMA <- rbind(dGS, dMS)
> head(dfSMA)
Symbol Date Open High Low Close Volume Adjusted
1 GS 2007-01-03 200.60 203.32 197.82 200.72 6494900 178.6391
2 GS 2007-01-04 200.22 200.67 198.07 198.85 6460200 176.9748
3 GS 2007-01-05 198.43 200.00 197.90 199.05 5892900 177.1528
4 GS 2007-01-08 199.05 203.95 198.10 203.73 7851000 181.3180
5 GS 2007-01-09 203.54 204.90 202.00 204.08 7147100 181.6295
6 GS 2007-01-10 203.40 208.44 201.50 208.11 8025700 185.2161
What you want to do is subset your long data object, and then apply technical indicators on each symbol in isolation. Here is one approach to guide you toward acheiving your desired result.
You could do this using a list, and build the indicators on xts data objects for each symbol, not on a data.frame like you do in your example (You can apply the TTR functions to columns in a data.frame but it is ugly -- work with xts objects is much more ideal). This is template for how you could do it. The final output l.data should be intuitive to work with. Keep each symbol in a separate "Container" (element of the list) rather than combining all the symbols in one data.frame which isn't easy to work with.
make_xts_from_long_df <- function(x) {
# Subset the symbol you desire
res <- dfSMA[dfSMA$Symbol == x, ]
#Create xts, then allow easy merge of technical indicators
x_res <- xts(OHLCV(res), order.by = res$Date)
merge(x_res, SMA(Cl(x_res), n = 20))
}
l.data <- setNames(lapply(symbols, make_xts_from_long_df), symbols)

R Programming Random Stock Pick

I stuck in a problem with R Programming.
My aim is to randomly select 2 stocks out of the Swiss Market Index, which contains of 30 stocks.
Until now I solved the random pick of the 2 stocks with the following code:
SMI_components <- cbind("ABB (ABBN.VX)", "ADECCO (ADEN.VX)", "ACTELION (ATLN.VX)", "JULIUS BAER GRP (BAER.VX)", "RICHEMONT (CFR.VX)", "CREDIT SUISSE (CSGN.VX)", "GEBERIT (GEBN.VX)", "GIVAUDAN (GIVN.VX)", "HOLCIM (HOLN.VX)", "NESTLE (NESN.VX)", "NOVARTIS (NOVN.VX)", "TRANSOCEAN (RIGN.VX)", "ROCHE HOLDING (ROG.VX)", "SWISSCOM (SCMN.VX)", "SGS (SGSN.VX)", "SWISS RE (SREN.VX)", "SYNGENTA (SYNN.VX)", "UBS (UBSG.VX)", "SWATCH GROUP (UHR.VX)", "ZURICH INSURANCE GROUP (ZURN.VX)")
for(i in 1:1){
print(sample(SMI_components, 2))
}
How do I continue my code, if I want to download the historical data from these two random picked stocks?
For example, the random selection is:
"NOVARTIS (NOVN.VX)" and "ZURICH INSURANCE GROUP (ZURN.VX)"
how to continue that ...
SMI_NOVARTIS <- yahooSeries ("NOVN.VX", from = "2005-01-01", to = "2015-07-30", frequency = "daily")
SMI_ZURICH <- yahooSeries ("ZURN.VX", from = "2005-01-01", to = "2015-07-30", frequency = "daily")
I would really appreciate your help
Regards
print outputs to the console but doesn't store anything. So the first thing to do is assign the output of sample into a variable.
my_picks <- sample(SMI_components, 2)
Extract ticker symbol between parens (courtesy the comment below):
my_picks <- sub(".*\\((.*)\\).*", "\\1", my_picks)
Then you can use lapply, to call a function (yahooSeries) for each value in my_picks.
series_list <- lapply(my_picks, yahooSeries, from = "2005-01-01", to = "2015-07-30", frequency = "daily")
Then you'll get the output in a list. series_list[[1]] will have the output of yahooSeries for the first value of my_picks, and series_list[[2]] for the second
Lastly, not sure why you bothered with the single-iteration for loop, but you don't need that

R subsetting by date range

seems simple enough and I've been through all similar questions and applied them all... I'm either getting nothing or everything...
Trying to took at water temperatures (WTEMP) for specific date range(SAMPLE_DATE) 2007-06-01 to 2007-09-30 from (allconmon)
here is my code so far...
bydate<-subset(allconmon, allconmon$SAMPLE_DATE > as.Date("2007-06-01") & allconmon$SAMPLE_DATE < as.Date("2007-09-30"))
Ive also tried this but get errors
bydate2<- as.xts(allconmon$WTEMP,order.by=allconmon$SAMPLE_DATE)
bydate2['2007-06-01/2007-09-30']
Error in xts(x, order.by = order.by, frequency = frequency, .CLASS = "double", :
order.by requires an appropriate time-based object
not sure what I'm doing wrong here... seems to work for other people
I will highly recommend you using zoo package in R while dealing with time series data.
The operation you mentioned is actually a window function in zoo.
Here is the example from ?window:
Examples
window(presidents, 1960, c(1969,4)) # values in the 1960's
window(presidents, deltat = 1) # All Qtr1s
window(presidents, start = c(1945,3), deltat = 1) # All Qtr3s
window(presidents, 1944, c(1979,2), extend = TRUE)
pres <- window(presidents, 1945, c(1949,4)) # values in the 1940's
window(pres, 1945.25, 1945.50) <- c(60, 70)
window(pres, 1944, 1944.75) <- 0 # will generate a warning
window(pres, c(1945,4), c(1949,4), frequency = 1) <- 85:89
pres
Here is a list of papers from JSS demonstrating the usage of the zoo package also reshape your data which I found very inspiring.
I figured it out! on multiple levels... first off I didn't notice that R did something funky with my sample date label when I uploaded from text file... probably my fault...
here is a small sample of the data set. its 5,573,301 observations of 30 variables
notice the funky symbol in front of sample date.... not sure why R did that...
ï..SAMPLE_DATE SampleTime STATION SONDE Layer TOTAL_DEPTH TOTAL_DEPTH_A BATT BATT_A WTEMP WTEMP_A SPCOND SPCOND_A SALINITY SALINITY_A DO_SAT DO_SAT_A
however what I did.... (i changed the name to x as allconmon was a bit excessive)
x <- read.csv(file = "C:/Users/Desktop/cmon2001-08.txt",quote = "",header = TRUE,sep = "\t", na.strings = c("","NULL"))
library(chron)
x$month <- months(as.Date(x$ï..SAMPLE_DATE, "%Y-%m-%d"))
x$year <- substr(as.character(x$ï..SAMPLE_DATE), 1, 4)
y <- x[x$month == 'June' | x$month == 'July' | x$month == 'August' | x$month == 'September' ,]
so now I was able to subset all my data by those 4 months and then later by year, station, and water temp....

How can I apply factors to different subsets of a larger time series using a custom function?

I'm measuring a physiological variable with a millisecond timestamp on a number of patients. For each patient I want to apply a factor to a subset of the timestamped rows describing their posture at that exact moment.
I've tried creating the following function, which works fine when describing the first posture. When trying to apply the next "posture-factor," the previously registered posture is deleted.
TestPatient <- data.frame(Time=seq(c(ISOdatetime(2011,12,22,12,00,00)), by = "sec", length.out = 100),Value=rnorm(100, 9, 3))
patientpositionslice <- function(patient,positiontype,timestart,timestop) {
patient$Position[
format(patient$Time, "%Y-%m-%d %H:%M:%S") >= timestart &
format(patient$Time, "%Y-%m-%d %H:%M:%S") < timestop] <- positiontype
patient
}
TestPatientNew <- patientpositionslice(TestPatient,"Horizontal","2011-12-22 12:00:05","2011-12-22 12:00:10")
TestPatientNew <- patientpositionslice(TestPatient,"Vertical","2011-12-22 12:00:15","2011-12-22 12:00:20")
How do I modify the function so I can apply it repeatedly on the same patient with different postures such as "Horizontal", "Vertical", "Sitting" etc.?
Here's your solution. Probably there are more elegant ways but this is mine ;)
TestPatient <- data.frame(Time=seq(c(ISOdatetime(2011,12,22,12,00,00)), by = "sec", length.out = 100),Value=rnorm(100, 9, 3))
#Included column with position
TestPatient$position <- NA
patientpositionslice <- function(patient,positiontype,timestart,timestop) {
#changed the test to ifelse() function
new<-ifelse(
format(patient$Time, "%Y-%m-%d %H:%M:%S") >= timestart &
format(patient$Time, "%Y-%m-%d %H:%M:%S") < timestop , positiontype, patient$position)
patient$position <- new
patient
}
TestPatientNew <- patientpositionslice(TestPatient,"Horizontal","2011-12-22 12:00:05","2011-12-22 12:00:10")
#For repeated insertion use the previous object
TestPatientNew <- patientpositionslice(TestPatientNew ,"Vertical","2011-12-22 12:00:15","2011-12-22 12:00:20")
i commented the changes. hope it is like you wanted it else just correct me.

Resources