It's probably really simple.
In the first case, using presidential data, I can filter by either years or years 2. And I get the same result.
However, when I use posixct data and try to filter in a similar way I run into problems.
When I write
school_hours2<-as.character(c(07:18))
I can see the values in school_hours 2 are
"7", "8","9" etc
whereas in
school_hours they are
"07" "08" "09" etc
EDIT: I think this explains that difference then?
EDIT: I can see the problem comparing integer:character, and even when I write the vector as.character the values in the vector do not match what I want.
What I'd like is to be able to filter by school_hours2. As that would mean I could think "i'd like to filter between these two times" and put the upper and lower bounds in. Rather than having to write all the interval points in between. How do I get this?
Why is filtering by "Y" easier than filtering by "H"?
library (tidyverse)
#some data - filtering works
data(presidential)
head(presidential)
str(presidential)
presidential%>%filter(format(as.Date(start),"%Y")<=2005)
years<-c('1979', '1980', '1981', '1982',
'1983', '1984', '1985', '1986',
'1987', '1988', '1989', '1990'
)
years2<-c(1950:1990)
presidential%>%filter(format(as.Date(start),"%Y")%in% years2)
presidential%>%filter(format(as.Date(start),"%Y")%in% years)
#some date time data - filtering.
test_data<-sample(seq(as.POSIXct('2013/01/01'), as.POSIXct('2017/05/01'), by="day"), 1000)
td<-as.data.frame(test_data)%>%mutate(id = row_number())
school_hours<-c('07', '08', '09', '10',
'11', '12', '13', '14',
'15', '16', '17', '18'
)
school_hours2<-c(07:18)
school_years<-c(2015,2016,2017)
school_years2<-c(2015:2017)
str(td)
test1<-td%>%
filter(id >=79)
schools<-td%>%
filter(format(test_data,'%H') %in% school_hours)
schools2<-td%>%
filter(format(test_data,'%H') %in% school_hours2)
schools3<-td%>%
filter(format(test_data,'%Y')==2017)
schools4<-td%>%
filter(format(test_data,'%Y') %in% school_years)
schools5<-td%>%
filter(format(test_data,'%Y') %in% school_years2)
Here's my question:
In the code above, when I try to filter td (which contains posixct data) using school_hours or school_hours2 I get zero data returned.
Why?
What I'd like to be able to do is instead of writing
school_hours<-c('07', '08', '09', '10',
'11', '12', '13', '14',
'15', '16', '17', '18'
)
I'd write
school_hours2<-c(07:18)
Just like I have for school_years and the filtering would work.
This doesn't work
schools2<-td%>%
filter(format(test_data,'%H') %in% school_hours2)
This does work
schools5<-td%>%
filter(format(test_data,'%Y') %in% school_years2)
WHY?
I ask because:
I've used something similar to filter my real data, which I can't share, and I get a discrepancy.
When I use school_hours (which is a character) I generate 993 records and the first time is 07:00.
When I use school_hours2 (which is an integer) I generate 895 records and the first time is 10:00.
I know - "without the data we can't make any evaluation" but what I can't work out is why the two different vector filters work differently. Is it because school_hours contains characters and school_hours2 integers?
EDIT:
I changed the test_data line to:
#some date time data - filtering.
test_data<-as.POSIXct(sample(seq(1127056501, 1127056501), 1000),origin = "1899-12-31",tz="UTC")
it's still problematic:
schools<-td%>%
filter(format(test_data,'%H') %in% school_hours)
generates 510 rows
schools2<-td%>%
filter(format(test_data,'%H') %in% school_hours2)
generates 379 rows
All of the data I'm really interested looks like this
1899-12-31 23:59:00
(where the last 6 digits represent a 24 hr clock time)
All I'm really trying to do is convert the time from this
1899-12-31 07:59:00
to
the hour (7)
and then
use
school_hours2<-c(07:18)
as a filter.
But will the hour generated by the conversion of
1899-12-31 07:59:00
be
07
or
7
Because if it's 07, then
school_hours2<-c(07:18)
generates
7
and
school_hours2<-as.character(c(07:18))
generates
'7'
How do I get around this?
EDIT:
LIKE THIS:
R: how to filter a timestamp by hour and minute?
td1<-td%>%mutate(timestamp_utc = ymd_hms(test_data,tz="UTC"))%>%
mutate(hour = hour(timestamp_utc))%>%
filter(hour(timestamp_utc) %in% school_hours)
td2<-td%>%mutate(timestamp_utc = ymd_hms(test_data,tz="UTC"))%>%
mutate(hour = hour(timestamp_utc))%>%
filter(hour(timestamp_utc) %in% school_hours2)
td3<-td%>%
mutate(hour = hour(test_data))%>%
filter(hour(test_data) %in% school_hours2)
After a lot of mucking around and talking to myself in my question
I found this thread:
filtering a dataset by time stamp
and it helped me to realise how to isolate the hour in the time stamp and then use that to filter the data properly.
the final answer is to isolate the hour by this
filter(hour(timestamp_utc) %in% school_hours2)
Related
I have some experimental data on CO2 values over a few days in a room which are time and date-stamped. I would like to break it up into a series of "experiments" based on an experiment list of when each experiment happened.
e.g.
Data
df<-data.frame(CO2.ppm.=runif(10), Date.time.=as.POSIXct(" 2019-2-08 07:00:00") + runif(n=10, min=0, max=3600))
List of experiments with start and stop times:
ExpertimentList<- data.frame(StartTime=c("2019-2-08 07:10:00", "2019-2-08 07:15:00", "2019-2-08 08:30:00"), StopTime=c("2019-2-08 07:12:00","2019-2-08 07:16:00","2019-2-08 08:15:00"),ExptID=c(1,2,3))
Note there is time when CO2 is measured but no experiment is happening. E.g. between 07:12:00 and 07:15:00.
I would like to split df$Date.time. by ExperimentList's StartTime and StopTime
So far I've converted everything to integers
df$Date.time.<-as.integer(df$Date.time.)
ExperimentList$StartTime<-as.integer(ExperimentList$StartTime
ExperimentList$StopTime<-as.integer(ExperimentList$StopTime)
Then looking at cut
breakz<-dplyr::arrange(paste(Experiment_List$StartTime,Experiment_List$StopTime)%>%as_tibble())
cut(df$Dev.Date.Time,breaks=unique(breakz$value))
But I can't filter out the data when no experiment was taking place. Any thoughts are much appreciated.
Expected output:
set.seed(143)
data.frame(CO2.ppm.=runif(10), Date.time.=sort(as.POSIXct(" 2019-2-08 07:00:00") + runif(n=10, min=0, max=3600)),ExptID=c(NA,NA,NA,1,NA,NA,NA,NA,NA,NA))
ANSWER:
I found that I would run out of memory with #Ronak's answer so I chunked the data.frame into 10000 row segments:
df<-split(df, (as.numeric(rownames(df))-1) %/% 10000)
Then based on #Ronak's answer, I popped the code into a function and used mclapply from the parallel package.
#Do a left join to remove any rows not belonging to an experiment
fuzzyJoinFunction<-function(a){
a<-fuzzy_left_join(a, Experiment_List,
by = c('Dev.Date.Time' = 'StartTime', 'Dev.Date.Time'= 'StopTime'),
match_fun = c(`>=`, `<=`))
a
}
df<-rbindlist(mclapply(X=df,FUN=fuzzyJoinFunction,mc.cores=4))
We can use fuzzyjoin::fuzzy_inner_join to keep only the rows which are in range.
library(dplyr)
library(fuzzyjoin)
#All the datetime values should be of type POSIXct.
ExpertimentList %>%
mutate(across(c(StartTime, StopTime), lubridate::ymd_hms)) -> ExpertimentList
fuzzy_inner_join(df, ExpertimentList,
by = c('Date.time.' = 'StartTime', 'Date.time.'= 'StopTime'),
match_fun = c(`>=`, `<=`))
To get all df values in the final output with NA for ExptID use fuzzy_left_join.
Through an API I accessed weather information in Json format. I want to convert this data to a dataframe. The problem is that not for every date-city combination the api returns weather conditions, so a few rows are empty. Second , not every combination that does return gives the same aspects of the weather. My goal is to convert the Json to a dataframe, where rows that are empty are still showed in the dataframe (which does not happen when I unlist them) and the different aspects of the weather are properly showed under the right variable with NA values if there is no record for that particular variable. I've tried enlisting them and putting it into a dataframe, flattening the table etc (getting the error: arguments imply differing number of rows: 0, 1) . I've searched for this topic but none of them worked for my case (or maybe because I'm not that experienced I applied them wrong), but every tip is welcome!
The input looks like this:
reviewid dateofwriting lon lat
98338143 28-02-11 11,41693611 22,3193039
58929813 18-03-10 -3,7037902 40,4167754
65945346 31-05-10 -3,188267 55,953252
The output looks like this (the second observation returns 36 columns and the third one 38. the first entry is missing because there was no observation for that day and is not displayed)
enter image description here
[{},
{"daily":
[{"time":"2010-03-18",
"summary":"Partly cloudy throughout the day.",
"icon":"partly-cloudy-day",
"sunriseTime":"2010-03-18 07:22:51",
"sunsetTime":"2010-03-18 19:25:28",
"moonPhase":0.08,
"precipIntensity":0,
"precipIntensityMax":0,
"precipProbability":0,
"temperatureHigh":63.14,
"temperatureHighTime":1268928000,
"temperatureLow":45.16,
"temperatureLowTime":1268971200,
"apparentTemperatureHigh":63.14,
"apparentTemperatureHighTime":1268928000,
"apparentTemperatureLow":45.16,
"apparentTemperatureLowTime":1268971200,
"dewPoint":36.97,
"humidity":0.58,
"pressure":1025.96,
"windSpeed":1.24,
"windGust":7.87,
"windGustTime":1268866800,
"windBearing":48,
"cloudCover":0.54,
"uvIndex":5,
"uvIndexTime":1268913600,
"visibility":6.19,
"temperatureMin":43.97,
"temperatureMinTime":"2010-03-18 07:00:00",
"temperatureMax":63.14,
"temperatureMaxTime":"2010-03-18 17:00:00",
"apparentTemperatureMin":42.03,
"apparentTemperatureMinTime":"2010-03-18 08:00:00",
"apparentTemperatureMax":63.14,
"apparentTemperatureMaxTime":"2010-03-18 17:00:00"}]},
{"daily":
[{"time":"2010-05-30 01:00:00",
"summary":"Mostly cloudy until evening.",
"icon":"partly-cloudy-day",
"sunriseTime":"2010-05-30 05:38:39",
"sunsetTime":"2010-05-30 22:44:55",
"moonPhase":0.58,
"precipIntensity":0.0038,
"precipIntensityMax":0.0766,
"precipIntensityMaxTime”:"2010-05-30 04:00:00",
"precipProbability":1,
"precipType":"rain",
"temperatureHigh":58.99,
"temperatureHighTime":1275242400,
"temperatureLow":36.62,
"temperatureLowTime":1275278400,
"apparentTemperatureHigh":58.99,
"apparentTemperatureHighTime":1275242400,
"apparentTemperatureLow":36.62,
"apparentTemperatureLowTime":1275278400,
"dewPoint":43.61,
"humidity":0.76,
"pressure":1011.52,
"windSpeed":4.65,
"windGust":21.4,
"windGustTime":1275224400,
"windBearing":350,
"cloudCover":0.61,
"uvIndex":5,
"uvIndexTime":1275213600,
"visibility":5.85,
"temperatureMin":45.99,
"temperatureMinTime":"2010-05-30 07:00:00",
"temperatureMax":58.99,
"temperatureMaxTime":"2010-05-30 20:00:00",
"apparentTemperatureMin":43.31,
"apparentTemperatureMinTime":"2010-05-30 06:00:00",
"apparentTemperatureMax":58.99,
"apparentTemperatureMaxTime":"2010-05-30 20:00:00"}]}]
The goal is to add these rows to the input excel above.
icon sunrisetime sunsettime etc.
NA NA NA etc.
partly-cloudy-day 18-03-10 07:22 18-03-10 19:25 etc.
partly-cloudy-day 30-05-10 05:38 30-05-10 22:44 etc.
There is a problem dealing with the responses that return NULL. To simplify the issue, it is easier to remove these non responses and then parse the remaining JSON response. If desired, one can go back and add the empty rows for the non responses.
library(jsonlite)
library(dplyr)
#test<- result from converting the JSON response.
#vector of reviewid, used to make the initial request to the API
reviewid<-c(98338143, 58929813, 65945346)
#find only the responses that are not Null or blank
valid<-which(sapply(1:nrow(test), function(j) {length(test[[1]][[j]])}) >0)
NullResponses<-which(sapply(1:nrow(test), function(j) {length(test[[1]][[j]])}) == 0)
#create a list of data frames with the data from row of the response
dflist<-lapply( valid, function(j) {
temp<-t(as.matrix(unlist(test[j,])))
df<-data.frame(reviewid=reviewid[j], temp, stringsAsFactors = FALSE)
df
})
#bind the rows together.
answer<-bind_rows(dflist)
I have a function to create a datatable in Shiny based on department numbers and how many times an event happened in that department during a time period. My issue is that if the date range is short enough, no departments will have had the event occur. In those instances, I get the error Error in rowSums(x) : 'x' must be an array of at least two dimensions which initially just appeared within the Shiny app and you could just ignore it. Now, the app crashes and you have to go back to R to look at it.
I understand why the error is occurring but I don't know if there's a way around it for my situation because I don't know if the events occur until the data is subset. The function is called a number of times in my code, so I don't want to write an if statement outside the function each time it is used.
I tried adding if(length(b$Department <= 1)){tab<-renderDataTable({datatable(NULL)})} right after defining b and then had an else statement around the remainder of the function, but I get the message Warning: Error in [.data.frame: undefined columns selected
I have also tried other if statements such as creating a dataframe full of NAs but this returned the original error message.
dept.table<-function(df, date1, date2){
a<-df[which(DATE >= as.Date(date1) & DATE <= as.Date(date2)),]
b<-as.data.frame(table((a[,c("Event", "Department")])))
d<-reshape(b, direction="wide", idvar="Event", timevar="Department")
names(d)<-sub('^Freq\\.', '', names(d))
d$Total<-round(rowSums(d[,-1]), 0)
levels(d$Event)<-c(levels(d$Event), "Total")
d<-rbind(d, c("Total", colSums(d[,-1])))tab<-DT::renderDataTable({
datatable(d, extensions="FixedColumns", options=list(dom='t', scrollX=T, fixedColumns=list(leftColumns=1, rightColumns=1)), rownames=FALSE)
})
}
Sample data
df<-data.frame(Department=rep(100:109, 3), Event=rep(c("A", "B", "C"),10),
Date=sample(seq(as.Date('2018/01/01'), as.Date('2018/09/01'), by="day"), 30))
It's not pretty, but I figured out a solution. There were two different issues. One when there was no data and another when there was only 2 departments, so I needed two if statements.
dept.table<-function(df, date1, date2) {a<-df[DATE >= as.Date(date1) & DATE <= as.Date(date2)),]
b<-as.data.frame(table((a[,c("Event", "Department")])))
if(nrow(b)==0){tab<-DT::renderDataTable(NULL)}
else{d<-reshape(b, direction="wide", idvar="CODE", timevar="Department")
names(d)<-sub('^Freq\\.', '', names(d))
if(ncol(d)>3){d$Total<-round(rowSums(d[,-1]), 0)
levels(d$Event)<-c(levels(d$Event), "Total")
d<-rbind(d, c("Total", colSums(d[,-1])))
tab<-DT::renderDataTable({
datatable(d, extensions="FixedColumns", options=list(dom='t', scrollX=T, fixedColumns=list(leftColumns=1, rightColumns=1)), rownames=FALSE)})}
else{tab<-DT::renderDataTable(datatable(d))}
}
tab
}
My Question is divided into 2 parts:
1st part:
I have a function, getdata() which I use to pull information for a date range.
get_data <- function (fac_num, start_date, end_date) {
if (!(is.null(fac_num) | is.null(start_date) | is.null(end_date))) {
if(end_date - start_date > 7) {
start_date <- end_date - 7
#start_date <- as.Date('2017-07-05')
#end_date <- as.Date('2017-07-06')
#fac_num <- "005"
}
new_start_date <- paste0(start_date,' 05:00:00')
new_end_date <- paste0(end_date + 1,' 05:00:00')
qry <- paste0("SELECT FAC_NUM, USER_ID, APPL_ID, FUNC_ID, ST_ID, NXT_ST_ID, RESP_PRMT_DATA,
ST_DT_TM, END_DT_TM, RESP_PRMT_TY_CDE,
REQ_INP_DATA FROM OPSDBA.STG_RFS_INTERACTION WHERE TRANS_ST_DT_TM >= DATE'",
start_date,"' AND TRANS_ST_DT_TM BETWEEN TO_TIMESTAMP('",new_start_date,"', 'YYYY-MM-DD HH:MI:SS') AND TO_TIMESTAMP('",new_end_date,"', 'YYYY-MM-DD HH:MI:SS')
AND APPL_ID='CTS' AND FAC_NUM='",fac_num,"'")
and then I perform calculations on it.
Further, in my program. I use this getdata() function to pull data for a new set of analysis.
rf_log_perform <- get_data(display_facility_decode(input$facNum2),
input$dateRange2, input$dateRange2 + 1)
Here since I am using just a single date instead of range, I have added one to the range so that the getdata() function would work.
I then wanted to modify the date range in such a way that, it does not show anything past 11:59 for the selected date.
rf_log_perform$date <- ifelse(strftime(rf_log_perform$st_dt_tm, format="%H:%M:%S")<'05:00:00',
format(as.POSIXct(strptime(rf_log_perform$st_dt_tm - 1*86400 , '%Y-%m-%d %H:%M:%S')),format = '%Y-%m-%d'),
format(as.POSIXct(strptime(rf_log_perform$st_dt_tm , '%Y-%m-%d %H:%M:%S')),format = '%Y-%m-%d'))
By using the getdata() function, I would be able to pull data for date range 08/29/2017, 05:00:00 to 08/30/2017, 05:00:00 which is considered to be a day in my example.
But for my calculations, I want to discard everything which is beyond 08/29/2017, 11:59:59 PM, for more accurate results.
For this purpose, I have added an ifelse statement in there to sort that out. But this isn't behaving as I expect and am confused on why not.
Unfortunately I still can not comment on the main question.
I encourage you to make two adjustments to your question to improve the chances on getting an answer to your question:
1) Please make your example reproducible e.g. provide date ranges, wrap your code in a well defined function etc.
2) Explain what you are trying to achieve. What is your intention and expected result.
I am new to R and struggling with the fact that functions are able to operate on whole vectors without having to explicitly specify this.
My goal
I have a data frame calls with multiple columns, one of which is a “date” column. Now I want to add a new column, “daytime”, that labels the daytime the particular entry’s date falls into:
> calls
call_id length date direction daytime
1 258 531 1400594572974 outgoing afternoon
2 259 0 1375555528144 unanswered evening
3 260 778 1385922648396 incoming evening
What I have done so far
I have already implemented methods that return a vector of booleans like that:
# Operates on POSIXlt timestamps
is.earlymorning <- function(date) {
hour(floor_date(date, "hour")) >= 5 & hour(floor_date(date, "hour")) < 9
}
The call is.earlymorning(“2014-05-20 16:02:52”, “2013-08-03 20:45:28”, “2013-12-01 19:30:48”) would thus return (“FALSE”, “FALSE”, “FALSE”). What I am currently struggling with is to implement a function that actually returns labels. What I would like the function to do is the following:
# rawDate is a long value of the date as ms since 1970
Daytime <- function(rawDate) {
date <- as.POSIXlt(as.numeric(rawDate) / 1000, origin = "1970-01-01")
if (is.earlymorning(date)) {
"earlymorning"
} else if (is.morning(date)) {
"morning"
} else if (is.afternoon(date)) {
"afternoon"
} else if (is.evening(date)) {
"evening"
} else if (is.earlynight(date)) {
"earlynight"
} else if (is.latenight(date)) {
"latenight"
}
}
The problem
Obviously, my above approach does not work since the if-conditions would operate on whole vectors in my example. Is there an elegant way to solve this problem? I am sure I am confusing or missing some important points, but as I mentioned I am pretty new to R.
In short, what I want to implement is a function that returns a vector of labels according to a vector of date values:
# Insert new column with daytime labels
calls$daytime <- Daytime(df$date)
# or something like that:
calls$daytime <- sapply(df$date, Daytime)
# Daytime(1400594572974, 1375555528144, 1385922648396) => (“afternoon”, “evening”, “evening”)
One approach would be to use cut rather than ifelse. I am not entirely sure how you want to label hours, but this will give you the idea. foo is your data (i.e., calls).
library(dplyr)
# Following your idea
ana <- transform(foo, date = as.POSIXlt(as.numeric(date) / 1000, origin = "1970-01-01"))
ana %>%
mutate(hour = cut(as.numeric(format(date, "%H")),
breaks = c(00,04,08,12,16,20,24),
label = c("late night", "early morning",
"morning", "afternoon",
"evening", "early night")
)
)
# call_id length date direction daytime hour
#1 258 531 2014-05-20 23:02:52 outgoing afternoon early night
#2 259 0 2013-08-04 03:45:28 unanswered evening late night
#3 260 778 2013-12-02 03:30:48 incoming evening late night
There is no need to have 6 different functions to establish which period of the day a given date is. It suffices to define a vector which matches the hour with the daytime. For instance:
Daytime<-function(rawDate) {
#change the vector according to your definition of the daytime.
#the first value corresponds to hour 0 and the last to hour 23
hours<-c(rep("latenight",5),rep("earlymorning",4),rep("morning",4),rep("afternoon",4),rep("evening",4),rep("earlynight",3))
hours[as.POSIXlt(as.numeric(rawDate) / 1000, origin = "1970-01-01")$hour+1]
}
Given Thomas' hint, I solved my problem in the following (addmittedly unelegant) way:
Daytime <- function(rawDates) {
dates <- as.POSIXlt(as.numeric(rawDates) / 1000, origin = "1970-01-01")
ifelse(is.earlymorning(dates), "earlymorning",
ifelse(is.morning(dates), "morning",
ifelse(is.afternoon(dates), "afternoon",
ifelse(is.evening(dates), "evening",
ifelse(is.earlynight(dates), "earlynight",
ifelse(is.latenight(dates), "latenight",
"N/A")
)
)
)
)
)
}
Considering a case with more labels this approach will get unmaintainable soon. Right now it serves my purposes and I will leave it at that since I must focus on analysing the data as soon as possible. But I will let you know if I had time left and found a less complicated solution! Thank you for your quick response, Thomas.