I've got a GPS dataset with about 5600 rows of coordinates from 5 GPS devices ('nodes') over several days and I want to reduce the number of GPS points to just one point per hour. Because the number of points per hour fluctuates, a simple for-loop is not possible.
A simplified structure of the table would be this:
ID node easting northing year month day hour minute time
The column 'time' is class "POSIXlt" "POSIXt".
Trying my first approach, a multiple nested for-loop, I learned about the Second circle of Inferno.
Does someone has any idea, how to reduce multiple rows (per hour) to one (per hour), separated by each device in R.
Assuming that the year, month, day, and time columns contain information related to the time column, the solution could be as follows:
# Generate data
md <- data.frame(
node = rep(1:5, each = 2)
, easting = sample(1:10, size = 20, replace = TRUE)
, northing = sample(1:10, size = 20, replace = TRUE)
, year = 2017
, month = "June "
, day = 6
, hour = rep(1:2, each = 2, times = 5)
, minute = NA
, time = NA
)
# Solution
library(dplyr)
md %>%
group_by(node, year, month, day, hour) %>%
summarize(
easting = mean(easting),
northing = mean(northing)
)
You can create a new column "Unix_hour": the UNIX timestamp divided by 3600.
So, you will have a unique id for each hour.
To do this, you should user as.numeric to convert a POSIXct date into Unix timestamp (in seconds):
as.numeric(POSIXct_variable) / 3600
It will return the timestamp.
Then, you will just group by on this new column "Unix_hour":
aggregate(. ~ Unix_hour, df, mean)
(Change aggregate function "mean" if you to aggregate other variables in another way)
You could convert your multi columns for date time into one, e.g:
DateTimeUTCmin5 <- ISOdate(year = tmp$Year,
month = tmp$Month,
day = tmp$Day,
hour = tmp$Hour,
min = tmp$Min,
sec = tmp$Sec,
tz = "America/New_York")
add an hour floor using floor_date from lubridate
df$HourFloor = floor_date(df$DateTimeUTCmin5, unit = "hour")
then decide how you want to extract the data from that hour, mean, first, max?
Hourstats <- df %>% group_by(HourFloor) %>%
summarise(meanEast = mean(easting, na.rm = TRUE),
firstNorth = first(northing, na.rm = TRUE))) %>%
ungroup()
Related
I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame.
I have a first data.frame (dates) which has 122880 rows of dates (each 15 min).
I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID.
I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.
Here is my code:
for(hour in 1:122880){
for(patient in 1:nrow(episode){
if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
no_episode = episode$id[patient]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[patient]
}
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
The first loop is to set the date and write the results.
The second loop + first if statement is to search in the episode data.frame which patients are in the ER during the time frame.
The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.
It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:
if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)
Any idea how to clean my code so it doesn't take so much time?
I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!
Thank you!
Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!
Here is the code after modification:
for(hour in 1:122880){
temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
for(element in 1:length(temp){
no_episode = temp[element]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[episode$id==no_episode]
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
Getting rid of the first if statement and the 2nd (longest) loop did the trick!
Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.
First, here's some fake data. I assume here that "id" + "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.
library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data
care <- tibble(
id = sample(1:99999, epi_n, replace = TRUE),
begin = ymd_h(2010010100) + runif(epi_n, max = 3.5*365*24*60*60),
date = as_date(begin)) %>%
filter(hour(begin) >= 7, hour(begin) <= 17) %>%
distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers
episode <- care %>%
transmute(id, date,
begin = begin - rpois(nrow(care), 30)*60,
end = begin + rgamma(nrow(care), 1)*600) %>%
arrange(begin)
Matching the data between the two data sets takes 0.2 sec.
tictoc::tic()
combined <- left_join(
episode %>% rename("patient_arrival" = "begin"),
care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()
Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.
Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.
tictoc::tic()
combined %>%
select(id, patient_arrival, dr_arrival) %>%
pivot_longer(-id, values_to = "hour") %>%
mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
arrange(hour) %>%
mutate(wait_count = cumsum(wait_chg)) %>%
bind_rows(.,
tibble(name = "15 minute interval",
hour = seq.POSIXt(ymd_h(2010010100),
to = ymd_h(2013070112),
by = "15 min")) %>%
filter(hour(hour) >= 6, hour(hour) < 20)
) %>%
arrange(hour) %>%
fill(wait_count) %>%
replace_na(list(wait_count = 0)) %>%
filter(name == "15 minute interval") %>%
select(hour,wait_count)
tictoc::toc()
I am really new at R and this is probably a really basic question: Let's say I have a dataset with a column that includes date values of the format ("y-m-d H:M:S") as a Factor value.
How do I split the one column into 5?
Given example:
x <- as.factor(c("2018-01-03 12:34:32.92382", "2018-01-03 12:50:40.00040"))
x <- as_datetime(x) #to convert to type Date
x <- x %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
I get the error: for objects with the class(c('POSIXct', 'POSIXt') can not be used.
Change it into dataframe then run mutate part will works
x %>%
as.data.frame() %>%
rename(x = '.') %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
x year month day hour minute second
1 2018-01-03 12:34:32 2018 1 3 12 34 32.92382
2 2018-01-03 12:50:40 2018 1 3 12 50 40.00040
You could also make your mutate a little bit cleaner utilizing the power of across:
library(lubridate)
x %>%
data.frame(date = .) %>%
mutate(across(date,
funs(year, month, day, hour, minute, second),
.names = "{.fn}"))
I have a data frame where each row is a different timestamp. The older data in the data frame is collected at 30-minute intervals while the more recent data is collected at 15-minute intervals. I would like to run a for loop (or maybe an ifelse statement) that calulates the time difference between each row, if the difference is equal to 30 minutes (below example uses 1800 seconds) then the loop continues, but if the loop encounters a 15 minute time difference (below example uses 900 seconds) it stops and tells me which row this first occured on.
x <- as.POSIXct("2000-01-01 01:00", tz = "", "%Y-%m-%d %H:%M")
y <- as.POSIXct("2000-01-10 12:30", tz = "", "%Y-%m-%d %H:%M")
xx <- as.POSIXct("2000-01-10 12:45", tz = "", "%Y-%m-%d %H:%M")
yy <- as.POSIXct("2000-01-20 23:45", tz = "", "%Y-%m-%d %H:%M")
a.30 <- as.data.frame(seq(from = x, to = y, by = 1800))
names(a.30)[1] <- "TimeStamp"
a.15 <- as.data.frame(seq(from = xx, to = yy, by = 900))
names(a.15)[1] <- "TimeStamp"
dat <- rbind(a.30,a.15)
In the example dat data frame, the time difference switches from 30 minute to 15 minute intervals at row 457. I would like to automate the process of identifing the row where this change in time difference first occurs.
We can use difftime to calculate the difference in time in mins and create a logical vector based on the difference
library(dplyr)
dat %>%
summarise(ind = which.max(abs(as.numeric(difftime(TimeStamp,
lag(TimeStamp, default = TimeStamp[2]), unit = 'min'))) < 30))
# ind
#1 457
Here's another way that uses slightly different logic. Calculate the difference, and create a column with the row number. Then filter to where the difference is 15, and take the first row.
library(tidyverse)
dat %>% mutate(Diff = TimeStamp - lag(TimeStamp), rownum = row_number()) %>%
filter(Diff == 15) %>%
slice(1)
TimeStamp Diff rownum
1 2000-01-10 12:45:00 15 mins 457
I have a data frame which consists of date and temperature of 34 different systems each system in different column. I need to calculate every systems average hourly temperature. I use this code to calculate average for 1 system. But if I want to calculate average for other 33 systems, I have to repeat code again, and again. Is there a better way to find hourly average in all columns at once ?
dat$ut_ms <- dat$ut_ms/1000
dat[ ,1]<- as.POSIXct(dat[,1], origin="1970-01-01")
dat$ut_ms <- strptime(dat$ut_ms, "%Y-%m-%d %H:%M")
dat$ut_ms <- cut(dat[enter image description here][1]$ut_ms, breaks = 'hour')
meanNPWD2401<- aggregate(NPWD2401 ~ ut_ms, dat, mean)
I added a picture of the data. For better understing of what I want.
You can split your data per hour and itterate,
list1 <- split(dat, cut(strptime(dat$ut_ms, format = '%Y-%m-%d %H:%M'), 'hour'))
lapply(list1, colMeans)
When you rearrange the data into a long format, things get much easier
n.system <- 34
n.time <- 100
temp <- rnorm(n.time * n.system)
temp <- matrix(temp, ncol = n.system)
seconds <- runif(n.time, max = 3 * 3600)
time <- as.POSIXct(seconds, origin = "1970-01-01")
dataset <- data.frame(time, temp)
library(dplyr)
library(tidyr)
dataset %>%
gather(key = "system", value = "temperature", -time) %>%
mutate(hour = cut(time, "hour")) %>%
group_by(system, hour) %>%
summarise(average = mean(temperature))
I am using the NAPM ISM data set from the FRED database. The data is monthly frequency. I would like to create another data frame with daily frequency where value each business day is the last monthly data release. So if last release of 49.5 on 02/01/16 then every day in February has a value of 49.5.
Code Sample
start_date <- as.Date("1970-01-01")
end_date <- Sys.Date()
US_PMI <- getSymbols("NAPM", auto.assign = FALSE, src ="FRED", from = start_date, to = end_date)
test <- data.frame(date=index(US_PMI), coredata(US_PMI))
I do not know which packages and data I need to reproduce your example but you can use a sequence of daily dates, the merge function and the NA filler in the zoo package to create the daily data frame:
library(zoo)
# Date range
sd = as.Date("1970-01-01")
ed = Sys.Date()
# Create daily and monthly data frame
daily.df = data.frame(date = seq(sd, ed, "days"))
monthly.df = data.frame(date = seq(sd, ed, "months"))
# Add some variable to the monthly data frame
monthly.df$v = rnorm(nrow(monthly.df))
# Merge
df = merge(daily.df, monthly.df, by = "date", all = TRUE)
# Fill up NA's
df = transform(df, v = na.locf(v))
This might not be the fastest way to obtain the data frame but it should work.