In R I have data
USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
...
I want to create a new data set data_new that contain all USER in the time 10 o'clock to 11 o'clock.
The types of USER and BIRTH are strings/characters. I tried this:
data_new= data$BIRTH > as.POSIXct("10:00:00", format="%H:%M:%S")
& data$BIRTH < as.POSIXct("11:00:00", format="%H:%M:%S")
but here R gives we FALSE for all entries, so this don't work.
How can I solve this?
Update
Say I want to find the number of users for all hours. I use the answer and try this
u=c()
for(j in 1:24) {
data_new=data[times > "00:00:00"+(j-1) & times < "01:00:00"+j ,]
#saving the number of users in vector u
u[j]=dim(data_new)[1]
}
but R can't figure out the term "00:00:00"+(j-1).
If df is your data frame:
df <- read.table(text = 'USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
121 "2014-12-26 10:07:35"
121 "2014-12-26 11:07:35"
121 "2014-12-26 10:38:35"', header = T)
df$BIRTH <- ymd_hms(df$BIRTH)
times <- strftime(df$BIRTH, format = "%H:%M:%S")
df[times > "10:00:00" & times < "11:00:00",]
Output:
USER BIRTH
3 121 2014-12-26 10:07:35
5 121 2014-12-26 10:38:35
One way to do something to each subset of your data is to use the split-lapply paradigm. In this case, you would convert data$BIRTH to POSIXlt and split by the hour component of the POSIXlt object. That will give you a list where each list element contains all the data for a specific hour.
data <- read.csv(text = "USER,BIRTH
11,2013-01-11 22:31:11
12,2014-12-26 04:07:35
21,2014-12-26 10:07:35
121,2014-12-26 11:07:35
112,2014-12-26 10:38:35")
data_by_hour <- split(data, as.POSIXlt(data$BIRTH)$hour)
Then you can use lapply (or sapply) to do whatever you want to each of those subsets. To count the number of observations per hour:
# number of observations for each hour
sapply(data_by_hour, nrow)
4 10 11 22
1 2 1 1
You can also do this with xts.
library(xts)
# Create xts object from 'data' data.frame
# Note: xts objects are based on a matrix, so you cannot have columns with
# mixed types like you can with a data.frame.
x <- xts(data["USER"], as.POSIXct(data$BIRTH))
period.apply(x, endpoints(x, "hours"), nrow)
# USER
# 2013-01-11 22:31:11 1
# 2014-12-26 04:07:35 1
# 2014-12-26 10:38:35 2
# 2014-12-26 11:07:35 1
Note that you can do time-of-day subsetting with xts. It avoids potential locale-related collation order issues caused by using logical operators on character strings.
x["T10:00/T11:00"]
# USER
# 2014-12-26 10:07:35 21
# 2014-12-26 10:38:35 112
Related
I have made measurements of temperature in a high time resolution of 10 minutes on different urban Tree species, whose reactions should be compared. Therefore I am researching especially periods of heat. The Task that I fail to do on my Dataset is to choose complete days from a maximum value. E.G. Days where there is one measurement above 30 °C should be subsetted from my Dataframe completely.
Below you find a reproducible example that should illustrate my problem:
In my Measurings Dataframe I have calculated a column indicating wether the individual Measurement is above or below 30°C. I wanted to use that column to tell other functions wether they should pick a day or not to produce a New Dataframe. When anytime of the day the value is above 30 ° C i want to include it by Date from 00:00 to 23:59 in that New Dataframe for further analyses.
start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")
Measurings <- data.frame(
Time = tseq,
Temp = sample(20:35,1000, replace = TRUE),
Variable1 = sample(1:200,1000, replace = TRUE),
Variable2 = sample(300:800,1000, replace = TRUE)
)
Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")
Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")
The example is yielding a Dataframe analog to the structure of my Data:
head(Measurings)
Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00 28 56 377 normal 0
2 2018-05-18 01:00:00 23 65 408 normal 0
3 2018-05-18 02:00:00 29 78 324 normal 0
4 2018-05-18 03:00:00 24 157 432 normal 0
5 2018-05-18 04:00:00 32 129 794 heat 1
6 2018-05-18 05:00:00 25 27 574 normal 0
So how do I subset to get a New Dataframe where all the days are taken where at least one entry is indicated as "heat"?
I know that for example dplyr:filter could filter the individual entries (row 5 in the head of the example). But how could I tell to take all the day 2018-05-18?
I am quite new to analyzing Data with R so I would appreciate any suggestions on a working solution to my problem. dplyris what I have been using for quite some tasks, but I am open to whatever works.
Thanks a lot, Konrad
Create variable which specify which day (droping hours, minutes etc.). Iterate over unique dates and take only such subsets which in heat30 contains "heat" at least once:
Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))
res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){
ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
rr <- Measurings %>% filter(Time2 == x) # select date x
# check if heat30 vector contains heat value at least once, if so bind that subset
if(any(ss == "heat")){
res <- rbind(res, rr)
}
return(res)
}) %>% bind_rows()
Below is one possible solution using the dataset provided in the question. Please note that this is not a great example as all days will probably include at least one observation marked as over 30 °C (i.e. there will be no days to filter out in this dataset but the code should do the job with the actual one).
# import packages
library(dplyr)
library(stringr)
# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))
# name the columns
names(time_df) <- c("Day", "Hour")
# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])
# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])
To be more precise, you are creating a random sample of 1000 observations varying between 20 to 35 for temperature across 40 days. As a result, it is very likely that every single day will have at least one observation marked as over 30 °C in your example. Additionally, it is always a good practice to set seed to ensure reproducibility.
I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28
I have hear a really silly output format from observations which I've to read in with scan.
Here's a snipplet from (data.dat), where I've marked header and data blocks:
06.02.2014 # header
PNP
-0,005
00:05#587 # values
00:15#591
23:50#587
23:55#587
07.02.2014 # header
PNP
-0,005
00:10#587 # values
00:15#590
23:55#590
24:00#593
08.02.2014 # header
PNP
-0,005
00:05#590 # value
00:10#595
00:15#600
23:50#600
23:55#607
The problems are:
I've got date for several years in 5min resolution,
each day has is own header (constant length), beginning with the date and two additional entries,
the length of the time series (format HH:MM#value)for each day is not constant, data gaps exists (not shown in the example)
My aim is a data.frame of the form date, time, value.
So, I need a loop or something, which analyses the single list elements (output from scan(file=data.dat, what=" ") as character). Since the time blocks have different lengths, I'd like to subsetting my daily data beginning with the date, skipping some further header elements, and than strsplit the time#value elements of the list, which has been outputted by
crap <- scan(file = data.dat, what=" ") # import as list
the strsplit works well with
tmp <- strsplit(crap[4:8], split="#")
df <- data.frame(date=as.Date(crap[1],format = "%d.%m.%Y"), time=sapply(tmp, "[[", 1), W=sapply(tmp, "[[", 2))
However, I've no idea how to analyse the elements from the list (as characters), if they have an valid date format.
Cheers!
I have a solution but it may be very specific to the question you asked and what I interpreted.
First read the data and remove the PNP and -0,005 from the data.
crap <- read.table(file = "data.dat",comment.char = " ")
a <- as.vector(crap$V1)
a <- a[-grep("PNP|-0,005",x = a)]
Now I extract the dates contained in the vector a
dateId <- grep(".",x=a,fixed=T)
uniquedate <- as.matrix(a[dateId])
> uniquedate
[,1]
[1,] "06.02.2014"
[2,] "07.02.2014"
[3,] "08.02.2014"
Now I create a vector of dates of same length as no. of values in the dataset by repeating the dates for the number of values present in the corresponding date.
len <- length(dateId)
dateRepVal <- c(diff(dateId)-1,(length(a) - dateId[len]))
dates <- unlist(sapply(1:len,FUN = function(x){rep(uniquedate[x],dateRepVal[x])}))
All other elements expect the date in our dataset "a" are time-value pair.using this information now I get the time and val by using the strsplit function and then create the dataframe.
timeVal <- strsplit(a[-dateId],split = "#")
time <- sapply(timeVal, "[[", 1)
val <- sapply(timeVal, "[[", 2)
DF <- data.frame(date = dates,time=time,val=val)
The final required output looks like below.
>DF
date time val
1 06.02.2014 00:05 587
2 06.02.2014 00:15 591
3 06.02.2014 23:50 587
4 06.02.2014 23:55 587
5 07.02.2014 00:10 587
6 07.02.2014 00:15 590
7 07.02.2014 23:55 590
8 07.02.2014 24:00 593
9 08.02.2014 00:05 590
10 08.02.2014 00:10 595
11 08.02.2014 00:15 600
12 08.02.2014 23:50 600
13 08.02.2014 23:55 607
Hope this solves the problem.
I have a CSV file that looks like this, where "time" is a UNIX timestamp:
time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14
... and so on
I am reading it into R and converting the time column into POSIXct like so:
data <- read.csv(file="data.csv",head=TRUE,sep=",")
data[,1] <- as.POSIXct(data[,1], origin="1970-01-01")
Great so far, but now I would like to build a histogram with each bin corresponding to the average hourly count. I'm stuck on selecting by hour and then counting. I've looked through ?POSIXt and ?cut.POSIXt, but if the answer is in there, I am not seeing it.
Any help would be appreciated.
Here is one way:
R> lines <- "time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14"
R> con <- textConnection(lines); df <- read.csv(con); close(con)
R> df$time <- as.POSIXct(df$time, origin="1970-01-01")
R> df$hour <- as.POSIXlt(df$time)$hour
R> df
time count hour
1 2011-03-15 05:13:52 5 5
2 2011-03-11 13:23:52 0 13
3 2011-03-13 09:48:52 1 9
4 2011-03-16 04:18:52 4 4
5 2011-03-12 12:28:52 1 12
6 2011-03-14 08:53:52 1 8
7 2011-03-10 17:03:52 9 17
8 2011-03-09 20:43:52 14 20
R> tapply(df$count, df$hour, FUN=mean)
4 5 8 9 12 13 17 20
4 5 1 1 1 0 9 14
R>
Your data doesn't actually yet have multiple entries per hour-of-the-day but this would average over the hours, properly parsed from the POSIX time stamps. You can adjust with TZ info as needed.
You can calculate the hour "bin" for each time by converting to a POSIXlt and subtracting away the minute and seconds components. Then you can add a new column to your data frame that would contain the hour bin marker, like so:
date.to.hour <- function (vec)
{
as.POSIXct(
sapply(
vec,
function (x)
{
lt = as.POSIXlt(x)
x - 60*lt$min - lt$sec
}),
tz="GMT",
origin="1970-01-01")
}
data$hour <- date.to.hour(as.POSIXct(data[,1], origin="1970-01-01"))
There's a good post on this topic on Mages' blog. To get the bucketed data:
aggregate(. ~ cut(time, 'hours'), data, mean)
If you just want a quick graph, ggplot2 is your friend:
qplot(cut(time, "hours"), count, data=data, stat='summary', fun.y='mean')
Unfortunately, because cut returns a factor, the x axis won't work properly. You may want to write your own, less awkward bucketing function for time, e.g.
timebucket = function(x, bucketsize = 1,
units = c("secs", "mins", "hours", "days", "weeks")) {
secs = as.numeric(as.difftime(bucketsize, units=units[1]), units="secs")
structure(floor(as.numeric(x) / secs) * secs, class=c('POSIXt','POSIXct'))
}
qplot(timebucket(time, units="hours"), ...)