My data consists of time points, in hours, starting from the start point of the experiment.
Experiments usually take over a week, so the the amount of hours easily exceeds 24.
To be precise, data is in the following format:
162:43:33.281
hhh:mm:ss.msecs
at the start of the experiment data points could consist of just 1-2 values for the hour insetad of the 3 mentioned here.
When I try to substract 2 times points I get an error stating that the numerical expression has for exemple 162:43 elements, which obviously refers to the colon used in the time annotation.
Any ideas on how to be able to treat time variables that consist of hour values over 24?
I tried the strptime function, with %H as argument, but that limits me to 24 hours.
Here is some example data:
V1 V2 V3 V4 V5
75:45:32.487 NA 17 ####revFalsePoke is 112 TRUE
75:45:32.487 NA 17 ####totalwindow is 5 TRUE
75:46:32.713 NA 1 ####Criteria not met TRUE
75:46:49.846 NA 6 ####revCorrectPoke is 37 TRUE
75:46:52.336 NA 9 ####revDeliberateLick is 34 TRUE
75:46:52.351 NA 9 ####totalwindow is 5 TRUE
75:46:52.598 NA 1 ####Criteria not met TRUE
75:47:21.332 NA 6 ####revCorrectPoke is 38 TRUE
75:47:23.440 NA 9 ####revDeliberateLick is 35 TRUE
75:47:23.455 NA 9 ####totalwindow is 6 TRUE
75:47:23.657 NA 1 ####rev Criteria not met TRUE
75:47:44.731 NA 17 ####revFalsePoke is 113 TRUE
75:47:44.731 NA 17 ####totalwindow is 6 TRUE
Unfortunately, you're going to have to roll your own converter function for this. I suggest converting the timestamps to difftime objects (which represent time duration, rather than location). You can then add them to some starting datetime to arrive at a final datetime for each timestamp. Here's one approach:
f <- function(start, timestep) {
result <- mapply(function(part, units) as.difftime(as.numeric(part), units=units),
unlist(strsplit(timestep, ':')),
c('hours', 'mins', 'secs'),
SIMPLIFY=FALSE)
start + do.call(sum, result)
}
start <- as.POSIXct('2013-1-1')
timesteps <- c('162:43:33.281', '172:34:28.33')
lapply(timesteps, f, start=start)
# [[1]]
# [1] "2013-01-07 18:43:33.280 EST"
#
# [[2]]
# [1] "2013-01-08 04:34:28.32 EST"
Related
I have dataset consists of user,time and condition. I want to replace the time for sequences beginning with a FALSE followed more than two consecutive TRUEs with the time of the last consecutive TRUE.
let's say df:
df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)
my desired result: time of rownumber 6 is copied to time of rownumber 3 to 6 because consecutive TRUE starts from 4 to 6. the same applies to last three records.
user time condition
11 1:05 FALSE
11 1:10 TRUE
11 1:25 FALSE
11 1:25 TRUE
11 1:25 TRUE
11 1:25 TRUE
11 1:40 FALSE
22 2:20 FALSE
22 2:40 FALSE
22 2:40 TRUE
22 2:40 TRUE
how can I do it in R?
Here's one option using rle
## Run length encoding of df
df_rle <- rle(df$condition)
## Locations of 2 or more consecutive TRUEs in RLE
seq_changes <- which(df_rle$lengths >= 2 & df_rle$value == TRUE)
## End-point index in original data frame
df_ind <- cumsum(df_rle$lengths)
## Loop over breakpoints to change
for (i in seq_changes){
i1 <- df_ind[i-1]
i2 <- df_ind[i]
df$time[i1:i2] <- df$time[i2]
}
This solution should do the trick, see the comments in the code for more details
false_positions <- which(!c(df$condition, FALSE)) #Flag the position of each of the FALSE occurences
#A dummy FALSE is put on the end to check for end of dataframe
false_differences <- diff(false_positions, 1) #Calculate how far each FALSE occurence is from the last
false_starts <- which(false_differences > 2) #Grab which of these FALSE differences are more than 2 apart
#Greater than 2 indicates 2 or more TRUEs as the first FALSE
#counts as one position
#false_starts stores the beginning of each chain we want to update
#Go through each of the FALSE starts which have more than one consecutive TRUE
for(false_start in false_starts){
false_first <- false_positions[false_start] #Gets the position of the start of our chain
true_last <- false_positions[false_start+1]-1 #Gets the position of the end of our chain, which is the
#the item before (thus the -1) the false after our
#initial FALSE (thus the +1)
time_override <- df$time[true_last] #Now we know the position of the end of our chain (the last TRUE)
#We can get the time we want to use
df$time[false_first:true_last] <- time_override #Update all the times from the start to end of our chain with
#the time we just determined
}
> df
user time condition
1 11 1:05 FALSE
2 11 1:10 TRUE
3 11 1:25 FALSE
4 11 1:25 TRUE
5 11 1:25 TRUE
6 11 1:25 TRUE
7 11 1:40 FALSE
8 22 2:20 FALSE
9 22 2:40 FALSE
10 22 2:40 TRUE
11 22 2:40 TRUE
I would like to parallelize that bottom loop if possible, but off the top of my head I was struggling to do so.
The gist is to identify where all our falses are, then identify where the start of all our chains are, since we only have TRUEs and FALSEs we can do this by looking at how far apart our FALSEs are!
Once we know where our chains start (since they are the first FALSE where FALSEs are far enough apart) we can get the end of our chain by looking at the element before the next FALSE in the list of all FALSES we already created.
Now we have the beginning and end of our chains we can just look at the end of the chain to get the time we want, then fill out the time values!
I hope this presents a relatively fast way of doing what you want, though :)
Here is a data.table solution which should be faster in runtime.
library(data.table)
setDT(df)
df[, time := if (.N > 2) time[.N] else time,
by=cumsum(!shift(c(condition, FALSE))[-1L])]
# user time condition
# 1: 11 1:05 FALSE
# 2: 11 1:10 TRUE
# 3: 11 1:25 FALSE
# 4: 11 1:25 TRUE
# 5: 11 1:25 TRUE
# 6: 11 1:25 TRUE
# 7: 11 1:40 FALSE
# 8: 22 2:20 FALSE
# 9: 22 2:40 FALSE
#10: 22 2:40 TRUE
#11: 22 2:40 TRUE
Idea is to cut into sequences of starting with F.
[-1L] removes the first NA before doing a cumsum.
I would recommend that you run some of the by code within j to take a look.
data:
df <- read.csv(text="user,time,condition
11,1:05,FALSE
11,1:10,TRUE
11,1:10,FALSE
11,1:15,TRUE
11,1:20,TRUE
11,1:25,TRUE
11,1:40,FALSE
22,2:20,FALSE
22,2:30,FALSE
22,2:35,TRUE
22,2:40,TRUE", header=TRUE)
I have 2 sucesive ZOO time series (the date of one begins after the other finishes), they have the following form (but much longer and not only NA values):
a:
1979-01-01 1979-01-02 1979-01-03 1979-01-04 1979-01-05 1979-01-06 1979-01-07 1979-01-08 1979-01-09
NA NA NA NA NA NA NA NA NA
b:
1988-08-15 1988-08-16 1988-08-17 1988-08-18 1988-08-19 1988-08-20 1988-08-21 1988-08-22 1988-08-23 1988-08-24 1988-08-25
NA NA NA NA NA NA NA NA NA NA NA
all I want to do is combine them in one time serie as a ZOO object, it seems to be a basic task but I am doing something wrong. I use the function "merge":
combined <- merge(a, b)
but the result is something in the form:
a b
1980-03-10 NA NA
1980-03-11 NA NA
1980-03-12 NA NA
1980-03-13 NA NA
1980-03-14 NA NA
1980-03-15 NA NA
1980-03-16 NA NA
.
.
which is not a time series, and the lengths dont fit:
> length(a)
[1] 10957
> length(b)
[1] 2557
> length(combined)
[1] 27028
how can I just combine them into one time series with the form of the original ones?
Assuming the series shown reproducibly in the Note at the end, the result of merging the two series has 20 times and 2 columns (one for each series). The individual series have lengths 9 and 11 elements and the merged series is a zoo object with 9 + 11 = 20 rows (since there are no intersecting times) and 2 columns (one for each input) and length 40 (= 20 * 2). Note that the length of a multivariate series is the number of elements in it, not the number of time points.
length(z1)
## [1] 9
length(z2)
## [1] 11
m <- merge(z1, z2)
class(m)
## [1] "zoo"
dim(m)
## [1] 20 2
nrow(m)
## [1] 20
length(index(m))
## [1] 20
length(m)
## [1] 40
If what you wanted is to string them out one after another then use c:
length(c(z1, z2))
## [1] 20
The above are consistent with how merge, c and length work in base R.
Note:
library(zoo)
z1 <- zoo(rep(NA, 9), as.Date(c("1979-01-01", "1979-01-02", "1979-01-03",
"1979-01-04", "1979-01-05", "1979-01-06", "1979-01-07", "1979-01-08",
"1979-01-09")))
z2 <- zoo(rep(NA, 11), as.Date(c("1988-08-15", "1988-08-16", "1988-08-17",
"1988-08-18", "1988-08-19", "1988-08-20", "1988-08-21", "1988-08-22",
"1988-08-23", "1988-08-24", "1988-08-25")))
Here is my first question into the Stackoverflow community. First of all, thanks a lot for all the answers that I managed to find here for the past 5 years. You all have been very helpful, but now I failed at finding my answer.
So, here is my situation. Into a bigger data frame, there is one variable causing me trouble: Weather. It is composed of factors defining the weather such as: "Rainy", "Cloudy", "Sunny", etc. My problem is that some entries are defined by more than one factor (e.g. "rainy,foggy"). Thus, R considers these combination of factors as new independent factors, which I don't want.
Here is an example of the data frame:
df <- read.table(text =
'"Date.Time","Year","Month","Day","Weekday","Hour","Temperature","Rel.humidity","Wind.dir","Wind.dir2","Wind.speed","Atm.pressure","Weather"
2015-04-01 00:00:00,"2015","4","1","Wednesday","00:00",-3.4,44,30,"NW",10,100.83,"Clear"
2015-04-02 23:00:00,"2015","4","2","Thursday","23:00",3.4,94,36,"N",2,99.8,"Rain,Fog"
2015-05-11 12:00:00,"2015","5","11","Monday","12:00",9.5,93,3,"NE",27,101.5,"Mist,Shower,Fog"',
header = TRUE, stringsAsFactors = FALSE, sep = ",")
My ultimate goal is to be able to, for instance, select only the entries labeled Fog including those that have both Rain and Fog.
My idea for a solution is to appply a character split and insert the result in lists that would be into the Weather variable, but I was unable to do it yet and maybe there is a simpler and fancier.
Here is my naive try to do it:
for (i in dim(df)[1]){
df[i,] <- as.factor(list(strsplit(dda[i,], ",")))
}
tldr; I want to convert a factor such as "A,B,C" into multiple factors "A", "B", "C" into the same element (same column, same row of the data frame)
Thanks in advance for your time and do not hesitate to comment the format of my question.
df <- read.table(text =
'"Date.Time","Year","Month","Day","Weekday","Hour","Temperature","Rel.humidity","Wind.dir","Wind.dir2","Wind.speed","Atm.pressure","Weather"
2015-04-01 00:00:00,"2015","4","1","Wednesday","00:00",-3.4,44,30,"NW",10,100.83,"Clear"
2015-04-02 23:00:00,"2015","4","2","Thursday","23:00",3.4,94,36,"N",2,99.8,"Rain,Fog"
2015-05-11 12:00:00,"2015","5","11","Monday","12:00",9.5,93,3,"NE",27,101.5,"Mist,Shower,Fog"',
header = TRUE, stringsAsFactors = FALSE, sep = ",")
Fixing your for loop:
df[["Weather_split"]] <- as.list(rep(NA, nrow(df)))
for (i in seq_len(nrow(df))) {
df[["Weather_split"]][[i]] <- strsplit(df[["Weather"]][[i]], ",")[[1]]
}
Same thing, simpler:
df[["Weather_split"]] <- strsplit(df[["Weather"]], ",")
str(df$Weather)
# chr [1:3] "Clear" "Rain,Fog" "Mist,Shower,Fog"
str(df$Weather_split)
# List of 3
# $ : chr "Clear"
# $ : chr [1:2] "Rain" "Fog"
# $ : chr [1:3] "Mist" "Shower" "Fog"
One step further with #Stephen Henderson's idea:
Weather_levels <- unique(unlist(df[["Weather_split"]]))
for (lvl in Weather_levels) {
df[[lvl]] <- unlist(lapply(df$Weather_split, "%in%", x = lvl))
}
df
# Date.Time Year Month Day Weekday Hour Temperature Rel.humidity Wind.dir Wind.dir2 Wind.speed Atm.pressure Weather Weather_split Clear Rain Fog Mist Shower
# 1 2015-04-01 00:00:00 2015 4 1 Wednesday 00:00 -3.4 44 30 NW 10 100.83 Clear Clear TRUE FALSE FALSE FALSE FALSE
# 2 2015-04-02 23:00:00 2015 4 2 Thursday 23:00 3.4 94 36 N 2 99.80 Rain,Fog Rain, Fog FALSE TRUE TRUE FALSE FALSE
# 3 2015-05-11 12:00:00 2015 5 11 Monday 12:00 9.5 93 3 NE 27 101.50 Mist,Shower,Fog Mist, Shower, Fog FALSE FALSE TRUE TRUE TRUE
Edit:
If, as per your question, you really need factor rather than character vectors, it is entirely feasible:
df$Weather_split <- lapply(df$Weather_split, factor, levels = Weather_levels)
df$Weather_split
# [[1]]
# [1] Clear
# Levels: Clear Rain Fog Mist Shower
#
# [[2]]
# [1] Rain Fog
# Levels: Clear Rain Fog Mist Shower
#
# [[3]]
# [1] Mist Shower Fog
# Levels: Clear Rain Fog Mist Shower
str(df$Weather_split)
# List of 3
# $ : Factor w/ 5 levels "Clear","Rain",..: 1
# $ : Factor w/ 5 levels "Clear","Rain",..: 2 3
# $ : Factor w/ 5 levels "Clear","Rain",..: 4 5 3
I have a 3x168 dataframe in R. Each row has three columns - Day, Hour, and value. The day and hour corresponds to day of the week, the hour column corresponds to the hour on that day, and the value corresponds to the value with which I am concerned.
I am hoping to transform this data such that it exists in a 24x7 matrix, with a row (or column) corresponding to a particular day, and a column (or row) corresponding to a particular hour.
What is the most efficient way to do this in R? I've been able to throw together some messy strings of commands to get something close, but I have a feeling there is a very efficient solution.
Example starting data:
> print(data)
weekday hour value
1 M 1 1.11569683
2 M 2 -0.44550495
3 M 3 -0.82566259
4 M 4 -0.81427790
5 M 5 0.08277568
6 M 6 1.36057839
...
156 SU 12 0.12842608
157 SU 13 0.44697186
158 SU 14 0.86549961
159 SU 15 -0.22333317
160 SU 16 1.75955163
161 SU 17 -0.28904472
162 SU 18 -0.78826607
163 SU 19 -0.78520233
164 SU 20 -0.19301032
165 SU 21 0.65281161
166 SU 22 0.37993619
167 SU 23 -1.58806896
168 SU 24 -0.26725907
I'd hope to get something of the type:
M .... SU
1 1.11569683
2 -0.44550495
3 -0.82566259
4 -0.81427790
5
6
.
.
.
19
20
21 0.65281161
22 0.37993619
23 -1.58806896
24 -0.26725907
You can get some actual sample data this way:
weekday <- rep(c("M","T","W","TH","F","SA","SU"),each=24)
hour <- rep(1:24,7)
value <- rnorm(24*7)
data <- data.frame(weekday=weekday, hour=hour, value=value)
Thanks!
This is pretty trivial with the reshape2 package:
# Sample data - please include some with your next question!
x <- data.frame(day = c(rep("Sunday", 24),
rep("Monday", 24),
rep("Tuesday", 24),
rep("Wednesday", 24),
rep("Thursday", 24),
rep("Friday", 24),
rep("Saturday", 24)),
hour = rep(1:24, 7),
value = rnorm(n = 24 * 7)
)
library(reshape2)
# For rows representing hours
acast(x, hour ~ day)
# For rows representing days
acast(x, day ~ hour)
# If you want to preserve the ordering of the days, just make x$day a factor
# unique(x$day) conveniently gives the right order here, but you'd always want
# check that (and make sure the factor reflects the original value - that's why
# I'm making a new variable instead of overwriting the old one)
x$day.f <- factor(x$day, levels = unique(x$day))
acast(x, hour ~ day.f)
acast(x, day.f ~ hour)
The three-column dataset you have is an example of what's called "molten data" - each row represents a single result (x$value) with one or more identifiers (here, x$day and x$hour). The little formula inside of acast lets you express how you'd like your new dataset to be configured - variable names to the left of the tilde are used to define rows, and variable names to the right to define columns. In this case, there's only one column left - x$value - so it's automatically used to fill in the result matrix.
It took me a while to wrap my brain around all of that, but it's an incredibly powerful to think about reshaping data.
Something like this (assuming dfrm is the data object):
M <- matrix( NA, nrow=24, ncol=2,
dimnames = list(Hours = 1:24, Days=unique(dfrm$weekday) ) )
M[ cbind(dfrm$hour, dfrm$weekday) ] <- dfrm$value
> M
Days
Hours M SU
1 1.11569683 NA
2 -0.44550495 NA
3 -0.82566259 NA
4 -0.81427790 NA
5 0.08277568 NA
6 1.36057839 NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
11 NA NA
12 NA 0.1284261
13 NA 0.4469719
14 NA 0.8654996
15 NA -0.2233332
16 NA 1.7595516
17 NA -0.2890447
18 NA -0.7882661
19 NA -0.7852023
20 NA -0.1930103
21 NA 0.6528116
22 NA 0.3799362
23 NA -1.5880690
24 NA -0.2672591
Or you could just "fold the values" if they are "dense":
M <- matrix(dfrm$value, 24, 7)
And then rename your dimensions accordingly. Tested code provided when actual test cases provided.
This is pretty straightforward with xtabs in base R:
output <- as.data.frame.matrix(xtabs(value ~ hour + weekday, data))
head(output)
# SU M T W TH F SA
# 1 -0.56902302 -0.4434357 -1.02356300 -0.38459296 0.7098993 -0.54780300 1.5232637
# 2 0.01023058 -0.2559043 -2.79688932 -1.65322029 -1.5150986 0.05566206 -0.6706817
# 3 0.18461405 1.2783761 -0.02509352 -1.36763623 -0.4978633 0.20300678 1.4211054
# 4 0.54194889 0.5681317 0.69391876 -1.35805959 0.4208977 1.65256590 0.3622756
# 5 -1.68048536 -1.9274994 0.24036908 -0.21959772 0.7654983 1.62773579 0.6760743
# 6 -1.39398673 1.7251476 0.36563174 0.04554249 -0.2991433 -1.47331314 -0.7647513
TO get the days in the right order (as above), use factor on your "weekday" variable before doing the xtabs step:
data$weekday <- factor(data$weekday,
levels = c("SU", "M", "T", "W", "TH", "F", "SA"))
I have a data.frame with 2 columns dates=dates of observations for each station and values=observation data
> head(dataset.a)
dates values
1 1976-01-01 7.5
2 1976-01-02 NA
3 1976-01-03 NA
4 1976-01-04 NA
5 1976-01-05 NA
6 1976-01-06 10.2
(...)
I have to multiply each row by a value that I have already from another data.frame:
> head(dataset.b)
dates values
1 1976-01-01 0.23
2 1976-01-02 NA
3 1976-01-03 NA
4 1976-01-04 NA
5 1976-01-05 NA
6 1976-01-06 1.23
(...)
Both datasets contain the Gregorian Calendar, however the dataset.a contains
Leap years (adds a 29th day to February) and the dataset.b has always 28 days in February. I want to ignore all 29th days of February in dataset.a and make the multiplication.
I should be able to make a basic subset using both indices:
which(strftime(dataset.a[,1],"%d")!="29")
which(strftime(dataset.a[,1],"%m")!="02")
However once I add a logical AND I loose the position in the data.frame were I have YEAR-02-29 and he returns me the number of rows that are TRUE for the combination of both indices.
I guess this is a very basic question, but I am lost.
Try a logical index:
idx <- strftime(ws.hb1.dataset[d,1],"%d")!="29" & strftime(ws.hb1.dataset[d,1],"%m")!="02"
Note: I'm assuming ws.hb1.dataset[d,1] is basically dataset.a[,1] here?
Then you'll get a vector of TRUE TRUE ... TRUE FALSE TRUE TRUE .. with the FALSE coinciding with 29/Feb.
Then you can just do dataset.a[idx,] to get the non 29/Feb dates.