I am an R newbie and am finding the conversion from matlab rather tricky, so apologies in advance for what could be a very simple question.
I am analyzing some time series data and the problem outlined below demonstrates the problem I am having in R:
Dat1 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:00","2012-05-03 02:00",
"2012-05-03 02:30","2012-05-03 05:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat2 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 01:00","2012-05-03 01:30",
"2012-05-03 02:30","2012-05-03 06:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat3 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:15","2012-05-03 02:20",
"2012-05-03 02:40","2012-05-03 06:25",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat4 <- data.frame(dateTime = as.POSIXct(c("2010-05-03 00:15","2010-05-03 02:20",
"2010-05-03 02:40","2010-05-03 06:25",
"2010-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
So, here I have 5 data frames where all of the data are measured at similar times. I am now trying to ensure that all of the data frames have an identical time step i.e. all measured at the same time. I can do this for two data frames:
idx1 <- (Dat1[,1] %in% Dat2[,1])
which will tell me the index of the consistent times in these two data frames. I can then re-define the data frame by
newDat1 <- Dat1[idx1,]
to get the data desired.
My question now is, how do I apply this to all of the data frames i.e. more than 2. I have tried:
idx1 <- (Dat1[,1] %in% (Dat2[,1] %in% (Dat3[,1] %in% Dat4[,1])))
but I can see that this is completely wrong. Any suggestions? Please keep in mind that I have many data frames (more than five), where each contain different variables.
EDIT:
I may have found one way this can be done:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
which will give the index, and can be used to define a new variable:
Dat1 <- Dat1[idx1,]
Dat2 <- Dat2[idx1,]
Dat3 <- Dat3[idx1,]
Dat4 <- Dat4[idx1,]
Although this work for this example, I was hoping to find a way of making this work for n number of data frames without having to write repeat this n number of times
To identify timestamps that are common to all data frames, create a function to return the intersection of multiple vectors
intersectMulti <- function(x=list()){
for(i in 2:length(x)){
if(i==2) foo <- x[[i-1]]
foo <- intersect(foo,x[[i]]) #find intersection between ith and previous
}
return(x[[1]][match(foo, x[[1]])]) #get original to retain format
}
Note that there are no common timestamps among the four dataframes in the question
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1],Dat4[,1]))
character(0)
But there is one common timestamp in the first three dataframes
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
[1] "2012-05-03 07:00:00 UTC"
Use the result from the function to subset rows of each dataframe with common timestamp:
m <- intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
Dat1 <- Dat1[Dat1$dateTime %in% m,]
Dat2 <- Dat2[Dat2$dateTime %in% m,]
Dat3 <- Dat3[Dat3$dateTime %in% m,]
Dat4 <- Dat4[Dat4$dateTime %in% m,]
> Dat1
dateTime x1
5 2012-05-03 07:00:00 -0.1607363
> Dat2
dateTime x1
5 2012-05-03 07:00:00 -0.2662494
> Dat3
dateTime x1
5 2012-05-03 07:00:00 -0.1917905
If this works for you:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
then try this, it works on lists/vectors and more elegant:
idx1 <- Dat1[,1] %in% Reduce(intersect, list(Dat1[,1], Dat2[,1], Dat3[,1], Dat4[,1]))
Related
This question already has answers here:
How to prevent ifelse() from turning Date objects into numeric objects
(7 answers)
Closed 2 years ago.
I have the following dataset:
A B
2007-11-22 2004-11-18
<NA> 2004-11-10
when the value of column A is NA, I want this value to be replaced by the date in B, except with an additional 25 days added.
Here is what the outcome should look like:
A B
2007-11-22 2004-11-18
2004-12-05 2004-11-10
So far, I have tried the following if else formula, but with no success.
library(lubridate)
data$A<- ifelse(is.na(data$A),data$B+days(25),data$A)
Could anyone tell me what's wrong with it or give me an alternate solution? The code to build my dataset is below.
A<-c("2007-11-22 01:00:00", NA)
B<-c("2004-11-18","2004-11-10")
data<-data.frame(A,B)
data$A<-as.Date(data$A);data$B<-as.Date(data$B)
The reason of the issue can be traced back from the source code of ifelse. When you type View(ifelse), you will see some lines in the bottom of the source code as below
ans <- test
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
where test is logic array, and ans is initialized as a copy of test. When running ans[ypos] <- rep(yes, length.out = len)[ypos], the class of ans is coerced to numeric, rather than Date. That's why you have integers on A column after using ifelse.
You can try the code below
data$A <- as.Date(ifelse(is.na(data$A), data$B + days(25), data$A), origin = "1970-01-01")
which gives
> data
A B
1 2007-11-22 2004-11-18
2 2004-12-05 2004-11-10
Assuming the data given reproducibly in the Note at the end -- in particular we assume both columns are of Date class -- compute a logical vector is_na which indicates which entries are NA and then set those from B.
is_na <- is.na(data$A)
data$A[is_na] <- data$B[is_na] + 25
This would also work and has the advantage that it does not overwrite data:
transform(data, A = replace(A, is.na(A), B[is.na(A)] + 25))
Note
Lines <- "
A B
2007-11-22 2004-11-18
NA 2004-11-10"
data <- read.table(text = Lines, header = TRUE)
data[] <- lapply(data, as.Date) # convert to Date class
Instead of ifelse you could use coalesce
library(tidyverse)
library(lubridate)
A <- c("2007-11-22 01:00:00", NA)
B <- c("2004-11-18","2004-11-10")
data <-data.frame(A,B)
data <- data %>%
mutate(A = as_date(A),
B = as_date(B),
A = coalesce(A,B+days(25)))
I have a huge dataset which I have to subset by a range of time and write the subsets into new dataframes. My problem is to subset the dataset between 12PM and 12PM the next day.
Small dummy subsetting by day.
dfrm <- data.frame(a=rnorm(240),dtm=as.POSIXct("2007-03-27 05:00", tz="GMT")+3600*(1:240))
dfrm
## Create list of dates in dfrm
date.start<-format(min(dfrm$dtm),"%Y-%m-%d")
date.end<-format(max(dfrm$dtm),"%Y-%m-%d")
datum<-seq(as.Date(date.start),as.Date(date.end),by="days")
## Get Date and Time from dfrm
dfrm$day<-as.POSIXlt(as.character(dfrm$dtm),format="%Y-%m-%d")
dfrm$clock<-as.POSIXlt(as.character(dfrm$dtm))
dfrm$clock<-format(dfrm$clock,format="%H:%M:%S")
## write dfrm daywise
j<-1
while (j<=length(datum))
{
name <- paste("day", datum[j], sep = "")
assign(name,dfrm[which(dfrm$day==format(datum[j],"%Y-%m-%d")),])
j<-j+1
}
Thank you for your help.
You can do
dfrm <- data.frame(a=rnorm(240),dtm=as.POSIXct("2007-03-27 05:00")+3600*(1:240));
lst <- split(dfrm, cut(dfrm$dtm, breaks = seq(as.POSIXct(paste0(as.Date(min(dfrm$dtm))-1, " 12:00:00")), as.POSIXct(paste0(as.Date(max(dfrm$dtm))+1, " 12:00:00")), by = "1 day")))
Now, take e.g. a subset of a few days:
lst2 <- lst[as.character(seq(as.POSIXct("2007-04-04 12:00:00"), as.POSIXct("2007-04-06 12:00:00"), "1 day"))]
And create a separate data frame for each day:
list2env(lst2, envir = .GlobalEnv)
head(`2007-04-04 12:00:00`)
I have a function called getWeatherForMonth that takes a start date and end date and returns as data frame of the result for each month. I have another method getWeatherForRange that takes a data frame of ranges. I need to call getWeatherForMonth for each row in the "dates" and combine the results into one data frame. I was using mapply like below but it's not combining the resulting data frames.
library(RJSONIO)
getWeatherForMonth <- function(start.date, end.date) {
url <- "http://api.worldweatheronline.com/premium/v1/past-weather.ashx?key=PUT-YOUR-KEY-HERE&q=London&format=json&date=%s&enddate=%e&tp=24"
url <- gsub("%s", start.date, url)
url <- url <- gsub("%e", end.date, url)
data <- fromJSON(url)
weather <- data$data$weather
GMT <- sapply(weather, function(x){as.character(x[1])})
Max.TemperatureC <- sapply(weather, function(x){as.numeric(x[3])})
Min.TemperatureC <- sapply(weather, function(x){as.numeric(x[4])})
Wind.SpeedKm.h <- sapply(weather, function(x){as.numeric(x$hourly[[1]]$windspeedKmph[1])})
Precipitationmm <- sapply(weather, function(x){as.numeric(x$hourly[[1]]$precipMM[1])})
DewPointC <-sapply(weather, function(x){as.numeric(x$hourly[[1]]$DewPointC[1])})
Wind.Chill <-sapply(weather, function(x){as.numeric(x$hourly[[1]]$WindChillC[1])})
Cloud.Cover <-sapply(weather, function(x){as.numeric(x$hourly[[1]]$cloudcover[1])})
Description <-sapply(weather, function(x){as.character(x$hourly[[1]]$weatherDesc[1])})
Humidity <- sapply(weather, function(x){as.numeric(x$hourly[[1]]$humidity[1])})
Feels.LikeC <- sapply(weather, function(x){as.numeric(x$hourly[[1]]$FeelsLikeC[1])})
df <- data.frame(GMT, Max.TemperatureC, Min.TemperatureC, Wind.SpeedKm.h, Precipitationmm, DewPointC, Wind.Chill, Cloud.Cover, Description, Humidity, Feels.LikeC)
return(df)
}
getWeatherForRange <- function(dates) {
df <- mapply(getWeatherForMonth, dates$start.date, dates$end.date)
return(df)
}
start.date <- seq(as.Date("2015-01-01"), length=12, by="1 month")
end.date <- seq(as.Date("2015-02-01"),length=12,by="months") - 1
dates.2015 <- data.frame(start.date, end.date)
data <- getWeatherForRange(dates)
View(data)
The output looks like this
Screenshot of the current output
Consider using Map(). Specifically, in your getWeatherForRange function, use Map() which is actually a wrapper for the non-simplified version of mapply(), equivalent to mapply(..., SIMPLIFY=FALSE). By default, mapply() returns a vector, matrix, or higher dimensional array. But you require a dataframe (i.e., a list object) return.
This updated function will return a list of dataframes that you can then later run a do.call(rbind, ...), assuming all columns are consistent in each df, to stack all dfs together for a final dataframe.
getWeatherForRange <- function(dates) {
# EQUIVALENT LINES
dfList <- Map(getWeatherForMonth, dates$start.date, dates$end.date)
# dfList <- mapply(getWeatherForMonth, dates$start.date, dates$end.date, SIMPLIFY = FALSE)
return(dfList)
}
start.date <- seq(as.Date("2015-01-01"), length=12, by="1 month")
end.date <- seq(as.Date("2015-02-01"), length=12, by="months") - 1
dates <- data.frame(start.date, end.date)
datalist <- getWeatherForRange(dates) # DATAFRAME LIST
data <- do.call(rbind, datalist) # FINAL DATA FRAME
I'm currently struggling with a beginner's issue regarding the calculation of a time difference between two events.
I want to take a column consisting of date and time (both values in one column) into consideration and calculate a time difference with the value of the previous/next row with the same ID (A or B in this example).
ID = c("A", "A", "B", "B")
time = c("08.09.2014 10:34","12.09.2014 09:33","13.08.2014 15:52","11.09.2014 02:30")
d = data.frame(ID,time)
My desired output is in the format Hours:Minutes
time difference = c("94:59","94:59","682:38","682:38")
The format Days:Hours:Minutes or anything similar would also work, as long as it could be conveniently implemented. I am flexible regarding the format of the output, the above is just an idea that crossed my mind.
For each single ID, I always have two rows (in the example 2xA and 2xB). I don't have a convincing idea how to avoid the repition of the difference.
I've tried some examples before, which I found on stackoverflow. Most of them used POSIXt and strptime. However, I didn't manage to apply those ideas to my data set.
Here's my attempt using dplyr
library(dplyr)
d %>%
mutate(time = as.POSIXct(time, format = "%d.%m.%Y %H:%M")) %>%
group_by(ID) %>%
mutate(diff = paste0(gsub("[.].*", "", diff(time)*24), ":",
round(as.numeric(gsub(".*[.]", ".", diff(time)*24))*60)))
# Source: local data frame [4 x 3]
# Groups: ID
#
# ID time diff
# 1 A 2014-09-08 10:34:00 94:59
# 2 A 2014-09-12 09:33:00 94:59
# 3 B 2014-08-13 15:52:00 682:38
# 4 B 2014-09-11 02:30:00 682:38
A very (to me) hack-ish base solution:
ID <- c("A", "A", "B", "B")
time <- c("08.09.2014 10:34", "12.09.2014 09:33", "13.08.2014 15:52","11.09.2014 02:30")
d <- data.frame(ID, time)
d$time <- as.POSIXct(d$time, format="%d.%m.%Y %H:%M")
unlist(unname(lapply(split(d, d$ID), function(d) {
sapply(abs(diff(c(d$time[2], d$time))), function(x) {
sprintf("%s:%s", round(((x*24)%/%1)), round(((x*24)%%1 *60)))
})
})))
## [1] "94:59" "94:59" "682:38" "682:38"
I have to believe this function exists somewhere already, tho.
similar to the attempts of David and hrmbrmstr, I found that this solution using difftime works
I use a rowshift script I found on stackoverflow
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
d$time.c <- as.POSIXct(d$time, format = "%d.%m.%Y %H:%M")
d$time.prev <- rowShift(d$time.c,-1)
d$diff <- difftime(d$time.c,d$time.prev, units="hours")
Every other row of d$diff has positive/negative values in the results. I do remove all the rows with negative values and have the difference between the first and the last time for every ID.
I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)