Replace NA value in column with modified date in other column [duplicate] - r

This question already has answers here:
How to prevent ifelse() from turning Date objects into numeric objects
(7 answers)
Closed 2 years ago.
I have the following dataset:
A B
2007-11-22 2004-11-18
<NA> 2004-11-10
when the value of column A is NA, I want this value to be replaced by the date in B, except with an additional 25 days added.
Here is what the outcome should look like:
A B
2007-11-22 2004-11-18
2004-12-05 2004-11-10
So far, I have tried the following if else formula, but with no success.
library(lubridate)
data$A<- ifelse(is.na(data$A),data$B+days(25),data$A)
Could anyone tell me what's wrong with it or give me an alternate solution? The code to build my dataset is below.
A<-c("2007-11-22 01:00:00", NA)
B<-c("2004-11-18","2004-11-10")
data<-data.frame(A,B)
data$A<-as.Date(data$A);data$B<-as.Date(data$B)

The reason of the issue can be traced back from the source code of ifelse. When you type View(ifelse), you will see some lines in the bottom of the source code as below
ans <- test
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
where test is logic array, and ans is initialized as a copy of test. When running ans[ypos] <- rep(yes, length.out = len)[ypos], the class of ans is coerced to numeric, rather than Date. That's why you have integers on A column after using ifelse.
You can try the code below
data$A <- as.Date(ifelse(is.na(data$A), data$B + days(25), data$A), origin = "1970-01-01")
which gives
> data
A B
1 2007-11-22 2004-11-18
2 2004-12-05 2004-11-10

Assuming the data given reproducibly in the Note at the end -- in particular we assume both columns are of Date class -- compute a logical vector is_na which indicates which entries are NA and then set those from B.
is_na <- is.na(data$A)
data$A[is_na] <- data$B[is_na] + 25
This would also work and has the advantage that it does not overwrite data:
transform(data, A = replace(A, is.na(A), B[is.na(A)] + 25))
Note
Lines <- "
A B
2007-11-22 2004-11-18
NA 2004-11-10"
data <- read.table(text = Lines, header = TRUE)
data[] <- lapply(data, as.Date) # convert to Date class

Instead of ifelse you could use coalesce
library(tidyverse)
library(lubridate)
A <- c("2007-11-22 01:00:00", NA)
B <- c("2004-11-18","2004-11-10")
data <-data.frame(A,B)
data <- data %>%
mutate(A = as_date(A),
B = as_date(B),
A = coalesce(A,B+days(25)))

Related

Converting dates in column of a data frame R

I am having problems converting a column of imported dates in a data frame, represented as characters in a different date format, into date objects in that same data frame. Here is a toy example:
xx <- data.frame(A = c(10, 15, 20), B = c("10/15/2010", "9/8/2015", "8/5/2013"))
If I print xx,
A B
1 10 10/15/2010
2 15 9/8/2015
3 20 8/5/2013
I apply:
xx[, "B"] <- sapply(xx[, "B"], function(x) {as.Date(x,
format = "%m/%d/%Y", origin = "1970-01-01")})
and I get:
A B
1 10 14897
2 15 16686
3 20 15922
If I look at the mode of column B, it is numeric, not date. No matter what I try I cannot seem to get a result that converts column B to a date type. I can always add:
xx[, "B"] <- as.Date(xx[, "B"])
but there must be a way to do this in one statement.
If you have only one column to convert, you can do
xx$B <- as.Date(xx$B, "%m/%d/%Y")
If you have multiple columns use lapply instead of sapply
cols <- 2
xx[cols] <- lapply(xx[cols], as.Date, "%m/%d/%Y")
Or using lubridate where you don't need to specify the format argument.
xx$B <- lubridate::mdy(xx$B)

Fill a data frame with increasing date objects in R

I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.

find corresponding dateTime in several time series in R

I am an R newbie and am finding the conversion from matlab rather tricky, so apologies in advance for what could be a very simple question.
I am analyzing some time series data and the problem outlined below demonstrates the problem I am having in R:
Dat1 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:00","2012-05-03 02:00",
"2012-05-03 02:30","2012-05-03 05:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat2 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 01:00","2012-05-03 01:30",
"2012-05-03 02:30","2012-05-03 06:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat3 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:15","2012-05-03 02:20",
"2012-05-03 02:40","2012-05-03 06:25",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat4 <- data.frame(dateTime = as.POSIXct(c("2010-05-03 00:15","2010-05-03 02:20",
"2010-05-03 02:40","2010-05-03 06:25",
"2010-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
So, here I have 5 data frames where all of the data are measured at similar times. I am now trying to ensure that all of the data frames have an identical time step i.e. all measured at the same time. I can do this for two data frames:
idx1 <- (Dat1[,1] %in% Dat2[,1])
which will tell me the index of the consistent times in these two data frames. I can then re-define the data frame by
newDat1 <- Dat1[idx1,]
to get the data desired.
My question now is, how do I apply this to all of the data frames i.e. more than 2. I have tried:
idx1 <- (Dat1[,1] %in% (Dat2[,1] %in% (Dat3[,1] %in% Dat4[,1])))
but I can see that this is completely wrong. Any suggestions? Please keep in mind that I have many data frames (more than five), where each contain different variables.
EDIT:
I may have found one way this can be done:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
which will give the index, and can be used to define a new variable:
Dat1 <- Dat1[idx1,]
Dat2 <- Dat2[idx1,]
Dat3 <- Dat3[idx1,]
Dat4 <- Dat4[idx1,]
Although this work for this example, I was hoping to find a way of making this work for n number of data frames without having to write repeat this n number of times
To identify timestamps that are common to all data frames, create a function to return the intersection of multiple vectors
intersectMulti <- function(x=list()){
for(i in 2:length(x)){
if(i==2) foo <- x[[i-1]]
foo <- intersect(foo,x[[i]]) #find intersection between ith and previous
}
return(x[[1]][match(foo, x[[1]])]) #get original to retain format
}
Note that there are no common timestamps among the four dataframes in the question
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1],Dat4[,1]))
character(0)
But there is one common timestamp in the first three dataframes
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
[1] "2012-05-03 07:00:00 UTC"
Use the result from the function to subset rows of each dataframe with common timestamp:
m <- intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
Dat1 <- Dat1[Dat1$dateTime %in% m,]
Dat2 <- Dat2[Dat2$dateTime %in% m,]
Dat3 <- Dat3[Dat3$dateTime %in% m,]
Dat4 <- Dat4[Dat4$dateTime %in% m,]
> Dat1
dateTime x1
5 2012-05-03 07:00:00 -0.1607363
> Dat2
dateTime x1
5 2012-05-03 07:00:00 -0.2662494
> Dat3
dateTime x1
5 2012-05-03 07:00:00 -0.1917905
If this works for you:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
then try this, it works on lists/vectors and more elegant:
idx1 <- Dat1[,1] %in% Reduce(intersect, list(Dat1[,1], Dat2[,1], Dat3[,1], Dat4[,1]))

melt with chron

I'm trying to melt a data frame with chron class
library(chron)
x = data.frame(Index = as.chron(c(15657.00,15657.17)), Var1 = c(1,2), Var2 = c(9,8))
x
Index Var1 Var2
1 (11/13/12 00:00:00) 1 9
2 (11/13/12 04:04:48) 2 8
y = melt(x,id.vars="Index")
Error in data.frame(ids, variable, value, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 2, 4
I can trick with as.numeric() as follows:
x$Index= as.numeric(x$Index)
y = melt(x,id.vars="Index")
y$Index = as.chron(y$Index)
y
Index variable value
1 (11/13/12 00:00:00) Var1 1
2 (11/13/12 04:04:48) Var1 2
3 (11/13/12 00:00:00) Var2 9
4 (11/13/12 04:04:48) Var2 8
But can it be simpler ? (I want to keep the chron class)
(1) I assume you issued this command before running the code shown:
library(reshape2)
In that case you could use the reshape package instead. It doesn't result in this problem:
library(reshape)
Other solutions are to
(2) use R's reshape function:
reshape(direction = "long", data = x, varying = list(2:3), v.names = "Var")
(3) or convert the chron column to numeric, use melt from the reshape2 package and then convert back:
library(reshape2)
xt <- transform(x, Index = as.numeric(Index))
transform(melt(xt, id = 1), Index = chron(Index))
ADDED additional solutions.
I'm not sure but I think this might be an "oversight" in chron (or possibly data.frame, but that seems unlikely).
The issue occurs when constructing the data frame in melt.data.frame in reshape2, which typically uses recycling, but that portion of data.frame:
for (j in seq_along(xi)) {
xi1 <- xi[[j]]
if (is.vector(xi1) || is.factor(xi1))
xi[[j]] <- rep(xi1, length.out = nr)
else if (is.character(xi1) && class(xi1) == "AsIs")
xi[[j]] <- structure(rep(xi1, length.out = nr), class = class(xi1))
else if (inherits(xi1, "Date") || inherits(xi1, "POSIXct"))
xi[[j]] <- rep(xi1, length.out = nr)
else {
fixed <- FALSE
break
}
seems to go wrong, as the chron variable doesn't inherit either Date or POSIXct. This removes the error but alters the date times:
x = data.frame(Index = as.chron(c(15657.00,15657.17)), Var1 = c(1,2), Var2 = c(9,8))
class(x$Index) <- c(class(x$Index),'POSIXct')
y = melt(x,id.vars="Index")
Like I said, this sorta smells like a bug somewhere. My money would be on the need for chron to add POSIXct to the class vector, but I could be wrong. The obvious alternative would be to use POSIXct date times instead.

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources