Fill missing Variables by Information from other date columns (R) - r

I have a Dataframe which looks similar to this:
set.seed(42)
start <- Sys.Date() + sort(sample(1:10, 5))
set.seed(43)
end <- Sys.Date() + sort(sample(1:10, 5))
end[4] <- NA
A <- c("10", "15", "NA", "4", "NA")
B <- rpois(n = 5, lambda = 10)
df <- data.frame(start, end, A, B)
I would like , when there is an NA in the column A to caclulate the hours beweet start and end. Nothing shall happen when either start or end is NA.
I tried somthing like that:
df[, df$A [is.na(df[, df$A])]] <- difftime(df$end, df$start, units = "hours")
but this gives me the Error: undefined columns selected.
Does someone have an Idea? Thanks.

Create an index where there are NA in 'A' column, subset the 'start', 'end' based on the index, get the difftime and assign back
df$A <- as.numeric(df$A)
i1 <- is.na(df$A)
df$A[i1] <- with(df, as.numeric(difftime(start[i1], end[i1], units = "hours")))

Related

Replace NA value in column with modified date in other column [duplicate]

This question already has answers here:
How to prevent ifelse() from turning Date objects into numeric objects
(7 answers)
Closed 2 years ago.
I have the following dataset:
A B
2007-11-22 2004-11-18
<NA> 2004-11-10
when the value of column A is NA, I want this value to be replaced by the date in B, except with an additional 25 days added.
Here is what the outcome should look like:
A B
2007-11-22 2004-11-18
2004-12-05 2004-11-10
So far, I have tried the following if else formula, but with no success.
library(lubridate)
data$A<- ifelse(is.na(data$A),data$B+days(25),data$A)
Could anyone tell me what's wrong with it or give me an alternate solution? The code to build my dataset is below.
A<-c("2007-11-22 01:00:00", NA)
B<-c("2004-11-18","2004-11-10")
data<-data.frame(A,B)
data$A<-as.Date(data$A);data$B<-as.Date(data$B)
The reason of the issue can be traced back from the source code of ifelse. When you type View(ifelse), you will see some lines in the bottom of the source code as below
ans <- test
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
where test is logic array, and ans is initialized as a copy of test. When running ans[ypos] <- rep(yes, length.out = len)[ypos], the class of ans is coerced to numeric, rather than Date. That's why you have integers on A column after using ifelse.
You can try the code below
data$A <- as.Date(ifelse(is.na(data$A), data$B + days(25), data$A), origin = "1970-01-01")
which gives
> data
A B
1 2007-11-22 2004-11-18
2 2004-12-05 2004-11-10
Assuming the data given reproducibly in the Note at the end -- in particular we assume both columns are of Date class -- compute a logical vector is_na which indicates which entries are NA and then set those from B.
is_na <- is.na(data$A)
data$A[is_na] <- data$B[is_na] + 25
This would also work and has the advantage that it does not overwrite data:
transform(data, A = replace(A, is.na(A), B[is.na(A)] + 25))
Note
Lines <- "
A B
2007-11-22 2004-11-18
NA 2004-11-10"
data <- read.table(text = Lines, header = TRUE)
data[] <- lapply(data, as.Date) # convert to Date class
Instead of ifelse you could use coalesce
library(tidyverse)
library(lubridate)
A <- c("2007-11-22 01:00:00", NA)
B <- c("2004-11-18","2004-11-10")
data <-data.frame(A,B)
data <- data %>%
mutate(A = as_date(A),
B = as_date(B),
A = coalesce(A,B+days(25)))

How to insert dates from one data frame into another?

I have a data frame with two columns, one containing dates, the other numbers. My goal is to insert dates from another data frame into the date column. Here is an example:
df <- data.frame(rep(as.Date("2001-01-01", origin = "1970-01-01"), 3),
c(1, 2, 3),
stringsAsFactors = F)
ins <- data.frame(rep(as.Date("1999-01-01", origin = "1970-01-01"), 3),
c(1, 2, 3),
stringsAsFactors = F)
The data frame I want to obtain is:
> df_goal
dates numbers
1 1999-01-01 1
2 2001-01-01 2
3 2001-01-01 3
I tried df[1, ] <- c(ins[1, 1], ins[1, 2]), but I got the following error:
Error in as.Date.numeric(value) : 'origin' must be supplied
However, if in df I omitt the numeric column, it works:
df <- data.frame(rep(as.Date("2001-01-01"), 3),
stringsAsFactors = F)
ins <- data.frame(rep(as.Date("1999-01-01"), 3),
c(1, 2, 3),
stringsAsFactors = F)
df[1, ] <- ins[1, 1]
How to get the first case (df with two columns) working?
I tried df[1, ] <- c(ins[1, 1], ins[1, 2]), but I got the following error:
Error in as.Date.numeric(value) : 'origin' must be supplied
Don't use c -- it transforms its arguments so they have the same class.
In this case, c(ins[1, 1], ins[1, 2]) makes a date vector; and when this is assigned onto the second column of df, R tries to coerce that column to date to make sense of the assignment, like as.Date(c(1, 2, 3)).
You can instead do df[1,] <- ins[1, c(1,2)].
Side note: Don't do this sort of insertion based on row numbers; there must be a better way to achieve what you're after, like a join/merge.
Alternatively:
df2 <- rbind(ins[1,], df[2:3,])
or
df2 <- df
df2[1,] <- ins[1,]

Create Value in final column of dataframe based on multiple columns

I have a dataframe that looks like this (but with a lot more variables/columns)
set.seed(5)
id<-seq(5)*floor(runif(5,min=1000, max=10000))
vals1<-c("Y","N","N","N","N")
vals2<-c("N","N","N","N","N")
vals3<-c("N","N","N","Y","N")
df<-data.frame(id,vals1,vals2,vals3)
I'd like to create a final column in the frame such that it generates a final flag with the following logic: If there is any value of 'Y' for any id the final flag is 'Y', otherwise it would be a 'N'. So, for this dataframe the 1st and 4th ids (2801, 14236) has a 'Y' in the final column and the rest have an 'n' for the final column. I tried a few approaches like apply and if...else to no avail.
Initialize by assigning "N" to every row. In next step, for the rows with "Y" (check using apply), assign "Y"
df$final = "N"
df$final[apply(df, 1, function(a) "Y" %in% a)] = "Y"
A solution for your letter encoding below.
set.seed(5)
id <- seq(5) * floor(runif(5, min=1000, max=10000))
vals1 <- c("Y","N","N","N","N")
vals2 <- c("N","N","N","N","N")
vals3 <- c("N","N","N","Y","N")
df <- data.frame(id, vals1, vals2, vals3)
# If you really want to use the letter encoding, my solution works as below
df$Final <- apply(df[,2:4], MARGIN = 1, FUN = function(x) {any(x == 'Y')})
However, I think you should use a boolean (TRUE/FALSE) for this.
Works well in combination with apply and any
set.seed(5)
id <- seq(5) * floor(runif(5, min=1000, max=10000))
vals1 <- c("Y","N","N","N","N")
vals2 <- c("N","N","N","N","N")
vals3 <- c("N","N","N","Y","N")
df <- data.frame(id, vals1, vals2, vals3)
# Convert your labels into booleans:
df[,2:4] <- df[,2:4] == 'Y'
# Then summarise across rows
df$Final <- apply(df[,2:4], MARGIN = 1, FUN = function(x) {any(x)})
Somewhat similar to the #d.b answer:
df$final <- apply(df, 1, function(x) c("N","Y")[any(x == "Y")+1])

Create TRUE/FALSE dataframe based on the presence/absence of specific variables

I have a data frame with samples taken from different seasons. What I would like is to summarize what sites have samples from spring (March-May) and autumn (September-November) across different years. For example, if Site A had a sample from Spring 2007, the cell reads 'TRUE'. Here is an example dataset:
Dates <- data.frame(c(as.Date("2007-9-1"),
rep(as.Date("2008-3-1"), times = 3) ,
rep(as.Date("2008-9-1"), times = 3)))
Sites <- as.data.frame(as.factor(c("SiteA",rep(c("SiteA","SiteB","SiteC"), 2))))
Values <- data.frame(matrix(sample(0:50, 3.5*2, replace=TRUE), ncol=1))
Dataframe <- cbind(Dates,Sites,Values)
colnames(Dataframe) <- c("date","site","value")
I have managed to create the factor 'season' within this dataframe based on these functions.
Dataframe$Months <- as.numeric(format(Dataframe$date, '%m'))
Dataframe$Season <- cut(Dataframe$Months,
breaks = c(1, 2, 5, 8, 11, 12),
labels = c("Winter", "Spring", "Summer", "Autumn", "Winter"),
right = FALSE)
But I am unsure where to go from here. Here is what the output should look like.
A <- rep("TRUE",times = 3)
B <- c("FALSE",rep("TRUE",times = 2))
C <- c("FALSE",rep("TRUE",times = 2))
Output <- as.data.frame(rbind(A,B,C))
colnames(Output) <- c("Autumn.07","Spring.07","Autumn.08")
Here is a proposition:
Dataframe$Samplings <- interaction(Dataframe$Season, unlist(lapply(strsplit(as.character(Dataframe$date), '-'), function(x) x[[1]]) ))
u1 <- unique(Dataframe$site)
u2 <- unique(Dataframe$Samplings)
output <- matrix(
matrix(levels(interaction(u1, u2)), nrow=length(unique(Dataframe$site))) %in%
interaction(Dataframe$site,Dataframe$Samplings),
nrow=length(unique(Dataframe$site))
)
colnames(output) <- levels(Dataframe$Samplings)
rownames(output) <- unique(Dataframe$site)
output # with all time interactions
# you can clear it with
output[, apply(output, 2, sum) != 0]
using reshape2::dcast
Dataframe$site <- gsub("Site","",Dataframe$site)
Dataframe$year <- format(Dataframe$date, "%y")
temp <- reshape2::dcast(Dataframe, site ~ Season + year, length)
(ans <- apply(data.frame(temp[,2:ncol(temp)], row.names=temp[,1]), 1:2, as.logical))
there is a warning with your Dataframe$Season due to duplicate labels. You might want to fix that.
I think that this is what you're looking for. The time label isn't exactly as in the question, but I hope it's still understandable.
library(reshape2)
# prepare the input, to have a handy label for the columns
Dataframe$Year <- as.numeric(format(Dataframe$date, '%Y'))
Dataframe$TimeLabel <- paste0(Dataframe$Season, '.', Dataframe$Year)
# This is in stages, to make it clear what's happening.
# create a data frame with the right structure, but cells holding NA / numbers
df1 <- dcast(Dataframe, site ~ TimeLabel)
# turn NA / number into false/true, while ignoring the site column
df2 <- !is.na(df1[, -1])
# add back the site labels for rows
df3 <- cbind(as.data.frame(df1$site), df2)

Aggregate an entire data frame with Weighted Mean

I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.
You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])
Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)

Resources