I have some trouble with a dataset I have in data.table. Basically, I have
2 columns: scheduled delivery date and rescheduled delivery date. However,
some values are left blank. An example:
Scheduled Rescheduled
NA NA
2016-11-14 2016-11-17
2016-11-15 NA
2016-11-13 2016-11-11
NA 2016-11-15
I want to create a new column, which indicates the most recent
date of both columns, for instance named max_scheduled_date.
Therefore, if Rescheduled is NA, the max_scheduled_date should
take the value of Scheduled, whilst max_scheduled_date should
take the value of Rescheduled if Scheduled is NA. When both
columns are NA, max_scheduled_date should obviously take NA.
When both columns have a date, it should take the most recent one.
I have a lot of problems creating this and do not get the results I want.
The dates are POSIXct, which gives me some trouble unfortunately.
Can someone help me out?
Thank you in advance,
Kind regards,
Amanda
As the question is tagged with data.table, here is also a data.table solution.
pmax() seems to work sufficiently well with POSIXct. Therefore, I see no reason to coerce the date columns from POSIXct to Date class.
setDT(DF)[, max_scheduled_date := pmax(Scheduled, Rescheduled, na.rm = TRUE)]
DF
Scheduled Rescheduled max_scheduled_date
1: <NA> <NA> <NA>
2: 2016-11-14 2016-11-17 2016-11-17
3: 2016-11-15 <NA> 2016-11-15
4: 2016-11-13 2016-11-11 2016-11-13
5: <NA> 2016-11-15 2016-11-15
Note that the new column is appended by reference, i.e., without copying the whole object.
Data
DF <- setDF(fread(
"Scheduled Rescheduled
NA NA
2016-11-14 2016-11-17
2016-11-15 NA
2016-11-13 2016-11-11
NA 2016-11-15"
)[, lapply(.SD, as.POSIXct)])
str(DF)
'data.frame': 5 obs. of 2 variables:
$ Scheduled : POSIXct, format: NA "2016-11-14" "2016-11-15" ...
$ Rescheduled: POSIXct, format: NA "2016-11-17" NA ...
Assuming that both columns are Date class, we can use pmax to create the max of the dates for each row
df1[] <- lapply(df1, as.Date) #change to Date class initially
df1$max_scheduled_date <- do.call(pmax, c(df1, na.rm = TRUE))
df1$max_scheduled_date
#[1] NA "2016-11-17" "2016-11-15" "2016-11-13" "2016-11-15"
It can also be done with the tidyverse
library(dplyr)
df1 %>%
mutate_all(as.Date) %>%
mutate(max_scheduled_date = pmax(Scheduled, Rescheduled, na.rm = TRUE))
Related
I combine several large csv files with arrival (ATA) and departure (ATD) times of objects. After combining the files I cannot remove <NA> rows using familiar methods. The origin may be in difference between Windows and Unix files in newline and carriage returns. But I don't want to alter the csv files. I want to be able to correct the data frame in R.
I combine several large csv files containing the same variables, e.g.:
# read csv files
df1 <- read.csv("data_1.csv", stringsAsFactors = FALSE)
df2 <- read.csv("data_2.csv", stringsAsFactors = FALSE)
df3 <- read.csv("data_3.csv", stringsAsFactors = FALSE)
# combine csv files
combidat <- rbind(df1, df2, df3)
# remove duplicate entries
combidat <- combidat[!duplicated(combidat), ]
To remove entries with <NA> in ID (first column variable), I use one of several:
combidat <- combidat[!is.na(combidat$ID),]
combidat <- combidat[complete.cases(combidat[ , 1]),]
combidat <- combidat[rowSums(is.na(combidat)) != ncol(combidat),]
I also found:
combidat <- combidat[-which(apply(combidat,1,function(x)all(is.na(x)))),]
But I cannot use this approach. If I do combidat becomes empty.
If I check the result:
combidat[is.na(combidat$ID),]
I get:
[1] ID ATA ATD object
<0 rows> (or 0-length row.names)
However, if I check on inconsistencies, i.e. departure times before arrival times:
combidat[(ATD<ATA),]
I get:
ID ATA ATD object
233 51586002 2016-03-14 09:44:00 2016-03-14 09:00:00 car718
798 54846070 2016-06-19 01:37:00 2016-04-07 23:59:00 car276
4126 56066767 2016-03-31 14:00:00 2016-03-30 07:00:00 car089
NA NA <NA> <NA> NA
NA.1 NA <NA> <NA> NA
NA.2 NA <NA> <NA> NA
NA.3 NA <NA> <NA> NA
NA.4 NA <NA> <NA> NA
NA.5 NA <NA> <NA> NA
NA.6 NA <NA> <NA> NA
NA.7 NA <NA> <NA> NA
What I hope to get is:
ID ATA ATD object
233 51586002 2016-03-14 09:44:00 2016-03-14 09:00:00 car718
798 54846070 2016-06-19 01:37:00 2016-04-07 23:59:00 car276
4126 56066767 2016-03-31 14:00:00 2016-03-30 07:00:00 car089
Any explanation what I do wrong and how to correct it, would be much appreciated.
[Addition June 28, 2019]
Something goes wrong with the imported csv files. Somehow newlines/carriage returns within a data field, get interpreted as end of record markers. I have juggled with 'quote':
df1 <- read.csv("data_1.csv", stringsAsFactors = FALSE, quote = "\"'")
And this has some impact, but I do not get it right.
As noted by gdevaux, maybe your "NA" are characters. In that case you could filter your data with the dplyr package (you can also do it with base R as you attempted):
library(dplyr)
combidat<- combidat%>% filter(!ID=="NA")
It may also have spaces, in that case you can trim in the importing or trim the column and retry the code above.
library(stringr)
combidat<- combidat%>%
mutate(ID=str_trim(ID, side = "both"))%>%
filter(!ID=="NA")
Lastly, it seems that your ID column is just composed of numeric ID. In that case you could make your column numeric and hence coerce the NA values to real NA.
combidat$ID<- as.numeric(combidat$ID)
combidat<- combidat%>% filter(!is.na(ID))
If values were coerced, R will tell you.
The two answers above would have helped you. Here is another approach to filter empty observation and variables from the data frame.
#install.packages("janitor")
library(janitor)
remove_empty(combidat, "rows")
combidat <- combidat[which(!is.na(combidat$ID)),]
I am looking for a code like a for-loop, which looks at all rows of certain columns to create a new variable for row [i].
I have a data frame which includes basically three columns. Interval start, interval end and measuring date. In all three columns the values have the format YearMonthDayHourMinute. The measuring date is continuous all 10 minutes. The interval are just short periods, that leave a lot of NAs where the measuring date is not met.
The data frame looks like this:
interval_start interval_end measuring_date
1 NA NA 201805021210
2 201805021220 201805021250 201805021220
3 NA NA 201805021230
4 NA NA 201805021240
5 NA NA 201805021250
6 NA NA 201805021300
Now, I want to R to create a new column, that gives a "Yes" where the measuring period lies within the interval, and a "No" where it doesn't.
Like this:
interval_start interval_end measuring_date within_interval
1 NA NA 201805021210 No
2 201805021220 201805021250 201805021220 Yes
3 NA NA 201805021230 Yes
4 NA NA 201805021240 Yes
5 NA NA 201805021250 Yes
6 NA NA 201805021300 No
So I want R to take the measuring_date of row 1 and compare it to the interval_start and interval_end of row 1,2,3,4,5 and 6. Same for measuring_date of row 2 and so on.
The Problem I have now is that I've tried for-loops with if else and nested for-loops (see below) but R seems not to be able to take the measuring_date of row 1 and compare it with all rows of interval_start and interval_end. It compares only within the same row. So all I can get is:
interval_start interval_end measuring_date within_interval
1 NA NA 201805021210 No
2 201805021220 201805021250 201805021220 Yes
3 NA NA 201805021230 No
4 NA NA 201805021240 No
5 NA NA 201805021250 No
6 NA NA 201805021300 No
Does anyone know a solution to this problem? Maybe there are solutions outside of a for-loop, that I didn't come across.
I've been searching the whole internet but didn't find any solution which left me quite frustrated. Even my supervisor is helpless..
I hope my question is clear enough, sorry, I am using stackoverflow for the first time..
for (i in 1:nrow(masterX)){
masterX$Within_Searching_Period[i] <- NA
for (j in 1:nrow(masterX)){
if (masterX$MESS_DATUM[i] >= masterX$time_date_start_min[j] &
masterX$MESS_DATUM[i] <= masterX$time_date_end_min[j]) {
masterX$Within_Searching_Period[i] <- "YES"
} else {masterX$Within_Searching_Period[i] <- "NO"
}
}
}
Using data.table package, you can use non-equi join to find if measuring_date is within any interval:
DT[, within_interval :=
DT[DT, .N > 0 ,on=.(interval_start <= measuring_date, interval_end >= measuring_date), by=.EACHI]$V1
]
output:
interval_start interval_end measuring_date within_interval
1: <NA> <NA> 2018-05-02 12:10:00 FALSE
2: 2018-05-02 12:20:00 2018-05-02 12:50:00 2018-05-02 12:20:00 TRUE
3: <NA> <NA> 2018-05-02 12:30:00 TRUE
4: <NA> <NA> 2018-05-02 12:40:00 TRUE
5: <NA> <NA> 2018-05-02 12:50:00 TRUE
6: <NA> <NA> 2018-05-02 13:00:00 FALSE
data:
library(data.table)
DT <- fread("interval_start,interval_end,measuring_date
NA,NA,201805021210
201805021220,201805021250,201805021220
NA,NA,201805021230
NA,NA,201805021240
NA,NA,201805021250
NA,NA,201805021300", colClasses="character")
DT[, (names(DT)) := lapply(.SD, as.POSIXct, format="%Y%m%d%H%M")]
Instead of focusing on the loop, it's better practice to try to write vectorized code. I've used some lubridate and tidyverse syntax. If you deal with dates a lot, it's worth looking at lubridate's documentation.
library(tidyverse)
library(lubridate)
data %>%
mutate_at(vars(time_date_start_min, time_date_end_min, MESS_DATUM), ymd_hm) %>%
mutate(within_interval =
ifelse(MESS_DATUM %within% interval(time_date_start_min, time_date_end_min) %in% TRUE, "Yes", "No"))
I have a data frame yy. I want to do a aggregation of the data. There is a time stamp variable and there is repetition in the time variable.
I want to find the unique values of time stamp and aggregate all the other variables in this data frame with respect to this unique time stamp value. Finally I need to get the mean of the other variables.
Here is the data sample
temp yield density time
1 54 NA 30.23 2009-12-31 18
2 54 NA 30.22 2009-12-31 19
3 53 NA 30.20 2009-12-31 20
4 53 NA 30.19 2009-12-31 21
5 50 NA 30.18 2009-12-31 22
6 51 3 30.16 2009-12-31 23
.......
I run the following code:
aggdata=aggregate(yy~time, by= list(unique(time)), data =yy, FUN = mean,na.rm=TRUE)
I got this warning
argument is not numeric or logical: returning NA
If I run the aggregation one variable at a time, it works
aggdata=aggregate(temp~time, by= list(unique(time)),data=yy,FUN=mean)
But if use the whole data list yy, there are errors.
Could someone please explain this?
Using data.table, convert the 'data.frame' to 'data.table' (setDT(yy)), grouped by 'time', specify the columns to summarise in .SDcols, loop through them and get the mean.
library(data.table)
setDT(yy)[, lapply(.SD, mean, na.rm=TRUE), by = time, .SDcols = c("temp", "yield")]
This seems like something that could easily be done using the package dplyr
You could do something as follows:
yy <- yy %>% group_by(time) %>% summarize(meantemp = mean(temp), meanyield = mean(yield))
I have a data.table containing two date variables. The data set was read into R from a .csv file (was originally an .xlsx file) as a data.frame and the two variables then converted to date format using as.Date() so that they display as below:
df
id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 <NA> 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 <NA> <NA>
I then converted the data.frame to a data.table:
df <- data.table(df)
I then wanted to create a third variable, that would include "specdate" if present, but replace it with "recdate" if "specdate" was missing (NA). This is where I'm having some difficulty, as it seems that no matter how I approach this, data.table displays dates in date format only if a complete variable that is already in date format is copied. Otherwise, individual values are displayed as a number (even when using as.IDate) and I gather that an origin date is needed to correct this. Is there any way to avoid supplying an origin date but display the dates as dates in data.table?
Below is my attempt to fill the NAs of specdate with the recdate dates:
# Function to fill NAs:
fillnas <- function(dataref, lookupref, nacol, replacecol, replacelist=NULL) {
nacol <- as.character(nacol)
if(!is.null(replacelist)) nacol <- factor(ifelse(dataref==lookupref & (is.na(nacol) | nacol %in% replacelist), replacecol, nacol))
else nacol <- factor(ifelse(dataref==lookupref & is.na(nacol), replacecol, nacol))
nacol
}
# Fill the NAs in specdate with the function:
df[, finaldate := fillnas(dataref=id, lookupref=id, nacol=specdate, replacecol=as.IDate(recdate, format="%Y-%m-%d"))]
Here is what happens:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> NA
The display problem is compounded if I create the new variable from scratch by using ifelse:
df[, finaldate := ifelse(!is.na(specdate), specdate, recdate)]
This gives:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 16294
2: 2 2014-08-15 2014-08-20 16297
3: 3 2014-08-21 2014-08-26 16303
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 16307
6: 6 <NA> <NA> NA
Alternately if I try a find-and-replace type approach, I get an error about the number of items to replace not matching the replacement length (I'm guessing this is because that approach is not vectorised?), the values from recdate are recycled and end up in the wrong place:
> df$finaldate <- df$specdate
> df$finaldate[is.na(df$specdate)] <- df$recdate
Warning message:
In NextMethod(.Generic) :
number of items to replace is not a multiple of replacement length
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 2014-08-17
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> 2014-08-20
So in conclusion - the function I applied gets me closest to what I want, except that where NAs have been replaced, the replacement value is displayed as a number and not in date format. Once displayed as a number, the origin is required to again display it as a date (and I would like to avoid supplying the origin since I usually don't know it and it seems unnecessarily repetitive to have to supply it when the date was originally in the correct format).
Any insights as to where I'm going wrong would be much appreciated.
I'd approach it like this, maybe :
DT <- data.table(df)
DT[, finaldate := specdata]
DT[is.na(specdata), finaldate := recdate]
It seems you want to add a new column so you can can retain the original columns as well. I do that as well a lot. Sometimes, I just update in place :
DT <- data.table(df)
DT[!is.na(specdate), specdate:=recdate]
setnames(DT, "specdate", "finaldate")
Using i like that avoids creating a whole new column's worth of data which might be very large. Depends on how important retaining the original columns is to you and how many of them there are and your data size. (Note that a whole column's worth of data is still created by the is.na() call and then again by ! but at least there isn't a third column's worth for the new finaldate. Would be great to optimize i=!is.na() in future (#1386) and if you use data.table this way now you won't need to change your code in future to benefit.)
It seems that you might have various "NA" strings that you're replacing. Note that fread in v1.9.6 on CRAN has a fix for that. From README :
correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to #dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.
Btw, you've made one of the top 3 mistakes mentioned here : https://github.com/Rdatatable/data.table/wiki/Support
Works for me. You may want to test to be sure that your NA values are not strings or factors "<NA>"; they will look like real NA values:
dt[, finaldate := ifelse(is.na(specdate), recdate, specdate)][
,finaldate := as.POSIXct(finaldate*86400, origin="1970-01-01", tz="UTC")]
# id specdate recdate finaldate
# 1: 1 2014-08-12 2014-08-17 2014-08-12
# 2: 2 2014-08-15 2014-08-20 2014-08-15
# 3: 3 2014-08-21 2014-08-26 2014-08-21
# 4: 4 NA 2014-08-28 2014-08-28
# 5: 5 2014-08-25 2014-08-30 2014-08-25
# 6: 6 NA NA NA
Data
df <- read.table(text=" id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 NA 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 NA NA", header=T, stringsAsFactors=F)
dt <- as.data.table(df)
I try to reshape my data using this code but i get NA values.
require(reshape2)
dates=data.frame(dates=seq(as.Date("1988-01-01"),as.Date("2011-12-31"),by="day"))
first=dates[,1]
dates1=cbind(dates[,1],colsplit(first,pattern="\\-",names=c("Year","Month","Day")))###split by y/m/day
head(dates1)
dates[, 1] Year Month Day
1 1988-01-01 6574 NA NA
2 1988-01-02 6575 NA NA
3 1988-01-03 6576 NA NA
4 1988-01-04 6577 NA NA
5 1988-01-05 6578 NA NA
6 1988-01-06 6579 NA NA
We can use cSplit from splitstacshape to split the 'dates' column by the delimiter -.
library(splitstackshape)
cSplit(dates, 'dates', '-', drop=FALSE)
Or extract to create additional columns
library(tidyr)
extract(dates, dates, into=c('Year', 'Month', 'Day'),
'([^-]+)-([^-]+)-([^-]+)', remove=FALSE)
Or another option from tidyr (suggested by #Ananda Mahto)
separate(dates, dates, into = c("Year", "Month", "Day"), remove=FALSE)
Or using read.table from base R. We specify the sep and the colum names, and cbind with the original column.
cbind(dates[1],read.table(text=as.character(dates$dates),
sep='-', col.names=c('Year', 'Month', 'Day')))
By using reshape2_1.4.1, I could reproduce the error
head(cbind(dates[,1],colsplit(first,pattern="-",
names=c("Year","Month","Day"))),2)
# dates[, 1] Year Month Day
#1 1988-01-01 6574 NA NA
#2 1988-01-02 6575 NA NA