I combine several large csv files with arrival (ATA) and departure (ATD) times of objects. After combining the files I cannot remove <NA> rows using familiar methods. The origin may be in difference between Windows and Unix files in newline and carriage returns. But I don't want to alter the csv files. I want to be able to correct the data frame in R.
I combine several large csv files containing the same variables, e.g.:
# read csv files
df1 <- read.csv("data_1.csv", stringsAsFactors = FALSE)
df2 <- read.csv("data_2.csv", stringsAsFactors = FALSE)
df3 <- read.csv("data_3.csv", stringsAsFactors = FALSE)
# combine csv files
combidat <- rbind(df1, df2, df3)
# remove duplicate entries
combidat <- combidat[!duplicated(combidat), ]
To remove entries with <NA> in ID (first column variable), I use one of several:
combidat <- combidat[!is.na(combidat$ID),]
combidat <- combidat[complete.cases(combidat[ , 1]),]
combidat <- combidat[rowSums(is.na(combidat)) != ncol(combidat),]
I also found:
combidat <- combidat[-which(apply(combidat,1,function(x)all(is.na(x)))),]
But I cannot use this approach. If I do combidat becomes empty.
If I check the result:
combidat[is.na(combidat$ID),]
I get:
[1] ID ATA ATD object
<0 rows> (or 0-length row.names)
However, if I check on inconsistencies, i.e. departure times before arrival times:
combidat[(ATD<ATA),]
I get:
ID ATA ATD object
233 51586002 2016-03-14 09:44:00 2016-03-14 09:00:00 car718
798 54846070 2016-06-19 01:37:00 2016-04-07 23:59:00 car276
4126 56066767 2016-03-31 14:00:00 2016-03-30 07:00:00 car089
NA NA <NA> <NA> NA
NA.1 NA <NA> <NA> NA
NA.2 NA <NA> <NA> NA
NA.3 NA <NA> <NA> NA
NA.4 NA <NA> <NA> NA
NA.5 NA <NA> <NA> NA
NA.6 NA <NA> <NA> NA
NA.7 NA <NA> <NA> NA
What I hope to get is:
ID ATA ATD object
233 51586002 2016-03-14 09:44:00 2016-03-14 09:00:00 car718
798 54846070 2016-06-19 01:37:00 2016-04-07 23:59:00 car276
4126 56066767 2016-03-31 14:00:00 2016-03-30 07:00:00 car089
Any explanation what I do wrong and how to correct it, would be much appreciated.
[Addition June 28, 2019]
Something goes wrong with the imported csv files. Somehow newlines/carriage returns within a data field, get interpreted as end of record markers. I have juggled with 'quote':
df1 <- read.csv("data_1.csv", stringsAsFactors = FALSE, quote = "\"'")
And this has some impact, but I do not get it right.
As noted by gdevaux, maybe your "NA" are characters. In that case you could filter your data with the dplyr package (you can also do it with base R as you attempted):
library(dplyr)
combidat<- combidat%>% filter(!ID=="NA")
It may also have spaces, in that case you can trim in the importing or trim the column and retry the code above.
library(stringr)
combidat<- combidat%>%
mutate(ID=str_trim(ID, side = "both"))%>%
filter(!ID=="NA")
Lastly, it seems that your ID column is just composed of numeric ID. In that case you could make your column numeric and hence coerce the NA values to real NA.
combidat$ID<- as.numeric(combidat$ID)
combidat<- combidat%>% filter(!is.na(ID))
If values were coerced, R will tell you.
The two answers above would have helped you. Here is another approach to filter empty observation and variables from the data frame.
#install.packages("janitor")
library(janitor)
remove_empty(combidat, "rows")
combidat <- combidat[which(!is.na(combidat$ID)),]
Related
I am looking for a code like a for-loop, which looks at all rows of certain columns to create a new variable for row [i].
I have a data frame which includes basically three columns. Interval start, interval end and measuring date. In all three columns the values have the format YearMonthDayHourMinute. The measuring date is continuous all 10 minutes. The interval are just short periods, that leave a lot of NAs where the measuring date is not met.
The data frame looks like this:
interval_start interval_end measuring_date
1 NA NA 201805021210
2 201805021220 201805021250 201805021220
3 NA NA 201805021230
4 NA NA 201805021240
5 NA NA 201805021250
6 NA NA 201805021300
Now, I want to R to create a new column, that gives a "Yes" where the measuring period lies within the interval, and a "No" where it doesn't.
Like this:
interval_start interval_end measuring_date within_interval
1 NA NA 201805021210 No
2 201805021220 201805021250 201805021220 Yes
3 NA NA 201805021230 Yes
4 NA NA 201805021240 Yes
5 NA NA 201805021250 Yes
6 NA NA 201805021300 No
So I want R to take the measuring_date of row 1 and compare it to the interval_start and interval_end of row 1,2,3,4,5 and 6. Same for measuring_date of row 2 and so on.
The Problem I have now is that I've tried for-loops with if else and nested for-loops (see below) but R seems not to be able to take the measuring_date of row 1 and compare it with all rows of interval_start and interval_end. It compares only within the same row. So all I can get is:
interval_start interval_end measuring_date within_interval
1 NA NA 201805021210 No
2 201805021220 201805021250 201805021220 Yes
3 NA NA 201805021230 No
4 NA NA 201805021240 No
5 NA NA 201805021250 No
6 NA NA 201805021300 No
Does anyone know a solution to this problem? Maybe there are solutions outside of a for-loop, that I didn't come across.
I've been searching the whole internet but didn't find any solution which left me quite frustrated. Even my supervisor is helpless..
I hope my question is clear enough, sorry, I am using stackoverflow for the first time..
for (i in 1:nrow(masterX)){
masterX$Within_Searching_Period[i] <- NA
for (j in 1:nrow(masterX)){
if (masterX$MESS_DATUM[i] >= masterX$time_date_start_min[j] &
masterX$MESS_DATUM[i] <= masterX$time_date_end_min[j]) {
masterX$Within_Searching_Period[i] <- "YES"
} else {masterX$Within_Searching_Period[i] <- "NO"
}
}
}
Using data.table package, you can use non-equi join to find if measuring_date is within any interval:
DT[, within_interval :=
DT[DT, .N > 0 ,on=.(interval_start <= measuring_date, interval_end >= measuring_date), by=.EACHI]$V1
]
output:
interval_start interval_end measuring_date within_interval
1: <NA> <NA> 2018-05-02 12:10:00 FALSE
2: 2018-05-02 12:20:00 2018-05-02 12:50:00 2018-05-02 12:20:00 TRUE
3: <NA> <NA> 2018-05-02 12:30:00 TRUE
4: <NA> <NA> 2018-05-02 12:40:00 TRUE
5: <NA> <NA> 2018-05-02 12:50:00 TRUE
6: <NA> <NA> 2018-05-02 13:00:00 FALSE
data:
library(data.table)
DT <- fread("interval_start,interval_end,measuring_date
NA,NA,201805021210
201805021220,201805021250,201805021220
NA,NA,201805021230
NA,NA,201805021240
NA,NA,201805021250
NA,NA,201805021300", colClasses="character")
DT[, (names(DT)) := lapply(.SD, as.POSIXct, format="%Y%m%d%H%M")]
Instead of focusing on the loop, it's better practice to try to write vectorized code. I've used some lubridate and tidyverse syntax. If you deal with dates a lot, it's worth looking at lubridate's documentation.
library(tidyverse)
library(lubridate)
data %>%
mutate_at(vars(time_date_start_min, time_date_end_min, MESS_DATUM), ymd_hm) %>%
mutate(within_interval =
ifelse(MESS_DATUM %within% interval(time_date_start_min, time_date_end_min) %in% TRUE, "Yes", "No"))
i have a dataframe df with a column containing values (meter reading). Some values are sporadically missing (NA).
df excerpt:
row time meter_reading
1 03:10:00 26400
2 03:15:00 NA
3 03:20:00 27200
4 03:25:00 28000
5 03:30:00 NA
6 03:35:00 NA
7 03:40:00 30000
What I'm trying to do:
If there is only one consecutive NA, I want to interpolate (e.g. na.interpolation for row 2).
But if there's two or more consecutive NA, I don't want R to interpolate and leave the values as NA. (e.g. row 5 and 6).
What I tried so far is loop (for...) with an if-condition. My approach:
library("imputeTS")
for(i in 1:(nrow(df))) {
if(!is.na(df$meter_reading[i]) & is.na(df$meter_reading[i-1]) & !is.na(df$meter_reading[i-2])) {
na_interpolation(df$meter_reading)
}
}
Giving me :
Error in if (!is.na(df$meter_reading[i]) & is.na(df$meter_reading[i - :
argument is of length zero
Any ideas how to do it? Am I completely wrong here?
Thanks!
I don't knaow what is your na.interpolation, but taking the mean of previous and next rows for example, you could do that with dplyr :
df %>% mutate(x=ifelse(is.na(meter_reading),
(lag(meter_reading)+lead(meter_reading))/2,
meter_reading))
# row time meter_reading x
#1 1 03:10:00 26400 26400
#2 2 03:15:00 NA 26800
#3 3 03:20:00 27200 27200
#4 4 03:25:00 28000 28000
#5 5 03:30:00 NA NA
#6 6 03:35:00 NA NA
#7 7 03:40:00 30000 30000
A quick look shows that your counter i starts at 1 and then you try to get index at i-1 andi-2.
Just an addition here, in the current imputeTS package version, there is also a maxgap option for each imputation algorithm, which easily solves this problem. Probably wasn't there yet, as you asked this question.
Your code would look like this:
library("imputeTS")
na_interpolation(df, maxgap = 1)
This means gaps of 1 NA get imputed, while longer gaps of consecutive NAs remain NA.
I have some trouble with a dataset I have in data.table. Basically, I have
2 columns: scheduled delivery date and rescheduled delivery date. However,
some values are left blank. An example:
Scheduled Rescheduled
NA NA
2016-11-14 2016-11-17
2016-11-15 NA
2016-11-13 2016-11-11
NA 2016-11-15
I want to create a new column, which indicates the most recent
date of both columns, for instance named max_scheduled_date.
Therefore, if Rescheduled is NA, the max_scheduled_date should
take the value of Scheduled, whilst max_scheduled_date should
take the value of Rescheduled if Scheduled is NA. When both
columns are NA, max_scheduled_date should obviously take NA.
When both columns have a date, it should take the most recent one.
I have a lot of problems creating this and do not get the results I want.
The dates are POSIXct, which gives me some trouble unfortunately.
Can someone help me out?
Thank you in advance,
Kind regards,
Amanda
As the question is tagged with data.table, here is also a data.table solution.
pmax() seems to work sufficiently well with POSIXct. Therefore, I see no reason to coerce the date columns from POSIXct to Date class.
setDT(DF)[, max_scheduled_date := pmax(Scheduled, Rescheduled, na.rm = TRUE)]
DF
Scheduled Rescheduled max_scheduled_date
1: <NA> <NA> <NA>
2: 2016-11-14 2016-11-17 2016-11-17
3: 2016-11-15 <NA> 2016-11-15
4: 2016-11-13 2016-11-11 2016-11-13
5: <NA> 2016-11-15 2016-11-15
Note that the new column is appended by reference, i.e., without copying the whole object.
Data
DF <- setDF(fread(
"Scheduled Rescheduled
NA NA
2016-11-14 2016-11-17
2016-11-15 NA
2016-11-13 2016-11-11
NA 2016-11-15"
)[, lapply(.SD, as.POSIXct)])
str(DF)
'data.frame': 5 obs. of 2 variables:
$ Scheduled : POSIXct, format: NA "2016-11-14" "2016-11-15" ...
$ Rescheduled: POSIXct, format: NA "2016-11-17" NA ...
Assuming that both columns are Date class, we can use pmax to create the max of the dates for each row
df1[] <- lapply(df1, as.Date) #change to Date class initially
df1$max_scheduled_date <- do.call(pmax, c(df1, na.rm = TRUE))
df1$max_scheduled_date
#[1] NA "2016-11-17" "2016-11-15" "2016-11-13" "2016-11-15"
It can also be done with the tidyverse
library(dplyr)
df1 %>%
mutate_all(as.Date) %>%
mutate(max_scheduled_date = pmax(Scheduled, Rescheduled, na.rm = TRUE))
This might be a simple question but I have tried a few things and they're not working.
I have a large data frame with date/time formats in. An example of my data frame is:
Index FixTime1 FixTime2
1 2017-05-06 10:11:03 NA
2 NA 2017-05-07 11:03:03
I want to remove all NAs from the dataframe and make them "" (blank). I have tried:
df[is.na(df)]<-""
but this gives the error:
Error in as.POSIXlt.character(value) :
character string is not in a standard unambiguous format
Again, this is probably very simple to fix but can't find how to do this, while keeping each of these columns in time/date format
We can use replace
df[] <- replace(as.matrix(df), is.na(df), "")
df
# Index FixTime1 FixTime2
#1 1 2017-05-06 10:11:03
#2 2 2017-05-07 11:03:03
Here a possible solution on a toy dataset, adapt this code to your needs:
df<-data.frame(date=c("01/01/2017",NA,"01/02/2017"))
df
date
1 01/01/2017
2 <NA>
3 01/02/2017
From factor to character, and then remove NA
df$date <- as.character(df$date)
df[is.na(df$date),]<-""
df
date
1 01/01/2017
2
3 01/02/2017
In your specific example, this could be fine:
df_2<-data.frame(Index=c(1,2),
+ FixTime1=c("2017-05-06 10:11:03",NA),
+ FixTime2=c(NA,"2017-05-07 11:03:03"))
df_2<-data.frame(lapply(df_2, as.character), stringsAsFactors=FALSE)
df_2[is.na(df_2$FixTime1),"FixTime1"]<-""
df_2[is.na(df_2$FixTime2),"FixTime2"]<-""
df_2
Index FixTime1 FixTime2
1 1 2017-05-06 10:11:03
2 2 2017-05-07 11:03:03
I have a data.table containing two date variables. The data set was read into R from a .csv file (was originally an .xlsx file) as a data.frame and the two variables then converted to date format using as.Date() so that they display as below:
df
id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 <NA> 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 <NA> <NA>
I then converted the data.frame to a data.table:
df <- data.table(df)
I then wanted to create a third variable, that would include "specdate" if present, but replace it with "recdate" if "specdate" was missing (NA). This is where I'm having some difficulty, as it seems that no matter how I approach this, data.table displays dates in date format only if a complete variable that is already in date format is copied. Otherwise, individual values are displayed as a number (even when using as.IDate) and I gather that an origin date is needed to correct this. Is there any way to avoid supplying an origin date but display the dates as dates in data.table?
Below is my attempt to fill the NAs of specdate with the recdate dates:
# Function to fill NAs:
fillnas <- function(dataref, lookupref, nacol, replacecol, replacelist=NULL) {
nacol <- as.character(nacol)
if(!is.null(replacelist)) nacol <- factor(ifelse(dataref==lookupref & (is.na(nacol) | nacol %in% replacelist), replacecol, nacol))
else nacol <- factor(ifelse(dataref==lookupref & is.na(nacol), replacecol, nacol))
nacol
}
# Fill the NAs in specdate with the function:
df[, finaldate := fillnas(dataref=id, lookupref=id, nacol=specdate, replacecol=as.IDate(recdate, format="%Y-%m-%d"))]
Here is what happens:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> NA
The display problem is compounded if I create the new variable from scratch by using ifelse:
df[, finaldate := ifelse(!is.na(specdate), specdate, recdate)]
This gives:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 16294
2: 2 2014-08-15 2014-08-20 16297
3: 3 2014-08-21 2014-08-26 16303
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 16307
6: 6 <NA> <NA> NA
Alternately if I try a find-and-replace type approach, I get an error about the number of items to replace not matching the replacement length (I'm guessing this is because that approach is not vectorised?), the values from recdate are recycled and end up in the wrong place:
> df$finaldate <- df$specdate
> df$finaldate[is.na(df$specdate)] <- df$recdate
Warning message:
In NextMethod(.Generic) :
number of items to replace is not a multiple of replacement length
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 2014-08-17
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> 2014-08-20
So in conclusion - the function I applied gets me closest to what I want, except that where NAs have been replaced, the replacement value is displayed as a number and not in date format. Once displayed as a number, the origin is required to again display it as a date (and I would like to avoid supplying the origin since I usually don't know it and it seems unnecessarily repetitive to have to supply it when the date was originally in the correct format).
Any insights as to where I'm going wrong would be much appreciated.
I'd approach it like this, maybe :
DT <- data.table(df)
DT[, finaldate := specdata]
DT[is.na(specdata), finaldate := recdate]
It seems you want to add a new column so you can can retain the original columns as well. I do that as well a lot. Sometimes, I just update in place :
DT <- data.table(df)
DT[!is.na(specdate), specdate:=recdate]
setnames(DT, "specdate", "finaldate")
Using i like that avoids creating a whole new column's worth of data which might be very large. Depends on how important retaining the original columns is to you and how many of them there are and your data size. (Note that a whole column's worth of data is still created by the is.na() call and then again by ! but at least there isn't a third column's worth for the new finaldate. Would be great to optimize i=!is.na() in future (#1386) and if you use data.table this way now you won't need to change your code in future to benefit.)
It seems that you might have various "NA" strings that you're replacing. Note that fread in v1.9.6 on CRAN has a fix for that. From README :
correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to #dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.
Btw, you've made one of the top 3 mistakes mentioned here : https://github.com/Rdatatable/data.table/wiki/Support
Works for me. You may want to test to be sure that your NA values are not strings or factors "<NA>"; they will look like real NA values:
dt[, finaldate := ifelse(is.na(specdate), recdate, specdate)][
,finaldate := as.POSIXct(finaldate*86400, origin="1970-01-01", tz="UTC")]
# id specdate recdate finaldate
# 1: 1 2014-08-12 2014-08-17 2014-08-12
# 2: 2 2014-08-15 2014-08-20 2014-08-15
# 3: 3 2014-08-21 2014-08-26 2014-08-21
# 4: 4 NA 2014-08-28 2014-08-28
# 5: 5 2014-08-25 2014-08-30 2014-08-25
# 6: 6 NA NA NA
Data
df <- read.table(text=" id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 NA 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 NA NA", header=T, stringsAsFactors=F)
dt <- as.data.table(df)