This might be a simple question but I have tried a few things and they're not working.
I have a large data frame with date/time formats in. An example of my data frame is:
Index FixTime1 FixTime2
1 2017-05-06 10:11:03 NA
2 NA 2017-05-07 11:03:03
I want to remove all NAs from the dataframe and make them "" (blank). I have tried:
df[is.na(df)]<-""
but this gives the error:
Error in as.POSIXlt.character(value) :
character string is not in a standard unambiguous format
Again, this is probably very simple to fix but can't find how to do this, while keeping each of these columns in time/date format
We can use replace
df[] <- replace(as.matrix(df), is.na(df), "")
df
# Index FixTime1 FixTime2
#1 1 2017-05-06 10:11:03
#2 2 2017-05-07 11:03:03
Here a possible solution on a toy dataset, adapt this code to your needs:
df<-data.frame(date=c("01/01/2017",NA,"01/02/2017"))
df
date
1 01/01/2017
2 <NA>
3 01/02/2017
From factor to character, and then remove NA
df$date <- as.character(df$date)
df[is.na(df$date),]<-""
df
date
1 01/01/2017
2
3 01/02/2017
In your specific example, this could be fine:
df_2<-data.frame(Index=c(1,2),
+ FixTime1=c("2017-05-06 10:11:03",NA),
+ FixTime2=c(NA,"2017-05-07 11:03:03"))
df_2<-data.frame(lapply(df_2, as.character), stringsAsFactors=FALSE)
df_2[is.na(df_2$FixTime1),"FixTime1"]<-""
df_2[is.na(df_2$FixTime2),"FixTime2"]<-""
df_2
Index FixTime1 FixTime2
1 1 2017-05-06 10:11:03
2 2 2017-05-07 11:03:03
Related
My questions concerns lagging data in r where r should be aware of the time index. I hope the question has not been asked in any further thread. Lets consider a simple setup:
df <- data.frame(date=as.Date(c("1990-01-01","1990-02-01","1990-01-15","1990-03-01","1990-05-01","1990-07-01","1993-01-02")), value=1:7)
This code should generate a table like
date
value
1990-01-01
1
1990-02-01
2
1990-01-15
3
1990-03-01
4
1990-05-01
5
1990-07-01
6
And my aim is now to try to lag the "value" by e.g. one month such that e.g when I try to compute the lagged value of "1990-05-01" (which would be 1990-04-01 but is not present in the data) should then generate an NA in the specific row. When I use the standard lag function r is not aware of the time index and simply uses the value "4" of 1990-03-01 which is not what I want. Has anyone an idea what I could do here?
Thanks in advance! :)
All the best,
Leon
You can try %m-% for lagged month like below
library(lubridate)
transform(
df,
value_lag = value[match(date %m-% months(1), date)]
)
which gives
date value value_lag
1 1990-01-01 1 NA
2 1990-02-01 2 1
3 1990-01-15 3 NA
4 1990-03-01 4 2
5 1990-05-01 5 NA
6 1990-07-01 6 NA
7 1993-01-02 7 NA
For an example with multiple columns lets consider:
df <- data.frame(date=as.Date(c("1990-01-01","1990-02-01","1990-01-15","1990-03-01","1990-05-01","1990-07-01","1993-01-02")), value=1:7, value2=7:13)
I recently found myself the following solution:
df %>%
as_tibble() %>%
mutate(across(2:ncol(df), .fns= function(x){x[match(date %m-% months(lags),date)]}, .names="{.col}_lag"))
Thanks to your code #ThomasisCoding. :)
I have been trying to convert a date from factor format to date format but I've been facing errors every time. The data is of the format
Mon/Yr
201701
201602
201506
Currently the values are factor type. I want to convert them to date format. I've used following code but I've been getting NA values
as.character(x$`Mon/Yr`)
as.POSIXct(x$`Mon/Yr`, format = '%y%m')
Output: [1] NA NA NA
I've followed example solutions from many posts but I'm not able to fix it. Can you please suggest a fix for this?
library(lubridate)
library(dplyr)
df <- data.frame("Mon/Yr" = c(201701, 201602, 101506))
df
> df
Mon.Yr
1 201701
2 201602
3 101506
df2 <- df %>%
dplyr::mutate(Mon.Yr = lubridate::parse_date_time(Mon.Yr, '%Y%m'),
Mon.Yr = base::format(Mon.Yr, "%Y-%m"))
df2
> df2
Mon.Yr
1 2017-01
2 2016-02
3 1015-06
i have a dataframe df with a column containing values (meter reading). Some values are sporadically missing (NA).
df excerpt:
row time meter_reading
1 03:10:00 26400
2 03:15:00 NA
3 03:20:00 27200
4 03:25:00 28000
5 03:30:00 NA
6 03:35:00 NA
7 03:40:00 30000
What I'm trying to do:
If there is only one consecutive NA, I want to interpolate (e.g. na.interpolation for row 2).
But if there's two or more consecutive NA, I don't want R to interpolate and leave the values as NA. (e.g. row 5 and 6).
What I tried so far is loop (for...) with an if-condition. My approach:
library("imputeTS")
for(i in 1:(nrow(df))) {
if(!is.na(df$meter_reading[i]) & is.na(df$meter_reading[i-1]) & !is.na(df$meter_reading[i-2])) {
na_interpolation(df$meter_reading)
}
}
Giving me :
Error in if (!is.na(df$meter_reading[i]) & is.na(df$meter_reading[i - :
argument is of length zero
Any ideas how to do it? Am I completely wrong here?
Thanks!
I don't knaow what is your na.interpolation, but taking the mean of previous and next rows for example, you could do that with dplyr :
df %>% mutate(x=ifelse(is.na(meter_reading),
(lag(meter_reading)+lead(meter_reading))/2,
meter_reading))
# row time meter_reading x
#1 1 03:10:00 26400 26400
#2 2 03:15:00 NA 26800
#3 3 03:20:00 27200 27200
#4 4 03:25:00 28000 28000
#5 5 03:30:00 NA NA
#6 6 03:35:00 NA NA
#7 7 03:40:00 30000 30000
A quick look shows that your counter i starts at 1 and then you try to get index at i-1 andi-2.
Just an addition here, in the current imputeTS package version, there is also a maxgap option for each imputation algorithm, which easily solves this problem. Probably wasn't there yet, as you asked this question.
Your code would look like this:
library("imputeTS")
na_interpolation(df, maxgap = 1)
This means gaps of 1 NA get imputed, while longer gaps of consecutive NAs remain NA.
I have a data.table containing two date variables. The data set was read into R from a .csv file (was originally an .xlsx file) as a data.frame and the two variables then converted to date format using as.Date() so that they display as below:
df
id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 <NA> 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 <NA> <NA>
I then converted the data.frame to a data.table:
df <- data.table(df)
I then wanted to create a third variable, that would include "specdate" if present, but replace it with "recdate" if "specdate" was missing (NA). This is where I'm having some difficulty, as it seems that no matter how I approach this, data.table displays dates in date format only if a complete variable that is already in date format is copied. Otherwise, individual values are displayed as a number (even when using as.IDate) and I gather that an origin date is needed to correct this. Is there any way to avoid supplying an origin date but display the dates as dates in data.table?
Below is my attempt to fill the NAs of specdate with the recdate dates:
# Function to fill NAs:
fillnas <- function(dataref, lookupref, nacol, replacecol, replacelist=NULL) {
nacol <- as.character(nacol)
if(!is.null(replacelist)) nacol <- factor(ifelse(dataref==lookupref & (is.na(nacol) | nacol %in% replacelist), replacecol, nacol))
else nacol <- factor(ifelse(dataref==lookupref & is.na(nacol), replacecol, nacol))
nacol
}
# Fill the NAs in specdate with the function:
df[, finaldate := fillnas(dataref=id, lookupref=id, nacol=specdate, replacecol=as.IDate(recdate, format="%Y-%m-%d"))]
Here is what happens:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> NA
The display problem is compounded if I create the new variable from scratch by using ifelse:
df[, finaldate := ifelse(!is.na(specdate), specdate, recdate)]
This gives:
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 16294
2: 2 2014-08-15 2014-08-20 16297
3: 3 2014-08-21 2014-08-26 16303
4: 4 <NA> 2014-08-28 16310
5: 5 2014-08-25 2014-08-30 16307
6: 6 <NA> <NA> NA
Alternately if I try a find-and-replace type approach, I get an error about the number of items to replace not matching the replacement length (I'm guessing this is because that approach is not vectorised?), the values from recdate are recycled and end up in the wrong place:
> df$finaldate <- df$specdate
> df$finaldate[is.na(df$specdate)] <- df$recdate
Warning message:
In NextMethod(.Generic) :
number of items to replace is not a multiple of replacement length
> df
id specdate recdate finaldate
1: 1 2014-08-12 2014-08-17 2014-08-12
2: 2 2014-08-15 2014-08-20 2014-08-15
3: 3 2014-08-21 2014-08-26 2014-08-21
4: 4 <NA> 2014-08-28 2014-08-17
5: 5 2014-08-25 2014-08-30 2014-08-25
6: 6 <NA> <NA> 2014-08-20
So in conclusion - the function I applied gets me closest to what I want, except that where NAs have been replaced, the replacement value is displayed as a number and not in date format. Once displayed as a number, the origin is required to again display it as a date (and I would like to avoid supplying the origin since I usually don't know it and it seems unnecessarily repetitive to have to supply it when the date was originally in the correct format).
Any insights as to where I'm going wrong would be much appreciated.
I'd approach it like this, maybe :
DT <- data.table(df)
DT[, finaldate := specdata]
DT[is.na(specdata), finaldate := recdate]
It seems you want to add a new column so you can can retain the original columns as well. I do that as well a lot. Sometimes, I just update in place :
DT <- data.table(df)
DT[!is.na(specdate), specdate:=recdate]
setnames(DT, "specdate", "finaldate")
Using i like that avoids creating a whole new column's worth of data which might be very large. Depends on how important retaining the original columns is to you and how many of them there are and your data size. (Note that a whole column's worth of data is still created by the is.na() call and then again by ! but at least there isn't a third column's worth for the new finaldate. Would be great to optimize i=!is.na() in future (#1386) and if you use data.table this way now you won't need to change your code in future to benefit.)
It seems that you might have various "NA" strings that you're replacing. Note that fread in v1.9.6 on CRAN has a fix for that. From README :
correctly handles na.strings argument for all types of columns - it detect possible NA values without coercion to character, like in base read.table. fixes #504. Thanks to #dselivanov for the PR. Also closes #1314, which closes this issue completely, i.e., na.strings = c("-999", "FALSE") etc. also work.
Btw, you've made one of the top 3 mistakes mentioned here : https://github.com/Rdatatable/data.table/wiki/Support
Works for me. You may want to test to be sure that your NA values are not strings or factors "<NA>"; they will look like real NA values:
dt[, finaldate := ifelse(is.na(specdate), recdate, specdate)][
,finaldate := as.POSIXct(finaldate*86400, origin="1970-01-01", tz="UTC")]
# id specdate recdate finaldate
# 1: 1 2014-08-12 2014-08-17 2014-08-12
# 2: 2 2014-08-15 2014-08-20 2014-08-15
# 3: 3 2014-08-21 2014-08-26 2014-08-21
# 4: 4 NA 2014-08-28 2014-08-28
# 5: 5 2014-08-25 2014-08-30 2014-08-25
# 6: 6 NA NA NA
Data
df <- read.table(text=" id specdate recdate
1 1 2014-08-12 2014-08-17
2 2 2014-08-15 2014-08-20
3 3 2014-08-21 2014-08-26
4 4 NA 2014-08-28
5 5 2014-08-25 2014-08-30
6 6 NA NA", header=T, stringsAsFactors=F)
dt <- as.data.table(df)
I have the following data frame:
id<-c(1,2,3,4)
date<-c("23-01-08","01-11-07","30-11-07","17-12-07")
df<-data.frame(id,date)
df$date2<-as.Date(as.character(df$date), format = "%d-%m-%y")
in the 4th column f my table I want to divide my data to calib and valid based on date such that where date <=2007-12-16 the forth column should be calib otherwise it should be valid
I have written the following lines:
for ( i in 1:4)
if (df[i,3]<=2007-12-16)(df[i,4]="calib")else (df[i,4]="valid")
The first problem is that by executing this command all cells in the 4th column will become valid and it seems that the date condition can not be processed appropriately. so my first question is that how can I solve this problem.
the second problem is that my real data frame has 600000 rows and executing this command will take hours. I wonder if there is any way to preform this command faster and with full CPU capacity.
Thank you!
R is vectorised so you can do that in a single statement:
R> df <- within(df,state <- ifelse(date2<=as.Date("2007-12-16"),"calib","valid"))
R> df
id date date2 state
1 1 23-01-08 2008-01-23 valid
2 2 01-11-07 2007-11-01 calib
3 3 30-11-07 2007-11-30 calib
4 4 17-12-07 2007-12-17 valid
R>
If within, with, or transform seem strange, you can also do it directly:
R> df$state <- ifelse(df$date2<=as.Date("2007-12-16"),"calib","valid")
R> df
id date date2 state
1 1 23-01-08 2008-01-23 valid
2 2 01-11-07 2007-11-01 calib
3 3 30-11-07 2007-11-30 calib
4 4 17-12-07 2007-12-17 valid
R>