R date column error using data[data==""] <- NA - r

I am working with a data set which has all kinds of column classes, including class "Date". I try to assign NA to all empty values in this data set the following way:
data[data==""] <- NA
Obviously the date column makes some problems here, because there is the following error:
Error in charToDate(x) :
character string is not in a standard unambiguous format
I do not really know why this error occurs, since there are no empty values in the date column, so it should happen nothing there. The dates in the date column are in a standard format "%Y-%m-%d".
What is the problem here and how can I solve it?

You can create a logical index to subset columns other than the 'Date' class, and use that to replace the '' with NA
indx <- sapply(data, class)!='Date'
data[indx][data[indx]==''] <- NA
It is the 'Date' class that is creating the problem. Another option would be to convert the data to matrix so that all the columns will be character.
data[as.matrix(data)==''] <- NA
Or as suggested by #Frank (and using replace)
data[indx] <- lapply(data[indx], function(x) replace(x, which(x==''), NA))
data
set.seed(49)
data <- data.frame(Col1= sample(c('',LETTERS[1:3]), 10, replace=TRUE),
Col2=sample(c('',LETTERS[1:2]), 10, replace=TRUE),
Date=seq(as.Date('2010-01-01'),length.out=10, by='day'),
stringsAsFactors=FALSE)

Related

Convert character dates in r (weird format)

I have columns that are named "X1.1.21", "X12.31.20" etc.
I can get rid of all the "X"s by using the substring function:
names(df) <- substring(names(df), 2, 8)
I've been trying many different methods to change "1.1.21" into a date format in R, but I'm having no luck so far. How can I go about this?
R doesn't like column names that start with numbers (hence you get X in front of them). However, you can still force R to allow column names that start with number by using check.names = FALSE while reading the data.
If you want to include date format as column names, you can use :
df <- data.frame(X1.1.21 = rnorm(5), X12.31.20 = rnorm(5))
names(df) <- as.Date(names(df), 'X%m.%d.%y')
names(df)
#[1] "2021-01-01" "2020-12-31"
However, note that they look like dates but are still of type 'character'
class(names(df))
#[1] "character"
So if you are going to use the column names for some date calculation you need to change it to date type first.
as.Date(names(df))

Errors making a new date field conditional on other date field

I'm trying to make new date field based on two other columns. If 'R' is present in the Indicator column, I want the date to be the ReportDate. If 'R' is not present, I want the date to be IncidentDate. A working example:
IncidentDate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
ReportDate <- as.Date(c('2010-11-1','2008-5-25','2007-5-14'))
Indicator <- c('','R','')
incident_data <- data.frame(IncidentDate, ReportDate, Indicator)
typeof(IncidentDate) #double
incident_data$calculatedDate <- ifelse(incident_data$ReportDate=='R',as.Date(incident_data$ReportDate), as.Date(incident_data$IncidentDate))
This gives me an error:
Error in charToDate(x) :
character string is not in a standard unambiguous format
I've also tried:
incident_data$calculatedDate <- ifelse(incident_data$ReportDate=='R',as.Date(as.character(incident_data$ReportDate)), as.Date(as.character(incident_data$IncidentDate)))
Which gives me the same error. Why might this be happening?
In base R, it may be better to use assignment on a logical vector instead of ifelse for Date class as ifelse can coerce and remove the Date attribute.
i1 <- incident_data$Indicator=='R'
incident_data$calculatedDate <- incident_data$IncidentDate
incident_data$calculatedDate[i1] <- incident_data$ReportDate
The logical should be based on the Indicator column. However, ifelse coerces the Date to its integer storage mode. So, it may be better to use if_else or case_when. With if_else, case_when, there is a type check associated with the the true, false cases.
library(dplyr)
if_else(incident_data$Indicator=='R',as.Date(incident_data$ReportDate),
as.Date(incident_data$IncidentDate))
#[1] "2010-11-01" "2008-05-25" "2007-03-14"

R - Removing a specific row

I have a dataframe called and df and I want to remove a row for a specific row which contains NA.
As commented before, you should provide a reproducible R example. If I understand correctly you can easily use subset function.
# Generating some fake data:
set.seed(101)
df <- data.frame("StudyID" = paste("Study", seq(1:100), sep = "_"),
"Column" = sample(c(1:30, NA),100, replace = TRUE))
Use subset with !is.na() if your NA is a Not Available value
newdf <- subset(df, !is.na(Column))
If your NA is a character:
# Numeric to character conversion
df$Column<- as.character(df$Column)
# Replace missing values with "NA"
df$Column[is.na(df$Column)] <- "NA"
Thus, just subsetting:
newdf <- subset(reviews, Column != "NA")
Here is a solution using grepl from base R, considering NA as a character.
pattern<-"NA"
df <-df[!grepl(pattern, df$Column),]
If possible share sample data for better clarity on the data

Converting data frame column from character to numeric

I have a data frame that I construct as such:
> yyz <- data.frame(a = c("1","2","n/a"), b = c(1,2,"n/a"))
> apply(yyz, 2, class)
a b
"character" "character"
I am attempting to convert the last column to numeric while still maintaining the first column as a character. I tried this:
> yyz$b <- as.numeric(as.character(yyz$b))
> yyz
a b
1 1
2 2
n/a NA
But when I run the apply class it is showing me that they are both character classes.
> apply(yyz, 2, class)
a b
"character" "character"
Am I setting up the data frame wrong? Or is it the way R is interpreting the data frame?
If we need only one column to be numeric
yyz$b <- as.numeric(as.character(yyz$b))
But, if all the columns needs to changed to numeric, use lapply to loop over the columns and convert to numeric by first converting it to character class as the columns were factor.
yyz[] <- lapply(yyz, function(x) as.numeric(as.character(x)))
Both the columns in the OP's post are factor because of the string "n/a". This could be easily avoided while reading the file using na.strings = "n/a" in the read.table/read.csv or if we are using data.frame, we can have character columns with stringsAsFactors=FALSE (the default is stringsAsFactors=TRUE)
Regarding the usage of apply, it converts the dataset to matrix and matrix can hold only a single class. To check the class, we need
lapply(yyz, class)
Or
sapply(yyz, class)
Or check
str(yyz)

Calculate diff in data.frame

I'm trying to calculate the returns from a data.frame of prices.
diff((na.locf(precos_mes))
Some of the columns have NAs as values, so to remove them I use locf function, but when I apply diff over it, it returns the following error:
(list) object cannot be coerced to type 'double'
And when I try to unlist it, I lose all the information from each stock vector.
diff(as.numeric(unlist(na.locf(prices))))
Try
lapply(precos_mes, function(x) diff(na.locf(x)))
Or if you don't need to remove the NA values at the beginning
sapply(precos_mes, function(x) diff(na.locf(x, na.rm=FALSE)))
data
set.seed(24)
precos_mes <- as.data.frame(matrix(sample(c(NA,0:4), 20*5,
replace=TRUE), ncol=5))

Resources