Corrupt warning when trying to join in dplyr

Corrupt warning when trying to join in dplyr - r

Looking for some help with an error I have not seen before and can't seem to find any help around it here.
I am trying to join two datasets and then keep the non-duplicated entries. However, when I run the code below:
alllawsuits <- allfjccases %>%
inner_join(.,allprisonsnos) %>%
distinct(CASENAME,PLT,DEF,FILEDATE,TERMDATE,NOSedit,Docket,completename,.keep_all=T)
I receive the following error:
Joining, by = c("Docket", "NOSedit", "File.Year", "File.Month", "File.Day")
Error: Internal error in ``date_validate()``: Corrupt ``Date`` with unknown type character.
The Date variable is not one of the merging variables and the code still does not run even if I exclude it from the two data frames. All the column types are the same across the two datasets and I have absolutely no idea how to fix this. Thoughts?
A couple other things: the code used to run fine in the past and merge works fine, just not any of the join functions from dplyr.

Your error is referencing some unspecified tbl_df variable that has with Date class but contains character data - this may be, but is not necessarily, the variable named Date (that seems to be how you are interpreting the error, and that is a reasonable but incorrect interpretation of the error).
This error is being thrown by date_validate(), an internal function in newer versions of the vctrs package, which is a dependency of dplyr and tbl_df. vctrs is more picky about what constitutes a valid Date object than most operations in base R. It doesn't matter whether the Date class variable is part of the join key because the column validation is performed when the new tbl_df object is created as a result of the join.
Normally a Date object is a numeric vector with the attribute class = "Date". For some reason there is a special S3 method specifically for preventing Date objects from returning e.g. is.numeric(my_date_object) = TRUE (look at the definition of base::is.numeric.Date() - this method supersedes dispatch to the primitive is.numeric function for Date objects) but they ARE normally numeric values "under the hood", so to speak. If we strip off the class attribute we can verify this.
> test <- as.Date(c("2020-01-01", "2020-01-02"))
> test
[1] "2020-01-01" "2020-01-02"
> str(test)
Date[1:2], format: "2020-01-01" "2020-01-02"
> is.character(test)
[1] FALSE
> is.numeric(test)
[1] FALSE
> is.numeric(unclass(test))
[1] TRUE
However, it is also possible to create a Date object by explicitly assigning the Date class to a character vector where all the individual elements are coercible to numeric. The resulting Date object prints as if it were a normal Date object but it is still a character vector.:
> test <- structure(c("21424", "21425"), class = "Date")
> test
[1] "2028-08-28" "2028-08-29"
> str(test)
Date[1:2], format: "2028-08-28" "2028-08-29"
> is.character(test)
TRUE
> is.numeric(test)
FALSE
> is.numeric(unclass(test))
[1] FALSE
> tibble(a = test)
Error: Internal error in `date_validate()`: Corrupt `Date` with unknown type character.
And there's your error. Somewhere upstream of the code that you have shown, you performed some operation that created a nonstandard Date column. Either don't do that, or coerce the offending column to a normal date with something like
blah blah blah... %>%
mutate(my_column = as.Date(as.numeric(my_column), origin = "1970-01-01")) %>%
left_join(blah blah ...
A lot of base R methods don't care about this because they perform implicit conversion before anything gets passed to a compiled function. vctrs, and dplyr verbs and joins by extension, do care about this. They achieve better performance by skipping implicit conversion for many operations. But as a tradeoff they have to be pickier about object types.

Related

read.csv ;check.names=F; R;Look at the picture,why it works a treat?

please see the the column name "if" in the second column,the deifference is :when check.name=F,"." beside "if" disappear
Sorry for the code,because I try to type some codes to generate this data.frame like in the picture,but i failed due to the "if".We know that "if" is a reserved word in R(like else,for, while ,function).And here, i deliberately use the "if" as the column name (the 2nd column),and see whether R will generate some novel things.
So using another way, I type the "if" in the excel and save as the format of csv in order to use read.csv.
Question is:
Why "if." changes to "if"?(After i use check.names=FALSE)
enter image description here

?read.csv describes check.names= in a similar fashion:
check.names: logical. If 'TRUE' then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
'make.names') so that they are, and also to ensure that there
are no duplicates.
The default action is to allow you to do something like dat$<column-name>, but unfortunately dat$if will fail with Error: unexpected 'if' in "dat$if", ergo check.names=TRUE changing it to something that the parser will not trip over. Note, though, that dat[["if"]] will work even when dat$if will not.
If you are wondering if check.names=FALSE is ever a bad thing, then imagine this:
dat <- read.csv(text = "a,a\n2,3")
dat
# a a.1
# 1 2 3
dat <- read.csv(text = "a,a\n2,3", check.names = FALSE)
dat
# a a
# 1 2 3
In the second case, how does one access the second column by-name? dat$a returns 2 only. However, if you don't want to use $ or [[, and instead can rely on positional indexing for columns, then dat[,colnames(dat) == "a"] does return both of them.

R made my character variable into a factor, and now the function doesn't work

I have a data frame of hospital discharge data. It has 99,779 records with 263 variables, including up to 50 ICD-10-CM diagnosis codes per record. I loaded the file including the subcommand "stringsAsFactors= FALSE" and then copied just the diagnosis codes to another df to make it easier to look at the data in RStudio. My current goal is to assign injury severity codes using icdpictr. I ran that program successfully, then looked at the output. As documented in the author's site, when the 7th character of the ICD-10-CM code is "B" or "C", the program ignores it, although it should not. So I want to change the 7th character from "B" or "C" to the character that triggers the attention. Here is where I run into a problem. Setting aside that I don't know how to write a function that will do this for each of my 50 variables, I anticipate writing 50 nearly identical statements like this:
mutate(temp = if_else(substr(DIAG1,7,7) == 'B' | substr(DIAG1,7,7) == 'C',
paste(substr(DIAG1,1,6),'A',sep=""),
DIAG1),
DIAG1 = temp, ...
I ran the program with just this one mutate command. This is the error message that appears:
Error: Problem with mutate() input temp.
x false must be a character vector, not a factor object.
i Input temp is if_else(...).
Although I loaded the DIAG variables as character, when I copied them to the other table, R -- without my permission -- turned them into factors. That was very efficient, but now I can't handle them as character type.
How do I solve this problem?

It may because of the comparison with a factor class and when we have different type for 'yes', 'no' in if_else, it can have that error because the if_else checks the type unlike the ifelse. Based on the OP's code, if 'DIAG1' is factor and the no case is returning 'DIAG1', it is a factor vs character class because substr automatically coerces the factor to character. We can convert the 'DIAG1' to character with as.character and it should work
library(dplyr)
df2 <- df1 %>%
mutate(DIAG1 = as.character(DIAG1),
temp = if_else(substr(DIAG1, 7, 7) %in% c("B", "C"),
paste0(substr(DIAG1,1,6),'A'), DIAG1))
NOTE: When there are more than one element to compare, instead of doing the same operation twice (substr(DIAG1, 7, 7)) and then doing == (as it is elementwise comparison), can use %in% with a single substr
NOTE2: From R 4.0, by default the read.csv/read.table or data.frame construction calls have stringsAsFactors = FALSE by default. Previously, it was TRUE. So, it is better to check the R version as well

Data type double not converting to factor

Hi I am trying to convert my column within the data frame from "double" to a "factor", but its not working
I am trying to convert the "double" data type to "factor" but its converting it to an integer. I have tried a couple of other things from stackoverflow but nothing seems to work. I have provided my code below along with console output.
Task 1.5 - Change class type from Integer to Factor
typeof(iLPdf$class) #check type
iLPdf$class <- as.factor(iLPdf$class)
typeof(iLPdf$class) #check type
[1] "double"
iLPdf$class <- as.factor(iLPdf$class)
typeof(iLPdf$class) #check type
[1] "integer"

The issue here is that typeof checks the internal representation of an object. Factors are represented as integers. To check that something is actually a factor, use is.factor instead. From the docs:
typeof determines the (R internal) type or storage mode of any object
To verify this "claim", you can check the well known iris Species' column which is a factor. typeof(iris$Species) will however return integer because to R factors are integers.
Using is.factor is a better option, this ultimately boils down to the difference between types and classes in R.
is.factor(iris$Species)
[1] TRUE

Why is Date is being returned as type 'double'?

I'm having some trouble working with the as.Date function in R. I have a vector of dates that I'm reading in from a .csv file that are coming in as a factor of integers or as character (depending on how I read in the file, but this doesn't seem to have anything to do with the issue), formatted as %m/%d/%Y.
I'm going through the file row by row, pulling out the date field and trying to convert it for use elsewhere using the following code:
tmpDtm <- as.Date(as.character(tempDF$myDate), "%m/%d/%Y")
This seems to give me what I want, for example, if I do this to a starting value of 12/30/2014, I get the value "2014-12-30" returned. However, if I examine this value using typeof(), R tells me that it its data type is 'double'. Additionally, if I try to bind this to other values and store it in a data frame using c() or cbind(), in the data frame, it winds up being stored as 16434, which looks to me like some sort of different internal storage value of a date. I'm pretty sure that's what it is too because if I try to convert that value again using as.Date(), it throws an error asking for an origin.
So, two questions: Is this as expected? If so, is there a more appropriate way to convert a date so that I actually end up with a date-typed object?
Thank you

Dates are internally represented as double, as you can see in the following example:
> typeof(as.Date("09/12/16", "%m/%d/%y"))
[1] "double"
it is still marked a class Date, as in
> class(as.Date("09/12/16", "%m/%d/%y"))
[1] "Date"
and because it is a double, you can do computations with it. But because it is of class Date, these computations lead to Dates:
> as.Date("09/12/16", "%m/%d/%y") + 1
[1] "2016-09-13"
> as.Date("09/12/16", "%m/%d/%y") + 31
[1] "2016-10-13"
EDIT
I have asked for c() and cbind(), because they can be assciated with some strange behaviour. See the following example, where switching the order within c changes not the type but the class of the result:
> c(as.Date("09/12/16", "%m/%d/%y"), 1)
[1] "2016-09-12" "1970-01-02"
> c(1, as.Date("09/12/16", "%m/%d/%y"))
[1] 1 17056
> class(c(as.Date("09/12/16", "%m/%d/%y"), 1))
[1] "Date"
> class(c(1, as.Date("09/12/16", "%m/%d/%y")))
[1] "numeric"
EDIT 2 - c() and cbind force objects to be of one type. The first edit shows an anomaly of coercion, but generally, the vector must be of one shared type. cbind shares this behavior because it coerces to matrix, which in turn coerces to a single type.
For more help on typeof and class see this link

This is as expected. You used typeof(); you probably should used class():
R> Sys.Date()
[1] "2016-09-12"
R> typeof(Sys.Date()) # this more or less gives you how it is stored
[1] "double"
R> class(Sys.Date()) # where as this gives you _behaviour_
[1] "Date"
R>
Minor advertisement: I have a new package anytime, currently in incoming at CRAN, which deals with this as it converts "anything" to POSIXct (via anytime()) or Date (via anydate().
E.g.:
R> anydate("12/30/2014") # no format needed
[1] "2014-12-30"
R> anydate(as.factor("12/30/2014")) # converts from factor too
[1] "2014-12-30"
R>

How do you convert multiple columns to date format in R using lubridate?

I have a database with multiple columns of dates as character class. I want to use the lubridate package in R to convert them all at once. I'm not having trouble parsing the date format, but in applying lubridate over multiple columns. Any suggestions?
crimes.df <- data.frame(offense.date = c('06102003', '05122006'), charge.date = c('07152003', '10012010'))
I have tried
crimes.df[,1:2]<-mdy(crimes.df[,1:2])
and
crimes.df[,1:2]<-lapply(crimes.df[,1:2], function(x) mdy(crimes.df[,1:2]))
both return this error:
Warning message:
All formats failed to parse. No formats found.
(and, inconveniently, wipe out all data in the columns.)

Using lapply, we are looping the columns of the dataset and the function mdy is applied on each column.
crimes.df[] <- lapply(crimes.df, mdy)
In the OP's code, if we are calling the anonymous function (function(x)), then the function (mdy) should be applied on 'x'
crimes.df[] <- lapply(crimes.df, function(x) mdy(x))
Also, note that since there are only 2 columns, we don't need to specify the crimes.df[,1:2]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Corrupt warning when trying to join in dplyr - r

Related

read.csv ;check.names=F; R;Look at the picture,why it works a treat?

R made my character variable into a factor, and now the function doesn't work

Data type double not converting to factor

Why is Date is being returned as type 'double'?

How do you convert multiple columns to date format in R using lubridate?

Categories

Resources