Read a CSV in R as a data.frame

Read a CSV in R as a data.frame - r

I am new to R and trying to read a csv. The documentation shows a function read.csv(). However, when I read the file and check the type of the variable it shows a list. Documentation shows it as a data.frame. Can someone explain why it happens that way?
My code so far:
mytable<-read.csv(InputFile,header=TRUE,stringsAsFactors=FALSE)
dim(mytable)
typeof(mytable)
Output:
dim(mytable)
[1] 500 20
typeof(mytable)
[1] "list"

As it is explained in the answer https://stackoverflow.com/a/6258536/8900683.
In R every "object" has a mode and a class. The former represents how an object is stored in memory (numeric, character, list and function) while the later represents its abstract type.
For example:
d <- data.frame(V1=c(1,2))
class(d)
# [1] "data.frame"
mode(d)
# [1] "list"
typeof(d)
# list

Related

Corrupt warning when trying to join in dplyr

Looking for some help with an error I have not seen before and can't seem to find any help around it here.
I am trying to join two datasets and then keep the non-duplicated entries. However, when I run the code below:
alllawsuits <- allfjccases %>%
inner_join(.,allprisonsnos) %>%
distinct(CASENAME,PLT,DEF,FILEDATE,TERMDATE,NOSedit,Docket,completename,.keep_all=T)
I receive the following error:
Joining, by = c("Docket", "NOSedit", "File.Year", "File.Month", "File.Day")
Error: Internal error in ``date_validate()``: Corrupt ``Date`` with unknown type character.
The Date variable is not one of the merging variables and the code still does not run even if I exclude it from the two data frames. All the column types are the same across the two datasets and I have absolutely no idea how to fix this. Thoughts?
A couple other things: the code used to run fine in the past and merge works fine, just not any of the join functions from dplyr.

Your error is referencing some unspecified tbl_df variable that has with Date class but contains character data - this may be, but is not necessarily, the variable named Date (that seems to be how you are interpreting the error, and that is a reasonable but incorrect interpretation of the error).
This error is being thrown by date_validate(), an internal function in newer versions of the vctrs package, which is a dependency of dplyr and tbl_df. vctrs is more picky about what constitutes a valid Date object than most operations in base R. It doesn't matter whether the Date class variable is part of the join key because the column validation is performed when the new tbl_df object is created as a result of the join.
Normally a Date object is a numeric vector with the attribute class = "Date". For some reason there is a special S3 method specifically for preventing Date objects from returning e.g. is.numeric(my_date_object) = TRUE (look at the definition of base::is.numeric.Date() - this method supersedes dispatch to the primitive is.numeric function for Date objects) but they ARE normally numeric values "under the hood", so to speak. If we strip off the class attribute we can verify this.
> test <- as.Date(c("2020-01-01", "2020-01-02"))
> test
[1] "2020-01-01" "2020-01-02"
> str(test)
Date[1:2], format: "2020-01-01" "2020-01-02"
> is.character(test)
[1] FALSE
> is.numeric(test)
[1] FALSE
> is.numeric(unclass(test))
[1] TRUE
However, it is also possible to create a Date object by explicitly assigning the Date class to a character vector where all the individual elements are coercible to numeric. The resulting Date object prints as if it were a normal Date object but it is still a character vector.:
> test <- structure(c("21424", "21425"), class = "Date")
> test
[1] "2028-08-28" "2028-08-29"
> str(test)
Date[1:2], format: "2028-08-28" "2028-08-29"
> is.character(test)
TRUE
> is.numeric(test)
FALSE
> is.numeric(unclass(test))
[1] FALSE
> tibble(a = test)
Error: Internal error in `date_validate()`: Corrupt `Date` with unknown type character.
And there's your error. Somewhere upstream of the code that you have shown, you performed some operation that created a nonstandard Date column. Either don't do that, or coerce the offending column to a normal date with something like
blah blah blah... %>%
mutate(my_column = as.Date(as.numeric(my_column), origin = "1970-01-01")) %>%
left_join(blah blah ...
A lot of base R methods don't care about this because they perform implicit conversion before anything gets passed to a compiled function. vctrs, and dplyr verbs and joins by extension, do care about this. They achieve better performance by skipping implicit conversion for many operations. But as a tradeoff they have to be pickier about object types.

R is changing my variable value by itself

I have a dataframe that has an id field with values as these two:
587739706883375310
587739706883375408
The problem is that, when I ask R to show these two numbers, the output that I get is the following:
587739706883375360
587739706883375360
which are not the real values of my ID field, how do I solve that?
For your information: I have executed options(scipen = 999) to R does not convert my number to a scientific notation.
This problem also happens in R console, if I enter these examples numbers I also get the same printing as shown above.
EDIT: someone asked
dput(yourdata$id)
I did that and the result was:
c(587739706883375360, 587739706883375360, 587739706883375488, 587739706883506560, 587739706883637632, 587739706883637632, 587739706883703040)
To compare, the original data in the csv file is:
587739706883375310,587739706883375408,587739706883375450,587739706883506509,587739706883637600,587739706883637629,587739706883703070
I also did the following test with one of these numbers:
> 587739706883375408
[1] 587739706883375360
> as.double(587739706883375408)
[1] 587739706883375360
> class(as.double(587739706883375408))
[1] "numeric"
> is.double(as.double(587739706883375408))
[1] TRUE

You can use the bit64 package to represent such large numbers:
library(bit64)
as.integer64("587739706883375408")
# integer64
# [1] 587739706883375408
as.integer64("587739706883375408") + 1
# integer64
# [1] 587739706883375409

Why is Date is being returned as type 'double'?

I'm having some trouble working with the as.Date function in R. I have a vector of dates that I'm reading in from a .csv file that are coming in as a factor of integers or as character (depending on how I read in the file, but this doesn't seem to have anything to do with the issue), formatted as %m/%d/%Y.
I'm going through the file row by row, pulling out the date field and trying to convert it for use elsewhere using the following code:
tmpDtm <- as.Date(as.character(tempDF$myDate), "%m/%d/%Y")
This seems to give me what I want, for example, if I do this to a starting value of 12/30/2014, I get the value "2014-12-30" returned. However, if I examine this value using typeof(), R tells me that it its data type is 'double'. Additionally, if I try to bind this to other values and store it in a data frame using c() or cbind(), in the data frame, it winds up being stored as 16434, which looks to me like some sort of different internal storage value of a date. I'm pretty sure that's what it is too because if I try to convert that value again using as.Date(), it throws an error asking for an origin.
So, two questions: Is this as expected? If so, is there a more appropriate way to convert a date so that I actually end up with a date-typed object?
Thank you

Dates are internally represented as double, as you can see in the following example:
> typeof(as.Date("09/12/16", "%m/%d/%y"))
[1] "double"
it is still marked a class Date, as in
> class(as.Date("09/12/16", "%m/%d/%y"))
[1] "Date"
and because it is a double, you can do computations with it. But because it is of class Date, these computations lead to Dates:
> as.Date("09/12/16", "%m/%d/%y") + 1
[1] "2016-09-13"
> as.Date("09/12/16", "%m/%d/%y") + 31
[1] "2016-10-13"
EDIT
I have asked for c() and cbind(), because they can be assciated with some strange behaviour. See the following example, where switching the order within c changes not the type but the class of the result:
> c(as.Date("09/12/16", "%m/%d/%y"), 1)
[1] "2016-09-12" "1970-01-02"
> c(1, as.Date("09/12/16", "%m/%d/%y"))
[1] 1 17056
> class(c(as.Date("09/12/16", "%m/%d/%y"), 1))
[1] "Date"
> class(c(1, as.Date("09/12/16", "%m/%d/%y")))
[1] "numeric"
EDIT 2 - c() and cbind force objects to be of one type. The first edit shows an anomaly of coercion, but generally, the vector must be of one shared type. cbind shares this behavior because it coerces to matrix, which in turn coerces to a single type.
For more help on typeof and class see this link

This is as expected. You used typeof(); you probably should used class():
R> Sys.Date()
[1] "2016-09-12"
R> typeof(Sys.Date()) # this more or less gives you how it is stored
[1] "double"
R> class(Sys.Date()) # where as this gives you _behaviour_
[1] "Date"
R>
Minor advertisement: I have a new package anytime, currently in incoming at CRAN, which deals with this as it converts "anything" to POSIXct (via anytime()) or Date (via anydate().
E.g.:
R> anydate("12/30/2014") # no format needed
[1] "2014-12-30"
R> anydate(as.factor("12/30/2014")) # converts from factor too
[1] "2014-12-30"
R>

how to get name of data.frame from list passed to function using lapply

I have function which I want to extend with ability to save results to csv file. The name of csv file should be generated based on data.frame name passed to this function:
my.func1 <- function(dframe, ...){
# PART OF CODE RESPONSIBLE FOR COMPUTATION
# ...
# PART OF CODE WHERE I WANT TO STORE RESULTS AS CSV
csv <- deparse(substitute(dframe))
csv
}
When I call this function following way then the name of dataset passed to this function is interpreted correctly:
> my.func1(mtcars)
[1] "mtcars"
But I need to call this function for each data.frame from list. If I call this function for particular data.frame from list then it is basically working (I get the ugly name containing also name of list but one workaround could be trim it using regular expression):
> LoDFs <- list(first=data.frame(y1=c(1,2,3), y2=c(4,5,6)), second=data.frame(yA=c(1,2,3), yB=c(4,5,6)))
> my.func1(LoDFs$first)
[1] "LoDFs$first"
Problem is when I want to call this function for all data.frames from list. In this case the names of data.frame are mess:
> lapply(LoDFs, my.func1)
$first
[1] "X[[i]]"
$second
[1] "X[[i]]"
> lapply(seq_along(LoDFs), function(x) { my.func1(LoDFs[[x]]) })
[[1]]
[1] "LoDFs[[x]]"
[[2]]
[1] "LoDFs[[x]]"
What I'm doing wrong and how can I avoid mentioned workaround with regular expressions and make code more robust?

f each data frame in the list is named
lapply (names (LoDf),function(i)write.csv (my.fun1 (LoDf [[i]]),paste0 (i,'.csv')))
On phone so forgive small mistakes

The issue is that lapply does not feed the name of the item in the list, it only feed the item itself.
An alternative solution is to use mapply which IMO is more specific about the input rather than relying on scoping
mapply(function(L,N){write.csv(L, paste0(N,".csv"));}, L=LoDFs,N=names(LoDFs))

how to use R language $ symbol to extract column from a matrix

I am new to R language, if I use us_stocks$"LNC" I could get the corresponding data zoo. resB is a list with following elements.The library is zoo, which is the type of us_stocks
resB
# [[1]] LNC 7
# [[2]] GAM 62
# [[3]] CMA 7
class(resB)
# [1] "list"
names(resB[[1]])
# [1] "LNC"
but when use us_stocks$names(resB[[1]]) I could not get the zoo series? How to fix this?

It often takes a while to understand what is meant by " ... $ is a function which does not evaluate its second argument." Most R functions would take names(resB[[1]]) and eval;uate it and then act on the value. But not $. It expects the second argument to be an actual column name but given as an unquoted string. This is an example of "non-standard evaluation". You will also see it operating in the functions library and help, as well as many functions in what is known perhaps flippantly as the hadleyverse, which includes the packages 'ggplot2' and 'dplyr'. The names of dataframe columns or the nodes of R lists are character literals, however, they are not really R names in the sense that their values cannot be accessed with an unquoted sequence of letters typed to the console at the toplevel of R.
So as already stated you should be using d[[ names(resB[[1]]) ]]. This is also much safer to use in programming, since there are often problems with scoping involved with the use of the $-function in anything other than interactive console use.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Read a CSV in R as a data.frame - r

Related

Corrupt warning when trying to join in dplyr

R is changing my variable value by itself

Why is Date is being returned as type 'double'?

how to get name of data.frame from list passed to function using lapply

how to use R language $ symbol to extract column from a matrix

Categories

Resources