Preventing column-class inference in fread() - r

Is there a way for fread to mimic the behaviour of read.table whereby the class of the variable is set by the data that is read in.
I have numeric data with a few comments underneath the main data. When i use fread to read in the data, the columns are converted to character. However, by setting the nrow in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks
An example
d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA))
write.csv(d, "test.csv", row.names=F)
in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)
Which produces
> str(in_d)
'data.frame': 100 obs. of 2 variables:
$ x: int 1 2 3 4 5 6 7 8 9 10 ...
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ x: chr "1" "2" "3" "4" ...
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
As a workaround I thought i would be able to use read.table to read in one line, get the class and set the colClasses, but i am misunderstanding.
cl <- read.csv("test.csv", nrow=1, header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)
Using Windows8.1
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

Option 1: Using a system command
fread() allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.
indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
The system command cat test.csv | tr -d '\"' explained:
cat test.csv reads the file to standard output
| is a pipe, using the output of the previous command as input for the next command
tr -d '\"' deletes (-d) all occurrences of double quotes ('\"') from the current input
Option 2: Coercion after reading
Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x column with type.convert().
library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
Side note: I usually prefer to use type.convert() over as.numeric() to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,
x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1] 1 4 NA 6
# Warning message:
# NAs introduced by coercion
type.convert(x)
# [1] 1 4 NA 6
But of course you can use as.numeric() as well.
Note: This answer assumes data.table dev v1.9.5

Ok, the customer is abusing CSV format to intentionally write out trailing string rows to an integer column, yet without those rows starting with a comment.char (#).
Then you somehow expect you can override fread()'s type inference to read those as integer, by using nrow to try to limit it to just see the integer rows. read.csv(..., nrow) will accept this, however fread() always uses all rows for type-inference (not just the ones specified by nrow, skip, header), and even if they start with comment.char (that's a bug).
Sounds like an abuse of CSV. Your comment rows should be prepended with #
Yes, fread() needs a fix/enhance to ignore comment rows for type inference.
For now, you can workaround with fread() by post-processing the data-table read in.
It's arguable whether fread() should be changed to support the behavior you want: using nrows to limit what gets exposed to type-inference. It might fix your (pretty unique) case and break some others.
I don't see why you (EDIT: the customer) can't write your comments to a separate .txt/README/data-dictionary file to accompany the .csv. The practice of using a separate data-dictionary file is pretty well-established.
I've never seen someone do this to a CSV file. At least move the comments to the header, not a footer.

Related

Read in Excel column with numbers and characters to R

I'm trying to read in an excel file to R using read_excel(it's a xlsx file), I have columns that contain letters and numbers, for example things like P765876. These columns also have cells with just numbers i.e 234654, so when it reads in to R it reads as an Unknown (not character or numeric) but this means that it gives any cell which has a letter and number a value of NA, how can I read this in correctly?
My code at the moment is
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx")
Would also recommend to use the col_types argument, by specifying it as "text" you should avoid getting NAs introduced by coercion. So your code would be like:
tenant<-read_excel("C:/Users/MPritchard/Repairs Projects/May 2017/Tenant Info/R data 1.xlsx", col_types = "text")
Please let me know if this solved your problem.
Regards,
/Michael
Not really an answer but too much for a comment...
1:
> library(xlsx)
> tenant <- read.xlsx("returns.xlsx", sheetIndex = 1)
> str(tenant)
'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : Factor w/ 9 levels "2545","2a","2d",..: 6 4 9 3 5 1 7 2 8
$ only_char : Factor w/ 6 levels "af","dd","e",..: 2 1 5 6 3 2 4 3 1
2:
> library(readxl)
> tenant2 <- read_excel("returns.xlsx")
> str(tenant2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 9 obs. of 3 variables:
$ only_integer: num 1 2 34 5 546931 ...
$ int_char : chr "d5" "5" "ff2ad2f" "2d" ...
$ only_char : chr "dd" "af" "h" "ha" ...
The column int_char is a mixture of both, starting/ending with numbers or characters

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Using "NA" as a legitimate nonmissing value

I'm working with a data set that includes first names entered in all capital letters. I need to work with the names as character variables, not as factors.
One person in the data set has the first name "NA". Can I get R to accept "NA" as a legitimate character value? My work-around solution was to rename that person NAA, but I am interested to see if there is a better way.
As a demonstration of my comment, consider the following sample CSV file:
x <- tempfile()
cat("v1,v2", "NA,1", "AB,3", sep = "\n", file = x)
cat(readLines(x), sep = "\n")
# v1,v2
# NA,1
# AB,3
Here's the str of a basic read.csv. Note the NA is seen as NA
str(read.csv(x))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 1 level "AB": NA 1
# $ v2: int 1 3
Now, specify a different character as your na.strings argument:
str(read.csv(x, na.strings = ""))
# 'data.frame': 2 obs. of 2 variables:
# $ v1: Factor w/ 2 levels "AB","NA": 2 1
# $ v2: int 1 3

R object of data.frame and data.table have same type?

I am still very new to R and recently came across something I am not sure what it means. data.frame and data.table have same type? Can an object have multiple types? After converting "cars" from data.frame to data.table, I obviously can't apply functions that apply to data.frames and not data.table, but class() shows the "cars" is still a data.frame. Anyone know why?
> class(cars)
[1] "data.frame"
> cars<-data.table(cars)
> class(cars)
[1] "data.table" "data.frame"
It is not clear what you mean by your line "I obviously can't apply functions that apply to data.frames and not data.table".
Many functions work as you would expect, whether applied to a data.frame or to a data.table. In particular, if you read the help page to ?data.table, you would find this specific line in the first paragraph of the description:
Since a data.table is a data.frame, it is compatible with R functions and packages that only accept data.frame.
You can test this out yourself:
library(data.table)
CARS <- data.table(cars)
The following should all give you the same results. They aren't the "data.table" way of doing things, but I've just popped off a few things off the top of my head to show you that many (most?) functions can be used with data.table the same way that you would use them with data.frame (but at that point, you miss out on all the great stuff that data.table has to offer).
with(cars, tapply(dist, speed, FUN = mean))
with(CARS, tapply(dist, speed, FUN = mean))
aggregate(dist ~ speed, cars, as.vector)
aggregate(dist ~ speed, CARS, as.vector)
colSums(cars)
colSums(CARS)
as.matrix(cars)
as.matrix(CARS)
t(cars)
t(CARS)
table(cut(cars$speed, breaks=3), cut(cars$dist, breaks=5))
table(cut(CARS$speed, breaks=3), cut(CARS$dist, breaks=5))
cars[cars$speed == 4, ]
CARS[CARS$speed == 4, ]
However, there are some cases in which this won't work. Compare:
cars[cars$speed == 4, 1]
CARS[CARS$speed == 4, 1]
For a better understanding of that, I recommend reading the FAQs. In particular, a couple of relevant points have been summarized at this question: what you can do with data.frame that you can't in data.table.
If your question is, more generally, "Can an object have more than one class?", then you've seen from your own exploration that, yes, it can. For more about that, you can read this page from Hadley's devtools wiki.
Classes also affect things like how objects are printed and how they interact with other functions.
Consider the rle function. If you look at the class, it returns "rle", and if you look at its structure, it shows that it is a list.
> x <- rev(rep(6:10, 1:5))
> y <- rle(x)
> x
[1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6
> y
Run Length Encoding
lengths: int [1:5] 5 4 3 2 1
values : int [1:5] 10 9 8 7 6
> class(y)
[1] "rle"
> str(y)
List of 2
$ lengths: int [1:5] 5 4 3 2 1
$ values : int [1:5] 10 9 8 7 6
- attr(*, "class")= chr "rle"
As the length of each list item is the same, you might expect that you can conveniently use data.frame() to convert it to a data.frame. Let's try:
> data.frame(y)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""rle"" to a data.frame
> unclass(y)
$lengths
[1] 5 4 3 2 1
$values
[1] 10 9 8 7 6
> data.frame(unclass(y))
lengths values
1 5 10
2 4 9
3 3 8
4 2 7
5 1 6
Or, let's add another class to the object and try:
> class(y) <- c(class(y), "list")
> y ## Printing is not affected
Run Length Encoding
lengths: int [1:5] 5 4 3 2 1
values : int [1:5] 10 9 8 7 6
> data.frame(y) ## But interaction with other functions is
lengths values
1 5 10
2 4 9
3 3 8
4 2 7
5 1 6
Data.table and data.frame are different classes, but they are related through inheritance. Data.table inherits from data.frame, and basically expands its capabilities. You can also see that after converting cars to the data.table class:
R> typeof(cars)
[1] "list" # similar to dataframe
R> mode(cars)
[1] "list" # idem
More information here or just google for "inheritance".

R: Can't select a specific column in a data frame

I have a problem with a function to select a given column. I have a data frame called Volume from which I want to make a subset DateSearch:
DateSearch = subset(Volume,select=c("TRI",name))
For some reason it does not work. I have used browser(). I can select TRI or name but I can't select both (either with their name or indice). I have tried with and without "".
Does anyone know why is that?
Many thanks,
Vincent
I just did what (I think) you describe:
str(dfrm)
#'data.frame': 20 obs. of 8 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ factor1: Factor w/ 4 levels "Not at all","To a small extent",..: 3 2 3 NA 3 NA 3 NA 4 1 ...
## <snip>
name = "factor1"
subset(dfrm, select=c("ID", name))
No error, .... results as expected.
Examine the spelling carefully. My guess is that you have a space at the beginning or end of the result of the as.character result. Perhaps even a non-printing character? You can use nchar(name) to check.

Resources