I have a 5 G file data to load. fread seems to be a fast way to load them but it reads all my data structures wrong. It looks like it is the quotes that result the problem.
# Codes. I don't know how to put raw csv data here.
dt<-fread("data.csv",header=T)
dt2<-read.csv("data.csv",header=T)
str(dt)
str(dt2)
This is the output. All data structures of fread variables are char regardless whether it is num or char.
It's curious that fread didn't use numeric for the id column, maybe some entries contain non-numeric values?
The documentation suggests the use of colClasses parameter.
dt <- fread("data.csv", header = T, colClasses = c("numeric", "character"))
The documentation has a warning for using this parameter:
A character vector of classes (named or unnamed), as read.csv. Or a named list of vectors of column names or numbers, see examples. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.
It looks as if the fread command will detect the type in a particular column and then assign the lowest type it can to that column based on what the column contains. From the fread documentation:
A sample of 1,000 rows is used to determine column types (100 rows
from 10 points). The lowest type for each column is chosen from the
ordered list: logical, integer, integer64, double, character. This
enables fread to allocate exactly the right number of rows, with
columns of the right type, up front once. The file may of course still
contain data of a higher type in rows outside the sample. In that
case, the column types are bumped mid read and the data read on
previous rows is coerced.
This means that if you have a column with mostly numeric type values it might assign the column as numeric, but then if it finds any character type values later on it will coerce anything read up to that point to character type.
You can read about these type conversions here, but the long and short of it seems to be that trying to convert a character column to numeric for values that are not numeric will result in those values being converted to NA, or a double might be converted to an integer, leading to a loss of precision.
You might be okay with this loss of precision, but fread will not allow you to do this conversion using colClasses. You might want to go in and remove non-numeric values yourself.
Related
I want to upload an Excel file as a dataframe in R.
It is a large file with a lot of numbers and some #NV values.
The upload works good for the majority of columns (in total, there are 4,000 columns). But for some columns, R changes the columns to "TRUE" or "FALSE", creating a boolean column.
I don't want that, since all of the columns are supposed to be numeric.
Do you know why R does that?
It would really help if you provided code snippets, because there are many different excel-to-dataframe libraries/methods/behaviors.
But assuming that you are using writexl, the read_excel function has a guess_max parameter for this kind of case. guess_max is 1000 by default.
Try df <- read_excel(path = filepath, sheet = sheet_name, guess_max = 100000)
Since dataframes cannot have different data types in the same column, read_excel has to read your excel file and guess what data type each column should be, before actually filling the dataframe. If a column happens to only have NA values in the first 1000 rows, read_excel will assume you have a column of booleans, and then all subsequent values encountered in future rows will be cast accordingly. So if you set guess_max to something huge, you make read_excel slower, but it might avoid the casting of numerics to booleans.
I have a table in Excel with numeric, date, and character type columns. I use the read_excel() function from readxl library to load data into R. For most of the columns, read_excel by default does a good job in recognizing the column type.
Problem:
As the number of columns in the table can increase or decrease, I don't want to define col_types in read_excel to load data.
Two Excel numeric columns are cost and revenue with '$' in front of the value such as $200.0541. The dollar sign '$' seems to cause the function to mistakenly identify the cost and revenue column as POSIXct type.
Since new numeric columns might be added later with '$', is it possible to change the column types after loading the data (without using df$cost <- as.numeric(df$cost) for each column) through a loop?
Edit: link to sample - https://ethercalc.org/ogiqi9s51o45
I have a CSV file with all values in double quotes. One column is the id column and it contains values such as this:
01100170109835
The problem I am having is that no matter what options I specify (as.is=T, stringsAsFactors=F, or numerals='no.loss'), it always reads this id column in as numeric and drops the leading 0's. This is such a fundamental operation that I am really baffled that I can't find a solution.
I have a dataset with nearly 30,000 rows and 1935 variables(columns). Among these many are character variables (around 350). Now I can change data type of an individual column using as.numeric on it, but it is painful to search for columns which are character type and then apply this individually on them. I have tried writing a function using a loop but since the data size is huge, laptop is crashing.
Please help.
Something like
take <- sapply(data, is.numeric)
which(take == FALSE)
identify which variables are numeric, but I don't know how extract automatically, so
apply(data[, c(putcolumnsnumbershere)], 1, as.character))
use
sapply(your.data, typeof)
to create a vector of variable types, then use this vector to identify the character vector columns to be converted.
I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.