How can I specify only some colClasses in sqldf file.format? - r

I have some CSV files with problematic columns for sqldf, causing some numeric columns to be classed as character. How can I just specify the classes for those columns, and not every column? There are many columns, and I don't necessarily want to have to specify the class for all of them.
Much of the data in these problem columns are zeros, so sqldf reads them as integer, when they are numeric (or real) data type. Note that read.csv correctly assigns classes.
I'm not clever enough to generate a suitable data set that has the right properties (first 50 values zero, then a value of say 1.45 in 51st row), but here's an example call to load the data:
df <- read.csv.sql("data.dat", sql="select * from file",
file.format=list(colClasses=c("attr4"="numeric")))
which returns this error:
Error in sqldf(sql, envir = p, file.format = file.format, dbname = dbname, :
formal argument "file.format" matched by multiple actual arguments
Can I somehow use another read.table call to work out the data types?
Can I read all columns in as character, and then convert some to numeric? There are a small number that are character, and it would be easier to specify those than all of the numeric columns. I have come up with this ugly partial solution, but it still fails on the final line with same error message:
df.head <- read.csv("data.dat", nrows=10)
classes <- lapply(df.head, class) # also fails to get classes correct
classes <- replace(classes, classes=="integer", "numeric")
df <- read.csv.sql("data.dat", sql="select * from file",
file.format=list(colClasses=classes))

Take a closer look at the documentation for read.csv.sql, specifically at the argument nrows:
nrows: Number of rows used to determine column types. It defaults to 50. Using -1 causes it to use all rows for determining column types.
Another thing you'll note from looking at the documentation for read.csv.sql and sqldf is that there is no colClasses parameter. If you read the file.format documenation in sqldf , you'll see that parameters in the file.format list are not passed to read.table but rather to sqliteImportFile, which has no understanding of R's data types. If you don't like modifying the nrows parameter, you could read the entire dataframe as having character type and then use whatever methods you like to figure out what column should be what class. You're always going to have the problem of not knowing whether an integer is an integer or numeric until you read the entire column however. Also, if the speed issue is really killing you here, you may want to consider moving away from CSV's.

Related

Why does read.csv in R convert fields to factors for some files, and not in others?

I have tables of weather data from several stations. When I import them separately using read.csv, the fields are factors, integers, and numerics. However, when I try to import one csv file with all of data combined, the resulting fields in a dataframe are all factors. In the combined file the 1st field has several alphanumeric variables, whereas in the individual files there is only one variable (name of station).
This is a commom behaviour of data.frame() from base R. And most of the times, the result of read.csv() will be stored in a data.frame. As #Duck suggested in the comment section, you can avoid this behaviour, by setting the stringsAsFactors argument to FALSE.
read.csv('myfile.csv', stringsAsFactors = FALSE)
You can check this description below on the documentation page of the data.frame function. You can access this documentation with ?data.frame command.
Character variables passed to data.frame are converted to factor columns unless protected by I() or argument stringsAsFactors is false.
So in your case, this happens in your combined file, because R are interpreting all variables as caracters. Why? Probably because in one (or some) of your files, in the numeric and integers columns, some lines of data are out of format. For example, maybe in a row, you have an "x" to represent an missing value. And read.csv() uses the entire file to decide wich format of data is each column, so as soon as the function hits this "x" value, it interprets the entire column as character. When this data is passed to data.frame(), the function converts these characters to factors. You sad that, in the combined file, you have in the first field, some alphanumeric values. So these values, are probably your "x"'s that are generating your problem.

Read huge csv file using `read.csv` by divide-and-conquer strategy?

I am supposed to read a big csv file (5.4GB with 7m lines and 205 columns) in R. I have successfully read it by using data.table::fread(). But I want to know is it possible to read it by using the basic read.csv()?
I tried just using brute force but my 16GB RAM cannot hold that. Then I tried to use the 'divide-and-conquer' (chunking) strategy as below, but it still didn't work. How should I do this?
dt1 <- read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip =1)
print(paste(1, 'th chunk completed'))
system.time(
for (i in (1:9)){
tmp = read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip = i * 721900 + 1)
dt1 <- rbind(dt1, tmp)
print(paste(i + 1, 'th chunk completed'))
}
)
Also I want to know how fread() works that could read all the data at once and very efficiently no matter in terms of memeory or time?
Your issue is not fread(), it's the memory bloat caused from not defining colClasses for all your (205) columns. But be aware that trying to read all 5.4GB into 16GB RAM is really pushing it in the first place, you almost surely won't be able to hold all that dataset in-memory; and even if you could, you'll blow out memory whenever you try to process it. So your approach is not going to fly, you seriously have to decide which subset you can handle - which fields you absolutely need to get started:
Define colClasses for your 205 columns: 'integer' for integer columns, 'numeric' for double columns, 'logical' for boolean columns, 'factor' for factor columns. Otherwise things get stored very inefficiently (e.g. millions of strings are very wasteful), and the result can easily be 5-100x larger than the raw file.
If you can't fit all 7m rows x 205 columns, (which you almost surely can't), then you'll need to aggressively reduce memory by doing some or all of the following:
read in and process chunks (of rows) (use skip, nrows arguments, and search SO for questions on fread in chunks)
filter out all unneeded rows (e.g. you may be able to do some crude processing to form a row-index of the subset rows you care about, and import that much smaller set later)
drop all unneeded columns (use fread select/drop arguments (specify vectors of column names to keep or drop).
Make sure option stringsAsFactors=FALSE, it's a notoriously bad default in R which causes no end of memory grief.
Date/datetime fields are currently read as character (which is bad news for memory usage, millions of unique strings). Either totally drop date columns for beginning, or read the data in chunks and convert them with the fasttime package or standard base functions.
Look at the args for NA treatment. You might want to drop columns with lots of NAs, or messy unprocessed string fields, for now.
Please see ?fread and the data.table doc for syntax for the above. If you encounter a specific error, post a snippet of say 2 lines of data (head(data)), your code and the error.

R in BERT won't use na.rm=TRUE in sum function

I installed BERT (R-language to Excel interface). In the functions.R file that is included, i modified the included Add function to use the na.rm argument, as follows.
Add <- function( ... ) {
sum(..., na.rm=TRUE);
}
However, it appears that the na.rm arugment is ignored. That is, the Add() function works fine in Excel if all values in the range are present.
[that is =R.Add(A1:A5) in Excel works fine if all of cells A1:A5 contain values]
But if I delete any value in the range (so the Excel cell is blank), I get #NULL! returned.
Is it possible to utilize the na.rm argument, using BERT so that for R-language functions that have the na.rm argument, it is taken into account and blank cells within the Excel range still compute on the remaining values and do not return #NULL!?
This is a little complicated because the behavior is different if you are passing in one argument (a range) or multiple arguments (individual cells). But in the case of a single argument, if you pass in a range that has empty cells, this will be passed as a list. In that case, you will need to call unlist, e.g.
Add <- function( ... ) {
sum(unlist(...), na.rm=TRUE);
}
Excel can have ranges that include different types (e.g. strings and numbers), but R can't. So when BERT passes data from Excel to R that has mixed types, it uses a list of lists.
There is a hint about this in the console when it runs, which is a good place to start -- it says that the argument is an invalid type (list).
As I said it's complicated because the three-dots argument could refer to multiple arguments, each of which could be a list or a single value (a scalar), each of which could be different types. In that case you'd need to use one of the apply functions to unlist different arguments. But try the above first.

Converting Data Type from data.table package in R

this might be a dumb/obvious question but unfortunately I haven't had much luck finding information about it online so I thought I'd ask it here. Basically, I'm working with the data.table package in R and I have imported a data set into R where, in a particular column, the values can be both numeric values and character values (and even blank/empty values), and I want to be able to obtain a value from that column and use it for calculations.
The thing about the data.table package though is that when you import a file using the fread() function it automatically sets all values in that file as a character data type, so this can cause a few issues since this means that all numbers are automatically character types as well. I have worked around this slightly by using the as.numeric() function so that if a value obtained from that column is a number then it can be easily converted to numeric type and used in calculations. However, since the column also contains other characters (specifically, it can also have \N or N as values) and since it can also contain blank/empty values, then this means the as.numeric() function will show up with an error. For example, I initially wrote an IF loop to detect whether a column cell had a character value or a numeric value as follows:
if( as.numeric(..{Reference to column cell from file here}...) == NA ) {
x <- 0
}
(where x is just some variable), but it did not work and instead gave the output:
Error in if ((as.numeric(.... :
missing value where TRUE/FALSE needed
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
(I should note that is.numeric() also did not work since all values in a data.table data set are automatically character values so this function always gives FALSE regardless of it's actual data type).
So clearly I need a better function or method to work around this. Is there a function capable of reading a 'character' value from a column and being able to detect whether that value is truly a numeric type or character type (or even neither, in the case of an empty cell)? Thanks in advance

Select a column from a dynamic variable

How can I select the second column of a dynamically named variable?
I create variables of the form "population.USA", "population.Mexico", "population.Canada". Each variable has a column for the year, and another column for the population value. I would like to select the second column from each of these variables during a loop.
I use this syntax:
sprintf("population.%s", country)[, 2]
R returns the error: Error in sprintf("population.%s", country)[, 2] : incorrect number of dimensions
Based on your sequence of questions over the last few minutes, I have two general recommendations for you as you get familiar with R:
Don't use sprintf.
Don't use assign.
Now, obviously, those functions are both useful at times. But you've learned about them too early, before you've mastered some basic stuff about R's data structures. Try to write code without those crutches (for the time being!), as they're just causing you problems.
Rather than creating separate individual variables for each nation's population, place them in a list.
population <- vector("list",3)
names(population) <- c('USA','Mexico','Russia')
Then you can access each using the string representation of the name of each country:
population[['USA']] <- 10000
Or,
region <- 'USA'
population[[region]]
In this example, I've assigned a single value to a list element, lists will hold any other data type, including matrices or data frames. It will be a lot less typing than using sprintf and assign, and a lot safer and more efficient as well.
See ?get. Here is an example:
> country <- "FOO"
> assign(sprintf("population.%s", country), data.frame(runif(5), runif(5)))
>
> get(sprintf("population.%s", country))[,2]
[1] 0.2241105 0.5640709 0.5945869 0.1830719 0.1895938
It is critically important to look at the object returned by a function if you get an error. It is immediately clear why your example fails if you just look at what it returns:
> sprintf("population.%s", country)
[1] "population.FOO"
At that point it would be immediately clear, if you didn't already know or have thought to read ?sprintf, that sprintf() returns a string not the object of that name. Armed with that knowledge you would have narrowed down the problem to how to recall an object from the computed name?

Resources