I'm having some trouble to read big csv files with my R, so i'm trying to use the package sqldf to read just some column or lines from the csv.
I tried this:
test <- read.csv.sql("D:\\X17065382\\Documents\\cad\\2016_mar\\2016_domicilio_mar.csv", sql = "select * from file limit 5", header = TRUE, sep = ",", eol = "\n")
but i got this problem:
Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: D:\X17065382\Documents\cad\2016_mar\2016_domicilio_mar.csv line 198361 expected 1 columns of data but found 2
If you're not too fussy about which package you use, data.table has a great function for doing just what you need
library(data.table)
file <- "D:\\X17065382\\Documents\\cad\\2016_mar\\2016_domicilio_mar.csv"
fread(file, nrows = 5)
Like Shinobi_Atobe said, the fread() function from data.table is working really well. If you prefer to use base R you could also use : read.csv() or read.csv2().
i.e.:
read.csv2(file_path, nrows = 5)
Also what do you mean by "big files" ? 1GB, 10GB, 100GB?
This works for me.
require(sqldf)
df <- read.csv.sql("C:\\your_path\\CSV1.csv", "select * from file where Name='Asher'")
df
Related
I am using RStudio and I am using the following R Codes to import a file into R. I need to exclude one specific column called "Approach".
Currently, my R codes to read the file stand as follows:
df1 <- read.csv("myfile.csv", check.names=FALSE, header = TRUE, fileEncoding="latin1")
I have tried something like this but it is not working:
excl_Approach_Col<-c("Approach")
df1 <- read.csv("myfile.csv", check.names=FALSE, header = TRUE, col.names!= excl_Approach_Col, fileEncoding="latin1")
I am getting the following error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote,
: object 'col.names' not found
I know I can import the full file as df1 and then proceed to drop that specific column. However, it would be nice if I could exclude the column during the read file step.
Is this possible? Do I need any specific package to perform this operation?
You can use 'fread' in 'data.table' for loading select columns. 'select' allows you to pick columns, 'drop' allows you to exclude:
library( data.table)
a <- data.table::fread(
"myfile.csv" ,
drop = "Approach"
)
you can use to import only certain columns
read.csv(file = "result1", sep = " ")[ ,1:2]
or if the columns names are known
read.csv(file = "result1", sep = " ")[ ,c('col1', 'col2')]
library( data.table)
a <- fread(
"myfile.csv",drop = "Approach")
for multiple colunms drop use
a <- fread("myfile.csv",drop = c("Approach","OtherColumn"))
and to select specific column use SELECT in place of DROP.
I have several hundred files regarding information in .pet files organized by date code (19960101 is format YYYYMMDD). I'm trying to add a column, NDate with the date code:
for (pet.atual in files.pet) {
data.pet.atual <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
data.pet.atual <- cbind(data.pet.atual, NDate= pet.atual)
}
What i'm trying to achieve, for example, is for the 01-01-1996 NDate = 19960101, for 02-01-1996 NDate = 19960102 and so on. Still the for loop just replaces the NDate field everytime it runs with the latest pet.atual, ideas? Thanks
Small modification should do the trick:
data.pet.atual <- NULL
for (pet.atual in files.pet) {
tmp.data <-
read.table(file = pet.atual,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";");
tmp.data <- cbind(tmp.data, NDate= pet.atual)
data.pet.atual <- rbind(data.pet.atual, tmp.data)
}
You can also replace the tmp.data<-cbind(...) by tmp.data$NDate <- pet.atual
You may also try fread() and rbindlist() from the data.table package (untested due to lack of a reproducible example):
library(data.table)
result <- rbindlist(lapply(files.pet, fread), idcol = "NDate")
result[, NDate := anytime::anydate(files.pet[NDate])]
lapply() "loops" over all entries in files.pet executing fread() for each entry and returns a list with the data.tables fread has created from reading each file. rbindlist() is used to combine all pieces into one large data.table. The parameter idcol = NDate generates an index column named NDate to identify the origin of each row in the final output. The ids are integer numbers 1 to the length of the list (if the list is not named).
Finally, the id number is used to lookup the file name in files.pet which is directly converted to class Date using the anytime package.
EDIT Perhaps, it might be more efficient to convert the file names to Date first before looking them up:
result[, NDate := anytime::anydate(files.pet)[NDate]]
Although fread() is pretty smart in analysing and guessing the right parameters for reading the files it might be necessary (and perhaps faster as well) to supply additional parameters, e.g.:
result <- rbindlist(lapply(files.pet, fread, header = FALSE, sep = ","), idcol = "NDate")
Yes, lapply will help, as Frank suggests. And you want to use rbind to keep the dates different for each file. Something along the lines of:
I'm assuming files.pet is a list of all the files you want to include...
my.fun<-function(file){
data <- read.table(file = file,
header = FALSE,
sep = ",",
quote = "\"",
comment.char = ";")
data$NDate = file
return(data)}
data.pet.atual <- do.call(rbind.data.frame, lapply(files.pet, FUN=my.fun))
I can't test this without a reproducible example, so you may need to play with it a bit, but the general approach should work!
Background:
I can successfully pull a particular dataset (shown in the code below) from the internet using the read.csv() function. However, when I try to utilize the sqldf package to speed up the process using read.csv.sql() it produces errors. I've tried various solutions but can't seem to solve this problem.
I can successfully pull the data and create the data frame that I want with read.csv() using the following code:
ce_data <- read.csv("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
fill=TRUE, header=TRUE, sep="")
To test the functionality of sqldf on my machine, I successfully tested read.csv.sql() by reading in the data as 1 variable rather than the 5 desired using the following code:
library(sqldf)
ce_data_sql1 <- read.csv.sql("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
sql = "select * from file")
To produce the result that I got using read.csv() but utilizing the speed of read.csv.sql(), I tried this code:
ce_data_sql2 <- read.csv.sql("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
fill=TRUE, header=TRUE, sep="", sql = "select * from file")
Unfortunately, it produced this error:
trying URL
'http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData' Content
type 'text/plain' length 24846571 bytes (23.7 MB) downloaded 23.7 MB
Error in sqldf(sql, envir = p, file.format = file.format, dbname =
dbname, : unused argument (fill = TRUE)
I have tried various methods to address the errors, using sqldf documentation and have been unsuccessful.
Question:
Is there a solution where I can read in this table with 5 variables desired using read.csv.sql()?
The reason you are reading it in as a single variable is because you did not correctly specify the separator for the original file. Try the following, where sep = "\t", for tab-separated:
ce_data_sql2 <- read.csv.sql("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
sep = "\t", sql = "select * from file")
.
The error you are getting in the final example:
Error in sqldf(sql, envir = p, file.format = file.format, dbname =
dbname, : unused argument (fill = TRUE)
Is due to the fact that read.csv.sql does not accept the fill argument.
I'm struggeling to read a local CSV file with quantmod's getSymbols. The format of the file (wkn_541779.csv) I'm trying to read is like this:
Date;Open;High;Low;Close;Volume;Ajdusted
2012-09-06;104,62;105,95;104,62;105,95;1248065,00;105,95
2012-09-05;104,78;104,78;104,45;104,48;1176371,00;104,48
2012-09-04;104,73;104,73;104,26;104,26;13090,00;104,26
> getSymbols("wkn_541779", src="csv", header = TRUE, sep=";", dec=",")
Gives me an error message: "more columns than column names" though.
> count.fields("wkn_541779.csv", sep = ";", skip = 0, blank.lines.skip = TRUE)
Results in "7" for each line (including the header!), which is exactly the number of columns in the header.
Can anybody please help me tracking down the problem here?
getSymbols.csv calls read.csv with its defaults. i.e. sep=","
When loading a .csv with sqldf, everything goes fine until I load data.table. For example:
library(sqldf)
write.table(trees, file="trees.csv", row.names=FALSE, col.names=FALSE, sep=",")
my.df <- read.csv.sql("trees.csv", "select * from file",
header = FALSE, row.names = FALSE)
works, while
library(data.table)
my.df <- read.csv.sql("trees.csv", "select * from file",
header = FALSE, row.names = FALSE)
# Error in list(...)[[1]] : subscript out of bounds
Doesn't. When loaded, data.table informs you that
The following object(s) are masked from 'package:base':
cbind, rbind
So, I tried this
rbind <- base::rbind # `unmask` rbind from base::
library(data.table)
my.df <- read.csv.sql("trees.csv", "select * from file",
header = FALSE, row.names = FALSE)
rbind <- data.table::rbind # `mask` rbind with data.table::rbind
which works. Before I litter all my code with this trick:
What is the best practise solution to deal with masking conflicts in R?
EDIT: There is a closely related thread here, but no general solution is suggested.
As per the comments, yes, please file a bug report :
bug.report(package="data.table")
That way it won't be forgotten, you'll get an automatic email each time the status changes and you can reopen it if the fix proves to be insufficient.
EDIT:
Now in v1.6.7 on R-Forge :
Compatibility with package sqldf (which can call do.call("rbind",...) on an empty ...) is fixed and test added. data.table was switching on list(...)[[1]] rather than ..1. Thanks to RYogi for reporting #1623.