masking conflicts - r

When loading a .csv with sqldf, everything goes fine until I load data.table. For example:
library(sqldf)
write.table(trees, file="trees.csv", row.names=FALSE, col.names=FALSE, sep=",")
my.df <- read.csv.sql("trees.csv", "select * from file",
header = FALSE, row.names = FALSE)
works, while
library(data.table)
my.df <- read.csv.sql("trees.csv", "select * from file",
header = FALSE, row.names = FALSE)
# Error in list(...)[[1]] : subscript out of bounds
Doesn't. When loaded, data.table informs you that
The following object(s) are masked from 'package:base':
cbind, rbind
So, I tried this
rbind <- base::rbind # `unmask` rbind from base::
library(data.table)
my.df <- read.csv.sql("trees.csv", "select * from file",
header = FALSE, row.names = FALSE)
rbind <- data.table::rbind # `mask` rbind with data.table::rbind
which works. Before I litter all my code with this trick:
What is the best practise solution to deal with masking conflicts in R?
EDIT: There is a closely related thread here, but no general solution is suggested.

As per the comments, yes, please file a bug report :
bug.report(package="data.table")
That way it won't be forgotten, you'll get an automatic email each time the status changes and you can reopen it if the fix proves to be insufficient.
EDIT:
Now in v1.6.7 on R-Forge :
Compatibility with package sqldf (which can call do.call("rbind",...) on an empty ...) is fixed and test added. data.table was switching on list(...)[[1]] rather than ..1. Thanks to RYogi for reporting #1623.

Related

Binding XLS files while gettind rid of extra sheets

I have 774 XLS files that I would like to merge into one big CSV data-base. They are roughly similar, but I don't know how to handle the differences...
Some XLS files have more than one sheet, and they are useless; thus I need to get rid of them. The problem is that, in some files, these extra sheets were moved to be the first, while in others this doesn't happen. So I can't depend of the default value of functions that read XLS on R, right?
Besides that, the name of the extra sheets (those I don't intend to keep) may vary.
Below I present the script that I know, hoping someone could help me adapt it to this situation.
setwd("D:/Folder")
library(readxl)
lst = list.files()
df = data.frame()
# Now comes the loop
for(table in lst){
dataFromExcel <- read_excel(table)
df <- rbind(df,dataFromExcel)
}
When I run the loop, I receive the message:
New names:
`` -> ...3
`` -> ...4
`` -> ...5
Error in as.POSIXlt.character(x, tz, ...) : character string is not in a standard unambiguous format
Can someone give me some help?
try
for(table in lst){
dataFromExcel <- read_excel(table, col_types = "text" ) # <- !!
df <- rbind(df,dataFromExcel)
}
you will have to wrangle the data to the correct type afterwards..
further: perhaps a more R-like code:
library( data.table )
DT <- rbindlist( lapply( list, read_excel, col_types = "text" ),
use.names = TRUE, fill = TRUE )
should do the same as your for-loop (and has some nice extra features, see ?data.table::rbindlist ).

How to exclude a specific column during the initial import of the file into R?

I am using RStudio and I am using the following R Codes to import a file into R. I need to exclude one specific column called "Approach".
Currently, my R codes to read the file stand as follows:
df1 <- read.csv("myfile.csv", check.names=FALSE, header = TRUE, fileEncoding="latin1")
I have tried something like this but it is not working:
excl_Approach_Col<-c("Approach")
df1 <- read.csv("myfile.csv", check.names=FALSE, header = TRUE, col.names!= excl_Approach_Col, fileEncoding="latin1")
I am getting the following error message:
Error in read.table(file = file, header = header, sep = sep, quote = quote,
: object 'col.names' not found
I know I can import the full file as df1 and then proceed to drop that specific column. However, it would be nice if I could exclude the column during the read file step.
Is this possible? Do I need any specific package to perform this operation?
You can use 'fread' in 'data.table' for loading select columns. 'select' allows you to pick columns, 'drop' allows you to exclude:
library( data.table)
a <- data.table::fread(
"myfile.csv" ,
drop = "Approach"
)
you can use to import only certain columns
read.csv(file = "result1", sep = " ")[ ,1:2]
or if the columns names are known
read.csv(file = "result1", sep = " ")[ ,c('col1', 'col2')]
library( data.table)
a <- fread(
"myfile.csv",drop = "Approach")
for multiple colunms drop use
a <- fread("myfile.csv",drop = c("Approach","OtherColumn"))
and to select specific column use SELECT in place of DROP.

Read csv with sqldf

I'm having some trouble to read big csv files with my R, so i'm trying to use the package sqldf to read just some column or lines from the csv.
I tried this:
test <- read.csv.sql("D:\\X17065382\\Documents\\cad\\2016_mar\\2016_domicilio_mar.csv", sql = "select * from file limit 5", header = TRUE, sep = ",", eol = "\n")
but i got this problem:
Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) : RS_sqlite_import: D:\X17065382\Documents\cad\2016_mar\2016_domicilio_mar.csv line 198361 expected 1 columns of data but found 2
If you're not too fussy about which package you use, data.table has a great function for doing just what you need
library(data.table)
file <- "D:\\X17065382\\Documents\\cad\\2016_mar\\2016_domicilio_mar.csv"
fread(file, nrows = 5)
Like Shinobi_Atobe said, the fread() function from data.table is working really well. If you prefer to use base R you could also use : read.csv() or read.csv2().
i.e.:
read.csv2(file_path, nrows = 5)
Also what do you mean by "big files" ? 1GB, 10GB, 100GB?
This works for me.
require(sqldf)
df <- read.csv.sql("C:\\your_path\\CSV1.csv", "select * from file where Name='Asher'")
df

Writing a loop and initializing various class of objects

I am processing a large file, I read in chucks of it and process it and save what I extract. Then after rm(list=ls()) to clear memory (sometime have to use .rs.restartR() as well but that is not of concern in this post), I run the same script after adding 1 in two numbers in my script.
This seemed like a opportunity to try writing a loop but - between trying to initialize all the object that are used in the loop and given that I am not very good with writing loops it got really confusing.
I posted this here to hear some suggestion, I apologize in advance if my question is too vague. Thanks.
####################### A:11
####################### B:12
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*11, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## First use colSums(), saves a numeric vector in `final_dfm_1`
## tib is the desired oject I will save with new name ea. time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq12.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below I will run the same script but change 11 to 12 in fread(), and change 12 to 13 in saveRDS() command.
####################### A:12
####################### b:13
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*12, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 4, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
saveRDS(tib, file = "tiq13.Rda")
rm(list=ls(all=TRUE))
Sys.sleep(10)
gc()
Sys.sleep(10)
Below is a list of all the objects (thanks this post) in my working environment, that are cleared from the working environment before running the the same chunk with A+1, and B+1.
Type Size Rows Columns
dfm_1 dfmSparse 174708600 166836 1731410
bi_tkn_one tokens 152494696 166836 NA
tib tbl_df 148109248 1731410 2
final_dfm_1 numeric 148108544 1731410 NA
text_tbl data.table 22485264 166836 1
I spent some time trying to figure out how to write this loop, found a post on SO about how to initialize a data.table with a character column, but there are still other objects that I think I need to initialize. I am unsure of how plausible it is to write such a loop.
I have copied and pasted the same script back-to-back as shown above and run it all at once. It's silly, since I am just adding one in two places.
Feel free comment on my approach, I would like to learn something out of this. Best
On a side note: I read about adding .rs.restartR() to the loop, and came across post that suggested using batch-files or scheduling tasks in R, I will have to pass on learning those for now.
This was very simple, I didn't have to initialize any objects , that is what I was trying to do. Only things I had to load was the required packages upon starting R and run the loop.
ls()
character(0)
From an empty environment, just a simple loop.
library(data.table)
library(quanteda)
library(dplyr)
for (i in 4:19){
# A I change the multiple each time here.
text_tbl <- fread("tlm_s_words", skip = 166836*i, nrows = 166836, header = FALSE, col.names = "text")
bi_tkn_one <- tokens(text_tbl$text, what = "fastestword", ngrams = 3, concatenator =" ", verbose = TRUE)
dfm_1 <- dfm(bi_tkn_one)
## Using colSums(), gives a numeric vector`final_dfm_1`
## tib is the desired oject I will save with new name each time.
final_dfm_1 <- colSums(dfm_1)
print(setNames(length(final_dfm_1), "no. N-grams in this batch"))
# no. N-grams
tib <- tbl_df(final_dfm_1) %>% add_rownames()
# This is what I wanted to extract 'the freq of each token'
# B Here I change the name `tib`` is saved uneder each time.
iplus = i+1
saveRDS(tib, file = paste0("titr",iplus,".Rda"))
rm(list=ls())
Sys.sleep(10)
gc()
Sys.sleep(10)
}
Without initializing any data.table, or other objects the result of above loop was 16 files saved in my working directory.
That makes me think, when do we need to initialize vectors, matrices and other objects that are used to our loop?

Having trouble reading table using sqldf package (R)

Background:
I can successfully pull a particular dataset (shown in the code below) from the internet using the read.csv() function. However, when I try to utilize the sqldf package to speed up the process using read.csv.sql() it produces errors. I've tried various solutions but can't seem to solve this problem.
I can successfully pull the data and create the data frame that I want with read.csv() using the following code:
ce_data <- read.csv("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
fill=TRUE, header=TRUE, sep="")
To test the functionality of sqldf on my machine, I successfully tested read.csv.sql() by reading in the data as 1 variable rather than the 5 desired using the following code:
library(sqldf)
ce_data_sql1 <- read.csv.sql("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
sql = "select * from file")
To produce the result that I got using read.csv() but utilizing the speed of read.csv.sql(), I tried this code:
ce_data_sql2 <- read.csv.sql("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
fill=TRUE, header=TRUE, sep="", sql = "select * from file")
Unfortunately, it produced this error:
trying URL
'http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData' Content
type 'text/plain' length 24846571 bytes (23.7 MB) downloaded 23.7 MB
Error in sqldf(sql, envir = p, file.format = file.format, dbname =
dbname, : unused argument (fill = TRUE)
I have tried various methods to address the errors, using sqldf documentation and have been unsuccessful.
Question:
Is there a solution where I can read in this table with 5 variables desired using read.csv.sql()?
The reason you are reading it in as a single variable is because you did not correctly specify the separator for the original file. Try the following, where sep = "\t", for tab-separated:
ce_data_sql2 <- read.csv.sql("http://download.bls.gov/pub/time.series/cx/cx.data.1.AllData",
sep = "\t", sql = "select * from file")
.
The error you are getting in the final example:
Error in sqldf(sql, envir = p, file.format = file.format, dbname =
dbname, : unused argument (fill = TRUE)
Is due to the fact that read.csv.sql does not accept the fill argument.

Resources