I'd like to import a bunch of large text files to a SQLite db using RSQLite. If my data were comma delimited, I'd do this:
library(DBI)
library(RSQLite)
db <- dbConnect(SQLite(), dbname = 'my_db.sqlite')
dbWriteTable(conn=db, name='my_table', value='my_file.csv')
But how about with '\t' -delimited data? I know I could read the data into an R data.frame and then create the table from that, but I'd like to go straight into SQLite, since there are lots of large files. When I try the above with my data, I get one single character field.
Is there a sep='\t' sort of option I can use? I tried just adding sep='\t', like this:
dbWriteTable(conn=db, name='my_table', value='my_file.csv', sep='\t')
EDIT: And in fact that works great, but a flaw in the file I was working with was producing an error. Also good to add header=TRUE, if you have headers as I do.
Try the following:
dbWriteTable(conn=db, name='my_table', value='my_file.csv', sep='\t')
Per the following toward the top of page 21 of http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf
When dbWriteTable is used to import data from a file, you may optionally specify header=,
row.names=, col.names=, sep=, eol=, field.types=, skip=, and quote=.
[snip]
sep= specifies the field separator, and its default is ','.
Related
How to import this data in R ???is so messy...I dont know if must first cleaning and then import..i dont know what to do....in the first line is the names of columns.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
It is not messy but very clean. The file is a comma separated values file (although the delimiter seems to be a semi-colon). You can use read.delim for this:
df <- read.delim("winequality-red.csv", sep = ";")
Make sure that the file is stored in the working directory. You can check the working directory by using getwd() and change it by setwd()
I'm using R library(RODBC) to import the results of a sql store procedure and save it in a data frame then export that data frame using write.table to write it to xml file (the results from sql is an xml output)
anyhow, R is truncating the string (imported xml results from sql).
I've tried to find a function or an option to expand the size/length of the R dataframe cell but didn't find any
I also tried to use the sqlquery in the write.table statement to ignore using a dataframe but also it didn't work, the imported data from sql is always truncated.
Anyone have any suggestions or an answer that could help me.
here is my code
#library & starting the sql connection
library(RODBC)
my_conn<-odbcDriverConnect('Driver={SQL Server};server=sql2014;database=my_conn;trusted_connection=TRUE')
#Create a folder and a path to save my output
x <- "C:/Users/ATM1/Documents/R/CSV/test"
dir.create(x, showWarnings=FALSE)
setwd(x)
Mpath <- getwd()
#importing the data from sql store procedure output
xmlcode1 <- sqlquery(my_conn, "exec dbo.p_webDefCreate 'SA25'", stringsAsFactors=F, as.is=TRUE)
#writing to a file
write.table(xmlcode1, file=paste0(Mpath,"/SA5b_def.xml"), quote = FALSE, col.names = FALSE, row.names = FALSE)
what I get is plain text that is not the full output.
and the code below is how I find the current length of my string
stri_length(xmlcode1)
[1] 65534
I had similar issue with our project, the data that was coming from the db was getting truncated to 257 characters, and I could not really get around it. Eventually I converted the column def on the db table from varchar(max) to varchar(8000) and I got all the characters back. I did not mind changing the table defintion.
In your case you can perhaps convert the column type in your proc output to varchar with some defined value if possible.
M
I am using PostgeSQL but experienced the same issue of truncation upon importing into R with RODBC package. I used Michael Kassa's solution with a slight change to set the data type to text which can store a string with unlimited length per postgresqltutorial. This worked for me.
The TEXT data type can store a string with unlimited length.
varchar() also worked for me
If you do not specify the n integer for the VARCHAR data type, it behaves like the TEXT datatype. The performance of the VARCHAR (without the size n) and TEXT are the same.
I have a .txt file with one column consisting of 1040 lines (including a header). However, when loading it into R using the read.table() command, it's showing 1044 lines (including a header).
The snippet of the file looks like
L*H
no
H*L
no
no
no
H*L
no
Might it be an issue with R?
When opened in Excel it doesn't show any errors as well.
EDIT
The problem was that R read a line like L + H* as three separated lines L + H*.
I used
table <- read.table(file.choose(), header=T, encoding="UTF-8", quote="\n")
You can try readLines() to see how many lines are there in your file. And feel free to use read.csv() to import it again to see it gets the expected return. Sometimes, the file may be parsed differently due to extra quote, extra return, and potentially some other things.
possible import steps:
look at your data with text editor or readLines() to figure out the delimiter and file type
Determine an import method (type read and press tab, you will see the import functions for import. Also check out readr.)
customize your argument. For example, if you have a header or not, or if you want to skip the first n lines.
Look at the data again in R with View(head(data)) or View(tail(data)). And determine if you need to repeat step 2,3,4
Based on the data you have provided, try using sep = "\n". By using sep = "\n" we ensure that each line is read as a single column value. Additionally, quote does not need to be used at all. There is no header in your example data, so I would remove that argument as well.
All that said, the following code should get the job done.
table <- read.table(file.choose(), sep = "\n")
Suppose I have variable s with this code:
s <- "foo\nbar"
Then change it to data.frame
s2 <- data.frame(s)
Now s2 is a data.frame with one records, next I export to a csv file with:
write.csv(s2, file = "out.csv", row.names = F)
Then I open it with notepad, the "foo\nbar" was flown into two lines. With SAS import:
proc import datafile = "out.csv" out = out dbms = csv replace;
run;
I got two records, one is '"foo', the other is 'bar"', which is not expected.
After struggling for a while, I found if I export from R with foreign package like this:
write.dbf(s2, 'out.dbf')
Then import with SAS:
proc import datafile = "out.dbf" out = out dbms = dbf replace;
run;
Everything works nice and got one records in sas, the value seems to be 'foo bar'.
Does this mean csv is a bad choice when dealing with data, compared with dbf? Are there any other solutions or explanations to this?
A CSV file stands for comma-separated-version. This means that each line in the file should contain a list of values separated by a comma. SAS imported the file correctly based on the definition of the CSV file (ie. 2 lines = 2 rows).
The problem you are experiencing is due to the \n character(s) in your string. This sequence of characters happens to represent a newline character, and this is why the R write.csv() call is creating two lines instead of putting it all on one.
I'm not an expert in R so I can't tell you how to either modify the call to write.csv() or mask the \n value in the input string to prevent it from writing out the newline character.
The reason you don't have this problem with .dbf is probably because it doesn't care about commas or newlines to indicate when new variables or rows start, it must have it's own special sequence of bytes that indicate this.
DBF - is a database formats, which are always easier to work with because they have variable types/lengths embedded in their structure.
With a CSV or any other delimited file you have to have documentation included to know the file structure.
The benefit of CSV is smaller file sizes and compatibility across multiple OS and applications. For a while Excel (2007?) no longer supported DBF for example.
As Robert says you will need to mask the new line value. For example:
replace_linebreak <- function(x,...){
gsub('\n','|n',x)
}
s3 <- replace_linebreak(s2$s)
This replaces \n with |n, which would you would then need to replace when you import again. Obviously what you choose to mask it with will depend on your data.
I have a requirement to fetch data from HIVE tables into a csv file and use it in RevoScaleR.
Currently we pull the data from HIVE and manually put it into a file and use it in unix file system for adhoc analysis, however, the requirement is to re-direct the result directly into hdfs location and use RevoScaleR from there?
How do I do that? or what sort of connection do I need to establish this.
If I understand your question correct, you could use RevoScaleR ODBC connection to import HIVE table and do further analysis from there.
Here is example of using Hortonworks provided ODBC driver:
OdbcConnString <- "DSN=Sample Hortonworks Hive DSN"
odbcDS <- RxOdbcData(sqlQuery = "SELECT * FROM airline",
connectionString=OdbcConnString,
stringsAsFactors=TRUE,
useFastRead = TRUE,
rowsPerRead=150000)
xdfFile <- "airlineHWS.xdf"
if(file.exists(xdfFile)) file.remove(xdfFile)
Flights<-rxImport(odbcDS, outFile=xdfFile,overwrite=TRUE)
rxGetInfo(data="airlineHWS.xdf", getVarInfo=TRUE,numRows = 10)
Chenwei's approach is ok but there is just one problem. The data is temporarily stored in memory as data frame in odbcDS object. If we have huge table in hive, then we are done.
I would suggested to keep everything on disk by using external tables in hive and then using the backend data directly in revolution r.
Something in these lines:
Create external table from the existing hive tables in textfile(csv, tab etc) format.
CREATE EXTERNAL TABLE ext_table
LIKE your_original_table_name
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/your/hdfs/location';
Here we are creating external table which is stored as csv file in hdfs.
Next copy the original table to the external table using insert overwrite command.
insert overwrite table ext_table select * from your_original_table_name
If we want to check the backend data on hdfs type:
hadoop fs -ls /your/hdfs/location/
We can see the part files stored at the location. Go ahead and cat them to be double sure
Now we can use RxTextData function to read the data from above step as
hive_data <- RxTextData(file='/your/hdfs/location/', delimiter = ',')
Now you can create an xdf file using hive_data as inFile argument in RxXdfData to be more efficient for further processing but above all data has never touched memory.