Extracting and storing data from a very large file in R - r

I have a very large DAT file (16 GB). It contains some information of let's say, 1000 customers. This data is sorted like below that the first column is representing the customer IDs:
9909814 246766 0 31/07/2012 7:00 0.03 0 0 0 0
8211675 262537 0 8/04/2013 3:00 0.52 0 0 0 0
However, the data of customers are not stored in an organized way. So, I want to extract the data of each customer and store it in a separate file. (I have a file that contains the customer IDs. )
For just one customer, I wrote the following code that can search through the file and extract data. However, my problem is to how to do this for all the customers when I'm reading this big file into R.
con<-file('D:/CD_INTERVAL_READING.DAT')
open(con)
n=20
nk=100000
B=9909814 #customer ID for customer no.1
customer1 <- read.table(con, sep=",", nrow=1)
for (i in 1:n) {
conn <- read.table(con,sep=",",skip=(i-1)*nk, nrow=nk)
## extracts just those rows that belong to a specific customer ID
temp1 <-conn[conn$V1==B,]
customer1 <-rbind(customer1,temp1)
}
customer1 <- customer1 [-1,]
library(xlsx)
write.xlsx(customer1, "D:/customer1.xlsx")

The optimal solution would probably be to import the data into a proper database but if you really want to split the file into multiple files based on the first token then you can use awk with this one-liner.
awk '/^/ {ofn=$1 ".txt"} ofn {print > ofn}' filetosplit.txt
It works by
/^/ matching start of line
{ofn=$1 ".txt"} sets the ofn variable to the first word (split by white space) with .txt appended.
Print each line to the file set by ofn.
It takes me just under two minutes on my laptop to split a 1 GB file with the same format as you listed above into multiple text files. I have no idea how well that scales or if it's fast enough for you. If you want an R solution you can always wrap it into a system() call ;o)
Addendum:
Oh ... I'm guessing you are on windows based on the path you mentioned. Then you may need to install Cygwin to get awk.

Related

How to save R DataFrame to a file in MSSQL backup format?

I need to feed external MSSQL server with a large amount of data calculated in R.
No direct access to DB is possible, so it must be an interim export file.
Excel format cannot be utilised due to number of data rows exceeding Excel capacity.
CSV would be fine, but there are many obstacles in the data itself like semicolons used in names, special characters, not closed quotations (odd number of ") and so on.
I am looking for the versatile method of transporting data from R to MSSQL database, independent of data content. If I were able to save DataFrame as a database containing single table to a MSSQL backup format file, that would satisfy the needs.
Any idea on how to achieve this? Any package available? Any suggestion would be appreciated.
I'm inferring you're hoping to bulk-insert the data using bcp or sqlcmd. While neither one deals well with commas, embedded commas, and embedded quotes, you can work around this by using a different field separator (that is not contained within the data).
Setup:
evil_data <- data.frame(
id = 1:2,
chr = c('evil"string ,;\n\'', '",";:|"'),
stringsAsFactors = FALSE
)
# con <- DBI::dbConnect(...)
DBI::dbExecute(con, "create table r2test (id INT, chr nvarchar(64))")
# [1] 0
DBI::dbWriteTable(con, "r2test", evil_data, create = FALSE, append = TRUE)
DBI::dbGetQuery(con, "select * from r2test")
# id chr
# 1 1 evil"string ,;\n'
# 2 2 ",";:|"
First, I'll use \U001 as the field separator and \U002 as the row separator. Those two should be "good enough", but if you have non-printable characters in your data, then you might either change your separators to other values or look for encoding options for the data (e.g., base64, though it might need to be stored that way).
write.table(evil_data, "evil_data.tab", sep = "\U001", eol = "\U002", quote = FALSE)
# or data.table::fwrite or readr::write_delim
Since I'm using bcp, it can use a "format file" to indicate separators and which columns on the source correspond with columns in the database. See references for how to create this file, but for this example I'll use:
fmt <- c("12.0", "2",
"1 SQLCHAR 0 0 \"\001\" 1 id \"\" ",
"2 SQLCHAR 0 0 \"\002\" 2 chr SQL_Latin1_General_CP1_CI_AS")
writeLines(fmt, "evil_data.fmt")
From here, assuming bcp is in your PATH (you'll need an absolute path for bcp otherwise), run this in a terminal (I'm using git-bash on windows, but this should be the same in others). The second line is all specific to my database connection, you'll need to omit or change all of this for your own connection. The first line is your stuff
$ bcp [db_owner].[r2test] in evil_data.tab -F2 -f evil_data.fmt -r '\002' \
-S '127.0.0.1,21433' -U 'myusername' -d 'mydatabase' -P ***MYPASS***
Starting copy...
2 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 235 Average : (8.51 rows per sec.)
Proof that it worked:
DBI::dbGetQuery(con, "select * from r2test")
# id chr
# 1 1 evil"string ,;\n'
# 2 2 ",";:|"
# 3 1 1\001evil"string ,;\r\n'
# 4 2 2\001",";:|"
References:
Microsoft pages for bcp: windows and linux
non-XML format files
bcp and quoted-CSV

Checking for number of items in a string in R

I have a very large csv file (1.4 million rows). It is supposed to have 22 fields and 21 commas in each row. It was created by taking quarterly text files and compiling them into one large text file so that I could import into SQL. In the past, one field was not in the file. I don't have the time to go row by row and check for this.
In R, is there a way to verify that each row has 22 fields or 21 commas? Below is a small sample data set. The possibly missing field is the 0 in the 10th slot.
32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1
you can use the base R function count.fields to do this:
count.fields(tmp, sep=",")
[1] 22 22
The input for this function is the name of a file or a connection. Below, I supplied a textConnection. For large files, you would probably want to feed this into table:
table(count.fields(tmp, sep=","))
Note that this can also be used to count the number of rows in a file using length, similar to the output of wc -l in the *nix OSs.
data
tmp <- textConnection(
"32,01,01,01,01,01,000000,123,456,0,132,345,456,456,789,235,256,88,4,1,2,1
32,01,01,01,01,01,000001,123,456,0,132,345,456,456,789,235,256,88,5,1,2,1"
)
Assuming df is your dataframe
apply(df, 1, length)
This will give you the length of each row.

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Unix programming to subset every 1Mb and name the subset

I am needing a way to subset a large data set in Unix. I have > 50K SNP, each with the genetic variance they explain and a location (chromosome and position). I need to subset the SNP every 1 million base pairs (position) for each chromosome to create what we call 1Mb windows. I also need to name these windows, for instance CHR:WINDOW.
My data is structured as:
SNP CHR POS GenVar
BTB-00074935 1 157284336 2.306141e-06
BTB-01512420 8 72495155 1.958865e-06
Hapmap35555-SCAFFOLD20017_21254 18 29600313 1.876211e-06
BTB-01098205 3 68702409 1.222881e-06
ARS-BFGL-NGS-115531 11 74038177 9.597669e-07
ARS-BFGL-NGS-25658 2 119059379 7.953552e-07
BTB-00411452 20 47919708 6.827312e-07
ARS-BFGL-NGS-100532 18 63878550 6.115242e-07
Hapmap60823-rs29019235 1 10717144 5.400144e-07
ARS-BFGL-NGS-42256 10 50282066 4.864838e-07
.
.
.
A basic first try, assuming no spaces in any of the first fields (SNP), and that the "key" is (col2, first (length-6) digits of col3):
awk '{w=0+substr($3,1,length($3)-6); print >>sprintf("CHR%02d:WINDOW%03d",$2,w)}'
This prints to files named like CHR03:WINDOW456. If you only wanted something like 03:456 for filenames, edit out the CHR and WINDOW above.
Also note, subsequent runs will just keep expanding existing files, so you may need a rm *:* between runs.

Creating a vector from a file in R

I am new to R and my question should be trivial. I need to create a word cloud from a txt file containing the words and their occurrence number. For that purposes I am using the snippets package.
As it can be seen at the bottom of the link, first I have to create a vector (is that right that words is a vector?) like bellow.
> words <- c(apple=10, pie=14, orange=5, fruit=4)
My problem is to do the same thing but create the vector from a file which would contain words and their occurrence number. I would be very happy if you could give me some hints.
Moreover, to understand the format of the file to be inserted I write the vector words to a file.
> write(words, file="words.txt")
However, the file words.txt contains only the values but not the names(apple, pie etc.).
$ cat words.txt
10 14 5 4
Thanks.
words is a named vector, the distinction is important in the context of the cloud() function if I read the help correctly.
Write the data out correctly to a file:
write.table(words, file = "words.txt")
Create your word occurrence file like the txt file created. When you read it back in to R, you need to do a little manipulation:
> newWords <- read.table("words.txt", header = TRUE)
> newWords
x
apple 10
pie 14
orange 5
fruit 4
> words <- newWords[,1]
> names(words) <- rownames(newWords)
> words
apple pie orange fruit
10 14 5 4
What we are doing here is reading the file into newWords, the subsetting it to take the one and only column (variable), which we store in words. The last step is to take the row names from the file read in and apply them as the "names" on the words vector. We do the last step using the names() function.
Yes, 'vector' is the proper term.
EDIT:
A better method than write.table would be to use save() and load():
save(words. file="svwrd.rda")
load(file="svwrd.rda")
The save/load combo preserved all the structure rather than doing coercion. The write.table followed by names()<- is kind of a hassle as you can see in both Gavin's answer here and my answer on rhelp.
Initial answer:
Suggest you use as.data.frame to coerce to a dataframe an then write.table() to write to a file.
write.table(as.data.frame(words), file="savew.txt")
saved <- read.table(file="savew.txt")
saved
words
apple 10
pie 14
orange 5
fruit 4

Resources