Efficient way to reshape a dataset from long to wide - r

I have a medical dataset which looks like this:
patient_id disease_id
1111111111 DISEASE:1
1111111111 DISEASE:2
1111111111 DISEASE:3
1111111111 DISEASE:4
1111111111 DISEASE:5
1111111111 DISEASE:6
1111111111 DISEASE:6
1111111112 DISEASE:1
1111111112 DISEASE:2
1111111112 DISEASE:4
1111111113 DISEASE:1
1111111113 DISEASE:5
which I need to feed into a neural network/random forest model. So, the only natural data representation to feed into the models I thought of was:
patient_id DISEASE:1 DISEASE:2 DISEASE:3 DISEASE:4 DISEASE:5 DISEASE:6 ...
11111111111 1 1 1 1 1 1 ...
11111111112 1 1 0 1 0 0 ...
11111111113 1 0 0 0 1 0 ...
But my dataset is very big (~50GB, 1.5 GB compressed) and has tons of disease_ids so that the reshaping this data in the most efficient way possible in R requires 11.7 TB of space in compressed in RDs format (I know this because I divided the dataset into 100 chunks and reshaping of a single one resulted in 117 GB heavy RDs file; merging 100 of these would produce something larger than 11.7TB).
Now, I have 5 dataset this big that I need to merge together, so I feel a bit stuck. I need to come up with a more efficient data representation but don't know how as I am dealing with categorical variables which will require 1-hot encoding. Can anyone suggest any alternative ways to deal with a data like this.

Given the size of the input you will want to do stream processing and R is not that suitable for such processing so here we use a simple gawk program instead.
gawk is available in Rtools on Windows and comes natively with most UNIX/Linux systems.
In the first pass the gawk program creates an associative array disease from the disease field, i.e. second field, of the input. Presumably the number of diseases is much smaller than the length of the file so this would likely fit into memory.
Then in a second pass it reads each group of records corresponding to a patient assuming that all records for a patient are consecutive. For each patient it outputs a single row with the patient id and a sequence of 0's and 1's such that the ith indicates absence or presence of the ith disease.
FNR == 1 { next } # skip header on both passes
# first pass - create disease array
FNR == NR {
disease[$2] = 0;
next;
}
# second pass - create and output flattened records
{
if ($1 != prevkey && FNR > 2) {
printf("%s ", prevkey);
for(d in disease) printf("%d ", disease[d]);
printf("\n");
for(d in disease) disease[d] = 0;
}
disease[$2] = 1;
prevkey = $1;
}
END {
if (FNR == NR) for(d in disease) {
print d;
} else {
printf("%s ", prevkey);
for(d in disease) printf("%d ", disease[d]);
printf("\n");
}
}
If we put the above gawk code in model_mat.awk then we can run it like this -- note that the file must be specified twice -- once for each of the two passes:
gawk -f model_mat.awk disease.txt disease.txt
The output is the following where we are assuming it is wanted that each disease be indicated by 1 if it is present or 0 if not.
1111111111 1 1 1 1 1 1
1111111112 1 1 0 1 0 0
1111111113 1 0 0 0 1 0
If we run it with only one disease.txt argument then it will only run the first pass and then at the end list the diseases without duplicates:
gawk -f model_mat.awk disease.txt
giving:
DISEASE:1
DISEASE:2
DISEASE:3
DISEASE:4
DISEASE:5
DISEASE:6
Listing diseases
An alternative for listing diseases is this UNIX pipeline which lists the diseases without duplicates and sorts them. sed removes the header, cut takes the third space separated field (it is the third because there are two spaces between the two fields) and sort sorts it taking unique elements.
sed 1d disease.txt | cut -f 3 -d " " | sort -u > diseases-sorted.txt
Sorting and Merging
The GNU sort utility can sort and merge files larger than memory and has a parallel option to speed it up. Also see the free cmsort utility (Windows only).
csvfix
Below are some scripts using the free csvfix command line utility. You may need to modify the quotes depending on the command line processor/shell you are using and will need to put each on a single line or appropriately escape the newline (backslash for bash, circumflex for Windows cmd). We have shown each pipeline spread over separate lines for clarity.
The first pipeline below creates a one column list of diseases in disease-list.txt . The first csvfix command in it removes the header, the second csvfix command extracts the second field (i.e. drops the patient id) and the last csvfix command reduces it to unique diseases.
The second pipeline below creates a file with one row per patient with the patient id followed by the diseases for that patient. The first csvfix command in it removes the header, the second converts it to csv format and the last csvfix command flattens it.
csvfix remove -if "$line == 1" -smq disease.txt |
csvfix read_dsv -s " " -cm -f 2 |
csvfix uniq -smq > disease-list.txt
csvfix remove -if "$line == 1" -smq disease.txt |
csvfix read_dsv -s " " -cm -f 1,2 |
csvfix flatten -smq > flat.txt

You are raising interesting question. Analyzing that volume of data with R will be a real change.
So, I can only give you general advices. First, I think you need to dissociate RAM and disk storage. Using Rds won't help you regarding the efficiency of the reshaping but will produce smaller data on disk than csv.
Concerning the efficiency of the reshaping
data.table
If you want an in-memory approach, I don't see any other possibility than using data.table::dcast. In that case, follow #Ronak Shah recommendation:
library(data.table)
setDT(df)
df[, n := 1]
dcast(unique(df), patient_id~ disease_id, value.var = "n", fill = 0)
?data.table::dcast :
In the spirit of data.table, it is very fast and memory efficient, making it well-suited to handling large data sets in RAM. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage.
Other solutions
With data that voluminous, I think in-memory is not the most appropriate approach. You might have a look to database approaches (especially postgreSQL) or Spark.
Databases
You have several options to use postgreSQL in R. One of them is dbplyr: if you know the tidyverse syntax you will find familiar verbs. The pivot operation is a little bit trickier for database than standard R dataframe but you might find some ways to do that. You won't have difficulties to find some people more expert on databases than me that can give you very interesting tricks.
Spark
Spark can be a very good candidate to perform the reshaping if you can spread your tasks among executors in a serveur. If you are on a personal computer (standalone mode) you can still parallelize tasks between your cores but do not forget to change the spark.memory.fraction parameter of the session otherwise I think you might experience out of memory problems. I am more used to pyspark than sparkR but I think the logic will be the same.
Since Spark 1.6, you can pivot your data (ex: pyspark doc). This enables a wide to long conversion. Something in this spirit (pyspark code)
df.withColumn("n", psf.lit(1)).pivot("patient_id").sum("n")
Concerning the size on disk
You use Rds. You have some format more compressed, e.g. fst. parquet files are also very compressed, maybe one of the best options to store voluminous data. You can read them with SparkR or using arrow package

Related

modifying line names to avoid redundancies when files are merged in terminal

I have two files containing biological DNA sequence data. Each of these files are the output of a python script which assigns each DNA sequence to a sample ID based on a DNA barcode at the beginning of the sequence. The output of one of these .txt files looks like this:
>S066_1 IGJRWKL02G0QZG orig_bc=ACACGTGTCGC new_bc=ACACGTGTCGC bc_diffs=0
TTAAGTTCAGCGGGTATCCCTACCTGATCCGAGGTCAACCGTGAGAAGTTGAGGTTATGGCAAGCATCCATAAGAACCCTATAGCGAGAATAATTACTACGCTTAGAGCCAGATGGCACCGCCACTGATTTTAGGGGCCGCTGAATAGCGAGCTCCAAGACCCCTTGCGGGATTGGTCAAAATAGACGCTCGAACAGGCATGCCCCTCGGAATACCAAGGGGCGCAATGTGCGTCCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCAGCGTTCTTCATCGATGACGAGTCTAG
>S045_2 IGJRWKL02H5XHD orig_bc=ATCTGACGTCA new_bc=ATCTGACGTCA bc_diffs=0
CTAAGTTCAGCGGGTAGTCTTGTCTGATATCAGGTCCAATTGAGATACCACCGACAATCATTCGATCATCAACGATACAGAATTTCCCAAATAAATCTCTCTACGCAACTAAATGCAGCGTCTCCGTACATCGCGAAATACCCTACTAAACAACGATCCACAGCTCAAACCGACAACCTCCAGTACACCTCAAGGCACACAGGGGATAGG
The first line is the sequence ID, and the second line in the DNA sequence. S_066 in the first part of the ID indicates that the sequence is from sample 066, and the _1 indicates that its the first sequence in the file (not the first sequence from S_066 per se). Because of the nuances of the DNA sequencing technology being used, I need to generate two files like this from the raw sequencing files, and the result is an output where I have two of these files, which I then use cat to merge together. So far so good.
The next downstream step in my workflow does not allow identical sample names. Right now it gets half way through, errors, and closes because it encounters some identical sequence IDs. So, it must be that the 400th sequence in both files belongs to the same sample, or something, generating identical sample IDs (i.e. both files might have S066_400).
What I would like to do is use some code to insert a number (1000,, 4971, whatever) immediately after the _ on every other line in the second file, starting with the first line. This way the IDs would no longer be confounded and I could proceed. So, it would cover S066_2 to S066_24971 or S066_49712. Part of the trouble is that the ID may be variable in length such that it could begin as S066_ or as 49BBT1_.
Try:
awk '/^\>/ {$1=$1 "_13"} {print $0}' filename > tmp.tmp
mv tmp.tmp filename

Is there a vector length limit in Rstudio? [duplicate]

Hoping someone can help me understand why errant \n characters are showing up in a vector of strings that I'm creating in R.
Trying to import and clean up a very wide data file that's in fixed width format
(http://www.state.nj.us/education/schools/achievement/2012/njask6/, 'Text file for data runs'). Followed the UCLA tutorial on using read.fwf and this excellent SO question to give the columns names after import.
Because the file is really wide, the column headers are LONG - all together, just under 29,800 characters. I'm passing them in as a simple vector of strings:
column_names <- c(...)
I'll spare you the ugly dump here but I dropped the whole thing on pastebin.
Was cleaning up and transforming some of the variables for analysis when I noticed that some of my subsets were returning 0 rows. After puzzling over it (did I misspell something?) it realized that somehow a bunch of '\n' newline characters had been introduced into my column headers.
If I loop over the column_names vector that I created
for (i in 1:length(column_names)) {
print(column_names[i])
}
I see the first newline character in the middle of the 81st line -
SPECIAL\nEDUCATION SCIENCE Number Enrolled Science
Avenues that I tried to resolve this:
1) Is it something about my environment? I'm using the regular script editor in R, and my lines do wrap - but the breaks on my screen don't match the placement of the \n characters, which to me suggests that it's not the R script editor.
2) Is there a GUI setting? Did some searching, but couldn't find anything.
3) Is there a pattern? Seems like the newline characters get inserted about every 4000 characters. Did some reading on R/S primitives to try to figure out if this had something to do with basic R data structures, but was pretty quickly in over my head.
I tried breaking up the long string into shorter chunks, and then subsequently combining them, and that seemed to solve the problem.
column_names.1 <- c(...)
column_names.2 <- c(...)
column_names_combined <- c(column_names.1, column_names.2)
so I have an immediate workaround, but would love to know what's actually going on here.
Some of the posts that had to do with problems with character vectors suggested that I run memory profile:
memory.profile()
NULL symbol pairlist closure environment promise
1 9572 220717 4734 1379 5764
language special builtin char logical integer
63932 165 1550 18935 10302 30428
double complex character ... any list
2039 1 60058 0 0 20059
expression bytecode externalptr weakref raw S4
1 16553 725 150 151 1162
I'm running R 2.15.1 (64-bit) R on Windows 7 (Enterprise, SP 1, 8 gigs RAM).
Thanks!
I doubt this is a bug. Instead, it looks like you're running into a known limitation of the console. As it says in Section 1.8 - R commands, case sensitivity, etc. of An Introduction to R:
Command lines entered at the console are limited[3] to about 4095 bytes (not characters).
[3] some of the consoles will not allow you to enter more, and amongst those which do some will silently discard the excess and some will use it as the start of the next line.
Either put the command in a file and source it, or break the code into multiple lines by inserting your own newlines at appropriate points (between commas). For example:
column_names <-
c("County Code/DFG/Aggregation Code", "District Code", "School Code",
"County Name", "District Name", "School Name", "DFG", "Special Needs",
"TOTAL POPULATION TOTAL POPULATION Number Enrolled LAL", ...)

append multiple large data.table's; custom data coercion using colClasses and fread; named pipes

[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.
My code:
append.tables <- function(files) {
moves.by.year <- lapply(files, fread)
move <- rbindlist(moves.by.year)
rm(moves.by.year)
move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
return(move)
}
Crash message:
append.tables crashes with this:
> system.time(move <- append.tables(files))
*** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'
Traceback:
1: rbindlist(moves.by.year)
2: append.tables(files)
3: system.time(move <- append.tables(files))
There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.
2. Could fread accept multiple file names?
In any case, I think a better approach here would be to allow fread to take files as a vector of file names:
files <- c("my", "files", "to be", "appended")
dt <- fread(files)
Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.
3. colClasses gives an error message
My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:
dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
Yes, in the case of dates, a simple:
dt[,date := as.Date(as.character(date), format="%Y%m%d")]
works.
However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.
Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):
if (has_header) {
tfile2 <- tempfile()
system(paste("echo fakeline >>", tfile2))
system(paste("head -q -n1", files[[1]], ">>", tfile2))
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=wait)
unlink(tfile2)
} else {
system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}
but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)
4. fread thinks named pipes are empty files
I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile), setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999
In any case, this is a bit of a hack.
Simplified Code if I had my wish list
Ideally, I would be able to do something like this:
setClass("Int_Price")
setAs("character", "Int_Price",
function (from) {
return(as.integer(gsub("\\.", "", from)))
}
)
dt <- fread(files, colClasses=list(price="Int_Price"))
And then I'd have a nice long data.table with properly coerced data.
Update: The rbindlist bug has been fixed in commit 1100 v1.8.11. From NEWS:
o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.
As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.
1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)
Please run again changing the line :
moves.by.year <- lapply(files, fread)
to
moves.by.year <- lapply(files, fread, verbose=TRUE)
and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread and rbindlist should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply should return 48GB and then the rbindlist should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).
2. Could fread accept multiple file names?
Excellent idea. As you say, fread could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.
I'll file this as a feature request and post the link here.
3. colClasses gives an error message
When type list, the type appears to the left of the = and a vector of column names or positions appears to the right. The idea is to be easier than colClasses in read.csv which only accepts a vector; to save repeating "character" over and over. I could have sworn this was better documented in ?fread but it seems not. I'll take a look at that.
So, instead of
fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
the correct syntax is
fread(tfile, colClasses=list(myDate="date"))
Given what you go on to say in the question, iiuc, you actually want :
fread(tfile, colClasses=list(character="date")) # just fread accepts list
or
fread(tfile, colClasses=c("date"="character")) # both read.csv and fread
Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric so just to remind that integer64 can be read directly by fread too.
4. fread thinks named pipes are empty files
Hopefully this goes away now assuming the previous point is resolved? fread works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread is fast is hand in hand with seeing the entire input first.

Does `sqlite3` support loops?

I wrote the little bash script below and it works as intended, but I added couple comments and newlines for readability which breaks the code. Removing comments and newlines should make it a valid script.
### read all measurements from the database and list each value only once
sqlite3 -init /tmp/timeout /tmp/testje.sqlite \
'select distinct measurement from errors order by measurement;' |
### remove the first line of stdout as this is a notification rather than intended output
sed '1d' |
### loop though all found values
while read error; do
### count the number of occurences in the original table and print that
sqlite3 -init /tmp/timeout /tmp/testje.sqlite \
"select $error,count( measurement ) from errors where measurement = '$error' ;"
done
The result is like this:
134 1
136 1
139 2
159 1
Question: Is it possible with sqlite3 to translate the while-loop to SQL statements? In other words, does sqlite3 support some sort of for-loop to loop through results of a previous query?
Now I know sqlite3 is a very limited database and chances are that what I want is just too complex for it. I've been searching, for it but I'm really a database nitwit and the hits I get so far are either on a different database or solving an entirely different problem.
The easiest answer (that I do not hope for BTW) is 'sqlite3 does not support loops'.
SQLite does not support loops. Here is the entire language, you'll notice that structured programming is completely absent.
However, that's not to say that you can't get what you want without loops, using sets or some other SQL construct instead. In your case it might be as simple as:
select measurement, count( measurement ) from errors GROUP BY measurement
That will give you a list of all measurements in the errors table and a count of how often each one occurs.
In general, SQL engines are best utilized by expressing your query in a single (sometimes complex) SQL statement, which is submitted to the engine for optimization. In your example you've already codified some decisions about the strategy used to get the data from the database -- it's a tenet of SQL that the engine is better able to make those decisions than the programmer.

Perform sequence of edits on a large text file

I am hoping to perform a series of edits to a large text file composed almost entirely of single letters, seperated by spaces. The file is about 300 rows by about 400,000 columns, and about 250 MB.
My goal is to tranform this table using a series of steps, for eventual processing with another language (R, probably). I don't have much experience working with big data files, but PERL has been suggested to me as the best way to go about this. Please let me know if there is a better way :).
So, I am hoping to write a PERL script that does the following:
Open file, edit or write to a new file the following:
remove columns 2-6
merge/concatenate pairs of columns, starting with column 2 (so, merge column 2-3,4-5, etc)
replace each character pair according to sequential conditional algorithm running accross each row:
[example PSEUDOCODE: if character 1 of cell = character 2 of cell=a, cell=1
else if character 1 of cell = character 2 of cell=b, cell=2
etc.] such that except for the first column, the table is a numerical matrix
remove every nth column, or keep every nth column and remove all others
I am just starting to learn PERL, so I was wondering if these operations were possible in PERL, whether PERL would be the best way to do them, and if there were any suggestions for syntax on these operations in the context of reading/writing to a file.
I'll start:
use strict;
use warnings;
my #transformed;
while (<>) {
chomp;
my #cols = split(/\s/); # split on whitespace
splice(#cols, 1,6); # remove columns
push #transformed, $cols[0];
for (my $i = 1; $i < #cols; $i += 2) {
push #transformed, "$cols[$i]$cols[$i+1]";
}
# other transforms as required
print join(' ', #transformed), "\n";
}
That should get you on your way.
You need to post some sample input and expected output or we're just guessing what you want but maybe this will be a start:
awk '{
printf "%s ", $1
for (i=7;i<=NF;i+=2) {
printf "%s%s ", $i, $(i+1)
}
print ""
}' file

Resources