Join Statement omitting entries - unix

Using:
Unix
2.6.18-194.el5
I am having an issue where this join statement is omitting values/indexes from the match. I found out the values are between 11-90 (out of about 3.5 Million entries) and I have tried to look for foreign characters but I may be overlooking something (Tried cat -v to see hidden characters).
Here is the join statement I am using (only simplified the output columns for security):
join -t "|" -j 1 -o 1.1 2.1 file1 file2> fileJoined
file1 contents (first 20 values):
1
3
7
11
12
16
17
19
20
21
27
28
31
33
34
37
39
40
41
42
file2 contents (first 50 values so you can see where it would match):
1|US
2|US
3|US
4|US
5|US
6|US
7|US
8|US
9|US
10|US
11|US
12|US
13|US
14|US
15|US
16|US
17|US
18|US
19|US
20|US
21|US
22|US
23|US
24|US
25|US
26|US
27|US
28|US
29|US
30|US
31|US
32|US
33|US
34|US
35|US
36|US
37|US
38|US
39|US
40|US
41|US
42|US
43|US
44|US
45|US
46|US
47|US
48|US
49|US
50|US
From my initial testing it appears that file2 is the culprit. Because when I create a new file with values 1-100 I am able to get the join statement to match completely against file1; however the same file will not match against file2.
Another strange thing is that the file is 3.5 million records long and at value 90 they start matching again. For example, the output of fileJoined looks like this (first 20 values only):
1|1
3|3
7|7
90|90
91|91
92|92
93|93
95|95
96|96
97|97
98|98
99|99
106|106
109|109
111|111
112|112
115|115
116|116
117|117
118|118
Other things I have tried are:
Using vi to manually enter a new line 11 (still doesnt match on the join statement)
copying the code into notepad, deleting the lines in vi and then copying them back in (same result, no matching 11-90)
Removing lines 11-90 to see if the problem then shifts to 90-170 and it does not shift
I think that there may be some hidden values that I am missing, or that the 11 - 90 from file1 is not the same binary equivalent as the 11 - 90 in file2?
I am lost here, any help would be greatly appreciated.

I tried this out, and I noticed a couple things.
First: this is minor, but I think you're missing a comma in your -o specifier. I changed it to -o 1.1,2.1.
But then, running it on just the fragments you posted, I got only three lines of output:
1|1
3|3
7|7
I think this is because join assumes alphabetical sorting, while your input files look like they're numerically sorted.
Rule #1 of join(1) is to make sure your inputs are sorted, and the same way join expects them to be!
When I ran the two input files through sort and then joined again, I got 18 rows of output. (Sorting was easy, since you're joining on the first column; I didn't have to muck around with sort's column specifiers.)
Beware that, these days, sort doesn't always sort the way you expect, due to locale issues. I tend to set LC_ALL=C to make sure I get the old-fashioned behavior I'm used to.

Related

Reading tab-delimited text file into R only expecting one column out of five

Here's my problem. I'm trying to read a tab-delimited text file into R, and I keep getting an error messages and it only loads one column out of the five in the dataset.
Our professor is requiring us to use the read_csv() command for this, but I've tried using read_tsv() as well, and neither has worked. I've looked into it everywhere, I just can't find anything about what could possibly be going wrong.
waste <- read_tsv("wasterunup.txt", col_names=TRUE, na=c("*"))
I can't seem to link the text file here, but it's a simple tab-delimited text file with 5 columns, column headers, 22 rows (not counting the headers). * is used for N/A results.
I have no clue how to do this "properly" according to my professor by using read_csv.
waste <- read_tsv("wasterunup.txt", col_names=TRUE, na=c("*"))
Parsed with column specification:
cols(
`PT1 PT2 PT3 PT4 PT5` = col_double()
)
Warning: 22 parsing failures.
row col expected actual file
1 -- 1 columns 5 columns 'wasterunup.txt'
2 -- 1 columns 5 columns 'wasterunup.txt'
3 -- 1 columns 5 columns 'wasterunup.txt'
4 -- 1 columns 5 columns 'wasterunup.txt'
5 -- 1 columns 5 columns 'wasterunup.txt'
... ... ......... ......... ................
See problems(...) for more details.
To clarify my errors:
When I use read_csv(), all of the data is there, but all five datapoints are crammed into one cell of each row.
When I use read_tsv(), only one column of the data is there.

R bad row data not shown when read to data.table, but written to file

Sample input tab-delimited text file, note there is bad data from this source file, the enclosing " at end of line 3 is two lines down. So there is 1 complete blank line, followed by a line with just the double-quote character, then continued good data on the next line.
id ca cb cc cd
1 hi bye hey nope
2 ab cd ef "quoted text here"
3 gh ij kl "quoted text but end quote is 2 lines down
"
4 mn op qr lalalala
when I read this into R, tried using read.csv and fread, with/without 'blank.lines.skip = T' for fread, I get the following data table:
id ca cb cc cd
1 1 hi bye hey nope
2 2 ab cd ef quoted text here
3 3 gh ij kl quoted text but end quote is 2 lines down
4 4 mn op qr lalalala
The data table does not show the 'bad' lines. OK, good! However, when I go to write out this data table, tried both write.table and fwrite, those 2 bad lines of /nothing/, and the double-quote, are written out just like they show in the input file!
I've tried doing:
dt[complete.cases(dt),],
dt[!apply(dt == "", 1, all),]
to clear out empty data before writing out, but it does nothing. The data table still only shows those 4 entries. Where is R keeping this 'missing' data? How can I clear out that bad data?
I hope this is a 'one-off' bad output from the source (good ol' US Govt!), but I think they saved this from an xls file, which had bad formatting in a column, causing the text file to contain this mistake, but they obviously did not check the output.
After sitting back and thinking through the reading functions, because that column (cd) data is quoted, there's actually two newline characters at the end of the string, which is not shown in the data table element! So writing out that element will result in writing those two line breaks.
All I needed to do was:
dt$cd <- gsub("[\r\n","",dt$cd)
and that fixed it, the output written to file now has correct rows of data.
I wish I could remove my question...but maybe someday someone will come across the same "issue". I should have stepped back and thought about it before posting the question.

R readr package - written and read in file doesn't match source

I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.

Unix programming to subset every 1Mb and name the subset

I am needing a way to subset a large data set in Unix. I have > 50K SNP, each with the genetic variance they explain and a location (chromosome and position). I need to subset the SNP every 1 million base pairs (position) for each chromosome to create what we call 1Mb windows. I also need to name these windows, for instance CHR:WINDOW.
My data is structured as:
SNP CHR POS GenVar
BTB-00074935 1 157284336 2.306141e-06
BTB-01512420 8 72495155 1.958865e-06
Hapmap35555-SCAFFOLD20017_21254 18 29600313 1.876211e-06
BTB-01098205 3 68702409 1.222881e-06
ARS-BFGL-NGS-115531 11 74038177 9.597669e-07
ARS-BFGL-NGS-25658 2 119059379 7.953552e-07
BTB-00411452 20 47919708 6.827312e-07
ARS-BFGL-NGS-100532 18 63878550 6.115242e-07
Hapmap60823-rs29019235 1 10717144 5.400144e-07
ARS-BFGL-NGS-42256 10 50282066 4.864838e-07
.
.
.
A basic first try, assuming no spaces in any of the first fields (SNP), and that the "key" is (col2, first (length-6) digits of col3):
awk '{w=0+substr($3,1,length($3)-6); print >>sprintf("CHR%02d:WINDOW%03d",$2,w)}'
This prints to files named like CHR03:WINDOW456. If you only wanted something like 03:456 for filenames, edit out the CHR and WINDOW above.
Also note, subsequent runs will just keep expanding existing files, so you may need a rm *:* between runs.

Display vector as rows instead of columns in Octave

I'm new to octave and running into a formatting issue which I can't seem to fix. If I display a variable with multiple columns I get something along the lines of:
Columns 1 through 6:
0.75883 0.93290 0.40064 0.43818 0.94958 0.16467
However what I would really like to have is:
0.75883
0.93290
0.40064
0.43818
0.94958
0.16467
I've read the format documentation here but haven't been able to make the change. I'm running Octave 3.6.4 on Windows however I've used Octave 3.2.x on Windows and seen it output to the desired output by default.
To be specific, in case it matters, I am using the fir1 command as part of the signal package and these are sample outputs that I might see.
It sounds like, as Dan suggested, you want to display the transpose of your vector, i.e. a row vector rather than a column vector:
>> A = rand(1,20)
A =
Columns 1 through 7:
0.681499 0.093300 0.490087 0.666367 0.212268 0.456260 0.532721
Columns 8 through 14:
0.850320 0.117698 0.567046 0.405096 0.333689 0.179495 0.942469
Columns 15 through 20:
0.431966 0.100049 0.650319 0.459100 0.613030 0.779297
>> A'
ans =
0.681499
0.093300
0.490087
0.666367
0.212268
0.456260
0.532721
0.850320
0.117698
0.567046
0.405096
0.333689
0.179495
0.942469
0.431966
0.100049
0.650319
0.459100
0.613030
0.779297

Resources