How to join data from three different spreadsheet? - unix

I have 3 tsv files containing different data on my employees. I can join these data with the last name and first name of the employees, which appear in each file.
I would like to gather all the data for each employee in only one spreadsheet.
(I can't just do copy/past of the columns because some employees are not in file number 2 for example but will be in file number 3).
So I think - I am a beginner - a script could do that, for each employee (a row), gather as much data as possible from the files in a new tsv file.
Edit.
Example of what I have (in reality I have approximatively 300 rows for each file, some emloyees are not in all files).
file 1
john hudson 03/03 male
mary kate 34/04 female
harry loup 01/01 male
file 2
harry loup 1200$
file3
mary kate atlanta
What I want :
column1 colum2 column3 column4 column5 column6
john hudson 03/03 male
mary kate 34/04 female atlanta
harry loup 01/01 male 1200$
It would help me a lot!

Use this python script:
import sys, re
r=[]
i = 0
res = []
for f in sys.argv[1:]:
r.append({})
for l in open(f):
a,b = re.split('\s+', l.rstrip(), 1)
r[i][a] = b
if i == 0:
res += [a]
i += 1
for l in res:
print l," ".join(r[k].get(l, '-') for k in range(i))
The script loads each file into the dictionary (the first column is used as a key).
Then the script iterates through the values of the first column in the first file and
writes correspondent values from the dictionaries (that were created from the other files).
Example of usage:
$ cat 1.txt
user1 100
user2 200
user3 300
$ cat 2.txt
user2 2200
user3 2300
$ cat 3.txt
user1 1
user3 3
$ python 1.py [123].txt
user1 100 - 1
user2 200 2200 -
user3 300 2300 3

If you're familiar with SQL then you can use the perl DBD::CSV module todo the job, easily. But that also depends on whether you're comfortable writing perl.

Related

Select a dataset based on different column value but in the same row

I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different
As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan

Get frequency counts for a subset of elements in a column

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

Remove all instances of the same column value across multiple files using awk

I'm back again with another awk question.
I have multiple large files that contain data i want to dedupe against each other.
Lets say I have the following data for one month:
fruit number rand
apple 12 342
taco 19 264
tortilla 2234 53423
tortillas 2 3431
apricot 13221 23424
apricots 24234 3252
pineapple 2342 2342
radish 1 3
The following month I receive this data:
fruit number rand
pineapple 2 698
apple 34 472
taco 19 234
tortilla 16 58
tortillas 87 25
potato 234 2342
radish 1 55
grapes 9 572 422
apricot 13221 24
What I am trying to do is take the second file, and check the values of the first column to see if there are items that exist in the first file. If yes, I want to remove them from the second file, leaving only items that are unique to the second file with relation to the first one.
The desired outcome would leave me something like this:
fruit number rand DUPLICATE
pineapple 2 698 DUPE
apple 34 472 DUPE
taco 19 234 DUPE
tortilla 16 58 DUPE
tortillas 87 25 DUPE
potato 234 2342
radish 1 55 DUPE
grapes 9 572 422
apricot 13221 24 DUPE
Or, more clearly:
fruit number rand
potato 234 2342
grapes 9 572 422
I was trying to think of a way to do this without having to sort the files. I was trying to modify the answer from #karafka for a related question. Instead of passing the same file twice, I tried inputting the two different files. Obviously I'm doing something wrong.
awk 'BEGIN { FS = OFS = " " }
NR==FNR {a[$1]++; next}
FNR==1 {print $0, "DUPLICATE"; next}
$1 in a{if (a[$1]>1){print $(NF+1)="DUPE";delete a[$1]}}1' file{,}
I'm still learning awk, any help the community can provide is greatly appreciated, but I'll try to explain what I think the above program does.
The first line sets the delimiter and the output delimiter to be a tab character.
This line reads the first file and stores an array with a count of how many times an item appears in the list.
This outputs the first line which is essentially the header, adding "DUPLICATE" at the end of the last item in the row
(This is where I'm stuck) If the current value is found in the array "a" it should check if the stored value is greater than one. If yes, it should print the word "DUPE" in the last column. Finally it returns the entire line.
In the test files I keep getting everything marked as "DUPE" or nothing at all.
I've also thought of combining the files and deduping that way, but that would leave me with undesired left-over values from the first file.
What am I doing wrong?
I think what you're doing wrong is just trying to use a bunch of scripts that are unrelated to your current problem as your starting point.
It sounds like all you need is:
$ awk '
NR==FNR { file1[$1]; next }
FNR==1 || !($1 in file1)
' file1 file2
fruit number rand
potato 234 2342
grapes 9 572 422

Not working join operation

I have a file as follows:
AKT3
ARRB1
ATF2
ATF4
BDNF
BRAF
C00076
C00165
TNF
TNFRSF1A
TP53
TRAF2
TRAF6
To me, it is perfectly sorted. Is not? Also, I have another file which contains AKT3, BRAF, TRAF6, etc. as its first column element. Since this file is too big, I do not put it here. However, after I type:
LANG=en_EN join -j 1 file2 file1 > output -t $'\t'
output file contains these lines:
TRAF6 0 genome...
TRAF6 0 genome...
TRAF6 0 genome...
TRAF6 0 genome...
I must see other rows which start with AKT3, BRAF, etc. as well in this output but there are only TRAF6 lines. What is the problem? How can I get the proper output? Thanks.
Edit: You can get the big file from this link:
https://www.dropbox.com/s/a2dmsq1tskpb9vg/sorted_mutation_data?dl=0
It is about 25 MB. I am sorry for this.
Edit (2): Lets say...
File1:
ADA
ADAM
BRUCE
GARY
File2:
AB 1
ABA 2
ABB 3
ADA 4
ADA 5
EVE 6
EVE 7
EVE 8
GARY 9
GARY 10
The output should be:
ADA 4
ADA 5
GARY 9
GARY 10
Edit: The problem was caused by non-printable ASCII characters that were hiden in the text in a way. After removed them all, I could use "join".
So, I don't know what your environment is, but this for me (I use an explicit sort to be sure it will work, and also to reveal what happens when you sort by the whole line in the default collating order instead of an explicit field).
Note also I have no -t $'\t' on the statement with the join command. If your second file has tab-separated fields then you'll need to express that correctly with a real tab character, and with the option before the filenames, and also you may have to be sure your files are sorted with the same key and field separator.
#! /bin/sh
f1=$(mktemp -t jdata)
f2=$(mktemp -t jdata)
trap "RC=$?; rm -f $f1 $f2*; exit $RC" 0 1 2 3 15
sort > $f1 <<__EOF__
ADA
ADAM
BRUCE
GARY
__EOF__
sort > $f2 <<__EOF__
AB 1
ABA 2
ABB 3
ADA 4
ADA 5
EVE 6
EVE 7
EVE 8
GARY 9
GARY 10
__EOF__
join -j 1 $f1 $f2
sh ./tjoin-multi.sh
ADA 4
ADA 5
GARY 10
GARY 9

R - joining/merging rows in one dataset

I would like to know how to use R to merge rows in one set of data.
Currently my data looks like this:
Text 1 Text 2 Text 3 Text 4
Bob Aba Abb Abc
Robert Aba Abb Abc
Fred Abd Abe Abf
Martin Abg Abh Abi
If text two and text 3 are both the same for two rows (as in rows 1 & 2) I would like to make it into one row with more columns for the other data.
Text 1 Text 1a Text 2 Text 3 Text 4 Text 4a
Bob Robert Aba Abb Abc Abd
Fred NA Abd Abe Abf NA
Martin NA Abg Abh Abi NA
I did something similar with joining two separate sets of data and merging them using join
join=join(Data1, Data2, by = c('Text2'), type = "full", match = "all")
but I can't work out how to do it for duplicates within one set of data.
I think it might be possible to use aggregate but I have not used it before, my attempt was:
MyDataAgg=aggregate(MyData, by=list(MyData$Text1), c)
but when I try I am getting an output that looks like this on summary:
1 -none- numeric
1 -none- numeric
2 -none- numeric
or this on structure:
$ Initials :List of 12505
..$ 1 : int 62
..$ 2 : int 310
..$ 3 : int 504
I would also like to be able to combine rows using matching elements of two variables.
I don't think you can reshape or aggregate because :
You have duplicated rows that corresponds to the same key
You don't have the same number of value for each keys : you should fill it with missing values
Here a a manual attempt using by to process by key, and rbind.fill to aggregate all the list together. Each by step , is creating a one-row data.frame having (Text2,Text3) as key.
do.call(plyr::rbind.fill,by(dat,list(dat$Text2,dat$Text3),
function(d){
## change all other columns to a one row data.frame
dd <- as.data.frame(as.list(rapply(d[,-c(2,3)],as.character)))
## the tricky part : add 1 to a name like Text1 to become Text11 ,
## this is import to join data.frames formed by by
names(dd) <- gsub('(Text[0-9]$)','\\11',names(dd))
## add key to to the row
cbind(unique(d[,2:3]),dd)
}))
Text2 Text3 Text11 Text12 Text41 Text42
1 Aba Abb Bob Robert Abc Abd
2 Abd Abe Fred <NA> Abf <NA>
3 Abg Abh Martin <NA> Abi <NA>

Resources