find common rows between two dataframes based on two columns using bash - unix

I found this very difficult to solve in bash - I have two files that I want to find the common rows between them based on two columns.
f1.csv:
col1,col2,col3,col4
Dalir,Cpne1,down,2174
Fendrr,Aco2,up,280
Cpne1,Tox1,down,8900
f2.csv
col1,col2,col3,col4,col5,col6
Linc,Rmo,ch2,ch2,p,l
Tox1,Cpne1,ch1,ch2,l,p
so basically the code should look only at the first two columns of the dfs and see if pairs are the same (the order of the pairs is not important). So you can see that in the first df there is
Cpne1,Tox1 in the third row and in the second df there is Tox1,Cpne1 in the second row - so this should be printed in the output from the second file.
Desired output:
Tox1,Cpne1
Unfortunately, I have not been able to develop a bash command for this - it would be great if you could help me with this. Thanks

Just adding the explanation to oguz' fine answer in the comments above:
BEGIN{FS=OFS=","} defines , to be the separator for both input and output.
NR==FNR{pair[$1,$2];next} while the record number of the entire input matches the current file's record number (in other words, for the first file) add an element with the first and second field as index to the array pair.
($1,$2) in pair||($2,$1) in pair{print $1,$2} operating on the second file, check if field one and two in any order are present as index in the array pair, and print them if they are.

Related

How to skip empty rows while reading multiple tabs in R?

I am trying to read an excel file with multiple tabs. For that, I use the code provided here.
The problem is that each tab has a different number of empty rows before the actual data begins. For example, the first tab has two empty rows, the second tab has three empty rows, and so on.
Normally, I would use the parameter skip in the read_excel function to indicate the number of empty lines to skip. But how do I do that for multiple tabs with different numbers of rows to skip?
perhaps the easiest solution would be to read it as it is then remove rows, i.e. yourdata <- yourdata[!is.na(yourdata$columname),] ; this would work if you don't expect any NA's in a particular column, like id. If you have data gaps everywhere you can test for all NAs in multiple columns - let me know if that's what you need.

Read AGS type file in R

I am trying to read a special type of file (the format is called AGS) which looks like in the image:
This is basically a TEXT file, which contains many tables with different dimensions inside, separated by 2 (but sometimes more) empty rows. As you might guess, the problem is related to the fact that these tables have different number of columns and obviously different column names.
The first row in each table (here tables are denoted as GROUP) shows the name of the table, e.g. LOCA, HDPH, etc. The second row shows the column names. The third row shows the units of each column. All the other rows show the observations. In each row, columns are separated by commas and values are inside double quotes.
I am struggling to read this type of file. The ideal output would be to have each of these tables into separated data frames. Any help and ideas are much appreciated.
An example file can be downloaded here: example AGS file

Printing out R Dataframe - Single Character Between Columns While Maintaining Alignment (Variable Spacing)

In a previous question, I received output for an R dataframe that had two aligned columns. The answer gave me the following output:
While the post answered my initial question, it seems as if the program I intend to use requires a text file in which the two columns are both aligned and separated by a single character (e.g. a tab). The previous solution instead results in a large and variable number of spaces between the first and second columns (depending on the length of the string in the first column for that particular row.) Inserting a single character, however, results in a misalignment of the columns.
Is there any way in which I can replace a large number of spaces with a single character that has variable spacing to 'reach' to the second column?
If it helps, this webpage contains a .txt file that you may download to see the intended output (although it does not suffer from the problem with the first column having variable name lengths, it has a single 'space character' that separates the first and second columns. If I 'copy and paste' this specific space character between columns 1 and 2, the program can successfully interpret the .txt file. This copy + paste results in a single character separating the columns and appropriate alignment.)
For further example, the first of the following pictures (note the highlight is a single character) properly parses while the second does not:

How can I return a vector with a dataframe inside in R?

Here is a challenge for you: I was trying to make a tic tac toe based on R. First, the players have to configure putting in the name of the players, and the game should check if the name exists in a file called "Players.txt" (if not, the game will create one), if the name exists, the game will ask for a new one. The last part of the game is that the game should record all the punctuation of the players (each gambling chip used will subtract 5 points of 100 that the player has at the beginning of the game). The problem is when a player wins, the game shows the following error: "Error in table[location_name1, 3]: Incorrect number of dimension in R".
A vector can either be atomic or a list. Atomic vectors can only contain elements of one and the same data type. That means, you are "accidentally" creating a list with
vector=c(win,name1,name2,table)
with the result that each column of the data frame should become an entry.
You can solve it with
vector <- list(win, name1, name2, table)
vector is still a list but now it has the format I believe you want.
Having done that you still get errors. The reason is that these assignments fail.
location_name1=which(grepl(name1,table$gamers))
location_name2=which(grepl(name2,table$gamers))
They return an empty vector because earlier in the code you set win=vector[1]... table=vector[4]. Since vector is now a list, you have to subset it accordingly. That means you have to chance the statements to table=vector[[4]].
Now you are going to get another problem. The reason is that you treat the columns table$scores as text. When you read the data you need to make sure that this columns is not interpreted as text. You also have to eliminate all statements that coerce the column into text. Otherwise table[location_name1,3]=table[location_name1,3]+pointsx will obviously fail because you cannot add a number to a string.
For example, you coerce the column into a character column with this statement:
name1 <- data.frame(gamers=name1,games="1",scores="100")
games and scores are strings not numbers. Another example is the assigment after reading the table from the file. You can make sure that scoresare numeric by doing this.
scores <- as.numeric(table[,3])
Please get familiar with Rstudio debugging capabilities (https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio). This way you can go through your code line by line and check consequences of each assignment to the data frame.

How to Add Column (script) transform that queries another column for content

I’m looking for a simple expression that puts a ‘1’ in column E if ‘SomeContent’ is contained in column D. I’m doing this in Azure ML Workbench through their Add Column (script) function. Here’s some examples they give.
row.ColumnA + row.ColumnB is the same as row["ColumnA"] + row["ColumnB"]
1 if row.ColumnA < 4 else 2
datetime.datetime.now()
float(row.ColumnA) / float(row.ColumnB - 1)
'Bad' if pd.isnull(row.ColumnA) else 'Good'
Any ideas on a 1 line script I could use for this? Thanks
Without really knowing what you want to look for in column 'D', I still think you can find all the information you need in the examples they give.
The script is being wrapped by a function that collects the value you calculate/provide and puts it in the new column. This assignment happens for each row individually. The value could be a static value, an arbitrary calculation, or it could be dependent on the values in the other columns for the specific row.
In the "Hint" section, you can see two different ways of obtaining the values from the other rows:
The current row is referenced using 'row' and then a column qualifier, for example row.colname or row['colname'].
In your case, you obtain the value for column 'D' either by row.D or row['D']
After that, all you need to do is come up with the specific logic for ensuring if 'SomeContent' is contained in column 'D' for that specific row. In your case, the '1 line script' would look something like this:
1 if [logic ensuring 'SomeContent' is contained in row.D] else 0
If you need help with the logic, you need to provide more specific examples.
You can read more in the Azure Machine Learning Documentation:
Sample of custom column transforms (Python)
Data Preparations Python extensions
Hope this helps

Resources