Comparing Excel tables and keep matching information in data.frame R - r

I have recently restarted to use R, and I'm trying to compare two excel tables (let's call them table 1 and 2), with very different data. The only common point is situated in one column (let's name it col1), and is the gene ID.
My goal is to find and keep all the rows of table 1 in which the data in col1 is exactly matching the data in table2.
For example if table1 contains 10 columns and col1 contains geneID. Table2 contains only 5 columns and col2 contains geneID. I want to compare and keep matching information of those two columns and get a data.frame containing the whole rows of table1 that I want to keep.
I hope I'm clear? English is not my first language ^^
Thanks a lot !

merge(x = table1,
y = table2,
by.x = "column_name_table1",
by.y = "column_name_table2",
all.x = T)

Related

Data from one table to select data columns from another table, using r

My data table, tab, is 2000 x 500, y1 = col1, y2 = col2, y3 = col3 …. Y500 = col500. See image.
I want to carry out some PCA work on a section of this, e.g y1 = col1, y22 = col22, y36 = col36, y41 = col41, and so on.
A separate data table, SM, contains the column ID,and refers to the columns in the main data table (tab) I want to consider. There are 200 such entries.
Image of SM follows.
The following
fit.std <- prcomp(tab, scale.=T)
Pulls in all the column entries.
If I have 200 specific columns of data to consider, entering the column numbers manually would be very time consuming and error prone.
Can someone please tell me how to take the data from column ID (in data table SM), to select the corresponding columns in the data table tab, and then include in the fit.std line?
Is there a way to take in the data in SM to enable me to select the required columns in the larger data table tab? In order words, SM col1 would correspond to tab col1, SM col22 would correspond to tab col22, and so on.
fit.std <- promo(c(ID$*), scale = TRUE)
where ID$* contains the data table SN entries I want to match with columns in tab?
Thank you.
Ok based on your updated question, it looks like you want to subset the dataframe tab, selecting only the columns listed in SM$ID.
You can do that with:
tab[,SM$ID]
I'm not really sure exactly what you're asking for, but I'll try to address your task to the best of my ability.
Suppose tab is a dataframe with 2000 rows and 500 columns. And SM is a dataframe where SM$ID refers to columns in tab.
Then you can get a list of the columns referenced by SM$ID using:
list_of_cols <- lapply(SM$ID, function(x) tab[,x])
If you want to collapse (or "flatten") this list of vectors into a single vector you can do:
single_vec <- unlist(list_of_cols)

Create a composite key and merge two tables in R

I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
I want to merge these two tables for better analysis. When I try to merge these tables, it counts the permutations of each ID and distorts the data in the resulting table.
Can anyone please suggest a way out.
You can merge two tables with merge command.
by = "employeeid" enables you to specify key column. if you have more than one column by = c("emoloyeeid", "period")
table3 <- merge(table1, table2, by = "employeeid")
?merge will give you more options.
I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
One idea is to wrangle your data so there are no more multiple entries.
Another is to summarize your data so there is only row per Employee in each table.
A third is to use the full-join to connect all matching ID
https://dplyr.tidyverse.org/reference/join.html
library(dplyr)
full_join(df1, df2, by = "EmployeeID")
Check out the DPLYR "Data Transformation Cheat Sheet" https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

How to combine items from two columns in two separate files?

I have two tables which I need to compare
Table 1:XLOC IDs
Column A: Xloc id
Column B: gene id
Table 2: Ensembl IDs
Column A: Ensembl id
Column B: gene Id
In both tables, there are identical Gene ids (names e.g. cpa6). In table 1 there are 25000 entries, in table 2 there are 46000 entries.
I need to insert the Ensemble Ids from ColA, Table 2 into ColC of Table1, when both gene ids in column B match and create an output file with new data- e.g.
Table 1
ENS0002 cpa6
Table 2:
Xloc0014 cpa6
Output file, Table 3:
ENS0002 cpa6 Xloc0014
The columns are not in the same order and cannot be sorted alphabetically etc. The remaining 21000 entries without corresponding Xlocs I will get rid of (but can easily do this post-output).
Does anyone know how to do this in either R, Excel, or other software?, relatively easily?
N.B. Both tables can not be sorted into the same order, so I really need to use a formula/script/bash to do this.
Try this. I have created an example data frame to show how you can merge and keep only the values that exist in both tables.
As you can see the new table is a result of these values that exist in both and now you have 3 columns with the value of the second table.
In case you want to keep all the rows that exist in both you must use the column gene Id in order to keep these gene Id that exist in both.newTable <- merge(tab1,tab2,by = "gen_id") for example.
tab1 <- data.frame(col1=c("id1","id2","id3","id4"),col2=c(1,2,3,4))
tab2 <- data.frame(col1=c("id1","id2","id3","id5","id7"),col2=c(1,3,3,5,6))
newTable <- merge(tab1,tab2,by = "col1")
in case you want to keep all from table1 but maybe they dont exist in table2 use this.
newTable <- merge(tab1,tab2,by = "col1",all.x=T)
these will keep all the rows of table1 and will give a value at col2.y otherwise you will have NAs.
In R I would use the merge function merge(Table 1, Table 2,by="cpa6").
However, I have done this in Excel before, which worked well too using the VLOOKUP function. You just need to use a IF function in R, with a nested VLOOKUP inside:
=IF(ISERROR(VLOOKUP(cell with gene name in Table1,array of cells that contain the gen names in Table2, number of the column in the array in Table2,"TRUE" so they match exactly)), Output if true, output if false).
Example:
=IF(ISERROR(VLOOKUP(C4,List1!A1:List1!A$2:A$1000,1,TRUE)), "Does NOT exist in List 1","Exists in List 1")

Joining two structural similar dataframes on two index columns?

I have two structural identical dataframes: column id-part1, column id-part2 and column data1.
id-part1 and id-part2 are together used as an index-
Now I want to calculate the difference between the two dataframes of column data1 with respect to the two id columns. In fact, in one data-frames it might happen that the combination of id-part1 and id-part2 is not existing...
So it is somehow a SQL join operation, ins't?
The merge() function is what you are looking for.
It works similar as an SQL join operation. Given your description a solution would be:
solution <- merge(DF1, DF2, by = c('id-part1', 'id-part2'), all.x = TRUE, all.y = TRUE)
DF1 and DF2 are your corresponding data frames. merge() uses x and y to reference these data frames where x is the first (DF1) and y the second (DF2).
The by= property defines the column names to match (you can even specify different names for each data frame).
all.x and all.y specify the kind of join you like to perform, depending on the data you like to keep.
The result is a new data frame with different columns for data1. You can then continue with your calculations.

merging two dataframes in R

I have data in a dataframe with 139104 rows which is multiple of 96x1449. i have a phenotype file which contains the phenotype information for the 96 samples. the snp name is repeated 1449X96 samples. I haveto merge the two dataframes based on sid and sen. this is how my two dataframes look like
dat <- data.frame(
snpname=rep(letters[1:12],12),
sid=rep(1:12,each=12),
genotype=rep(c('aa','ab','bb'), 12)
)
pheno <- data.frame(
sen=1:12,
disease=rep(c('N','Y'),6),
wellid=1:12
)
I have to merge or add the disease column and 3 other columns to the data file. I am unable to use merge in R. I have searched google, i am not hitting the correct terms to get the answer. I would appreciate any input on this issue.
Thanks, Sharad
You can specify the columns you want to match on directly with merge():
merge(dat, pheno, by.x = "sid", by.y = "sen")

Resources