How to combine items from two columns in two separate files? - r

I have two tables which I need to compare
Table 1:XLOC IDs
Column A: Xloc id
Column B: gene id
Table 2: Ensembl IDs
Column A: Ensembl id
Column B: gene Id
In both tables, there are identical Gene ids (names e.g. cpa6). In table 1 there are 25000 entries, in table 2 there are 46000 entries.
I need to insert the Ensemble Ids from ColA, Table 2 into ColC of Table1, when both gene ids in column B match and create an output file with new data- e.g.
Table 1
ENS0002 cpa6
Table 2:
Xloc0014 cpa6
Output file, Table 3:
ENS0002 cpa6 Xloc0014
The columns are not in the same order and cannot be sorted alphabetically etc. The remaining 21000 entries without corresponding Xlocs I will get rid of (but can easily do this post-output).
Does anyone know how to do this in either R, Excel, or other software?, relatively easily?
N.B. Both tables can not be sorted into the same order, so I really need to use a formula/script/bash to do this.

Try this. I have created an example data frame to show how you can merge and keep only the values that exist in both tables.
As you can see the new table is a result of these values that exist in both and now you have 3 columns with the value of the second table.
In case you want to keep all the rows that exist in both you must use the column gene Id in order to keep these gene Id that exist in both.newTable <- merge(tab1,tab2,by = "gen_id") for example.
tab1 <- data.frame(col1=c("id1","id2","id3","id4"),col2=c(1,2,3,4))
tab2 <- data.frame(col1=c("id1","id2","id3","id5","id7"),col2=c(1,3,3,5,6))
newTable <- merge(tab1,tab2,by = "col1")
in case you want to keep all from table1 but maybe they dont exist in table2 use this.
newTable <- merge(tab1,tab2,by = "col1",all.x=T)
these will keep all the rows of table1 and will give a value at col2.y otherwise you will have NAs.

In R I would use the merge function merge(Table 1, Table 2,by="cpa6").
However, I have done this in Excel before, which worked well too using the VLOOKUP function. You just need to use a IF function in R, with a nested VLOOKUP inside:
=IF(ISERROR(VLOOKUP(cell with gene name in Table1,array of cells that contain the gen names in Table2, number of the column in the array in Table2,"TRUE" so they match exactly)), Output if true, output if false).
Example:
=IF(ISERROR(VLOOKUP(C4,List1!A1:List1!A$2:A$1000,1,TRUE)), "Does NOT exist in List 1","Exists in List 1")

Related

Data from one table to select data columns from another table, using r

My data table, tab, is 2000 x 500, y1 = col1, y2 = col2, y3 = col3 …. Y500 = col500. See image.
I want to carry out some PCA work on a section of this, e.g y1 = col1, y22 = col22, y36 = col36, y41 = col41, and so on.
A separate data table, SM, contains the column ID,and refers to the columns in the main data table (tab) I want to consider. There are 200 such entries.
Image of SM follows.
The following
fit.std <- prcomp(tab, scale.=T)
Pulls in all the column entries.
If I have 200 specific columns of data to consider, entering the column numbers manually would be very time consuming and error prone.
Can someone please tell me how to take the data from column ID (in data table SM), to select the corresponding columns in the data table tab, and then include in the fit.std line?
Is there a way to take in the data in SM to enable me to select the required columns in the larger data table tab? In order words, SM col1 would correspond to tab col1, SM col22 would correspond to tab col22, and so on.
fit.std <- promo(c(ID$*), scale = TRUE)
where ID$* contains the data table SN entries I want to match with columns in tab?
Thank you.
Ok based on your updated question, it looks like you want to subset the dataframe tab, selecting only the columns listed in SM$ID.
You can do that with:
tab[,SM$ID]
I'm not really sure exactly what you're asking for, but I'll try to address your task to the best of my ability.
Suppose tab is a dataframe with 2000 rows and 500 columns. And SM is a dataframe where SM$ID refers to columns in tab.
Then you can get a list of the columns referenced by SM$ID using:
list_of_cols <- lapply(SM$ID, function(x) tab[,x])
If you want to collapse (or "flatten") this list of vectors into a single vector you can do:
single_vec <- unlist(list_of_cols)

R function for simple lookup replacement of excel

I want to extract the values form file 2 to file matching the values in indicated columns. It is a simple lookup function in Excel.
but many solutions given are based on matching column names which I don't want change in my data set.
2 files having a matching column and file2 column to be inserted in file1
As your column names are different in the two data.frames you need to tell merge which columns correspond to each other:
merge(file1, unique(file2[, c("Symbol", "GeneID"))], by.x="UniprotBlastGeneSymbol", by.y="Symbol")
Your result column will be called GeneID, not Column4, of course. If file2 contains gene Ids that are not found in file1 then you may also want all.y=FALSE.

Create a composite key and merge two tables in R

I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
I want to merge these two tables for better analysis. When I try to merge these tables, it counts the permutations of each ID and distorts the data in the resulting table.
Can anyone please suggest a way out.
You can merge two tables with merge command.
by = "employeeid" enables you to specify key column. if you have more than one column by = c("emoloyeeid", "period")
table3 <- merge(table1, table2, by = "employeeid")
?merge will give you more options.
I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
One idea is to wrangle your data so there are no more multiple entries.
Another is to summarize your data so there is only row per Employee in each table.
A third is to use the full-join to connect all matching ID
https://dplyr.tidyverse.org/reference/join.html
library(dplyr)
full_join(df1, df2, by = "EmployeeID")
Check out the DPLYR "Data Transformation Cheat Sheet" https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

Comparing Excel tables and keep matching information in data.frame R

I have recently restarted to use R, and I'm trying to compare two excel tables (let's call them table 1 and 2), with very different data. The only common point is situated in one column (let's name it col1), and is the gene ID.
My goal is to find and keep all the rows of table 1 in which the data in col1 is exactly matching the data in table2.
For example if table1 contains 10 columns and col1 contains geneID. Table2 contains only 5 columns and col2 contains geneID. I want to compare and keep matching information of those two columns and get a data.frame containing the whole rows of table1 that I want to keep.
I hope I'm clear? English is not my first language ^^
Thanks a lot !
merge(x = table1,
y = table2,
by.x = "column_name_table1",
by.y = "column_name_table2",
all.x = T)

Getting respective columns for the unique records in r

I have a large csv file with millions of records and 6 columns . I want to get the unique records of one column say "Name" and the columns associated with the unique records in "Name". Say I get 50,000 unique "Name" records I want to get the other 5 columns associated with those 50,000 records. I know how to get the unique records in a column. In the code below I filter out the Name column(1st column) I want into a separate data frame and then return the unique records using unique function. But I am not sure how to get the other 5 columns for those unique records.
m <- read.csv(file="Test.csv", header=T, sep=",",
colClasses = c("character","NULL","NULL","NULL","NULL","NULL"))
names <- unique(m, incomparables = FALSE)
Yes, others will be unique w.r.t. your 1st column. If Same name has repeated and have different entries in at-least one of the other 5 columns, that row will be count as unique one.
m <- read.csv(file="Test.csv", header=T, sep=",", colClasses = c("character","NULL","NULL","NULL","NULL","NULL"))
m <- unique(m) #remove duplicates
Subset <- m[1:50000,] #subset first 50000 rows
Refer following links for better understanding:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html
Unique on a dataframe with only selected columns

Resources