How do I intersect two data.frames in R? [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I have two tables that are in the data.frame structure. Table 1 contains a column of 200 gene IDs (letters and numbers) and Table 2 contains a list of 4,000 gene IDs (in rows) as well as 20 additional columns. I want to intersect these two tables and generate a new Table 3 that contains the 200 gene IDs as well as the associated information in the 20 columns.
table3 <- table1%n%table2

You want something like
table3 <- merge(table1, table2, by.x="id", by.y="id", all.x=T, all.y=F)
You might also be able to do subsetting with something like this:
table3 <- table2[table2$id %in% table1$id,]
A reprex would have made this post more likely to get a good response, but you should have been able to find something to help you with a little searching. If these don't work because you have a unique problem no one has asked before, give is a reprex and we can try to give you alternative solutions.
edit: for a little more context, here's a similar question I replied to last week and here's a great post on understanding merges.

I recommend the dplyr package. It works more intuitively than merge in my opinion.
you can just type:
table3 <- left_join(table1, table2, by = "unique_id")

Related

how to join two tables by a common column and filter another column to be in between the data of 2 other columns in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have two tables 1 and 2 and I want to join the tables by a common column. Moreover, I want my data to be sorted so that another column of my first table to be in between two other columns of my second table.
It's difficult to answer properly because you haven't provided any sample data, but something along these lines should work:
library(dplyr)
left_join(table1, table2, by = c("commoncolumn") %>%
select(col2, col1, col3) #reorders columns to order listed
left_join() will join any rows from table2 to the matching rows of table1 but discard any rows in table2 with no matches. Depending on the behaviour you want, full_join() (keeps all rows from both tables) or inner_join() (only keeps rows which are in both tables) might work better for you.

How would I find matches between records in two lists based on a combination of variables in R? [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have two data frames...
> dim(df.x)
[1] 2120   5
> dim(df.y)
[1] 125    3
I'd like to identify records in data frame x that match data frame y for both variable 1 and variable 2 (but not for any other variables).
 
I suppose the typical way to do this in a lot of languages would be to do nested for statements and to compare each record in x to each record in y and stop and index the hits. But I'm wondering if there's a more efficient way to do this in R.
(I'd prefer to stick to base R or "out-of-the-box" R, if possible, rather than some of the higher-level packages.)
you can use merge() from base-R which gives an inner join by default. The code would be something like:
common = merge(df.x,df.y,by=c("var1","var2"))
var1 and var2 are your variables.

Merge two dataframes [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 4 years ago.
I would like to merge 2 dataframes and I have tried with the code below but it's not working,
merg <- merge(companies, rounds2,
by.companies = "permalink",
by.rounds2 = "company_permalink", all = TRUE)
One data frame has more than 1,00,000 rows and 8 columns and other dataframe has 60,000 + rows, 6 columns. Permalink is the unique key both dataframes but different column names. I m not sure how the file will look if merge 2 dataframes which have more and fewer rows. We need to merge as column wise.
In the by.x="" and in by.y="" you have to put the name of the permalink identifier. I do not know what this is since I do not have a data example. Regarding the join there are several options there as well for instance all.x=TRUE all.y=TRUE or all=TRUE or FALSE. These depend on how you want to join the dataframe.
companies=data.frame(companies=rnorm(100),other1=rnorm(100))
rounds2=data.frame(rounds2=rnorm(100),other1=rnorm(100))
companies
rounds2
merge(companies,rounds2,by.x="companies",by.y="rounds2",all=TRUE)

Writing data from one dataframe to a new column of another in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two data frames in R. data, a frame with monthly sales per department in a store, looks like this:
While averages, a frame with the average sales over all months per department, looks like this:
What I'd like to do is add a column to data containing the average sales (column 3 of averages) for each department. So whereas now I have an avg column with all zeroes, I'd like it to contain the overall average sales for whatever department is listed in that row. This is the code I have now:
for(j in 1:nrow(avgs)){
for(i in 1:nrow(data)){
if(identical(data[i,4], averages[j,1])){
gd[i,10] <- avgs[j,3] } } }
After running the loop, the avg column in data is still all zeroes, which makes me think that if(identical(data[i,4], averages[j,1])) is always evaluating to FALSE... But why would this be? How can I troubleshoot this issue / is there a better way to do this?
Are you looking for merge function?
merge(x = data, y = avgs, by = "departmentName", all.x=TRUE)
I would use dplyr by doing:
dplyr::full_join(data, averages, by = "departmentName")
The great thing about dplyr (besides being fast) is that it has a very simple syntax. Moreover, if your two tables have variables with different names, that can also be specified. Imagine you have data_departmentName in table data and avgs_departmentName in the table averages:
dplyr::full_join(data, averages, by = c("data_departmentName" = "averages_departmentName"))
And then I would filter the dataset if you only want a specific column from the second dataset. If you know your data is ordered and has the same lenght, then you could just add it like:
data$avgs <- averages$avgs
But I'd rather join first then filter.

Deleting rows from a data frame that are not present in another data frame in R [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I'm new to R but from what I've been reading this one is a bit hard for me. I have two data frames, say DF1 and DF2, both of which have a variable of interest, say idFriends, and I want to create a new data frame where all the rows that do not appear in DF2 are deleted from DF1 based on the values of idFriends.
The thing is that in DF2 each value appears only once while DF1 has thousands of values, many of them repeated. BUT I don't want R to delete repetitions, I just want it to search DF2, see if EACH value of DF1 exists in DF2, and if it doesn't exist delete that row and if it exists leave it as is, and do the same for each row in DF1.
I hope it's clear.
dplyr has an semi_join function that does that.
DF1 %>% semi_join(DF2, by = "idFriends") # keep rows with matching ID
DF1 %>% anti_join(DF2, by = "idFriends") # keep rows without matching ID
Hard to say without a reproducible example, but %in% is probably what you are looking for:
DF1[!DF1$idFriends %in% DF2$idFriends,]

Resources