Extra Observations When Merging Datasets [duplicate] - r

This question already has an answer here:
Why does merge result in more rows than original data?
(1 answer)
Closed 3 years ago.
I am trying to merge two datasets (Datasets A and B) but when I merge Dataset A (407 Obs) with Dataset B (1462 Obs) I merged them by:
C <- merge(A, B, by=ID, all.x=TRUE)
It creates 416 observations in Dataset C.
Is there a reason why?

See Why does the result from merge have more rows than original file?.
Looks like there were multiple matches in the ID column. Since you didn't specify what your goal is, I recommend going through the full documentation:
https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/merge

Related

Extract rows from second Dataframe which are newly added compare to first Dataframe [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I have two data frames, I need to find the rows in second data frame which are newly added that means my First data frame has some rows and my second data frame can have few rows from my First data frame and some other rows also. I need to find those rows which are not in first data frame. That means rows which are only in my second data frame.
Below is the example with output
comp1<- data.frame(sector =c('Sector_123','Sector_456','Sector_789','Sector_101','Sector_111','Sector_113','Sector_115','Sector_117'), id=c(1,2,3,4,5,6,7,8) ,stringsAsFactors = FALSE)
comp2 <- data.frame(sector = c('Sector_456','Sector_789','Sector_000','Sector_222'), id=c(2,3,6,5), stringsAsFactors = FALSE)
Expected output is should be like below:
sector id
Sector_000 6
Sector_222 5
I should not use any other libraries like compare and data.table.
any suggestions
Assuming we are looking for similar entries in column sector. For all columns just remove the restriction.
We could use dplyr:
anti_join(comp2, comp1, by="sector")
gives us
> anti_join(comp2, comp1, by="sector")
sector id
1 Sector_000 6
2 Sector_222 5
With base R we could use
comp2[!comp2$sector %in% comp1$sector,]

How to Extract Two Columns of Data in R [duplicate]

This question already has answers here:
Extracting specific columns from a data frame
(10 answers)
Closed 4 years ago.
I have to extract two columns from this data set (Cars93 on MASS) and create a separate folder consisting only of the two columns MPG.highway and EngineSize. How do I go about doing this?
You can look at Cars93 on Mass and just get the first ten rows to see it.
You can create a subset using the names directly using the subset function or alternately,
new_df <- Cars93[,c("MPG.highway","EngineSize")]
#or
new_df <- subset(Cars93, keep = c("MPG.highway","EngineSize"))

setdiff two single column data frames [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 7 years ago.
I'm having a problem with a very simple issue and I don't know how to sort it out. Here's the deal. I have two one column data frames
a <- data.frame(C=c("c1","c2","c3","c4","c5","c6","c7","c8"))
b <- data.frame(C=c("c1","c4","c5","c8"))
I would like to get one column dataframe with the entries that do NOT appear in b but they are in a. ie. a dataframe with "c2","c3","c6","c7".
I tried
c <- setdiff(a,b)
but I got the a dataframe and also with
c <- merge(a,b,all.x=TRUE)
I don't get what I want it. so do you know where I am wrong?
We can use anti_join
library(dplyr)
anti_join(a,b)
Or
data.frame(C= setdiff(a[,1], b[,1]))

Deleting rows from a data frame that are not present in another data frame in R [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 7 years ago.
I'm new to R but from what I've been reading this one is a bit hard for me. I have two data frames, say DF1 and DF2, both of which have a variable of interest, say idFriends, and I want to create a new data frame where all the rows that do not appear in DF2 are deleted from DF1 based on the values of idFriends.
The thing is that in DF2 each value appears only once while DF1 has thousands of values, many of them repeated. BUT I don't want R to delete repetitions, I just want it to search DF2, see if EACH value of DF1 exists in DF2, and if it doesn't exist delete that row and if it exists leave it as is, and do the same for each row in DF1.
I hope it's clear.
dplyr has an semi_join function that does that.
DF1 %>% semi_join(DF2, by = "idFriends") # keep rows with matching ID
DF1 %>% anti_join(DF2, by = "idFriends") # keep rows without matching ID
Hard to say without a reproducible example, but %in% is probably what you are looking for:
DF1[!DF1$idFriends %in% DF2$idFriends,]

In R, how do I select rows of entries in one dataframe by identifiers from a second datafrmame [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
In R, how do I subset a data.frame by values from another data.frame?
I have two data.frames. The first (df1) is a single column of 100 entries with header - "names". The second (df2) is a dataframe containing hundreds of columns of metadata for tens of thousands of entries. The first column of df2 also has the header "names".
I simply want to select all the metadata in df2 by the subset of names found in df1.
Please help this novice R user. Thank you!
You can use data.frame with %in% but it can be slow if you have many thousands of names to look up.
I would recommend using data.table because it sorts the index columns and can do an almost instantaneous database join even with millions of records. Read the data.table documentation for more information.
Suppose you have a big data.frame and little data.frame:
library(data.table)
big <- data.frame(names=1:5, data=1:5)
small <- data.frame(names=c(1, 3, 6))
Make them into data.table objects and set the key column to be names.
big <- data.table(big, key='names')
small <- data.table(small, key='names')
Now perform the join. [] in data.table allows a data.table to be indexed by the key column of another data.table. In this case, we return the rows of big that are also in small, and there will be missing data if there are names in small but not in big.
big[small]
# names data
# 1: 1 1
# 2: 3 3
# 3: 6 NA

Resources