Dropping common rows in two different dataframes - r

I am a beginner using R. I have two different dataframes like the image called df-1 and df-2. I want to combine two dataframes and drop common rows. (Or I want to removal common rows and want to remain unique ID of rows.
Therefore, What I want to make is like df-3.
A merge is not appropriate because I don't need common rows.
df-1
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 B67 200302034466 1 20031204 3 1
3 C15 200302034455 1 20031223 3 1
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-2
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 K99 200402034466 1 20041204 2 3
3 Z75 200502034455 2 20021222 1 6
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-3
ID NUMBER FORM DATE CD AD
1 B67 200302034466 1 20031204 3 1
2 C15 200302034455 1 20031223 3 1
3 K99 200402034466 1 20041204 2 3
4 Z75 200502034455 2 20021222 1 6

Use rbind to merge df1 and df2 and then selecet unique values
df3 <- unique(rbind(df1,df2))

Can you just use unique on df3 to keep only unique rows? Or, in one line,
df3 <- unique(merge(df1, df2))
Also, avoid using brackets when naming variables - df(1) looks like "apply function df to 1"

If I'm interpreting your question correctly you want a dataframe with records that are present in only one of the original dataframes.
With dplyr:
library(dplyr)
df1_anti <- anti_join(df1, df2)
df2_anti <- anti_join(df2, df1)
df3 <- bind_rows(df1_anti, df2_anti)
df1_anti contains rows present in df1 but not in df2.
df2_anti contains rows present in df2 but not in df1.
df3 is the UNION the two dfs.

Related

Merge multiple data frames with partially matching rows

I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

R two table merge

I have two data.frame df1 and df2 .
df1=data.frame(id=c(1,2,2),var1=c(3,5,5),var3=c(2,3,4))
df2=data.frame(id=c(1,1,2,2),var1=c('NONE','NONE','NONE','NONE'),var3=c(2,4,6,5))
now I want to merge to one data.frame. First, I should change the df2$var1. re encoding the df2$var1 with df1$var1 when df2$id match with df1$id. For example, df1$id=1 df1$var1=3 then df2$id=1 and df2$var1=3, so the result should like this:
df1=data.frame(id=c(1,2,2),var1=c(3,5,5),var3=c(2,3,4)).
df2=data.frame(id=c(1,1,2,2),var1=c(3,3,5,5),var3=c(2,4,6,5))
secondly, I want to merge two data.frame and delete the same one.the result should like this:
df=data.frame(id=c(1,1,2,2,2,2),var1=c(3,3,5,5,5,5),var2=c(2,4,3,4,6,5))
Sorry, it's my first to use stackoverflow. And most importantly,English isn't my native language.
library(dplyr)
union_all(df1, df2) %>%
distinct() %>%
arrange(id, var1)
id var1 var3
1 1 3 2
2 1 3 4
3 2 5 3
4 2 5 4
5 2 5 6
6 2 5 5
First,I use dplyr::union,then I found that the order is disrupted.
So,finally I use union_all, then rank it
I think this is what you want.
library(sqldf)
sqldf("select b.id, a.var1, b.var3 from df1 a left join df2 b on a.id = b.id")
id var1 var3
1 1 3 2
2 1 3 4
3 2 5 5
4 2 5 6
5 2 5 5
6 2 5 6
This is the same as the example you gave of your desired result, except for the 3rd column of the 3rd and 4th row. I believe that is due to a typo in your example, however if I am mistaken about this please let me know (and just explain why those values would be different and I will update my answer accordingly).
By the way, there are multiple ways to do this, but I find this one to be quick and easy.
with merge:
df2$var1 <- df1[df2$id,'var1'];
df2
id var1 var3
1 1 3 2
2 1 3 4
3 2 5 6
4 2 5 5
df <- merge(df1, df2, by='id')[-2:-3]
df
id var1.y var3.y
1 1 3 2
2 1 3 4
3 2 5 6
4 2 5 5
5 2 5 6
6 2 5 5

Frequency of Characters in Strings as columns in data frame using R

I have a data frame initial of the following format
> head(initial)
Strings
1 A,A,B,C
2 A,B,C
3 A,A,A,A,A,B
4 A,A,B,C
5 A,B,C
6 A,A,A,A,A,B
and the data frame I want is final
> head(final)
Strings A B C
1 A,A,B,C 2 1 1
2 A,B,C 1 1 1
3 A,A,A,A,A,B 5 1 0
4 A,A,B,C 2 1 1
5 A,B,C 1 1 1
6 A,A,A,A,A,B 5 1 0
to generate the data frames the following codes can be used to keep the number of rows high
initial<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100))
final<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100),A=rep(c(2,1,5),100),B=rep(c(1,1,1),100),C=rep(c(1,1,0),100))
What is the fastest way I can achieve this? Any help will be greatly appreciated
We can use base R methods for this task. We split the 'Strings' column (strsplit(...)), set the names of the output list with the sequence of rows, stack to convert to data.frame with key/value columns, get the frequency with table, convert to 'data.frame' and cbind with the original dataset.
cbind(df1, as.data.frame.matrix(
table(
stack(
setNames(
strsplit(as.character(df1$Strings),','), 1:nrow(df1))
)[2:1])))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
or we can use mtabulate after splitting the column.
library(qdapTools)
cbind(df1, mtabulate(strsplit(as.character(df1$Strings), ',')))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
Update
For the new dataset 'initial', the second method works. If we need to use the first method with the correct order, convert to factor class with levels specified as the unique elements of 'ind'.
df1 <- stack(setNames(strsplit(as.character(initial$Strings), ','),
seq_len(nrow(initial))))
df1$ind <- factor(df1$ind, levels=unique(df1$ind))
cbind(initial, as.data.frame.matrix(table(df1[2:1])))

Match dataframe rows according to two variables (Indexing)

I am essentially trying to get disorganized data into long form for linear modeling.
I have 2 data.frames "rec" and "book"
Each row in "book" needs to be pasted onto the end of several of the rows of "rec" according to two variables in the row: "MRN" and "COURSE" which match.
I have tried the following and variations thereon to no avail:
i=1
newlist=list()
colnames(newlist)=colnames(book)
for ( i in 1:dim(rec)[1]) {
mrn=as.numeric(as.vector(rec$MRN[i]));
course=as.character(rec$COURSE[i]);
get.vector<-as.vector(((as.numeric(as.vector(book$MRN))==mrn) & (as.character(book$COURSE)==course)))
newlist[i]<-book[get.vector,]
i=i+1;
}
If anyone has any suggestions on
1)getting this to work
2) making it more elegant (or perhaps just less clumsy)
If I have been unclear in any way I beg your pardons.
I do understand I haven't combined any data above, I think if I can generate a long-format data.frame I can combine them all on my own
Sounds like you need to merge the two data-frames. Try this:
merge(rec, book, by = c('MRN', 'COURSE'))
and do read the help for merge (by doing ?merge at the R console) for more options on how to merge these.
I've created a simple example that may help you. In my case i wanted to paste the 'value' column from df1 in each row of df2, according to variables x1 and x2:
df1 <- read.table(textConnection("
x1 x2 value
1 2 12
1 3 56
2 1 35
2 2 68
"),header=T)
df2 <- read.table(textConnection("
test x1 x2
1 1 2
2 1 3
3 2 1
4 2 2
5 1 2
6 1 3
7 2 1
"),header=T)
library(sqldf)
sqldf("select df2.*, df1.value from df2 join df1 using(x1,x2)")
test x1 x2 value
1 1 1 2 12
2 2 1 3 56
3 3 2 1 35
4 4 2 2 68
5 5 1 2 12
6 6 1 3 56
7 7 2 1 35

Resources