R - merge by variable column with duplicated entry - r

I am trying to merge two data of different size by ID. However, for the values that match, both data contain duplicated entries, i.e., there may be three ID #3 in Data A and three ID#3 in Data B. When I try to merge the data, the result is much larger than both data combined.
C<-merge(A,B,by="ID",all.x=T,sort=F)
I want to merge the two data by the ID column, such that the first ID #3 in B pairs with the first ID #3 in A, and so on.
Also, I want the row order of Data A to remain the same. The sort=FALSE wasn't much helpful: It places all the matching rows at the top, and the unmatched rows at the bottom.
Thanks for your help!

Before merging, you'll need to add to each data.frame a column whose value records the index of each observation within its own ID group.
## Example data
A <- data.frame(ID=c(1,1,1,2), ht=1:4)
B <- data.frame(ID=c(1,1,2,2), wt=3:6)
## Add column with number of each observation within ID
A <- transform(A, ID2=ave(ID, ID, FUN=seq_along))
B <- transform(B, ID2=ave(ID, ID, FUN=seq_along))
## Now carry out the merge
merge(A, B, all.x=TRUE, sort=FALSE)
# ID ID2 ht wt
# 1 1 1 1 3
# 2 1 2 2 4
# 3 2 1 4 5
# 4 1 3 3 NA

Thanks for your help, it is really useful. I end up adding a column of numbers to the larger data that I want to preserve order of.
Using #Josh O'Brien's example,
> ## Example data
> A <- data.frame(ID=c(1,1,1,2), ht=1:4)
> B <- data.frame(ID=c(1,1,2,2), wt=3:6)
>
> ## Add column with number of each observation within ID
> A <- transform(A, ID2=ave(ID, ID, FUN=seq_along))
> B <- transform(B, ID2=ave(ID, ID, FUN=seq_along))
>
> # Add a new column in A that numbers the row from 1 to number of row
> A$ORDER_DATA <- 1:nrow(A)
>
> ## Now carry out the merge
> C<-merge(A, B, all.x=TRUE, sort=FALSE)
>
> # Sort the merged data by ORDER_DATA column
> D<-C[with(C,order(ORDER_DATA)),]
> D
ID ID2 ht ORDER_DATA wt
1 1 1 1 1 3
2 1 2 2 2 4
4 1 3 3 3 NA
3 2 1 4 4 5

Related

Remove duplicate rows with certain value in specific column

I have a data frame and I want to remove rows that are duplicated in all columns except one column and choose to keep the ones that are not certain values.
In above example, 3rd row and 4th row are duplicated for all columns except for col3, so I want to keep one row only. The complicated step is I want to keep 4th row instead of 3rd because 3rd row in col3 is "excluded". In general, I want to only keep the rows(that were duplicated) that do not have "excluded".
My real data frame have lots of duplicated rows and among those 2 rows that are duplicated, one of them is "excluded" for sure.
Below is re-producible ex:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)
Thank you so much!
Update: Or, when removing duplicate, only keep the second observation (4th row).
dplyr with some base R should work for this:
library(dplyr)
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)
d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")
Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded
The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.
Here is an example with a loop:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)
# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))
to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
# Find simmilar rows
selection=which(d$a==d$a[i] & d$b==d$b[i])
# Sotre indices of raw in the set of duplicated row whihc are "excluded"
to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}
# Remove rows
d=d[-to_remove,]
print(d)
> a b c
> 1 4 red
> 2 2 5 green
> 4 3 6 orange
> 5 7 8 excluded
Here is a possibility ... I hope it can help :)
nquit <- (d %>%
mutate(code= 1:nrow(d)) %>%
group_by(a, b) %>%
mutate(nDuplicate= n()) %>%
filter(nDuplicate > 1) %>%
filter(c == "excluded"))$code
e <- d[-nquit]
Shortening the approach by #Klone a bit, another dplyr solution:
d %>% mutate(c = factor(c, ordered = TRUE,
levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
arrange(c) %>% # Sort the data frame so that excluded comes first
group_by(a, b) %>% # Group by the two columns that determine duplicates
mutate(id = 1:n()) %>% # Assign IDs in each group
filter(id == 1) # Only keep one row in each group
Result:
# A tibble: 4 x 4
# Groups: a, b [4]
a b c id
<dbl> <dbl> <ord> <int>
1 1 4 red 1
2 2 5 green 1
3 3 6 orange 1
4 7 8 excluded 1
Regarding your edit at the end of the question:
Update: Or, when removing duplicate, only keep the second observation (4th row).
note that, in case the ordering of the rows by col3 determines that the row to keep is always the last one among the duplicate records, you can simply set fromLast=TRUE in the duplicated() function to request that rows should be flagged as duplicates starting the duplicate count from the last one found for each duplicate group.
Using a slightly modified version of your data (where I added more duplicate groups to better show that the process works in a more general case):
a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)
a b c
1 1 4 excluded
2 1 4 red
3 2 5 green
4 3 6 excluded
5 3 6 excluded
6 3 6 orange
7 7 8 excluded
using:
ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])
we get:
a b c
2 1 4 red
3 2 5 green
6 3 6 orange
7 7 8 excluded
Note that this doesn't require the rows in each duplicate group to be all together in the original data. The only important thing is that you want to keep the record showing up last in the data from each duplicate group.

compute columns means after assigning one dataframe to a field of another data frame R

I have one data frame for example:
> df=data.frame(a=1:4,b=2:5)
> df
a b
1 1 2
2 2 3
3 3 4
4 4 5
Then I create another data frame and assign the data frame above to a field of the other one:
> df2=data.frame(c=3:6)
> df2$df1=df
> df2
c df1.a df1.b
1 3 1 2
2 4 2 3
3 5 3 4
4 6 4 5
When I compute the column means of the data frame, I got the error:
> colMeans(df2)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L) X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
Could anyone help to solve this problem?
Check ncol(df2) to see that there are only 2 "columns". The colMeans function cannot take the mean of the second element of the df2 list because it isn't a single column but two. Instead of df2$df1 = df, you can do df2 <- cbind(df2, df). If you want the column names to be the same as in your example you can do
sapply(1:ncol(df), function(i) df2[,paste0('df1','.',names(df)[i])] <<- df[,i])

R - if values match in column A, how often do their corresponding values in column B match?

I am a novice R user and have a 3 million row dataset. I am using R 3.0.1. I have a data frame in R that looks as follows:
A1 B1
1 50
1 50
1 45
2 20
2 20
3 15
4 30
4 30
I'd like to know, if there are multiple of the same value in A1, what % of the time do the corresponding values match in column B1?
In the example above, there are 7 rows that are a duplicate in A1 and their corresponding values match 6/7 times. How can I get this result for millions of rows?
Note: For a group of given values in A1, there will not be more than 2 unique values in column B.
Here's a data.table approach (assuming df is your data set)
library(data.table)
df2 <- as.data.table(df)[, list(Match = if(.N > 1) sum(B1[1] == B1),
Dups = if(.N > 1) .N), by = A1]
This will create a data set that will show you the duplicates and matched frequencies per A1
df2
## A1 Match Dups
## 1: 1 2 3
## 2: 2 2 2
## 3: 4 2 2
In order to reach your desired output, simply do
df2[, sum(Match)/sum(Dups)]
## [1] 0.8571429

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

aggregate over several variables in r

I have a rather large dataset in a long format where I need to count the number of instances of the ID due to two different variables, A & B. E.g. The same person can be represented in multiple rows due to either A or B. What I need to do is to count the number of instances of ID which is not too hard, but also count the number of ID due to A and B and return these as variables in the dataset.
Regards,
//Mi
The ddply() function from the package plyr lets you break data apart by identifier variables, perform a function on each chunk, and then assemble it all back together. So you need to break your data apart by identifier and A/B status, count how many times each of those combinations occur (using nrow()), and then put those counts back together nicely.
Using wkmor1's df:
library(plyr)
x <- ddply(.data = df, .var = c("ID", "GRP"), .fun = nrow)
which returns:
ID GRP V1
1 1 a 2
2 1 b 2
3 2 a 2
4 2 b 2
And then merge that back on to the original data:
merge(x, df, by = c("ID", "GRP"))
OK, given the interpretations I see, then the fastest and easiest solution is...
df$IDCount <- ave(df$ID, df$group, FUN = length)
Here is one approach using 'table' to count rows meeting your criteria, and 'merge' to add the frequencies back to the data frame.
> df<-data.frame(ID=rep(c(1,2),4),GRP=rep(c("a","a","b","b"),2))
> id.frq <- as.data.frame(table(df$ID))
> colnames(id.frq) <- c('ID','ID.FREQ')
> df <- merge(df,id.frq)
> grp.frq <- as.data.frame(table(df$ID,df$GRP))
> colnames(grp.frq) <- c('ID','GRP','GRP.FREQ')
> df <- merge(df,grp.frq)
> df
ID GRP ID.FREQ GRP.FREQ
1 1 a 4 2
2 1 a 4 2
3 1 b 4 2
4 1 b 4 2
5 2 a 4 2
6 2 a 4 2
7 2 b 4 2
8 2 b 4 2

Resources