Merge function generates duplicates

Merge function generates duplicates - r

This question is just to understand why this would happen.
I'm merging two databases:
bot.rep.geo <- merge(x = bot.rep, y = geo.2016, by = "cod.geo", all.x = TRUE)
The original databases have the following dimensions: bot.rep has 1634451 observations, geo.2016 has 1393.
After merging using all.x = TRUE, the new database emerges with 1727681, instead of the same size as bot.rep.
Why does this happen?
After a quick review, I realised it was creating some duplicates, but I don't understand the reason and if I'm doing something wrong while using the merge function.

There may be lines in the geo.2016 table where the cod.geo value appears twice or more.
if you have a bot.rep value of "X" in your bot.rep data, then 2 lines which contain "X" in the geo.2016 data, the merge will duplicate the line in bot.rep and join the 2 lines from geo.2016.

This happens because of one-to-many relationship, x has multiple rows matching in y.
See example below, where bot.rep cod.geo value of 1 has 2 matches in geo.2016 dataset. Hence, we have 2 rows for 1 id. Also, notice we are creating NA rows for non matched ids because of all.x = TRUE argument.
Now, you need to decide which row is a duplicate for cod.geo value 1.
#dummy data
bot.rep <- data.frame(cod.geo = 1:4)
geo.2016 <- data.frame(cod.geo = c(1,1,3,5,6), z = 1:5)
bot.rep.geo <- merge(x = bot.rep, y = geo.2016,
by = "cod.geo", all.x = TRUE)
# cod.geo z
# 1 1 1
# 2 1 2
# 3 2 NA
# 4 3 3
# 5 4 NA
You will find more info on different types of merge functions here.

Related

How to add and rename a row in a dataframe using the apply function

I am new to programming in R and from a data frame that I have previously created, I have to add a row that contains the mean of all the columns, this row has to be called 'Average'. It should be noted that the last column contains NA values.
So far I have managed to create the new row with the mean for each of the columns and in the last column, omitting the NA values. What I need is to name the row and put it 'Average', by default it appears 51 and I don't know how to change it, I have read about the function row.names () etc, but I can't get anything out.
Can someone help me out? I would be enormously grateful
I leave the dataframe here (state.df):
state.df = as.data.frame(state.x77)
This is what i've done:
apply = apply(state.df, MARGIN = 2, mean, na.rm = TRUE)
rbind(state.df,apply)
I have tried this as well and a few other things but it doesn't work for me.
rbind(state.df,apply, rownames(state.df, prefix = 'Average'))
In summary, with apply I already have what I basically want, I just want to change the name 51 that appears when I do the rbind and change it to 'Average'

rbind constructs row names from argument names only for arguments that are matrix-like. Both apply and colMeans return a vector that is not matrix-like. You can use t to coerce this vector to a matrix, so that the argument name (in this case Average) is actually used.
dd <- data.frame(x = rnorm(10L), y = c(NA, rnorm(9L)))
rbind(dd, Average = t(colMeans(dd, na.rm = TRUE)))
# x y
# 1 0.4070128 NA
# 2 1.2352564 0.5730119
# 3 -0.5842432 -0.2096068
# 4 0.1695935 -1.0667109
# 5 -0.7393369 -1.5895364
# 6 -0.7394052 -1.0886582
# 7 0.9922455 -0.2560118
# 8 -3.0080877 -2.1085712
# 9 -0.3629210 -1.9192967
# 10 -0.5564323 -0.5459473
# Average -0.3186318 -0.9123697
rbind(dd, Average = colMeans(dd, na.rm = TRUE))
# x y
# 1 0.4070128 NA
# 2 1.2352564 0.5730119
# 3 -0.5842432 -0.2096068
# 4 0.1695935 -1.0667109
# 5 -0.7393369 -1.5895364
# 6 -0.7394052 -1.0886582
# 7 0.9922455 -0.2560118
# 8 -3.0080877 -2.1085712
# 9 -0.3629210 -1.9192967
# 10 -0.5564323 -0.5459473
# 11 -0.3186318 -0.9123697
Of course, you can always modify row names after the fact with row.names<-, as pointed out in the comments.
You could replace colMeans(dd, na.rm = TRUE) with apply(dd, 2L, mean, na.rm = TRUE) and get the same results, but colMeans is faster. For operations other than mean, you may need apply.

One to Many Join in data.table

I am using data.table to do a one-to-many merge. Instead of matching with all the rows, the output is showing only the last matched row for each unique value of the key.
a <- data.table(x = 1:2L, y = letters[1:4])
b <- data.table(x = c(1L,3L))
setkey(a,x)
setkey(b,x)
I want to do a many to one (b to a) join based on column x.
c <- a[b,on=.(x)]
c
# x y
# 1: 1 a
# 2: 1 c
# 3: 3 NA
However, this approach creates a new data.table called c, instead of making a new data.table, I use the following code to add the column y with b.
b[a,y:=i.y]
Now b looks like,
b
# x y
# 1: 1 c
# 2: 3 NA
The desired output is the one in the first method (c). Is there a way of using := and output all the rows instead of the last matched row alone?
PS: The reason I want to use method 2 using := is because my data is huge and I do not want to make copies. The example I showed reflects what happens in my data.

How to merge lists of vectors based on one vector belonging to another vector?

In R, I have two data frames that contain list columns
d1 <- data.table(
group_id1=1:4
)
d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )
And
d_grouped <- data.table(
group_id2=1:4
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )
I would like to merge these two data.tables based on the vectors in d1$Cat_grouped being contained in the vectors in d_grouped$Cat_grouped
To be more precise, there could be two matching criteria:
a) all elements of each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_a <- data.table(
group_id1=c(1,2)
group_id2=c(1,1)
)
b) at least one of the elements in each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_b <- data.table(
group_id1=c(1,2,3,3),
group_id2=c(1,1,1,2)
)
How can I implement a) or b) ? Preferably in a data.table way.
EDIT1: added the expected results of a) and b)
EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions

So I think long form is better, though my answer feels a little roundabout. I bet someone whose a little sleeker with data table can do this in fewer steps, but here's what I've got:
first, let's unpack the vectors in your example data:
d1_long <- d1[, list(cat=unlist(Cat_grouped)), group_id1]
d_grouped_long <- d_grouped[, list(cat=unlist(Cat_grouped)), group_id2]
Now, we can merge on the individual elements:
result_b <- merge(d1_long, d_grouped_long, by='cat')
Based on our example, it seems you don't actually need to know which elements were part of the match...
result_b[, cat := NULL]
Finally, my answer has duplicated group_id pairs because it gets a join for each pairwise match, not just the vector-level matches. So we can unique them away.
result_b <- unique(result_b)
Here's my result_b:
group_id.1 group_id.2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
We can use b as an intermediate step to a, since having any elements in common is a subset of having all elements in common.
Let's merge the original tables to see what the candidates are in terms of subvectors and vectors
result_a <- merge(result_b, d1, by = 'group_id1')
result_a <- merge(result_a, d_grouped, by = 'group_id2')
So now, if the length of Cat_grouped.x matches the number of TRUEs about Cat_grouped.x being %in% Cat_grouped.y, that's a bingo.
I tried a handful of clean ways, but the weirdness of having lists in the data table defeated the most obvious attempts. This seems to work though:
Let's add a row column to operate by
result_a[, row := 1:.N]
Now let's get the length and number of matches...
result_a[, x.length := length(Cat_grouped.x[[1]]), row]
result_a[, matches := sum(Cat_grouped.x[[1]] %in% Cat_grouped.y[[1]]), row]
And filter down to just rows where length and matches are the same
result_a <- result_a[x.length==matches]

This answer focuses on part a) of the question.
It follows Harland's approach but tries to make better use of the data.table idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.
Sample data
library(data.table)
d1 <- data.table(
group_id1 = 1:4,
Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12]))
d_grouped <- data.table(
group_id2 = 1:2,
Cat_grouped = list(letters[1:5], letters[6:10]))
Result a)
grp_cols <- c("group_id1", "group_id2")
unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, .(V2, .N), by = grp_cols][V2 == N, ..grp_cols])
group_id1 group_id2
1: 1 1
2: 2 1
Explanation
While expanding the list elements of d1 and d_grouped into long format, the number of list elements is determined for d1 using the lengths() function. lengths() (note the difference to length()) gets the length of each element of a list and was introduced with R 3.2.0.
After the inner join (note the nomatch = 0L parameter), the number of rows in the result set is counted (using the specal symbol .N) for each combination of grp_cols. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols are returned.
Result b)
Result b) can be derived from above solution by omitting the counting stuff:
unique(d1[, unlist(Cat_grouped), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, c("group_id1", "group_id2")])
group_id1 group_id2
1: 1 1
2: 2 1
3: 3 1
4: 3 2

Another way:
Cross-join to get all pairs of group ids:
Y = CJ(group_id1=d1$group_id1, group_id2=d_grouped$group_id2)
Then merge in the vectors:
Y = Y[d1, on='group_id1'][d_grouped, on='group_id2']
# group_id1 group_id2 Cat_grouped i.Cat_grouped
# 1: 1 1 a,b a,b,c,d,e
# 2: 2 1 c,b a,b,c,d,e
# 3: 3 1 c,d,e,f a,b,c,d,e
# 4: 4 1 k,l a,b,c,d,e
# 5: 1 2 a,b f,g,h,i,j
# 6: 2 2 c,b f,g,h,i,j
# 7: 3 2 c,d,e,f f,g,h,i,j
# 8: 4 2 k,l f,g,h,i,j
Now you can use mapply to filter however you like:
Y[mapply(function(u,v) all(u %in% v), Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
Y[mapply(function(u,v) length(intersect(u,v)) > 0, Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 3 2

Append values from column 2 to values from column 1

In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.

Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4

I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.

Combine two data.frames in R with differing rows

I have two tables one with more rows than the other. I would like to filter the rows out that both tables share. I tried the solutions proposed here.
The problem, however, is that it is a large data-set and computation takes quite a while. Is there any simple solution? I know how to extract the shared rows of both tables using:
rownames(x1)->k
rownames(x)->l
which(rownames(x1)%in%l)->o
Here x1 and x are my data frames. But this only provides me with the shared rows. How can I get the unique rows of each table to then exclude them respectively? So that I can just cbind both tables together?

(I edit the whole answer)
You can merge both df with merge() (from Andrie's comment). Also check ?merge to know all the options you can put in as by parameter, 0 = row.names.
The code below shows an example with what could be your data frames (different number of rows and columns)
x = data.frame(a1 = c(1,1,1,1,1), a2 = c(0,1,1,0,0), a3 = c(1,0,2,0,0), row.names = c('y1','y2','y3','y4','y5'))
x1 = data.frame(a4 = c(1,1,1,1), a5 = c(0,1,0,0), row.names = c('y1','y3','y4','y5'))
Provided that row names can be used as identifier then we put them as a new column to merge by columns:
x$id <- row.names(x)
x1$id <- row.names(x1)
# merge by column names
merge(x, x1, by = intersect(names(x), names(x1)))
# result
# id a1 a2 a3 a4 a5
# 1 y1 1 0 1 1 0
# 2 y3 1 1 2 1 1
# 3 y4 1 0 0 1 0
# 4 y5 1 0 0 1 0
I hope this solves the problem.
EDIT: Ok, now I feel silly. If ALL columns have different names in both data frames then you don't need to put the row name as another column. Just use:
merge(x,x1, by=0)

If you only want the rows which are not repeated from each data set:
rownames(x1)->k
rownames(x)->l
which(k%in%l) -> o
x1.uniq <- x1[k[k != o],];
x.uniq <- x[l[l != o],];
And then you can join them with rbind:
x2 <- rbind(x1.uniq,x.uniq);
If you also wanted the repeated rows you can add them:
x.repeated <- x1[o];
x2 <- rbind(x2,x.repeated);

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merge function generates duplicates - r

There may be lines in the geo.2016 table where the cod.geo value appears twice or more. if you have a bot.rep value of "X" in your bot.rep data, then 2 lines which contain "X" in the geo.2016 data, the merge will duplicate the line in bot.rep and join the 2 lines from geo.2016.

Related

How to add and rename a row in a dataframe using the apply function

One to Many Join in data.table

How to merge lists of vectors based on one vector belonging to another vector?

Append values from column 2 to values from column 1

Combine two data.frames in R with differing rows

Categories

Resources