Merging dataframes in R - resulting dataframe is too large - r

I am trying to merge two dataframes in R, joining them by the one column that they share.
Here are screenshots of the two dataframes, and I am merging on the column "INC_KEY".
This is the code I have written to merge the two dataframes:
dp <- inner_join(d,p,by="INC_KEY")
d has 177156 observations, and p has 1641137 observations, but the final merged dataframe has 8416113 observations, which does not make sense to me. I have also tried changing the inner_join function above to the merge function, but I still get the same result. I am wondering how to fix this code so that the merged dataframe has a realistic number of observations - thanks so much for any help!

You most probably have duplicates in either d or p or both of them. Try keeping only one row for each unique INC_KEY value before joining.
library(dplyr)
dp <- inner_join(d %>% distinct(INC_KEY, .keep_all = TRUE),
p %>% distinct(INC_KEY, .keep_all = TRUE),by="INC_KEY")

This can happen if your INC_KEY is not a unique identifier. Here is a simplified example:
library(dplyr)
df1 <- data.frame(key = c("A", "B", "C", "A"),
val1 = 1:4)
df2 <- data.frame(key = c("A", "B", "C", "C", "B"),
val2 = 1:5)
inner_join(df1, df2, by = "key")
Joining, by = "key"
key val1 val2
1 A 1 1
2 B 2 2
3 B 2 5
4 C 3 3
5 C 3 4
6 A 4 1
Because there are two values of "A" in the key column in df1, both rows match the one row of df2 with "A". The one row in df1 with a key of "C" matches both rows with the key of "C" in df2. This is the expected behavior of an inner join with duplicated key values. The join returns all rows in the second data.frame that match each row in the first data.frame. If there are multiple matches, they are all returned.
If you want one row per INC_KEY, then you need to do something to your original data before the join, especially the rows are not complete duplicates.

The key column INC_KEY has duplicates in at least one of your tables. inner_join will then output a table with additional rows depending on the number of found duplicates minus the rows with INC_KEY missing in either dor p.
If you expect your new table to have the same number of rows as table d, then you need to aggregate the information in table p first; grouped by INC_KEY. Then you can perform inner_join.

Related

In R, search for several unique IDs from one dataset in another

I am trying to create a new column in a dataset. I want this column to be a "yes" or "no" column. Let's say that I have one dataset that has 1000 rows including a unique ID and another dataset that has 200 rows including a unique ID. The unique ID's match between the datasets because both datasets are from the same database.
I want to create a column in the larger dataset based on a search criteria, I want to search the unique IDs in the larger dataset and the new column will say "yes" for any of the unique IDs that also belong in the unique ID column in the smaller dataset. Basically if the ID is found in the small and the large dataset it will say Yes and if not then No.
Example:
This is what I want. Except in my case the 2 columns will be in different datasets.
I've tried to do this in R and even in Excel. I've tried merging the 2 datasets by the ID column but that doesn't get me what I want, which is a new column "yes" or "no" if the ID is found in both datasets. What should I do? I think I can use the %>% to solve my problem but I'm lost where to start..
The key here is checking which value is contained in the other dataset. Basically a conditional operation comparing between the two vectors of IDs, in this case can be easily solved using %in%:
Data:
# Dataset with 5 letters and values
dat <- data.frame(
id = LETTERS[1:5],
val = 1:5
)
# Subset
minidat <- dat[4,]
Base R
new_dat <- dat # Or modify in place
new_dat$is_in_smaller <- ifelse(dat$id %in% minidat$id, "yes", "no")
new_dat
## id val is_in_smaller
## 1 A 1 no
## 2 B 2 no
## 3 C 3 no
## 4 D 4 yes
## 5 E 5 no
{dplyr} approach, identical output
library(dplyr)
new_dat2 <- dat %>%
mutate(is_in_smaller = ifelse(id %in% minidat$id, "yes", "no"))
{data.table}
library(data.table)
new_dat3 <- as.data.table(dat) # Assuming you already have a data.table object
new_dat3[, is_in_smaller := ifelse(id %in% minidat$id, "yes", "no")]

Better subsetting and counting values in a dataframe [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.
I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?
fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x) length(unique(subset(fake.data, fake.data$b == x)$a)))
Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.
library(dplyr)
fake.cluster <- fake.data %>%
distinct() %>%
count(b)
Here is an option using data.table
library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
# b members
#1: a 4
#2: b 3
#3: c 3

Key retention when using different ways to select columns from data.table

I ran into an unexpected (to me) issue with subsetting columns in a data.table. I was trying to use unique on three columns of a data.table. The original data.table contained several (>5) columns of which 2 were set as keys. One of three columns on which I used unique was a key column. The surprising part was that the results differed when I selected columns using their character names versus when I selected them using .() notation. Here is a little example that replicates the issue that I encountered (NOTE: selecting 2 columns rather than 3 for simplicity).
library(data.table)
dt1 <- data.table(c1 = rep(c("a", "a", "b"), each = 3),
c2 = rep(1:3, each = 3),
c3 = rep(1:3, 3))
setkey(dt1, c1, c3)
unique(dt1[, c("c1", "c2"), with = FALSE])
c1 c2
1: a 1
2: b 3
unique(dt1[, .(c1, c2)])
c1 c2
1: a 1
2: a 2
3: b 3
It seems like that column selection using c("c1", "c2") notation retains the key by c1 but column selection using .(c1, c2) does not. Is there a way to control whether or not the key is retained while subsetting columns?
A little more context to my problem. I was trying to execute this code within a function which took the data and the column names to be selected as arguments. It was easier to pass character column names.

Equivalence of 'vlookup' in R for multiple columns?

I have a 9801 by 3 reference table.
The first 2 columns of this table is defined as follows.
x1 = x2 = seq(0.01,0.99,0.01)
x12 = data.matrix(expand.grid(x1,x2))
The 3rd columns contains the outcome values.
Now I have another n by 3 matrix where the 1st and 2nd columns are selected rows of the above matrix 'x12' and the 3rd column is to be filled. I would like fill in the 3rd column of the 2nd table by looking up the same combination of the 1st and 2nd column in the 1st table and find the value in the 3rd column.
How can I do this?
You can do this with the merge function:
# Original data frame
x1 = x2 = seq(0.01,0.99,0.01)
x12 = expand.grid(x1,x2)
# Add a fake "outcome"
x12$outcome = rnorm(nrow(x12))
# New data frame with 100 random rows and the first two columns of x12
x12new = x12[sample(1:nrow(x12), 100), c(1,2)]
# Merge the outcome values from x12 into x12new
x12new = merge(x12new, x12, by=c("Var1","Var2"), all.x=TRUE)
by tells merge which columns must match when comparing the two data frames. all.x=TRUE tells merge to keep all rows from the first data frame, x12new in this case, even if they don't have a match in the second data frame (not an issue here, but you'll often want to make sure you don't lose any rows when merging).
One other thing to note is that, unlike vlookup in Excel, merge will increase the number of rows in the new, merged data frame if there are multiple rows that match the criteria. For example, see what happens when you merge values from df2 into df1:
df1 = data.frame(x = c(1,2,3,4), z=c(10,20,30,40))
df2 = data.frame(x = c(1,1,1,2,3), y=c("a","b","c","a","c"))
merge(df1, df2, by="x", all.x=TRUE)
x z y
1 1 10 a
2 1 10 b
3 1 10 c
4 2 20 a
5 3 30 c
6 4 40 <NA>
You can also use left_join from the dplyr package (other types of joins are available as well):
library(dplyr)
left_join(df1, df2, by="x")

How can I merge two data frames with 500+ columns each, and 200+ overlapping columns?

I have several data frames that I need to merge into the one data frame to rule them all. The master data frame will end up with thousands of columns. All of the data frames have an ID column to join on. One problem is that hundreds of columns are duplicated across data frames. Another problem is that a handful of those columns contain inconsistent values. I would like to find a way to
Combine all data frames, keeping only 1 "master column" of data if there are duplicate column names and the values do not conflict between data frames
Keep both both columns of data if they share the same name, but they have conflicting values.
Are there any packages that can help automate this? Or am I going to be stuck writing a lot of code/manually checking data?
I wrote the package safejoin which solves this very succintly :
#devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
See the following data frames, A is identical in both, B is different in df1 and df2,
C and D are in only one data frame
df1 <- data.frame(id = 1:2, A = 3:4, B= 5:6, C = 7:8)
df2 <- data.frame(id = 1:2, A = 3:4, B= 9:10, D = 11:12)
library(tidyverse)
safe_full_join(df1, df2, by = "id", conflict = ~ if(identical(.x, .y)) .x else
map2( .x, .y,~tibble(df1=.x,df2=.y))) %>%
unnest(.sep="_")
# id A C D B_df1 B_df2
# 1 1 3 7 11 5 9
# 2 2 4 8 12 6 10L

Resources