Add column to grouped dataframe dplyr - r

I have two differents dataframes
DF1 = data.frame("A"= c("a","a","b","b","c","c"), "B"= c(1,2,3,4,5,6))
DF2 = data.frame("A"=c("a","b","c"), "C"=c(10,11,12))
I want to add the column C to DF1 grouping by column A
The expected result is
A B C
1 a 1 10
2 a 2 10
3 b 3 11
4 b 4 11
5 c 5 12
6 c 6 12
note: In this example all the groups have the same size but it won't be necessarily the case

Welcome to stackoverflow. As #KarthikS commented, what you want is a join.
'Joining' is the name of the operation for connecting two tables together. 'Grouping by' a column is mainly used when summarizing a table: For example, group by state and sum number of votes would give the total number of votes by each state (summing without grouping first would give the grand total number of votes).
The syntax for joins in dplyr is:
output = left_join(df1, df2, by = "shared column")
or equivalently
output = df1 %>% left_join(df2, by = "shared column")
Key reference here.
In your example, the shared column is "A".

We can use merge from base R
merge(DF1, DF2, by = 'A', all.x = TRUE)

Related

Merge two dataframes with different number of rows

I am having some issues with my data. I have two datasets on football matches, that are covering the same games and have the same "Match_ID" and "Country_ID" and I would like to merge the datasets. However I am having some issues. 1. I cant seem to find a way of merging the data by more than one column? and 2. One of the datasets have a few more rows than the other one. I would like to remove the rows that contains a "Match_ID" which is not in both datasets. Any Tips?
Since you didnt provide sample data, I dont know what your data look like so only taking a stab. Here is some sample data:
# 5 matches
df1 <- data.frame(match_id = 1:5,
country_id = LETTERS[1:5],
outcome = c(0,1,0,0,1),
weather = c("rain", rep("dry", 4)))
# 10 matches (containing the same 5 in df1
df2 <- data.frame(match_id = 1:10,
country_id = LETTERS[1:10],
location = rep(c("home", "away"), 5))
You can simply use merge():
df3 <- merge(df1, df2, by = c("match_id", "country_id"))
# Note that in this case, merge(df1, df2, by = "match_id") will
# result in the same output because of the simplicity of the sample data, but
# the above is how you merge by more than one column
Output:
# match_id country_id outcome weather location
# 1 1 A 0 rain home
# 2 2 B 1 dry away
# 3 3 C 0 dry home
# 4 4 D 0 dry away
# 5 5 E 1 dry home

What kind of join is it? Is it well defined?

Here is my data:
my_df_1 <- data.frame(col_1 = c(1,2,3,4,5,15), col_2 = c(4,5,6,8,9,17))
my_df_2 <- data.frame(col_1 = c(1,6,3,4,4), col_2 = c(4,5,5,11,13), col_3 = c(7,8,9,10,11))
my_df_1
my_df_2
I would like to join my_df_1 and my_df_2 on col_1 and col_2 and get my_df_3
my_df_3 <-data.frame(col_1 = c(1,2,2,3,4,4,5,15), col_2 = c(4,5,5,6,8,8,9,17),
col_3 = c(7,8,9,9,10,11,NA, NA))
my_df_3
Here is a logic of the join.
We start with row one of the my_df_1, if I can match values in both columns with my_df_2 then simply pull values from col_3 from my_df_2. For example the first row is matched completely and we simply get value from col_3 = 7.
In the second row of my_df_1 we could only match value in the second column (5) so we got value from column 3 = 8. 5 of second column was also found in and 3 of my_df_2 so we also pulled col_3 = 9 from third row.
In third row of my_df_1 we could only match value in the first column, so we pulled value 9.
Similarly in the 4th row we only matched 4 in two rows of my_df_2 and we pulled 10 and 11.
Other rows were not matched so we ended up with NA. This is a bit similar to left join, but also very different.
What kind of join is it? What is the easiest way to accomplish it?
Update
Thank you everyone for the comments and suggestions. I am struggling with choosing right title for my question. And I also failed to come up with minimal example. It looks like my example came out too abstract. So I am going to make my example more concrete here (but still somewhat minimal).
I have database of employees. There are three columns for each employee and there are plenty of nulls.
I also have compensation table with the same column.
For each employee I would like to compute relevant compensation. If can match all columns like in case of employee 5, then answer is clear: 8. When I cannot match all columns like in case of employee 1, I would like to take average over matched values 2 and 8 => 5.
That is it. I agree that this does not look like any particular join and it looks like the solution is to take consecutive left joins over power set of columns descending from the biggest to the lowest number of columns and stopping on the match.
I don't think there is any name for this type of operation but you can achieve the desired output using series of joins and combining them.
library(dplyr)
df1 <- my_df_1 %>%
inner_join(my_df_2, by = c('col_1', 'col_2'))
df2 <- my_df_1 %>%
inner_join(my_df_2, by = 'col_1') %>%
rename(col_2 = col_2.x) %>%
select(-col_2.y)
df3 <- my_df_1 %>%
inner_join(my_df_2, by = 'col_2') %>%
rename(col_1 = col_1.x) %>%
select(-col_1.y)
bind_rows(df1, df2, df3) %>%
distinct() %>%
right_join(my_df_1, by = c('col_1', 'col_2'))
# col_1 col_2 col_3
#1 1 4 7
#2 3 6 9
#3 4 8 10
#4 4 8 11
#5 2 5 8
#6 2 5 9
#7 5 9 NA
#8 15 17 NA

Merging Two Datasets Using Different Column names: left_Join

I am trying to merge two datasets using two separate column names, but that share same unique values. For instance, column A in dataset 1== xyzw, while in dataset 2, the column's name is B but the value == xyzw.
However, the problem is that in dataset 2, column's B value == xyzw refers to firm names and appears several times, depending on how many employees are in that firm that exist in the dataset.
Essentially, I want to create a new column, let's call it C in dataset 1 telling me how many employees are in each firm.
I have tried the following:
## Counting how many teachers are in each matched school, using the "Matched" column from matching_file_V4, along with the school_name column from the sample11 dataset:
merged_dataset <- left_join(sample11,matched_datasets,by="school_name")
While this code works, it is not really providing me with the number of employees per firm.
If you could provide a sample data and expected output, It'd makes it easier for others to help. But that notwithstanding, I hope this gives you what you want:
Assuming we have these two data frames:
df_1 <- data.frame(
A = letters[1:5],
B = c('empl_1','empl_2','empl_3','empl_4','empl_5')
)
df_2 <- data.frame(
C = sample(rep(c('empl_1','empl_2','empl_3','empl_4','empl_5'), 15), 50),
D = sample(letters[1:5], 50, replace=T)
)
# I suggest you find the number of employees for each firm in the second data frame
df_2%>%group_by(C)%>%
summarise(
num_empl = n()
)%>% ### Then do the left join
left_join(
df_1,., by=c('B' = 'C') ## this is how you can join on two different column names
)
# A B num_empl
# 1 a empl_1 8
# 2 b empl_2 11
# 3 c empl_3 10
# 4 d empl_4 10
# 5 e empl_5 11

Finding unique tuples in R but ignoring order

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

Merge 2 data frames, discard unmatched rows

I have two data frames--one is huge (over 2 million rows) and one is smaller (around 300,000 rows). The smaller data frame is a subset of the larger one. The only difference is that the larger one has an additional attribute that I need to add to the smaller one.
Specifically, the attributes for the large data frame are (Date, Time, Address, Flag) and the attributes for the small data frame are (Date, Time, Address). I need to get the correct corresponding Flag value somehow into the smaller data frame for each row. The final size of the "merged" data frame should be the same as my smaller one, discarding the unused rows from the large data frame.
What is the best way to accomplish this?
Update: I tested the merge function with the following:
new<-merge(data12, data2, by.x = c("Date", "Time", "Address"),
by.y=c("Date", "Time", "Address"))
and
new<-merge(data12, data2, by = c("Date", "Time", "Address"))
both return an empty data frame (new) with the right number of attributes as well as the following warning message:
Warning message:In `[<-.factor`(`*tmp*`, ri, value = c(15640, 15843, 15843, 15161, : invalid factor level, NAs generated
R> df1 = data.frame(a = 1:5, b = rnorm(5))
R> df1
a b
1 1 -0.09852819
2 2 -0.47658118
3 3 -2.14825893
4 4 0.82216912
5 5 -0.36285430
R> df2 = data.frame(a = 1:10000, c = rpois(10000, 6))
R> head(df2)
a c
1 1 2
2 2 4
3 3 5
4 4 3
5 5 3
6 6 8
R> merge(df1, df2)
a b c
1 1 -0.09852819 2
2 2 -0.47658118 4
3 3 -2.14825893 5
4 4 0.82216912 3
5 5 -0.36285430 3
Perhaps plyr is a more intuitive package for this operation. What you need is a SQL inner join. I believe this approach is clearer than merge().
Here is a simple example of how you would use join() with data sets of your size.
library(plyr)
id = c(1:2000000)
rnormal <- rnorm(id)
rbinom <- rbinom(2000000, 5,0.5)
df1 <- data.frame(id, rnormal, rbinom)
df2 <- data.frame(id = id[1:300000], rnormal = rnormal[1:300000])
You would like to add rbinom to df2
joined.df <- join(df1, df2, type = "inner")
Here is the performance of join() vs merge()
system.time(joined.df <- join(df1, df2, type = "inner"))
Joining by: id, rnormal
user system elapsed
22.44 0.53 22.80
system.time(merged.df <- merge(df1, df2))
user system elapsed
26.212 0.605 30.201

Resources