Here is my data:
my_df_1 <- data.frame(col_1 = c(1,2,3,4,5,15), col_2 = c(4,5,6,8,9,17))
my_df_2 <- data.frame(col_1 = c(1,6,3,4,4), col_2 = c(4,5,5,11,13), col_3 = c(7,8,9,10,11))
my_df_1
my_df_2
I would like to join my_df_1 and my_df_2 on col_1 and col_2 and get my_df_3
my_df_3 <-data.frame(col_1 = c(1,2,2,3,4,4,5,15), col_2 = c(4,5,5,6,8,8,9,17),
col_3 = c(7,8,9,9,10,11,NA, NA))
my_df_3
Here is a logic of the join.
We start with row one of the my_df_1, if I can match values in both columns with my_df_2 then simply pull values from col_3 from my_df_2. For example the first row is matched completely and we simply get value from col_3 = 7.
In the second row of my_df_1 we could only match value in the second column (5) so we got value from column 3 = 8. 5 of second column was also found in and 3 of my_df_2 so we also pulled col_3 = 9 from third row.
In third row of my_df_1 we could only match value in the first column, so we pulled value 9.
Similarly in the 4th row we only matched 4 in two rows of my_df_2 and we pulled 10 and 11.
Other rows were not matched so we ended up with NA. This is a bit similar to left join, but also very different.
What kind of join is it? What is the easiest way to accomplish it?
Update
Thank you everyone for the comments and suggestions. I am struggling with choosing right title for my question. And I also failed to come up with minimal example. It looks like my example came out too abstract. So I am going to make my example more concrete here (but still somewhat minimal).
I have database of employees. There are three columns for each employee and there are plenty of nulls.
I also have compensation table with the same column.
For each employee I would like to compute relevant compensation. If can match all columns like in case of employee 5, then answer is clear: 8. When I cannot match all columns like in case of employee 1, I would like to take average over matched values 2 and 8 => 5.
That is it. I agree that this does not look like any particular join and it looks like the solution is to take consecutive left joins over power set of columns descending from the biggest to the lowest number of columns and stopping on the match.
I don't think there is any name for this type of operation but you can achieve the desired output using series of joins and combining them.
library(dplyr)
df1 <- my_df_1 %>%
inner_join(my_df_2, by = c('col_1', 'col_2'))
df2 <- my_df_1 %>%
inner_join(my_df_2, by = 'col_1') %>%
rename(col_2 = col_2.x) %>%
select(-col_2.y)
df3 <- my_df_1 %>%
inner_join(my_df_2, by = 'col_2') %>%
rename(col_1 = col_1.x) %>%
select(-col_1.y)
bind_rows(df1, df2, df3) %>%
distinct() %>%
right_join(my_df_1, by = c('col_1', 'col_2'))
# col_1 col_2 col_3
#1 1 4 7
#2 3 6 9
#3 4 8 10
#4 4 8 11
#5 2 5 8
#6 2 5 9
#7 5 9 NA
#8 15 17 NA
Related
I have two differents dataframes
DF1 = data.frame("A"= c("a","a","b","b","c","c"), "B"= c(1,2,3,4,5,6))
DF2 = data.frame("A"=c("a","b","c"), "C"=c(10,11,12))
I want to add the column C to DF1 grouping by column A
The expected result is
A B C
1 a 1 10
2 a 2 10
3 b 3 11
4 b 4 11
5 c 5 12
6 c 6 12
note: In this example all the groups have the same size but it won't be necessarily the case
Welcome to stackoverflow. As #KarthikS commented, what you want is a join.
'Joining' is the name of the operation for connecting two tables together. 'Grouping by' a column is mainly used when summarizing a table: For example, group by state and sum number of votes would give the total number of votes by each state (summing without grouping first would give the grand total number of votes).
The syntax for joins in dplyr is:
output = left_join(df1, df2, by = "shared column")
or equivalently
output = df1 %>% left_join(df2, by = "shared column")
Key reference here.
In your example, the shared column is "A".
We can use merge from base R
merge(DF1, DF2, by = 'A', all.x = TRUE)
I am trying to merge two datasets using two separate column names, but that share same unique values. For instance, column A in dataset 1== xyzw, while in dataset 2, the column's name is B but the value == xyzw.
However, the problem is that in dataset 2, column's B value == xyzw refers to firm names and appears several times, depending on how many employees are in that firm that exist in the dataset.
Essentially, I want to create a new column, let's call it C in dataset 1 telling me how many employees are in each firm.
I have tried the following:
## Counting how many teachers are in each matched school, using the "Matched" column from matching_file_V4, along with the school_name column from the sample11 dataset:
merged_dataset <- left_join(sample11,matched_datasets,by="school_name")
While this code works, it is not really providing me with the number of employees per firm.
If you could provide a sample data and expected output, It'd makes it easier for others to help. But that notwithstanding, I hope this gives you what you want:
Assuming we have these two data frames:
df_1 <- data.frame(
A = letters[1:5],
B = c('empl_1','empl_2','empl_3','empl_4','empl_5')
)
df_2 <- data.frame(
C = sample(rep(c('empl_1','empl_2','empl_3','empl_4','empl_5'), 15), 50),
D = sample(letters[1:5], 50, replace=T)
)
# I suggest you find the number of employees for each firm in the second data frame
df_2%>%group_by(C)%>%
summarise(
num_empl = n()
)%>% ### Then do the left join
left_join(
df_1,., by=c('B' = 'C') ## this is how you can join on two different column names
)
# A B num_empl
# 1 a empl_1 8
# 2 b empl_2 11
# 3 c empl_3 10
# 4 d empl_4 10
# 5 e empl_5 11
I have a dataframe returned from a function that looks like this:
df <- data.frame(data = c(1,2,3,4,5,6,7,8))
rownames(df) <- c('firsta','firstb','firstc','firstd','seconda','secondb','secondc','secondd')
firsta 1
seconda 5
firstb 2
secondb 6
my goal is to turn it into this:
df_goal <- data.frame(first = c(1,2,3,4), second = c(5,6,7,8))
rownames(df_goal) <- c('a','b','c','d')
first second
a 1 5
b 2 6
Basically the problem is that there is information in the row names that I can't discard because there isn't otherwise a way to distinguish between the column values.
This is a simple long-to-wide conversion; the twist is that we need to generate the key variable from the rownames by splitting the string appropriately.
In the data you present, the rowname consists of the concatination of a "position" (ie. 'first', 'second') and an id (ie. 'a', 'b'), which is stuck at the end. The structure of this makes splitting it complicated: ideally, you'd use a separator (ie. first_a, first_b) to make the separation unambiguous. Without a separator, our only option is to split on position, but that requires the splitting position to be a fixed distance from the start or end of the string.
In your example, the id is always the last single character, so we can pass -1 to the sep argument of separate to split off the last character as the ID column. If that wasn't always true, you would need to some up with a more complex solution to resolve the rownames.
Once you have converted the rownames into a "position" and "id" column, it's a simple matter to use spread to spread the position column into the wide format:
library(tidyverse)
df %>%
rownames_to_column('row') %>%
separate(row, into = c('num', 'id'), sep = -1) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
If row ids could be of variable length, the above solution wouldn't work. If you have a known and limited number of "position" values, you could use a regex solution to split the rowname:
Here, we extract the position value by matching to a regex containing all possible values (| is the OR operator).
We match the "id" value by putting that same regex in a positive lookahead operator. This regex will match 1 or more lowercase letters that come immediately after a match to the position value. The downside of this approach is that you need to specify all possible values of "position" in the regex -- if there are many options, this could quickly become too long and difficult to maintain:
df2
data
firsta 1
firstb 2
firstc 3
firstd 4
seconda 5
secondb 6
secondc 7
secondd 8
secondee 9
df2 %>%
rownames_to_column('row') %>%
mutate(num = str_extract(row, 'first|second'),
id = str_match(row, '(?<=first|second)[a-z]+')) %>%
select(-row) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
5 ee NA 9
I have a script which produces a .csv output like this:
However, there is a problem which I have highlighted: the date-named columns aren't always in the correct order.
I have tried to sort the columns by name, but this affects the first three columns (retailer, department, type) which have to always be in those first three columns. This happens because they are ordered by date first, then by character values.
How can I reorder the columns so that the first three columns remain where they are and also get the dates in the correct order?
UPDATE:
I can order the columns like this, which is the first part of the solution:
sort(names(output))
In this format, I now need to move the final three columns to the beginning (this will always be the same for every data frame that is generated so will be fine).
How can I achieve this?
One option would be to convert to Date class and then order it
# using a pattern, get the column index
i1 <- grep("^\\d{2}", names(df1))
# sort the extracted the column names after converting to 'Date' class
nm1 <- names(df1)[i1][order(as.Date(names(df1)[i1], '%d/%m/%Y'))]
# get the names of the other columns
nm2 <- setdiff(names(df1), names(df1)[i1])
# concatenate the columns
df2 <- df1[c(nm2, nm1)]
df2
# retailer department type 22/03/2015 15/01/2017 25/07/2018 11/01/2019 12/01/2019
#1 1 a completed 4 1 2 4 1
#2 2 b completed 1 1 2 3 4
#3 3 c completed 5 1 2 2 3
data
df1 <- data.frame(retailer = 1:3, department = letters[1:3],
type = 'completed', `11/01/2019` = c(4, 3, 2),
`12/01/2019` = c(1, 4, 3), `15/01/2017` = 1,
`25/07/2018` = 2, `22/03/2015` = c(4, 1, 5), check.names = FALSE)
Now I have a .df looks like below:
v1 v2 v3
1 2 3
4 5 6
What should I do with rownames such that if v2 of rownames(df) %% 2 == 0 does not equal to v2 of rownames(df) %% 2 == 1, then delete both rows?
Thank you all.
Update:
For this df below, you can see that for row 1 and 2, they have the same ID, so I want to keep these two rows as a pair (CODE shows 1 and 4).
Similarly I want to keep row 10 and 11 because they have the same ID and they are a pair.
What should I do to get a new df?
1) Create a dataframe with column for number of times the id comes
library(sqldf)
df2=sqldf("select count(id),id from df group by id"
2) merge them
df3=merge(df1,df2)
3) select only if count>1
df3[df3$count>1,]
If what you are looking for is to keep paired IDs and delete the rest (I doubt it is as simple as this), then ..
Extract your ids: I have written them out, you should extract.
id = c(263733,263733,2913733,3243733,3723733,4493733,273733,393733,2953733,3583733,3583733)
sort them
Find out which ones to keep.
id1 = cbind(id[1:length(id)-1],id[2:length(id)])
chosenID = id1[which(id1[,1]==id1[,2]),1]
And then extract from your df those rows that have chosenID.