Merging Two Datasets Using Different Column names: left_Join - r

I am trying to merge two datasets using two separate column names, but that share same unique values. For instance, column A in dataset 1== xyzw, while in dataset 2, the column's name is B but the value == xyzw.
However, the problem is that in dataset 2, column's B value == xyzw refers to firm names and appears several times, depending on how many employees are in that firm that exist in the dataset.
Essentially, I want to create a new column, let's call it C in dataset 1 telling me how many employees are in each firm.
I have tried the following:
## Counting how many teachers are in each matched school, using the "Matched" column from matching_file_V4, along with the school_name column from the sample11 dataset:
merged_dataset <- left_join(sample11,matched_datasets,by="school_name")
While this code works, it is not really providing me with the number of employees per firm.

If you could provide a sample data and expected output, It'd makes it easier for others to help. But that notwithstanding, I hope this gives you what you want:
Assuming we have these two data frames:
df_1 <- data.frame(
A = letters[1:5],
B = c('empl_1','empl_2','empl_3','empl_4','empl_5')
)
df_2 <- data.frame(
C = sample(rep(c('empl_1','empl_2','empl_3','empl_4','empl_5'), 15), 50),
D = sample(letters[1:5], 50, replace=T)
)
# I suggest you find the number of employees for each firm in the second data frame
df_2%>%group_by(C)%>%
summarise(
num_empl = n()
)%>% ### Then do the left join
left_join(
df_1,., by=c('B' = 'C') ## this is how you can join on two different column names
)
# A B num_empl
# 1 a empl_1 8
# 2 b empl_2 11
# 3 c empl_3 10
# 4 d empl_4 10
# 5 e empl_5 11

Related

Merge two dataframes with different number of rows

I am having some issues with my data. I have two datasets on football matches, that are covering the same games and have the same "Match_ID" and "Country_ID" and I would like to merge the datasets. However I am having some issues. 1. I cant seem to find a way of merging the data by more than one column? and 2. One of the datasets have a few more rows than the other one. I would like to remove the rows that contains a "Match_ID" which is not in both datasets. Any Tips?
Since you didnt provide sample data, I dont know what your data look like so only taking a stab. Here is some sample data:
# 5 matches
df1 <- data.frame(match_id = 1:5,
country_id = LETTERS[1:5],
outcome = c(0,1,0,0,1),
weather = c("rain", rep("dry", 4)))
# 10 matches (containing the same 5 in df1
df2 <- data.frame(match_id = 1:10,
country_id = LETTERS[1:10],
location = rep(c("home", "away"), 5))
You can simply use merge():
df3 <- merge(df1, df2, by = c("match_id", "country_id"))
# Note that in this case, merge(df1, df2, by = "match_id") will
# result in the same output because of the simplicity of the sample data, but
# the above is how you merge by more than one column
Output:
# match_id country_id outcome weather location
# 1 1 A 0 rain home
# 2 2 B 1 dry away
# 3 3 C 0 dry home
# 4 4 D 0 dry away
# 5 5 E 1 dry home

How can I reorder specific columns based on their date values?

I have a script which produces a .csv output like this:
However, there is a problem which I have highlighted: the date-named columns aren't always in the correct order.
I have tried to sort the columns by name, but this affects the first three columns (retailer, department, type) which have to always be in those first three columns. This happens because they are ordered by date first, then by character values.
How can I reorder the columns so that the first three columns remain where they are and also get the dates in the correct order?
UPDATE:
I can order the columns like this, which is the first part of the solution:
sort(names(output))
In this format, I now need to move the final three columns to the beginning (this will always be the same for every data frame that is generated so will be fine).
How can I achieve this?
One option would be to convert to Date class and then order it
# using a pattern, get the column index
i1 <- grep("^\\d{2}", names(df1))
# sort the extracted the column names after converting to 'Date' class
nm1 <- names(df1)[i1][order(as.Date(names(df1)[i1], '%d/%m/%Y'))]
# get the names of the other columns
nm2 <- setdiff(names(df1), names(df1)[i1])
# concatenate the columns
df2 <- df1[c(nm2, nm1)]
df2
# retailer department type 22/03/2015 15/01/2017 25/07/2018 11/01/2019 12/01/2019
#1 1 a completed 4 1 2 4 1
#2 2 b completed 1 1 2 3 4
#3 3 c completed 5 1 2 2 3
data
df1 <- data.frame(retailer = 1:3, department = letters[1:3],
type = 'completed', `11/01/2019` = c(4, 3, 2),
`12/01/2019` = c(1, 4, 3), `15/01/2017` = 1,
`25/07/2018` = 2, `22/03/2015` = c(4, 1, 5), check.names = FALSE)

Calculations across more than two different dataframes in R

I'm trying to transfer some work previously done in Excel into R. All I need to do is transform two basic count_if formulae into readable R script. In Excel, I would use three tables and calculate across those using 'point-and-click' methods, but now I'm lost in how I should address it in R.
My original dataframes are large, so for this question I've posted sample dataframes:
OperatorData <- data.frame(
Operator = c("A","B","C"),
Locations = c(850, 575, 2175)
)
AreaData <- data.frame(
Area = c("Torbay","Torquay","Tooting","Torrington","Taunton","Torpley"),
SumLocations = c(1000,500,500,250,600,750)
)
OperatorAreaData <- data.frame(
Operator = c("A","A","A","B","B","B","C","C","C","C","C"),
Area = c("Torbay","Tooting","Taunton",
"Torbay","Taunton","Torrington",
"Tooting","Torpley","Torquay","Torbay","Torrington"),
Locations = c(250,400,200,
100,400,75,
100,750,500,650,175)
)
What I'm trying to do is add two new columns to the OperatorData dataframe: one indicating the count of Areas that operator operates in and another count indicating how many areas in which that operator operates in and owns more than 50% of locations.
So the new resulting dataframe would look like this
Operator Locations AreaCount Own_GE_50percent
A 850 3 1
B 575 3 1
C 2715 5 4
So far, I've managed to calculate the first column using the table function and then appending:
OpAreaCount <- data.frame(table(OperatorAreaData$Operator))
names(OpAreaCount)[2] <- "AreaCount"
OperatorData$"AreaCount" <- cbind(OpAreaCount$AreaCount)
This is fairly straightforward, but I'm stuck in how to calculate the second column calculation with the condition of 50%.
library(dplyr)
OperatorAreaData %>%
inner_join(AreaData, by="Area") %>%
group_by(Operator) %>%
summarise(AreaCount = n_distinct(Area),
Own_GE_50percent = sum(Locations > (SumLocations/2)))
# # A tibble: 3 x 3
# Operator AreaCount Own_GE_50percent
# <fct> <int> <int>
# 1 A 3 1
# 2 B 3 1
# 3 C 5 4
You can use AreaCount = n() if you're sure you have unique Area values for each Operator.

Merging columns of dataset when they have diff number of rows

I need to 'merge' two different data.frames with one another of unequal size but with the same unique identifier (ID) and I want to retain the # of rows of the larger data.frame.
More importantly, I want the value of variable x in data.frame.1 (the larger one) to be summed for each unique ID such that in data.frame.3 (the merged dataset) each observation for variable x is the sum of the observations with the same unique identifier originally found in data.frame.1.
Essentially, I want my merged dataset to have the row dimensions of my smaller dataset (data.frame.2) -i.e. same # of observations -but I want the column from the larger df (data.frame.1) merged to the column of the smaller df (data.frame.2) and I want its values aggregated like stated above (sum).
I hope this is clear so the charts below make it more clear: there are three total Unique ID's (a,b,c) but in data.frame.1 these repeated -i want these repeated values summed when the merger takes place.
ID x data.frame.1
a 1
a 8
a 10
b 2
b 1
c 4
ID y data.frame.2
a 3
b 7
c 9
ID y x data.frame.3
a 3 19
b 7 3
c 9 4
data.frame1 <- data.frame(ID = c(rep("a",3), rep("b",2), "c"),
x = c(1,8,10,2,1,4))
data.frame2 <- data.frame(ID = c("a", "b", "c"),
y = c(3, 7, 9))
data.frame1 <- aggregate(x ~ ID, data.frame1, sum)
data.frame3 <- merge(data.frame2, data.frame1, by = "ID")

Append values from column 2 to values from column 1

In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.

Resources