I have two differents dataframes
DF1 = data.frame("A"= c("a","a","b","b","c","c"), "B"= c(1,2,3,4,5,6))
DF2 = data.frame("A"=c("a","b","c"), "C"=c(10,11,12))
I want to add the column C to DF1 grouping by column A
The expected result is
A B C
1 a 1 10
2 a 2 10
3 b 3 11
4 b 4 11
5 c 5 12
6 c 6 12
note: In this example all the groups have the same size but it won't be necessarily the case
Welcome to stackoverflow. As #KarthikS commented, what you want is a join.
'Joining' is the name of the operation for connecting two tables together. 'Grouping by' a column is mainly used when summarizing a table: For example, group by state and sum number of votes would give the total number of votes by each state (summing without grouping first would give the grand total number of votes).
The syntax for joins in dplyr is:
output = left_join(df1, df2, by = "shared column")
or equivalently
output = df1 %>% left_join(df2, by = "shared column")
Key reference here.
In your example, the shared column is "A".
We can use merge from base R
merge(DF1, DF2, by = 'A', all.x = TRUE)
I am trying to merge two datasets using two separate column names, but that share same unique values. For instance, column A in dataset 1== xyzw, while in dataset 2, the column's name is B but the value == xyzw.
However, the problem is that in dataset 2, column's B value == xyzw refers to firm names and appears several times, depending on how many employees are in that firm that exist in the dataset.
Essentially, I want to create a new column, let's call it C in dataset 1 telling me how many employees are in each firm.
I have tried the following:
## Counting how many teachers are in each matched school, using the "Matched" column from matching_file_V4, along with the school_name column from the sample11 dataset:
merged_dataset <- left_join(sample11,matched_datasets,by="school_name")
While this code works, it is not really providing me with the number of employees per firm.
If you could provide a sample data and expected output, It'd makes it easier for others to help. But that notwithstanding, I hope this gives you what you want:
Assuming we have these two data frames:
df_1 <- data.frame(
A = letters[1:5],
B = c('empl_1','empl_2','empl_3','empl_4','empl_5')
)
df_2 <- data.frame(
C = sample(rep(c('empl_1','empl_2','empl_3','empl_4','empl_5'), 15), 50),
D = sample(letters[1:5], 50, replace=T)
)
# I suggest you find the number of employees for each firm in the second data frame
df_2%>%group_by(C)%>%
summarise(
num_empl = n()
)%>% ### Then do the left join
left_join(
df_1,., by=c('B' = 'C') ## this is how you can join on two different column names
)
# A B num_empl
# 1 a empl_1 8
# 2 b empl_2 11
# 3 c empl_3 10
# 4 d empl_4 10
# 5 e empl_5 11
I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2
In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.
I have two data frames--one is huge (over 2 million rows) and one is smaller (around 300,000 rows). The smaller data frame is a subset of the larger one. The only difference is that the larger one has an additional attribute that I need to add to the smaller one.
Specifically, the attributes for the large data frame are (Date, Time, Address, Flag) and the attributes for the small data frame are (Date, Time, Address). I need to get the correct corresponding Flag value somehow into the smaller data frame for each row. The final size of the "merged" data frame should be the same as my smaller one, discarding the unused rows from the large data frame.
What is the best way to accomplish this?
Update: I tested the merge function with the following:
new<-merge(data12, data2, by.x = c("Date", "Time", "Address"),
by.y=c("Date", "Time", "Address"))
and
new<-merge(data12, data2, by = c("Date", "Time", "Address"))
both return an empty data frame (new) with the right number of attributes as well as the following warning message:
Warning message:In `[<-.factor`(`*tmp*`, ri, value = c(15640, 15843, 15843, 15161, : invalid factor level, NAs generated
R> df1 = data.frame(a = 1:5, b = rnorm(5))
R> df1
a b
1 1 -0.09852819
2 2 -0.47658118
3 3 -2.14825893
4 4 0.82216912
5 5 -0.36285430
R> df2 = data.frame(a = 1:10000, c = rpois(10000, 6))
R> head(df2)
a c
1 1 2
2 2 4
3 3 5
4 4 3
5 5 3
6 6 8
R> merge(df1, df2)
a b c
1 1 -0.09852819 2
2 2 -0.47658118 4
3 3 -2.14825893 5
4 4 0.82216912 3
5 5 -0.36285430 3
Perhaps plyr is a more intuitive package for this operation. What you need is a SQL inner join. I believe this approach is clearer than merge().
Here is a simple example of how you would use join() with data sets of your size.
library(plyr)
id = c(1:2000000)
rnormal <- rnorm(id)
rbinom <- rbinom(2000000, 5,0.5)
df1 <- data.frame(id, rnormal, rbinom)
df2 <- data.frame(id = id[1:300000], rnormal = rnormal[1:300000])
You would like to add rbinom to df2
joined.df <- join(df1, df2, type = "inner")
Here is the performance of join() vs merge()
system.time(joined.df <- join(df1, df2, type = "inner"))
Joining by: id, rnormal
user system elapsed
22.44 0.53 22.80
system.time(merged.df <- merge(df1, df2))
user system elapsed
26.212 0.605 30.201