I need to 'merge' two different data.frames with one another of unequal size but with the same unique identifier (ID) and I want to retain the # of rows of the larger data.frame.
More importantly, I want the value of variable x in data.frame.1 (the larger one) to be summed for each unique ID such that in data.frame.3 (the merged dataset) each observation for variable x is the sum of the observations with the same unique identifier originally found in data.frame.1.
Essentially, I want my merged dataset to have the row dimensions of my smaller dataset (data.frame.2) -i.e. same # of observations -but I want the column from the larger df (data.frame.1) merged to the column of the smaller df (data.frame.2) and I want its values aggregated like stated above (sum).
I hope this is clear so the charts below make it more clear: there are three total Unique ID's (a,b,c) but in data.frame.1 these repeated -i want these repeated values summed when the merger takes place.
ID x data.frame.1
a 1
a 8
a 10
b 2
b 1
c 4
ID y data.frame.2
a 3
b 7
c 9
ID y x data.frame.3
a 3 19
b 7 3
c 9 4
data.frame1 <- data.frame(ID = c(rep("a",3), rep("b",2), "c"),
x = c(1,8,10,2,1,4))
data.frame2 <- data.frame(ID = c("a", "b", "c"),
y = c(3, 7, 9))
data.frame1 <- aggregate(x ~ ID, data.frame1, sum)
data.frame3 <- merge(data.frame2, data.frame1, by = "ID")
Related
I have two dataframes. The first one has information on individual id, period and city of the workplace. The second dataset contains information on individual id and city of study degrees achieved throughout their lives. One individual can work at different places at the same period as well as may have multiple degree. I wish to add a column to the first dataframe informing whether the individual has a degree from the same city as she is working at the given period.
Consider the very simple example below. Dataframe mydf1 informs that (i) individual A works in cities x and y at both periods 1 and 2, (ii) invididual B works in city w in periods 1 and 2 and in city k in period 1, (iii) individual C works in city k in period 1. Dataframe mydf2 shows that (i) individual A has studied in cities x and w, (ii) individual B has studied in cities x and k, and (iii) individual C has studied in cities y and k.
mydf1 <- data.frame(id=c('A','A','A','A','B','B','B','C'),
period=c(1,1,2,2,1,1,2,1),
work_city=c('x','y','x','y','w','k','w','k'))
mydf2 <- data.frame(id=c('A','A','B','B','C','C'),
study_city=c('x','w','x','k','y','k'))
My output should be as below, where the indicator variable same_city is equal to 1 if the value of work_city for the respective row coincides with any of the values of variable study_city in dataset mydf2 for that particular individual. For instance: for individual A, variable same_city should be 1 if work_city is equal to 'x' or 'w', or 0 otherwise.
mydf_final <- data.frame(id=c('A','A','A','A','B','B','B','C'),
period=c(1,1,2,2,1,1,2,1),
work_city=c('x','y','x','y','w','k','w','k'),
same_city=c('1','0','1','0','0','1','0','1'))
Possible solution by aggregating mydf2 by id and putting all study cities in a list. After joining mydf1andmydf2_aggregatedwe check if thework_cityfor each row appears in thestudy_cities` list:
mydf1 <- data.frame(id=c('A','A','A','A','B','B','B','C'),
period=c(1,1,2,2,1,1,2,1),
work_city=c('x','y','x','y','w','k','w','k'))
mydf2 <- data.frame(id=c('A','A','B','B','C','C'),
study_city=c('x','w','x','k','y','k'))
Aggregate mydf2 by id and put all values for study_cities in a list. Now there is only one row per unique id.
library(dplyr)
mydf2_aggr <- mydf2 %>%
group_by(id) %>%
summarise(study_cities = list(study_city))
Join mydf2 and mydf2_aggr on id and use the rowwise function so that we can use a simple ifelse on each rows study_cities list. There might exist solutions without having to use rowwise... The columne study_cities_as_string I've only added to illustrate my answer!
mydf_final <- mydf1 %>%
left_join(mydf2_aggr, by="id") %>%
rowwise() %>%
mutate(study_cities_as_string = paste(study_cities, collapse=","),
same_city = ifelse(work_city %in% study_cities, 1, 0)) %>%
select(-study_cities)
mydf_final is now:
id period work_city study_cities_as_string same_city
<chr> <dbl> <chr> <chr> <dbl>
1 A 1 x x,w 1
2 A 1 y x,w 0
3 A 2 x x,w 1
4 A 2 y x,w 0
5 B 1 w x,k 0
6 B 1 k x,k 1
7 B 2 w x,k 0
8 C 1 k y,k 1
I am trying to merge two datasets using two separate column names, but that share same unique values. For instance, column A in dataset 1== xyzw, while in dataset 2, the column's name is B but the value == xyzw.
However, the problem is that in dataset 2, column's B value == xyzw refers to firm names and appears several times, depending on how many employees are in that firm that exist in the dataset.
Essentially, I want to create a new column, let's call it C in dataset 1 telling me how many employees are in each firm.
I have tried the following:
## Counting how many teachers are in each matched school, using the "Matched" column from matching_file_V4, along with the school_name column from the sample11 dataset:
merged_dataset <- left_join(sample11,matched_datasets,by="school_name")
While this code works, it is not really providing me with the number of employees per firm.
If you could provide a sample data and expected output, It'd makes it easier for others to help. But that notwithstanding, I hope this gives you what you want:
Assuming we have these two data frames:
df_1 <- data.frame(
A = letters[1:5],
B = c('empl_1','empl_2','empl_3','empl_4','empl_5')
)
df_2 <- data.frame(
C = sample(rep(c('empl_1','empl_2','empl_3','empl_4','empl_5'), 15), 50),
D = sample(letters[1:5], 50, replace=T)
)
# I suggest you find the number of employees for each firm in the second data frame
df_2%>%group_by(C)%>%
summarise(
num_empl = n()
)%>% ### Then do the left join
left_join(
df_1,., by=c('B' = 'C') ## this is how you can join on two different column names
)
# A B num_empl
# 1 a empl_1 8
# 2 b empl_2 11
# 3 c empl_3 10
# 4 d empl_4 10
# 5 e empl_5 11
Imagine that I have a list
l <- list("a" = 1, "b" = 2)
and a data frame
id value
a 3
b 4
I want to match id with list names, and apply a function on that list with the value in data frame. For example, I want the sum of value in the data frame and corresponding value in the list, I get
id value
a 4
b 6
Anyone has a clue?
Edit:
A.
I just want to expand the question a little bit with. Now, I have more than one value in every elements of list.
l <- list("a" = c(1, 2), "b" =c(1, 2))
I still want the sum
id value
a 6
b 7
We can match the names of the list with id of dataframe, unlist the list accordingly and add it to value
df$value <- unlist(l[match(df$id, names(l))]) + df$value
df
# id value
#1 a 4
#2 b 6
EDIT
If we have multiple entries in list we need to sum every list after matching. We can do
df$value <- df$value + sapply(l[match(df$id, names(l))], sum)
df
# id value
#1 a 6
#2 b 7
You just need
df$value=df$value+unlist(l)[df$id]# vector have names can just order by names
df
id value
1 a 4
2 b 6
Try answer with Ronak
l <- list("b" = 2, "a" = 1)
unlist(l)[as.character(df$id)]# if you id in df is factor
a b
1 2
Update
df$value=df$value+unlist(lapply(l,sum))[df$id]
Lets say we have to data.frames:
x <- data.frame(date=c(1,2,3,1,3), id=c("a", "a", "a", "b", "b"), sum=50:54)
y <- data.frame(date=c(1,2,1,3), id=c("a", "a", "b", "b"))
x
date id sum
1 1 a 50
2 2 a 51
3 3 a 52
4 1 b 53
5 3 b 54
y
date id
1 1 a
2 2 a
3 1 b
4 3 b
Now, i want to find the row in x that has dates that is not in y, within the same id. In y we have 1, 2 and 3 in id a and in y we only have 1 and 2 in id a.
How do i identify (and, preferably remove from x) row number 3 in x?
EDIT: I found a (very ugly and slow) solution, but there has to be a better and faster one? Currently im running it on two large data.frames, and first time it took more than one hour. I need to run it multiple times, so any help would be appreciated.
z <- data.frame()
for (f in 1:length(unique(x$id))) { #Run the iteration for all the unique id's in x
id <- unique(x$id)[f] #find the name of the id in this iteriation
a <- x[x$id==id,] #subset x
b <- y[y$id==id,] #subset y
x.new <- a[a$date%in%unique(b$date),] #find the dates that are in x and also in y
z <- rbind(z, x.new) #bind the two data.frames together
}
It seems you want an inner join. You are conceptualizing the problem as "find rows in X that are not in Y, then remove them from X," - this is more commonly stated as "keep only rows in X that are also in Y."
There are many ways to do this, it is the default setting for base::merge
merge(x, y, all = F)
# date id sum
# 1 1 a 50
# 2 1 b 53
# 3 2 a 51
# 4 3 b 54
There are many other options detailed at the R-FAQ How to join (merge) data frames (inner, outer, left, right)?
If you do need to identify the removed rows for some other purpose, dplyr::anti_join is one way. anti_join(x, y) will return the rows in x that are not in y.
library(dplyr)
anti_join(x, y)
# Joining, by = c("date", "id")
# date id sum
# 1 3 a 52
If speed is an issue, the data.table solution method as in this answer will be fastest. This answer does some fairly comprehensive benchmarking. However, your code is making enough inefficient steps (growing a data frame inside a loop, recomputing the same unique values, sometimes unnecessarily) that my guess is that even base::merge will be several orders of magnitude faster.
Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))