Merge overlapping datasets by column identifier? - r

I am trying to merge/join two datasets which have different data about the same samples with no rows in common. I would like to be able to merge them by the column names and have that add the rows from the smaller dataset to the larger, filling in NA for all columns that do not have information from the smaller dataset. I feel like this is something super easy that I'm just somehow not able to figure out.
2 tiny sample datasets:
df1 <- data.frame(team=c('A', 'B', 'C', 'D'),
points=c(88, 98, 104, 100),
league=c('Alpha', 'Beta', 'Gamma', 'Delta'))
team points league
1 A 88 Alpha
2 B 98 Beta
3 C 104 Gamma
4 D 100 Delta
df2 <- data.frame(team=c('L', 'M','N', 'O', 'P', 'Q'),
points=c(43, 66, 77, 83, 12, 12),
league=c('Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c(22, 31, 29, 20, 33, 44),
fouls=c(1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 L 43 Epsilon 22 1
2 M 66 Zeta 31 3
3 N 77 Eta 29 2
4 O 83 Theta 20 4
5 P 12 Iota 33 5
6 Q 12 Kappa 44 1
the output I would like to get would be:
df3<- data.frame(team=c('A', 'B', 'C', 'D', 'L', 'M','N', 'O', 'P', 'Q' ),
points=c(88, 98, 104, 100, 43, 66, 77, 83, 12, 12),
league=c('Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c('NA', 'NA', 'NA', 'NA', 22, 31, 29, 20, 33, 44),
fouls= c('NA', 'NA', 'NA', 'NA',1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
I tried transposing the dfs, but because they have no rows in common that does not work either. I thought about making an index, but I'm just learning about those and I'm not sure how I would do it or if that's the correct move.

Use full_join and arrange
library(dplyr)
full_join(df2, df1) %>%
arrange(team)
-output
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Or with rows_upsert
rows_upsert(df2, df1, by = c("team", "points", "league"))

We could use bind_rows()
When row-binding, columns are matched by name, and any missing columns will be filled with NA:
library(dplyr)
bind_rows(df1, df2)
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1

Using base R, you could add the missing columns in df1 using setdiff() and then rbind them together:
df1[, setdiff(names(df2), names(df1))] <- NA
rbind(df1, df2)
Output:
# team points league rebounds fouls
# 1 A 88 Alpha NA NA
# 2 B 98 Beta NA NA
# 3 C 104 Gamma NA NA
# 4 D 100 Delta NA NA
# 5 L 43 Epsilon 22 1
# 6 M 66 Zeta 31 3
# 7 N 77 Eta 29 2
# 8 O 83 Theta 20 4
# 9 P 12 Iota 33 5
# 10 Q 12 Kappa 44 1

Related

logical operator TRUE/FALSE in R

I wrote a simple function that produces all combinations of the input (a vector). Here the input vector is basically a sequence of 4 coordinates (x, y) as mentioned inside the function as a, b,c, and d.
intervals<-function(x1,y1,x2,y2,x3,y3,x4,y4){
a<-c(x1,y1)
b<-c(x2,y2)
c<-c(x3,y3)
d<-c(x4,y4)
union<-expand.grid(a,b,c,d)
union
}
intervals(2,10,3,90,6,50,82,7)
> intervals(2,10,3,90,6,50,82,7)
Var1 Var2 Var3 Var4
1 2 3 6 82
2 10 3 6 82
3 2 90 6 82
4 10 90 6 82
5 2 3 50 82
6 10 3 50 82
7 2 90 50 82
8 10 90 50 82
9 2 3 6 7
10 10 3 6 7
11 2 90 6 7
12 10 90 6 7
13 2 3 50 7
14 10 3 50 7
15 2 90 50 7
16 10 90 50 7
>
Now I want to find (max of x) and (min of y) for each row of the given output. E.g. row 2: we have 4 values (10, 3, 6, 82). Here (3,6,82) are from x (x2,x3,x4) and 10 is basically from y (y1). Thus max of x is 82, and the min of y is 10.
So what I want is two values from each row.
I do not actually know how to approach this kind of logical command. Any idea or suggestions?
You can pass x and y vector separately to the function. Use expand.grid to create all combinations of the vector and get max of x and min of y from each row.
intervals<-function(x, y){
tmp <- do.call(expand.grid, rbind.data.frame(x, y))
names(tmp) <- paste0('col', seq_along(tmp))
result <- t(apply(tmp, 1, function(p) {
suppressWarnings(c(max(p[p %in% x]), min(p[p %in% y])))
}))
result[is.infinite(result)] <- NA
result <- as.data.frame(result)
names(result) <- c('max_x', 'min_x')
result
}
intervals(c(2,3,6,82), c(10, 90, 50, 7))
# max_x min_x
#1 82 NA
#2 82 10
#3 82 90
#4 82 10
#5 82 50
#6 82 10
#7 82 50
#8 82 10
#9 6 7
#10 6 7
#11 6 7
#12 6 7
#13 3 7
#14 3 7
#15 2 7
#16 NA 7

R create a column to identify the group that row belong to

Description of Data: Dataset contains information regarding users about their age, gender and membership they are holding.
Goal: Create a new column to identify the group/label for each user based on pre-defined conditions.
Age conditions: multiple age brackets :
18 >= age <= 24, 25 >= age <=30, 31 >= age <= 41, 41 >= age <= 60, age >= 61
Gender: M/F
Membership: A,B,C,I
I created sample data frame to try out creation of new column to identify the group/label
df = data.frame(userid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12),
age = c(18, 61, 23, 35, 30, 25, 55, 53, 45, 41, 21, NA),
gender = c('F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', '<NA>', 'M'),
membership = c('A', 'B', 'A', 'C', 'C', 'B', 'A', 'A', 'I', 'I', 'A', '<NA>'))
userid age gender membership
1 1 18 F A
2 2 61 M B
3 3 23 F A
4 4 35 F C
5 5 30 M C
6 6 25 M B
7 7 55 M A
8 8 53 M A
9 9 45 M I
10 10 41 F I
11 11 21 <NA> A
12 12 NA M <NA>
Based on above data there exist 4 * 2 * 5 options (combinations)
Final outcome:
userid age gender membership GroupID
1 1 16 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
userid age gender membership GroupID
1 1 18 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
11 11 21 <NA> A 43 (assuming it will auto-detec combo)
12 12 NA M <NA> 46
I believe my calculation of combinations are correct and if so how can I use dplyr or any other option to get above data frame.
Use multiple if conditions to confirm all the options?
In dplyr is there a way to actually provide conditions for each column to set the grouping conditions:
df %>% group_by(age, gender, membership)
Two options,
One, more automated;
# install.packages(c("tidyverse""), dependencies = TRUE)
library(tidyverse)
df %>% mutate(ageCat = cut(age, breaks = c(-Inf, 24, 30, 41, 60, Inf))) %>%
mutate(GroupID = group_indices(., ageCat, gender, membership)) %>% select(-ageCat)
#> userid age gender membership GroupID
#> 1 1 18 F A 2
#> 2 2 61 M B 9
#> 3 3 23 F A 2
#> 4 4 35 F C 5
#> 5 5 30 M C 4
#> 6 6 25 M B 3
#> 7 7 55 M A 7
#> 8 8 53 M A 7
#> 9 9 45 M I 8
#> 10 10 41 F I 6
#> 11 11 21 <NA> A 1
#> 12 12 NA M <NA> 10
Two, more manual;
Here I make an illustration of a solution with category 1 and 4, you have to code the rest yourself.
df %>% mutate(GroupID =
ifelse((age >= 18 | age > 25) & gender == 'F' & membership == "A", 1,
ifelse((age >= 31 | age > 41) & gender == 'F' & membership == "C", 4, NA)
))
#> userid age gender membership GroupID
#> 1 1 18 F A 1
#> 2 2 61 M B NA
#> 3 3 23 F A 1
#> 4 4 35 F C 4
#> 5 5 30 M C NA
#> 6 6 25 M B NA
#> 7 7 55 M A NA
#> 8 8 53 M A NA
#> 9 9 45 M I NA
#> 10 10 41 F I NA
#> 11 11 21 <NA> A NA
#> 12 12 NA M <NA> NA
the data structure in case others feel like giving it a go,
You can try this:
setDT(df)[,agegrp:= ifelse((df$age >= 18) & (df$age <= 24), 1, ifelse((df$age >= 25) & (df$age <= 30), 2, ifelse((df$age >= 31) & (df$age <= 41),3,ifelse((df$age >= 42) & (df$age <= 60),4,5))))]
setDT(df)[, group := .GRP, by = .(agegrp,gender, membership)]
If you want to use base R only, you could do something like this:
# 1
allcombos <- expand.grid(c("M", "F"), c("A", "B", "C", "I"), 1:5)
allgroups <- do.call(paste0, allcombos) # 40 unique combinations
# 2
agegroups <- cut(df$age,
breaks = c(17, 24, 30, 41, 61, 99),
labels = c(1, 2, 3, 4, 5))
# 3
df$groupid <- paste0(df$gender, df$membership, agegroups)
df$groupid <- factor(df$groupid, levels=allgroups, labels=1:length(allgroups))
expand.grid gives you a data.frame with three columns where every row represents a unique combination of the three arguments provided. As you said, these are 40 combinations. The second line combines every row of the data frame in a single string, like "MA1", "FA1", "MB1", etc.
Then we use cut to each age to its relevant age group with names 1 to 5.
We create a column in df that contains the three character combination of the gender, membership and age group which is then converted to a factor, according to all possible combinations we found in allgroups.

transforming & adding new column in r

I have currently have a data frame that is taken from a data feed of events that happened in chronological order. I would like to add a new column onto to each row of my data the corresponds to the previous event's endx if the prior event type is 1 & the previous event's x if the prior event type is not 1
e.g
player_id <- c(12, 17, 26, 3)
event_type <- c(1, 3, 1, 10)
x <- c(65, 34, 43, 72)
endx <- c(68, NA, 47, NA)
df <- data.frame(player_id, event_type, x, endx)
df
player_id event_type x endx
1 12 1 65 68
2 17 3 34 NA
3 26 1 43 47
4 3 10 72 NA
so end result
player_id event_type x endx previous
1 12 1 65 68 NA
2 17 3 34 NA 68
3 26 1 43 47 34
4 3 10 72 NA 47
We can use if_else
library(dplyr)
df %>%
mutate(previous = if_else(lag(event_type)==1, lag(endx), lag(x)))
# player_id event_type x endx previous
#1 12 1 65 68 NA
#2 17 3 34 NA 68
#3 26 1 43 47 34
#4 3 10 72 NA 47
I am sure this isn't the most succient way but you can use a loop and indexing.
df$previous <- NA
for( i in 2: nrow(df)){
df[ i , "previous"] <- df[ i-1 , "endx"]
}

Create nominal variable from multiple columns R

My intention involves creating a variable based on the values of two numeric ones. I have not written any user-defined functions in R and need help getting started.
Dataset:
My dataset has over 3k stores, but created a reproducible example of the first 10 rows. Deliveries per day of week show total volume for that day through the year. Store_num represents store number and Total shows the total deliveries for a store throughout year.
I want predominant delivery days created in a variable called Del_Sch with the following inequalities. If first condition TRUE (50-100%), then create the variable with the column name. If FALSE, test second condition and create variable with all column names between 32-50%, ect. If there are no days over 20%, no predominant delivery days are counted.
-Volume in a day between 50-100% of the total.
-Volume in a day between 32-50% of total
-Volume in a day between 25-32% of total.
-Volume in a day between 20-25% of total.
-Volume in a day less than 20% of total.
Reproducible Example:
Store_Num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#Total deliveries made per week
Sun_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Mon_Del <- c(10, 50, 51, 7, 80, 97, 21, 49, 30, 3)
Tue_Del <- c(7, NA, 2, 50, 5, 56, 1, 4, 35, 52)
Wed_Del <- c(49, 51, 1, 4, 51, 16, 2, 2, 1, 1)
Thu_Del <- c(3, 2, 47, 7, 40, 2, 6, 5, 1, 7)
Fri_Del <- c(50, 49, 3, 51, 53, 86, 9, 52, 25, 52)
Sat_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Total <- c(119, 152, 104, 119, 229, 257, 39, 112, 92, 115)
#Single dataset
Schedule <- data.frame(Store_Num, Sun_Del, Mon_Del, Tue_Del,
Wed_Del, Thu_Del, Fri_Del, Sat_Del, Total)
Schedule
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total
1 1 NA 10 7 49 3 50 NA 119
2 2 NA 50 NA 51 2 49 NA 152
3 3 NA 51 2 1 47 3 NA 104
4 4 NA 7 50 4 7 51 NA 119
5 5 NA 80 5 51 40 53 NA 229
6 6 NA 97 56 16 2 86 NA 257
7 7 NA 21 1 2 6 9 NA 39
8 8 NA 49 4 2 5 52 NA 112
9 9 NA 30 35 1 1 25 NA 92
10 10 NA 3 52 1 7 52 NA 115
Desired Output:
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total Del_Sch
1 1 NA 10 7 49 3 50 NA 119 WF
2 2 NA 50 NA 51 2 49 NA 152 MWF
3 3 NA 51 2 1 47 3 NA 104 MTh
4 4 NA 7 50 4 7 51 NA 119 TF
5 5 NA 80 5 51 40 53 NA 229 MWF
6 6 NA 97 56 16 2 86 NA 257 MTF
7 7 NA 21 1 2 6 9 NA 39 M
8 8 NA 49 4 2 5 52 NA 112 MF
9 9 NA 30 35 1 1 25 NA 92 MTF
10 10 NA 3 52 1 7 52 NA 115 TF
Using tidyr and dplyr. I made the names be the first two letter pasted to fix the Tuesday/Thursday confusion:
library(dplyr)
library(tidyr)
Schedule %>% gather(Day, del, -Store_Num, -Total) %>%
mutate(proportion = ifelse(del/Total >= 0.5, 1,
ifelse(del/Total >= 0.32, 2,
ifelse(del/Total >= 0.25, 3,
ifelse(del/Total >= 0.20, 4,
NA))))) %>%
group_by(Store_Num) %>%
summarise(days = paste0(substr(Day[which(
proportion == min(proportion, na.rm = TRUE))],
1, 2), collapse = "")) %>%
merge(Schedule, ., by = "Store_Num")
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total days
1 1 NA 10 7 49 3 50 NA 119 WeFr
2 2 NA 50 NA 51 2 49 NA 152 MoWeFr
3 3 NA 51 2 1 47 3 NA 104 MoTh
4 4 NA 7 50 4 7 51 NA 119 TuFr
5 5 NA 80 5 51 40 53 NA 229 Mo
6 6 NA 97 56 16 2 86 NA 257 MoFr
7 7 NA 21 1 2 6 9 NA 39 Mo
8 8 NA 49 4 2 5 52 NA 112 MoFr
9 9 NA 30 35 1 1 25 NA 92 MoTu
10 10 NA 3 52 1 7 52 NA 115 TuFr
Edit: there are a couple of mismatches between my results and your data (line 5,6 and 9), according to your rules, you have mistakes there.

dplyr- renaming sequence of columns with select function

I'm trying to rename my columns in dplyr. I found that doing it with select function. however when I try to rename some selected columns with sequence I cannot rename them the format that I want.
test = data.frame(x = rep(1:3, each = 2),
group =rep(c("Group 1","Group 2"),3),
y1=c(22,8,11,4,7,5),
y2=c(22,18,21,14,17,15),
y3=c(23,18,51,44,27,35),
y4=c(21,28,311,24,227,225))
CC <- paste("CC",seq(0,3,1),sep="")
aa<-test%>%
select(AC=x,AR=group,CC=y1:y4)
head(aa)
AC AR CC1 CC2 CC3 CC4
1 1 Group 1 22 22 23 21
2 1 Group 2 8 18 18 28
3 2 Group 1 11 21 51 311
4 2 Group 2 4 14 44 24
5 3 Group 1 7 17 27 227
6 3 Group 2 5 15 35 225
the problem is even I set CC value from CC0, CC1, CC2, CC3 the output gives automatically head names starting from CC1.
how can I solve this issue?
I think you'll have an easier time crating such an expression with the select_ function:
library(dplyr)
test <- data.frame(x=rep(1:3, each=2),
group=rep(c("Group 1", "Group 2"), 3),
y1=c(22, 8, 11, 4, 7, 5),
y2=c(22, 18, 21, 14, 17, 15),
y3=c(23, 18, 51, 44, 27, 35),
y4=c(21, 28, 311,24, 227, 225))
# build out our select "translation" named vector
DQ <- paste0("y", 1:4)
names(DQ) <- paste0("DQ", seq(0, 3, 1))
# take a look
DQ
## DQ0 DQ1 DQ2 DQ3
## "y1" "y2" "y3" "y4"
test %>%
select_("AC"="x", "AR"="group", .dots=DQ)
## AC AR DQ0 DQ1 DQ2 DQ3
## 1 1 Group 1 22 22 23 21
## 2 1 Group 2 8 18 18 28
## 3 2 Group 1 11 21 51 311
## 4 2 Group 2 4 14 44 24
## 5 3 Group 1 7 17 27 227
## 6 3 Group 2 5 15 35 225

Resources