Double left join in dplyr to recover values - r

I've checked this issue but couldn't find a matching entry.
Say you have 2 DFs:
df1:mode df2:sex
1 1
2 2
3
And a DF3 where most of the combinations are not present, e.g.
mode | sex | cases
1 1 9
1 1 2
2 2 7
3 1 2
1 2 5
and you want to summarise it with dplyr obtaining all combinations (with not existent ones=0):
mode | sex | cases
1 1 11
1 2 5
2 1 0
2 2 7
3 1 2
3 2 0
If you do a single left_join (left_join(df1,df3) you recover the modes not in df3, but 'Sex' appears as 'NA', and the same if you do left_join(df2,df3).
So how can you do both left join to recover all absent combinations, with cases=0? dplyr preferred, but sqldf an option.
Thanks in advance, p.

The development version of tidyr, tidyr_0.2.0.9000, has a new function called complete that I saw the other day that seems like it was made for just this sort of situation.
The help page says:
This is a wrapper around expand(), left_join() and replace_na that's
useful for completing missing combinations of data. It turns
implicitly missing values into explicitly missing values.
To add the missing combinations of df3 and fill with 0 values instead, you would do:
library(tidyr)
library(dplyr)
df3 %>% complete(mode, sex, fill = list(cases = 0))
mode sex cases
1 1 1 9
2 1 1 2
3 1 2 5
4 2 1 0
5 2 2 7
6 3 1 2
7 3 2 0
You would still need to group_by and summarise to get the final output you want.
df3 %>% complete(mode, sex, fill = list(cases = 0)) %>%
group_by(mode, sex) %>%
summarise(cases = sum(cases))
Source: local data frame [6 x 3]
Groups: mode
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

First here's you data in a more friendly, reproducible format
df1 <- data.frame(mode=1:3)
df2 <- data.frame(sex=1:2)
df3 <- data.frame(mode=c(1,1,2,3,1), sex=c(1,1,2,1,2), cases=c(9,2,7,2,5))
I don't see an option for a full outer join in dplyr, so I'm going to use base R here to merge df1 and df2 to get all mode/sex combinations. Then i left join that to the data and replace NA values with zero.
mm <- merge(df1,df2) %>% left_join(df3)
mm$cases[is.na(mm$cases)] <- 0
mm %>% group_by(mode,sex) %>% summarize(cases=sum(cases))
which gives
mode sex cases
1 1 1 11
2 1 2 5
3 2 1 0
4 2 2 7
5 3 1 2
6 3 2 0

Related

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Sort across rows to obtain three largest values

There is a injury score called ISS score
I have a table of injury data in rows according to pt ID.
I would like to obtain the top three values for the 6 injury columns.
Column values range from 0-5.
pt_id head face abdo pelvis Extremity External
1 4 0 0 1 0 3
2 3 3 5 0 3 2
3 0 0 2 1 1 1
4 2 0 0 0 0 1
5 5 0 0 2 0 1
My output for the above example would be
pt-id n1 n2 n3
1 4 3 1
2 5 3 3
3 2 1 1
4 2 1 0
5 5 2 1
values can be in a list or in new columns as calculating the score is simple from that point on.
I had thought that I would be able to create a list for the 6 injury columns and then apply a sort to each list taking the top three values. My code for that was:
ais$ais_list <- setNames(split(ais[,2:7], seq(nrow(ais))), rownames(ais))
But I struggled to apply the sort to the lists within the data frame as unfortunately some of the data in my data set includes NA values
We could use apply row-wise and sort the dataframe and take only first three values in each row.
cbind(df[1], t(apply(df[-1], 1, sort, decreasing = TRUE)[1:3, ]))
# pt_id 1 2 3
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1
As some values may contain NA it is better we apply sort using anonymous function and then take take top 3 values using head.
cbind(df[1], t(apply(df[-1], 1, function(x) head(sort(x, decreasing = TRUE), 3))))
A tidyverse option is to first gather the data, arrange it in descending order and for every row select only first three values. We then replace the injury column with the column names which we want and finally spread the data back to wide format.
library(tidyverse)
df %>%
gather(injury, value, -pt_id) %>%
arrange(desc(value)) %>%
group_by(pt_id) %>%
slice(1:3) %>%
mutate(injury = 1:3) %>%
spread(injury, value)
# pt_id `1` `2` `3`
# <int> <int> <int> <int>
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1

value of certain column based on multiple conditions in two data frames R

As shown above, there are df1 and df2
If you look at btime one df1 there are NAs
I want to fill up the btime NAs with all unique + stnseq = 1, so only the first NA of each Unique will be filled
the value i would like it to fill is in df2. The condition would be for all unique + boardstation = 8501970 add the value in the departure column.
i have tried the aggregate function but i do not know how to make the condition for only boardstation 8501970.
Thanks anyone for any help
If I understood the question correctly then this might help.
library(dplyr)
df2 %>%
group_by(unique) %>%
summarise(departure_sum = sum(departure[boardstation==8501970])) %>%
right_join(df1, by="unique") %>%
mutate(btime = ifelse(is.na(btime) & stnseq==1, departure_sum, btime)) %>%
select(-departure_sum) %>%
data.frame()
Since the sample data is in image format I cooked my own data as below:
df1
unique stnseq btime
1 1 1 NA
2 1 2 NA
3 2 1 NA
4 2 2 200
df2
unique boardstation departure
1 1 8501970 1
2 1 8501970 2
3 1 123 3
4 2 8501970 4
5 2 456 5
6 3 900 6
Output is:
unique stnseq btime
1 1 1 3
2 1 2 NA
3 2 1 4
4 2 2 200

Filter ids with having count > 1 in data.table [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I would like to subset my data frame to keep only groups that have 3 or more observations on DIFFERENT days. I want to get rid of groups that have less than 3 observations, or the observations they have are not from 3 different days.
Here is a sample data set:
Group Day
1 1
1 3
1 5
1 5
2 2
2 2
2 4
2 4
3 1
3 2
3 3
4 1
4 5
So for the above example, group 1 and group 3 will be kept and group 2 and 4 will be removed from the data frame.
I hope this makes sense, I imagine the solution will be quite simple but I can't work it out (I'm quite new to R and not very fast at coming up with solutions to things like this). I thought maybe the diff function could come in handy but didn't get much further.
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
One idea using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
filter(length(unique(Day)) >= 3)
#Source: local data frame [7 x 2]
#Groups: Group [2]
# Group Day
# (int) (int)
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#5 3 1
#6 3 2
#7 3 3
We can use base R
i1 <- rowSums(table(df1)!=0)>=3
subset(df1, Group %in% names(i1)[i1])
# Group Day
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#9 3 1
#10 3 2
#11 3 3
Or a one-liner base R would be
df1[with(df1, as.logical(ave(Day, Group, FUN = function(x) length(unique(x)) >=3))),]

Add missing values in time series efficiently

I have 500 datasets (panel data). In each I have a time series (week) across different shops (store). Within each shop, I would need to add missing time series observations.
A sample of my data would be:
store week value
1 1 50
1 3 52
1 4 10
2 1 4
2 4 84
2 5 2
which I would like to look like:
store week value
1 1 50
1 2 0
1 3 52
1 4 10
2 1 4
2 2 0
2 3 0
2 4 84
2 5 2
I currently use the following code (which works, but takes very very long on my data):
stores<-unique(mydata$store)
for (i in 1:length(stores)){
mydata <- merge(
expand.grid(week=min(mydata$week):max(mydata$week)),
mydata, all=TRUE)
mydata[is.na(mydata)] <- 0
}
Are there better and more efficient ways to do so?
Here's a dplyr/tidyr option you could try:
library(dplyr); library(tidyr)
group_by(df, store) %>%
complete(week = full_seq(week, 1L), fill = list(value = 0))
#Source: local data frame [9 x 3]
#
# store week value
# (int) (int) (dbl)
#1 1 1 50
#2 1 2 0
#3 1 3 52
#4 1 4 10
#5 2 1 4
#6 2 2 0
#7 2 3 0
#8 2 4 84
#9 2 5 2
By default, if you don't specify the fill parameter, new rows will be filled with NA. Since you seem to have many other columns, I would advise to leave out the fill parameter so you end up with NAs, and if required, make another step with mutate_each to turn NAs into 0 (if that's appropriate).
group_by(df, store) %>%
complete(week = full_seq(week, 1L)) %>%
mutate_each(funs(replace(., which(is.na(.)), 0)), -store, -week)

Resources