multiple criteria filtering join using dplyr - r

I'm trying to accomplish the operation described below by creating a df named event_f.
I want from the detail df as filtering criteria, all event_id that have type_id == 6 excluding those with a combination of 6 and 3 or 6 and 7.
Note that there can be other combinations but they are all to be included then.
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.5.3
#> Warning: package 'purrr' was built under R version 3.5.3
event <- tibble(id = c("00_1", "00_2", "00_3", "00_4", "00_5", "00_6", "00_7"),
type_id = c("A", "B", "C", "B", "A", "B", "C"))
detail <- tibble(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L),
event_id = c("00_1", "00_1", "00_2", "00_2", "00_3", "00_4", "00_4", "00_5", "00_6", "00_6", "00_7", "00_8"),
type_id = c(3L, 4L, 6L, 7L, 2L, 6L, 3L, 2L, 6L, 5L, 2L, 1L))
event_f <- event %>%
semi_join(detail %>% filter(event_id %in% event$id,
type_id == 6,
type_id != (7 | 3)), by = c("id" = "event_id"))
Created on 2019-04-01 by the reprex package (v0.2.1)
I would like to have a df with one row : id = "00_6" and type_id = "B". I suppose the problem comes from the last two filter() operations, but not sure how to combine them?

I think you need
library(dplyr)
event %>%
semi_join(detail %>%
group_by(event_id) %>%
filter(any(type_id == 6) & all(!type_id %in% c(3, 7))),
by = c("id" = "event_id"))
# id type_id
# <chr> <chr>
#1 00_6 B
As we are trying to find out event_ids for those type_id which satisfy the criteria we need to group_by event_id. If we do not group_by then the filtering criteria would be applied to entire dataframe instead which will return 0 rows since we have values 3 and 7 in the dataframe.

Related

Creating matched pairs based on condition

Suppose I have a table in the following format:
CowId DIM Type
1 13 Case
2 7 Case
3 3 Control
4 4 Control
5 9 Control
6 3 Control
7 5 Control
8 10 Control
9 1 Control
10 6 Control
11 7 Control
12 4 Control
I would like to randomly match Cases to Controls (1 to 1) based on +/- 3 DIM. Is there a convenient way to accomplish this task using dplyr? Any feedback would be appreciated.
Output from dput is appended:
structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
A way in base R :
#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data.
df[c(which(inds), result), ]
# CowId DIM Type
#1 1 13 Case
#2 2 7 Case
#5 5 9 Control
#10 10 6 Control
The part randomly could be tricky. Here is my approach:
For each case Id calculate the min/max DIM
Then randomly picked either 1 or half of available Control available to them
Update the Control picked with reference to CAse ID and excluded those rows from future pick.
Repeat this step till done for all Case
In case of no picked was available a message will popup.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)
# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
# calculate the min/max DIM
current_case <- df %>% filter(CowId == i_case_id)
expecting_DIM_min <- current_case$DIM - 3
expecting_DIM_max <- current_case$DIM + 3
# Pick with sample
possible_sample <- df %>%
filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
Picked == FALSE)
if (nrow(possible_sample) == 0) {
message("There is no possible sample for Case ID: ", i_case_id)
message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
} else {
max_sample <- nrow(possible_sample)
# Maximum pick - in this case OP ask for 1 - 1 matched
# pick_number <- max(1, max_sample / 2)
pick_number <- 1
sample <- possible_sample %>%
sample_n(size = 1)
df$Picked[df$CowId %in% sample$CowId] <- TRUE
df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
}
}
Here is an output
df %>% filter(Picked | Type == "Case")
#> CowId DIM Type Picked Case_ID
#> 1 1 13 Case FALSE -1
#> 2 2 7 Case FALSE -1
#> 3 8 10 Control TRUE 1
#> 4 10 6 Control TRUE 2
Updated: matching 1-1 only
Created on 2021-04-10 by the reprex package (v1.0.0)

Function to Apply Factor Levels & Labels to Numerous Columns at Once

I have a survey dataset that I imported as an SAS file but it did not include the text labels that are associated with the numeric codes in the dataset.
I'm trying to apply the factor function to all variables and then have the respective levels and labels for each variable.
I have a main dataframe with the actual data, and then a second dataframe with the text labels corresponding to each value for each variable.
So, for example, the variable column names in the main dataset are A1, B1, C1, D1. The second dataframe with the labels is listed below with dummy text. And for each variable, there are varying numbers of values that need text labels.
labels_list <- structure(list(VariableName = c("A1", "A1", "A1", "B1", "B1",
"B1", "B1", "C1", "C1", "C1", "C1", "C1", "D1", "D1", "D1", "D1",
"D1", "D1"), Value = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = c("Red", "Blue", "Yellow",
"Up", "Down", "Left", "Right", "Boston", "Atlanta", "Dallas",
"New York", "Los Angeles", "John", "Jim", "Jake", "Bill", "Bob",
"Brian")), class = "data.frame", row.names = c(NA, -18L))
I'm trying to write a function to automatically label all the factor variables. The function reduces down the data to make sure that they each contain the exact same variables and then are in the exact same order. I split the table above into a list using the split function, and then each variable name above has it's own list, but I'm encountering an error when I try to subset the list in the for loop.
Below is the for loop I have written.
df = main dataset
labels_list = list with the value and text labels
for(i in 1:ncol(df)) {
for(j in labels_list) {
if(names(x[,i]) == names(ahs_split[[j]])) {
x[,i] <- factor(x[,i], levels = c(ahs_split[[j]][[2]]), labels = c(ahs_split[[j]][[3]]))
As I mentioned, my ultimate goal is to take this dataframe with the text labels and corresponding values for each variable and apply it to each one individually using the factor function. I've tried for almost a month now and am just very stuck so I could use any help. I'm not sure if anyone could possibly recommend a better approach or point me in the right direction. I would greatly appreciate any help.
If you don't mind some tidyverse verbs, you can reshape your data with tidyr::gather. Once it's in a long shape, you can join the data with the code lookup by variable name, and reshape it back into a wide format. This workflow scales for however many columns you need.
library(dplyr)
library(tidyr)
labels_list <- structure(list(Variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1",
"B1", "C1", "D1"), class = "factor"), Value = c(1L, 2L, 3L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = structure(c(15L,
3L, 18L, 17L, 8L, 12L, 16L, 5L, 1L, 7L, 14L, 13L, 11L, 10L, 9L,
2L, 4L, 6L), .Label = c("Atlanta", "Bill", "Blue", "Bob", "Boston",
"Brian", "Dallas", "Down", "Jake", "Jim", "John", "Left", "Los_Angeles",
"New_York", "Red", "Right", "Up", "Yellow"), class = "factor")), class = "data.frame", row.names = c(NA,
-18L))
df <- tibble(A1 = rep(1:3,2),
B1 = c(1:4, 1, 2),
C1 = c(1:5, 1),
D1 = 1:6
)
A row number iterated over Variable will be necessary to spread the data, but you can drop it after it's no longer needed.
df %>%
gather(key = Variable, value = Value) %>%
left_join(labels_list, by = c("Variable", "Value")) %>%
select(-Value) %>%
group_by(Variable) %>%
mutate(row = row_number()) %>%
spread(key = Variable, value = Label)
#> Warning: Column `Variable` joining character vector and factor, coercing
#> into character vector
#> # A tibble: 6 x 5
#> row A1 B1 C1 D1
#> <int> <fct> <fct> <fct> <fct>
#> 1 1 Red Up Boston John
#> 2 2 Blue Down Atlanta Jim
#> 3 3 Yellow Left Dallas Jake
#> 4 4 Red Right New_York Bill
#> 5 5 Blue Up Los_Angeles Bob
#> 6 6 Yellow Down Boston Brian
One way is to convert your labels_list into a list of lists:
library(dplyr) # just using dplyr for the pipe %>%, otherwise everything is in base R
# Convert df to list of key:value pairs
labels_list <- labels_list %>%
split(f = labels_list$VariableName) %>%
lapply(function(x) list(key = x$Value, value = x$Label))
e.g.:
$A1
$A1$key
[1] 1 2 3
$A1$value
[1] "Red" "Blue" "Yellow"
This can be mapped onto your df col-wise with apply. This is a bit hacky as I put the column name as the first item of the vector passed to the function.
# Map labels onto sample data with factor()
apply(rbind(names(df), df),
2,
function(x) factor(x[2:length(x)],
levels = labels_list[[x[1]]]$key,
labels = labels_list[[x[1]]]$value)) %>%
as.data.frame()
A1 B1 C1 D1
1 Blue Up Dallas Jake
2 Red Down New York Jake
3 Yellow Left Boston Jim
4 Yellow Right Boston John
5 Yellow Down Los Angeles Jake
6 Red Left Atlanta Jake
7 Blue Down New York John
8 Red Down Atlanta Brian
9 Blue Up New York Jim
10 Yellow Down Atlanta Bill
Sample Data
set.seed(1724)
df <- data.frame(A1 = floor(runif(10, 1, 4)),
B1 = floor(runif(10, 1, 5)),
C1 = floor(runif(10, 1, 6)),
D1 = floor(runif(10, 1, 7)))

How to calculate percentage of mising data in a time series in R dplyr

In the following sample data and script,
How can I calculate the % of missing data between start date strtdt and end date enddt for each ID. What I want to get is: add the missing days with NA between strtdt and enddt separately for each IDs than calculated the % of NA.
I tried following using dplyr but for no luck. Any suggestion will be highly appreciated.
Note: I can achieve same by calculating individually for each ID however that is not possible because I have more than 10000 IDs.
Ultimate goal is to get % of NA between start date and end date for each ID; If the dates are missing completely than i have to add missing date with NA values.
library(dplyr
df<-structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L
), .Label = c("xx", "xyz", "yy", "zz"), class = "factor"), Date = structure(c(8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 19L, 20L, 21L, 22L, 23L), .Label = c("1989-09-12",
"1989-09-13", "1989-09-14", "1989-09-19", "1989-09-23", "1990-01-12",
"1990-01-13", "1996-09-12", "1996-09-13", "1996-09-16", "1996-09-17",
"1996-09-18", "1996-09-19", "2000-09-12", "2000-09-13", "2000-11-10",
"2000-11-11", "2000-11-12", "2001-09-07", "2001-09-08", "2001-09-09",
"2001-09-10", "2001-09-11"), class = "factor"), val = c(3, 5,
9, 3, 5, 6, 8, 7, 9, 5, 3, 2, 8, 8, 5, 3, 2, 1, 5, 7, NA, NA,
NA, NA)), .Names = c("ID", "Date", "val"), row.names = c(NA,
-24L), class = "data.frame")
df$Date<-as.Date(df$Date,format="%Y-%m-%d")
df
df_mis<-df %>%
group_by(ID)%>%
dplyr::mutate(strtdt=min(Date),
enddt=max(Date))
df_mis
df_mis2<-df_mis %>%
group_by(ID) %>%
dplyr::do( data.frame(., Date1= seq(.$strtdt,.$enddt, by = '1 day')))
df_mis2
I assume from the sequence generation in the question's code, that the expected observations are one per day between the first observed date and last observed date per ID. Here's a clunky piece by piece calculation to count the % missing data.
1. Make a data frame of all expected dates for each ID
library(dplyr)
# df as in the question, but coerce Date column
df$Date <- as.Date(df$Date)
# Data frame with date ranges per id
ranges_df <- df %>%
group_by(ID) %>%
summarize(min=min(Date), max=max(Date))
# Data frame with IDs and date for every day expected.
alldays <- ranges_df %>%
group_by(ID) %>%
do(., data.frame(
Date = seq(.$dmin,.$dmax, by = '1 day')
)
)
2. JOIN the expected dates table with the observed dates table.
imputed_df <- left_join(alldays, df)
3. Count NAs
imputed_df %>%
group_by(ID) %>%
summarize(total=n(),
missing=sum(is.na(val)),
percent_missing=missing/total*100
)
result:
# A tibble: 4 x 4
ID total missing percent_missing
<fctr> <int> <int> <dbl>
1 xx 8 2 25.00000
2 xyz 4 4 100.00000
3 yy 62 57 91.93548
4 zz 4380 4371 99.794
Assuming that NAs in the original data should be counted as missing data, this will do so.
Calculate the number of days between the min and max of dates as an intermediate variable.
Then, calculate the number of missing days as number of days - number of observations. Then, calculate percentages.
df %>%
group_by(ID) %>%
mutate(numdays = as.numeric(max(Date) - min(Date)) + 1,
pctmissing = (numdays - n()) / numdays)

Spread with duplicate identifiers (using tidyverse and %>%) [duplicate]

This question already has answers here:
Reshaping data in R with "login" "logout" times
(6 answers)
Closed 5 years ago.
My data looks like this:
I am trying to make it look like this:
I would like to do this in tidyverse using %>%-chaining.
df <-
structure(list(id = c(2L, 2L, 4L, 5L, 5L, 5L, 5L), start_end = structure(c(2L,
1L, 2L, 2L, 1L, 2L, 1L), .Label = c("end", "start"), class = "factor"),
date = structure(c(6L, 7L, 3L, 8L, 9L, 10L, 11L), .Label = c("1979-01-03",
"1979-06-21", "1979-07-18", "1989-09-12", "1991-01-04", "1994-05-01",
"1996-11-04", "2005-02-01", "2009-09-17", "2010-10-01", "2012-10-06"
), class = "factor")), .Names = c("id", "start_end", "date"
), row.names = c(3L, 4L, 7L, 8L, 9L, 10L, 11L), class = "data.frame")
What I have tried:
data.table::dcast( df, formula = id ~ start_end, value.var = "date", drop = FALSE ) # does not work because it summarises the data
tidyr::spread( df, start_end, date ) # does not work because of duplicate values
df$id2 <- 1:nrow(df)
tidyr::spread( df, start_end, date ) # does not work because the dataset now has too many rows.
These questions do not answer my question:
Using spread with duplicate identifiers for rows (because they summarise)
R: spread function on data frame with duplicates (because they paste the values together)
Reshaping data in R with "login" "logout" times (because not specifically asking for/answered using tidyverse and chaining)
We can use tidyverse. After grouping by 'start_end', 'id', create a sequence column 'ind' , then spread from 'long' to 'wide' format
library(dplyr)
library(tidyr)
df %>%
group_by(start_end, id) %>%
mutate(ind = row_number()) %>%
spread(start_end, date) %>%
select(start, end)
# id start end
#* <int> <fctr> <fctr>
#1 2 1994-05-01 1996-11-04
#2 4 1979-07-18 NA
#3 5 2005-02-01 2009-09-17
#4 5 2010-10-01 2012-10-06
Or using tidyr_1.0.0
chop(df, date) %>%
spread(start_end, date) %>%
unnest(c(start, end))

reshape: cast oddity

Either it's late, or I've found a bug, or cast doesn't like colnames with "." in them. This all happens inside a function, but it "doesn't work" outside of a function as much as it doesn't work inside of it.
x <- structure(list(df.q6 = structure(c(1L, 1L, 1L, 11L, 11L, 9L,
4L, 11L, 1L, 1L, 2L, 2L, 11L, 5L, 4L, 9L, 4L, 4L, 1L, 9L, 4L,
10L, 1L, 11L, 9L), .Label = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k"), class = "factor"), df.s5 = structure(c(4L,
4L, 1L, 2L, 4L, 4L, 4L, 3L, 4L, 1L, 2L, 1L, 2L, 4L, 1L, 3L, 4L,
2L, 2L, 4L, 4L, 4L, 2L, 2L, 1L), .Label = c("a", "b", "c", "d",
"e"), class = "factor")), .Names = c("df.q6", "df.s5"), row.names = c(NA,
25L), class = "data.frame")
cast(x, df.q6 + df.s5 ~., length)
No worky.
However, if:
colnames(x) <- c("variable", "value")
cast(x, variable + value ~., length)
Works like a charm.
For me I use a similar solution to what Spacedman points out.
#take your data.frame x with it's two columns
#add a column
x$value <- 1
#apply your cast verbatim
cast(x, df.q6 + df.s5 ~., length)
df.q6 df.s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
6 d a 1
7 d b 1
8 d d 3
9 e d 1
10 i a 1
11 i c 1
12 i d 2
13 j d 1
14 k b 3
15 k c 1
16 k d 1
Hopefully that helps!
Jay
Nothing to do with the dots in the colnames (easily shown!).
If your dataframe doesnt have a column called 'value' then cast() guesses what column is the value - in this case it guesses 'df.s5' as it is the last column. This is what you get when you melt() data. It then renames that column to 'value' before calling reshape1. Now the column 'df.s5' is no more, yet it's there on the left of your formula. Uh oh.
You are using the value in the formula, which is an odd thing to do. None of the cast examples do that. What are you trying to do here?
You could add an ad-hoc column as a dummy value:
> cast(cbind(x,1), df.q6+s5~., length)
Using 1 as value column. Use the value argument to cast to override this choice
df.q6 s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
[etc]
But I suspect there's a better way to get the number of repeated observations (rows) in a data frame - which is your real question!
if you are looking for an easy solution, dcast in reshape2 package can help you:
library(reshape2)
dcast(x, df.q6 + df.s5 ~., length)

Resources