Complete and fill data.frame with multiple conditions - r

I want to complete a data.frame with all combinations of two variables but with two conditions.
Here is my data.frame:
Data <-
data.frame(
A = rep(c("a", "b", "c", "d"), each = 2),
N = 1,
Type = c("i", "i", "I", "i", "i", "i", "I", "I")
)
> Data
A N Type
1 a 1 i
2 a 1 i
3 b 1 I
4 b 1 i
5 c 1 i
6 c 1 i
7 d 1 I
8 d 1 I
Now I want to complete that data.frame with all combinations of A and Type, but only if A != "a" and Type == "I". So there only has to be one additional row, the row with A == "c" and Type == "I". Furthermore, N should be filled with 0, see my desired output below. Is there an elegant way to achieve this? It would be great with tidyr's complete but it's OK if not. If the order was like here it would be even better.
> Data
A N Type
1 a 1 i
2 a 1 i
3 b 1 I
4 b 1 i
5 c 1 i
6 c 1 i
7 c 0 I
8 d 1 I
9 d 1 I

Perhaps, you can try this -
library(dplyr)
library(tidyr)
Data %>%
mutate(Type = factor(Type)) %>%
filter(!A %in% A[Type == "I"], A != 'a') %>%
complete(A, Type, fill = list(N = 0)) %>%
bind_rows(Data %>% filter(A %in% A[Type == "I"] | A == 'a')) %>%
arrange(A, -N)
# A Type N
# <chr> <chr> <dbl>
#1 a i 1
#2 a i 1
#3 b I 1
#4 b i 1
#5 c i 1
#6 c i 1
#7 c I 0
#8 d I 1
#9 d I 1
filter for the combination that you are interested in (filter(!A %in% A[Type == "I"], A != 'a')), complete those A values, bind them to the remaining rows and arrange.

You could do something like this
library(tidyverse)
Data <-
data.frame(
A = rep(c("a", "b", "c", "d"), each = 2),
N = 1,
Type = c("i", "i", "I", "i", "i", "i", "I", "I")
)
Data %>%
bind_rows(
complete(Data, A, nesting(Type)) %>%
filter(A != "a" & Type == "I" & is.na(N))
) %>%
mutate(N = replace_na(N, 0)) %>%
arrange(A, -N)
I filter on is.na(N) to ensure i get only "new" rows added.

Related

How to count the frequency of unique factor across each row in r dataframe

I have a dataset like the following:
Age Monday Tuesday Wednesday
6-9 a b a
6-9 b b c
6-9 c a
9-10 c c b
9-10 c a b
Using R, I want to get the following data set/ results (where each column represents the total frequency of each of the unique factor):
Age a b c
6-9 2 1 0
6-9 0 2 1
6-9 1 0 1
9-10 0 1 2
9-10 1 1 1
Note: My data also contains missing values
couple of quick and dirty tidyverse solutions - there should be a way to reduce steps though.
library(tidyverse) # install.packages("tidyverse")
input <- tribble(
~Age, ~Monday, ~Tuesday, ~Wednesday,
"6-9", "a", "b", "a",
"6-9", "b", "b", "c",
"6-9", "", "c", "a",
"9-10", "c", "c", "b",
"9-10", "c", "a", "b"
)
# pivot solution
input %>%
rowid_to_column() %>%
mutate_all(function(x) na_if(x, "")) %>%
pivot_longer(cols = -c(rowid, Age), values_drop_na = TRUE) %>%
count(rowid, Age, value) %>%
pivot_wider(id_cols = c(rowid, Age), names_from = value, values_from = n, values_fill = list(n = 0)) %>%
select(-rowid)
# manual solution (if only a, b, c are expected as options)
input %>%
unite(col = "combine", Monday, Tuesday, Wednesday, sep = "") %>%
transmute(
Age,
a = str_count(combine, "a"),
b = str_count(combine, "b"),
c = str_count(combine, "c")
)
In base R, we can replace empty values with NA, get unique values in the dataframe, and use apply row-wise and count the occurrence of values using table.
df[df == ''] <- NA
vals <- unique(na.omit(unlist(df[-1])))
cbind(df[1], t(apply(df, 1, function(x) table(factor(x, levels = vals)))))
# Age a b c
#1 6-9 2 1 0
#2 6-9 0 2 1
#3 6-9 1 0 1
#4 9-10 0 1 2
#5 9-10 1 1 1

Replace column values using key value dataframe

I must imagine this question is not unique, but I was struggling with which words to search for so if this is redundant please point me to the post!
I have a dataframe
test <- data.frame(x = c("a", "b", "c", "d", "e"))
x
1 a
2 b
3 c
4 d
5 e
And I'd like to replace SOME of the values using a separate data frame
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
Resulting in:
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
A base R solution using match + replace
test <- within(test,x <- replace(as.character(x),match(metadata$a,x),as.character(metadata$b)))
such that
> test
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Importing your data with stringsAsFactors = FALSE and using dplyr and stringr, you can do:
test %>%
mutate(x = str_replace_all(x, setNames(metadata$b, metadata$a)))
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Or using the basic idea from #Sotos:
test %>%
mutate(x = pmax(x, metadata$b[match(x, metadata$a, nomatch = x)], na.rm = TRUE))
You can do,
test$x[test$x %in% metadata$a] <- na.omit(metadata$b[match(test$x, metadata$a)])
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Here's one approach, though I presume there are shorter ones:
library(dplyr)
test %>%
left_join(metadata, by = c("x" = "a")) %>%
mutate(b = coalesce(b, x))
# x b
#1 a a
#2 b b
#3 c REPLACE_1
#4 d REPLACE_2
#5 e e
(Note, I have made the data types match by loading metadata as character, not factors:
metadata <- data.frame(stringsAsFactors = F,
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
You can use match to make this update join.
i <- match(metadata$a, test$x)
test$x[i] <- metadata$b
# test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Or:
i <- match(test$x, metadata$a)
j <- !is.na(i)
test$x[j] <- metadata$b[i[j]]
test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Data:
test <- data.frame(x = c("a", "b", "c", "d", "e"), stringsAsFactors = FALSE)
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"), stringsAsFactors = FALSE)

How to fill a column based on a condition using sum() for matches in r

I have struggles filling a column based on a condition. Maybe my approach is not in the right direction. I don't know. My conditions are as follow:
2 "b"s and 1 "a" in a row, write in column "match" "B"
2 "c"s in a row, write in column "match" "C"
for anything else fill NA
So far I did the following but I see that this is not quite accurate since my new vector is not created from the rows but the entire column, and it still doesn't work.
set.seed(123)
df_letters <- data.frame(basket1 = sample(letters[1:3], 5, replace = TRUE, prob = c(0.85,0.10,0.5)),
basket2 = sample(letters[1:3], 5, replace = TRUE, prob = c(0.10,0.85,0.5)),
basket3 = sample(letters[1:3], 5, replace = TRUE, prob=c(0.5,0.10,0.85)),
stringsAsFactors = FALSE)
df_letters %>% mutate(match = ifelse(sum(as.character(as.vector(df_letters)) == "c")==2, "C",
ifelse((sum(as.character(as.vector(df_letters)) == "b")==2) & (sum(as.character(as.vector(df_letters)) == "a")==1) ,"B", NA )))
My desired output is:
> df_letters
basket1 basket2 basket3 match
1 a b b B
2 c b c C
3 a c a <NA>
4 c b c C
5 b b c <NA>
Many thanks in advance!
One dplyr option could be:
df_letters %>%
mutate(match = case_when(rowSums(select(., starts_with("basket")) == "b") == 2 & rowSums(select(., starts_with("basket")) == "a") == 1 ~ "B",
rowSums(select(., starts_with("basket")) == "c") == 2 ~ "C",
TRUE ~ NA_character_))
basket1 basket2 basket3 match
1 a b b B
2 c b c C
3 a c a <NA>
4 c b c C
5 b b c <NA>
This is how to achieve this in base R:
df_letters$match <- apply(df_letters, 1, function(x) {
count <- as.list(table(x))
ifelse(count$a == 1 && count$b == 2, "B", ifelse(count$c == 2, "C", NA_character_))
})
The idea is to convert the table object to list to access counts by element.
Output
basket1 basket2 basket3 match
1 a b b B
2 c b c C
3 a c a <NA>
4 c b c C
5 b b c <NA>

R - building new variables from sequenced data

This is an update / follow-up on this question. The answer outlined their doesn't meet the new requirements.
I am looking for an efficient way (data.table?) to construct two new measures for each ID.
Measure 1 and Measure 2 needs to meet the following conditions:
Condition 1:
Find a sequence of three rows for which:
the first count > 0
the second `count >1' and
the third count ==1.
Condition 2 for Measure 1:
takes the value of the elements in product of the third row of the sequence that are:
in the product of second row of sequence and
NOT in the stock of the first row in sequence.
Condition 2 for measure 2:
takes the value of the elements in product of the last row of the sequence that are:
NOT in the product of second row of sequence
NOT in the stock of the first row in sequence.
Data:
df2 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
> df2
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 A,C,E A,B,C,E
5 1 5 1 A,B A,B,C,E
6 1 6 2 A,B,C A,B,C,E
7 1 7 3 D A,B,C,D,E
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
The desired output looks like this:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D
How would you code this?
Few things you need to know to be able to do this:
shift function to compare values in your groups
separate_rows function to split your strings to get to the normalised data view.
library(data.table)
dt <- data.table(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E", "A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
dt[, count.2 := shift(count, type = "lead")]
dt[, count.3 := shift(count, n = 2, type = "lead")]
dt[, product.2 := shift(product, type = "lead")]
dt[, product.3 := shift(product, n = 2, type = "lead")]
dt <- dt[count > 0 & count.2 > 1 & count.3 == 1]
dt <- unique(dt, by = "ID")
library(tidyr)
dt.measure <- separate_rows(dt, product.3, sep = ",")
dt.measure <- separate_rows(dt.measure, stock, sep = ",")
dt.measure <- separate_rows(dt.measure, product, sep = ",")
dt.measure[, measure.1 := (product.3 == product.2 & product.3 != stock)]
dt.measure[, measure.2 := (product.3 != product.2 & product.3 != stock)]
res <- dt.measure[,
.(
measure.1 = max(ifelse(measure.1, product.3, NA_character_), na.rm = TRUE),
measure.2 = max(ifelse(measure.2, product.3, NA_character_), na.rm = TRUE)
),
ID
]
dt <- merge(dt, res, by = "ID")
dt[, .(ID, measure.1, measure.2)]
# ID measure.1 measure.2
# 1: 1 C E
# 2: 2 <NA> <NA>
# 3: 3 D <NA>
I'm not sure what the criteria for efficient is, but here's an approach using embed and tidyverse style. It filters down so you are working with less and less.
Loading up the data and packages (note later on setdiff and intersect are from dplry)
library(purrr)
library(dplyr)
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "A,C,E", "A,B",
"A,B,C", "D", "A", "B", "A", "A",
"A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C,E", "A,B,C,E",
"A,B,C,E", "A,B,C,D,E", "A", "A,B", "A,B", "A",
"A,B,C", "A,B,C,D", "A,B,C,D"),
stringsAsFactors = FALSE)
Define a helper function to evaluate condition 1
meetsCond1 <- function(rseg) {
seg <- rev(rseg)
all(seg[1] > 0, seg[2] > 1, seg[3] == 1)
}
The embed function warps a time series into a matrix where essentially each row is a window of the length of interest. Using apply, you filter down to which rows start relevant sequences.
cond1Match<- embed(df1$count, 3) %>%
apply(1, meetsCond1) %>%
which()
You can translate that back to final products, the previous products, and stock rows of interest to determine the measures by adding offsets. Split them into a list of individual components.
finalProds <- df1$product[cond1Match + 2] %>%
strsplit(",")
prevProds <- df1$product[cond1Match + 1] %>%
strsplit(",")
initialStock <- df1$stock[cond1Match] %>%
strsplit(",")
For both measures, neither of them can be in the stock.
notStock <- map2(finalProds, initialStock, ~.x[!(.x %in% .y)])
Then generate your data.frame by retrieving the seqs and ID values of the window. The measures then are just the intersect and setdiff of the final products with those in the previous rows.
data.frame(ID = df1$ID[cond1Match],
seq1 = df1$seqs[cond1Match],
seq2 = df1$seqs[cond1Match + 1],
seq3 = df1$seqs[cond1Match + 2],
measure1 = imap_chr(notStock,
~intersect(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
measure2 = imap_chr(notStock,
~setdiff(.x, prevProds[[.y]]) %>%
{if(length(.) == 0) "" else paste(., sep = ",")}
),
stringsAsFactors = FALSE
) %>%
slice(match(unique(ID), ID))
which yields the desired output, which seems to limit at most one line per ID. In the original post, you specify you want all reported. Removing the slice call would then instead yield
#> ID seq1 seq2 seq3 measure1 measure2
#> 1 1 2 3 4 C E
#> 2 1 6 7 1
#> 3 2 1 2 3
#> 4 2 3 1 2 C
#> 5 3 2 3 4 D
If you're looking to really squeeze efficiency, you might be able to gain some by placing the definitions of finalProds, prevProds, and initialStock instead of assigning them to variables first. I would imagine unless your set of matches is really large, it would be negligible.
A rolling window approach using data.table with base R code in j:
library(data.table)
cols <- c("product", "stock")
setDT(df2)[, (cols) := lapply(.SD, function(x) strsplit(as.character(x), split=",")), .SDcols=cols]
ans <- df2[,
transpose(lapply(1L:(.N-2L), function(k) {
if(count[k]>0 && count[k+1L]>1 && count[k+2L]==1) {
m1 <- setdiff(intersect(product[[k+2L]], product[[k+1L]]), stock[[k]])
m2 <- setdiff(setdiff(product[[k+2L]], product[[k+1L]]), stock[[k]])
c(seq1=seqs[k], seq2=seqs[k+1L], seq3=seqs[k+2L],
measure1=if(length(m1) > 0) paste(m1, collapse=",") else "",
measure2=if(length(m2) > 0) paste(m2, collapse=",") else "")
}
}), ignore.empty=TRUE),
ID]
setnames(ans, names(ans)[-1L], c(paste0("seq", 1:3), paste0("measure", 1:2)))
ans
output:
ID seq1 seq2 seq3 measure1 measure2
1: 1 2 3 4 C E
2: 2 1 2 3
3: 3 2 3 4 D

Count number of duplicates in other dataframe

I have two data.frames dfA and dfB. Both of them have a column called key.
Now I'd like to know how many duplicates for A$key there are in B$key.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
It should be A=2, B=3, C=0 and D=1. Whats the most easiest way to do this?
Use table
table(factor(B$key, levels = sort(unique(A$key))))
#A B C D
#2 3 0 1
factor is needed here such that we also 'count' entries that do not appear in B$key, that is C.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
library(dplyr)
library(tidyr)
B %>%
filter(key %in% A$key) %>% # keep values that appear in A
count(key) %>% # count values
complete(key = A$key, fill = list(n = 0)) # add any values from A that don't appear
# # A tibble: 4 x 2
# key n
# <chr> <dbl>
# 1 A 2
# 2 B 3
# 3 C 0
# 4 D 1
Using tidyverse you can do:
A %>%
left_join(B %>% #Merging df A with df B for which the count in "key" was calculated
group_by(key) %>%
tally(), by = c("key" = "key")) %>%
mutate(n = ifelse(is.na(n), 0, n)) #Replacing NA with 0
key n
1 A 2
2 B 3
3 C 0
4 D 1
Actually you mean how many occurrences of each value of A$key you have in B$key?
You can obtain this by coding B$key as factor with the unique values of A$key as levels.
o <- table(factor(B$key, levels=unique(A$key)))
Yielding:
> o
A B C D
2 3 0 1
If you really want to count duplicates, do
dupes <- ifelse(o - 1 < 0, 0, o - 1)
Yielding:
> dupes
A B C D
1 2 0 0

Resources