R: select top products in groups - r

I need to select 3 top selling products in each category, but if category dose not have 3 products I should add more products from best available category ("a" being the best category, "c" worst).
Every day the products change so I would like to this automatically. Previously I did choose top 3 products and if there was not available I did not bothered, but unfortunately the conditions changed. For that I used code as follows:
Selected <- items %>% group_by(Cat) %>% dplyr:: filter(row_number() < 3) %>% ungroup
Sample data:
items <- data.frame(Cat = c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results:
"a", "a", "a", "b", "b", "c", "c", "c", "c"
Sample data - 2:
items <- data.frame(Cat = c("a", "a", "a", "a", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results - 2:
"a", "a", "a", "a", "b", "c", "c", "c", "c"

Here is a possible answer. I'm not entirely sure if I'm getting what you are after - if not let me know.
items <- data.frame(Cat = c("a", "a", "a",
"b", "b",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
First we order the data according from best to worst category and add the count number within category.
Selected <- items %>% group_by(Cat) %>%
mutate(id = row_number()) %>%
ungroup() %>% arrange(Cat)
Then we can make the filter and fill up with remaining rows from best to worst
Selected %>% filter(id<=3) %>% # Select top 3 in each group
bind_rows(Selected %>% filter(id>3)) %>% # Merge with the ones that weren't selected
mutate(id=row_number()) %>%
filter(id <= 3*length(unique(Cat))) # Extract the right number
This produces
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 4 4
5 b 5 5
6 c 6 6
7 c 7 7
8 c 8 8
9 c 9 9
The second data example yields
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 5 4
5 c 6 5
6 c 7 6
7 c 8 7
8 a 4 8
9 c 9 9
which seems to be what you were after.

Related

Add new value to table() in order to be able to use chi square test

From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))

R - Creating a new variable based on multiple observations

My dataset represents patients which have been treated multiple times. The dataset is in a long format, patients either get treatment A, C or S or a combination. A and C are never combined.
Simply put, the data looks something like this:
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", NA)
I would like to creat a new variable based on if any patient had treatment A or C or neither, so the end result looking something like:
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", "S"),
group = c("A", "A", "A", "C", "C", "C", "C", "C", "C", "C", "C", "A", "A", "A", "S", "S"))
How can I best approach this? I'm struggling with how to deal with multiple observations per ID.
Thank you!
You can use group_by() in combination with mutate() and case_when() to achieve this:
library(tidyverse)
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", NA))
df %>%
group_by(PatientID) %>%
mutate(groups = case_when("A" %in% treatment ~ "A",
"C" %in% treatment ~ "C",
TRUE ~ "S"))
#> # A tibble: 16 × 3
#> # Groups: PatientID [6]
#> PatientID treatment groups
#> <dbl> <chr> <chr>
#> 1 1 A A
#> 2 1 A A
#> 3 1 S A
#> 4 2 C C
#> 5 2 S C
#> 6 3 S C
#> 7 3 C C
#> 8 3 C C
#> 9 3 <NA> C
#> 10 4 C C
#> 11 4 <NA> C
#> 12 5 <NA> A
#> 13 5 S A
#> 14 5 A A
#> 15 6 S S
#> 16 6 <NA> S
Created on 2022-08-18 with reprex v2.0.2

How to generate string counts in different samples by R

Let's say I have a data table as follow:
ID1 ID2 ID3
-------------
a a b
a b b
b b b
c c c
c c d
c d d
d e
d e
e
Then I want to convert it as like following structure:
Samples ID1 ID2 ID3
-------------------
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
Would any of you please help me with R or bash code to achieve such transformation?
Try the R code below
> table(stack(df))
ind
values ID1 ID2 ID3
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
data
> dput(df)
structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame", row.names = c(NA,
-9L))
An option with tidyverse - reshape to 'long' format with pivot_longer, get the count and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything(), values_drop_na = TRUE, values_to = 'Samples') %>%
count(name, Samples) %>%
pivot_wider(names_from = name, values_from = n, values_fill = 0)
-output
# A tibble: 5 × 4
Samples ID1 ID2 ID3
<chr> <int> <int> <int>
1 a 2 1 0
2 b 1 2 3
3 c 3 2 1
4 d 2 1 2
5 e 1 0 2
data
df <- structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame",
row.names = c(NA,
-9L))

Tidy way of arranging data frame rows according to target sorting orders

Back in 2015, I asked a similar question on this, but I would like to find a tidy way of doing this.
This is the best that I could come up with so far. It works, but having to change column types just for sorting seems "wrong" somehow. However, so does resorting to dplyr::*_join() and match() comes with its own catches (plus it's hard to use it in tidy contexts).
So is there a good/recommended way of doing this in the tidyverse?
Define function
library(magrittr)
arrange_by_target <- function(
x,
targets
) {
x %>%
# Transform arrange-by columns to factors so we can leverage the order of
# the levels:
dplyr::mutate_at(
names(targets),
function(.x, .targets = targets) {
.col <- deparse(substitute(.x))
factor(.x, levels = .targets[[.col]])
}
) %>%
# Actual arranging:
dplyr::arrange_at(
names(targets)
) %>%
# Clean up by recasting factor columns to their original type:
dplyr::mutate_at(
.vars = names(targets),
function(.x, .targets = targets) {
.col <- deparse(substitute(.x))
vctrs::vec_cast(.x, to = class(.targets[[.col]]))
}
)
}
Test
x <- tibble::tribble(
~group, ~name, ~value,
"A", "B", 1,
"A", "C", 2,
"A", "A", 3,
"B", "B", 4,
"B", "A", 5
)
x %>%
arrange_by_target(
targets = list(
group = c("B", "A"),
name = c("A", "B", "C")
)
)
#> # A tibble: 5 x 3
#> group name value
#> <chr> <chr> <dbl>
#> 1 B A 5
#> 2 B B 4
#> 3 A A 3
#> 4 A B 1
#> 5 A C 2
x %>%
arrange_by_target(
targets = list(
group = c("B", "A"),
name = c("A", "B", "C") %>% rev()
)
)
#> # A tibble: 5 x 3
#> group name value
#> <chr> <chr> <dbl>
#> 1 B B 4
#> 2 B A 5
#> 3 A C 2
#> 4 A B 1
#> 5 A A 3
Created on 2019-11-06 by the reprex package (v0.3.0)
The easiest way to accomplish this is to convert your character columns to factors, like so:
x %>%
mutate(
group = factor(group, c("A", "B")),
name = factor(name, c("C", "B", "A"))
) %>%
arrange(group, name)
Another option that I frequently use is to utilize joins. For example:
x <- tibble::tribble(
~group, ~name, ~value,
"A", "B", 1,
"A", "C", 2,
"A", "A", 3,
"B", "B", 4,
"B", "A", 5,
"A", "A", 6,
"B", "C", 7,
"A", "B", 8,
"B", "B", 9
)
custom_sort <- tibble::tribble(
~group, ~name,
"A", "C",
"A", "B",
"A", "A",
"B", "B",
"B", "A"
)
x %>% right_join(custom_sort)

Count frequency of elements matching other elements of another column in R

Say I have
Name<- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C")
Cate<- c("a", "a", "b", "b", "c", "a", "a", "a", "c", "b", "b", "c")
I want to reproduce the following:
Nam fra frb frc
A 2 2 1
B 3 0 1
C 0 2 1
Where fra, frb and frc are the frequency values of a, b and c values respectively in Cate for each category (A,B,C) of Nam.
I am looking for a faster code than the one I am using (subsetting Nam in each category and then calculate the frequencies)
We can do a dcast from data.table which is very efficient and quick
library(data.table)
dcast(data.table(Name, Cate), Name ~paste0("fr", Cate))
# Name fra frb frc
#1: A 2 2 1
#2: B 3 0 1
#3: C 0 2 1
A simple base R option would be
table(Cate, Name)
data
Name <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C")
Cate <- c("a", "a", "b", "b", "c", "a", "a", "a", "c", "b", "b", "c")
You can also use the xtabs() function:
xtabs(~Name + Cate)
For completeness' sake, here's a Hadleyverse solution:
library(dplyr)
library(tidyr)
data.frame(Name, Cate) %>%
count(Name, Cate) %>%
spread(key = Cate, value = n, fill = 0)

Resources