R - Creating a new variable based on multiple observations - r

My dataset represents patients which have been treated multiple times. The dataset is in a long format, patients either get treatment A, C or S or a combination. A and C are never combined.
Simply put, the data looks something like this:
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", NA)
I would like to creat a new variable based on if any patient had treatment A or C or neither, so the end result looking something like:
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", "S"),
group = c("A", "A", "A", "C", "C", "C", "C", "C", "C", "C", "C", "A", "A", "A", "S", "S"))
How can I best approach this? I'm struggling with how to deal with multiple observations per ID.
Thank you!

You can use group_by() in combination with mutate() and case_when() to achieve this:
library(tidyverse)
df <- tibble(PatientID = c(1,1,1,2,2,3,3,3,3,4,4,5,5,5,6,6),
treatment = c("A", "A", "S", "C", "S", "S", "C", "C", NA, "C", NA, NA, "S", "A", "S", NA))
df %>%
group_by(PatientID) %>%
mutate(groups = case_when("A" %in% treatment ~ "A",
"C" %in% treatment ~ "C",
TRUE ~ "S"))
#> # A tibble: 16 × 3
#> # Groups: PatientID [6]
#> PatientID treatment groups
#> <dbl> <chr> <chr>
#> 1 1 A A
#> 2 1 A A
#> 3 1 S A
#> 4 2 C C
#> 5 2 S C
#> 6 3 S C
#> 7 3 C C
#> 8 3 C C
#> 9 3 <NA> C
#> 10 4 C C
#> 11 4 <NA> C
#> 12 5 <NA> A
#> 13 5 S A
#> 14 5 A A
#> 15 6 S S
#> 16 6 <NA> S
Created on 2022-08-18 with reprex v2.0.2

Related

Add new value to table() in order to be able to use chi square test

From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))

How to generate string counts in different samples by R

Let's say I have a data table as follow:
ID1 ID2 ID3
-------------
a a b
a b b
b b b
c c c
c c d
c d d
d e
d e
e
Then I want to convert it as like following structure:
Samples ID1 ID2 ID3
-------------------
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
Would any of you please help me with R or bash code to achieve such transformation?
Try the R code below
> table(stack(df))
ind
values ID1 ID2 ID3
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
data
> dput(df)
structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame", row.names = c(NA,
-9L))
An option with tidyverse - reshape to 'long' format with pivot_longer, get the count and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything(), values_drop_na = TRUE, values_to = 'Samples') %>%
count(name, Samples) %>%
pivot_wider(names_from = name, values_from = n, values_fill = 0)
-output
# A tibble: 5 × 4
Samples ID1 ID2 ID3
<chr> <int> <int> <int>
1 a 2 1 0
2 b 1 2 3
3 c 3 2 1
4 d 2 1 2
5 e 1 0 2
data
df <- structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame",
row.names = c(NA,
-9L))

Extract list of values from column based upon other column

The following code:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1,6)
)
Results in the following dataframe:
letter score
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6
I want to get the scores for a sequence of letters, for example the scores of c("f", "a", "d", "e"). It should result in c(6, 1, 4, 5).
What's more, I want to get the scores for c("c", "o", "f", "f", "e", "e"). Now the o is not in the letter column so it should return NA, resulting in c(3, NA, 6, 6, 5, 5).
What is the best way to achieve this? Can I use dplyr for this?
We can use match to create an index and extract the corresponding 'score' If there is no match, then by default it gives NA
df$score[match(v1, df$letter)]
#[1] 3 NA 6 6 5 5
df$score[match(v2, df$letter)]
#[1] 6 1 4 5
data
v1 <- c("c", "o", "f", "f", "e", "e")
v2 <- c("f", "a", "d", "e")
If you want to use dplyr I would use a join:
df <- data.frame(
"letter" = c("a", "b", "c", "d", "e", "f"),
"score" = seq(1:6)
)
library(dplyr)
df2 <- data.frame(letter = c("c", "o", "f", "f", "e", "e"))
left_join(df2, df, by = "letter")
letter score
1 c 3
2 o NA
3 f 6
4 f 6
5 e 5
6 e 5

Counting frequencies of each letter for multiple column [duplicate]

This question already has answers here:
Create frequency tables for multiple factor columns in R
(3 answers)
Closed 5 years ago.
I have a data frame as below:
> dfnew
C1 C2 C3 C4 C5 C6
1 A A G A G A
2 A T T T G G
3 T A G A T A
4 C A A A A G
5 C A T T T C
6 C A A A T A
7 T C T G A A
8 G A G C T A
9 C T A T G A
10 G A A A G G
11 G G T T T A
12 G A C T T A
13 T T C T T T
14 A T A G C T
15 A C A A A A
16 A A C A A A
17 T G G A A T
18 A A A A G T
19 G T G G <NA> <NA>
I want to get answer as below in one line of code in R without looping:
A 6 10 7 9 5 10
C 4 2 3 1 1 1
G 5 2 5 3 5 3
T 4 5 4 6 7 4
We can use sapply to loop over the columns, convert it to factor with levels specified and get the frequency with table
sapply(dfnew, function(x) table(factor(x, levels = c("A", "C", "G", "T"))))
Or using tidyverse
library(dplyr)
library(tidyr)
dfnew %>%
gather(key, val, na.rm = TRUE) %>%
count(key, val) %>%
spread(key, n)
If you use stack to reshape everything to long form, you can call table on the result:
dfnew <- data.frame(C1 = c("A", "A", "T", "C", "C", "C", "T", "G", "C", "G", "G", "G", "T", "A", "A", "A", "T", "A", "G"),
C2 = c("A", "T", "A", "A", "A", "A", "C", "A", "T", "A", "G", "A", "T", "T", "C", "A", "G", "A", "T"),
C3 = c("G", "T", "G", "A", "T", "A", "T", "G", "A", "A", "T", "C", "C", "A", "A", "C", "G", "A", "G"),
C4 = c("A", "T", "A", "A", "T", "A", "G", "C", "T", "A", "T", "T", "T", "G", "A", "A", "A", "A", "G"),
C5 = c("G", "G", "T", "A", "T", "T", "A", "T", "G", "G", "T", "T", "T", "C", "A", "A", "A", "G", NA),
C6 = c("A", "G", "A", "G", "C", "A", "A", "A", "A", "G", "A", "A", "T", "T", "A", "A", "T", "T", NA),
stringsAsFactors = FALSE)
table(stack(dfnew))
#> ind
#> values C1 C2 C3 C4 C5 C6
#> A 6 10 7 9 5 10
#> C 4 2 3 1 1 1
#> G 5 2 5 3 5 3
#> T 4 5 4 6 7 4
using data.table and its pipe worflow with [:
library(data.table)
tab <- fread("
C1 C2 C3 C4 C5 C6
A A G A G A
A T T T G G
T A G A T A
C A A A A G
C A T T T C
C A A A T A
T C T G A A
G A G C T A
C T A T G A
G A A A G G
G G T T T A
G A C T T A
T T C T T T
A T A G C T
A C A A A A
A A C A A A
T G G A A T
A A A A G T
G T G G NA NA")
tab[, melt(.SD, measure.vars = paste0("C", 1:6), na.rm = TRUE)][
, dcast(.SD, value ~ variable, fun = length, drop = TRUE)
]
#> value C1 C2 C3 C4 C5 C6
#> 1: A 6 10 7 9 5 10
#> 2: C 4 2 3 1 1 1
#> 3: G 5 2 5 3 5 3
#> 4: T 4 5 4 6 7 4

R: select top products in groups

I need to select 3 top selling products in each category, but if category dose not have 3 products I should add more products from best available category ("a" being the best category, "c" worst).
Every day the products change so I would like to this automatically. Previously I did choose top 3 products and if there was not available I did not bothered, but unfortunately the conditions changed. For that I used code as follows:
Selected <- items %>% group_by(Cat) %>% dplyr:: filter(row_number() < 3) %>% ungroup
Sample data:
items <- data.frame(Cat = c("a", "a", "a", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results:
"a", "a", "a", "b", "b", "c", "c", "c", "c"
Sample data - 2:
items <- data.frame(Cat = c("a", "a", "a", "a", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
Desired results - 2:
"a", "a", "a", "a", "b", "c", "c", "c", "c"
Here is a possible answer. I'm not entirely sure if I'm getting what you are after - if not let me know.
items <- data.frame(Cat = c("a", "a", "a",
"b", "b",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c"),
ranking = 1:15)
First we order the data according from best to worst category and add the count number within category.
Selected <- items %>% group_by(Cat) %>%
mutate(id = row_number()) %>%
ungroup() %>% arrange(Cat)
Then we can make the filter and fill up with remaining rows from best to worst
Selected %>% filter(id<=3) %>% # Select top 3 in each group
bind_rows(Selected %>% filter(id>3)) %>% # Merge with the ones that weren't selected
mutate(id=row_number()) %>%
filter(id <= 3*length(unique(Cat))) # Extract the right number
This produces
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 4 4
5 b 5 5
6 c 6 6
7 c 7 7
8 c 8 8
9 c 9 9
The second data example yields
# A tibble: 9 x 3
Cat ranking id
<fctr> <int> <int>
1 a 1 1
2 a 2 2
3 a 3 3
4 b 5 4
5 c 6 5
6 c 7 6
7 c 8 7
8 a 4 8
9 c 9 9
which seems to be what you were after.

Resources