How to generate string counts in different samples by R - r

Let's say I have a data table as follow:
ID1 ID2 ID3
-------------
a a b
a b b
b b b
c c c
c c d
c d d
d e
d e
e
Then I want to convert it as like following structure:
Samples ID1 ID2 ID3
-------------------
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
Would any of you please help me with R or bash code to achieve such transformation?

Try the R code below
> table(stack(df))
ind
values ID1 ID2 ID3
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
data
> dput(df)
structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame", row.names = c(NA,
-9L))

An option with tidyverse - reshape to 'long' format with pivot_longer, get the count and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything(), values_drop_na = TRUE, values_to = 'Samples') %>%
count(name, Samples) %>%
pivot_wider(names_from = name, values_from = n, values_fill = 0)
-output
# A tibble: 5 × 4
Samples ID1 ID2 ID3
<chr> <int> <int> <int>
1 a 2 1 0
2 b 1 2 3
3 c 3 2 1
4 d 2 1 2
5 e 1 0 2
data
df <- structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame",
row.names = c(NA,
-9L))

Related

How to count a swap characters between two columns in R

I have a data frame that looks like this
df <- data.frame(col1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"),
col2 = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E",
"A", "B", "C", "D", "E"))
what I want is to have like this
df <- data.frame(col1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"),
col2 = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E",
"A", "B", "C", "D", "E"),
col3 = c("1","0","0","0","0","1","1","0","0","0","1","1","1","0","0"))
In col3, it counts the duplicated characters as 1 and unique as 0. row 6 is considered a duplicate because the swap characters ("B", "A") were counted already in row2 as unique ("A", "B"). I can easily do this in excel using the if and countif function. Thanks in advance!
We can use pmin and pmax to sort the values from left to right by rows and apply duplicated to check the duplicates
transform(
df,
col3 = +(duplicated(paste(pmin(col1, col2), pmax(col1, col2))) | col1 == col2)
)
which gives
col1 col2 col3
1 A A 1
2 A B 0
3 A C 0
4 A D 0
5 A E 0
6 B A 1
7 B B 1
8 B C 0
9 B D 0
10 B E 0
11 C A 1
12 C B 1
13 C C 1
14 C D 0
15 C E 0
Does this work:
df %>% mutate(col4 = str_c(col1, col2)) %>%
mutate(col5 = lapply(col4, function(x) paste(sort(unlist(strsplit(x, ''))), collapse = ''))) %>%
mutate(col3 = +(duplicated(col5) | (col1 == col2))) %>%
select(col1, col2, col3)
col1 col2 col3
1 A A 1
2 A B 0
3 A C 0
4 A D 0
5 A E 0
6 B A 1
7 B B 1
8 B C 0
9 B D 0
10 B E 0
11 C A 1
12 C B 1
13 C C 1
14 C D 0
15 C E 0
Here is one option where we look for any duplicates or where col1 and col2 are the same. The + returns a binary for the logical.
df$col3 <- +(duplicated(t(apply(df, 1, sort))) | df$col1 == df$col2)
Output
col1 col2 col3
1 A A 1
2 A B 0
3 A C 0
4 A D 0
5 A E 0
6 B A 1
7 B B 1
8 B C 0
9 B D 0
10 B E 0
11 C A 1
12 C B 1
13 C C 1
14 C D 0
15 C E 0
try this
column <- grepl("^[.0-9]+$", dat[,1])
column
dat2 <- data.frame(Sex = dat[cbind(seq_len(nrow(dat)),1+column)], Length =
dat[cbind(seq_len(nrow(dat)),2-column)])
dat2$Length <- as.numeric(dat2$Length)
dat2

Get Max value by category as a new column in R

I understand the questions wording is not clear but its a simple one, hopefully the image will convey more than words. I need a new column (New Col in image) which gets a value from column B corresponding to max value in column N (by group) in this case A.
Preferably tidyverse solution since I am piping a long command.
df <- structure(list(A = c("a", "a", "a", "a", "a", "b", "b", "b"),
B = c("b", "c", "d", "e", "f", "c", "d", "e"), N = c(1L,
2L, 3L, 4L, 5L, 5L, 4L, 3L), New.Col = c("f", "f", "f", "f",
"f", "c", "c", "c")), class = "data.frame", row.names = c(NA,
-8L))
Using data.table
library(data.table)
setDT(df)[, new_col := B[which.max(N)], A]
> df
A B N New.Col new_col
1: a b 1 f f
2: a c 2 f f
3: a d 3 f f
4: a e 4 f f
5: a f 5 f f
6: b c 5 c c
7: b d 4 c c
8: b e 3 c c
Here is a clean and neat version of my answer without ifelse:
library(dplyr)
df %>%
group_by(A) %>%
mutate(new_col = B[N==max(N)])
A B N New.Col new_col
<chr> <chr> <int> <chr> <chr>
1 a b 1 f f
2 a c 2 f f
3 a d 3 f f
4 a e 4 f f
5 a f 5 f f
6 b c 5 c c
7 b d 4 c c
8 b e 3 c c
First answer:
We could use max in an ifelse statement:
library(dplyr)
df %>%
group_by(A) %>%
mutate(NewCol = ifelse(N==max(N), B, B)) %>%
select(-NewCol)
A B N New.Col
<chr> <chr> <int> <chr>
1 a b 1 f
2 a c 2 f
3 a d 3 f
4 a e 4 f
5 a f 5 f
6 b c 5 c
7 b d 4 c
8 b e 3 c
You can just use normal group_by and mutate
library(tidyverse)
df <- structure(list(A = c("a", "a", "a", "a", "a", "b", "b", "b"),
B = c("b", "c", "d", "e", "f", "c", "d", "e"), N = c(1L,
2L, 3L, 4L, 5L, 5L, 4L, 3L), New.Col = c("f", "f", "f", "f",
"f", "c", "c", "c")), class = "data.frame", row.names = c(NA,
-8L))
df %>%
group_by(A) %>%
mutate(new_col = B[which.max(N)]) %>%
ungroup()

how to add a column to identify specific combination of values in R?

I have a database with several columns ( >20) and 2 of these columns have the subject names. I would like to add another column with inside a number that identifies the combination of the two subjects.
Here is an example with only the 2 columns of names (I don't include the others for convenience):
ID1 ID2
A B
A C
A B
B C
A B
B A
C B
And here is what i would like to create:
ID1 ID2 CODE
A B 1
A C 2
A B 1
B C 3
A B 1
B A 1
C B 3
I am kind of new in R and I think it can be done with stringr but I am not sure how
Thanks for the help!
Simo
df$CODE <- as.integer(
factor(
apply(df, 1, function(x) paste0(sort(x), collapse = ""))
)
)
# ID1 ID2 CODE
# 1 A B 1
# 2 A C 2
# 3 A B 1
# 4 B C 3
# 5 A B 1
# 6 B A 1
# 7 C B 3
Data
df <- data.frame(
ID1 = c("A", "A", "A", "B", "A", "B", "C"),
ID2 = c("B", "C", "B", "C", "B", "A", "B")
)
Try this:
library(dplyr)
#Code
new <- df %>% rowwise() %>%
mutate(Var = paste0(sort(c(ID1, ID2)), collapse = '')) %>%
group_by(Var) %>%
mutate(CODE=cur_group_id()) %>%
ungroup() %>%
select(-Var)
Output:
# A tibble: 7 x 3
ID1 ID2 CODE
<chr> <chr> <int>
1 A B 1
2 A C 2
3 A B 1
4 B C 3
5 A B 1
6 B A 1
7 C B 3
Some data used:
#Data
df <- structure(list(ID1 = c("A", "A", "A", "B", "A", "B", "C"), ID2 = c("B",
"C", "B", "C", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-7L))

swapping rows and columns in R

I have a table that looks like :
> head(test,10)
# A tibble: 10 x 16
Question_1 Question_2 Question_3 Question_4 Question_5 Question_6 Question_7 Question_8 Question_9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 B C C E C A C E C
2 C C C B C A E D C
3 B C C E C A C E C
4 C C C D C A C D C
5 B B C B A A A D C
6 C C C E BLANK A C E C
7 C C C E C A E E C
8 B C C E C A C D C
9 C C C E C A C D C
10 D C E B C A A D C
and want to transpose so I get one question for each row and in 6 separate columns count of A,B,C,D,E,BLANKS.
We can gather into 'long' format, get the count of the 'key', 'value' columnss and spread it to 'wide' format
library(tidyverse)
gather(test) %>%
count(key, value) %>%
spread(value, n, fill = 0)
Or using melt/dcast
library(data.table)
dcast(melt(setDT(test), measure = patterns("^Question")), variable ~ value)
Or in base R with no looping by replicating the column names of 'test, while unlisting the 'test' and get the table
table(names(test)[col(test)], unlist(test))
# A B BLANK C D E
# Question_1 0 4 0 5 1 0
# Question_2 0 1 0 9 0 0
# Question_3 0 0 0 9 0 1
# Question_4 0 3 0 0 1 6
# Question_5 1 0 1 8 0 0
# Question_6 10 0 0 0 0 0
# Question_7 2 0 0 6 0 2
# Question_8 0 0 0 0 6 4
# Question_9 0 0 0 10 0 0
NOTE: There is no need to trick with a loop
Benchmarks
df2 <- test[rep(seq_len(nrow(test)), 1e5), ]
system.time({
vals <- unique(unlist(df2))
t(sapply(df2, function(x) table(factor(x, levels = vals))))
})
# user system elapsed
# 6.987 0.367 7.293
system.time({
table(names(df2)[col(df2)], unlist(df2))
})
# user system elapsed
# 6.355 0.407 6.720
system.time({
gather(df2) %>%
count(key, value) %>%
spread(value, n, fill = 0)
})
# user system elapsed
# 0.567 0.125 0.695
system.time({
dcast(melt(setDT(df2), measure = patterns("^Question")), variable ~ value)
})
# user system elapsed
# 0.789 0.018 0.195
data
test <- structure(list(Question_1 = c("B", "C", "B", "C", "B", "C", "C",
"B", "C", "D"), Question_2 = c("C", "C", "C", "C", "B", "C",
"C", "C", "C", "C"), Question_3 = c("C", "C", "C", "C", "C",
"C", "C", "C", "C", "E"), Question_4 = c("E", "B", "E", "D",
"B", "E", "E", "E", "E", "B"), Question_5 = c("C", "C", "C",
"C", "A", "BLANK", "C", "C", "C", "C"), Question_6 = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A"), Question_7 = c("C",
"E", "C", "C", "A", "C", "E", "C", "C", "A"), Question_8 = c("E",
"D", "E", "D", "D", "E", "E", "D", "D", "D"), Question_9 = c("C",
"C", "C", "C", "C", "C", "C", "C", "C", "C")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
A base R trick could be to get all the unique values of the dataframe and use sapply and count frequency of each value in the column.
vals <- unique(unlist(test))
t(sapply(test, function(x) table(factor(x, levels = vals))))
# B C D E A BLANK
#Question_1 4 5 1 0 0 0
#Question_2 1 9 0 0 0 0
#Question_3 0 9 0 1 0 0
#Question_4 3 0 1 6 0 0
#Question_5 0 8 0 0 1 1
#Question_6 0 0 0 0 10 0
#Question_7 0 6 0 2 2 0
#Question_8 0 0 6 4 0 0
#Question_9 0 10 0 0 0 0

sum mismatchs in column to column comparison

I am quite new to R programming, and am having some difficulty with ANOTHER step of my project. I am not even sure at this point if I am asking the question correctly. I have a dataframe of actual and predicted values:
actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a
The issue that I am having is that I need to create a vector of mismatches between the actual value and each of the four predicted values. This should result in a single vector: c(2,1,2,4)
I am trying to use a boolean mask to sum over the TRUE values...but something is not working right. I need to do this sum for each of the four predicted values to actual value comparisons.
discordant_sums(df[,seq(1,ncol(df),2)]!=,df[,seq(2,ncol(df),2)])
Any suggestions would be greatly appreciated.
You can use apply to compare values in 1st column with values in each of all other columns.
apply(df[-1], 2, function(x)sum(df[1]!=x))
# predicted.1 predicted.2 predicted.3 predicted.4
# 2 1 2 4
Data:
df <- read.table(text =
"actual predicted.1 predicted.2 predicted.3 predicted.4
a a a a a
a a a b b
b b a b b
b a b b c
c c c c c
c d c c d
d d d c d
d d d d a",
header = TRUE, stringsAsFactors = FALSE)
We can replicate the first column to make the lengths equal between the comparison objects and do the colSums
as.vector(colSums(df[,1][row(df[-1])] != df[-1]))
#[1] 2 1 2 4
data
df <- structure(list(actual = c("a", "a", "b", "b", "c", "c", "d",
"d"), predicted.1 = c("a", "a", "b", "a", "c", "d", "d", "d"),
predicted.2 = c("a", "a", "a", "b", "c", "c", "d", "d"),
predicted.3 = c("a", "b", "b", "b", "c", "c", "c", "d"),
predicted.4 = c("a", "b", "b", "c", "c", "d", "d", "a")),
.Names = c("actual",
"predicted.1", "predicted.2", "predicted.3", "predicted.4"),
class = "data.frame", row.names = c(NA,
-8L))

Resources