How to count a swap characters between two columns in R - r

I have a data frame that looks like this
df <- data.frame(col1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"),
col2 = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E",
"A", "B", "C", "D", "E"))
what I want is to have like this
df <- data.frame(col1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
"C", "C", "C", "C", "C"),
col2 = c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E",
"A", "B", "C", "D", "E"),
col3 = c("1","0","0","0","0","1","1","0","0","0","1","1","1","0","0"))
In col3, it counts the duplicated characters as 1 and unique as 0. row 6 is considered a duplicate because the swap characters ("B", "A") were counted already in row2 as unique ("A", "B"). I can easily do this in excel using the if and countif function. Thanks in advance!

We can use pmin and pmax to sort the values from left to right by rows and apply duplicated to check the duplicates
transform(
df,
col3 = +(duplicated(paste(pmin(col1, col2), pmax(col1, col2))) | col1 == col2)
)
which gives
col1 col2 col3
1 A A 1
2 A B 0
3 A C 0
4 A D 0
5 A E 0
6 B A 1
7 B B 1
8 B C 0
9 B D 0
10 B E 0
11 C A 1
12 C B 1
13 C C 1
14 C D 0
15 C E 0

Does this work:
df %>% mutate(col4 = str_c(col1, col2)) %>%
mutate(col5 = lapply(col4, function(x) paste(sort(unlist(strsplit(x, ''))), collapse = ''))) %>%
mutate(col3 = +(duplicated(col5) | (col1 == col2))) %>%
select(col1, col2, col3)
col1 col2 col3
1 A A 1
2 A B 0
3 A C 0
4 A D 0
5 A E 0
6 B A 1
7 B B 1
8 B C 0
9 B D 0
10 B E 0
11 C A 1
12 C B 1
13 C C 1
14 C D 0
15 C E 0

Here is one option where we look for any duplicates or where col1 and col2 are the same. The + returns a binary for the logical.
df$col3 <- +(duplicated(t(apply(df, 1, sort))) | df$col1 == df$col2)
Output
col1 col2 col3
1 A A 1
2 A B 0
3 A C 0
4 A D 0
5 A E 0
6 B A 1
7 B B 1
8 B C 0
9 B D 0
10 B E 0
11 C A 1
12 C B 1
13 C C 1
14 C D 0
15 C E 0

try this
column <- grepl("^[.0-9]+$", dat[,1])
column
dat2 <- data.frame(Sex = dat[cbind(seq_len(nrow(dat)),1+column)], Length =
dat[cbind(seq_len(nrow(dat)),2-column)])
dat2$Length <- as.numeric(dat2$Length)
dat2

Related

How to generate string counts in different samples by R

Let's say I have a data table as follow:
ID1 ID2 ID3
-------------
a a b
a b b
b b b
c c c
c c d
c d d
d e
d e
e
Then I want to convert it as like following structure:
Samples ID1 ID2 ID3
-------------------
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
Would any of you please help me with R or bash code to achieve such transformation?
Try the R code below
> table(stack(df))
ind
values ID1 ID2 ID3
a 2 1 0
b 1 2 3
c 3 2 1
d 2 1 2
e 1 0 2
data
> dput(df)
structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame", row.names = c(NA,
-9L))
An option with tidyverse - reshape to 'long' format with pivot_longer, get the count and reshape back to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything(), values_drop_na = TRUE, values_to = 'Samples') %>%
count(name, Samples) %>%
pivot_wider(names_from = name, values_from = n, values_fill = 0)
-output
# A tibble: 5 × 4
Samples ID1 ID2 ID3
<chr> <int> <int> <int>
1 a 2 1 0
2 b 1 2 3
3 c 3 2 1
4 d 2 1 2
5 e 1 0 2
data
df <- structure(list(ID1 = c("a", "a", "b", "c", "c", "c", "d", "d",
"e"), ID2 = c("a", "b", "b", "c", "c", "d", NA, NA, NA), ID3 = c("b",
"b", "b", "c", "d", "d", "e", "e", NA)), class = "data.frame",
row.names = c(NA,
-9L))

how to add a column to identify specific combination of values in R?

I have a database with several columns ( >20) and 2 of these columns have the subject names. I would like to add another column with inside a number that identifies the combination of the two subjects.
Here is an example with only the 2 columns of names (I don't include the others for convenience):
ID1 ID2
A B
A C
A B
B C
A B
B A
C B
And here is what i would like to create:
ID1 ID2 CODE
A B 1
A C 2
A B 1
B C 3
A B 1
B A 1
C B 3
I am kind of new in R and I think it can be done with stringr but I am not sure how
Thanks for the help!
Simo
df$CODE <- as.integer(
factor(
apply(df, 1, function(x) paste0(sort(x), collapse = ""))
)
)
# ID1 ID2 CODE
# 1 A B 1
# 2 A C 2
# 3 A B 1
# 4 B C 3
# 5 A B 1
# 6 B A 1
# 7 C B 3
Data
df <- data.frame(
ID1 = c("A", "A", "A", "B", "A", "B", "C"),
ID2 = c("B", "C", "B", "C", "B", "A", "B")
)
Try this:
library(dplyr)
#Code
new <- df %>% rowwise() %>%
mutate(Var = paste0(sort(c(ID1, ID2)), collapse = '')) %>%
group_by(Var) %>%
mutate(CODE=cur_group_id()) %>%
ungroup() %>%
select(-Var)
Output:
# A tibble: 7 x 3
ID1 ID2 CODE
<chr> <chr> <int>
1 A B 1
2 A C 2
3 A B 1
4 B C 3
5 A B 1
6 B A 1
7 C B 3
Some data used:
#Data
df <- structure(list(ID1 = c("A", "A", "A", "B", "A", "B", "C"), ID2 = c("B",
"C", "B", "C", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-7L))

swapping rows and columns in R

I have a table that looks like :
> head(test,10)
# A tibble: 10 x 16
Question_1 Question_2 Question_3 Question_4 Question_5 Question_6 Question_7 Question_8 Question_9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 B C C E C A C E C
2 C C C B C A E D C
3 B C C E C A C E C
4 C C C D C A C D C
5 B B C B A A A D C
6 C C C E BLANK A C E C
7 C C C E C A E E C
8 B C C E C A C D C
9 C C C E C A C D C
10 D C E B C A A D C
and want to transpose so I get one question for each row and in 6 separate columns count of A,B,C,D,E,BLANKS.
We can gather into 'long' format, get the count of the 'key', 'value' columnss and spread it to 'wide' format
library(tidyverse)
gather(test) %>%
count(key, value) %>%
spread(value, n, fill = 0)
Or using melt/dcast
library(data.table)
dcast(melt(setDT(test), measure = patterns("^Question")), variable ~ value)
Or in base R with no looping by replicating the column names of 'test, while unlisting the 'test' and get the table
table(names(test)[col(test)], unlist(test))
# A B BLANK C D E
# Question_1 0 4 0 5 1 0
# Question_2 0 1 0 9 0 0
# Question_3 0 0 0 9 0 1
# Question_4 0 3 0 0 1 6
# Question_5 1 0 1 8 0 0
# Question_6 10 0 0 0 0 0
# Question_7 2 0 0 6 0 2
# Question_8 0 0 0 0 6 4
# Question_9 0 0 0 10 0 0
NOTE: There is no need to trick with a loop
Benchmarks
df2 <- test[rep(seq_len(nrow(test)), 1e5), ]
system.time({
vals <- unique(unlist(df2))
t(sapply(df2, function(x) table(factor(x, levels = vals))))
})
# user system elapsed
# 6.987 0.367 7.293
system.time({
table(names(df2)[col(df2)], unlist(df2))
})
# user system elapsed
# 6.355 0.407 6.720
system.time({
gather(df2) %>%
count(key, value) %>%
spread(value, n, fill = 0)
})
# user system elapsed
# 0.567 0.125 0.695
system.time({
dcast(melt(setDT(df2), measure = patterns("^Question")), variable ~ value)
})
# user system elapsed
# 0.789 0.018 0.195
data
test <- structure(list(Question_1 = c("B", "C", "B", "C", "B", "C", "C",
"B", "C", "D"), Question_2 = c("C", "C", "C", "C", "B", "C",
"C", "C", "C", "C"), Question_3 = c("C", "C", "C", "C", "C",
"C", "C", "C", "C", "E"), Question_4 = c("E", "B", "E", "D",
"B", "E", "E", "E", "E", "B"), Question_5 = c("C", "C", "C",
"C", "A", "BLANK", "C", "C", "C", "C"), Question_6 = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A"), Question_7 = c("C",
"E", "C", "C", "A", "C", "E", "C", "C", "A"), Question_8 = c("E",
"D", "E", "D", "D", "E", "E", "D", "D", "D"), Question_9 = c("C",
"C", "C", "C", "C", "C", "C", "C", "C", "C")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
A base R trick could be to get all the unique values of the dataframe and use sapply and count frequency of each value in the column.
vals <- unique(unlist(test))
t(sapply(test, function(x) table(factor(x, levels = vals))))
# B C D E A BLANK
#Question_1 4 5 1 0 0 0
#Question_2 1 9 0 0 0 0
#Question_3 0 9 0 1 0 0
#Question_4 3 0 1 6 0 0
#Question_5 0 8 0 0 1 1
#Question_6 0 0 0 0 10 0
#Question_7 0 6 0 2 2 0
#Question_8 0 0 6 4 0 0
#Question_9 0 10 0 0 0 0

Tabulating list of values in third variable in R

I have following data:
ddf2 = structure(list(col1 = c(3, 3, 2, 1, 1, 1, 3, 2, 1, 1, 3, 1, 1,
2, 1, 1, 1, 2, 3, 1, 1, 3, 2, 3, 3), col2 = c("c", "c", "b",
"b", "b", "a", "b", "c", "b", "b", "c", "c", "b", "b", "a", "c",
"c", "b", "a", "b", "b", "c", "a", "c", "a"), col3 = c("C", "E",
"E", "B", "D", "E", "C", "C", "E", "E", "C", "A", "D", "D", "C",
"E", "A", "A", "A", "D", "A", "A", "B", "A", "E")), .Names = c("col1",
"col2", "col3"), row.names = c(NA, 25L), class = "data.frame")
head(ddf2)
col1 col2 col3
1 3 c C
2 3 c E
3 2 b E
4 1 b B
5 1 b D
6 1 a E
For every combination of col1 and col2, there may be many values of col3:
with(ddf2, ddf2[col1==1 & col2=='b',])
col1 col2 col3
4 1 b B
5 1 b D
9 1 b E
10 1 b E
13 1 b D
20 1 b D
21 1 b A
with(ddf2, table(col1, col2))
col2
col1 a b c
1 2 7 3
2 1 3 1
3 2 1 5
I want to create a table/matrix of col1 and col2 as above but each cell should have a list of unique col3 entries for that set of col1 and col2. I expect following output:
col2
col1 a b c
1 E,C A,B,D,E A,E
2 B A,D,E C
3 A,E C A,C,E
I tried following but it does not work:
with(ddf2, tapply(col3, list(col1,col2), c))
a b c
1 Character,2 Character,7 Character,3
2 "B" Character,3 "C"
3 Character,2 "C" Character,5
How can this be done? Thanks for your help.
One option:
d <- with(ddf2, aggregate(col3 ~ col2 + col1, FUN = function(x) paste0(unique(x))))
library(reshape2)
dcast(d, col1 ~ col2, value.var = "col3")
# col1 a b c
#1 1 E, C B, D, E, A A, E
#2 2 B E, D, A C
#3 3 A, E C C, E, A
Most likely it's possible to do both steps in one, but I'll generously leave it to someone else to figure this out ;)
Or
library(dplyr)
library(tidyr)
ddf2 %>%
group_by(col1, col2) %>%
summarise(col3 = paste(unique(col3), collapse = ", ")) %>%
spread(col2, col3)
#Source: local data frame [3 x 4]
#
# col1 a b c
#1 1 E, C B, D, E, A A, E
#2 2 B E, D, A C
#3 3 A, E C C, E, A
Edit after comment:
Just tested with tapply and this seems to work (the problem was apparently in calling c()):
with(ddf2, tapply(col3, list(col1,col2), FUN = function(x) paste(unique(x), collapse = ", ")))
# a b c
#1 "E, C" "B, D, E, A" "A, E"
#2 "B" "E, D, A" "C"
#3 "A, E" "C" "C, E, A"

R: unique combination (avoid a-b and b-a and identical such as a-a, b-b)

I have the following variable columns -
var1 <- c("a", "b", "a", "a", "c", "a", "b", "b", "c", "b", "c", "c", "d")
var2 <- c("a", "a", "b", "c", "a", "d", "b", "c", "b", "d", "c", "d", "d")
mydf <- data.frame(var1, var2)
I want to find unique variable combination, such that
(a) var1 a- var2 b and var1 b- var2 a are not considered unique.
(b) no identical combination are present -
for example var1 a and var2 a, var1 b and var2 b
I used the following codes, is not providing what I am expecting:
unique(mydf)
var1 var2
1 a a
2 b a
3 a b
4 a c
5 c a
6 a d
7 b b
8 b c
9 c b
10 b d
11 c c
12 c d
13 d d
My expected output is:
var1 var2
1 a b
2 a c
3 a d
4 b c
5 b d
6 c d
thanks;
This should do it:
mydf = mydf[mydf[,1] != mydf[,2], ]
mydf = mydf[!duplicated(data.frame(t(apply(mydf, 1, sort)))), ]
> mydf
var1 var2
2 b a
4 a c
6 a d
8 b c
10 b d
12 c d
More of an exercise to teach myself some sets package behavior:
require(sets)
mydf <- data.frame(var1, var2, stringsAsFactors=FALSE) # unneeded factors are a plague on R/S
dlis <- list();
for (i in seq(nrow(mydf)) ) {
if( length(set(mydf[i,1], mydf[i,2]) )==2 ) {
dlis <- c( dlis, list(set(mydf[i,1], mydf[i,2]))
) } }
unique(dlis)
[[1]]
{"a", "b"}
[[2]]
{"a", "c"}
[[3]]
{"a", "d"}
[[4]]
{"b", "c"}
[[5]]
{"b", "d"}
[[6]]
{"c", "d"}

Resources