how can I regroup multiple categorical variables into a new variable

how can I regroup multiple categorical variables into a new variable - r

I have a data.frame (df) with 2 columns (A, B):
A B
1 a TCRB
2 a TCRG
3 a TCRB
4 b TCRB
5 b TCRG
6 c TCRB
7 c TCRB
8 c TCRB
9 c TCRB
10 d TCRG
11 d TCRG
12 d TCRG
I want to create a new column "C" as bellow that tells me whether each unique variable in "A" has both TCRB and TCRG or either one of them (0= TCRB only, 1= TCRG only, 2= both) as follows:
A: a b c d
C: 2 2 0 1
Greatly appreciate any help!

Here's an approach with dplyr:
library(dplyr)
df %>%
group_by(A) %>%
dplyr::summarise(C = case_when("TCRB" %in% B & "TCRG" %in% B ~ 2,
"TCRB" %in% B ~ 0,
"TCRG" %in% B ~ 1,
TRUE ~ NA_real_))
# A tibble: 4 x 2
A C
<fct> <dbl>
1 a 2
2 b 2
3 c 0
4 d 1

An option with n_distinct
library(dplyr)
df %>%
group_by(A) %>%
summarise(C = n_distinct(B) *!all(B == 'TCRB'))
# A tibble: 4 x 2
# A C
# <chr> <int>
#1 a 2
#2 b 2
#3 c 0
#4 d 1
data
df <- structure(list(A = c("a", "a", "a", "b", "b", "c", "c", "c",
"c", "d", "d", "d"), B = c("TCRB", "TCRG", "TCRB", "TCRB", "TCRG",
"TCRB", "TCRB", "TCRB", "TCRB", "TCRG", "TCRG", "TCRG")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

In Base R, we can use aggregate :
aggregate(B~A, df, function(x) {
if(all(c('TCRB', 'TCRG') %in% x)) 2
else if(any(x == 'TCRG')) 1
else if(any(x == 'TCRB')) 0
else NA
})
# A B
#1 a 2
#2 b 2
#3 c 0
#4 d 1

Related

Creating a new column of 1s and 0s as a way to count unique values in R

I would like to add a helper column of 0s and 1s to keep track of unique values based on one or more variables in R Programming.
Sample data:
df<- matrix(c("A","A","A","B","B","C","D","D","D","D"))
and what I would like is:
structure(c("A", "A", "A", "B", "B", "C", "D", "D", "D", "D",
"1", "0", "0", "1", "0", "1", "1", "0", "0", "0"), .Dim = c(10L,
2L))

I think you could use the following solution:
df <- as.data.frame(df)
df$Helper <- +!duplicated(df$V1)
df
V1 Helper
1 A 1
2 A 0
3 A 0
4 B 1
5 B 0
6 C 1
7 D 1
8 D 0
9 D 0
10 D 0

Using dplyr
library(dplyr)
library(data.table)
df %>%
mutate(Helper = +(rowid(id) == 1))
data
df <- structure(list(id = c("A", "A", "A", "B", "B", "C", "D", "D",
"D", "D")), class = "data.frame", row.names = c(NA, -10L))

Another base R option using ave
transform(
as.data.frame(df),
helper = +(ave(seq_along(V1),V1,FUN = seq_along)==1)
)
gives
V1 helper
1 A 1
2 A 0
3 A 0
4 B 1
5 B 0
6 C 1
7 D 1
8 D 0
9 D 0
10 D 0

A dplyr solution:
# Creating the dataframe:
df <- data.frame(id=c("A","A","A","B","B","C","D","D","D","D"))
library(dplyr)
df %>% group_by(id) %>% mutate(helper = ifelse(row_number()==1, 1,0))
# A tibble: 10 x 2
# Groups: id [4]
id helper
<chr> <dbl>
1 A 1
2 A 0
3 A 0
4 B 1
5 B 0
6 C 1
7 D 1
8 D 0
9 D 0
10 D 0

Here is another option using match -
library(dplyr)
df %>% mutate(result = as.integer(row_number() %in% match(unique(id), id)))
# id result
#1 A 1
#2 A 0
#3 A 0
#4 B 1
#5 B 0
#6 C 1
#7 D 1
#8 D 0
#9 D 0
#10 D 0
In base R -
transform(df, result = as.integer(seq(nrow(df)) %in% match(unique(id), id)))

how to split a dataframe by specific rows in r

I have a data look like this:
data <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))
data
> data
# A tibble: 8 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 A B C D
5 10 20 30 40
6 10 20 30 40
7 B C D NA
8 200 300 400 NA
It was wrong bind by rows and I wanted to split the data into 3 sub data(d1, d2 and d3) such like this:
NOTE: In my real situation, d1, d2 and d3 have different nrow(). I set nrow(d1) = 3, nrow(d2) = 2 and nrow(d3) = 1 just for simplify the question in this example.
d1 <- data.frame(A = rep(1,3), B = rep(2,3), C = rep(3,3), D = rep(4,3))
d2 <- data.frame(A = rep(10,2), B = rep(20,2), C = rep(30,2), D = rep(40,2))
d3 <- data.frame( B = 200, C = 300, D = 400)
> d1
A B C D
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
> d2
A B C D
1 10 20 30 40
2 10 20 30 40
> d3
B C D
1 200 300 400
And then I could bind them correctly using bind_rows from dplyr
bind_rows(d1, d2, d3) %>% as_tibble()
# A tibble: 6 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 10 20 30 40
5 10 20 30 40
6 NA 200 300 400
The problem is that I am troubled by how to get the d1, d2 and d3 from data.
Any help will be highly appreciated!

Here is a tidyverse solution.
process_df takes a data frame and sets the column names and removes the first row.
process_df <- function(df, ...) {
df %>%
set_names(slice(., 1)) %>%
select(which(!is.na(names(.)))) %>%
slice(-1)
}
Add a header row that just contains the column names.
Use rowwise() and c_across() to get the values of all columns by row. Use this to identify which rows are header rows.
group_map will apply a function over each group and bind_rows will combine the results.
data %>%
add_row(!!!set_names(names(.)), .before = 1) %>%
rowwise() %>%
mutate(
group = all(is.na(c_across()) | c_across() %in% names(.))
) %>%
ungroup() %>%
mutate(group = cumsum(group)) %>%
group_by(group) %>%
group_map(process_df) %>%
bind_rows()
#> # A tibble: 6 x 4
#> A B C D
#> <chr> <chr> <chr> <chr>
#> 1 1 2 3 4
#> 2 1 2 3 4
#> 3 1 2 3 4
#> 4 10 20 30 40
#> 5 10 20 30 40
#> 6 NA 200 300 400
Explanation of the usage of !!! in new_row
set_names(names(.)) creates a named vector that represents the row we want to add. However, add_row doesn't accept a named vector - it wants the values to be specified as arguments.
Here is a simplified example.
new_row <- c(speed = 1, dist = 2)
add_row doesn't accept a named vector, so this doesn't work.
cars %>% add_row(new_row, .before = TRUE)
# (Error)
!!! will unpack the vector as arguments to the function.
cars %>% add_row(!!!new_row, .before = TRUE)
# (Works)
!!! above essentially results in this:
cars %>% add_row(speed = 1, dist = 2, .before = TRUE)

Does this work:
data
# A tibble: 5 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 A B C D
3 10 20 30 40
4 B C D NA
5 200 300 400 NA
data <- rbind(LETTERS[1:4],data)
data
# A tibble: 6 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
3 A B C D
4 10 20 30 40
5 B C D NA
6 200 300 400 NA
split(data, rep(1:ceiling(nrow(data)/2), each = 2))
$`1`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
$`2`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 10 20 30 40
$`3`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 B C D NA
2 200 300 400 NA

Base R solution:
Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2)))
Including pushing separate data.frames to Global Environment:
list2env(setNames(Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2))),
paste0('d', seq_len(ceiling(nrow(df) / 2)))), .GlobalEnv)
Tidyverse Solution:
library(tidyverse)
df %>%
rbind(names(df), .) %>%
split(cumsum(seq_len(nrow(.)) %% 2)) %>%
Map(function(x){setNames(x[2,], x[1,])[,complete.cases(t(x))]}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Note solution adjusted to reflect edit to the question:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1]); y[,!is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
New solution including push to Global Env:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
dflist <- Map(function(x) {
y <-
setNames(t(x[, -1, drop = FALSE]), x[, 1])
y[, !is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
list2env(setNames(dflist, paste0('d', names(dflist))), .GlobalEnv)
Adjusted Tidyverse solution:
df %>%
rbind(names(.), .) %>%
t() %>%
data.frame() %>%
type.convert() %>%
split.default(cumsum(!sapply(., is.integer))) %>%
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1])
data.frame(y[,!is.na(colSums(y)), drop = FALSE])}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Data:
df <- structure(list(A = c("1", "A", "10", "B", "200"), B = c("2", "B", "20", "C", "300"), C = c("3", "C", "30", "D", "400"), D = c("4","D", "40", NA, NA)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
Updated Data:
df <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))

arrange a complicated data set in R

I have a large data set:
> ncol(d) [1] 1680 nrow(d) [1] 12
that it looks like this:
a b c e f g
3 2 5 1 3 6
a b c d e g
1 7 8 4 5 8
a c d e f h #in this row b does not exist
5 10 4 7 5 10
And I need that it looks like this:
a b c d e f g h
3 2 5 0 3 6 10 8
1 7 8 4 5 0 8 0
5 0 10 4 7 5 0 10 #and all the other columns ...
Since my data is really long and I have many corrections like this one to do over all the data set, it is hard to do it by hand. I would like to know if there is any way to do this using some sort of automatic way, like a logic function or a loop.
Any idea is welcome
Regards

Here's a possible approach using data.table:
library(data.table)
melt(
setDT(
setnames(
data.table::transpose(df1),
paste(rep(1:(nrow(df1)/2), each = 2), c("name", "value"), sep = "_"))),
measure = patterns("name", "value"))[
, dcast(.SD, variable ~ value1, value.var = "value2", fill = 0)]
# variable a b c d e f g h
# 1: 1 3 2 5 0 1 3 6 0
# 2: 2 1 7 8 4 5 0 8 0
# 3: 3 5 0 10 4 7 5 0 10

We could get the alternate rows with recycling logical vector, construct a data.frame and pivot it to wide format with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
sub1 <- df1[c(TRUE, FALSE),]
sub2 <- df1[c(FALSE, TRUE),]
tibble(ind = c(row(sub1)), col1 = factor(unlist(sub1), levels = letters[1:8]),
col2 = as.integer(unlist(sub2))) %>%
pivot_wider(names_from = col1, values_from = col2,
values_fill = list(col2 = 0)) %>%
select(-ind)
#A tibble: 3 x 8
# a b c d e f g h
# <int> <int> <int> <int> <int> <int> <int> <int>
#1 3 2 5 0 1 3 6 0
#2 1 7 8 4 5 0 8 0
#3 5 0 10 4 7 5 0 10
Or using base R with reshape
out <- reshape(
data.frame(ind = c(row(sub1)),
col1 = factor(unlist(sub1), levels = letters[1:8]),
col2 = as.integer(unlist(sub2))),
idvar = 'ind', direction = 'wide', timevar = 'col1')[-1]
names(out) <- sub("col2\\.", "", names(out))
out[is.na(out)] <- 0
row.names(out) <- NULL
out
# a b c d e f g h
#1 3 2 5 0 1 3 6 0
#2 1 7 8 4 5 0 8 0
#3 5 0 10 4 7 5 0 10
data
df1 <- structure(list(v1 = c("a", "3", "a", "1", "a", "5"), v2 = c("b",
"2", "b", "7", "c", "10"), v3 = c("c", "5", "c", "8", "d", "4"
), v4 = c("e", "1", "d", "4", "e", "7"), v5 = c("f", "3", "e",
"5", "f", "5"), v6 = c("g", "6", "g", "8", "h", "10")), class = "data.frame",
row.names = c(NA,
-6L))

Count occurrences per entry in dataframe

I have the following kind of dataframe (this is simplified example):
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df
id bank
1 1 a
2 1 b
3 1 c
4 2 b
5 3 b
6 3 c
7 4 a
8 4 c
In this dataframe you can see that for some ids there are multiple banks, i.e. for id==1, bank=c(a,b,c).
The information I would like to extract from this dataframe is the overlap between id's within different banks and the count.
So for example for bank a: bank a has two persons (unique ids): 1 and 4. For these persons, I want to know what other banks they have
For person 1: bank b and c
For person 4: bank c
the total amount of other banks: 3, for which, b = 1, and c = 2.
So I want to create as output a sort of overlap table as below:
bank overlap amount
a b 1
a c 2
b a 1
b c 2
c a 2
c b 2

Took me a while to get a result, so I post it. Not as sexy as Ronak Shahs but same result.
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank)
df$bank <- as.character(df$bank)
resultlist <- list()
dflist <- split(df, df$id)
for(i in 1:length(dflist)) {
if(nrow(dflist[[i]]) < 2) {
resultlist[[i]] <- data.frame(matrix(nrow = 0, ncol = 2))
} else {
resultlist[[i]] <- as.data.frame(t(combn(dflist[[i]]$bank, 2)))
}
}
result <- setNames(data.table(rbindlist(resultlist)), c("bank", "overlap"))
result %>%
group_by(bank, overlap) %>%
summarise(amount = n())
bank overlap amount
<fct> <fct> <int>
1 a b 1
2 a c 2
3 b c 2

We may use data.table:
df = data.frame(id = c("1", "1", "1", "2", "3", "3", "4", "4"),
bank = c("a", "b", "c", "b", "b", "c", "a", "c"))
library(data.table)
setDT(df)[, .(bank = rep(bank, (.N-1L):0L),
overlap = bank[(sequence((.N-1L):1L) + rep(1:(.N-1L), (.N-1L):1))]),
by=id][,
.N, by=.(bank, overlap)]
#> bank overlap N
#> 1: a b 1
#> 2: a c 2
#> 3: b c 2
#> 4: <NA> b 1
Created on 2019-07-01 by the reprex package (v0.3.0)
Please note that you have b for id==2 which is not overlapping with other values. If you don't want that in the final product, just apply na.omit() on the output.

An option would be full_join
library(dplyr)
full_join(df, df, by = "id") %>%
filter(bank.x != bank.y) %>%
dplyr::count(bank.x, bank.y) %>%
select(bank = bank.x, overlap = bank.y, amount = n)
# A tibble: 6 x 3
# bank overlap amount
# <fct> <fct> <int>
#1 a b 1
#2 a c 2
#3 b a 1
#4 b c 2
#5 c a 2
#6 c b 2

Do you need to cover both banks in both the directions? Since a -> b is same as b -> a in this case here. We can use combn and create combinations of unique bank taken 2 at a time, find out length of common id found in the combination.
as.data.frame(t(combn(unique(df$bank), 2, function(x)
c(x, with(df, length(intersect(id[bank == x[1]], id[bank == x[2]])))))))
# V1 V2 V3
#1 a b 1
#2 a c 2
#3 b c 2
data
id = c("1", "1", "1", "2", "3", "3", "4", "4")
bank = c("a", "b", "c", "b", "b", "c", "a", "c")
df = data.frame(id, bank, stringsAsFactors = FALSE)

replacing multiple values in data frame in R

I want to reassign multiple different character strings with the same value in a single call. However, the following code only replaces some of values in each variable.
dat <-data.frame(x=c(rep("1=x",4),rep("b",4)),y=c(rep("1=z",4),rep("b",4)))
dat[] <- sapply(dat[], as.character)
dat[dat == c("1=x", "1=y")]<- 1
such that I get:
dat
x y
1 1 1=z
2 1=x 1=z
3 1 1=z
4 1=x 1=z
5 b b
6 b b
7 b b
8 b b
when I want is the following:
dat
x y
1 1 1
2 1 1
3 1 1
4 1 1
5 b b
6 b b
7 b b
8 b b

With dplyr:
library(dplyr)
dat <- mutate_all(dat, funs(replace(., grepl("1=", .), 1)))
With Base R:
dat[] <- lapply(dat, function(x) replace(x, grepl("1=", x), 1))
Result:
x y
1 1 1
2 1 1
3 1 1
4 1 1
5 b b
6 b b
7 b b
8 b b
Data:
dat <- structure(list(x = c("1=x", "1=x", "1=x", "1=x", "b", "b", "b",
"b"), y = c("1=z", "1=z", "1=z", "1=z", "b", "b", "b", "b")), .Names = c("x",
"y"), row.names = c(NA, -8L), class = "data.frame")

Another Base R option if you want to make an explicit replacement of certain strings would be:
dat[] <- lapply(dat,function(x) ifelse(x %in% c("1=x", "1=z"), 1, x))
Result:
x y
1 1 1
2 1 1
3 1 1
4 1 1
5 b b
6 b b
7 b b
8 b b
Data:
dat <- structure(list(x = c("1", "1", "1", "1", "b", "b", "b", "b"),
y = c("1", "1", "1", "1", "b", "b", "b", "b")), row.names = c(NA,
-8L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how can I regroup multiple categorical variables into a new variable - r

Here's an approach with dplyr: library(dplyr) df %>% group_by(A) %>% dplyr::summarise(C = case_when("TCRB" %in% B & "TCRG" %in% B ~ 2, "TCRB" %in% B ~ 0, "TCRG" %in% B ~ 1, TRUE ~ NA_real_)) # A tibble: 4 x 2 A C <fct> <dbl> 1 a 2 2 b 2 3 c 0 4 d 1

In Base R, we can use aggregate : aggregate(B~A, df, function(x) { if(all(c('TCRB', 'TCRG') %in% x)) 2 else if(any(x == 'TCRG')) 1 else if(any(x == 'TCRB')) 0 else NA }) # A B #1 a 2 #2 b 2 #3 c 0 #4 d 1

Related

Creating a new column of 1s and 0s as a way to count unique values in R

how to split a dataframe by specific rows in r

arrange a complicated data set in R

Count occurrences per entry in dataframe

replacing multiple values in data frame in R

Categories

Resources