Random sampling in R within Categorical variable - r

Suppose I have a data frame with categorical variable of n classes and a numerical variable. I need to randomize the numerical variable within each category. For example , consider the following table:
Col_1 Col_2
A 2
A 5
A 4
A 8
B 1
B 4
B 9
B 7
When I tried sample() function in R, it threw the result considering both the categories. Is there any function where I can get this kind of output? (with or without replacement, doesn't matter)
Col_1 Col_2
A 8
A 4
A 2
A 5
B 9
B 7
B 4
B 1

You could sample row numbers within groups. In base R, we can use ave
df[with(df, ave(seq_len(nrow(df)), Col_1, FUN = sample)), ]
# Col_1 Col_2
#2 A 5
#4 A 8
#1 A 2
#3 A 4
#7 B 9
#5 B 1
#8 B 7
#6 B 4
In dplyr, we can use sample_n
library(dplyr)
df %>% group_by(Col_1) %>% sample_n(n())
data
df <- structure(list(Col_1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("A", "B"), class = "factor"), Col_2 = c(2L, 5L,
4L, 8L, 1L, 4L, 9L, 7L)), class = "data.frame", row.names = c(NA, -8L))

Here's a dplyr solution:
library(dplyr)
set.seed(2)
dat %>%
group_by(Col_1) %>%
mutate(Col_2 = sample(Col_2)) %>%
ungroup()
# # A tibble: 8 x 2
# Col_1 Col_2
# <chr> <int>
# 1 A 2
# 2 A 4
# 3 A 5
# 4 A 8
# 5 B 7
# 6 B 9
# 7 B 1
# 8 B 4
A data.table method:
library(data.table)
datDT <- as.data.table(dat)
set.seed(2)
datDT[, Col_2 := sample(Col_2), by = "Col_1"]
datDT
# Col_1 Col_2
# 1: A 2
# 2: A 4
# 3: A 5
# 4: A 8
# 5: B 7
# 6: B 9
# 7: B 1
# 8: B 4
Data
dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Col_1 Col_2
A 2
A 5
A 4
A 8
B 1
B 4
B 9
B 7")

Related

Using complete to fill groups with NA to have same length as the maximum group

I have this dataframe:
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L), var = c("A", "B",
"C", "B", "C", "C")), class = "data.frame", row.names = c(NA,
-6L))
id var
1 1 A
2 1 B
3 1 C
4 2 B
5 2 C
6 3 C
I would like to get this dataframe:
id var
1 1 A
2 1 B
3 1 C
4 2 <NA>
5 2 B
6 2 C
7 3 <NA>
8 3 <NA>
9 3 C
I would like to learn how to use complete or expand.grid in this situation
I have tried several ways but was not successful: One of my tries:
df %>%
complete(id, var, fill=list(NA))
Create a duplicate column of 'var' and then do the complete on the other column, which makes the NA in the 'var' column and then remove the duplicate 'var' column
library(dplyr)
library(tidyr)
df %>%
mutate(var1 = var) %>%
complete(id, var1) %>%
select(-var1)
-output
# A tibble: 9 × 2
id var
<int> <chr>
1 1 A
2 1 B
3 1 C
4 2 <NA>
5 2 B
6 2 C
7 3 <NA>
8 3 <NA>
9 3 C

Counting occurencies in every column in R

Hello I need to count the occurencies of every number in each column.
Example data-frame:
A B C
2 1 2
2 1 1
1 1 3
3 3 3
3 2 2
2 1 2
I want my output to look like this
how_much A B C
1 1 4 1
2 3 1 3
3 2 1 2
In tidyverse you could do:
library(tidyverse)
gather(df1) %>%
group_by(key,value) %>%
count() %>%
pivot_wider(value, names_from = key, values_from = n, values_fill = 0)
value A B C
<int> <int> <int> <int>
1 1 1 4 1
2 2 3 1 3
3 3 2 1 2
We can use table
table(unlist(df1), names(df1)[c(col(df1))])
-output
A B C
1 1 4 1
2 3 1 3
3 2 1 2
Or loop over the columns with sapply, and apply table
sapply(df1, table)
A B C
1 1 4 1
2 3 1 3
3 2 1 2
data
df1 <- structure(list(A = c(2L, 2L, 1L, 3L, 3L, 2L), B = c(1L, 1L, 1L,
3L, 2L, 1L), C = c(2L, 1L, 3L, 3L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-6L))
In order for the solution to be more flexible and can be used for any occurrence of numbers we can use the following solution using purrr package functions.
library(dplyr)
library(purrr)
df1 %>%
map(~ unique(.x) %>% sort()) %>% reduce(~ union(..1, ..2)) %>%
bind_cols(map_dfr(., ~ map_dfc(df1, function(a) sum(a == .x)))) %>%
rename(what = ...1)
# A tibble: 3 x 4
what A B C
<int> <int> <int> <int>
1 1 1 4 1
2 2 3 1 3
3 3 2 1 2
A slightly verbose answer, but it will work on all data types.
set.seed(1234)
df1 <- data.frame(A = sample(letters[1:3], 8, T),
B = sample(letters[1:3], 8, T),
C = sample(letters[1:3], 8, T))
df1
#> A B C
#> 1 b c b
#> 2 b b a
#> 3 a b c
#> 4 c b c
#> 5 a c c
#> 6 a b a
#> 7 b b b
#> 8 b b a
library(tidyverse)
unique(unlist(apply(df1, 1, unique))) %>% as.data.frame() %>% setNames('how_much') %>%
bind_cols(map_df(unique(unlist(apply(df1, 1, unique))), ~map_int(df1, \(x) sum(x %in% .x) ) ))
#> how_much A B C
#> 1 b 4 6 2
#> 2 c 1 2 3
#> 3 a 3 0 3
Created on 2021-06-23 by the reprex package (v2.0.0)

Sum a set of numeric columns and collapse string column by group

Suppose I have a data frame (df) like this:
Names ID Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1 10 5 10 5 10
2: Gen2 id2 1 2 3 4 5
3: Gen1 id3 10 5 10 5 10
4: Gen2 id4 1 2 3 4 5
5: Gen3 id5 7 7 7 7 7
For each 'Names', I would like to sum 'Thing' columns, and collapse the strings in 'ID':
Names ID Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1|id3 20 10 20 10 20
2: Gen2 id2|id4 2 4 6 8 10
3: Gen3 id5 7 7 7 7 7
I am able to achieve this via dplyr:
df1 <- df %>%
group_by(Names)%>%
summarise_each(funs(paste(unique(.), collapse='|')),matches('^\\D+$'))
df2 <- df %>%
group_by(Names)%>%
summarise_each(funs(sum = sum(., na.rm=TRUE)), starts_with('Thing' ))
bind_cols(df1, df2[-1])
However, this solution takes very long since I have a data frame with more than 10k rows and more than 10k column!
Is there any possible solution with data.table?
The closest I have gotten is this here:
> setDT(df)[, c(paste(df$ID,collapse = "-", sep = ""), lapply(.SD, sum, na.rm = TRUE)),
by = Names, .SDcols = !"ID"]
Names Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1-id2-id3-id4-id5 20 10 20 10 20
2: Gen2 id1-id2-id3-id4-id5 2 4 6 8 10
3: Gen3 id1-id2-id3-id4-id5 7 7 7 7 7
Obviously this is not what I am going for since it will collapse all IDs and not just the ones that were aggregated by summarizing via "Names".
I would very much appreciate your help!
Here is the example data:
df <- structure(list(Names = c("Gen1", "Gen2", "Gen1", "Gen2","Gen3"),
ID=c("id1","id2","id3","id4","id5"),
Thing1 = c(10L, 1L, 10L, 1L, 7L),
Thing2 = c(5L, 2L, 5L, 2L,7L),
Thing3 = c(10L, 3L, 10L, 3L, 7L),
Thing4 = c(5L, 4L, 5L,4L, 7L),
Thing5 = c(10L, 5L, 10L, 5L, 7L)),
.Names = c("Names","ID","Thing1", "Thing2", "Thing3", "Thing4", "Thing5"),
class = "data.frame", row.names = c(1:5L))
If you don't heavily rely on data.table you could use aggregate two times and merge the results.
merge(aggregate(.~Names, df[-2], sum), aggregate(ID ~ Names, df, paste, collapse="|"))
# Names Thing1 Thing2 Thing3 Thing4 Thing5 ID
# 1 Gen1 20 10 20 10 20 id1|id3
# 2 Gen2 2 4 6 8 10 id2|id4
# 3 Gen3 7 7 7 7 7 id5
try it this way
use tidyverse
library(tidyverse)
df %>%
group_by(Names) %>%
summarise(across(where(is.character), str_c, collapse = "|"),
across(where(is.numeric), sum, na.rm = T))
# A tibble: 3 x 7
Names ID Thing1 Thing2 Thing3 Thing4 Thing5
<chr> <chr> <int> <int> <int> <int> <int>
1 Gen1 id1|id3 20 10 20 10 20
2 Gen2 id2|id4 2 4 6 8 10
3 Gen3 id5
use data.table
library(data.table)
dt <- copy(df)
setDT(dt)
out_sum <- dt[, lapply(.SD, sum), by = Names, .SDcols=!"ID"]
out_id <- dt[, list(id = sapply(list(ID), paste0, collapse = "|")), by = Names]
merge(out_id, out_sum)
Names id Thing1 Thing2 Thing3 Thing4 Thing5
1: Gen1 id1|id3 20 10 20 10 20
2: Gen2 id2|id4 2 4 6 8 10
3: Gen3 id5 7 7 7 7 7

Transpose Rows in batches to Columns in R

My data.frame df looks like this:
A 1
A 2
A 5
B 2
B 3
B 4
C 3
C 7
C 9
I want it to look like this:
A B C
1 2 3
2 3 7
5 4 9
I have tried spread() but probably not in the right way. Any ideas?
We can use unstack from base R
unstack(df1, col2 ~ col1)
# A B C
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or with split
data.frame(split(df1$col2, df1$col1))
Or if we use spread or pivot_wider, make sure to create a sequence column
library(dplyr)
library(tidyr)
df1 %>%
group_by(col1) %>%
mutate(rn = row_number()) %>%
ungroup %>%
pivot_wider(names_from = col1, values_from = col2) %>%
# or use
# spread(col1, col2) %>%
select(-rn)
# A tibble: 3 x 3
# A B C
# <int> <int> <int>
#1 1 2 3
#2 2 3 7
#3 5 4 9
Or using dcast
library(data.table)
dcast(setDT(df1), rowid(col1) ~ col1)[, .(A, B, C)]
data
df1 <- structure(list(col1 = c("A", "A", "A", "B", "B", "B", "C", "C",
"C"), col2 = c(1L, 2L, 5L, 2L, 3L, 4L, 3L, 7L, 9L)),
class = "data.frame", row.names = c(NA,
-9L))
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), rowid(col1)~col1, value.var = 'col2')[, col1 := NULL][]
# A B C
#1: 1 2 3
#2: 2 3 7
#3: 5 4 9

How to use column indices to collect values from columns in R

x y z column_indices
6 7 1 1,2
5 4 2 3
1 3 2 1,3
I have the column indices of the values I would like to collect in a separate column like so, what I want to create is something like this:
x y z column_indices values
6 7 1 1,2 6,7
5 4 2 3 2
1 3 2 1,3 1,2
What is the simplest way to do this in R?
Thanks!
In base R, we can use apply, split the column_indices on ',', convert them to integer and get the corresponding value from the row.
df$values <- apply(df, 1, function(x) {
inds <- as.integer(strsplit(x[4], ',')[[1]])
toString(x[inds])
})
df
# x y z column_indices values
#1 6 7 1 1,2 6, 7
#2 5 4 2 3 2
#3 1 3 2 1,3 1, 2
data
df <- structure(list(x = c(6L, 5L, 1L), y = c(7L, 4L, 3L), z = c(1L,
2L, 2L), column_indices = structure(c(1L, 3L, 2L), .Label = c("1,2",
"1,3", "3"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
One solution involving dplyr and tidyr could be:
df %>%
pivot_longer(-column_indices) %>%
group_by(column_indices) %>%
mutate(values = toString(value[1:n() %in% unlist(strsplit(column_indices, ","))])) %>%
pivot_wider(names_from = "name", values_from = "value")
column_indices values x y z
<chr> <chr> <int> <int> <int>
1 1,2 6, 7 6 7 1
2 3 2 5 4 2
3 1,3 1, 2 1 3 2

Resources