More efficient way to purrr::map2 for a large dataframe - r

Is there a faster way to do the following, where in the real application, df has many rows (and therefore list_of_colnames has the same number of elements):
list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")
map2(split(df, seq(nrow(df))), list_of_colnames, function(row, colnames) {
row$indicator <- ifelse(any(row[, colnames] %in% some_vector), 1, 0)
return(row)
})
While this current implementation works, it takes centuries for the big df. In fact I think split() is a major bottleneck.
Thank you!

One option may be to make use of row/column indexing
rowind <- rep(seq_len(nrow(df)), lengths(list_of_colnames) * nrow(df))
df$indicator <- +(tapply(c(t(df[unlist(list_of_colnames)])) %in% some_vector,
rowind, FUN = any))
-output
> df
A B indicator
1 fish A 1
2 hello cat 1
data
df <- data.frame(A = c('fish', 'hello'), B = c('A', 'cat'))

You can avoid splitting your data frame into a list all together and instead apply your condition across the rows using rowwise and c_across from dplyr:
library(dplyr)
library(purrr)
list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")
map(list_of_colnames, ~
df %>%
rowwise() %>%
mutate(indicator = as.numeric(any(c_across(all_of(.x)) %in% some_vector))) %>%
ungroup()
)
Output
Still mapping over list_of_columns returns a list output:
[[1]]
# A tibble: 3 x 4
A B C indicator
<chr> <chr> <chr> <lgl>
1 fish dog bird TRUE
2 dog cat bird TRUE
3 bird lion cat FALSE
[[2]]
# A tibble: 3 x 4
A B C indicator
<chr> <chr> <chr> <lgl>
1 fish dog bird TRUE
2 dog cat bird FALSE
3 bird lion cat FALSE
Data
structure(list(A = c("fish", "dog", "bird"), B = c("dog", "cat",
"lion"), C = c("bird", "bird", "cat")), class = "data.frame", row.names = c(NA,
-3L))

Related

Move subgroup under repeated main group while keeping main group once in data.frame R

I'm aware that the question is awkward. If I could phrase it better I'd probably find the solution in an other thread.
I have this data structure...
df <- data.frame(group = c("X", "F", "F", "F", "F", "C", "C"),
subgroup = c(NA, "camel", "horse", "dog", "cat", "orange", "banana"))
... and would like to turn it into this...
data.frame(group = c("X", "F", "camel", "horse", "dog", "cat", "C", "orange", "banana"))
... which is surprisingly confusing. Also, I would prefer not using a loop.
EDIT: I updated the example to clarify that solutions that depend on sorting unfortunately do not do the trick.
Here an (edited) answer with new data.
Using data.table is going to help a lot. The idea is to split the df into groups and lapply() to each group what we need. Whe have to take care of some things meanwhile.
library(data.table)
# set as data.table
setDT(df)
# to mantain the ordering, you need to put as factor the group.
# the levels are going to give the ordering infos to split
df[,':='(group = factor(group, levels =unique(df$group)))]
# here the split function, splitting df int a list
df_list <-split(df, df$group, sorted =F)
# now you lapply to each element what you need
df_list <-lapply(df_list, function(x) data.frame(group = unique(c(as.character(x$group),x$subgroup))))
# put into a data.table and remove NAs
rbindlist(df_list)[!is.na(df_onecol$group)]
group
1: X
2: F
3: camel
4: horse
5: dog
6: cat
7: C
8: orange
9: banana
With the edited data we need to add another column (here row_number) to sort by:
df %>%
pivot_longer(col = everything()) %>%
mutate(r_n = row_number()) %>%
group_by(value) %>% slice(1) %>%
arrange(r_n) %>%
filter(!is.na(value))
#output
# A tibble: 9 × 3
# Groups: value [9]
name value r_n
<chr> <chr> <int>
1 group X 1
2 group F 3
3 subgroup camel 4
4 subgroup horse 6
5 subgroup dog 8
6 subgroup cat 10
7 group C 11
8 subgroup orange 12
9 subgroup banana 14

Collapsing Columns in R using tidyverse with mutate, replace, and unite. Writing a function to reuse?

Data:
ID
B
C
1
NA
x
2
x
NA
3
x
x
Results:
ID
Unified
1
C
2
B
3
B_C
I'm trying to combine colums B and C, using mutate and unify, but how would I scale up this function so that I can reuse this for multiple columns (think 100+), instead of having to write out the variables each time? Or is there a function that's already built in to do this?
My current solution is this:
library(tidyverse)
Data %>%
mutate(B = replace(B, B == 'x', 'B'), C = replace(C, C == 'x', 'C')) %>%
unite("Unified", B:C, na.rm = TRUE, remove= TRUE)
We may use across to loop over the column, replace the value that corresponds to 'x' with column name (cur_column())
library(dplyr)
library(tidyr)
Data %>%
mutate(across(B:C, ~ replace(., .== 'x', cur_column()))) %>%
unite(Unified, B:C, na.rm = TRUE, remove = TRUE)
-output
ID Unified
1 1 C
2 2 B
3 3 B_C
data
Data <- structure(list(ID = 1:3, B = c(NA, "x", "x"), C = c("x", NA,
"x")), class = "data.frame", row.names = c(NA, -3L))
Here are couple of options.
Using dplyr -
library(dplyr)
cols <- names(Data)[-1]
Data %>%
rowwise() %>%
mutate(Unified = paste0(cols[!is.na(c_across(B:C))], collapse = '_')) %>%
ungroup -> Data
Data
# ID B C Unified
# <int> <chr> <chr> <chr>
#1 1 NA x C
#2 2 x NA B
#3 3 x x B_C
Base R
Data$Unified <- apply(Data[cols], 1, function(x)
paste0(cols[!is.na(x)], collapse = '_'))

Splitting values in a column

sorry I'm new to R but I've got some data that looks like the following:
I'd like count the number of times each object is mentioned in the findings. So the result would look like this:
I've tried tidyverse and separate but can't seem to get the hang of it, any help would be amazing, thanks in advance!
To recreate my data:
df <- data.frame(
col_1 = paste0("image", 1:5),
findings = c("rock|cat|sun", "cat", "cat|dog|fish|sun", "sun", "dog|cat")
)
You can use separate_rows() and then count().
library(tidyverse)
df %>%
separate_rows(findings) %>%
count(findings)
# # A tibble: 5 x 2
# findings n
# <chr> <int>
# 1 cat 4
# 2 dog 2
# 3 fish 1
# 4 rock 1
# 5 sun 3
Data
df <- structure(list(col_1 = c("image_1", "image_2", "image_3", "image_4",
"image_5"), findings = c("rock|cat|sun", "cat", "cat|dog|fish|sun",
"sun", "dog|cat")), class = "data.frame", row.names = c(NA, -5L))
In base R:
as.data.frame(table(unlist(strsplit(df$col_2, "|", fixed = TRUE))))
# Var1 Freq
# 1 cat 4
# 2 dog 2
# 3 fish 1
# 4 rock 1
# 5 sun 3
Reproducible data (please provide it in your next post):
df <- data.frame(
col_1 = paste0("image", 1:5),
col_2 = c("rock|cat|sun", "cat", "cat|dog|fish|sun", "sun", "dog|cat")
)
An option with cSplit
library(splitstackshape)
cSplit(df, 'col_2', 'long', sep="|")[, .N, col_2]
# col_2 N
#1: rock 1
#2: cat 4
#3: sun 3
#4: dog 2
#5: fish 1
data
df <- structure(list(col_1 = c("image1", "image2", "image3", "image4",
"image5"), col_2 = c("rock|cat|sun", "cat", "cat|dog|fish|sun",
"sun", "dog|cat")), class = "data.frame", row.names = c(NA, -5L
))
Using tidyverse:
df %>%
separate_rows(findings) %>%
group_by(findings) %>%
summarize(total_count_col=n())
First we convert the data into a long format using separate_rows, then group and count the number of rows with each finding.
Example:
df<-data.frame(col1=c(rep(letters[1:3],3),"d"),col2=c(rep("moose|cat|dog",9),"rock"), stringsAsFactors = FALSE)
df %>% separate_rows(col2) %>% group_by(col2) %>% summarize(total_count_col=n())
# A tibble: 4 x 2
col2 total_count_col
<chr> <int>
1 cat 9
2 dog 9
3 moose 9
4 rock 1

How to merge rows in a dataframe and combine factor-values in cells

I have a dataframe in R that in which I want to merge certain rows and combine the values of certain cells in these rows. Imagine the following data frame:
Col.1<-c("a","b","b","a","c","c","c","d")
Col.2<-c("mouse", "cat", "dog", "bird", "giraffe", "elephant", "zebra", "worm")
df<-data.frame(Col.1, Col.2)
df
Col.1 Col.2
a mouse
b cat
b dog
a bird
c giraffe
c elephant
c zebra
d worm
I would like to merge all adjacent rows in which the values in Col.1 are the same and combine the values in Col.2 accordingly.
The final result should look like this:
Col.1 Col.2
a mouse
b cat dog
a bird
c giraffe elephant zebra
d worm
I have tried to use a dplyr-solution (like:ddply(df, .(Col.1), summarize, Col.2 = sum(Col.2))), but the sum-command doesn't work for factor-values.
We can do a group by paste. To do the grouping for adjacent similar elements, rleid from data.table can be used, and then summarise the values of 'Col.2' by pasteing
library(dplyr)
library(data.table)
library(stringr)
df %>%
group_by(Col.1, grp = rleid(Col.1)) %>%
summarise(Col.2 = str_c(Col.2, collapse=' ')) %>%
ungroup %>%
select(-grp)
# A tibble: 5 x 2
# Col.1 Col.2
# <fct> <chr>
#1 a mouse
#2 a bird
#3 b cat dog
#4 c giraffe elephant zebra
#5 d worm
NOTE: This matches the output showed in the OP's post
EDIT: missed the "adjacent" bit. See the solution using base function rle below from this question.
Col.1 <- c("a","b","b","a","c","c","c","d")
Col.2 <- c("mouse", "cat", "dog", "bird", "giraffe", "elephant", "zebra", "worm")
df <- tibble(Col.1, Col.2)
rlel <- rle(df$Col.1)$length
df %>%
mutate(adj = unlist(lapply(1:length(rlel), function(i) rep(i, rlel[i])))) %>%
group_by(Col.1, adj) %>%
summarize(New.Col.2 = paste(Col.2, collapse = " ")) %>%
ungroup %>% arrange(adj) %>% select(-adj)
# A tibble: 5 x 2
Col.1 New.Col.2
<chr> <chr>
1 a mouse
2 b cat dog
3 a bird
4 c giraffe elephant zebra
5 d worm

Organize subgroup strings (text)

I am trying to convert something like this df format:
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df
first words
1 a about
2 a among
3 b blue
4 b but
5 b both
6 c cat
into the following format:
df1
first words
1 a about, among
2 b blue, but, both
3 c cat
>
I have tried
aggregate(words ~ first, data = df, FUN = list)
first words
1 a 1, 2
2 b 3, 5, 4
3 c 6
and tidyverse:
df %>%
group_by(first) %>%
group_rows()
Any suggestions would be appreciated!
A data.table solution:
library(data.table)
df <- data.frame(first = c("a", "a", "b", "b", "b", "c"),
words =c("about", "among", "blue", "but", "both", "cat"))
df <- setDT(df)[, lapply(.SD, toString), by = first]
df
# first words
# 1: a about, among
# 2: b blue, but, both
# 3: c cat
# convert back to a data.frame if you want
setDF(df)
Using tidyverse, after the group_by use summarise to either paste
library(dplyr)
df %>%
group_by(first) %>%
summarise(words = toString(words))
# A tibble: 3 x 2
# first words
# <fct> <chr>
#1 a about, among
#2 b blue, but, both
#3 c cat
or keep it as a list column
df %>%
group_by(first) %>%
summarise(words = list(words))

Resources