Splitting values in a column - r

sorry I'm new to R but I've got some data that looks like the following:
I'd like count the number of times each object is mentioned in the findings. So the result would look like this:
I've tried tidyverse and separate but can't seem to get the hang of it, any help would be amazing, thanks in advance!
To recreate my data:
df <- data.frame(
col_1 = paste0("image", 1:5),
findings = c("rock|cat|sun", "cat", "cat|dog|fish|sun", "sun", "dog|cat")
)

You can use separate_rows() and then count().
library(tidyverse)
df %>%
separate_rows(findings) %>%
count(findings)
# # A tibble: 5 x 2
# findings n
# <chr> <int>
# 1 cat 4
# 2 dog 2
# 3 fish 1
# 4 rock 1
# 5 sun 3
Data
df <- structure(list(col_1 = c("image_1", "image_2", "image_3", "image_4",
"image_5"), findings = c("rock|cat|sun", "cat", "cat|dog|fish|sun",
"sun", "dog|cat")), class = "data.frame", row.names = c(NA, -5L))

In base R:
as.data.frame(table(unlist(strsplit(df$col_2, "|", fixed = TRUE))))
# Var1 Freq
# 1 cat 4
# 2 dog 2
# 3 fish 1
# 4 rock 1
# 5 sun 3
Reproducible data (please provide it in your next post):
df <- data.frame(
col_1 = paste0("image", 1:5),
col_2 = c("rock|cat|sun", "cat", "cat|dog|fish|sun", "sun", "dog|cat")
)

An option with cSplit
library(splitstackshape)
cSplit(df, 'col_2', 'long', sep="|")[, .N, col_2]
# col_2 N
#1: rock 1
#2: cat 4
#3: sun 3
#4: dog 2
#5: fish 1
data
df <- structure(list(col_1 = c("image1", "image2", "image3", "image4",
"image5"), col_2 = c("rock|cat|sun", "cat", "cat|dog|fish|sun",
"sun", "dog|cat")), class = "data.frame", row.names = c(NA, -5L
))

Using tidyverse:
df %>%
separate_rows(findings) %>%
group_by(findings) %>%
summarize(total_count_col=n())
First we convert the data into a long format using separate_rows, then group and count the number of rows with each finding.
Example:
df<-data.frame(col1=c(rep(letters[1:3],3),"d"),col2=c(rep("moose|cat|dog",9),"rock"), stringsAsFactors = FALSE)
df %>% separate_rows(col2) %>% group_by(col2) %>% summarize(total_count_col=n())
# A tibble: 4 x 2
col2 total_count_col
<chr> <int>
1 cat 9
2 dog 9
3 moose 9
4 rock 1

Related

More efficient way to purrr::map2 for a large dataframe

Is there a faster way to do the following, where in the real application, df has many rows (and therefore list_of_colnames has the same number of elements):
list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")
map2(split(df, seq(nrow(df))), list_of_colnames, function(row, colnames) {
row$indicator <- ifelse(any(row[, colnames] %in% some_vector), 1, 0)
return(row)
})
While this current implementation works, it takes centuries for the big df. In fact I think split() is a major bottleneck.
Thank you!
One option may be to make use of row/column indexing
rowind <- rep(seq_len(nrow(df)), lengths(list_of_colnames) * nrow(df))
df$indicator <- +(tapply(c(t(df[unlist(list_of_colnames)])) %in% some_vector,
rowind, FUN = any))
-output
> df
A B indicator
1 fish A 1
2 hello cat 1
data
df <- data.frame(A = c('fish', 'hello'), B = c('A', 'cat'))
You can avoid splitting your data frame into a list all together and instead apply your condition across the rows using rowwise and c_across from dplyr:
library(dplyr)
library(purrr)
list_of_colnames <- list(c("A", "B"), c("A"))
some_vector <- c("fish", "cat")
map(list_of_colnames, ~
df %>%
rowwise() %>%
mutate(indicator = as.numeric(any(c_across(all_of(.x)) %in% some_vector))) %>%
ungroup()
)
Output
Still mapping over list_of_columns returns a list output:
[[1]]
# A tibble: 3 x 4
A B C indicator
<chr> <chr> <chr> <lgl>
1 fish dog bird TRUE
2 dog cat bird TRUE
3 bird lion cat FALSE
[[2]]
# A tibble: 3 x 4
A B C indicator
<chr> <chr> <chr> <lgl>
1 fish dog bird TRUE
2 dog cat bird FALSE
3 bird lion cat FALSE
Data
structure(list(A = c("fish", "dog", "bird"), B = c("dog", "cat",
"lion"), C = c("bird", "bird", "cat")), class = "data.frame", row.names = c(NA,
-3L))

R: extract text after ")"

A simple question but I can't find solution. In R dplyr how do I extract text after a ")" and then split it based on "/"?
my data is like this
# A tibble: 3 x 2
id Group
<dbl> <chr>
1 1 (aa1) red/yellow
2 2 (bb1) blue/yellow
3 3 (cc1) green/orange
structure(list(id = c(1, 2, 3), group = c("(aa1) red/yellow",
"(bb1) blue/yellow", "(cc1) green/orange")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
And I would simply like:
Seems simple but I am new to r and cannot figure this out. Thanks.
you can use regmatches in combination with regexpr.
library(dplyr)
df = data.frame(id = c(1,2,3), group = c("(aa1) red/yellow","(dd1) blue/yellow","(cc1) green/orange"))
df %>%
mutate(x1 = regmatches(group,regexpr("^\\(.{3}\\)",group)),
x2 = regmatches(group,regexpr("(?<= )\\w+(?=/)",group,perl = TRUE)),
x3 = regmatches(group,regexpr("(?<=/)\\w+$",group,perl = TRUE)))
output is:
id group x1 x2 x3
1 1 (aa1) red/yellow (aa1) red yellow
2 2 (dd1) blue/yellow (dd1) blue yellow
3 3 (cc1) green/orange (cc1) green orange
If you don't know how to use regular expressions you can read this, it is a helpful intro to regular expressions
First separate the values in group, separating them by whitespace \\s or /, then remove the parentheses in x1 using sub and 'recollecting' only the alphanumerical parts \\w+ in the replacement with backreference \\1:
library(tidyr)
library(dplyr)
df %>%
separate(., col = "group", into = paste0("x", 1:3), sep = "\\s|/") %>%
mutate(x1 = sub(".(\\w+).", "\\1", x1))
# A tibble: 3 x 4
id x1 x2 x3
<dbl> <chr> <chr> <chr>
1 1 aa1 red yellow
2 2 bb1 blue yellow
3 3 cc1 green orange
EDIT:
If your input data is more complex, as suggested in a comment, such as this:
df <- structure(list(id = c(1, 2, 3), group = c("(aa1) red bus/yellow",
"(bb1) blue/yellow", "(cc1) green/orange apple")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
then this will work:
df %>%
separate(., col = "group", into = paste0("x", 1:3), sep = "\\) |/") %>%
mutate(x1 = sub(".(\\w+).", "\\1", x1))
# A tibble: 3 x 4
id x1 x2 x3
<dbl> <chr> <chr> <chr>
1 1 aa red bus yellow
2 2 bb blue yellow
3 3 cc green orange apple

Split df column of integers into individual digits in R

I have a df where one variable is an integer. I'd like to split this column into it's individual digits. See my example below
Group Number
A 456
B 3
C 18
To
Group Number Digit1 Digit2 Digit3
A 456 4 5 6
B 3 3 NA NA
C 18 1 8 NA
We can use read.fwf from base R. Find the max number of character (nchar) in 'Number' column (mx). Read the 'Number' column after converting to character (as.character), specify the 'widths' as 1 by replicating 1 with mx and assign the output to new 'Digit' columns in the data
mx <- max(nchar(df1$Number))
df1[paste0("Digit", seq_len(mx))] <- read.fwf(textConnection(
as.character(df1$Number)), widths = rep(1, mx))
-output
df1
# Group Number Digit1 Digit2 Digit3
#1 A 456 4 5 6
#2 B 3 3 NA NA
#3 C 18 1 8 NA
data
df1 <- structure(list(Group = c("A", "B", "C"), Number = c(456L, 3L,
18L)), class = "data.frame", row.names = c(NA, -3L))
Another base R option (I think #akrun's approach using read.fwf is much simpler)
cbind(
df,
with(
df,
type.convert(
`colnames<-`(do.call(
rbind,
lapply(
strsplit(as.character(Number), ""),
`length<-`, max(nchar(Number))
)
), paste0("Digit", seq(max(nchar(Number))))),
as.is = TRUE
)
)
)
which gives
Group Number Digit1 Digit2 Digit3
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Using splitstackshape::cSplit
splitstackshape::cSplit(df, 'Number', sep = '', stripWhite = FALSE, drop = FALSE)
# Group Number Number_1 Number_2 Number_3
#1: A 456 4 5 6
#2: B 3 3 NA NA
#3: C 18 1 8 NA
Updated
I realized I could use max function for counting characters limit in each row so that I could include it in my map2 function and save some lines of codes thanks to an accident that led to an inspiration by dear #ThomasIsCoding.
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
df %>%
rowwise() %>%
mutate(map2_dfc(Number, 1:max(nchar(Number)), ~ str_sub(.x, .y, .y))) %>%
unnest(cols = !c(Group, Number)) %>%
rename_with(~ str_replace(., "\\.\\.\\.", "Digit"), .cols = !c(Group, Number)) %>%
mutate(across(!c(Group, Number), as.numeric, na.rm = TRUE))
# A tibble: 3 x 5
Group Number Digit1 Digit2 Digit3
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 456 4 5 6
2 B 3 3 NA NA
3 C 18 1 8 NA
Data
df <- tribble(
~Group, ~Number,
"A", 456,
"B", 3,
"C", 18
)
Two base r methods:
no_cols <- max(nchar(as.character(df1$Number)))
# Using `strsplit()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(strsplit(as.character(df1$Number), ""),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))
# Using `regmatches()` and `gregexpr()`:
cbind(df1, setNames(data.frame(do.call(rbind,
lapply(regmatches(df1$Number, gregexpr("\\d", df1$Number)),
function(x) {
length(x) <- no_cols
x
}
)
)
), paste0("Digit", seq_len(no_cols))))

Number of occurences in a dataframe

I've the following data frame and I want to count the occurrences of each row by the first column and append as another column say "freq" to the data frame:
df:
gene a b c
abc 1 NA 1
bca NA 1 1
cba 1 2 1
my df is bigger, so this is only an example to scalable.
The desire dataframe is that:
gene a b c freq
abc 1 NA 1 2
bca NA 1 1 2
cba 1 2 1 3
the codes what I have tried is that:
g <- df %>% mutate(numtwos = rowSums(. > 0))
or
df$freq <- apply(df , 1, function(x) length(which(x>0)))
But it is not working because if in a row should have (for example) 150 repetitions, I obtain only 2 for every row.
Any help or other point of view is welcome!
Thanks
We can use first convert the Na to "NA"
library(dplyr)
df %>%
mutate_at(vars(a:c), ~ as.numeric(na_if(., "Na"))) %>%
mutate(freq = rowSums(select(., a:c), na.rm = TRUE))
# gene a b c freq
#1 abc 1 NA 1 2
#2 bca NA 1 1 2
#3 cba 1 1 1 3
Here, the values are all 1s, so it is the same as getting the sum of non-NA
df %>%
mutate_at(vars(a:c), ~ as.numeric(na_if(., "Na"))) %>%
mutate(freq = rowSums(!is.na(select(., a:c))))
data
df <- structure(list(gene = c("abc", "bca", "cba"), a = c("1", "Na",
"1"), b = c("Na", "1", "1"), c = c(1L, 1L, 1L)),
class = "data.frame", row.names = c(NA,
-3L))
I haven't used R for a while, so I won't paste in the code, but you can create a new df groupping the initial one by gene and merge/join it to your initial df in another line of code.

Binding data frames from a list with different column types

Trying to figure out a way in purrr to bind rows over different elements of lists where the column types are not consistent. For example, my data looks a little like this...
d0 <- list(
data_frame(x1 = c(1, 2), x2 = c("a", "b")),
data_frame(x1 = c("P1"), x2 = c("c"))
)
d0
# [[1]]
# # A tibble: 2 x 2
# x1 x2
# <dbl> <chr>
# 1 1 a
# 2 2 b
#
# [[2]]
# # A tibble: 1 x 2
# x1 x2
# <chr> <chr>
# 1 P1 c
I can use a for loop and then map_df with bind_rows to get the output I want (map_df will not work if the columns are of different types)...
for(i in 1:length(d0)){
d0[[i]] <- mutate_if(d0[[i]], is.numeric, as.character)
}
map_df(d0, bind_rows)
# # A tibble: 3 x 2
# x1 x2
# <chr> <chr>
# 1 1 a
# 2 2 b
# 3 P1 c
but I think I am missing a trick somewhere that would allow me to avoid the for loop. My attempts along these lines...
d0 %>%
map(mutate_if(., is.numeric, as.character)) %>%
map_df(.,bind_rows)
# Error in UseMethod("tbl_vars") :
# no applicable method for 'tbl_vars' applied to an object of class "list"
... do not seem to work (still getting my head around purrr)
You can use rbindlist() from data.table in this case
data.table::rbindlist(d0) %>%
dplyr::as_data_frame()
# A tibble: 3 x 2
x1 x2
<chr> <chr>
1 1 a
2 2 b
3 P1 c
There may be circumstances where you will want to make sure the fill argument is TRUE
Documentation reference:
If column i of input items do not all have the same type; e.g, a
data.table may be bound with a list or a column is factor while others
are character types, they are coerced to the highest type (SEXPTYPE).
How about this?
library(purrr)
map_df(lapply(d0, function(x) data.frame(lapply(x, as.character))), bind_rows)
Output is:
x1 x2
1 1 a
2 2 b
3 P1 c
Sample data:
d0 <- list(structure(list(x1 = c(1, 2), x2 = c("a", "b")), .Names = c("x1",
"x2"), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
)), structure(list(x1 = "P1", x2 = "c"), .Names = c("x1", "x2"
), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"
)))
With tidyverse, the option would be
library(tidyverse)
d0 %>%
map_df(~ .x %>%
mutate_if(is.numeric, as.character))
# A tibble: 3 x 2
# x1 x2
# <chr> <chr>
#1 1 a
#2 2 b
#3 P1 c
It's a good opportunity to use purrr::modify_depth :
library(purrr)
library(dplyr)
bind_rows(modify_depth(d0,2,as.character))
# # A tibble: 3 x 2
# x1 x2
# <chr> <chr>
# 1 1 a
# 2 2 b
# 3 P1 c

Resources