splitting a string in a column and adding duplicate rows in R - r

Suppose I have the following dataframe, called 'example':
a <- c("rs123|rs246|rs689653", "rs9753", "rs00334")
b <- c(1,2,9)
c <- c(234534523, 67345634, 536423)
example <- data.frame(a,b,c)
I want the dataframe to look like this:
a b c
rs123 1 234534523
rs246 1 234534523
rs689653 1 234534523
rs9753 2 67345634
rs00334 9 536423
Where if we split column a on the | delimiter, the other columns are duplicated. Any help would be greatly appreciated!!

We can use separate_rows from the tidyr package (part of the tidyverse package).
library(tidyverse)
example2 <- example %>%
separate_rows(a)
example2
# a b c
# 1 rs123 1 234534523
# 2 rs246 1 234534523
# 3 rs689653 1 234534523
# 4 rs9753 2 67345634
# 5 rs00334 9 536423
Here is one way to convert example2 back to the original format.
example3 <- example2 %>%
group_by(b, c) %>%
summarize(a = str_c(a, collapse = "|")) %>%
ungroup() %>%
select(names(example2)) %>%
mutate(a = factor(a)) %>%
as.data.frame()
identical(example, example3)
# [1] TRUE

Related

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

Create new columns to indicate column name's position inside another string vector (with dplyr, purrr, and stringr)

Given this example data:
require(stringr)
require(tidyverse)
labels <- c("foo", "bar", "baz")
n_rows <- 4
df <- 1:n_rows %>%
map(~ data.frame(
block_order=paste(sample(labels, size=length(labels), replace=FALSE),
collapse="|"))) %>%
bind_rows()
df
block_order
1 foo|bar|baz
2 baz|bar|foo
3 foo|baz|bar
4 foo|bar|baz
I want to generate a column for each string in labels, which takes the value of the position of that string in the |-separated sequence in each row.
Desired output:
block_order foo bar baz
1 foo|bar|baz 1 2 3
2 baz|bar|foo 3 2 1
3 foo|baz|bar 1 3 2
4 foo|bar|baz 1 2 3
I've been trying different variations in a dplyr/purrr setup, like this example, where I map in each value of label, and then attempt to get its position in block_order using match on str_split:
labels %>%
map(~ df %>%
transmute(!!.x := match(!!.x, str_split(block_order,
"\\|",
simplify=TRUE)))) %>%
bind_cols(df, .)
But that produces unexpected output:
block_order foo bar baz
1 foo|bar|baz 1 5 2
2 baz|bar|foo 1 5 2
3 foo|baz|bar 1 5 2
4 foo|bar|baz 1 5 2
I'm not really sure what these numbers represent, or why they're all the same.
If anyone can help me figure out (a) how to achieve my desired output in a dplyr/purrr framework and (b) why the proposed solution here gives the output it does, I'd be very appreciative.
We can split the 'block_order' by |, loop through the list of vectors using lapply, get the index with match, rbind the vectors and assign it to create new columns
labels <- c("foo", "bar", "baz")
df[labels] <- do.call(rbind, lapply(strsplit(df$block_order, "|",
fixed = TRUE), match, table = labels))
Or similar idea with tidyverse
library(tidyverse)
str_split(df$block_order, "[|]") %>%
map(~ .x %>%
match(table= labels)) %>%
do.call(rbind, .) %>%
as_tibble %>%
set_names(labels) %>%
bind_cols(df, .)
# block_order foo bar baz
#1 foo|bar|baz 1 2 3
#2 baz|bar|foo 3 2 1
#3 foo|baz|bar 1 3 2
#4 foo|bar|baz 1 2 3
Another option would be to use separate_rows, reshape it to 'long' format and spread it back
rownames_to_column(df, 'rn') %>%
separate_rows(block_order) %>%
group_by(rn) %>%
mutate(ind = match(block_order, labels), labels = factor(labels, levels = labels)) %>%
select(-block_order) %>%
spread(labels, ind) %>%
ungroup %>%
select(-rn) %>%
bind_cols(df, .)
Unless you need to for other reasons, you don't have to fully split the string if you just identify the location of the first match for each value of labels, which regexpr will give you. mapping over labels will give a list with one element for each string in labels (so it's a quick iteration), which you can then pmap rank over to get indices. Using the *_dfr version to simplify the results to a data frame and cbinding to the original,
library(tidyverse)
set.seed(47)
labels <- c("foo", "bar", "baz")
df <- data_frame(block_order = replicate(10, paste(sample(labels), collapse = "|")))
labels %>%
map(~regexpr(.x, df$block_order)) %>%
pmap_dfr(~set_names(as.list(rank(c(...))), labels)) %>%
bind_cols(df, .)
#> # A tibble: 10 x 4
#> block_order foo bar baz
#> <chr> <dbl> <dbl> <dbl>
#> 1 baz|foo|bar 2. 3. 1.
#> 2 baz|bar|foo 3. 2. 1.
#> 3 bar|foo|baz 2. 1. 3.
#> 4 baz|foo|bar 2. 3. 1.
#> 5 foo|bar|baz 1. 2. 3.
#> 6 baz|foo|bar 2. 3. 1.
#> 7 foo|baz|bar 1. 3. 2.
#> 8 bar|baz|foo 3. 1. 2.
#> 9 baz|foo|bar 2. 3. 1.
#> 10 foo|bar|baz 1. 2. 3.
If you prefer stringr/stringi to base regex, you could to the same thing by changing the regexpr call to str_locate(df$block_order, .x)[, "start"] or stringi::stri_locate_first_fixed in the same arrangement.
I think this might work:
library(tidyr)
library(purrr)
position_counter <- function(...) {
row = list(...)
row %>% map(~which(row == .)) %>% setNames(row)
}
df %>%
separate(block_order, labels) %>%
pmap_df(position_counter)

two factor group_by then add row number R dplyr [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have a data frame (df):
a <- c("up","up","up","up","down","down","down","down")
b <- c("l","r","l","r","l","l","r","r")
df <- data.frame(a,b)
I would like to add a third column (c) which contains the order of entries, grouped by columns a and b that looks something like this:
a b c
1 up l 1
2 up r 1
3 up l 2
4 up r 2
5 down l 1
6 down l 2
7 down r 1
8 down r 2
I have tried solutions using dplyr that have not worked:
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = row_number()) # This counts the order based on `b`, ignoring `a`
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = seq_len(n())) # This counts the order based on `b`, ignoring `a`
I would prefer to keep using dplyr and pipes if possible, but other suggestions are welcome
You need to combine a and b in the same group_by statement.
order <- df %>%
group_by(a, b) %>%
mutate(c = row_number())
order
# Source: local data frame [8 x 3]
# Groups: a, b [4]
#
# a b c
# <fctr> <fctr> <int>
# 1 up l 1
# 2 up r 1
# 3 up l 2
# 4 up r 2
# 5 down l 1
# 6 down l 2
# 7 down r 1
# 8 down r 2

Count occurence across multiple columns using R & dplyr

This should be a simple solution...I just can't wrap my head around this. I'd like to count the occurrences of a factor across multiple columns of a data frame. There're 13 columns range from abx.1 > abx.13 and a huge number of rows.
Sample data frame:
library(dplyr)
abx.1 <- c('Amoxil', 'Cipro', 'Moxiflox', 'Pip-tazo')
start.1 <- c('2012-01-01', '2012-02-01', '2013-01-01', '2014-01-01')
abx.2 <- c('Pip-tazo', 'Ampicillin', 'Amoxil', NA)
start.2 <- c('2012-01-01', '2012-02-01', '2013-01-01', NA)
abx.3 <- c('Ampicillin', 'Amoxil', NA, NA)
start.3 <- c('2012-01-01', '2012-02-01', NA,NA)
worksheet <-data.frame (abx.1, start.1, abx.2, start.2, abx.3, start.3)
Result I'd like:
name count
Amoxil 3
Ampicillin 2
Pip-tazo 2
Cipro 1
Moxiflox 1
I've tried :
worksheet %>% group_by (abx.1, abx.2, abx.3) %>% summarise(count = n())
This doesn't give me my desired output. Any thoughts would be greatly appreciated.
If you want a dplyr solution, I'd suggest combining it with tidyr in order to convert your data to a long format first
library(tidyr)
worksheet %>%
select(starts_with("abx")) %>%
gather(key, value, na.rm = TRUE) %>%
count(value)
# Source: local data frame [5 x 2]
#
# value n
# 1 Amoxil 3
# 2 Ampicillin 2
# 3 Cipro 1
# 4 Moxiflox 1
# 5 Pip-tazo 2
Alternatively, with base R, it's just
as.data.frame(table(unlist(worksheet[grep("^abx", names(worksheet))])))
# Var1 Freq
# 1 Amoxil 3
# 2 Cipro 1
# 3 Moxiflox 1
# 4 Pip-tazo 2
# 5 Ampicillin 2

splitting text in column and add row number [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I would like to split some text in a data frame column and save it into a data frame together with the row number or an id column.
I normally used plyr to do that, but this is no longer working in dplyr.
If I understand it correctly, it is more a bug in plyr and my code works since it is a bug.
So I am looking for the correct way to do this.
This is a minimal example in plyr:
library(plyr)
set.seed(1)
df <- data.frame(a=seq(2),
b=c(paste(sample(letters,3), collapse=';'),
paste(sample(letters,3), collapse=';')),
stringsAsFactors=FALSE)
ddply(df,.(a),summarise,unlist(strsplit(b,';')))
It turns the original data frame:
a b
1 1 g;j;n
2 2 x;f;v
Into this:
a ..1
1 1 g
2 1 j
3 1 n
4 2 x
5 2 f
6 2 v
What would be the correct dplyr solution?
I'm biased in favor of cSplit from the "splitstackshape" package, but you might be interested in unnest from "tidyr" in conjunction with "dplyr":
library(dplyr)
library(tidyr)
df %>%
mutate(b = strsplit(b, ";")) %>%
unnest(b)
# a b
# 1 1 g
# 2 1 j
# 3 1 n
# 4 2 x
# 5 2 f
# 6 2 v
You could do this using cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'b', ';', 'long')
# a b
#1: 1 g
#2: 1 j
#3: 1 n
#4: 2 x
#5: 2 f
#6: 2 v
Or using dplyr/tidyr
library(dplyr)
library(tidyr)
separate(df, b, c('b1', 'b2', 'b3'), sep=";") %>%
gather(Var, b, -a) %>%
select(-Var) %>%
arrange(a)
Or another option would be to use do
df %>%
group_by(a) %>%
do(data.frame(b=unlist(strsplit(.$b, ';'))))

Resources