This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I would like to split some text in a data frame column and save it into a data frame together with the row number or an id column.
I normally used plyr to do that, but this is no longer working in dplyr.
If I understand it correctly, it is more a bug in plyr and my code works since it is a bug.
So I am looking for the correct way to do this.
This is a minimal example in plyr:
library(plyr)
set.seed(1)
df <- data.frame(a=seq(2),
b=c(paste(sample(letters,3), collapse=';'),
paste(sample(letters,3), collapse=';')),
stringsAsFactors=FALSE)
ddply(df,.(a),summarise,unlist(strsplit(b,';')))
It turns the original data frame:
a b
1 1 g;j;n
2 2 x;f;v
Into this:
a ..1
1 1 g
2 1 j
3 1 n
4 2 x
5 2 f
6 2 v
What would be the correct dplyr solution?
I'm biased in favor of cSplit from the "splitstackshape" package, but you might be interested in unnest from "tidyr" in conjunction with "dplyr":
library(dplyr)
library(tidyr)
df %>%
mutate(b = strsplit(b, ";")) %>%
unnest(b)
# a b
# 1 1 g
# 2 1 j
# 3 1 n
# 4 2 x
# 5 2 f
# 6 2 v
You could do this using cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'b', ';', 'long')
# a b
#1: 1 g
#2: 1 j
#3: 1 n
#4: 2 x
#5: 2 f
#6: 2 v
Or using dplyr/tidyr
library(dplyr)
library(tidyr)
separate(df, b, c('b1', 'b2', 'b3'), sep=";") %>%
gather(Var, b, -a) %>%
select(-Var) %>%
arrange(a)
Or another option would be to use do
df %>%
group_by(a) %>%
do(data.frame(b=unlist(strsplit(.$b, ';'))))
Related
I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
I have a tibble which has column names containing spaces & special characters which make it a hassle to work with. I want to change these column names to easier to use names while I'm working with the data, and then change them back to the original names at the end for display. Ideally, I want to be able to do this as part of a pipe, however I haven't figured out how to do it with rename_with().
Sample data:
df <- tibble(oldname1 = seq(1:10),
oldname2 = letters[seq(1:10)],
oldname3 = LETTERS[seq(1:10)])
cols_lookup <- tibble(old_names = c("oldname4", "oldname2", "oldname1"),
new_names = c("newname4", "newname2", "newname1"))
Desired output:
> head(df_renamed)
# A tibble: 6 x 3
newname1 newname2 oldname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
Some columns are removed & reordered during this work so when converting them back there will be entries in the cols_lookup table which are no longer in df. There are also new columns created in df which I want to remain named the same.
I am aware there are similar questions which have already been asked, however the answers either don't work well with tibbles or in a pipe (eg. those using match()), or don't work if the columns aren't all present in the same order in both tables.
We can use rename_at. From the master lookup table, filter the rows where the names of dataset have a match (filtered_lookup), then use that in rename_at where we specify the 'old_names' in vars and replace with the 'new_names'
library(dplyr)
filtered_lookup <- cols_lookup %>%
filter(old_names %in% names(df))
df %>%
rename_at(vars(filtered_lookup$old_names), ~ filtered_lookup$new_names)
Or using rename_with, use the same logic
df %>%
rename_with(.fn = ~filtered_lookup$new_names, .cols = filtered_lookup$old_names)
Or another option is rename with splicing (!!!) from a named vector
library(tibble)
df %>%
rename(!!! deframe(filtered_lookup[2:1]))
You can use rename_ with setnames
cols_lookup <- tibble(old_names = c("oldname3", "oldname2", "oldname1"),
new_names = c("newname3", "newname2", "newname1"))
df
rename_(df, .dots=setNames(cols_lookup$old_names, cols_lookup$new_names))
Output:
# A tibble: 10 x 3
newname1 newname2 newname3
<int> <chr> <chr>
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
I have a data frame with three columns. Each row contains three unique numbers between 1 and 5 (inclusive).
df <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5))
I want to use mutate to create two additional columns that, for each row, contain the two numbers between 1 and 5 that do not appear in the initial three columns in ascending order. The desired data frame in the example would be:
df2 <- data.frame(a=c(1,4,2),
b=c(5,3,1),
c=c(3,1,5),
d=c(2,2,3),
e=c(4,5,4))
I tried to use the below mutate function utilizing setdiff to accomplish this, but returned NAs rather than the values I was looking for:
df <- df %>% mutate(d=setdiff(c(a,b,c),c(1:5))[1],
e=setdiff(c(a,b,c),c(1:5))[2])
I can get around this by looping through each row (or using an apply function) but would prefer a mutate approach if possible.
Thank you for your help!
Base R:
cbind(df, t(apply(df, 1, setdiff, x = 1:5)))
# a b c 1 2
# 1 1 5 3 2 4
# 2 4 3 1 2 5
# 3 2 1 5 3 4
Warning: if there are any non-numerical columns, apply will happily up-convert things (converting to a matrix internally).
We can use pmap to loop over the rows, create a list column and then unnest it to create two new columns
library(dplyr)
librayr(purrr)
library(tidyr)
df %>%
mutate(out = pmap(., ~ setdiff(1:5, c(...)) %>%
as.list%>%
set_names(c('d', 'e')))) %%>%
unnest_wider(c(out))
# A tibble: 3 x 5
# a b c d e
# <dbl> <dbl> <dbl> <int> <int>
#1 1 5 3 2 4
#2 4 3 1 2 5
#3 2 1 5 3 4
Or using base R
df[c('d', 'e')] <- do.call(rbind, lapply(asplit(df, 1), function(x) setdiff(1:5, x)))
Suppose I have the following dataframe, called 'example':
a <- c("rs123|rs246|rs689653", "rs9753", "rs00334")
b <- c(1,2,9)
c <- c(234534523, 67345634, 536423)
example <- data.frame(a,b,c)
I want the dataframe to look like this:
a b c
rs123 1 234534523
rs246 1 234534523
rs689653 1 234534523
rs9753 2 67345634
rs00334 9 536423
Where if we split column a on the | delimiter, the other columns are duplicated. Any help would be greatly appreciated!!
We can use separate_rows from the tidyr package (part of the tidyverse package).
library(tidyverse)
example2 <- example %>%
separate_rows(a)
example2
# a b c
# 1 rs123 1 234534523
# 2 rs246 1 234534523
# 3 rs689653 1 234534523
# 4 rs9753 2 67345634
# 5 rs00334 9 536423
Here is one way to convert example2 back to the original format.
example3 <- example2 %>%
group_by(b, c) %>%
summarize(a = str_c(a, collapse = "|")) %>%
ungroup() %>%
select(names(example2)) %>%
mutate(a = factor(a)) %>%
as.data.frame()
identical(example, example3)
# [1] TRUE
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
I have a data frame (df):
a <- c("up","up","up","up","down","down","down","down")
b <- c("l","r","l","r","l","l","r","r")
df <- data.frame(a,b)
I would like to add a third column (c) which contains the order of entries, grouped by columns a and b that looks something like this:
a b c
1 up l 1
2 up r 1
3 up l 2
4 up r 2
5 down l 1
6 down l 2
7 down r 1
8 down r 2
I have tried solutions using dplyr that have not worked:
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = row_number()) # This counts the order based on `b`, ignoring `a`
order <- df %>%
group_by(a) %>%
group_by(b) %>%
mutate(c = seq_len(n())) # This counts the order based on `b`, ignoring `a`
I would prefer to keep using dplyr and pipes if possible, but other suggestions are welcome
You need to combine a and b in the same group_by statement.
order <- df %>%
group_by(a, b) %>%
mutate(c = row_number())
order
# Source: local data frame [8 x 3]
# Groups: a, b [4]
#
# a b c
# <fctr> <fctr> <int>
# 1 up l 1
# 2 up r 1
# 3 up l 2
# 4 up r 2
# 5 down l 1
# 6 down l 2
# 7 down r 1
# 8 down r 2