R reorder column values alphabetic - r

i have a dataframe like this in R:
and i want to reorder the second column "Car" alphbethic like this:
Car
Audi/BMW/VW
Audi/BMW
Audi/BMW/VW
Audi/BMW/Porsche/VW
there could be 0 to 15 Cars with seperator "/"
my solution is a little bit complicated. (build a new DataFrame with this column, split them in multiple columns, reorder the rows alphabetic, paste them together, insert in original dataframe)
do you know a better and smarter solution?
thanks a lot

This is basically what you did but without creating new dataframe and new columns.
df$Car <- sapply(strsplit(as.character(df$Car), "/"), function(x)
paste(sort(x), collapse = "/"))

We can use separate_rows to split the second column, then arrange by 'Name', and 'Car' and paste the elements grouped by 'Name'
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
separate_rows(Car) %>%
arrange(Name, Car) %>%
group_by(Name, zipcode) %>%
summarise(Car = str_c(Car, collapse="/"))
# A tibble: 4 x 3
# Groups: Name [4]
# Name zipcode Car
# <chr> <dbl> <chr>
#1 Frank 3456 Audi/BMW/VW
#2 Lilly 1333 Audi/BMW/Porsche/VW
#3 Marie 1416 Audi/BMW
#4 Peter 1213 Audi/BMW/VW
data
df1 <- structure(list(Name = c("Peter", "Marie", "Frank", "Lilly"),
Car = c("BMW/VW/Audi", "Audi/BMW", "VW/BMW/Audi", "Audi/BMW/VW/Porsche"
), zipcode = c(1213, 1416, 3456, 1333)),
class = "data.frame", row.names = c(NA,
-4L))

Related

extract component from a list of file names in a column to create a new column in R

I have a data frame with numerous columns and rows. One column in particular "Filename" has information that I would like to separate and make into a new column "ID"
A Filename
1 Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz
2 Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz
My new df2 I would like to create is
A ID
1 AA56789
2 AA52399
I do not have a code written to this as I am just begining to understand sub and gsub. And help would be appreciate, thank you!
We could use basename with trimws
df2 <- cbind(df1['A'], ID = trimws(basename(df1$Filename), whitespace = "_.*"))
-output
df2
A ID
1 1 AA56789
2 2 AA52399
data
df1 <- structure(list(A = 1:2, Filename = c("Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz",
"Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz"
)), class = "data.frame", row.names = c(NA, -2L))
You can use regex to extract the interested value. In base R using sub you can do -
cbind(df[1], ID = sub('.*/(\\w+)_.*', '\\1', df$Filename))
# A ID
#1 1 AA56789
#2 2 AA52399
A tidyverse-solution:
library(dplyr)
library(stringr)
df %>%
mutate(ID = str_replace(Filename, '.*/(\\w+)_.*', '\\1'), .keep = "unused")
which is basically the same as Ronak Shah's or
df %>%
mutate(ID = str_extract(Filename, "(?<=/)(\\w+)(?=_\\d+\\.clipped)"), .keep = "unused")
Both return
# A tibble: 2 x 2
A ID
<dbl> <chr>
1 1 AA56789
2 2 AA52399

Dplyr: Anonymising values up to a million rows with unique names

I have the following data:
library(dplyr)
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('John', 'Jane', 'Rich', 'Clive'),
surname = c('Smith', 'Jones', 'Smith', 'Jones'))
I would like to anonymise the values within the 'forename ' and 'surname ' variables so that the data looks like this.
d <- tibble(
region = c('all', 'one', 'eleven', 'six'),
forename = c('forename1', 'forename2', 'forename3', 'forename4'),
surname = c('surname1', 'surname2', 'surname3', 'surname4'))
I could just do this manually but I have a df with millions of rows. What I would like is for the row number in the df to coincide with the value rename. So the data on row 67 for example would show:
d <- tibble(
region = c('all'),
forename = c('forename67'),
surname = c('surname67'))
Does anyone know how I would achieve this using dplyr if possible?
Thannks
As every row is a unique user, we can paste row_number to the column names.
library(dplyr)
d %>%
mutate(forename = paste0("forename", row_number()),
surname = paste0("surname", row_number()))
# A tibble: 4 x 3
# region forename surname
# <chr> <chr> <chr>
#1 all forename1 surname1
#2 one forename2 surname2
#3 eleven forename3 surname3
#4 six forename4 surname4
An option with stringr
library(dplyr)
library(stringr)
d %>%
mutate(forename = str_c("forename", row_number()),
surname = str_c("surname", row_number()))
Or with lapply from base R
d[c('forename', 'surname')] <- lapply(c('forename', 'surname'), function(x)
paste0(x, seq_len(nrow(d))))]

R data frame rearraangement

I have an R data frame (actually an excel sheet which I have read into R) in the format below:
ID Text
1 This is a red
car. Its electric
and has 4 wheels.
2 This is a van with
six wheels.
I want to reshape it into the following format
ID Text
1 This is a red car. Its electric and has 4 wheels.
2 This is a van with six wheels
Essentially between the two ID numbers my text has been broken into multiple lines. I want to combine it to look like the output above.
Using group_by a numeric ID did not work as it gets rid of lines w/o the ID#.
Any thoughts on how I can achieve this type of output?
Thanks!
Here is one option with tidyverse. Convert the blank ("") in 'ID' to NA (na_if), using fill from tidyr, change the NA elements to previous non-Na value, grouped by 'ID', then paste the 'Text' by collapseing the elements together to a single string
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Text = str_c(Text, collapse=' '))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
Or create a logical index converted to numeric to fill the 'ID' and use that as grouping variable to summarise the 'Text' column
df1 %>%
group_by(ID = ID[ID != ""][cumsum(ID != "")]) %>%
summarise(Text = str_c(Text, collapse=" "))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
data
df1 <- structure(list(ID = c("1", "", "", "2", ""), Text = c("This is a red",
"car. Its electric", "and has 4 wheels.", "This is a van with",
"six wheels.")), row.names = c(NA, -5L), class = "data.frame")

How to replace with only the part before the ":" in every row of a column in R

so in a dataset, I have a column named "Interventions", and each row looks like this:
row1: "Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600"
row2: "Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
I want to only extract the Intervention type such as "Drug", "Biological", "Procedure" to remain in the column. And even better, if can only have the unique Intervention type instead of "Drug" 4 times like the first row.
The expected output would look like this:
row1: "Drug"
row2: "Biological, Drug, Procedure"
I am just getting started with r, I have tidyverse installed and kinda used to playing with the %>%. If anyone can help me with this, much appreciated !
If we want to extract only the prefix part before the :
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df1 %>%
mutate(Interventions = map_chr(str_extract_all(Interventions,
"\\w+(?=:)"), ~ toString(sort(unique(.x)))))
# Interventions
#1 Drug
#2 Biological, Drug, Procedure
Or another option is to separate the rows based on the delimiters, slice the alternate rows and paste together the sorted unique values in 'Interventions'
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Interventions, sep="[:|]") %>%
group_by(rn) %>%
slice(seq(1, n(), by = 2)) %>%
distinct() %>%
summarise(Interventions = toString(sort(unique(Interventions)))) %>%
ungroup %>%
select(-rn)
# A tibble: 2 x 1
# Interventions
# <chr>
#1 Drug
#2 Biological, Drug, Procedure
data
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
Not as concise and the same logic as Akruns but in Base R:
# Create df:
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
# Assign a row id vec:
df1$row_num <- 1:nrow(df1)
# Split string on | delim:
split_up <- strsplit(df1$Interventions, split = "[|]")
# Roll down the dataframe - keep uniques:
rolled_out <- unique(data.frame(row_num = rep(df1$row_num, sapply(split_up, length)),
Interventions = gsub("[:].*","", unlist(split_up))))
# Stack the dataframe:
df2 <- aggregate(Interventions~row_num, rolled_out, paste0, collapse = ", ")
# Drop id vec:
df2 <- within(df2, rm("row_num"))

Replicate each row of data.frame when occurrence

I am facing a tricky question and would be glad to have some help.
I have a data frame with an ID name taking different structures. Something like this following :
ID
bbb-5p/mi-98/6134
abb-4p
bbb-5p/mi-98
Every time I have this "/" I would like to duplicate the row. Each row should be duplicated the number of time we find this "/".
Then the name of the duplicated row should be the root + the characters right after the "/".
For exemple this :
ID
bbb-5p/mi-98/6134
should give :
ID
bbb-5p
bbb-5p-mi-98
bbb-5p-6134
Also my initial data frame have 5 variables :
[ID, varA, varB, varC, varD]
And every time I have this "/" I would like to duplicate the entire row. Then I am expecting to have a new data frame with something like
newID newvarA newvarB newvarC newvarD
bbb-5p varA(1) varB(1) varC(1) varD(1)
bbb-5p-mi-98 varA(1) varB(1) varC(1) varD(1)
bbb-5p-6134 varA(1) varB(1) varC(1) varD(1)
abb-4p varA(2) varB(2) varC(2) varD(2)
bbb-5p varA(3) varB(3) varC(3) varD(3)
bbb-5p-mi-98 varA(3) varB(3) varC(3) varD(3)
Any idea?
Thank you in advance
Peter
You can accomplish this in base R, using lapply() with a custom function. First, you split your character column on "/", resulting in a list of vectors:
l <- strsplit(df$ID,"/")
Then you apply a user defined function to each element of l using lapply():
l_stacked <- lapply(l, function(x)
if(length(x) > 1) {
c(x[1], paste0(x[1],"-",x[-1])) }
else { x })
The function first checks whether the vector has a length > 1. If so, it concatenates all elements with the first element, separated by "-". If length <= 1, it means the string didn't contain "/", hence it is returned as is. Finally we flatten our output using unlist() to be able to convert to data.frame.
data.frame(ID = unlist(l_stacked))
# ID
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98
One way to achieve this is the following:
library(dplyr)
library(tidyr)
res <- df %>% mutate(i=row_number(),
ID = strsplit(ID,split='/')) %>%
unnest() %>%
group_by(i) %>%
mutate(ID=ifelse(ID==first(ID),first(ID),paste(first(ID),ID,sep='-'))) %>%
ungroup() %>% select(-i)
### A tibble: 6 x 1
## ID
## <chr>
##1 bbb-5p
##2 bbb-5p-mi-98
##3 bbb-5p-6134
##4 abb-4p
##5 bbb-5p
##6 bbb-5p-mi-98
Notes:
First, create an indexing column i to group by later so that we can group each "root".
Use strsplit to split each row by "|".
tidyr::unnest the result to separate rows.
group_by the created index i and then if the row is the first row, just return the root; otherwise, paste to prepend the root to the row with separator "-".
Finally, ungroup and remove the created index column i.
Data
df <- structure(list(ID = c("bbb-5p/mi-98/6134", "abb-4p", "bbb-5p/mi-98"
)), .Names = "ID", row.names = c(NA, -3L), class = "data.frame")
ID
1 bbb-5p/mi-98/6134
2 abb-4p
3 bbb-5p/mi-98
Here is one option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1, ..)) and create a column of rownames, grouped by 'rn', split the 'ID' by /, loop through the sequence of rows, paste the split elements based on the index.
library(splitstackshape)
library(data.table)
setDT(df1, keep.rownames=TRUE)[, unlist(strsplit(ID, "/")),
by = rn][, .(ID=sapply(seq_len(.N), function(i)
paste(V1[unique(c(1,i))], collapse="-"))) , rn]
Or an option with dplyr/tidyr/tibble. Create the rownames column with tibble::rownames_to_column, separate the rows into long format with separate_rows, grouped by 'rn', we mutate the 'ID' by pasteing the elements based on the condition of length and remove the 'rn' column.
library(dplyr)
library(tidyr)
library(tidyr)
rownames_to_column(df1, var = "rn") %>%
separate_rows(ID, sep="/") %>%
group_by(rn) %>%
mutate(ID = if(n()>1) c(ID[1], paste(ID[1], ID[-1], sep="-")) else ID) %>%
ungroup() %>%
select(-rn)
# ID
# <chr>
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98

Resources