Separating Column Based on First Value of String - r

I have an ID variable that I am trying to separate into two separate columns based on their prefix being either a 1 or 2.
An example of my data is:
STR_ID
1434233
2343535
1243435
1434355
I have tried countless ways to try to separate these variables into columns based on their prefixes, but cannot seem to figure it out. Any ideas on how I would do this? Thank you in advance.

We create a grouping variable with substr by extracting the first character/digit of 'STR_ID', and spread it to 'wide' format
library(tidyverse)
df1 %>%
group_by(grp = paste0('grp', substr(STR_ID, 1, 1))) %>%
mutate(i = row_number()) %>%
spread(grp, STR_ID) %>%
select(-i)
# A tibble: 3 x 2
# grp1 grp2
# <int> <int>
#1 1434233 2343535
#2 1243435 NA
#3 1434355 NA
data
df1 <- structure(list(STR_ID = c(1434233L, 2343535L, 1243435L, 1434355L
)), class = "data.frame", row.names = c(NA, -4L))

Related

Is there a way to group_by values in R?

I have a dataframe that looks something like this
Column A
Column B
Accepted
Did not accept
Did not accept
Did not accept
and so on..
Just wanted to know if there is a way to manipulate the data into a table visualisation which looks something like this?: To count the number of Accepted and Did not accept for each column.
Accepted or not?
Column A
Column B
Accepted
10
2
Did not accept
5
5
In base R, use sapply with table -
sapply(df,function(x) table(factor(x, levels = c('Accepted', 'Did not accept'))))
# Column.A Column.B
#Accepted 1 0
#Did not accept 1 2
In tidyverse -
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
pivot_wider(names_from = name, values_from = name,
values_fn = length, values_fill = 0)
# value Column.A Column.B
# <chr> <int> <int>
#1 Accepted 1 0
#2 Did not accept 1 2
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Column.A = c("Accepted", "Did not accept"), Column.B = c("Did not accept",
"Did not accept")), row.names = c(NA, -2L), class = "data.frame")
Convert wide-to-long, then tabulate:
table(stack(df))
# ind
# values Column.A Column.B
# Accepted 1 0
# Did not accept 1 2

extract component from a list of file names in a column to create a new column in R

I have a data frame with numerous columns and rows. One column in particular "Filename" has information that I would like to separate and make into a new column "ID"
A Filename
1 Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz
2 Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz
My new df2 I would like to create is
A ID
1 AA56789
2 AA52399
I do not have a code written to this as I am just begining to understand sub and gsub. And help would be appreciate, thank you!
We could use basename with trimws
df2 <- cbind(df1['A'], ID = trimws(basename(df1$Filename), whitespace = "_.*"))
-output
df2
A ID
1 1 AA56789
2 2 AA52399
data
df1 <- structure(list(A = 1:2, Filename = c("Sample.2020-03-16_2345_WES01_FF001_089-267/2355245_H445.FASTQs/AA56789_1.clipped.fastq.gz",
"Sample.2020-03-15_2355_WES01_FF001_089-267/2345245_H345.FASTQs/AA52399_1.clipped.fastq.gz"
)), class = "data.frame", row.names = c(NA, -2L))
You can use regex to extract the interested value. In base R using sub you can do -
cbind(df[1], ID = sub('.*/(\\w+)_.*', '\\1', df$Filename))
# A ID
#1 1 AA56789
#2 2 AA52399
A tidyverse-solution:
library(dplyr)
library(stringr)
df %>%
mutate(ID = str_replace(Filename, '.*/(\\w+)_.*', '\\1'), .keep = "unused")
which is basically the same as Ronak Shah's or
df %>%
mutate(ID = str_extract(Filename, "(?<=/)(\\w+)(?=_\\d+\\.clipped)"), .keep = "unused")
Both return
# A tibble: 2 x 2
A ID
<dbl> <chr>
1 1 AA56789
2 2 AA52399

adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4

How to replace with only the part before the ":" in every row of a column in R

so in a dataset, I have a column named "Interventions", and each row looks like this:
row1: "Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600"
row2: "Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
I want to only extract the Intervention type such as "Drug", "Biological", "Procedure" to remain in the column. And even better, if can only have the unique Intervention type instead of "Drug" 4 times like the first row.
The expected output would look like this:
row1: "Drug"
row2: "Biological, Drug, Procedure"
I am just getting started with r, I have tidyverse installed and kinda used to playing with the %>%. If anyone can help me with this, much appreciated !
If we want to extract only the prefix part before the :
library(dplyr)
library(stringr)
library(tidyr)
library(purrr)
df1 %>%
mutate(Interventions = map_chr(str_extract_all(Interventions,
"\\w+(?=:)"), ~ toString(sort(unique(.x)))))
# Interventions
#1 Drug
#2 Biological, Drug, Procedure
Or another option is to separate the rows based on the delimiters, slice the alternate rows and paste together the sorted unique values in 'Interventions'
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Interventions, sep="[:|]") %>%
group_by(rn) %>%
slice(seq(1, n(), by = 2)) %>%
distinct() %>%
summarise(Interventions = toString(sort(unique(Interventions)))) %>%
ungroup %>%
select(-rn)
# A tibble: 2 x 1
# Interventions
# <chr>
#1 Drug
#2 Biological, Drug, Procedure
data
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
Not as concise and the same logic as Akruns but in Base R:
# Create df:
df1 <- structure(list(Interventions = c("Drug: Rituximab|Drug: Utomilumab|Drug: Avelumab|Drug: PF04518600",
"Biological: alemtuzumab|Biological: donor lymphocytes|Drug: carmustine|Drug: cytarabine|Drug: etoposide|Drug: melphalan|Procedure: allogeneic bone marroow"
)), class = "data.frame", row.names = c(NA, -2L))
# Assign a row id vec:
df1$row_num <- 1:nrow(df1)
# Split string on | delim:
split_up <- strsplit(df1$Interventions, split = "[|]")
# Roll down the dataframe - keep uniques:
rolled_out <- unique(data.frame(row_num = rep(df1$row_num, sapply(split_up, length)),
Interventions = gsub("[:].*","", unlist(split_up))))
# Stack the dataframe:
df2 <- aggregate(Interventions~row_num, rolled_out, paste0, collapse = ", ")
# Drop id vec:
df2 <- within(df2, rm("row_num"))

Return column names based on condition

I've a dataset with 18 columns from which I need to return the column names with the highest value(s) for each observation, simple example below. I came across this answer, and it almost does what I need, but in some cases I need to combine the names (like abin maxcolbelow). How should I do this?
Any suggestions would be greatly appreciated! If it's possible it would be easier for me to understand a tidyverse based solution as I'm more familiar with that than base.
Edit: I forgot to mention that some of the columns in my data have NAs.
library(dplyr, warn.conflicts = FALSE)
#turn this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5)
#into this
Df <- tibble(a = 4:2, b = 4:6, c = 3:5, maxol = c("ab", "b", "b"))
Created on 2018-10-30 by the reprex package (v0.2.1)
Continuing from the answer in the linked post, we can do
Df$maxcol <- apply(Df, 1, function(x) paste0(names(Df)[x == max(x)], collapse = ""))
Df
# a b c maxcol
# <int> <int> <int> <chr>
#1 4 4 3 ab
#2 3 5 4 b
#3 2 6 5 b
For every row, we check which position has max values and paste the names at that position together.
If you prefer the tidyverse approach
library(tidyverse)
Df %>%
mutate(row = row_number()) %>%
gather(values, key, -row) %>%
group_by(row) %>%
mutate(maxcol = paste0(values[key == max(key)], collapse = "")) %>%
spread(values, key) %>%
ungroup() %>%
select(-row)
# maxcol a b c
# <chr> <int> <int> <int>
#1 ab 4 4 3
#2 b 3 5 4
#3 b 2 6 5
We first convert dataframe from wide to long using gather, then group_by each row we paste column names for max key and then spread the long dataframe to wide again.
Here's a solution I found that loops through column names in case you find it hard to wrap your head around spread/gather (pivot_wider/longer)
out_df <- Df %>%
# calculate rowwise maximum
rowwise() %>%
mutate(rowmax = max(across())) %>%
# create empty maxcol column
mutate(maxcol = "")
# loop through column names
for (colname in colnames(Df)) {
out_df <- out_df %>%
# if the value at the specified column name is the maximum, paste it to the maxcol
mutate(maxcol = ifelse(.data[[colname]] == rowmax, paste0(maxcol, colname), maxcol))
}
# remove rowmax column if no longer needed
out_df <- out_df %>%
select(-rowmax)

Resources