Creating new columns in R using parts of an existing column - r

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?

tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2

I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.

This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

Related

Using R Regex to identify two characters followed by a dash and two numbers

Very obnoxious regex question incoming! I have a column that I am trying to split into two based off a condition. I'd like a new column to be created when there are two characters, followed by a dash and two numbers (e.g., CA-01).
My code is:
mydf %>% extract(col = pilot_id, regex = "[a-z]{2}.d{2}", into = 'facility_test')
Where the column I'd like to identify the pattern in is pilot_id, and the new column I'd like to make is facility_test.
We need to capture in extract
library(dplyr)
library(tidyr)
mydf %>%
extract(col = pilot_id, regex = ".*-([A-Z]{2}-\\d{2})\\s.*",
into = 'facility_test')
# A tibble: 1 x 1
# facility_test
# <chr>
#1 FL-03
data
mydf <- tibble(pilot_id = "TGT Track -FL-03 (Hilsborough County) 3/3/2021")

Separate string into columns by extracting al groups that match regex

I have these strings in every row of one column.
example_df <- tibble(string = c("[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":19.843418733556412,\"volgorde\":10},{\"positieVergelekenMetSchooladvies\":\"op niveau\",\"percentage\":81.81818181818181,\"percentageVergelijking\":78.58821425834631,\"volgorde\":20},{\"positieVergelekenMetSchooladvies\":\"onder niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":1.5683670080972694,\"volgorde\":30}]"))
I'm only interested in the numbers. This regex works:
example_df %>%
.$string %>%
str_extract_all(., "[0-9]+\\.[0-9]+")
Instead of using the separate() function I want to use the extract() function. My understanding is that it differs from separate() in that extract() matches your regex you want to populate your new columns with. separate() matches, of course, the separation string. But where separate() matches all strings you fill in at sep= extract() matches only one group.
example_df %>%
extract(string,
into = c("boven_niveau_school",
"boven_niveau_verg",
"op_niveau_school",
"op_niveau_verg",
"onder_niveau_school",
"onder_niveau_verg"),
regex = "([0-9]+\\.[0-9]+)")
What am I doing wrong?
Instead of separate or extract I would extract all the numbers from the string and then use unnest_wider to create new columns.
library(tidyverse)
example_df %>%
mutate(temp = str_extract_all(string, "[0-9]+\\.[0-9]+")) %>%
unnest_wider(temp)
You can rename the columns as per your choice.
We can use regmatches/regexpr from base R
out <- regmatches(example_df$string, gregexpr("\\d+\\.\\d+", example_df$string))[[1]]
example_df[paste0("new", seq_along(out))] <- as.list(out)
example_df
# A tibble: 1 x 7
# string new1 new2 new3 new4 new5 new6
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 "[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9… 9.09090909… 19.84341873… 81.8181818… 78.588214… 9.0909090… 1.56836700…

Regex for consecutive repeated word in R

Regular expression to find words and digits that are repeated back to back
Suppose I have a data frame
df<-data.frame(name=c("mike","mike","mike","bob","mike"),age=c(23,23,23,25,23)
how can I write a regular expression to check "name" column whether "mike" or any other word is repeated back to bake for e.g here mike is repeated 3 times and in "age" column a digit is repeated e.g here 23 is repeated 3 times back to back
You can try this :
library(dplyr)
df %>%
mutate(across(.fns = data.table::rleid, .names = '{col}_grp')) %>%
group_by(across(ends_with('grp'))) %>%
filter(n() >= 3) %>%
ungroup %>%
select(names(df))
# name age
# <chr> <dbl>
#1 mike 23
#2 mike 23
#3 mike 23
For every column in df we use rleid to give a unique number to consecutive values and select those groups that have >= 3 rows in them.
Base R one liner (no regex):
df[which(c(0, cumsum(abs(diff(as.integer(as.factor(df$name)))))) == 0),]

How can I use R stringr to leave only the gene name?

I have a large spreadsheet with 3200 observations that has a list of genes in a column. The column however has a bunch of junk that I don't need (example below). How can I use stringr to remove the unnecessary junk and leave only the gene name?
Example: The gene names are TEM-126 and ykkD.
gb|AY628199|+|203-1064|ARO:3000988|TEM-126
gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD
If your gene names are always at the tail of your strings, you can try the code below
> gsub(".*\\|","",v)
[1] "TEM-126" "ykkD"
DATA
v <- c("gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD")
Using stringr:
str_split_fixed(genes, '\\|', n = 6)[, 6]
As you said you have those names in a column and it seems that the gene name is the last "word", you could easily do that using just two packages from tidyverse, dplyr and stringr.
library(dplyr)
library(stringr)
df <- tibble::tribble(
~Text,
"gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD"
)
df %>%
mutate(gene = word(Text, start = -1, end = -1, sep = "\\|"))
#> # A tibble: 2 x 2
#> Text gene
#> <chr> <chr>
#> 1 gb|AY628199|+|203-1064|ARO:3000988|TEM-126 TEM-126
#> 2 gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD ykkD
If you have a vector genevec of gene names, you can vectorise the function:
stringr::str_split(pattern="\\|", string=genevec, simplify=T)[,6]

Use case_when and startsWith to selectively mutate by row

I'm trying to create a new column based on another, using case_when to give different outputs based on the value of each row.
I start with df <- data.frame(a=c("abc", "123", "abc", "123"))
And want to generate a new column b like so
#> a b
#> 1 abc letter
#> 2 123 number
#> 3 abc letter
#> 4 123 number
I've tried df %>% mutate(b = case_when(startsWith(a, "a") ~ "letter", startsWith(a, "1") ~ "number")) but it only gives an error. Can someone show me how to get different values for column b based on the first letter of the row in column a?
According to ?startsWith
x -vector of character string whose “starts” are considered.
So, startsWith expects the class to be character and here it is factor class. Converting it to character class would solve the issue
library(dplyr)
df %>%
mutate(b = case_when(startsWith(as.character(a), "a") ~ "letter",
TRUE ~ "number"))
# a b
#1 abc letter
#2 123 number
#3 abc letter
#4 123 number
The default behavior of data.frame would be stringsAsFactors = TRUE. If we specify stringsAsFactors = FALSE, the 'a' column will be character class
Another option is str_detect to create a logical expression by checking if the character from the start (^) of the string is a digit ([0-9])
library(stringr)
library(dplyr)
df %>%
mutate(b = c("letter", "number")[1+str_detect(a, "^[0-9]")])
# a b
#1 abc letter
#2 123 number
#3 abc letter
# 123 number
You can just use if_else() since there's only two cases here. A regex seems more appropriate, given the test you're trying to run; the key is that ^ specifies the start of the string, and [:alpha:] matches alphabetical letters, case-insensitive.
library(tidyverse)
df <- data.frame(a=c("abc", "123", "abc", "123"))
df %>% mutate(
b = a %>% str_detect("^[:alpha:]") %>% if_else("letter", "number")
)
#> a b
#> 1 abc letter
#> 2 123 number
#> 3 abc letter
#> 4 123 number
Created on 2019-09-29 by the reprex package (v0.3.0)
As #akrun pointed out, there is an issue here with factors vs. characters - are you sure this is an appropriate example for your use case, i.e. your real data is in factors? Luckily, though, str_detect() works just as well either way.

Resources