How can I use R stringr to leave only the gene name? - r

I have a large spreadsheet with 3200 observations that has a list of genes in a column. The column however has a bunch of junk that I don't need (example below). How can I use stringr to remove the unnecessary junk and leave only the gene name?
Example: The gene names are TEM-126 and ykkD.
gb|AY628199|+|203-1064|ARO:3000988|TEM-126
gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD

If your gene names are always at the tail of your strings, you can try the code below
> gsub(".*\\|","",v)
[1] "TEM-126" "ykkD"
DATA
v <- c("gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD")

Using stringr:
str_split_fixed(genes, '\\|', n = 6)[, 6]

As you said you have those names in a column and it seems that the gene name is the last "word", you could easily do that using just two packages from tidyverse, dplyr and stringr.
library(dplyr)
library(stringr)
df <- tibble::tribble(
~Text,
"gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD"
)
df %>%
mutate(gene = word(Text, start = -1, end = -1, sep = "\\|"))
#> # A tibble: 2 x 2
#> Text gene
#> <chr> <chr>
#> 1 gb|AY628199|+|203-1064|ARO:3000988|TEM-126 TEM-126
#> 2 gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD ykkD

If you have a vector genevec of gene names, you can vectorise the function:
stringr::str_split(pattern="\\|", string=genevec, simplify=T)[,6]

Related

Creating new columns in R using parts of an existing column

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?
tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.
This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

Separate string into columns by extracting al groups that match regex

I have these strings in every row of one column.
example_df <- tibble(string = c("[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":19.843418733556412,\"volgorde\":10},{\"positieVergelekenMetSchooladvies\":\"op niveau\",\"percentage\":81.81818181818181,\"percentageVergelijking\":78.58821425834631,\"volgorde\":20},{\"positieVergelekenMetSchooladvies\":\"onder niveau\",\"percentage\":9.090909090909092,\"percentageVergelijking\":1.5683670080972694,\"volgorde\":30}]"))
I'm only interested in the numbers. This regex works:
example_df %>%
.$string %>%
str_extract_all(., "[0-9]+\\.[0-9]+")
Instead of using the separate() function I want to use the extract() function. My understanding is that it differs from separate() in that extract() matches your regex you want to populate your new columns with. separate() matches, of course, the separation string. But where separate() matches all strings you fill in at sep= extract() matches only one group.
example_df %>%
extract(string,
into = c("boven_niveau_school",
"boven_niveau_verg",
"op_niveau_school",
"op_niveau_verg",
"onder_niveau_school",
"onder_niveau_verg"),
regex = "([0-9]+\\.[0-9]+)")
What am I doing wrong?
Instead of separate or extract I would extract all the numbers from the string and then use unnest_wider to create new columns.
library(tidyverse)
example_df %>%
mutate(temp = str_extract_all(string, "[0-9]+\\.[0-9]+")) %>%
unnest_wider(temp)
You can rename the columns as per your choice.
We can use regmatches/regexpr from base R
out <- regmatches(example_df$string, gregexpr("\\d+\\.\\d+", example_df$string))[[1]]
example_df[paste0("new", seq_along(out))] <- as.list(out)
example_df
# A tibble: 1 x 7
# string new1 new2 new3 new4 new5 new6
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 "[{\"positieVergelekenMetSchooladvies\":\"boven niveau\",\"percentage\":9… 9.09090909… 19.84341873… 81.8181818… 78.588214… 9.0909090… 1.56836700…

extract identically named vectors from nested lists, where the list names vary? Using purrr?

I have to work with some data that is in recursive lists like this (simplified reproducible example below):
groups
#> $group1
#> $group1$countries
#> [1] "USA" "JPN"
#>
#>
#> $group2
#> $group2$countries
#> [1] "AUS" "GBR"
Code for data input below:
chars <- c("USA", "JPN")
chars2 <- c("AUS", "GBR")
group1 <- list(countries = chars)
group2 <- list(countries = chars2)
groups <- list(group1 = group1, group2 = group2)
groups
I'm trying to work out how to extract the vectors that are in the lists, without manually having to write a line of code for each group. The code below works, but my example has a large number of groups (and the number of groups will change), so it would be great to work out how to extract all of the vectors in a more efficient manner. This is the brute force way, that works:
countries1 <- groups$group1$countries
countries2 <- groups$group2$countries
In the example, the bottom level vector I'm trying to extract is always called countries, but the lists they're contained in change name, varying only by numbering.
Would there be an easy purrr solution? Or tidyverse solution? Or other solution?
Add some additional cases to your list
groups[["group3"]] <- list()
groups[["group4"]] <- list(foo = letters[1:2])
groups[["group5"]] <- list(foo = letters[1:2], countries = LETTERS[1:2])
Here's a function that maps any list to just the elements named "countries"; it returns NULL if there are no elements
fun = function(x)
x[["countries"]]
Map your original list to contain just the elements you're interested in
interesting <- Map(fun, groups)
Then transform these into a data.frame using a combination of unlist() and rep()
df <- data.frame(
country = unlist(interesting, use.names = FALSE),
name = rep(names(interesting), lengths(interesting))
)
Alternatively, use tidy syntax, e.g.,
interesting %>%
tibble(group = names(.), value = .) %>%
unnest("value")
The output is
# A tibble: 6 x 2
group value
<chr> <chr>
1 group1 USA
2 group1 JPN
3 group2 AUS
4 group2 GBR
5 group5 A
6 group5 B
If there are additional problems parsing individual elements of groups, then modify fun, e.g.,
fun = function(x)
as.character(x[["countries"]])
This will put the output in a list which will handle any number of groups
countries <- unlist(groups, recursive = FALSE)
names(countries) <- sub("^\\w+(\\d+)\\.(\\w+)", "\\2\\1", names(countries), perl = TRUE)
> countries
$countries1
[1] "USA" "JPN"
$countries2
[1] "AUS" "GBR"
You can simply transform your nested list to a data.frame and then unnest the country column.
library(dplyr)
library(tidyr)
groups %>%
tibble(group = names(groups),
country = .) %>%
unnest(country) %>%
unnest(country)
#> # A tibble: 4 x 2
#> group country
#> <chr> <chr>
#> 1 group1 USA
#> 2 group1 JPN
#> 3 group2 AUS
#> 4 group2 GBR
Created on 2020-01-15 by the reprex package (v0.3.0)
Since the countries are hidden 2 layers deep, you have to run unnest twice. Otherwise I think this is straightforward.
If you actually want to have each vector as a an object in you global environment a combination of purrr::map2/walk and list2env will work. In order to make this work, we have to give the country entries in the list individual names first, otherwise list2env just overwrites the same object over and over again.
library(purrr)
groups <-
map2(groups, 1:length(groups), ~setNames(.x, paste0(names(.x), .y)))
walk(groups, ~list2env(. , envir = .GlobalEnv))
This would create the exact same results you are describing in your question. I am not sure though, if it is the best solution for a smooth workflow, since I don't know where you are going with this.

Extracting numbers from text with stringr and regex in R

I have a problem where I'm trying to extract numbers from a string containing text and numbers and then create two new columns showing the Min and Max of the numbers.
For example, I have one column and a string of data like this:
Text
Section 12345.01 to section 12345.02
And I want to create two new columns from the data in the Text column, like this:
Min Max
12345.01 12345.02
I'm using dplyr and stringr with regex, but the regex only extracts the first occurence of the pattern (first number).
df%>%dplyr::mutate(SectionNum = stringr::str_extract(Text, "\\d+.\\d+"))
If I try to use the stringr::str_extract_all function. It seems to extract both occurence of the pattern, but it creates a list in the tibble, which I find is a real hassle. So I'm stuck on the first step, just trying to get the numbers out into their own columns.
Can anyone recommend the most efficient way to do this? Ideally I'd like to extract the numbers from the string, convert them to numbers as.numeric and then run min() and max() functions.
With extract from tidyr. extract turns each regex capture group into its own column. convert = TRUE is convenient in that it coerces the resulting columns to the best format. remove = FALSE can be used if we want to keep the original column. The last mutate is optional to make sure that the first number extracted is really the minimum:
library(tidyr)
library(purrr)
df %>%
extract(Text, c("Min", "Max"), "([\\d.]+)[^\\d.]+([\\d.]+)", convert = TRUE) %>%
mutate(Min = pmap_dbl(., min),
Max = pmap_dbl(., max))
Output:
Min Max
1 12345.02 12345.03
Data:
df <- structure(list(Text = structure(1L, .Label = "Section 12345.03 to section 12345.02", class = "factor")), class = "data.frame", row.names = c(NA,
-1L), .Names = "Text")
Using some other tidyverse tools, you can either approach this by unnesting the list-column and using group_by and summarise semantics (the more dplyr way), or you can just deal with the list-col as-is and use map_dbl to extract the max and min from each row (a more purrr way). My benchmarks have map_dbl about 7 times faster than unnest and dplyr, and about 15% faster than extract, though this is only on the one row.
library(tidyverse)
df <- tibble(
Text = c("Section 12345.01 to section 12345.02")
)
df %>%
mutate(SectionNum = str_extract_all(Text, "\\d+\\.\\d+")) %>%
unnest %>%
group_by(Text) %>%
summarise(min = min(as.numeric(SectionNum)), max = max(as.numeric(SectionNum)))
#> # A tibble: 1 x 3
#> Text min max
#> <chr> <dbl> <dbl>
#> 1 Section 12345.01 to section 12345.02 12345. 12345.
df %>%
mutate(
SectionNum = str_extract_all(Text, "\\d+\\.\\d+"),
min = map_dbl(SectionNum, ~ min(as.numeric(.x))),
max = map_dbl(SectionNum, ~ max(as.numeric(.x)))
)
#> # A tibble: 1 x 4
#> Text SectionNum min max
#> <chr> <list> <dbl> <dbl>
#> 1 Section 12345.01 to section 12345.02 <chr [2]> 12345. 12345.
Created on 2018-09-24 by the reprex package (v0.2.0).
There have already been answers which say how to accomplish your final goal as asked in the question, but just to address the question of how you can find the first or second match using the stringr package, you can use the str_match function, and specify the specific match you are interested in by referring to the column of str_match.
library(stringr)
Text <- "Section 12345.01 to section 12345.02"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[2]
#> [1] "12345.01"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[3]
#> [1] "12345.02"
Created on 2018-09-24 by the reprex package (v0.2.0).

stringr: find rows where any column content matches a regex

Consider the following example
> data_text <- data.frame(text = c('where', 'are', 'you'),
blob = c('little', 'nice', 'text'))
> data_text
# A tibble: 3 x 2
text blob
<chr> <chr>
1 where little
2 are nice
3 you text
I want to print the rows that contain the regex text (that is, row 3)
Problem is, I have hundreds of columns and I dont know which one contains this string. str_detect only work with one column at a time...
How can I do that using the stringr package?
Thanks!
With stringr and dplyr you can do this.
You should use filter_all from dplyr >= 0.5.0.
I have extended the data to have a better look on the result:
library(dplyr)
library(stringr)
data_text <- data.frame(text = c('text', 'where', 'are', 'you'),
one_more_text = c('test', 'test', 'test', 'test'),
blob = c('wow', 'little', 'nice', 'text'))
data_text %>%
filter_all(any_vars(str_detect(., 'text')))
# output
text one_more_text blob
1 text test wow
2 you test text
You can treat the data.frame as a list and use purrr::map to check each column, which can then be reduced into a logical vector that filter can handle. Alternatively, purrr::pmap can iterate over all the columns in parallel:
library(tidyverse)
data_text <- data_frame(text = c('where', 'are', 'you'),
blob = c('little', 'nice', 'text'))
data_text %>% filter(map(., ~.x == 'text') %>% reduce(`|`))
#> # A tibble: 1 x 2
#> text blob
#> <chr> <chr>
#> 1 you text
data_text %>% filter(pmap_lgl(., ~any(c(...) == 'text')))
#> # A tibble: 1 x 2
#> text blob
#> <chr> <chr>
#> 1 you text
matches = apply(data_text,1,function(x) sum(grepl("text",x)))>0
result = data_text[matches,]
No other packages required. Hope this helps!

Resources