Extracting numbers from text with stringr and regex in R

Extracting numbers from text with stringr and regex in R - r

I have a problem where I'm trying to extract numbers from a string containing text and numbers and then create two new columns showing the Min and Max of the numbers.
For example, I have one column and a string of data like this:
Text
Section 12345.01 to section 12345.02
And I want to create two new columns from the data in the Text column, like this:
Min Max
12345.01 12345.02
I'm using dplyr and stringr with regex, but the regex only extracts the first occurence of the pattern (first number).
df%>%dplyr::mutate(SectionNum = stringr::str_extract(Text, "\\d+.\\d+"))
If I try to use the stringr::str_extract_all function. It seems to extract both occurence of the pattern, but it creates a list in the tibble, which I find is a real hassle. So I'm stuck on the first step, just trying to get the numbers out into their own columns.
Can anyone recommend the most efficient way to do this? Ideally I'd like to extract the numbers from the string, convert them to numbers as.numeric and then run min() and max() functions.

With extract from tidyr. extract turns each regex capture group into its own column. convert = TRUE is convenient in that it coerces the resulting columns to the best format. remove = FALSE can be used if we want to keep the original column. The last mutate is optional to make sure that the first number extracted is really the minimum:
library(tidyr)
library(purrr)
df %>%
extract(Text, c("Min", "Max"), "([\\d.]+)[^\\d.]+([\\d.]+)", convert = TRUE) %>%
mutate(Min = pmap_dbl(., min),
Max = pmap_dbl(., max))
Output:
Min Max
1 12345.02 12345.03
Data:
df <- structure(list(Text = structure(1L, .Label = "Section 12345.03 to section 12345.02", class = "factor")), class = "data.frame", row.names = c(NA,
-1L), .Names = "Text")

Using some other tidyverse tools, you can either approach this by unnesting the list-column and using group_by and summarise semantics (the more dplyr way), or you can just deal with the list-col as-is and use map_dbl to extract the max and min from each row (a more purrr way). My benchmarks have map_dbl about 7 times faster than unnest and dplyr, and about 15% faster than extract, though this is only on the one row.
library(tidyverse)
df <- tibble(
Text = c("Section 12345.01 to section 12345.02")
)
df %>%
mutate(SectionNum = str_extract_all(Text, "\\d+\\.\\d+")) %>%
unnest %>%
group_by(Text) %>%
summarise(min = min(as.numeric(SectionNum)), max = max(as.numeric(SectionNum)))
#> # A tibble: 1 x 3
#> Text min max
#> <chr> <dbl> <dbl>
#> 1 Section 12345.01 to section 12345.02 12345. 12345.
df %>%
mutate(
SectionNum = str_extract_all(Text, "\\d+\\.\\d+"),
min = map_dbl(SectionNum, ~ min(as.numeric(.x))),
max = map_dbl(SectionNum, ~ max(as.numeric(.x)))
)
#> # A tibble: 1 x 4
#> Text SectionNum min max
#> <chr> <list> <dbl> <dbl>
#> 1 Section 12345.01 to section 12345.02 <chr [2]> 12345. 12345.
Created on 2018-09-24 by the reprex package (v0.2.0).

There have already been answers which say how to accomplish your final goal as asked in the question, but just to address the question of how you can find the first or second match using the stringr package, you can use the str_match function, and specify the specific match you are interested in by referring to the column of str_match.
library(stringr)
Text <- "Section 12345.01 to section 12345.02"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[2]
#> [1] "12345.01"
str_match(Text, "^[^0-9.]*([0-9.]*)[^0-9.]*([0-9.]*)[^0-9.]*$")[3]
#> [1] "12345.02"
Created on 2018-09-24 by the reprex package (v0.2.0).

Related

Find the maximum number in an R dataframe column of strings

For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:
| particular_col_name |
| ------------------- |
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|
So two new columns should be created with title 'maximum_number' and 'minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.
I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr 'mutate' operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.
Any help on this would be most appreciated please.

library(tidyverse)
tibble(
particular_col_name = c(
"First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15",
"20% 30%"
)
) %>%
mutate(
numbers = particular_col_name %>% map(~ {
.x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
}),
min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
) %>%
select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#> particular_col_name min max
#> <chr> <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e… 5 20
#> 2 Second_Row_50%, number40. Number 4. number_15 4 40
#> 3 20% 30% NA NA
Created on 2022-02-22 by the reprex package (v2.0.0)

We could use str_extract_all in combination with sapply:
library(stringr)
df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9]+"), function(x) max(as.integer(x)))
particular_col_name min max
<chr> <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70% 5 70
2 Second_Row_50%, number40. Number 4. number_15 4 50
data:
df <- structure(list(particular_col_name = c("First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15"), min = 5:4,
max = c(70L, 50L)), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))

Creating new columns in R using parts of an existing column

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?

tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2

I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.

This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

why I do not get the renamed column in R with title defined?

I am trying to created a function that parameters defined into the local environment that I hope to use it in a tibble.
here it is a reproducible example without yet creating the function:
tibble <- structure(list(standardised_existing_cond_rate = 1.44), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame"))
and I am defining my title, an example which does not seem to be an issue when I add them together using paste function :
comorbidity <- "asthma"
title <- "Standardised"
add_to_title <- comorbidity
add_last_word <- " rate in"
country = "India"
country = country
whole_title <- paste(title, add_to_title, add_last_word, country)
whole_title
But then, when I rename my column name with the whole title I do not succeed.
table <- tibble %>% rename(whole_title = standardised_existing_cond_rate)
table
Clearly, I am trying to do something odd. yet, I wish I would find another more elegant solution as to how to get my column name renamed with the title. But important is to keep the whole title name as it is because these will be passed in as functions with the parameters defined.

Use !! with := to evaluate whole_title :
library(dplyr)
tibble %>% rename(!!whole_title := standardised_existing_cond_rate)
# `Standardised asthma rate in India`
# <dbl>
#1 1.44
Also there are simpler ways without getting into non-standard evaluation. For eg with setNames
setNames(tibble, whole_title)

We can use rename_at
library(dplyr)
tibble %>%
rename_at(vars(standardised_existing_cond_rate), ~ whole_title)
-output
# A tibble: 1 x 1
# `Standardised asthma rate in India`
# <dbl>
#1 1.44
Or using rename_with
tibble %>%
rename_with(~ whole_title, standardised_existing_cond_rate)
# A tibble: 1 x 1
# `Standardised asthma rate in India`
# <dbl>
#1 1.44

How can I use R stringr to leave only the gene name?

I have a large spreadsheet with 3200 observations that has a list of genes in a column. The column however has a bunch of junk that I don't need (example below). How can I use stringr to remove the unnecessary junk and leave only the gene name?
Example: The gene names are TEM-126 and ykkD.
gb|AY628199|+|203-1064|ARO:3000988|TEM-126
gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD

If your gene names are always at the tail of your strings, you can try the code below
> gsub(".*\\|","",v)
[1] "TEM-126" "ykkD"
DATA
v <- c("gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD")

Using stringr:
str_split_fixed(genes, '\\|', n = 6)[, 6]

As you said you have those names in a column and it seems that the gene name is the last "word", you could easily do that using just two packages from tidyverse, dplyr and stringr.
library(dplyr)
library(stringr)
df <- tibble::tribble(
~Text,
"gb|AY628199|+|203-1064|ARO:3000988|TEM-126",
"gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD"
)
df %>%
mutate(gene = word(Text, start = -1, end = -1, sep = "\\|"))
#> # A tibble: 2 x 2
#> Text gene
#> <chr> <chr>
#> 1 gb|AY628199|+|203-1064|ARO:3000988|TEM-126 TEM-126
#> 2 gb|AL009126|+|1376854-1377172|ARO:3003064|ykkD ykkD

If you have a vector genevec of gene names, you can vectorise the function:
stringr::str_split(pattern="\\|", string=genevec, simplify=T)[,6]

Extract class of each field in data.frame; summarize classes in new data.frame

I have a number of very similar .csv's that I want to check through programatically to determine if their column types are the same.
Say I've imported a .csv as a data.frame and I want to check the column classes:
library(tidyverse)
test <- structure(list(Date = "6/15/2018", Time = structure(44255, class = c("hms",
"difftime"), units = "secs")), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"))
test
## A tibble: 1 x 2
# Date Time
# <chr> <time>
#1 6/15/2018 12:17
Checking the class of each column, I can see that the Time column has two classes:
map(test, class)
# $`Date`
# [1] "character"
# $Time
# [1] "hms" "difftime"
What I want is a data.frame that ideally would show:
Date Time
character hms, difftime
So that I can easily compare among different csvs.
I thought map_dfr or map_dfc might work but they return errors.
I also tried the following, but I haven't used summarize_all before and I can't get it to work:
test %>% data.frame() %>%
summarize_all(funs(paste0(collapse = ", ")))

You are very close, you are missing that funs() asks you to specify where the column vector is going to go in the function call(s) with .. So it would be:
test %>%
summarize_all(funs(paste0(class(.), collapse = ", ")))
However, funs() is soft-deprecated and throws a warning as of dplyr 0.8.0. Instead, you can use the formula notation like this:
library(tidyverse)
test <- structure(list(Date = "6/15/2018", Time = structure(44255, class = c("hms", "difftime"), units = "secs")), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))
test %>%
summarise_all(~ class(.) %>% str_c(collapse = ", "))
#> # A tibble: 1 x 2
#> Date Time
#> <chr> <chr>
#> 1 character hms, difftime
If you want to try using a purrr style syntax, here's one way to get it in long format with imap_dfr in one line. We write the function to return a named vector for each column, and then bind into a dataframe with _dfr. (You could have used gather to reshape the wide format version too)
test %>%
imap_dfr(~ tibble(colname = .y, classes = class(.x) %>% str_c(collapse = ", ")))
#> # A tibble: 2 x 2
#> colname classes
#> <chr> <chr>
#> 1 Date character
#> 2 Time hms, difftime
Created on 2019-02-26 by the reprex package (v0.2.1)

You can use
lapply(test, function(x) paste0(class(x), collapse = ', ')) %>% data.frame()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting numbers from text with stringr and regex in R - r

Related

Find the maximum number in an R dataframe column of strings

Creating new columns in R using parts of an existing column

why I do not get the renamed column in R with title defined?

How can I use R stringr to leave only the gene name?

Extract class of each field in data.frame; summarize classes in new data.frame

Categories

Resources