R Extract partially matching string

R Extract partially matching string - r

I have a question about extracting a part of a string from several files that has these rows:
units = specified
- name 0 = prDM: Pressure, Digiquartz [db]
- name 1 = t090C: Temperature [ITS-90, deg C]
- name 2 = c0S/m: Conductivity [S/m]
- name 3 = t190C:Temperature, 2 [ITS-90, deg C]
- name 4 = c1S/m: Conductivity, 2 [S/m]
- name 5 = flSP: Fluorescence, Seapoint
- name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l]
- name 7 = altM: Altimeter [m]
- name 8 = sal00: Salinity, Practical [PSU]
- name 9 = sal11: Salinity, Practical, 2 [PSU]
- span 0 = 1.000, 42.000
I need to extract only the information of the columns that start with "name" and extract everything between = and: .
For example, in the row "name 0 = prDM: Pressure, Digiquartz [db]" the desired result will be prDM.
Some files have different number of "name"rows (i.e. this example has 13 rows but other files has 16, and the number varies), so I want it to be as general as I can so I can allways extract the right strings independently the number of rows.Rows starts with # and a space before name.
I have tried this code but it only extract the first row. Can you please help me with this? Many thanks!
CNV<-NULL
for (i in 1:nro.files){
x <- readLines(all.files[i])
name.col<-grep("^\\# name", x)
df <- data.table::fread(text = x[name.col])
CNV[[i]]<-df
}

using stringr and the regex pattern "name \\d+ = (.*?):" which means in words "name followed by one or more digits followed by an equals sign followed by a space followed by a captured group containing any character (the period) zero or more times (the *) followed by a colon".
library(stringr)
strings <- c("name 0 = prDM: Pressure, Digiquartz [db]",
"name 1 = t090C: Temperature [ITS-90, deg C]",
"name 2 = c0S/m: Conductivity [S/m]",
"name 3 = t190C:Temperature, 2 [ITS-90, deg C]",
"name 4 = c1S/m: Conductivity, 2 [S/m]",
"name 5 = flSP: Fluorescence, Seapoint",
"name 6 = sbeox0ML/L: Oxygen, SBE 43 [ml/l]",
"name 7 = altM: Altimeter [m]",
"name 8 = sal00: Salinity, Practical [PSU]",
"name 9 = sal11: Salinity, Practical, 2 [PSU]")
result <- str_match(strings, "name \\d+ = (.*):")
result[,2]
[1] "prDM" "t090C" "c0S/m" "t190C" "c1S/m" "flSP" "sbeox0ML/L"
[8] "altM" "sal00" "sal11"
Or if you prefer base
pattern = "name \\d+ = (.*):"
result <- regmatches(strings, regexec(pattern, strings))
sapply(result, "[[", 2)
[1] "prDM" "t090C" "c0S/m" "t190C" "c1S/m" "flSP" "sbeox0ML/L"
[8] "altM" "sal00" "sal11"

Use str_extract from package stringr and positive lookahead and lookbehind:
str <- "name 0 = prDM: Pressure, Digiquartz [db]"
str_extract(str, "(?<== ).*(?=:)")
[1] "prDM"
Explanation:
(?<== )if you see =followed by white space on the left (lookbehind)
.* match anything until ...
(?=:)... you see a colon on the right (lookahead)

In Base R
test <- c("name 0 = prDM: Pressure, Digiquartz [db]","name 1 = t090C: Temperature [ITS-90, deg C]")
gsub("^name [0-9]+ = (.+):.+","\\1",test)
[1] "prDM" "t090C"
explanation
^name [0-9]+ Searches for a the beginning of a string ^ with name folowed by any length of number
= (.+): any length + of any character . found between = and : are stored ( ) to be later recalled by \\1

Related

How to manipulate digits in a character string in R?

I feel like I have a super easy question but for the life of me I can't find it when googling or searching here (or I don't know the correct terms to find a solution) so here goes.
I have a large amount of text in R in which I want to identify all numbers/digits, and add a specific number to them, for example 5.
So just as a small example, if this were my text:
text <- c("Hi. It is 6am. I want to leave at 7am")
I want the output to be:
> text
[1] "Hi. It is 11am. I want to leave at 12am"
But also I need the addition for each individual digit, so if this is the text:
text <- c("Hi. It is 2017. I am 35 years old.")
...I want the output to be:
> text
[1] "Hi. It is 75612. I am 810 years old."
I have tried 'grabbing' the numbers from the string and adding 5, but I don't know how to then get them back into the original string so I can get the full text back.
How should I go about this? Thanks in advance!

Here is how I would do the time. I would search for a number that is followed by am or pm and then sub in a math expression to be evaluated by gsubfn. This is pretty flexible, but would require whole hours in its current implementation. I added an am and pm if you wanted to swap those, but I didn't try to code in detecting if the number changes from am to pm. Also note that I didn't code in rolling from 12 to 1. If you add numbers over 12, you will get a number bigger than 12.
text1 <- c("Hi. It is 6am. I want to leave at 7am")
text2 <- c("It is 9am. I want to leave at 10am, but the cab comes at 11am. Can I push my flight to 12am?")
change_time <- function(text, hours, sign, am_pm){
string_change <- glue::glue("`(\\1{sign}{hours})`{am_pm}")
gsub("(\\d+)(?=am|pm)(am|pm)", string_change, text, perl = TRUE)|>
gsubfn::fn$c()
}
change_time(text = text1, hours = 5, sign = "+", am_pm = "am")
#> [1] "Hi. It is 11am. I want to leave at 12am"
change_time(text = text2, hours = 3, sign = "-", am_pm = "pm")
#> [1] "It is 6pm. I want to leave at 7pm, but the cab comes at 8pm. Can I push my flight to 9pm?"

text1 <- c("Hi. It is 2017. I am 35 years old.")
text2 <- c("Hi. It is 6am. I want to leave at 7am")
change_number <- function(text, change, sign){
string_change <- glue::glue("`(\\1{sign}{change})`")
gsub("(\\d)", string_change, text, perl = TRUE) %>%
gsubfn::fn$c() }
change_number(text = text1, change = 5, sign = "+")
#>[1] "Hi. It is 75612. I am 810 years old."
change_number(text = text2, change = 5, sign = "+")
#>[1] "Hi. It is 11am. I want to leave at 12am"
This works perfectly. Many thanks to #AndS., I tweaked (or rather, simplified) your code to fit my needs better. I was determined to figure out the other text myself haha, so thanks for showing me how!

Something quick and dirty with base R:
add_n = \(x, n, by_digit = FALSE) {
if (by_digit) ptrn = "[0-9]" else ptrn = "[0-9]+"
tmp = gregexpr(ptrn, x)
raw = regmatches(x, gregexpr(ptrn, x))
raw_plusn = lapply(raw, \(x) as.integer(x) + n)
for (i in seq_along(x)) regmatches(x[i], tmp[i]) = raw_plusn[i]
x
}
text = c(
"Hi. It is 6am. I want to leave at 7am",
"wow it's 505 dollars and 19 cents",
"Hi. It is 2017. I am 35 years old."
)
> add_n(text, 5)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 510 dollars and 24 cents"
# [3] "Hi. It is 2022. I am 40 years old."
> add_n(text, -2)
# [1] "Hi. It is 4am. I want to leave at 5am" "wow it's 503 dollars and 17 cents"
# [3] "Hi. It is 2015. I am 33 years old."
> add_n(text, 5, by_digit = TRUE)
# [1] "Hi. It is 11am. I want to leave at 12am"
# [2] "wow it's 10510 dollars and 614 cents"
# [3] "Hi. It is 75612. I am 810 years old."

Here's a tidyverse solution:
data.frame(text) %>%
# separate `text` into individual characters:
separate_rows(text, sep = "(?<!^)(?!$)") %>%
# add `5` to any digit:
mutate(
# if you detect a digit...
text = ifelse(str_detect(text, "\\d"),
# ... extract it, convert it to numeric, add `5`:
as.numeric(str_extract(text, "\\d")) + 5,
# ... else leave `text` as is:
text)
) %>%
# string the characters back together:
summarise(text = str_c(text, collapse = ""))
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 11am. I want to leave at 12am
Data 1:
text <- c("Hi. It is 6am. I want to leave at 7am")
Note that the same code works for the second text as well without any change:
# A tibble: 1 × 1
text
<chr>
1 Hi. It is 75612. I am 810 years old.
Data 2:
text <- c("Hi. It is 2017. I am 35 years old.")

Extracting gene name and ID number from a vector [duplicate]

This question already has answers here:
How do I separate a character column into two columns? [duplicate]
(2 answers)
Closed 1 year ago.
What gsub function can I use in R to get the gene name and the id number from a vector which looks like this?
head(colnames(cn), 20)
[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)"

1) Assuming the input s given in the Note at the end we can use read.table specifying that the fields are separated by ( and that ) is a comment character. We also strip white space around fields and give meaningful column names. No packages are used.
DF <- read.table(text = s, sep = "(", comment.char = ")",
strip.white = TRUE, col.names = c("Gene", "Id"))
DF
giving this data frame so DF$Gene is the genes and DF$Id is the id's.
Gene Id
1 A1BG 1
2 NAT2 10
3 ADA 100
4 CDH2 1000
5 AKT3 10000
6 GAGE12F 100008586
7 RNA5-8SN5 100008587
8 RNA18SN5 100008588
9 RNA28SN5 100008589
10 LINC02584 100009613
11 POU5F1P5 100009667
12 ZBTB11-AS1 100009676
13 MED6 10001
14 NR2E3 10002
15 NAALAD2 10003
16 DUXB 100033411
17 SNORD116-1 100033413
18 SNORD116-2 100033414
19 SNORD116-3 100033415
20 SNORD116-4 100033416
2) A variation of the above is to first remove the parentheses and then read it in giving the same result. Note that the second argument of chartr contains two spaces so that each parenthesis is translated to a space.
read.table(text = chartr("()", " ", s), col.names = c("Gene", "Id"))
Note
Lines <- '[1] "A1BG (1)" "NAT2 (10)" "ADA (100)" "CDH2 (1000)" "AKT3 (10000)" "GAGE12F (100008586)"
[7] "RNA5-8SN5 (100008587)" "RNA18SN5 (100008588)" "RNA28SN5 (100008589)" "LINC02584 (100009613)" "POU5F1P5 (100009667)" "ZBTB11-AS1 (100009676)"
[13] "MED6 (10001)" "NR2E3 (10002)" "NAALAD2 (10003)" "DUXB (100033411)" "SNORD116-1 (100033413)" "SNORD116-2 (100033414)"
[19] "SNORD116-3 (100033415)" "SNORD116-4 (100033416)" '
L <- Lines |>
textConnection() |>
readLines() |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
s <- scan(text = L, what = "")
so s looks like this:
> dput(s)
c("A1BG (1)", "NAT2 (10)", "ADA (100)", "CDH2 (1000)", "AKT3 (10000)",
"GAGE12F (100008586)", "RNA5-8SN5 (100008587)", "RNA18SN5 (100008588)",
"RNA28SN5 (100008589)", "LINC02584 (100009613)", "POU5F1P5 (100009667)",
"ZBTB11-AS1 (100009676)", "MED6 (10001)", "NR2E3 (10002)", "NAALAD2 (10003)",
"DUXB (100033411)", "SNORD116-1 (100033413)", "SNORD116-2 (100033414)",
"SNORD116-3 (100033415)", "SNORD116-4 (100033416)")

First, in the future please share your data using the dput() command. See this for details.
Second, here is one solution for extracting the parts you need:
library(tidyverse)
g<-c("A1BG (1)","NAT2 (10)","ADA (100)" , "RNA18SN5 (100008588)", "RNA28SN5 (100008589)")
gnumber<-stringr::str_extract(g,"(?=\\().*?(?<=\\))")
gnumber
gname<-stringr::str_extract(g, "[:alpha:]+")
gname
# or, to get the whole first word:
gname<-stringr::word(g,1,1)
gname

Count the number of keywords based on a list

I have got a dataset called colours, in which I am interested to find some keywords based on a list (colouryellow , colourblue, colourwhite) that I have created. This is an example of the dataset:
USER
MESSAGE
23456
The colouryellow is very bright!
31245
Most girls like colourpink
99999
I am having a break
9877
The colouryellow is like the sun
Is there a way where I can obtain the number of times each keywords based on the list appear on the column MESSAGE?
For example, the output would be like:
Keyword
Frequency of Keywords
colouryellow
2
colourblue
0
colourwhite
0
I have tried the following code but it does not provide me the frequency for each keyword, instead displays them together.
colour= read.csv("C: xxxxxx")
keywordcount= dplyr::filter(colour, grepl("colouryellow|colourblue|colourwhite, MESSAGE))
Thank you in advance.

Some things you can do.
some_colours <- c("colouryellow", "colourblue", "colourwhite")
some_col_regex <- paste0("\\b(", paste(some_colours, collapse = "|"), ")\\b")
grepl(some_col_regex, colour$MESSAGE)
# [1] TRUE FALSE FALSE TRUE
lengths(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE)))
# [1] 1 0 0 1
table(unlist(regmatches(colour$MESSAGE, gregexpr(some_col_regex, colour$MESSAGE))))
# colouryellow
# 2
Data
colour <- structure(list(USER = c(23456L, 31245L, 99999L, 9877L), MESSAGE = c("The colouryellow is very bright!", "Most girls like colourpink", "I am having a break", "The colouryellow is like the sun")), class = "data.frame", row.names = c(NA, -4L))

How to separate one colmn into multiple based on a character vector of delimiters

I have a dataframe which consists of one column. I would like to separate the text into seperate columns based on a vector of delimiters.
Input:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"
The intended output:
structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"),
HospitalNumber = " 233456 ", PatientName = " Jonny Begood",
DOB = " 13/01/77 ", GeneralPractitioner = NA_character_,
Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ",
Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ",
Histology = " These show chronic reflux and other bits n bobs\n ",
Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName",
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails",
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")
I was keen to use the separate function from tidyr but can't quite figure it out so that it will separate according to a list of delimiters
The list would be:
mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")
I then tried:
Mypath %>% separate(Mypath, mywords)
But I amd clearly mis-understanding the function which I guess can't take a list of delimiters
Error: `var` must evaluate to a single number or a column name, not a list
Is there a simple way of doing this using tidyr (or csplit with a list or any other way for that matter)

Maybe make sure that it's like a dcf file, and you can use read.dcf:
Notice that "mywords" is a little different from yours. I've added colons to "Hospital Number" and "Patient Name".
mywords<-c("Hospital Number:","Patient Name:","DOB:","General Practitioner:",
"Date of Procedure:","Clinical Details:","Macroscopic description:",
"Histology:","Diagnosis:")
Convert the relevant column to character, add a colon after "Hospital Number".
Mypath$PathReportWhole <- as.character(Mypath$PathReportWhole)
Mypath$PathReportWhole <- gsub("Hospital Number", "Hospital Number:", Mypath$PathReportWhole)
Make it such that each key: value pair is on its own line.
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", Mypath$PathReportWhole)
Use read.dcf to read it in:
out <- read.dcf(textConnection(temp))
Here's some sample data that makes it easier to see the resulting structure:
example <- c("var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here",
"var 1 xyz var 2: more text here var 5: not all values are there")
example <- data.frame(report = example)
example
# report
# 1 var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here
# 2 var 1 xyz var 2: more text here var 5: not all values are there
And, going through the same steps:
mywords <- c("var 1:", "var 2:", "var 3:", "var 4:", "var 5:")
temp <- as.character(example$report)
temp <- gsub("var 1", "var 1:", temp)
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", temp)
read.dcf(textConnection(temp))
# var 1 var 2 var 3 var 4 var 5
# [1,] "abc" "some, text" "112" "value" "even more here"
# [2,] "xyz" "more text here" NA NA "not all values are there"
read.dcf(textConnection(temp), fields = c("var 1", "var 3", "var 5"))
# var 1 var 3 var 5
# [1,] "abc" "112" "even more here"
# [2,] "xyz" NA "not all values are there"

indented unordered list to nested list()

I've got a log file that looks as follows:
Data:
+datadir=/data/2017-11-22
+Nusers=5292
Parameters:
+outdir=/data/2017-11-22/out
+K=20
+IC=179
+ICgroups=3
-group 1: 1-1
ICeffects: 1-5
-group 2: 2-173
ICeffects: 6-10
-group 3: 175-179
ICeffects: 11-15
I would like to parse this logfile into a nested list using R so that the result will look like this:
result <- list(Data = list(datadir = '/data/2017-11-22',
Nusers = 5292),
Parameters = list(outdir = '/data/2017-11-22/out',
K = 20,
IC = 179,
ICgroups = list(list('group 1' = '1-1',
ICeffects = '1-5'),
list('group 2' = '2-173',
ICeffects = '6-10'),
list('group 1' = '175-179',
ICeffects = '11-15'))))
Is there a not-extremely-painful way of doing this?

Disclaimer: This is messy. There is no guarantee that this will work for larger/different files without some tweaking. You will need to do some careful checking.
The key idea here is to reformat the raw data, to make it consistent with the YAML format, and then use yaml::yaml.load to parse the data to produce a nested list.
By the way, this is an excellent example on why one really should use a common markup language for log-output/config files (like JSON, YAML, etc.)...
I assume you read in the log file using readLines to produce the vector of strings ss.
# Sample data
ss <- c(
"Data:",
" +datadir=/data/2017-11-22",
" +Nusers=5292",
"Parameters:",
" +outdir=/data/2017-11-22/out",
" +K=20",
" +IC=179",
" +ICgroups=3",
" -group 1: 1-1",
" ICeffects: 1-5",
" -group 2: 2-173",
" ICeffects: 6-10",
" -group 3: 175-179",
" ICeffects: 11-15")
We then reformat the data to adhere to the YAML format.
# Reformat to adhere to YAML formatting
ss <- gsub("\\+", "- ", ss); # Replace "+" with "- "
ss <- gsub("ICgroups=\\d+","ICgroups:", ss); # Replace "ICgroups=3" with "ICgroups:"
ss <- gsub("=", " : ", ss); # Replace "=" with ": "
ss <- gsub("-group", "- group", ss); # Replace "-group" with "- group"
ss <- gsub("ICeffects", " ICeffects", ss); # Replace "ICeffects" with " ICeffects"
Note that – consistent with your expected output – the value 3 from ICgroups doesn't get used, and we need to replace ICgroups=3 with ICgroups: to initiate a nested sub-list. This was the part that threw me off first...
Loading & parsing the YAML string then produces a nested list.
require(yaml);
lst <- yaml.load(paste(ss, collapse = "\n"));
lst;
#$Data
#$Data[[1]]
#$Data[[1]]$datadir
#[1] "/data/2017-11-22"
#
#
#$Data[[2]]
#$Data[[2]]$Nusers
#[1] 5292
#
#
#
#$Parameters
#$Parameters[[1]]
#$Parameters[[1]]$outdir
#[1] "/data/2017-11-22/out"
#
#
#$Parameters[[2]]
#$Parameters[[2]]$K
#[1] 20
#
#
#$Parameters[[3]]
#$Parameters[[3]]$IC
#[1] 179
#
#
#$Parameters[[4]]
#$Parameters[[4]]$ICgroups
#$Parameters[[4]]$ICgroups[[1]]
#$Parameters[[4]]$ICgroups[[1]]$`group 1`
#[1] "1-1"
#
#$Parameters[[4]]$ICgroups[[1]]$ICeffects
#[1] "1-5"
#
#
#$Parameters[[4]]$ICgroups[[2]]
#$Parameters[[4]]$ICgroups[[2]]$`group 2`
#[1] "2-173"
#
#$Parameters[[4]]$ICgroups[[2]]$ICeffects
#[1] "6-10"
#
#
#$Parameters[[4]]$ICgroups[[3]]
#$Parameters[[4]]$ICgroups[[3]]$`group 3`
#[1] "175-179"
#
#$Parameters[[4]]$ICgroups[[3]]$ICeffects
#[1] "11-15"
PS. You will need to test this on larger files, and make changes to the substitution if necessary.

Categories

HOME

functional-programming

postfix-mta

setup-project

artifactory

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Extract partially matching string - r

Related

How to manipulate digits in a character string in R?

Extracting gene name and ID number from a vector [duplicate]

Count the number of keywords based on a list

How to separate one colmn into multiple based on a character vector of delimiters

indented unordered list to nested list()

Categories

Resources