Regular Expression in R (removing spacing and punctuation characters) - r

Suppose I have the following text:
text = c("Initial [kHz]","Initial Value [dB]",
"Min Accept X [kHz]","Min Accept [dB]",
"Cut-Off Frequency [kHz]",
"Min Bandwidth Limit [kHz]","y min [dB]",
"Max Bandwidth Limit [kHz]","y max [dB]",
"Iter: 1 [kHz]","Iter: 1","Value: 55 [dB]",
"Iter: 2 [kHz]","Iter: 2","Value: 59 [dB]")
But what I want is (which removed the spacing and the punctuation characters:
text = c("InitialkHz","InitialValuedB",
"MinAcceptXkHz","MinAcceptdB",
"CutOffFrequencykHz",
"MinBandwidthLimitkHz","ymindB",
"MaxBandwidthLimitkHz","ymaxdB]",
"Iter1kHz","Iter1","Value55dB",
"Iter2kHz","Iter2","Value59dB")
Can anyone help me? Please...

You can choose to keep only alpha numeric values like this:
gsub('[^[:alnum:]]', '', text)

We can use gsub to remove all the punctuations and spaces from text.
gsub("[[:punct:]]| ", "", text)
# [1] "InitialkHz" "InitialValuedB" "MinAcceptXkHz"
# [4] "MinAcceptdB" "CutOffFrequencykHz" "MinBandwidthLimitkHz"
# [7] "ymindB" "MaxBandwidthLimitkHz" "ymaxdB"
#[10] "Iter1kHz" "Iter1" "Value55dB"
#[13] "Iter2kHz" "Iter2" "Value59dB"

Related

Use regex to extract mixed fraction and text that may also contain mixed fractions with R (stringr)

Please see below for a sample of the data I am working with, I have 100K+ entries in total, however.
note that the ... in the comment under UNIT is just to make it fit. For example, the full UNIT text is for the first item is 4- to 5-mm-diameter and the fifth item is 3 1/2- to 4-inch-diameter, etc.
library(tidyverse)
#i QTY UNIT
parts <- c("6 4- to 5-mm-diameter plugs", #1 6 4- to...diameter
"6 large bricks", #2 6 large
"1 1/3 shipment concrete", #3 1.33 shipment
"1 (14- to 15-oz) gold bars", #4 1 (14- to 15-oz)
"16 3 1/2- to 4-inch-diameter caps", #5 16 3 1/2- to...eter
"1 1/2 tons sand", #6 1.5 tons
"2 1 1/4- to 3-inch diameter caps", #7 2 1 1/4- to...eter
"1/3 shipment cement") #8 .333 shipment
I've had some moderate success working from some of the answers on SO but I run into problems when the UNIT text also contains mixed fractions as in items 1 and 5:
# Goal: extract QTY as mixed frac
parts %>%
str_extract("(\\d+[\\/\\d[ ]?]*|\\d*)")
# i=1, 5 broken
#[1] "6 4" "6 " "1 1/3 " "1 " "16 3 1/2" "1 1/2 "
# Goal: extract UNIT word
parts %>%
str_extract("[[:graph:]]{3,11}|[- to ].{5,21}")
# all i with some problem
# [1] " 4- to 5-mm-diameter p" " large bricks" " 1/3 shipment concrete"
# [4] " (14- to 15-oz) gold b" " 3 1/2- to 4-inch-diam" " 1/2 tons sand"
My goals is to extract QTY and UNIT as shown in the comment of the code from first to last entry as 6, 6, 1 1/3, 1, 16, 1, 2, 1/3 - in addition, I am trying to pull out the text under UNIT abbreviated just so it'd fit in the code section - here it is in full: 4- to 5-mm-diameter, large, shipment, (14- to 15-oz), 3 1/2- to 4-inch-diameter, tons, 1 1/4- to 3-inch diameter, shipment.
My intuition suggests I should do this in two steps but please let me know if there are better ways to achieve this.
Thank you.
edit: added a critical example number 8.
You may use
m <- str_match(parts, '^(\\d+(?:\\s+\\d+/\\d+)?|\\d+/\\d+)\\s+((?:\\d+(?:-?in(?:ch)?|")?\\s+)*\\S+(?:\\s+to\\s+(?:\\d+(?:-?in(?:ch)?|")?\\s+)*\\S+)?)')
qty <- m[,2]
# => [1] "6" "6" "1 1/3" "1" "16" "1 1/2" "2" "1/3"
unit <- m[,3]
# => [1] "4- to 5-mm-diameter" "large"
# [3] "shipment" "(14- to 15-oz)"
# [5] "3 1/2- to 4-inch-diameter" "tons"
# [7] "1 1/4- to 3-inch diameter" "shipment"
See the R demo and the regex demo. Details:
^ - start of string
(\d+(?:\s+\d+/\d+)?|\d+/\d+) - Group 1 (m[,2]): one or more digits followed with an optional occurrence of one or more whitespaces, one or more digits, / and one or more digits, or a / enclosed with one or more digits
\s+ - one or more whitespaces
((?:\d+(?:-?in(?:ch)?|")?\s+)*\S+(?:\s+to\s+(?:\d+(?:-?in(?:ch)?|")?\s+)*\S+)?) - Group 2 (m[,3]):
(?:\d+(?:-?in(?:ch)?|")?\s+)* - zero or more occurrences of one or more digits followed with an optional occurrence of a " or an optional -, in and then an optional ch substring and then one or more whitespaces
\S+ - one or more chars other than whitespace (a "word")
(?:\s+to\s+(?:\d+(?:-?in(?:ch)?|")?\s+)*\S+)? - an optional occurrence of:
\s+to\s+ - to enclosed with one or more whitespaces
(?:\d+(?:-?in(?:ch)?|")?\s+)* - see above
\S+ - one or more chars other than whitespace.

Extract title from multiple lines

I have multiple files each one has a different title, I want to extract the title name from each file. Here is an example of one file
[1] "<START" "ID=\"CMP-001\"" "NO=\"1\">"
[4] "<NAME>Plasma-derived" "vaccine" "(PDV)"
[7] "versus" "placebo" "by"
[10] "intramuscular" "route</NAME>" "<DIC"
[13] "CHI2=\"3.6385\"" "CI_END=\"0.6042\"" "CI_START=\"0.3425\""
[16] "CI_STUDY=\"95\"" "CI_TOTAL=\"95\"" "DF=\"3.0\""
[19] "TOTAL_1=\"0.6648\"" "TOTAL_2=\"0.50487622\"" "BLE=\"YES\""
.
.
.
[789] "TOTAL_2=\"39\"" "WEIGHT=\"300.0\"" "Z=\"1.5443\">"
[792] "<NAME>Local" "adverse" "events"
[795] "after" "each" "injection"
[798] "of" "vaccine</NAME>" "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
[801] "</GROUP_LABEL_2>" "<GRAPH_LABEL_1>" "PDV</GRAPH_LABEL_1>"
the extracted expected title is
Plasma-derived vaccine (PDV) versus placebo by intramuscular route
Note, each file has a different title's length.
Here is a solution using stringr. This first collapses the vector into one long string, and then captures all words / characters that are not a newline \n between every pair of "<NAME>" and "</NAME>". In the future, people will be able to help you easier if you make a reproducible example (e.g., using dput()). Hope this helps!
Note: if you just one the first title you can use str_match() instead of str_match_all().
library(stringr)
str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine"
Data:
string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
"TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")

Regex: Capturing Numbers at Beginning and Negating Numbers After Characters

I need to capture the 3.93, 4.63999..., and -5.35. I've tried all kinds of variations, but have been unable to grab the correct set of numbers.
Copay: 20.30
3.93
TAB 8.6MG Qty:60
4.6399999999999997
-5.35
2,000UNIT TAB Qty:30
AMOUNT
Qty:180
CAP 4MG
x = c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG");
grep("^[\\-]?\\d+[\\.]?\\d+$", x);
Output (see ?grep):
[1] 2 4 5
If leading/trailing spaces are allowed change the regex with
"^\\s*[\\-]?\\d+[\\.]?\\d+\\s*$"
Try this
S <- c("Copay: 20.30", "3.93", "TAB 8.6MG Qty:60", "4.6399999999999997", "-5.35", "2,000UNIT TAB Qty:30", "AMOUNT", "Qty:180", "CAP 4MG")
library(stringr)
ans <- str_extract_all(S, "-?[[:digit:]]*(\\.|,)?[[:digit:]]+", simplify=TRUE)
clean <- ans[ans!=""]
Output
[1] "20.30" "3.93" "8.6"
[4] "4.6399999999999997" "-5.35" "2,000"
[7] "180" "4" "60"
[10] "30"

Ambiguity while using readLines() in R

The first line of my dataset contains the name of the columns.
It looks like this --
#"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude","Datum","Parameter Name","Sample Duration","Pollutant Standard","Metric Used","Method Name","Year","Units of Measure","Event Type","Observation Count","Observation Percent","Completeness Indicator","Valid Day Count","Required Day Count","Exceptional Data Count","Null Data Count","Primary Exceedance Count","Secondary Exceedance Count","Certification Indicator","Num Obs Below MDL","Arithmetic Mean","Arithmetic Standard Dev","1st Max Value","1st Max DateTime","2nd Max Value","2nd Max DateTime","3rd Max Value","3rd Max DateTime","4th Max Value","4th Max DateTime","1st Max Non Overlapping Value","1st NO Max DateTime","2nd Max Non Overlapping Value","2nd NO Max DateTime","99th Percentile","98th Percentile","95th Percentile","90th Percentile","75th Percentile","50th Percentile","10th Percentile","Local Site Name","Address","State Name","County Name","City Name","CBSA Name","Date of Last Change"
It is a csv file.
Since I am using windows I wrote
pm0 <-read.csv("C:/Users/Ad/Desktop/EDA/2010.csv",
comment.char="#", header=FALSE, sep=",", na.strings="")
to read this csv file except the first line. Now I want to read the first line so that I can use the first line to set the column names of my generated data frame.For this I wrote--
cnames<-readLines("C:/Users/Ad/Desktop/EDA/2010.csv",1)
But when I print cnames I get this --
[1] "\"State Code\",\"County Code\",\"Site Num\",\"Parameter Code\",\"POC\",\"Latitude\",\"Longitude\",\"Datum\",\"Parameter Name\",\"Sample Duration\",\"Pollutant Standard\",\"Metric Used\",\"Method Name\",\"Year\",\"Units of Measure\",\"Event Type\",\"Observation Count\",\"Observation Percent\".
I dont understand why \ is coming at start and end of every element of cnames.
Can someone help me to remove this.
This is from the Exploratory Data Analysis (EDA) assignment on Coursera, right? I trust you are compliant with the honor code.
What you have in 'cnames' is ONE string enclosed in double-quotes within which the backslash operator has escaped other quotation marks.
To get around this, try:
cnames1 <- strsplit(cnames, ",")
gsub("[\"]", "", cnames1[[1]], perl=TRUE)
This gives an array of names.
[1] "State Code" "County Code" "Site Num"
[4] "Parameter Code" "POC" "Latitude"
[7] "Longitude" "Datum" "Parameter Name"
[10] "Sample Duration" "Pollutant Standard" "Metric Used"
[13] "Method Name" "Year" "Units of Measure"
[16] "Event Type" "Observation Count" "Observation Percent"
What i did is this --
pm0<-read.csv("C:/Users/Ad/Desktop/EDA/2010.csv",comment.char="#",header=TRUE,sep=",",na.strings="")
Now the object pm0 contains the first row of csv file as the column names.

Decimal places get rounded in line by line output with cat()

When I print a vector in R line by line with cat(), results are rounded differently than in the usual output:
> dbinom(0:10, 10, 0.95)
[1] 9.765625e-14 1.855469e-11 1.586426e-09 8.037891e-08 2.672599e-06
[6] 6.093525e-05 9.648081e-04 1.047506e-02 7.463480e-02 3.151247e-01
[11] 5.987369e-01
> options(scipen=999)
> dbinom(0:10, 10, 0.95)
[1] 0.00000000000009765625 0.00000000001855468750 0.00000000158642578125
[4] 0.00000008037890625000 0.00000267259863281252 0.00006093524882812524
[7] 0.00096480810644531680 0.01047505944140628489 0.07463479852001964066
[10] 0.31512470486230492739 0.59873693923837867370
> cat(dbinom(0:10, 10, 0.95), sep = "\n")
0.00000000000009765625
0.00000000001855469
0.000000001586426
0.00000008037891
0.000002672599
0.00006093525
0.0009648081
0.01047506
0.0746348
0.3151247
0.5987369
How can I preserve the decimal places?
Try this using sprintf:
> cat(sprintf("%.20f", dbinom(0:10, 10, 0.95)),sep="\n")
0.00000000000009765625
0.00000000001855468750
0.00000000158642578125
0.00000008037890625000
0.00000267259863281251
0.00006093524882812514
0.00096480810644531680
0.01047505944140628489
0.07463479852001966841
0.31512470486230481637
0.59873693923837867370
I should also mention that any precision beyond 15 digits is probably spurious at best using floating point calculations. Notice that 0.31512470486230492739 in your data and 0.31512470486230481637 in mine don't match beyond 15 digits.

Resources