Extracting a number of a string of varying lengths [duplicate] - r

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 6 years ago.
Pretend I have a vector:
testVector <- c("I have 10 cars", "6 cars", "You have 4 cars", "15 cars")
Is there a way to go about parsing this vector, so I can store just the numerical values:
10, 6, 4, 15
If the problem were just "15 cars" and "6 cars", I know how to parse that, but I'm having difficulty with the strings that have text in front too! Any help is greatly appreciated.

For this particular common task, there's a nice helper function in tidyr called extract_numeric:
library(tidyr)
extract_numeric(testVector)
## [1] 10 6 4 15

We can use str_extract with pattern \\d+ which means to match one or more numbers. It can be otherwise written as [0-9]+.
library(stringr)
as.numeric(str_extract(testVector, "\\d+"))
#[1] 10 6 4 15
If there are multiple numbers in a string, we use str_extract_all which wil1 return a list output.
This can be also done with base R (no external packages used)
as.numeric(regmatches(testVector, regexpr("\\d+", testVector)))
#[1] 10 6 4 15
Or using gsub from base R
as.numeric(gsub("\\D+", "", testVector))
#[1] 10 6 4 15
BTW, some functions are just using the gsub, from extract_numeric
function (x)
{
as.numeric(gsub("[^0-9.-]+", "", as.character(x)))
}
So, if we need a function, we can create one (without using any external packages)
ext_num <- function(x) {
as.numeric(gsub("\\D+", "", x))
}
ext_num(testVector)
#[1] 10 6 4 15

This might also come in handy .
testVector <- gsub("[:A-z:]","",testVector)
testVector <- gsub(" ","",testVector)
> testVector
[1] "10" "6" "4" "15"

Related

extract number only after specific word by skipping other word in between it

I'm looking for a two-digit number that comes before the word "years" and a seven- or eight-digit number that comes after the word "years." An example of the data is shown below.
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
data <- as.list(data)
I tried this approach and was successful in getting two digit numbers before the word "years" :
stringr::str_extract_all(data,regex(".\\d{2}\\s(?:year)"))
I also tried this method to get the number after word "years" :
str_extract_all(data,regex(".\\d{2}\\s(?:year).\\d{7,8}"))
I managed to get the number that appear directly after the word years :
" 57 year 7654321"
However, I was unsuccessful in getting eight digit numbers following the word "years" that included other characters in between the number and the word "years".
How can I retrieve the number only after the word "years" by skipping this other word/character?
I really appreciate your help
We may use str_replace to match and remove the non-digits before and after the 'years' and then extract the digits before and after the years including the 'years'
library(stringr)
str_extract_all(str_replace_all(data,
"(?<=years)\\D+|(\\D+)(?=years)", " "), "\\d{2}\\s+years\\s+\\d{7,8}")[[1]]
[1] "45 years 12345678" "57 years 7654321"
Or another option is to capture the digits, along with the 'years' substring with str_match and then paste them together
library(purrr)
library(dplyr)
str_match_all(data, "(\\d{2})\\D+(years)\\D+(\\d{7,8})")[[1]][,-1] %>%
as.data.frame %>%
invoke(str_c, sep =" ", .)
[1] "45 years 12345678" "57 years 7654321"
data
data <- "mr john is 45 years old his number is 12345678, mr doe is 57 years 7654321"
Here is a base R approach:
Create a list with strsplit separating by ,
define a function my_func that takes a string and searches for numeric before year and after year and then pastes all together.
Use lapply to apply your function to the list.
Use toString() to get the expected output.
my_list <- strsplit(data, ",")
my_func <- function(x){
a <- as.integer(sub(".*?(\\d+)\\s*year.*", "\\1", x))
b <- as.integer(sub(".*?year.*?(\\d+).*", "\\1", x))
paste(a, "year", b)
}
result <- lapply(my_list, my_func)
lapply(result, toString)
Output:
[[1]]
[1] "45 year 12345678, 57 year 7654321"

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

Convert Dollar Data from Character to Numeric

How would I change a column that has character data in the format of "33 dollars 14 cents" to numeric data formatted like"33.14"?
Thanks for any help!
You can use the stringr library to extract the numeric components and then paste them together. This assumes that there are always only two numbers for the format you are looking for.
library(stringr)
s <- c("33 dollars 14 cents", "35 dollars 50 cents")
sapply(str_extract_all(s,"\\d+"), function(x) paste(x, collapse = "."))
[1] "33.14" "35.50"
You may use sub
x <- "33 dollars 14 cents"
as.numeric(sub("^(\\d+)\\s+dollars\\s+(\\d+)\\s+cents$", "\\1.\\2", x))
# [1] 33.14
as.numeric(sub("^(\\d+).*?(\\d+).*", "\\1.\\2", x))
# [1] 33.14
or
as.numeric(paste(str_extract_all(x, "\\d+")[[1]], collapse="."))
# [1] 33.14
Assuming your data is all the same format you can use gsub().
This is clumsy but it works:
as.numeric(gsub(" cents","",gsub(" dollars ",".",data)))
It's always worthwhile to write a simple function to handle cases where you need several little steps. Here's a non-elegant example that's easy to read;
numerify <- function(x) {# convert string in form of "33 dollars 14 cents" to numeric 33.14
x <- gsub('[a-z]','',x) # remove letters
x <- gsub(' $','',x) # remove trailing space
x <- gsub(' +','.',x) # insert decimal point
return(as.numeric(x)) # convert to numeric
}

Capitalizing letters. R equivalent of excel "PROPER" function [duplicate]

This question already has answers here:
Capitalize the first letter of both words in a two word string
(15 answers)
Closed 6 years ago.
Colleagues,
I'm looking at a data frame resembling the extract below:
Month Provider Items
January CofCom 25
july CofCom 331
march vobix 12
May vobix 0
I would like to capitalise first letter of each word and lower the remaining letters for each word. This would result in the data frame resembling the one below:
Month Provider Items
January Cofcom 25
July Cofcom 331
March Vobix 12
May Vobix 0
In a word, I'm looking for R's equivalent of the ROPER function available in the MS Excel.
With regular expressions:
x <- c('woRd Word', 'Word', 'word words')
gsub("(?<=\\b)([a-z])", "\\U\\1", tolower(x), perl=TRUE)
# [1] "Word Word" "Word" "Word Words"
(?<=\\b)([a-z]) says look for a lowercase letter preceded by a word boundary (e.g., a space or beginning of a line). (?<=...) is called a "look-behind" assertion. \\U\\1 says replace that character with it's uppercase version. \\1 is a back reference to the first group surrounded by () in the pattern. See ?regex for more details.
If you only want to capitalize the first letter of the first word, use the pattern "^([a-z]) instead.
The question is about an equivalent of Excel PROPER and the (former) accepted answer is based on:
proper=function(x) paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
It might be worth noting that:
proper("hello world")
## [1] "Hello world"
Excel PROPER would give, instead, "Hello World". For 1:1 mapping with Excel see #Matthew Plourde.
If what you actually need is to set only the first character of a string to upper-case, you might also consider the shorter and slightly faster version:
proper=function(s) sub("(.)", ("\\U\\1"), tolower(s), pe=TRUE)
Another method uses the stringi package. The stri_trans_general function appears to lower case all letters other than the initial letter.
require(stringi)
x <- c('woRd Word', 'Word', 'word words')
stri_trans_general(x, id = "Title")
[1] "Word Word" "Word" "Word Words"
I dont think there is one, but you can easily write it yourself
(dat <- data.frame(x = c('hello', 'frIENds'),
y = c('rawr','rulZ'),
z = c(16, 18)))
# x y z
# 1 hello rawr 16
# 2 frIENds rulZ 18
proper <- function(x)
paste0(toupper(substr(x, 1, 1)), tolower(substring(x, 2)))
(dat <- data.frame(lapply(dat, function(x)
if (is.numeric(x)) x else proper(x)),
stringsAsFactors = FALSE))
# x y z
# 1 Hello Rawr 16
# 2 Friends Rulz 18
str(dat)
# 'data.frame': 2 obs. of 3 variables:
# $ x: chr "Hello" "Friends"
# $ y: chr "Rawr" "Rulz"
# $ z: num 16 18

Taking characters to the left of a character [duplicate]

This question already has answers here:
Splitting a file name into name,extension
(3 answers)
substring of a path variable
(2 answers)
Closed 9 years ago.
Given some data
hello <- c('13.txt','12.txt','14.txt')
I want to just take the numbers and convert to numeric, i.e. remove the .txt
You want file_path_sans_ext from the tools package
library(tools)
hello <- c('13.txt','12.txt','14.txt')
file_path_sans_ext(hello)
## [1] "13" "12" "14"
You can do this with regular expressions using the function gsub on the "hello" object in your original post.
hello <- c('13.txt','12.txt','14.txt')
as.numeric(gsub("([0-9]+).*","\\1",hello))
#[1] 13 12 14
Another regex solution
hello <- c("13.txt", "12.txt", "14.txt")
as.numeric(regmatches(hello, gregexpr("[0-9]+", hello)))
## [1] 13 12 14
If you know your extensions are all .txt then you can use substr()
> hello <- c('13.txt','12.txt','14.txt')
> as.numeric(substr(hello, 1, nchar(hello) - 3))
#[1] 13 12 14

Resources