Count the number of words without white spaces [duplicate] - r

This question already has answers here:
Count the number of all words in a string
(19 answers)
Closed 2 years ago.
I have the following string:
str1<-" india hit milestone electricity wind solar"
Number of words contained in it is:
>sapply(strsplit(str1, " "), length)
[1] 7
It is not true because we have a space at the beginning of str1. I tried to trim the white space but:
> stripWhitespace(str1) # by tm package
returns the same situation:
[1] " india hit milestone electricity wind solar"
Why?

You can just use the base function trimws
sapply(strsplit(trimws(str1), " "), length)
[1] 6

Maybe you can try
lengths(gregexpr("\\b\\w+\\b",str1))
such that
> lengths(gregexpr("\\b\\w+\\b",str1))
[1] 6

You could try using stringr::str_trim and stringr::str_split like this:
length(stringr::str_split(stringr::str_trim(str1), pattern=" ", simplify=T))

We can use str_count
library(stringr)
str_count(str1, '\\w+')
#[1] 6

Related

How to get a date out of a string? [duplicate]

This question already has answers here:
R Regexp - extract number with 5 digits
(4 answers)
Closed 1 year ago.
I have a file with name "test_result_20210930.xlsx". I would like to get "20210930" out to a new variable date. How should I do that? I think I can say pattern="[0-9]+" What if I have more numbers in the file name, and I only want the part that will stand for the date? (8digt together?)
Any suggestion?
Using gsub with \\D+ matches all non-digits and in the replacement, specify blank ("")
gsub("\\D+", "", str1)
[1] "20210930"
If the pattern also includes other digits, and want to return only the 8 digits
sub(".*_(\\d{8})_.*", "\\1", "test_result_20210930_01.xlsx")
[1] "20210930"
Or use str_extract
library(stringr)
str_extract("test_result_20210930_01.xlsx", "(?<=_)\\d{8}(?=_)")
[1] "20210930"
If we need to automatically convert to Date object
library(parsedate)
parse_date(str1)
[1] "2021-09-30 UTC"
-output
str1 <- "test_result_20210930.xlsx"
You can also use str_extract from the stringr package to obtain the desired result.
library(stringr)
str_extract("test_result_20210930.xlsx", "[0-9]{8}")
# [1] "20210930"

Print a result without a preceding square bracket in R [duplicate]

This question already has an answer here:
R how to not display the number into brackets of the row count in output
(1 answer)
Closed 2 years ago.
x <- 5+2
print(x)
[1] 7
How to suppress [1] and only print 7?
Similarly for characters:
y <- "comp"
print(y)
[1] "comp"
I want to remove both [1] and " ". Any help is appreciated!
Thanks!
With cat, it is possible
cat(x, '\n')
7
Or for characters
cat(dQuote(letters[1], FALSE), '\n')
"a"

Extract Text Starting and Ending with Punctuations in R [duplicate]

This question already has answers here:
Extracting a string between other two strings in R
(4 answers)
Closed 3 years ago.
I want to extract a group of strings between two punctuations using RStudio.
I tried to use str_extract command, but whenever I tried to use anchors (^ for starting char, and $ for ending char), it failed.
Here is the sample problem:
> text <- "Name : Dr. CHARLES DOWNING MAP ; POB : London; Age/DOB : 53 years / August 05, 1958;"
Here is the sample code I used:
> str_extract(text,"(Name : )(.+)?( ;)")
> str_match(str_extract(text,"(Name : )(.+)?( ;)"),"(Name : )(.+)?( ;)")[3]
But it seemed too verbose, and not flexible.
I only want to extract "Dr. CHARLES DOWNING MAP".
Anyone can help with my problem?
Can I tell the regex to start with any non-white-space character after "Name : " and ends before " ; POB"?
This seems to work.
> gsub(".*Name :(.*) ;.*", "\\1", text)
[1] " Dr. CHARLES DOWNING MAP"
With str_match
stringr::str_match(text, "^Name : (.*) ;")[, 2]
#[1] "Dr. CHARLES DOWNING MAP"
[, 2] is to get the contents from the capture group.
There is also qdapRegex::ex_between to extract string between left and right markers
qdapRegex::ex_between(text, "Name : ", ";")[[1]]
#[1] "Dr. CHARLES DOWNING MAP"

How to split words in R while keeping contractions [duplicate]

This question already has an answer here:
strsplit on all spaces and punctuation except apostrophes [duplicate]
(1 answer)
Closed 7 years ago.
I'm trying to turn a character vector novel.lower.mid into a list of single words. So far, this is the code I've used:
midnight.words.l <- strsplit(novel.lower.mid, "\\W")
This produces a list of all the words. However, it splits everything, including contractions. The word "can't" becomes "can" and "t". How do I make sure those words aren't separated, or that the function just ignores the apostrophe?
We can use
library(stringr)
str_extract_all(novel.lower.mid, "\\b[[:alnum:]']+\\b")
Or
strsplit(novel.lower.mid, "(?!')\\W", perl=TRUE)
If you just want your current "\W" split to not include apostrophes, negate \w and ':
novel.lower.mid <- c("I won't eat", "green eggs and", "ham")
strsplit(novel.lower.mid, "[^\\w']", perl=T)
# [[1]]
# [1] "I" "won't" "eat"
#
# [[2]]
# [1] "green" "eggs" "and"
#
# [[3]]
# [1] "ham"

Taking characters to the left of a character [duplicate]

This question already has answers here:
Splitting a file name into name,extension
(3 answers)
substring of a path variable
(2 answers)
Closed 9 years ago.
Given some data
hello <- c('13.txt','12.txt','14.txt')
I want to just take the numbers and convert to numeric, i.e. remove the .txt
You want file_path_sans_ext from the tools package
library(tools)
hello <- c('13.txt','12.txt','14.txt')
file_path_sans_ext(hello)
## [1] "13" "12" "14"
You can do this with regular expressions using the function gsub on the "hello" object in your original post.
hello <- c('13.txt','12.txt','14.txt')
as.numeric(gsub("([0-9]+).*","\\1",hello))
#[1] 13 12 14
Another regex solution
hello <- c("13.txt", "12.txt", "14.txt")
as.numeric(regmatches(hello, gregexpr("[0-9]+", hello)))
## [1] 13 12 14
If you know your extensions are all .txt then you can use substr()
> hello <- c('13.txt','12.txt','14.txt')
> as.numeric(substr(hello, 1, nchar(hello) - 3))
#[1] 13 12 14

Resources