How to recognise and extract alpha numeric characters - r

I want to extract alphanumeric characters from a partiular sentence in R.
I have tried the following:
aa=grep("[:alnum:]","abc")
.This should return integer(0),but it returns 1,which should not be the case as "abc" is not an alphanumeric.
What am I missing here?
Essentially I am looking for a function,that only searches for characters that are combinations of both alphabets and numbers,example:"ABC-0112","PCS12SCH"
Thanks in advance for your help.

[[:alnum:]] matches alphabets or digits. To match the string which contains the both then you should use,
x <- c("ABC", "ABc12", "--A-1", "abc--", "89=A")
grep("(.*[[:alpha:]].*[[:digit:]]|.*[[:digit:]].*[[:alpha:]])", x)
# [1] 2 3 5
or
which(grepl("[[:alpha:]]", x) & grepl("[[:digit:]]", x))
# [1] 2 3 5

Related

replacing word with same word + added characters R

I have a regex "^[0-9]\\.[0-9]|^§"
Now i want to replace occurences and add something
Example
"foo" becomes "[[foo]]"
grep("^[0-9]\\.[0-9]|^§", Vector)
gives me all occurences unsure how to continue
You can use sub. If you put parentheses around your pattern, then you can refer to it in the replacement string with \1
For example, if your vector is like this:
Vector <- c("2.9", "7.4", "A", "2.2")
And your regex is like this:
grep("^[0-9]\\.[0-9]|^§", Vector)
#> [1] 1 2 4
You can do
sub("(^[0-9]\\.[0-9]|^§)", "[[\\1]]", Vector)
#> [1] "[[2.9]]" "[[7.4]]" "A" "[[2.2]]"

Special character matching in r using grep

If I have a sentence separated by spaces
s<-("C java","C++ java")
grep("C",s)
gives output as
[1] [2]
while I only require
[1]
How to do that? ( I have used c\++ to identify c++ separately but matching with C gives [1] and [2] both as the output)
If we want to match 1 only, then we can use the start (^) and end ($) of the string to denote that there are no characters after or before 'C'
grep("^C$",s)
#[1] 1
data
s<- c("C","C++","java")
s<-c("C","C++","java")
which(s %in% "C")
grep() gives a positive result for any match within a string

Rename Dataframe Column Names in R using Previous Column Name and Regex Pattern

I am working in R for the first time and I have been having difficulty renaming column names in a dataframe (Grade.Data). I have a dataset imported from an csv file that has column names like this:
Student.ID
Grade
Interactive.Exercises.1..Health
Interactive.Exercises.2..Fitness
Quizzes.1..Week.1.Quiz
Quizzes.2..Week.2.Quiz
Case.Studies.1..Case.Study1
Case.Studies.2..Case.Study2
I would like to be able to change the variable names so that they are more simple, i.e. from Interactive.Exercises.1.Health to Interactive.Exercises.1 or Quizzes.1.Week.1.Quiz to Quizzes.1
So far, I have tried this:
grep(".*[0-9]", names(Grade.Data))
But I get this returned:
[1] 3 4 5 6 7 8 9 11 12 13 14 15 16 17 19 20 21 22 23 24 25
Can anyone help me figure out what is going on, and write a better regex expression? Thank you so much.
It seems you truncate column names after the first chunk of digits.
You may use the following sub solution:
names(Grade.Data) <- sub("^(.*?\\d+).*$", "\\1", names(Grade.Data))
See the regex demo
Details
^ - start of string
(.*?\\d+) - Group 1 (later referred with \1 from the replacement pattern) matching any 0+ chars as few as possible (.*?) and then 1 or more digits (\d+)
.* - any 0+ chars as many as possible
$ - end of string
There is nothing wrong with your regex itself. What you are looking for is probably the combination of regexpr - which gets the start and ending of your regex- and regmatches - which gets the actual string corresponding to the output of regexpr:
start_end <- regexpr(".*[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1..Week.1" "Quizzes.2..Week.2"
# [5] "Case.Studies.1..Case.Study1"
Adding a question-mark behind the dot-star will make the regex match as few characters as possible, so it will stop after the first numeric value:
start_end <- regexpr(".*?[0-9]", names(Grade.data))
regmatches(names(Grade.data), start_end)
# [1] "Interactive.Exercises.1" "Interactive.Exercises.2"
# [3] "Quizzes.1" "Quizzes.2"
# [5] "Case.Studies.1"
you should use the function names, following I write a little example, the names string can be as long as you need.
names(x = Grade.Data) <- c("Col1_name", "Col2_name")

extracting a word (of variable length) ending with 동 from a string in R

I have a data frame in R with one column containing an address in Korean. I need to extract one of the words (a word ending with 동), if it's there (it's possible that it's missing) and create a new column named "dong" that will contain this word. So my data is shown in column "address" and desired output is shown in column "dong" shown below.
address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
dong <- c("탄방동","효동","오정동","자양동",NA)
data <- data.frame(address,dong, stringsAsFactors = FALSE)
I've tried using grep but it's not giving me exactly what I need.
grep(".+동\\s",data$address,value=T)
I think I have 2 issues: 1) I'm not sure how to write a proper regular expression to identify the word I need and 2) I'm not sure why grep returns the whole string rather than the word. I would appreciate any suggestions.
A regex to extract Korean whole words ending with a specific letter is
\b\w*동\b
See the regex demo.
Details:
\b- leading word boundary
\w* - 0+ word chars
동 - ending letter
\b - trailing word boundary
See the R demo:
address <- c("대전광역시 서구 탄방동 홈플러스","대전광역시 동구 효동 주민센터","대전광역시 대덕구 오정동 한남마트","대전광역시 동구 자양동 87-3번지 성동경로당","대전광역시 유성구 용계로 128")
## matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address, perl=TRUE ))
matches <- regmatches(address, gregexpr("\\b\\w*동\\b", address ))
dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x))
data <- data.frame(address,dong, stringsAsFactors = FALSE)
Output:
address dong
1 대전광역시 서구 탄방동 홈플러스 탄방동
2 대전광역시 동구 효동 주민센터 효동
3 대전광역시 대덕구 오정동 한남마트 오정동
4 대전광역시 동구 자양동 87-3번지 성동경로당 자양동
5 대전광역시 유성구 용계로 128 <NA>
Note that dong <- unlist(lapply(matches, function(x) if (length(x) == 0) NA else x)) line is necessary to add NA to those rows where no match was found.
grep returns the whole string. In your case, stringr library is useful.
library(stringr)
str_match(paste0(data$address, ' '), '([^\\s]+동)\\s')
[,1] [,2]
[1,] "탄방동 " "탄방동"
[2,] "효동 " "효동"
[3,] "오정동 " "오정동"
[4,] "자양동 " "자양동"
[5,] NA NA
The column 2 is what you want. Note that I added a space at the end of strings so that regex would match if "dong" appears at the end of string.

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

Resources