R - Extract information from string following a general format

R - Extract information from string following a general format - r

This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?

Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.

Related

How to vectorize a for loop in R for a large dataset

I'm relatively new to R and I have a question about data processing. The main issue is that the dataset is too big, and I want to write a vectorized function that's faster than a for loop, but I don't know how. The data is about movies and user ratings, is formatted like this (below).
1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
The 1: and 2: represent movies, while the other lines represent a user id, user rating and dating of rating for that movie (in that order from left to right, separated by commas). I want to format the data as an edge list, like this:
Movie | User
1: | 5
1: | 1
1: | 3
2: | 2
2: | 3
2: | 5
I wrote the code below to perform this function. Basically, for every row, it check if its a movie id (containing ':') or if it's user data. It then combines the movie id and user id as two columns for every movie and user, and then rowbinds it to a new data frame. At the same time, it also only binds those users who rate a movie 5 out of 5.
el <- data.frame(matrix(ncol = 2, nrow = 0))
for (i in 1:nrow(data))
{
if (grepl(':', data[i,]))
{
mid <- data[i,]
} else(grepl(',', data[i,]))
{
if(grepl(',5,', data[i,]))
{
uid <- unlist(strsplit(data[i,], ','))[1]
add <- c(mid, uid)
el <- rbind(el, add)
}
}
}
However, I have about 100 million entries, and the for loop runs throughout the night without being able to complete. Is there a way to speed this up? I read about vectorization, but I can't figure out how to vectorize this function. Any help?

You can do this with a few regular expressions, for which I'll use the stringr package, as well as na.locf from the zoo package. (You'll have to install stringr and zoo first).
First we'll set up your data, which it sounds like is in a one-column data frame:
data <- read.table(textConnection("1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
"))
You can then follow the following steps (explanation in comments).
# Pull out the column as a character vector for simplicity
lines <- data[[1]]
library(stringr)
# Figure out which lines represent movie IDs, and extract IDs
movie_ids <- str_match(lines, "(\\d+):")[, 2]
# Fill the last observation carried forward (locf), to find out
# the most recent non-NA value
library(zoo)
movie_ids_filled <- na.locf(movie_ids)
# Extract the user IDs
user_ids <- str_match(lines, "(\\d+),")[, 2]
# For each line that has a user ID, match it to the movie ID
result <- cbind(movie_ids_filled[!is.na(user_ids)],
user_ids[!is.na(user_ids)])
This gets the result
[,1] [,2]
[1,] "1" "5"
[2,] "1" "1"
[3,] "1" "3"
[4,] "2" "2"
[5,] "2" "3"
[6,] "2" "5"
The most important part of this code is the use of regular expressions, particularly the capturing groups in parentheses of "(\\d+):" and (\\d+),. For more on using str_match with regular expressions, do check out this guide.

Create a new vector with text from strings in an old vector in R

Working with a data frame in R studio. One column, PODMap, has sentences such as "At my property there is a house at 38.1234, 123.1234 and also I have a car". I want to create new columns, one for the latitude and one for the longitude.
Fvalue is the data frame. So far I have
matches <- regmatches(fvalue[,"PODMap"], regexpr("..\\.....", fvalue[,"PODMap"], perl = TRUE))
Since the only periods in the text are in longitude and latitude, this returns the first lat or long listed in each string (still working on finding a regex to grab the longitude from after the latitude but that's a different question). The problem is, for instance, if my vector is c("test 38.1111", "x", "test 38.2222") then it returns (38.1111. 38.2222) which has the right values, but the vector won't be the right length for my data frame and won't match. I need it to return a blank or a 0 or NA for each string that doesn't have the value matching the regular expression, so that it can be put into the data frame as a column. If I'm going about this entirely wrong let me know about that too.

You can use regexecwhich returns a list of the same length so you don't loose the non-match spaces
PODMap<-c("At my property there is a house at 38.1234, 123.1234 and also I have a",
"Test TEst TEST Tes T 12.1234, 123.4567 test Tes",
"NO LONG HEre Here No Lat either",
"At my property there is a house at 12.1234, 423.1234 and also I have ")
Index<-c(1:4)
fvalue<-data.frame(Index,PODMap)
matches <- regmatches(fvalue[,"PODMap"], regexec("..\\.....", fvalue[,"PODMap"], perl
= TRUE))
> matches
[[1]]
[1] "38.1234"
[[2]]
[1] "12.1234"
[[3]]
character(0)
[[4]]
[1] "12.1234"
Using the package stringr, we can get both the long and lat.
library(stringr)
matches<-str_match_all(fvalue[,"PODMap"], ".\\d\\d\\.\\d\\d\\d\\d")
> matches
[[1]]
[,1]
[1,] " 38.1234"
[2,] "123.1234"
[[2]]
[,1]
[1,] " 12.1234"
[2,] "123.4567"
[[3]]
[,1]
[[4]]
[,1]
[1,] " 12.1234"
[2,] "423.1234"
The \\d checks for any digit 1:9, so that will keep out any words, and we use str_match_all to get all the matches from the string, as regmatches will only take the first match. str_match_all will set a value to NULL instead of character(0) though, which should not be a problem.
Check out this regex demo

Extracting multiple substrings that come after certain characters in a string using stringi in R

I have a large dataframe in R that has a column that looks like this where each sentence is a row
data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)
I want to extract all the words that come after "wiki/" and put them in another column
So for the first row it should come out with "political_philosophy self-governance"
The second row should look like "hierarchy free_association_(communism_and_anarchism)"
The third row should be "state_(polity)"
And the fourth row should be "anti-statism"
I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.
I've tried
stri_extract_all_fixed(data$datalist, "wiki")[[1]]
but that just extracts the word wiki

You can do this with a regex. By using stri_match_ instead of stri_extract_ we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match
[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"
[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"
[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"
[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"
You can then use apply functions to make the data into any form you want:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

You can use a lookbehind in the regex.
library(dplyr)
library(stringi)
text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")
df <- data.frame(text, stringsAsFactors = FALSE)
df %>%
mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))

You may use
> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
See the online R code demo.
Details
wiki/(\\S+) - matches wiki/ and captures 1+ non-whitespace chars into Group 1
| - or
(?:(?!wiki/\\S).)+ - a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/+a non-whitespace char sequence.
If you need to get rid of redundant whitespace inside the result you may use another call to gsub:
> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

finding potential duplications (spelling errors) in r [duplicate]

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.
For example:
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:
agrep: only return best match(es)
In R, how do I replace a string that contains a certain pattern with another string?
Fast Levenshtein distance in R?
But most often agrep will do what you want :
> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3
$`Bush, G.W.`
[1] 2
$`Obama, B.H.`
[1] 1 3
$`Clinton, W.J.`
[1] 4

Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.
lapply(pres, agrep, pres, value = TRUE)
[[1]]
[1] " Obama, B." "Obama, B.H."
[[2]]
[1] "Bush, G.W."
[[3]]
[1] " Obama, B." "Obama, B.H."
[[4]]
[1] "Clinton, W.J."

Add another duplicate to show it works with more than one duplicate.
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")
adist shows the string distance between 2 character vectors
adist(" Obama, B.", pres)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 9 3 10 7
For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:
d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."
To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().
keepunique <- function(previousones, x){
if(any(adist(x, previousones)<5)){
x <- NULL
}
return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B." "Bush, G.W." "Clinton, W.J."

Extracting Nouns and Verbs from Text

I was wondering if it is possible to extract nouns, verbs separately in R package openNLP?
I use the the tagPOS function which tags the sentence but what to do in case I want to extract verbs, nouns separately.

Using an example: (this is to extract words tagged as /VBx, where x is any single character)
library("openNLP")
acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter."
acqTag <- tagPOS(acq)
sapply(strsplit(acqTag,"[[:punct:]]*/VB.?"),function(x) sub("(^.*\\s)(\\w+$)", "\\2", x))
[,1]
[1,] "said"
[2,] "sold"
[3,] "engaged"
[4,] "said"
[5,] "is"
[6,] "did"
[7,] " not/RB explain./NN Reuter./."
Ok, my regular expression needs some improvement in order to get rid of the last line in the result.
EDIT
An alternative could be to ignore rows containing a space character
sapply(strsplit(acqTag,"[[:punct:]]*/VB.?"),function(x) {res = sub("(^.*\\s)(\\w+$)", "\\2", x); res[!grepl("\\s",res)]} )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - Extract information from string following a general format - r

Related

How to vectorize a for loop in R for a large dataset

Create a new vector with text from strings in an old vector in R

Extracting multiple substrings that come after certain characters in a string using stringi in R

finding potential duplications (spelling errors) in r [duplicate]

Extracting Nouns and Verbs from Text

Categories

Resources