Splitting string with '<U+FF0E>' in R - r

Hello I am trying to split a dataframe column test$Name that is in this format.
[1]"Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A"
[2] "Victoria Centre<U+FF0E>Block 3<U+FF0E>20/F<U+FF0E>Flat B"
[3] "Lei King Wan<U+FF0E>Sites B<U+FF0E>Block 6 Yat Hong Mansion<U+FF0E>3/F<U+FF0E>Flat H"
[4] "Island Place<U+FF0E>Block 3 (Three Island Place)<U+FF0E>9/F<U+FF0E>Flat G"
[5] "7A Comfort Terrace<U+FF0E>5/F<U+FF0E>Flat B"
[6] "Broadview Court<U+FF0E>Block 4<U+FF0E>38/F<U+FF0E>Flat E"
[7] "Chi Fu Fa Yuen<U+FF0E>Fu Ho Yuen (Block H-5)<U+FF0E>16/F<U+FF0E>Flat G"
[8] "City Garden<U+FF0E>Phase 2<U+FF0E>Block 10<U+FF0E>9/F<U+FF0E>Flat B"
[9] "Euston Court<U+FF0E>Tower 1<U+FF0E>12/F<U+FF0E>Flat H"
[10] "Garley Building<U+FF0E>10/F<U+FF0E>Flat C"
The structure of each entry is BuildingName<U+FF0E>FloorNumber<U+FF0E>Unit. I would like to extract the building name like the following example.
Name
Fung Yat Building
Victoria Centre
Lei King Wan
...
I have tested that <U+FF0E> is actually '.' by doing this.
grepl('.',"Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A")
[1] TRUE
Hence, I have tried the followings but none of them worked...
test %>% separate(Name, c('Name'), sep = '.') %>% head
gsub(".", " ", test$Name[1], fixed=TRUE)
sub("^\\s*<U\\+\\w+>\\s*", " ", test$Name[1])
Any suggestions please? Thanks!

easies way is to use < as a split pattern.
library(stringr)
word("Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A", 1, sep = "\\<")
# word("Fung Yat Building<U+FF0E>13/F<U+FF0E>Flat A", 1, sep = "\\<U\\+FF0E\\>") ## building is '1', FloorNumber is '2', Unit os '3'
out:
[1] "Fung Yat Building"

Related

Remove everything in a string after the first " - " (multiple " - ")

I am struggling to only keep the part before the first " - ".
If I try this regex on regex101.com I get the expected output but when I try it in R I get a different output.
authors <- sub("\\s-\\s.*", "", authors)
Input:
[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org"
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier"
[3] "CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library"
Expected output:
[1] "T Dietz, RL Shwom, CT Whitley"
[2] "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
Actual output:
[1] "T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020"
[2] "L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011"
[3] "CD Thomas - Diversity and Distributions, 2010"
Thanks in advance!
It seems you receive output containing some Unicode whitespaces.
In this case, the following will work:
sub("(*UTF)(*UCP)\\s-\\s.*", "", authors, perl=TRUE)
The (*UTF)(*UCP) (or probably just (*UCP)) will enable \s to match any Unicode whitespaces.
You can use this regex. Replace for nothing the result in Notepad++ for example:
Regex
-(.*?)$
You can also just split the string on your delimiter (-) and take the first element:
sapply(strsplit(authors, " -", fixed = T), `[[`, 1)
[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
You can also use regex greedy matching to remove everything after and including your delimiter. Because it is greedy it will match as much as possible:
stringr::str_remove(authors, " -.*")
[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson"
[3] "CD Thomas"
Too long for a comment at the moment, may delete later. When I run this code alone, I get your expected output:
authors <- c("T Dietz, RL Shwom, CT Whitley - Annual Review of Sociology, 2020 - annualreviews.org",
"L Berrang-Ford, JD Ford, J Paterson - Global environmental change, 2011 - Elsevier",
"CD Thomas - Diversity and Distributions, 2010 - Wiley Online Library")
sub("\\s-\\s.*", "", authors)
#[1] "T Dietz, RL Shwom, CT Whitley" "L Berrang-Ford, JD Ford, J Paterson" "CD Thomas"
This might have something to do with the fact that you reassign to authors every time you try subbing, which overwrites authors. You might have been doing that as you were developing the regex, and forgot to reassign the authors vector to the original.

Split String at First Occurrence of an Integer using R

Note I have already read Split string at first occurrence of an integer in a string however my request is different because I would like to use R.
Suppose I have the following example data frame:
> df = data.frame(name_and_address =
c("Mr. Smith12 Some street",
"Mr. Jones345 Another street",
"Mr. Anderson6 A different street"))
> df
name_and_address
1 Mr. Smith12 Some street
2 Mr. Jones345 Another street
3 Mr. Anderson6 A different street
I would like to split the string at the first occurrence of an integer. Notice that the integers are of varying length.
The desired output can be like the following:
[[1]]
[1] "Mr. Smith"
[2] "12 Some street",
[[2]]
[1] "Mr. Jones"
[2] "345 Another street",
[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"
I have tried the following but I can not get the regular expression correct:
# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d+)', perl=TRUE, type.convert=TRUE)
# Attempt 2 (Does not work)
library(stringr)
str_split(df, "\\d+")
I would use sub here:
df$name <- sub("(\\D+).*", "\\1", df$name_and_address)
df$address <- sub(".*?(\\d+.*)", "\\1", df$name_and_address)
You can use tidyr::extract:
library(tidyr)
df <- df %>%
extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
## name address
## 1 Mr. Smith 12 Some street
## 2 Mr. Jones 345 Another street
## 3 Mr. Anderson 6 A different street
The (\D*)(\d.*) regex matches the following:
(\D*) - Group 1: any zero or more non-digit chars
(\d.*) - Group 2: a digit and then any zero or more chars as many as possible.
Another solution with stringr::str_split is also possible:
str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith" "12 Some street"
## [[2]]
## [1] "Mr. Jones" "345 Another street"
## [[3]]
## [1] "Mr. Anderson" "6 A different street"
The (?=\d) positive lookahead finds a location before a digit, and n=2 tells stringr::str_split to only split into 2 chunks max.
Base R approach that does not return anything if there is no digit in the string:
df = data.frame(name_and_address = c("Mr. Smith12 Some street", "Mr. Jones345 Another street", "Mr. Anderson6 A different street", "1 digit is at the start", "No digits, sorry."))
df$name <- sub("^(?:(\\D*)\\d.*|.+)", "\\1", df$name_and_address)
df$address <- sub("^\\D*(\\d.*)?", "\\1", df$name_and_address)
df$name
# => [1] "Mr. Smith" "Mr. Jones" "Mr. Anderson" "" ""
df$address
# => [1] "12 Some street" "345 Another street"
# [3] "6 A different street" "1 digit is at the start" ""
See an online R demo. This also supports cases when the first digit is the first char in the string.

How to filter specific vector using matching condition in R

I have two sets of Input vector. Either Policy Number or Contract Number comes as value for these inputs. But at a time only one will come.
I need to extract either of these Number and fit into policy_nr variable. The code works fine If I explicitly extract.
If I am trying to put OR operator and I am getting this Error
invalid 'x' type in 'x || y'
My Input Type 1
kk
[1] "DUPLICATE x"
[2] "ifa’-"
[3] "UPLICAT EXIDE Life"
[4] "Insurance"
[5] "02-01-2020"
[6] "Mr DEV WHITE T"
[7] "Contract Number : 123456"
My Input Type 2
kk
[1] "DUPLICATE x"
[2] "ifa’-"
[3] "UPLICAT EXIDE Life"
[4] "Insurance"
[5] "02-01-2020"
[6] "Mr CRAIG WHITE T"
[7] "Policy Number : 7890"
Code
policy_nr <- str_trim(sub(".*:", "", kk[grepl("Contract Number",kk)] ), side = c("both")) || str_trim(sub(".*:", "", kk[grepl("Policy Number",kk)] ), side = c("both"))
Just concatenate the character vector into one using paste, because the number might be at any element. You can encode the OR operator in a regular expression like this: "(Policy|Contract)"
library(tidyverse)
record1 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr DEV WHITE T",
"Contract Number : 123456"
)
record2 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr CRAIG WHITE T",
"Policy Number : 7890"
)
extract_number <- function(x) {
x %>%
paste0(collapse = " ") %>%
str_extract("(Policy|Contract) Number : [0-9]+") %>%
str_extract("[0-9]+") %>%
as.numeric()
}
extract_number(record1)
#> [1] 123456
extract_number(record2)
#> [1] 7890
Created on 2021-12-03 by the reprex package (v2.0.1)
You can use this pattern: '(?<=(Policy|Contract) Number : ).\\d+' to extract numbers after Policy Number or Contract Number
library(tidyverse)
kk1 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr DEV WHITE T",
"Contract Number : 123456"
)
kk2 <- c(
"DUPLICATE x",
"ifa’-",
"UPLICAT EXIDE Life",
"Insurance",
"02-01-2020",
"Mr CRAIG WHITE T",
"Policy Number : 7890"
)
kk1 %>% paste(collapse = ' ') %>%
str_extract_all('(?<=(Policy|Contract) Number : ).\\d+')
kk2 %>% paste(collapse = ' ') %>%
str_extract_all('(?<=(Policy|Contract) Number : ).\\d+')

Scraping keywords on PHP page

I would like to scrape the keywords inside the dropdown table of this webpage https://www.aeaweb.org/jel/guide/jel.php
The problem is that the drop-down menu of each item prevents me from scraping the table directly because it only takes the heading and not the inner content of each item.
rvest::read_html("https://www.aeaweb.org/jel/guide/jel.php") %>%
rvest::html_table()
I thought of scraping each line that starts with Keywords: but I do not get how can I do that. Seems like the HTML is not showing the items inside the table.
A RSelenium solution,
#Start the server
library(RSelenium)
driver = rsDriver(
browser = c("firefox"))
remDr <- driver[["client"]]
#Navigate to the url
remDr$navigate("https://www.aeaweb.org/jel/guide/jel.php")
#xpath of the table
remDr$findElement(using = "xpath",'/html/body/main/div/section/div[4]') -> out
#get text from the table
out <- out$getElementText()
out= out[[1]]
Split using stringr package
library(stringr)
str_split(out, "\n", n = Inf, simplify = FALSE)
[[1]]
[1] "A General Economics and Teaching"
[2] "B History of Economic Thought, Methodology, and Heterodox Approaches"
[3] "C Mathematical and Quantitative Methods"
[4] "D Microeconomics"
[5] "E Macroeconomics and Monetary Economics"
[6] "F International Economics"
[7] "G Financial Economics"
[8] "H Public Economics"
[9] "I Health, Education, and Welfare"
[10] "J Labor and Demographic Economics"
[11] "K Law and Economics"
[12] "L Industrial Organization"
[13] "M Business Administration and Business Economics; Marketing; Accounting; Personnel Economics"
[14] "N Economic History"
[15] "O Economic Development, Innovation, Technological Change, and Growth"
[16] "P Economic Systems"
[17] "Q Agricultural and Natural Resource Economics; Environmental and Ecological Economics"
[18] "R Urban, Rural, Regional, Real Estate, and Transportation Economics"
[19] "Y Miscellaneous Categories"
[20] "Z Other Special Topics"
To get the Keywords for History of Economic Thought, Methodology, and Heterodox Approaches
out1 <- remDr$findElement(using = 'xpath', value = '//*[#id="cl_B"]')
out1$clickElement()
out1 <- remDr$findElement(using = 'xpath', value = '/html/body/main/div/section/div[4]/div[2]/div[2]/div/div/div/div[2]')
out1$getElementText()
[[1]]
[1] "Keywords: History of Economic Thought"

Create vertex and edge list from PubMed articles author list

I am trying to build an R igraph graph by fetching articles from PubMed via the RISmed package by building a vertex list dataframe as follows:
Name Count
Cardone L 22
... ...
Sassone-Corsi P 6
... ...
and a corresponding (undirected) edge list
V1 V2 NumberOfCoAuthorships
Cardone L Sassone-Corsi P 4
Here is the data I am fetching:
#install.packages("RISmed")
library(RISmed)
limit = 100
query = "Cardone L[AUTH]"
# build the Eutils object
search_query <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2016, retmax=limit)
# now let's fetch authors
auths<-Author(EUtilsGet(search_query))
# to extract all the names in the LastName Initials format
Authors<-sapply(auths, function(x)paste(x$LastName,x$Initials))
but I am really not wrapping my head around on how to build the edges and vertexes dataframes for a future igraph graph object from the Authors list of strings
> head(Authors,3)
[[1]]
[1] "Cardone L"
[[2]]
[1] "Carrella D" "Manni I" "Tumaini B" "Dattilo R" "Papaccio F" "Mutarelli M"
[7] "Sirci F" "Amoreo CA" "Mottolese M" "Iezzi M" "Ciolli L" "Aria V"
[13] "Bosotti R" "Isacchi A" "Loreni F" "Bardelli A" "Avvedimento VE" "di Bernardo D"
[19] "Cardone L"
[[3]]
[1] "Ventre S" "Indrieri A" "Fracassi C" "Franco B" "Conte I" "Cardone L"
[7] "di Bernardo D"
Thank you for any help

Resources