Extract keys and values from hstore string in R - r

I'm working in R with strings obtained from OpenStreetMap and stored using the hstore data type. For example:
"comment"=>"Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout","lit"=>"yes","maxspeed"=>"30 mph","oneway"=>"yes","surface"=>"asphalt"
I would like to create a regex (or any other approach is fine) to extract all keys and all values. Please notice that keys and values could contain the characters =, >, \", or ,. The ideal approach should use only functions implemented in base-R packages.
EDIT - Expected output
If the input is
"comment"=>"Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout","lit"=>"yes","maxspeed"=>"30 mph","oneway"=>"yes","surface"=>"asphalt"
then the expected output should be something like
keys: "comment", "lit", "maxspeed", "oneway", "surface"
values: "Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout", "yes", "30 mph", "yes", "asphalt"
If the input is
"lit"=>"no","foot"=>"designated","horse"=>"designated","bicycle"=>"yes","surface"=>"gravel","old_name"=>"Freshwater, Yarmouth & Newport Railway","prow_ref"=>"F61"
then the output should be like
keys: "lit", "foot", "horse", "bicycle", "surface", "old_name"
values: "no", "designated", "designated", "yes", "gravel", "Freshwater, Yarmouth & Newport"

We can use tidyverse approaches
library(dplyr)
library(tidyr)
library(stringr)
tibble(str1) %>%
separate_rows(str1, sep = '"[^"]+"(*SKIP)(*FAIL)|,') %>%
separate(str1, into = c('key', 'value'), sep= '"=>"') %>%
mutate(across(everything(), str_remove_all, pattern = '"'))
-output
# A tibble: 12 x 2
# key value
# <chr> <chr>
# 1 comment Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout
# 2 lit yes
# 3 maxspeed 30 mph
# 4 oneway yes
# 5 surface asphalt
# 6 lit no
# 7 foot designated
# 8 horse designated
# 9 bicycle yes
#10 surface gravel
#11 old_name Freshwater, Yarmouth & Newport Railway
#12 prow_ref F61
data
str1 <- c("\"comment\"=>\"Removed junction=roundabout as some entrances have right of way. See http://wiki.openstreetmap.org/wiki/Tag:junction%3Droundabout\",\"lit\"=>\"yes\",\"maxspeed\"=>\"30 mph\",\"oneway\"=>\"yes\",\"surface\"=>\"asphalt\"",
"\"lit\"=>\"no\",\"foot\"=>\"designated\",\"horse\"=>\"designated\",\"bicycle\"=>\"yes\",\"surface\"=>\"gravel\",\"old_name\"=>\"Freshwater, Yarmouth & Newport Railway\",\"prow_ref\"=>\"F61\""
)

read.table(text=gsub(',?([^,]+)=>',"\n\\1:", string, perl = TRUE), sep=":",
col.names = c("Key", "value"))
Key value
1 lit no
2 foot designated
3 horse designated
4 bicycle yes
5 surface gravel
6 old_name Freshwater, Yarmouth & Newport Railway
7 prow_ref F61

Related

Select phrases found in dictionary and return dataframe of doc_id and phrase

I have a dictionary file of medical phrases and a corpus of raw texts. I'm trying to use the dictionary file to select the relevant phrases from the text. Phrases, in this case, are 1 to 5-word n-grams. In the end, I would like the selected phrases in a dataframe with two columns: doc_id, phrase
I've been trying to use the quanteda package to do this but haven't been successful. Below is some code to reproduce my latest attempt. I'd appreciate any advice you have...I've tried a variety of methods but keep getting back only single-word matches.
version R version 3.6.2 (2019-12-12)
os Windows 10 x64
system x86_64, mingw32
ui RStudio
Packages:
dbplyr 1.4.2
quanteda 1.5.2
library(quanteda)
library(dplyr)
raw <- data.frame("doc_id" = c("1", "2", "3"),
"text" = c("diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."))
term = c("diffuse intrinsic pontine glioma", "brain tumors", "brain", "pontine glioma", "mri", "medical imaging", "radiology", "anatomy", "physiological processes", "radiation therapy", "radiotherapy", "cancer treatment", "malignant cells")
medTerms = list(term = term)
dict <- dictionary(medTerms)
corp <- raw %>% group_by(doc_id) %>% summarise(text = paste(text, collapse=" "))
corp <- corpus(corp, text_field = "text")
dfm <- dfm(corp,
tolower = TRUE, stem = FALSE, remove_punct = TRUE,
remove = stopwords("english"))
dfm <- dfm_select(dfm, pattern = phrase(dict))
What I'd eventually like to get back is something like the following:
doc_id term
1 diffuse intrinsice pontine glioma
1 pontine glioma
1 brain tumors
1 brain
2 mri
2 medical imaging
2 radiology
2 anatomy
2 physiological processes
3 radiation therapy
3 radiotherapy
3 cancer treatment
3 malignant cells
If you want to match-multi word patterns from a dictionary, you can do so by constructing your dfm using ngrams.
library(quanteda)
library(dplyr)
library(tidyr)
raw$text <- as.character(raw$text) # you forgot to use stringsAsFactors = FALSE while constructing the data.frame, so I convert your factor to character before continuing
corp <- corpus(raw, text_field = "text")
dfm <- tokens(corp) %>%
tokens_ngrams(1:5) %>% # This is the new way of creating ngram dfms. 1:5 means to construct all from unigram to 5-grams
dfm(tolower = TRUE,
stem = FALSE,
remove_punct = TRUE) %>% # I wouldn't remove stopwords for this matching task
dfm_select(pattern = dict)
Now we just have to convert the dfm to a data.frame and bring it into a long format:
convert(dfm, "data.frame") %>%
pivot_longer(-document, names_to = "term") %>%
filter(value > 0)
#> # A tibble: 13 x 3
#> document term value
#> <chr> <chr> <dbl>
#> 1 1 brain 2
#> 2 1 pontine_glioma 1
#> 3 1 brain_tumors 1
#> 4 1 diffuse_intrinsic_pontine_glioma 1
#> 5 2 mri 1
#> 6 2 radiology 1
#> 7 2 anatomy 1
#> 8 2 medical_imaging 1
#> 9 2 physiological_processes 1
#> 10 3 radiotherapy 1
#> 11 3 radiation_therapy 1
#> 12 3 cancer_treatment 1
#> 13 3 malignant_cells 1
You could remove the value column but it might be of interest later on.
You could form all ngrams from 1 to 5 in length, and then select all out. But for large texts, this would be very inefficient. Here's a more direct way. I've reproduced the entire problem here with a few modifications (such as stringsAsFactors = FALSE and skipping some unnecessary steps).
Granted, this does not double count the terms as in your expected example, but I submit that you probably did not want this. Why count "brain" if it occurred within "brain tumor"? You would be better counting "brain tumor" when it occurs as that phrase, and "brain" only when it occurs without "tumor". The code below does that.
library(quanteda)
## Package version: 2.0.1
raw <- data.frame(
"doc_id" = c("1", "2", "3"),
"text" = c(
"diffuse intrinsic pontine glioma are highly aggressive and difficult to treat brain tumors found at the base of the brain.",
"magnetic resonance imaging (mri) is a medical imaging technique used in radiology to form pictures of the anatomy and the physiological processes of the body.",
"radiation therapy or radiotherapy, often abbreviated rt, rtx, or xrt, is a therapy using ionizing radiation, generally as part of cancer treatment to control or kill malignant cells and normally delivered by a linear accelerator."
),
stringsAsFactors = FALSE
)
dict <- dictionary(list(
term = c(
"diffuse intrinsic pontine glioma",
"brain tumors", "brain", "pontine glioma", "mri", "medical imaging",
"radiology", "anatomy", "physiological processes", "radiation therapy",
"radiotherapy", "cancer treatment", "malignant cells"
)
))
Here's the key to the answer: using the dictionary first to select the tokens, then to concatenate them, then to reshape them one dictionary match per new "document". The last step creates the data.frame you want.
toks <- corpus(raw) %>%
tokens() %>%
tokens_select(dict) %>% # select just dictionary values
tokens_compound(dict, concatenator = " ") %>% # turn phrase into single "tokens"
tokens_segment(pattern = "*") # make one token per "document"
# make into data.frame
data.frame(
doc_id = docid(toks), term = as.character(toks),
stringsAsFactors = FALSE
)
## doc_id term
## 1 1 diffuse intrinsic pontine glioma
## 2 1 brain tumors
## 3 1 brain
## 4 2 mri
## 5 2 medical imaging
## 6 2 radiology
## 7 2 anatomy
## 8 2 physiological processes
## 9 3 radiation therapy
## 10 3 radiotherapy
## 11 3 cancer treatment
## 12 3 malignant cells

Is there a more elegant way to collapse a variable with 88 levels to one with 5 levels?

I have a categorical variable with 88 levels (counties) and I want to aggregate those into five larger geographical regions. Is there a more elegant way to do this than a huge amount of ifelse statements (like below)?
survey.responses$admin<-ifelse(survey.responses$CNTY=="Lake","Northeast",
ifelse(survey.responses$CNTY=="Traverse","Northwest",
ifelse(survey.responses$CNTY=="Ramsey","Central",
ifelse(survey.responses$CNTY=="Cottonwood","South","out of state")
except imagine that CNTY has 88 levels! Any thoughts?
Two quick methods, I recommend the merge one for larger sets.
Data
dat <- data.frame(cnty = c("Lake", "Traverse", "Ramsey", "Cottonwood"),
stringsAsFactors = FALSE)
Merge/join. I prefer this for several reasons, most of all that it is quite easy to maintain a CSV of the matches and read.csv the CSV into the ref lookup table. I'll intentionally leave "Lake" out to show what happens with non-matches.
ref <- data.frame(cnty = c("Cottonwood", "Ramsey", "Traverse", "SomeOther"),
admin = c("South", "Central", "Northwest", "NeverNeverLand"),
stringsAsFactors = FALSE)
out <- merge(dat, ref, by = "cnty", all.x = TRUE)
out
# cnty admin
# 1 Cottonwood South
# 2 Lake <NA>
# 3 Ramsey Central
# 4 Traverse Northwest
The default value is assigned in this way:
out$admin[is.na(out$admin)] <- "out of state"
out
# cnty admin
# 1 Cottonwood South
# 2 Lake out of state
# 3 Ramsey Central
# 4 Traverse Northwest
If you're using other components of tidyverse, this can be done with
library(dplyr)
left_join(dat, ref, by = "cnty") %>%
mutate(admin = if_else(is.na(admin), "out of state", admin))
Lookup. This works fine for small things, perhaps not best for your fit. (Again, I've commented "Lake" out to show the non-match.)
c(Cottonwood="South", # Lake="Northeast",
Ramsey="Central", Traverse="Northwest")[dat$cnty]
# <NA> Traverse Ramsey Cottonwood
# NA "Northwest" "Central" "South"
Unless you have some pattern in CNTY which you can combine and create some logic on you need to include those levels manually. One way would be to use case_when from dplyr
library(dplyr)
survey.responses %>%
mutate(admin = case_when(CNTY %in% c("Lake","Northeast") ~ "GR1",
CNTY %in% c("Traverse","Northwest") ~ "GR2",
CNTY %in% c("Ramsey","Central") ~ "GR3",
TRUE ~ NA_character_))

Splitting strings in between 3rd and 4th characters in R

I'm grabbing information from Wikipedia on Canadian Forward Sortation Areas (FSAs - those are the first 3 digits of postal codes in Canada) and what cities/areas they belong to. Example of this information is below:
library(rvest)
library(tidyverse)
URL <- paste0("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_", "K")
FSAs <- URL %>%
read_html() %>%
html_nodes(xpath = "//td") %>%
html_text()
head(FSAs)
[1] "K1AGovernment of CanadaOttawa and Gatineau offices (partly in QC)\n" "K2AOttawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
[3] "K4AOttawa(Fallingbrook)\n" "K6AHawkesbury\n"
[5] "K7ASmiths Falls\n" "K8APembrokeCentral and northern subdivisions\n"
The problem I'm facing is that I would like to have a data frame with the first 3 digits of each spring in one column, and the rest of the information in another. I've thought there would be a solution involving a stringr function like str_split(), but this removes the pattern of the first 3 digits, which I of course don't want. In effect, I'm looking to split the string in-between the 3rd and 4th character of each string.
I've figured out this solution, with the last bit borrowed from this answer, but it's incredibly hackish. My question is, is there a better way of doing this?
FSAs %>%
enframe(name = NULL) %>%
separate(value, c(NA, "Location"), sep = "^...", remove = FALSE) %>%
separate(value, c("FSA", NA), sep = "(?<=\\G...)")
# A tibble: 195 x 2
FSA Location
<chr> <chr>
1 K1A "Government of CanadaOttawa and Gatineau offices (partly in QC)\n"
2 K2A "Ottawa(Highland Park / McKellar Park /Westboro /Glabar Park /Carlingwood)\n"
3 K4A "Ottawa(Fallingbrook)\n"
4 K6A "Hawkesbury\n"
5 K7A "Smiths Falls\n"
6 K8A "PembrokeCentral and northern subdivisions\n"
7 K9A "Cobourg\n"
8 K1B "Ottawa(Blackburn Hamlet / Pine View / Sheffield Glen)\n"
9 K2B "Ottawa(Britannia /Whitehaven / Bayshore / Pinecrest)\n"
10 K4B "Ottawa(Navan)\n"

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

R: How to use grep() to find specific words?

I have a long data frame with words. I want to use multi specific words to find each all part-of-speech words.
For example:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply", "supplying cmp
abrasive", "chemical mechanical"))
words
1 clean
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
I want to extract "clean" and "supply" single words with different POS. I have tried use the grep() function to do.
specific_word <- c("clean", "supply")
grep_onto <- df_1[grepl(paste(ontoword_apparatus, collapse = "|"), df_1$word), ] %>%
data.frame(word = ., row.names = NULL) %>%
unique()
But the result is not what I want:
word
1 cleans
2 grinding liquid cmp
3 cleaning
4 cleaning composition
5 supplying
6 supply
7 supplying cmp abrasive
8 chemical mechanical
I prefer to get
words
1 clean
2 cleaning
3 supplying
4 supply
I know maybe regular expression can solve my problem, but I don't know how to define it. Can anyone give me some advice?
There are various ways to do this, but generally if you want it to be a single word and you're using regex, you need to specify the beginning ^ and end $ of the line so as to limit what can come before or after your pattern. You seem to want it to be able to expand with more letters, so add in \\w* to allow it:
df <- data.frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
pattern <- paste0('^\\w*', specific_word, '\\w*$', collapse = '|')
pattern
#> [1] "^\\w*clean\\w*$|^\\w*supply\\w*$"
df[grep(pattern, df$word), , drop = FALSE] # drop = FALSE to stop simplification to vector
#> word
#> 1 clean
#> 3 cleaning
#> 5 supplying
#> 6 supply
Another interpretation of what you're looking for is to split each term into individual words, and search any of those for a match. tidyr::separate_rows can be used for such a split, which you can then filter with grepl:
library(tidyverse)
df <- data_frame(word = c("clean", "grinding liquid cmp", "cleaning",
"cleaning composition", "supplying", "supply",
"supplying cmp abrasive", "chemical mechanical"))
specific_word <- c("clean", "supply")
df %>% separate_rows(word) %>%
filter(grepl(paste(specific_word, collapse = '|'), word)) %>%
distinct()
#> # A tibble: 4 x 1
#> word
#> <chr>
#> 1 clean
#> 2 cleaning
#> 3 supplying
#> 4 supply
For more robust word tokenization, try tidytext::unnest_tokens or another word actual word tokenizer.

Resources