I'm cleaning up a long list of noun-phrases for further text mining. They're supposed to be 1- or 2-word phrases, but some have / in a conjunction. Here's what I've got:
library(tidyverse)
conjuncts <- tibble(usecase = 1:3,
classes = c("Insulators/Insulation",
"Optic/light fiber",
"Magnets"))
And I want:
wanted <- tibble(usecase = c(1,1,2,2,3),
classes = c("Insulators/Insulation",
"Insulators/Insulation",
"Optic/light fiber",
"Optic/light fiber",
"Magnets"),
bigrams = c("Insulators", "Insulation",
"Optic fiber", "Light fiber", NA))
I've got something working, but it's horrible and non-extensible.
patternSplit <- function(class){
regexs <- c("(?x) ^ (\\w+) / (\\w+) $",
"(?x) ^ (\\w+) / (\\w+) \\s+ (\\w+) $")
if(str_detect(class, regexs[1])){
extr <- str_match(class, regexs[1])
list(extr[1,2],
extr[1,3])
} else if(str_detect(class, regexs[2])){
extr <- str_match(class, regexs[2])
list(paste(extr[1,2], extr[1,4]),
paste(extr[1,3], extr[1,4]))
} else {
list(NA_character_)
}
}
anx <- conjuncts %>%
mutate(bigrams = map(classes, patternSplit)) %>%
unnest(cols = "bigrams") %>%
unnest(cols = "bigrams")
Which gives me what I wanted, but blecchh!
# A tibble: 5 x 3
usecase classes bigrams
<int> <chr> <chr>
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets NA
The top two problems (1) I have to run the rexex twice - once with str_detect to get the logical for the if / else and again with str_match to pull out the tokens. (2) I have do the double unnest to unwind the list structure. And smaller problem (3) Can I get out of if / else, into case_when or switch?
I'll eventually be extending this to about a dozen patterns and use-cases.
Here is solution using / as seperator to detect word phrases, then using ifelse to get desired result:
patternSplit<- function(x,p="[A-z]+[/][A-z]+"){
x1<- stringr::str_extract(x,p)
x2<- stringr::str_replace(x,p,"")
return(cbind(val1=x1,val2=x2))
}
conjuncts<- cbind(conjuncts,conjuncts$classes %>% patternSplit()) %>%
tidyr::separate_rows(val1, sep = '/') %>%
dplyr::mutate(bigrams= ifelse(!is.na(val1),paste0(val1,val2),val1)) %>%
dplyr::select(-contains("val"))
conjuncts
usecase classes bigrams
1 1 Insulators/Insulation Insulators
2 1 Insulators/Insulation Insulation
3 2 Optic/light fiber Optic fiber
4 2 Optic/light fiber light fiber
5 3 Magnets <NA>
Related
I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.
This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1
I want to find the distribution of number of titles with 1 word, 2 words, 3 words, ... in my dataset "jnl.dt" in R.
one_word_title = 0
two_word_title = 0
three_word_title = 0
for (i in 1:x){
if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==1){one_word_title <- one_word_title+1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==2){two_word_title <- two_word_title+1}
else if (str_count(jnl.dt[i]$`Full Title`, '\\w+')==3){three_word_title <- three_word_title+1}
}
one_word_title
two_word_title
three_word_title
Is there a way to find the distribution of number of titles with different number of words without hardcoding the number of words in title?
Instead of doing this for every word separately, you can do this together.
table(stringr::str_count(jnl.dt$`Full Title`, '\\w+'))
Here's a proposal somewhat tentative given the absence of reproducible data:
Let's assume you have this kind of data and titles:
df <- data.frame(titles = c("The Great Gatsby", "That's the Story of my Life", "Love Story", "Alice in Wonderland", "Harry Potter"))
To get the "distribution" of number of words in the titlesyou can do this:
library(dplyr)
library(stringr)
df %>%
mutate(N_w = str_count(titles, "\\S+")) %>%
group_by(N_w) %>%
summarise(Dist_N_w = n())
# A tibble: 3 x 2
N_w Dist_N_w
* <int> <int>
1 2 2
2 3 2
3 6 1
Note that using \\w+ and, respectively, \\S+ makes a difference: as the apostrophe is not contained in the \\w character class (for letter, digits, and the underscore) That's will be counted as 2 words. If you use \\S instead, which is a negative character class matching anything that is a whitespace (including actual whitespace and also new line and return characters etc.), the count for That's will be 1.
We may use unnest_tokens
library(tidytext)
library(dplyr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, titles) %>%
count(rn) %>%
count(n)
I would like to count the number of English words in a string of text.
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")))
df.words
ID text
1 1 frog friend fresh frink foot
2 2 get give gint gobble
I'd like the final product to look like this:
ID text count
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
I'm guessing I'll have to first separate based on spaces and then reference the words against a dictionary?
Building on #r2evans suggestion of using strsplit() and using a random English word .txt file dictionary online, example is below. This solution probably might not scale well if you have a large number of comparisons because of the unnest step.
library(dplyr)
library(tidyr)
# text file with 479k English words ~4MB
dict <- read.table(file = url("https://github.com/dwyl/english-words/raw/master/words_alpha.txt"), col.names = "text2")
df.words <- data.frame(ID = 1:2,
text = c(c("frog friend fresh frink foot"),
c("get give gint gobble")),
stringsAsFactors = FALSE)
df.words %>%
mutate(text2 = strsplit(text, split = "\\s")) %>%
unnest(text2) %>%
semi_join(dict, by = c("text2")) %>%
group_by(ID, text) %>%
summarise(count = length(text2))
Output
ID text count
<int> <chr> <int>
1 1 frog friend fresh frink foot 4
2 2 get give gint gobble 3
Base R alternative, using EJJ's great recommendation for dict:
sapply(strsplit(df.words$text, "\\s+"),
function(z) sum(z %in% dict$text2))
# [1] 4 3
I thought that this would be a clear winner in speed, but apparently doing sum(. %in% .) one at a time can be a little expensive. (It is slower with this data.)
Faster but not necessarily simpler:
words <- strsplit(df.words$text, "\\s+")
words <- sapply(words, `length<-`, max(lengths(words)))
found <- array(words %in% dict$text2, dim = dim(words))
colSums(found)
# [1] 4 3
It's a hair faster (~ 10-15%) than EJJ's solution, so likely only a good thing if you need to wring some performance out of it.
(Caveat: EJJ's is faster with this 2-row dataset. If the data is 1000x larger, then my first solution is a little faster, and my second solution is twice as fast. Benchmarks are benchmarks, though, don't optimize code beyond usability if speed/time is not a critical factor.)
My dataframe (df) contains a list of values which are labelled following a format of 'Month' 'Name of Site' and 'Camera No.'. I.e., if my value is 'DECBUTCAM27' then Dec-December, BUT-Name of Site and CAM27-Camera No.
I have 100 such values with 19 different site names.
I want to write an If else code such that only the site names are recognised and a corresponding number is added.
My initial idea was to add the corresponding number for all the 100 values, but since if else does not work beyond 50 values I couldnt use that option.
This is what I had written for the option that i had tried:
df <- df2 %>% mutate(Site_ID =
ifelse (CT_Name == 'DECBUTCAM27', "1",
ifelse (CT_Name == 'DECBUTCAM28', "1",
ifelse (CT_Name == 'DECI2NCAM01', "2",
ifelse (CT_Name == 'DECI2NCAM07', "2",
ifelse (CT_Name == 'DECI5CAM39', "3",
ifelse (CT_Name == 'DECI5CAM40', "3","NoVal")))))))
I am looking for a code such that only the sites i.e., 'BUT', 'I2N' and 'I5' would be recognised and a corresponding number is added.
Any help would be greatly appreciated.
Extract the sitename using regex and use match + unique to assign unique number.
df2$site_name <- sub('...(.*)CAM.*', '\\1', df2$CT_Name)
df2$Site_ID <- match(df2$site_name, unique(df2$site_name))
For example, see this example :
CT_Name <- c('DECBUTCAM27', 'DECBUTCAM28', 'DECI2NCAM07', 'DECI2NCAM01',
'DECI5CAM39', 'DECI5CAM40')
site_name <- sub('...(.*)CAM.*', '\\1', CT_Name)
site_name
#[1] "BUT" "BUT" "I2N" "I2N" "I5" "I5"
Site_ID <- match(site_name, unique(site_name))
Site_ID
#[1] 1 1 2 2 3 3
Here is a tidyverse solution:
You haven't provided a reproducible example, but let's use the CT_Names that you have supplied to create a test dataframe:
data <- tribble(
~ CT_Name,
"DECBUTCAM27",
"DECBUTCAM28",
"DECI2NCAM01",
"DECI2NCAM07",
"DECI5CAM39",
"DECI5CAM40"
)
Let's assume that the string format is 3 letters for months, 2 or more letters or numbers for site and CAM + 1 or more digits for camera number (adjust these as needed). We can use a regular expression in tidyr's extract() function to split up the string into its components:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera"))
(add remove = FALSE if you want to keep the original CT_Name variable)
This yields:
# A tibble: 6 x 3
Month Site Camera
<chr> <chr> <chr>
1 DEC BUT CAM27
2 DEC BUT CAM28
3 DEC I2N CAM01
4 DEC I2N CAM07
5 DEC I5 CAM39
6 DEC I5 CAM40
We can then group by site and assign a group ID as your Site_ID:
data_new <- data %>%
extract(CT_Name, regex = "(\\w{3})(\\w{2,})(CAM\\d+)", into = c("Month", "Site", "Camera")) %>%
group_by(Site) %>%
mutate(Site_ID = cur_group_id())
This produces:
# A tibble: 6 x 4
# Groups: Site [3]
Month Site Camera Site_ID
<chr> <chr> <chr> <int>
1 DEC BUT CAM27 1
2 DEC BUT CAM28 1
3 DEC I2N CAM01 2
4 DEC I2N CAM07 2
5 DEC I5 CAM39 3
6 DEC I5 CAM40 3
Here is a quick example using regex to find the site code and using an apply function to return a vector of code.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$loc <- apply(df, 1, function(x) gsub("CAM.*$","",gsub("^.{3}",'',x[1])))
unique(df$loc) # all the location of the file
df$n <- as.numeric(as.factor(df$loc)) # get a number for each location
Mind that here I use the x[1] because the code are in the first column of my data.frame, which may vary for you.
---EDIT--- This was a previous answer also working but with more work for you to do. However it allow you to choose numeric code value (or text) to assign locations if they are ordered for example.
It require you to put all the codes for each site, which I found heavy in term of code but it works. The switch part is roughly the same as an ifelse.
The regex consist in excluding the 3 first character and the other ones at the end after the 'CAM' sequence.
df <- data.frame(code = c('DECBUTCAM27','JANBUTCAM27','DECDUCCAM45'))
df$n <- apply(df, 1, function(x) switch(gsub("CAM.*$","",gsub("^.{3}",'',x[1])),
BUT = 1,
DUC = 2)
)
What I did
I wrote a regex that matches all text strings with "A" and "BV" with 0-10 words between using this tutorial: https://www.regular-expressions.info/near.html
df<- data.frame(text=c("ART 6 dasd asd NOT art 2 BV","NOT ART 6 ds as dd BV","ART 6 NO BV"),
id=c(1,2,3))
subset(df, grepl("(ART)(?:\\W+\\w+){0,10}?\\W+(\\bBV\\b)",
perl=TRUE,
ignore.case = TRUE,
text))
text id
1 ART 6 dasd asd NOT art 2 BV 1
2 NOT ART 6 ds as dd BV 2
3 ART 6 NO BV 3
What I am trying to get
Now I would like to rewrite the regex that it does not match if there occurs any word of a list (i.e. NOT and NO in the example data) in the 0-10 words between "A" and "BV".
So the result would look like:
subset(df, grepl("NEWREGEX",
perl=TRUE,
ignore.case = TRUE,
text))
text id
1 NOT ART 6 ds as dd BV 2
I think I could use something like ?! but I could not figure it out
Thanks to akrun we have a really nice solution:
library(stringr)
str_extract(df$text, "(A\\w+\\b.*\\bBV\\b)") %>% str_detect("NOT?") %>% '!' %>% magrittr::extract(df, ., )