Remove non-english string from Rows: R - r

I have a couple of variables whose data (rows) contain english string followed by non-english translation (Hindi).
E.g. Carpenter (Hindi word for carpenter)
Is there a way to strip the rows to contain only the english part? Hindi is causing problems with applying functions and so I want them removed.

Here is another option using base R's iconv() which removes only the non-Latin script:
s <- 'Carpenter (बढ़ई)'
iconv(s, "latin1", "ASCII", sub="")
# [1] "Carpenter ()"
Applying to a data frame:
df <- data.frame(rbind('Carpenter (बढ़ई)',
'Cat (बिल्ली)'))
sapply(df,iconv, from="latin1", to="ASCII",sub="")
# [1,] "Carpenter ()"
# [2,] "Cat ()"

I managed to strip the english part of the text by using regular expressions (regex) and the stringr package. Below is an example data frame and resulting output.
library(tidyverse)
library(stringr)
df <- tibble(
complete_wrd = c(
"carpenter (Hindi word for carpenter)",
"cat (Hindi word for cat)",
"dog (Hindi word for dog)"))
df %>%
mutate(engl_wrd = stringr::str_extract(complete_wrd, "^.*?\\S*"))
# A tibble: 3 x 2
complete_wrd engl_wrd
<chr> <chr>
1 carpenter (Hindi word for carpenter) carpenter
2 cat (Hindi word for cat) cat
3 dog (Hindi word for dog) dog

Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(hindi_text = str_remove(hindi_text, "\\(.*\\)"))
# hindi_text
# 1 Construction Labourer
# 2 Other
Data
df <- data.frame(hindi_text = c("Construction Labourer(सभी प्रकार के निर्माण मजदूर)", "Other(उपरोक्त के अतिरिक्त) "))

Related

Str_split is returning only half of the string

I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你

R tidytext Remove word if part of relevant bigrams, but keep if not

By using unnest_token, I want to create a tidy text tibble which combines two different tokens: single words and bigrams.
The reasoning behind is that sometimes single words are the more reasonable unit to study and sometime it is rather higher-order n-grams.
If two words show up as a "sensible" bigram, I want to store the bigram and not store the individual words. If the same words show up in a different context (i.e. not as bigram), then I want to save them as single words.
In the stupid example below "of the" is an important bigram. Thus, I want to remove single words "of" and "the" if they actually appear as "of the" in the text. But if "of" and "the" show up in other combinations, I would like to keep them as single words.
library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)
# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)
# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10] )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]
# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]
# want to keep biwords in tide if they are not part of bigrams
You could do this by replacing the bigrams you're intrested in with a compound in text, before tokenisation (i.e. unnest_tokens):
keepbigram_new <- stringi::stri_replace_all_regex(keepbigram, "\\s+", "_")
keepbigram_new
#> [1] "of_the" "to_be" "in_the" "it_was" "i_am" "she_had"
#> [7] "of_her" "to_the" "she_was" "had_been"
Using _ instead of whitespace is common practice for this. stringi::stri_replace_all_regex is pretty much the same as gsub or str_replace from stringr but a little faster and with more features.
Now replace the bigrams in text with these new compounds before tokenisation. I use word boundary regular expressions (\\b) at the beginning and end of the bigrams to not accidentally capture e.g., "of them":
topwords <- austen_books() %>%
mutate(text = stringi::stri_replace_all_regex(text, paste0("\\b", keepbigram, "\\b"), keepbigram_new, vectorize_all = FALSE)) %>%
unnest_tokens(output = word, input = text) %>%
count(word, sort = TRUE) %>%
mutate(rank = seq_along(word))
Looking at the most common words, the first bigram appears on rank 40 now:
topwords %>%
slice(1:4, 39:41)
#> # A tibble: 7 x 3
#> word n rank
#> <chr> <int> <int>
#> 1 and 22515 1
#> 2 to 20152 2
#> 3 the 20072 3
#> 4 of 16984 4
#> 5 they 2983 39
#> 6 of_the 2833 40
#> 7 from 2795 41

Replace words/phrases within longer strings if they are found in lookup table

I have a data frame of sentences and a data frame of key words and their synonyms. I would like to look through each row of the sentences and replace any synonyms found with the appropriate key word. I've been struggling with this thing for the last couple of days without much luck. So any advice you can provide would be much appreciated!
Sample data:
sentences <- data.frame( ID = c( "1", "2", "3", "4"),
text = c("the kitten in the hat",
"a dog with a bone",
"this is a category",
"their cat has no hat"),
stringsAsFactors=FALSE)
lookup <- data.frame( key = c("cat", "a", "has"),
synonym = c("kitten", "the", "with"),
stringsAsFactors=FALSE)
I'd like to get the data back as a data frame much like the original "sentences" only with the synonyms replaced. For example:
ID text
1 a cat in a hat
2 a dog has a bone
3 this is a category
4 their cat has no hat
The actual data consists of 2016 sentences of between 200-500 words each. The lookup table contains about 200,000 rows of words and phrases. I've figured out how to replace individual words and phrases without much trouble, but I can't figure out how to do it with a lookup table.
One other note that's causing me grief: I need to match exact words/phrases including special characters. For example "adison's disease" should match "adison's disease" but not "adisons disease". "cotton-roll" should match "cotton-roll" but it should not match "cottonroll" or "cotton roll".
I'm using
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Here is an option with str_replace_all
library(stringr)
str_replace_all(sentences$text, setNames(lookup$key,
str_c("\\b(", lookup$synonym, ")\\b")))
#[1] "a cat in a hat" "a dog has a bone" "this is a category" "their cat has no hat"
Or using with dplyr
library(dplyr)
sentences %>%
mutate(text = str_replace_all(text,
set_names(lookup$key,
str_c("\\b(", lookup$synonym, ")\\b"))))
# ID text
#1 1 a cat in a hat
#2 2 a dog has a bone
#3 3 this is a category
#4 4 their cat has no hat
Mostly the same as #akrun's answer but I personally prefer the stringi version of stringr's str_replace_all, which does not do the strange named vector thing. So here for an alternative:
sentences$text <- stringi::stri_replace_all_regex(
str = sentences$text,
pattern = paste0("\\b", lookup$key, "\\b"), # add word boundaries
replacement = lookup$synonym,
vectorize_all = FALSE,
opts_regex = stringi::stri_opts_regex(case_insensitive = TRUE) # set additional options
)
sentences
#> ID text
#> 1 1 the kitten in the hat
#> 2 2 the dog with the bone
#> 3 3 this is the category
#> 4 4 their kitten with no hat
Using gsubfn create the translation list trans and then for each word (defined by regular expression where \y means word boundary and \w is a word character) replace it using trans if there is a match in text:
library(gsubfn)
trans <- with(lookup, setNames(as.list(key), synonym))
transform(sentences, text = gsubfn("\\y\\w+\\y", trans, text))
giving:
ID text
1 1 a cat in a hat
2 2 a dog has a bone
3 3 this is a category
4 4 their cat has no hat

grepl for finding words

I am trying in R to find the spanish words in a number of words. I have all the spanish words from a excel that I don´t know how to attach in the post (it has more than 80000 words), and I am trying to check if some words are on it, or not.
For example:
words = c("Silla", "Sillas", "Perro", "asdfg")
I tried to use this solution:
grepl(paste(spanish_words, collapse = "|"), words)
But there is too much spanish words, and gives me this error:
Error
So... who can i do it? I also tried this:
toupper(words) %in% toupper(spanish_words)
Result
As you can see with this option only gives TRUE in exactly matches, and I need that "Sillas" also appear as TRUE (it is the plural word of silla). That was the reason that I tried first with grepl, for get plurals aswell.
Any idea?
As df:
df <- tibble(text = c("some words",
"more words",
"Perro",
"And asdfg",
"Comb perro and asdfg"))
Vector of words:
words <- c("Silla", "Sillas", "Perro", "asdfg")
words <- tolower(paste(words, collapse = "|"))
Then use mutate and str_detect:
df %>%
mutate(
text = tolower(text),
spanish_word = str_detect(text, words)
)
Returns:
text spanish_word
<chr> <lgl>
1 some words FALSE
2 more words FALSE
3 perro TRUE
4 and asdfg TRUE
5 comb perro and asdfg TRUE

Removing text containing non-english character

This is my sample dataset:
Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)
I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.
I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.
stringi package has the convenience function stri_enc_isascii:
library(stringi)
stri_enc_isascii(data$Name)
# [1] TRUE FALSE FALSE
As the name suggests,
the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).
An alternative to regex would be to use iconv and than filter for non NA entries:
library(dplyr)
data <- data %>%
mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%
filter(!is.na(Name))
What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.

Resources