Removing text containing non-english character - r

This is my sample dataset:
Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)
I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.
I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.

I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.

stringi package has the convenience function stri_enc_isascii:
library(stringi)
stri_enc_isascii(data$Name)
# [1] TRUE FALSE FALSE
As the name suggests,
the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).

An alternative to regex would be to use iconv and than filter for non NA entries:
library(dplyr)
data <- data %>%
mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%
filter(!is.na(Name))
What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.

Related

readr issue with read_delim not including decimals

I'm trying to read a csv file that looks like this (Let's call this test1.csv)
test_1;test_2;test_3;test_4
Test with Ö Ä;20;10,45;15,34
As you can see, the values are separated by ; and not , - in fact , is the decimal separator. I've added "Ö" and "Ä" because my data has German letters in it - requiring me to use ISO-8859-1 in the locale() in read_delim(). Note, this isn't as important, it just explains why I want to use read_delim().
Now I would read all this using read_delim():
read_delim("test1.csv", delim = ";", locale = locale(encoding = 'ISO-8859-1',
decimal_mark = ","))
Giving me this:
# A tibble: 1 x 4
test_1 test_2 test_3 test_4
<chr> <dbl> <dbl> <dbl>
1 "Test with Ö Ä" 20 10.4 15.3
And indeed, I can get the 10.45 value out by using pull(test_3):
[1] 10.45
But now if I simply add five 0s to the 10.45 making it 1000000.45 like so (let's call this test2.csv)
test_1;test_2;test_3;test_4
Test with Ö Ä;20;1000000,45;15,34
And then repeat everything, I completely lose the .45 behind the 1000000.
read_delim("test2.csv", delim = ";",locale = locale(encoding = 'ISO-8859-1',decimal_mark = ",")) %>% pull(test_3)
Rows: 1 Columns: 4
0s── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ";"
chr (1): test_1
dbl (3): test_2, test_3, test_4
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] 1000000
I must be able to retain this information, no? Or control this behaviour? Is this a bug?
This is a printing issue.
If you add %>% print(digits = 22) to the end of your workflow you get:
[1] 1000000.449999999953434
this is not 1000000.45 because what's shown is the closest approximation available in the standard floating-point system;
the default getOption("digits") value is 7; you can set this however you like with options(digits = <your_choice>). In this case anything between digits = 10 and digits = 17 will get you a printed result of "1000000.45"; digits = 18 starts to reveal the underlying approximation.

Remove non-english string from Rows: R

I have a couple of variables whose data (rows) contain english string followed by non-english translation (Hindi).
E.g. Carpenter (Hindi word for carpenter)
Is there a way to strip the rows to contain only the english part? Hindi is causing problems with applying functions and so I want them removed.
Here is another option using base R's iconv() which removes only the non-Latin script:
s <- 'Carpenter (बढ़ई)'
iconv(s, "latin1", "ASCII", sub="")
# [1] "Carpenter ()"
Applying to a data frame:
df <- data.frame(rbind('Carpenter (बढ़ई)',
'Cat (बिल्ली)'))
sapply(df,iconv, from="latin1", to="ASCII",sub="")
# [1,] "Carpenter ()"
# [2,] "Cat ()"
I managed to strip the english part of the text by using regular expressions (regex) and the stringr package. Below is an example data frame and resulting output.
library(tidyverse)
library(stringr)
df <- tibble(
complete_wrd = c(
"carpenter (Hindi word for carpenter)",
"cat (Hindi word for cat)",
"dog (Hindi word for dog)"))
df %>%
mutate(engl_wrd = stringr::str_extract(complete_wrd, "^.*?\\S*"))
# A tibble: 3 x 2
complete_wrd engl_wrd
<chr> <chr>
1 carpenter (Hindi word for carpenter) carpenter
2 cat (Hindi word for cat) cat
3 dog (Hindi word for dog) dog
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(hindi_text = str_remove(hindi_text, "\\(.*\\)"))
# hindi_text
# 1 Construction Labourer
# 2 Other
Data
df <- data.frame(hindi_text = c("Construction Labourer(सभी प्रकार के निर्माण मजदूर)", "Other(उपरोक्त के अतिरिक्त) "))

Str_split is returning only half of the string

I have a tibble and the vectors within the tibble are character strings with a mix of English and Mandarin characters. I want to split the tibble into two, with one column returning the English, the other column returning the Mandarin. However, I had to resort to the following code in order to accomplish this:
tb <- tibble(x = c("I我", "love愛", "you你")) #create tibble
en <- str_split(tb[[1]], "[^A-Za-z]+", simplify = T) #split string when R reads a character that is not a-z
ch <- str_split(tb[[1]], "[A-Za-z]+", simplify = T) #split string after R reads all the a-z characters
tb <- tb %>%
mutate(EN = en[,1],
CH = ch[,2]) %>%
select(-x)#subset the matrices created above, because the matrices create a column of blank/"" values and also remove x column
tb
I'm guessing there's something wrong with my RegEx that's causing this to occur. Ideally, I would like to write one str_split line that would return both of the columns.
We can use strsplit from base R
do.call(rbind, strsplit(tb$x, "(?<=[A-Za-z])(?=[^A-Za-z])", perl = TRUE))
Or we can use
library(stringr)
tb$en <- str_extract(tb$x,"[[:alpha:]]+")
tb$ch <- str_extract(tb$x,"[^[:alpha:]]+")
We can use str_match and get data for English and rest of the characters separately.
stringr::str_match(tb$x, "([A-Za-z]+)(.*)")[, -1]
# [,1] [,2]
#[1,] "I" "我"
#[2,] "love" "愛"
#[3,] "you" "你"
A simple solution using str_extract from package stringr:
library(stringr)
tb$en <- str_extract(tb$x,"[A-z]+")
tb$ch <- str_extract(tb$x,"[^A-z]")
In case there's more than one Chinese character, just add +to [^A-z].
Alternatively, use gsuband backreference:
tb$en <- gsub("(\\w+).$", "\\1", tb$x)
tb$ch <- gsub("\\w+(.$)", "\\1", tb$x)
Yet another solution macthes unicode characters with [ -~]+ and excludes them with [^ -~]+:
tb$en <- str_extract(tb$x, "[ -~]+")
tb$ch <- str_extract(tb$x, "[^ -~]+")
Result:
tb
# A tibble: 3 x 3
x en ch
<chr> <chr> <chr>
1 I我 I 我
2 love愛 love 愛
3 you你 you 你

R: parse_number fails if the string contains a dot

parse_number from readr fails if the character string contains a .
It works well with special characters.
library(readr)
#works
parse_number("%ç*%&23")
#does not work
parse_number("art. 23")
Warning: 1 parsing failure.
row col expected actual
1 -- a number .
[1] NA
attr(,"problems")
# A tibble: 1 x 4
row col expected actual
<int> <int> <chr> <chr>
1 1 NA a number .
Why is this happening?
Update:
The excpected result would be 23
There is a space in after the dot which is causing an error. What is the expected number from this sequence (0.23 or 23)?
parse_number seems to look for decimal and grouping separators as defined by your locale, see the documentation here https://www.rdocumentation.org/packages/readr/versions/1.3.1/topics/parse_number
You can opt to change the locale using the following (grouping_mark is a dot with a space):
parse_number("art. 23", locale=locale(grouping_mark=". ", decimal_mark=","))
Output: 23
or remove the space in front:
parse_number(gsub(" ", "" , "art. 23"))
Output: 0.23
Edit: To handle dots as abbreviations and numbers use the following:
library(stringr)
> as.numeric(str_extract("art. 23", "\\d+\\.*\\d*"))
[1] 23
> as.numeric(str_extract("%ç*%&23", "\\d+\\.*\\d*"))
[1] 23
The above uses regular expressions to identify number patterns within strings.
\\d+ finds a digits
\\.* finds a dot
\\d* finds the remaining digits
Note: I am no expert on regex but there are plenty of other resources that will make you one

Using regular expression in string replacement

I have a broken csv file that I am attempting to read into R and repair using a regular expression.
The reason it is broken is that it contains some fields which include a comma but does not wrap those fields in double quotes. So I have to use a regular expression to find these fields, and wrap them in double quotes.
Here is an example of the data source:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00
So you can see that in the third row, the Price field contains a comma but it is not wrapped in double quotes. This breaks the read.table function.
My approach is to use readLines and str_replace_all to wrap the price with commas in double quotes. But I am not good at regular expressions and stuck.
vector <- read.Lines(file)
vector_temp <- str_replace_all(vector, ",\\$[0-9]+,\\d{3}\\.\\d{2}", ",\"\\$[0-9]+,\\d{3}\\.\\d{2}\"")
I want the output to be:
DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,"$1,250.00"
With this format, I can read into R.
Appreciate any help!
lines <- readLines(textConnection(object="DataField1,DataField2,Price
ID1,Value1,
ID2,Value2,$500.00
ID3,Value3,$1,250.00"))
library(stringi)
library(tidyverse)
stri_split_regex(lines, ",", n=3, simplify=TRUE) %>%
as_data_frame() %>%
docxtractr::assign_colnames(1)
## DataField1 DataField2 Price
## 1 ID1 Value1
## 2 ID2 Value2 $500.00
## 3 ID3 Value3 $1,250.00
from there you can readr::write_csv() or write.csv()
The extra facilities in the stringi or stringr packages do not seem needed. gsub seems perfectly suited for this. You just need understand about capture-groups with paired parentheses (brackets to Brits) and the use of the double-backslash_n convention for referring to capture-group matches in the replacement argument:
txt <- "DataField1,DataField2,Price, extra
ID1,Value1, ,
ID2,Value2,$500.00,
ID3,Value3,$1,250.00, o"
vector<- gsub("([$][0-9]{1,3}([,]([0-9]{3})){0,10}([.][0-9]{0,2}))" , "\"\\1\"", readLines(textConnection(txt)) )
> read.csv(text=vector)
DataField1 DataField2 Price extra
1 ID1 Value1
2 ID2 Value2 $500.00
3 ID3 Value3 $1,250.00 o
You are putting quotes around specific sequence of digits possibly repeated(commas digits) and possible period and 2 digits . There might be earlier SO questions about formatting as "currency".
Here are some solutions:
1) read.pattern This uses read.pattern in the gsubfn package to read in a file (assumed to be called sc.csv) such that the capture groups, i.e. the parenthesized portions, of the pattern are the fields. This will read in the file and process it all in one step so it is not necessary to use readLines first.
^(.*?), that begins the pattern will match everything from the start until the first comma. Then (.*?), will match to the next comma and finally (.*)$ will match everything else to the end. Normally * is greedy, i.e. it matches as much as it can, but the question mark after it makes it ungreedy. We needed to specify perl=TRUE so that it uses perl regular expressions since by default gsubfn uses tcl regular expressions based on Henry Spencer's regex parser which does not support *? . If you would rather have character columns instead of factor then add the as.is=TRUE argument to read.pattern.
The final line of code removes the $ and , characters from the Price column and converts it to numeric. (Omit this line if you actually want it formatted.)
library(gsubfn)
DF <- read.pattern("sc.csv", pattern = "^(.*?),(.*?),(.*)$", perl = TRUE, header = TRUE)
DF$Price <- as.numeric(gsub("[$,]", "", DF$Price)) ##
giving:
> DF
DataField1 DataField2 Price
1 ID1 Value1 NA
2 ID2 Value2 500
3 ID3 Value3 1250
2) sub This uses very simple regular expression (just a single character match) and no packages. Using vector as defined in the question this replaces the first two commas with semicolons. Then it can be read in using sep = ";"
read.table(text = sub(",", ";", sub(",", ";", vector)), header = TRUE, sep = ";")
Add the line marked ## in (1) if you want numeric prices.

Resources