R string, UTF-8 coding swedish character treatment - r

have problem to change the swedish characters ä ö å in a presentable way in R
I got my data directly from MS SQL database
here are the examples
markets <- c("Caf\xe9 ","Restaurang kv\xe4ll ","Barnomsorg tillagningsk\xf6k ","Folkh\xf6gskola ")
then I use gusb to remove the lefthand space
market=gsub(" ", "", markets,fixed = TRUE)
I got this error:
Error in gsub(" ", "", market, fixed = TRUE) :
input string 3 is invalid UTF-8
then I use this command:
markets_new=gsub(" ", "", markets)
then have strange Chinese characters in the string,
"Caf攼㸹"
"Restauranglunch+kv攼㸴ll"
"Barnomsorgtillagningsk昼㸶k"
"Folkh昼㸶gskola"
I tried the treatment change the default setting of Rstudio by following:
https://yihui.name/en/2018/11/biggest-regret-knitr/?fbclid=IwAR2E5Lp0zjS51fcdjgZ1tej0sg5EBxfG8sNitt-cUA2XEshnT3lNCHNQ3Do
it does not help, was also try to use gsub() substitute the characters but seems not working.
One more thing, if I use
write.csv(markets,'submarket product view.csv',row.names = F)
then in my csv file what I see as follows
"Caf<e9> "
"Restaurang kv<e4>ll "
"Barnomsorg tillagningsk<f6>k "
"Folkh<f6>gskola "
"Sm<f6>rg<e5>s/salladsrestaurang "
I think <e9> is e with a hat, <e4> is ä, <f6> is ö, and <e5> is å
Any treatment suggestion?

Thanks to #Wiktor Stribiżew
this solution works best:
df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE)

try this
Encoding(markets) <- "UTF-16"
markets <- trimws(markets)
#[1] "Café" "Restaurang kväll" "Barnomsorg tillagningskök" "Folkhögskola"

Related

Within a column, I'd like to gsub each row of string values and remove any value that matches a list of values I created

Context
I am working with a messy datafile right now. I have a list of comments that I'd like to sort out and grab the most common combination of phrases. An example phrase would be "Did not qualify because of X and Y" and "Did not qualify because of Y and X". I am trying to go through and remove Stop Words so I can match X and Y as a common phrase. I was able to easily do this for common single words, but phrases are a little difficult. Below is my code for context
Create Datafile
dat1 <- dat %>% filter(Action != Exclude)
Remove problem characters
dat1$Comments <- stri_trans_general(dat1$Comments, "latin-ascii")
dat1$Comments <- gsub(pattern='<[^<>]*>', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern='\n', replacement=" ", x=dat1$Comments)
dat1$Comments <- gsub(pattern="[[:punct:]]", replacement=" ", x=dat1$Comments)
Remove stop words (Where my problem is)
sw <- paste0("\\b(", paste0(stop_words$word, collapse="|"), ")\\b")
dat1$Comments <- lapply(dat1$Comments, function(x) (gsub(pattern=sw, replacement=" ", x)))
Remove extra spaces between words
dat1$Comments <- trimws(gsub("\\s+", " ", dat1$Comments))
dat1$Comments <- gsub("(^[[:space:]]*)|([[:space:]]*$)", "", dat1$Comments)
Sweet Data
top_phrases <- data.frame(text = dat1$Comments) %>%
unnest_tokens(bigram, text, 'ngrams', n = Length, to_lower = TRUE) %>%
count(bigram, sort = TRUE)
Issue
This is what pops up and is traced back to the gsub code
Error in gsub(pattern = sw, replacement = " ", x) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634
If anyone is curious, here is what is stored in "sw"
"\\b(a|a's|able|about|above|according|accordingly|across|actually|after|afterwards|again|against|ain't|all|allow|allows|almost|alone|along|already|also|although|always|am|among|amongst|an|and|another|any|anybody|anyhow|anyone|anything|anyway|anyways|anywhere|apart|appear|appreciate|appropriate|are|aren't|around|as|aside|ask|asking|associated|at|available|away|awfully|b|be|became|because|become|becomes|becoming|been|before|beforehand|behind|being|believe|below|beside|besides|best|better|between|beyond|both|brief|but|by|c|c'mon|c's|came|can|can't|cannot|cant|cause|causes|certain|certainly|changes|clearly|co|com|come|comes|concerning|consequently|consider|considering|contain|containing|contains|corresponding|could|couldn't|course|currently|d|definitely|described|despite|did|didn't|different|do|does|doesn't|doing|don't|done|down|downwards|during|e|each|edu|eg|eight|either|else|elsewhere|enough|entirely|especially|et|etc|even|ever|every|everybody|everyone|everything|everywhere|ex|exactly|example|except|f|far|few|fifth|first|five|followed|following|follows|for|former|formerly|forth|four|from|further|furthermore|g|get|gets|getting|given|gives|go|goes|going|gone|got|gotten|greetings|h|had|hadn't|happens|hardly|has|hasn't|have|haven't|having|he|he's|hello|help|hence|her|here|here's|hereafter|hereby|herein|hereupon|hers|herself|hi|him|himself|his|hither|hopefully|how|howbeit|however|i|i'd|i'll|i'm|i've|ie|if|ignored|immediate|in|inasmuch|inc|indeed|indicate|indicated|indicates|inner|insofar|instead|into|inward|is|isn't|it|it'd|it'll|it's|its|itself|j|just|k|keep|keeps|kept|know|knows|known|l|last|lately|later|latter|latterly|least|less|lest|let|let's|like|liked|likely|little|look|looking|looks|ltd|m|mainly|many|may|maybe|me|mean|meanwhile|merely|might|more|moreover|most|mostly|much|must|my|myself|n|name|namely|nd|near|nearly|necessary|need|needs|neither|never|nevertheless|new|next|nine|no|nobody|non|none|noone|nor|normally|not|nothing|novel|now|nowhere|o|obviously|of|off|often|oh|ok|okay|old|on|once|one|ones|only|onto|or|other|others|otherwise|ought|our|ours|ourselves|out|outside|over|overall|own|p|particular|particularly|per|perhaps|placed|please|plus|possible|presumably|probably|provides|q|que|quite|qv|r|rather|rd|re|really|reasonably|regarding|regardless|regards|relatively|respectively|right|s|said|same|saw|say|saying|says|second|secondly|see|seeing|seem|seemed|seeming|seems|seen|self|selves|sensible|sent|serious|seriously|seven|several|shall|she|should|shouldn't|since|six|so|some|somebody|somehow|someone|something|sometime|sometimes|somewhat|somewhere|soon|sorry|specified|specify|specifying|still|sub|such|sup|sure|t|t's|take|taken|tell|tends|th|than|thank|thanks|thanx|that|that's|thats|the|their|theirs|them|themselves|then|thence|there|there's|thereafter|thereby|therefore|therein|theres|thereupon|these|they|they'd|they'll|they're|they've|think|third|this|thorough|thoroughly|those|though|three|through|throughout|thru|thus|to|together|too|took|toward|towards|tried|tries|truly|try|trying|twice|two|u|un|under|unfortunately|unless|unlikely|until|unto|up|upon|us|use|used|useful|uses|using|usually|uucp|v|value|various|very|via|viz|vs|w|want|wants|was|wasn't|way|we|we'd|we'll|we're|we've|welcome|well|went|were|weren't|what|what's|whatever|when|whence|whenever|where|where's|whereafter|whereas|whereby|wherein|whereupon|wherever|whether|which|while|whither|who|who's|whoever|whole|whom|whose|why|will|willing|wish|with|within|without|won't|wonder|would|would|wouldn't|x|y|yes|yet|you|you'd|you'll|you're|you've|your|yours|yourself|yourselves|z|zero|i|me|my|myself|we|our|ours|ourselves|you|your|yours|yourself|yourselves|he|him|his|himself|she|her|hers|herself|it|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|would|should|could|ought|i'm|you're|he's|she's|it's|we're|they're|i've|you've|we've|they've|i'd|you'd|he'd|she'd|we'd|they'd|i'll|you'll|he'll|she'll|we'll|they'll|isn't|aren't|wasn't|weren't|hasn't|haven't|hadn't|doesn't|don't|didn't|won't|wouldn't|shan't|shouldn't|can't|cannot|couldn't|mustn't|let's|that's|who's|what's|here's|there's|when's|where's|why's|how's|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|around|as|ask|asked|asking|asks|at|away|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|best|better|between|big|both|but|by|came|can|cannot|case|cases|certain|certainly|clear|clearly|come|could|did|differ|different|differently|do|does|done|down|down|downed|downing|downs|during|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|had|has|have|having|he|her|here|herself|high|high|high|higher|highest|him|himself|his|how|however|i|if|important|in|interest|interested|interesting|interests|into|is|it|its|itself|just|keep|keeps|kind|knew|know|known|knows|large|largely|last|later|latest|least|less|let|lets|like|likely|long|longer|longest|made|make|making|man|many|may|me|member|members|men|might|more|most|mostly|mr|mrs|much|must|my|myself|necessary|need|needed|needing|needs|never|new|new|newer|newest|next|no|nobody|non|noone|not|nothing|now|nowhere|number|numbers|of|off|often|old|older|oldest|on|once|one|only|open|opened|opening|opens|or|order|ordered|ordering|orders|other|others|our|out|over|part|parted|parting|parts|per|perhaps|place|places|point|pointed|pointing|points|possible|present|presented|presenting|presents|problem|problems|put|puts|quite|rather|really|right|right|room|rooms|said|same|saw|say|says|second|seconds|see|seem|seemed|seeming|seems|sees|several|shall|she|should|show|showed|showing|shows|side|sides|since|small|smaller|smallest|some|somebody|someone|something|somewhere|state|states|still|still|such|sure|take|taken|than|that|the|their|them|then|there|therefore|these|they|thing|things|think|thinks|this|those|though|thought|thoughts|three|through|thus|to|today|together|too|took|toward|turn|turned|turning|turns|two|under|until|up|upon|us|use|used|uses|very|want|wanted|wanting|wants|was|way|ways|we|well|wells|went|were|what|when|where|whether|which|while|who|whole|whose|why|will|with|within|without|work|worked|working|works|would|year|years|yet|you|young|younger|youngest|your|yours)\\b"
Both TRE (the default regex engine used in base R regex functions) and PCRE (the regex engine used in base R regex functions with perl=TRUE) have quite hard limits for the pattern length.
In your case, stringr regex functions will work better as they are using ICU regex engine that supports much longer regex patterns.
So, you may replace
gsub(pattern=sw, replacement=" ", x)
with
stringr::str_replace_all(x, sw, " ")

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.
A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.
Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart"
The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.
Dummy data set reproducible with by using dput() (including the step force to lowercase:
Emoji_struct <- c(
list(content = "🔥🔥 wow", "😮 look at that", "😤this makes me angry😤", "😍❤\ufe0f, i love it!"),
list(content = "😍😍", "😊 thanks for helping", "😢 oh no, why? 😢", "careful, challenging ❌❌❌")
)
Current coding (data_orig is a list of several files):
library(textclean)
#The rest should be standard r packages for pre-processing
#pre-processing:
data <- gsub("'", "", data)
data <- replace_contraction(data)
data <- replace_emoji(data) # replace emoji with words
data <- replace_emoticon(data) # replace emoticon with words
data <- replace_hash(data, replacement = "")
data <- replace_word_elongation(data)
data <- gsub("[[:punct:]]", " ", data) #replace punctuation with space
data <- gsub("[[:cntrl:]]", " ", data)
data <- gsub("[[:digit:]]", "", data) #remove digits
data <- gsub("^[[:space:]]+", "", data) #remove whitespace at beginning of documents
data <- gsub("[[:space:]]+$", "", data) #remove whitespace at end of documents
data <- stripWhitespace(data)
Desired output:
[1] list(content = c("fire fire wow",
"facewithopenmouth look at that",
"facewithsteamfromnose this makes me angry facewithsteamfromnose",
"smilingfacewithhearteyes redheart \ufe0f, i love it!"),
content = c("smilingfacewithhearteyes smilingfacewithhearteyes",
"smilingfacewithsmilingeyes thanks for helping",
"cryingface oh no, why? cryingface",
"careful, challenging crossmark crossmark crossmark"))
Any ideas? Lower cases would work, too.
Best regards. Stay safe. Stay healthy.
Answer
Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:
hash2 <- lexicon::hash_emojis
hash2$y <- gsub("[[:space:]]|[[:punct:]]", "", hash2$y)
replace_emoji(Emoji_struct[,1], emoji_dt = hash2)
Example
Single character string:
replace_emoji("wow!😮 that is cool!", emoji_dt = hash2)
#[1] "wow! facewithopenmouth that is cool!"
Character vector:
replace_emoji(c("1: 😊", "2: 😍"), emoji_dt = hash2)
#[1] "1: smilingfacewithsmilingeyes "
#[2] "2: smilingfacewithhearteyes "
List:
list("list_element_1: 🔥", "list_element_2: ❌") %>%
lapply(replace_emoji, emoji_dt = hash2)
#[[1]]
#[1] "list_element_1: fire "
#
#[[2]]
#[1] "list_element_2: crossmark "
Rationale
To convert emojis to text, replace_emoji uses lexicon::hash_emojis as a conversion table (a hash table):
head(lexicon::hash_emojis)
# x y
#1: <e2><86><95> up-down arrow
#2: <e2><86><99> down-left arrow
#3: <e2><86><a9> right arrow curving left
#4: <e2><86><aa> left arrow curving right
#5: <e2><8c><9a> watch
#6: <e2><8c><9b> hourglass done
This is an object of class data.table. We can simply modify the y column of this hash table so that we remove all the spaces and punctuation. Note that this also allows you to add new ASCII byte representations and an accompanying string.

How do I remove line breaks from a string?

I've am extracting tweets using the Twitter API in R.
I have been saving my results to csv in r using a write.csv2 command which is fine but there is an issue where character returns in the tweet text are causing the multiple rows in the spreadsheet for the one tweet.
I've tried using a str_replace_all but it doesn't seem to work for me and i can't find anything as to why.
Here is my code
searchTags = c("Galwaybikeshare", "Corkbikeshare", "dublinbikes", "BelfastBikes", "SantanderCycles", "CitiBikeNYC", "obike", "Hubway", "bicing")
additionalParams = c("-rt -http")
searchString <- paste((paste(searchTags[1:9], collapse = " OR ")), additionalParams, collapse = "")
tweets_list <- searchTwitter(searchString, n=20, lang = "en", resultType = 'recent')
str_replace_all(tweets_list, "[\r\n]" , "")
tweets.df <- twListToDF(tweets_list)
todayDate <- Sys.Date()
tweetArchive <- paste("BikeShareTweets ", todayDate, ".csv", sep ="")
write.csv2(tweets.df, file = tweetArchive)
The text below is an example of a tweet which is causing the issue.
"TransitNinja205: 0.01% of the budget for 5-borough #CitiBikeNYC,\nand 0.2% for #FairFares. #NYCmayor #NYCmayorsOffice #progressive"
Why isn't my str_replace_all removing the \n from the text?
stringr::str_replace_all works, you’re just ignoring the result. To fix it:
tweets_list = str_replace_all(tweets_list, "[\r\n]" , "")
stringr::str_remove_all will also do this for you.
tweets_list = str_remove_all(tweets_list, "[\r\n]")

String: extract wanted character instead of removing unwanted

I was wandering if in R their is a function like KeepChar("abcde....xyz", some_text) that you feed with all the desired character that you want to keep, and returns the strings with only the desired character left in it. Here the function would only keep the letters of the alphabet in lower case. I would like something that looks like this:
some_text <- "Hel-_l0o W#oRr^ld"
some_text <- KeepChar("abcdefghijklmnopqrstuvwxyz ", some_text)
some_text
> "hello world"
I feel that the removing method that I am currently using gsub("#\\w+", "", some_text), tm_map(some_text, stripWhitespace) or str_replace_all(some_text,"[^[:graph:]]", " ") takes a lot of time and coding line with a constant risk of forgetting to remove a specific character, especially when you already know exactly what you want to keep.
Why I ask this question is because I am coding a plateform to process sentiment analysis on texts from various sources like twitter and I want to make sure not to forget to remove any unwanted character.
To handle a pattern without using regex I will try this:
string <- "Hel-_l0o W#oRr^ld"
pattern <- "abcdefghijklmnopqrstuvwxyz"
KeepChar = function(pattern, string){
splitted_string <- unlist(strsplit(string, ""))
splitted_pattern <- unlist(strsplit(pattern, ""))
ids_string <- splitted_string %in% splitted_pattern
return(paste(splitted_string[ids_string], sep = "", collapse = ""))
}
some_text <- KeepChar(pattern = pattern, string = string)
You can try this:
some_text <- "Hel-_l0o W#oRr^ld"
gsub("[^[:alpha:] ]", "", some_text)#will return all characters
gsub("[^[:lower:] ]", "", some_text)#will return only lower characters alongwith space
gsub("[^[:upper:] ]", "", some_text)#will return higher case characters alongwith space
You can also look at the page https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html to see the matches available in R

search and replace a character

I have a string like this
x<-c("This is a test (120)")
I need to replace the empty space between test and ( so that text will be like this
x<-c("This is a test(120)")
I tried this
s<-gsub("\t\v\(", "", x)
not working, any input would be appreciated.
Using a lookahead:
gsub("\\s+(?=\\()", "", x, perl=TRUE)
[1] "This is a test(120)"
Answer depends on more specifications you require though. Do you want to remove all spaces in front of opening brackets? Or just one? Or only in front of brackets containing numbers?
One simple approach is to used fixed = TRUE as in:
gsub(" (", "(", x, fixed = TRUE)
or:
gsub(" \\(", "\\(", x)
You have to "double escape" things in R. One for R and one for the regex:
s <- gsub('\\s\\(', '(', x)
That said, depending on your specific use case, you might want this to be more robust:
s <- gsub('(.+) \\((.+)\\)', '\\1(\\2)', x)

Resources