I have used adist to calculate the number of characters that differ between two strings:
a <- "#IvoryCoast TENNIS US OPEN Clément «Un beau combat» entre Simon et Cilic"
b <- "Clément «Un beau combat» entre Simon et Cilic"
adist(a,b) # result 27
Now I would like to extract all the occurrences of those characters that differ. In my example, I would like to get the string "#IvoryCoast TENNIS US OPEN ".
I tried and used:
paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")
But the obtained result is not what I expected!
#IvysTENOP
For this case, you could use gsub.
> a <- "#IvoryCoast TENNIS US OPEN Clément «Un beau combat» entre Simon et Cilic"
> b <- "Clément «Un beau combat» entre Simon et Cilic"
> gsub(b, "", a)
[1] "#IvoryCoast TENNIS US OPEN "
You can do, based on the paste/reduce solution:
paste(Reduce(setdiff, strsplit(c(a, b), split = " ")), collapse = " ")
#[1] "#IvoryCoast TENNIS US OPEN"
Or, if you want to get separated items, with setdiff and strsplit:
setdiff(strsplit(a," ")[[1]],strsplit(b," ")[[1]])
#[1] "#IvoryCoast" "TENNIS" "US" "OPEN"
Related
The followings are the results I expect
> title = "La La Land (2016/I)"
[1]"(2016" #result
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
[1]"(2013" #result
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
[1]"(2015" #result
==================================================================
The followings are what I got by applying codesub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(1500-1800) (#1.1)" #result. However, I expected it to be "(2013)"
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1] "(2016/I)" #result as I expect
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
[1]"(2017)" # result. However, I expect it to be "(2015)"
The followings are what I GOT by applying codesub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
> title = "La La Land (2016/I)"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "La La Land (2016/I)" #result. However, I expect it to be "(2016)"
> title = "dfajfj(2015)asdfjuwer f(2017)fa.erewr6"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2017)" #result. However, I expect it to be "(2015)"
> title = "_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_"
> sub(pattern=".*(\\(\\d{4}\\)).*", title, replacement="\\1")
[1] "(2013)" #result as I expect
I checked the description of sub, it says "sub performs replacement of the first match. In this case, the first match should be (2013).
In a word, I try to write a sub()command to return the first occurrence of a year in a string.
I guess there is something wrong with my code but couldn't find it, appreciate if anyone could help me.
==================================================================
In fact, my ultimate goal is to extract the year of all movies. However, I don't know how to do it in one step. Therefore, I decide to first find the year in (dddd format, then use code sub(pattern="\\((\\d{4}).*", a, replacement="\\1") to find the pure number of the year.
for example:
> a= "(2015"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
> a= "(2015)"
> sub(pattern="\\((\\d{4}).*", a, replacement="\\1")
[1] "2015"
=================updated 05/29/2017 22:51PM=======================
the str_extract in akrun's answer works well with my dataset.
However, the sub() doesn't work for all data. The following are what I did. However, my code doesn't work with all 500 records. I would really appreciate if anyone could point out the mistakes on my codes. I really cannot figure it out myself. Thank you very much.
> t1
[1] "Man Who Fell to Earth (Remix) (2010) (TV)"
> t2
[1] "Manual pr\u0087ctico del amigo imaginario (abreviado) (2008)"
> title = c(t1,t2)
> x=gsub(pattern=".*(\\(\\d{4}.*\\)).*", title, replacement="\\1")
> x
[1] "(2010) (TV)" "(2008)"
> sub(pattern="\\((.*)\\).*", x, replacement="\\1")
[1] "2010) (TV" "2008"
However, my goal is to get 2010 and 2008. My code works with t2 but fails with t1
We can match 0 or more characters that are not a ( ([^(]*) from the start (^) of the string, followed by a ( and four digits (\\([0-9]{4}) which we capture as a group ((...)) followed by other characters (.*) and replace with the backreference (\\1) of the captured group
sub("^[^(]*(\\([0-9]{4}).*", "\\1", title)
#[1] "(2016" "(2013" "(2015"
If we need to remove the (, then capture only the numbers that follows the \\( as a group
sub("^[^(]*\\(([0-9]{4}).*", "\\1", title)
#[1] "2016" "2013" "2015"
Or with str_extract, we use a regex lookaround to extract the 4 digit numbers that follows the (
library(stringr)
str_extract(title, "(?<=\\()[0-9]{4}")
#[1] "2016" "2013" "2015"
Or with regmatches/regexpr
regmatches(title, regexpr("(?<=\\()([0-9]{4})", title, perl = TRUE))
#[1] "2016" "2013" "2015"
data
title <- c("La La Land (2016/I)",
"_The African Americans: Many Rivers to Cross with Henry Louis Gates, Jr._ (2013) _The Black Atlantic (1500-1800) (#1.1)_",
"dfajfj(2015)asdfjuwer f(2017)fa.erewr6")
I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:
sampleData
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
Words I do not want to include in my output:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
The output for the above function looks like this:
[1] "XYZ Company Consulting Firm Smith"
T
he output should be:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
Any help would be appreciated.
We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
With a little check on the stopwords( having inserted "\" in Co. to avoid regex, spaces ): (But the previous answer should be preferred if you dont want to keep an eye on stopwords)
stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")
gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company" "Consulting Firm " "Smith "
df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
# Location Company
#1 New York, NY XYZ Company
#2 Chicago, IL Consulting Firm
#3 Miami, FL Smith
Example: I have a string,
> myString <- c(“Regal bar Opp New Market Esplanade”, “Netaji Indoor Stadium, Opp Eden Gardens, Kolkata”)
I am required to remove the word “Opp” and the next two words following “Opp” from the entire string, that is I am looking for the following output:
[1] “Regal bar Esplanade” “Netaji Indoor Stadium, Kolkata”
Can anyone help please?
Just do lapply
myString <- c(“Regal bar Opp New Market Esplanade”,
“Netaji Indoor Stadium, Opp Eden Gardens, Kolkata”)
lapply(strsplit(myString, split = " "),
function(x) x[-c(match("Opp", x),
match("Opp", x)+1,
match("Opp", x)+2)])
or do sapply directly to get the format you had from the start
sapply(strsplit(myString, split = " "),
function(x) paste(x[-c(match("Opp", x),
match("Opp", x)+1,
match("Opp", x)+2)],
collapse = " "))
# [1] "Regal bar Esplanade" "Netaji Indoor Stadium, Kolkata"
Cheers,
Mike
How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?
Getting:
Error in sort.list(y) :
invalid input
Thanks
and this is how the emoticons come out looking from twitter and into r:
\xed��\xed�\u0083\xed��\xed��
\xed��\xed�\u008d\xed��\xed�\u0089
This should get rid of the emoticons, using iconv as suggested by ndoogan.
Some reproducible data:
require(twitteR)
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem")
# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))
# inspect, yes there are some odd characters in row five
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via #tweetsmania ;-)
2 “#teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
Here's the key line that will remove the emoticons:
# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
Now inspect again, to see if the odd characters are gone (see row 5)
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via #tweetsmania ;-)
2 #teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
I recommend the function:
ji_replace_all <- function (string, replacement)
From the package:
install_github (" hadley / emo ").
I needed to remove the emojis from tweets that were in the Spanish language. Tried several options, but some messed up the text for me. However this is a marvel that works perfectly:
library(emo)
text="#VIDEO 😢💔🙏🏻,Alguien sabe si en Afganistán hay cigarro?"
ji_replace_all(text,"")
Result:
"#VIDEO ,Alguien sabe si en Afganistán hay cigarro?"
You can use regular expression to detect non-alphabet characters and remove them. Sample code:
rmNonAlphabet <- function(str) {
words <- unlist(strsplit(str, " "))
in.alphabet <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
nice.str <- paste(words[in.alphabet], collapse = " ")
nice.str
}
How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?
Getting:
Error in sort.list(y) :
invalid input
Thanks
and this is how the emoticons come out looking from twitter and into r:
\xed��\xed�\u0083\xed��\xed��
\xed��\xed�\u008d\xed��\xed�\u0089
This should get rid of the emoticons, using iconv as suggested by ndoogan.
Some reproducible data:
require(twitteR)
# note that I had to register my twitter credentials first
# here's the method: http://stackoverflow.com/q/9916283/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem")
# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))
# inspect, yes there are some odd characters in row five
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via #tweetsmania ;-)
2 “#teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
Here's the key line that will remove the emoticons:
# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
Now inspect again, to see if the odd characters are gone (see row 5)
head(df)
text
1 ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via #tweetsmania ;-)
2 #teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons & \nall the other stuff i cant see on android!" \n#Emoticons
3 E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4 #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5 I use emoticons too much. #addicted #admittingit #emoticons haha
6 What you text What I see #Emoticons http://t.co/BKowBSLJ0s
I recommend the function:
ji_replace_all <- function (string, replacement)
From the package:
install_github (" hadley / emo ").
I needed to remove the emojis from tweets that were in the Spanish language. Tried several options, but some messed up the text for me. However this is a marvel that works perfectly:
library(emo)
text="#VIDEO 😢💔🙏🏻,Alguien sabe si en Afganistán hay cigarro?"
ji_replace_all(text,"")
Result:
"#VIDEO ,Alguien sabe si en Afganistán hay cigarro?"
You can use regular expression to detect non-alphabet characters and remove them. Sample code:
rmNonAlphabet <- function(str) {
words <- unlist(strsplit(str, " "))
in.alphabet <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
nice.str <- paste(words[in.alphabet], collapse = " ")
nice.str
}