Removing different words form a document using R console - r

I have managed to retrieve a text file but i want to remove different words. I have gone to read.table and have no clue how to use it to help me remove certain words. I have got 300 words and these are some of them. How can remove all these words using the R console? I have two files, one is sk.text which is a whole document and the other one is bash.txt that has got just words, so i want to remove all the words in sk.text that match the words given in bash.text.
with
within
without
work
worked
working
works
would

A simple way would be to use
gsub(paste0('\\b',
YOURVECTOROFWORDSTOREMOVE,
'\\b', collapse = '|'),'',YOURSTRING)
which replaces every occurence of the words in the vector surrounded by either end/beginning characters or whitespace with a single space.
but you might want to look at the tm package and work with a corpus object if you have many files like this. there you can remove the words you like simply with
tm_map(YOURCORPUS, removeWords, YOURVECTOROFWORDSTOREMOVE)

Related

RemoveWords command not removing some weird words

The point is that im trying to remove some weird words (like <U+0001F399><U+FE0F>) from my text corpus to do some twitter analysis.
There are many words like that that i just can't remove by using <- tm_map(X, removeWords).
i have plenty of tweets agregated in a dataset. Then i use the following code:
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("<U+0001F339>", "<U+0001F4CD>"))
if i try changing those weird words for regular ones (like "life" or "animal") that also appear on my dataset the regular ones get removed easily.
Any idea of how to solve this?
As these are Unicode characters, you need to figure out how to properly enter them in R.
The escape code syntax for Unicode in R probably is not <U+xxxx>, but rather something like \Uxxxx. See the manual for details (I don't use R - I am too annoyed by its inconsistencies. This is even an example for such an inconsistency, where apparently the string is printed differently than what R would accept as input.)
corpus_tweets <- tm_map (corpus_tweets, removeWords, c("\U0001F339", "\U0001F4CD","\uFE0F","\uFE0E"))
NOTE: You use a slash and lowercase u then 4 hex digits to specify a character from Unicode plane 0; you must use uppercase U then 8 hex digits for the other planes (which are typically emoji, given you are working with tweets).
BTW, see Some emojis (e.g. ☁) have two unicode, u'\u2601' and u'\u2601\ufe0f'. What does u'\ufe0f' mean? Is it the same if I delete it? for why you are getting the FE0F in there: they are when the user wants to choose a variation of an emoji, e.g. to add colour. FE0E is its partner (to say you want the plain text glyph).

Cleaning a column with break spaces that obtain last, first name so I can filter it from my data frame

I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!

Removing part of strings within a column

I have a column within a data frame with a series of identifiers in, a letter and 8 numbers, i.e. B15006788.
Is there a way to remove all instances of B15.... to make them empty cells (there’s thousands of variations of numbers within each category) but keep B16.... etc?
I know if there was just one thing I wanted to remove, like the B15, I could do;
sub(“B15”, ””, df$col)
But I’m not sure on the how to remove a set number of characters/numbers (or even all subsequent characters after B15).
Thanks in advance :)
Welcome to SO! This is a case of regex. You can use base R as I show here or look into the stringR package for handy tools that are easier to understand. You can also look for regex rules to help define what you want to look for. For what you ask you can use the following code example to help:
testStrings <- c("KEEPB15", "KEEPB15A", "KEEPB15ABCDE")
gsub("B15.{2}", "", testStrings)
gsub is the base R function to replace a pattern with something else in one or a series of inputs. To test our regex I created the testStrings vector for different examples.
Breaking down the regex code, "B15" is the pattern you're specifically looking for. The "." means any character and the "{2}" is saying what range of any character we want to grab after "B15". You can change it as you need. If you want to remove everything after "B15". replace the pattern with "B15.". the "" means everything till the end.
edit: If you want to specify that "B15" must be at the start of the string, you can add "^" to the start of the pattern as so: "^B15.{2}"
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf has a info on different regex's you can make to be more particular.

R removes spaces in read.table

I came across some surprising behavior today that doesn't seem right to me. I have a CSV file with several columns, some numeric and some text. One of my text columns contains extra spaces between some words. When I read this file into R using read.csv (or more generally read.table), it removes the extra spaces. I am not talking about leading or trailing whitespace, but spaces inside the string.
I have looked through the docs and nowhere can I find an option to turn off this behavior. Surely there must be a way to tell R to read the data as it is and not remove these spaces. Or is there?

Stray commas when importing CSV into R

I have a large CSV file (170k rows), which I'm importing into R. Each entry in the file is comma-delimited - however, in some of the columns (particularly those with a collection of URLs stuck together), there are commas in the strings. An example below:
Will Smith,25/09/68,null,male,08/10/14,450109,TRUE,http://commons.wikimedia.org/wiki/Special:FilePath/Will_Smith_2011,_2.jpg?width=300http://upload.wikimedia.org/wikipedia/commons/thumb/5/51/Will_Smith_2011,_2.jpg/200px-Will_Smith_2011,_2.jpghttp:.....
The added comma has a knock-on effect - it makes R (and Excel) think that it is a separate column, which then extends out over other columns and destroying the formatting. Given that there are roughly ~10% of the data affected, is there a quick way to get around this?
If the rule suggested by this limited example is to remove the commas that appear before underscores, then this succeeds:
gsub("[,][_]", "_", s)
Without some rule for when commas should be ignored, no.
If you have some consistant rule then use str_replace_all with regex to find the exceptions.
If you're the one making the csv I'd suggest you delimit with a different character.

Resources