I am scraping comments from Reddit and trying to remove empty rows/comments.
A number of rows appear empty, though I cannot seem to remove them. When I use is_empty they do not appear empty.
> Reddit[25,]
[1] ""
> is_empty(Reddit$text[25])
[1] FALSE
> Reddit <- subset(Reddit, text != "")
> Reddit[25,]
[1] ""
Am I missing something? I've tried a couple of other methods to remove these rows and they haven't worked either.
Edit:
Included dput example in answer to comments:
RedditSample <- data.frame(text=
c("I liked coinbase, used it before. But the fees are simply too much. If they were to take 1% instead 2.5% I would understand. It's much simpler and long term it doesn't matter as much.",
"But Binance only charges 0.1% so making the switch is worth it fairly quickly. They also have many more coins. Approval process took me less than 10 minutes, but always depends on how many register at the same time.",
"", "Here's a 10%/10% referal code if you chose to register: KHELMJ94",
"What is a spot wallet?"))
Actually the data you shared doesn't contain an empty string, it contains a Unicode zero-width space character. You can see that with
charToRaw(RedditSample$text[3])
# [1] e2 80 8b
You could make sure there is a non-space character using a regular expression that matches a "word" character
subset(RedditSample, grepl("\\w", text))
You could use the string length functions. For example in tidyverse which includes the stringr package:
library(tidyverse)
Reddit %>%
filter(str_length(text) > 0)
Or base R:
Reddit[ nchar(Reddit$text) >0, ]
Related
I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.
I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!
example string: >742,811 people own these<
-> 742,811
Could you please try following.
val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)
Output will be as follows.
[1] "742,811"
Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.
Try using str_extract_all from the stringr library:
str_extract_all(data, "\\d{1,3}(?:,\\d{3})*(?:\\.\\d+)?(?= people own these)")
So I scanned in a physical document, changed it to a tiff image and used the package Tesseract to import it into R. However, I need R to look for specific keywords, find it in the text file and return the entire line that the keyword is in.
For example, if I had the text file:
This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”.
And I tell R to search for the keyword "straightforward", how do I get it to return "This is also straightforward...see if that matches the"?
Here is a solution using the quanteda package that breaks the text into sentences, and then uses grep() to return the sentence containing the word "straightforward".
aText <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
library(quanteda)
aCorpus <- corpus(aText)
theSentences <- tokens(aCorpus,what="sentence")
grep("straightforward",theSentences,value=TRUE)
and the output:
> grep("straightforward",theSentences,value=TRUE)
text1
"This is also straightforward."
To search for multiple keywords, add them in the grep() function via the or operator | .
grep("straightforward|exceeds",theSentences,value=TRUE)
...and the output:
> grep("straightforward|exceeds",theSentences,value=TRUE)
text1
"This is also straightforward."
<NA>
"It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a \"5\"."
Here is one base R option:
text <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
lst <- unlist(strsplit(text, "(?<=[a-z]\\.\\s)", perl=TRUE))
lst[grepl("\\bstraightforward\\b", lst)]
I am splitting your text on the pattern (?<=[a-z]\\.\\s), which says to lookbehind for a lowercase letter, following by a full stop and a space. This should work well most of the time. There is the issue of abbreviations, but most of the time they would be in the form of capital letter followed by dot, and also most of the time they would not be ending sentences.
Demo
I have parts of links pertaining to baseball players in my character vector:
teamplayerlinks <- c(
"/players/i/iannech01.shtml",
"/players/l/lindad01.shtml",
"/players/c/canoro01.shtml"
)
I would like to isolate the letters/numbers after the 3rd / sign, and before the .sthml portion. I want my resulting string to read:
desiredlinks
# [1] "iannech01" "lindad01" "canoro01"
I assume this may be a job for sub, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.
Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.
You could try
gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01" "canoro01"
Here we have
.*/ remove everything up to and including the last /
| or
\\..*$ remove everything after the ., starting from the end of the string
By the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman"). I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!
The basename function is useful here.
gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01" "canoro01"
This can be also done without regex
tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01" "canoro01"
I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?
NOTE1: The question is looking for exact occurrence of word "memory"!
NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.
#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9
Here's the file
The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.
You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.
As pointed by #andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:
names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)
length(idxs)
# [1] 10
I have
str=c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
and I want to get
"00005.profit" "00006.profit"
How can I achieve this using grep in R?
Here is one way:
R> s <- c("00005.profit", "00005.profit-in","00006.profit","00006.profit-in")
> unique(gsub("([0-9]+.profit).*", "\\1", s))
[1] "00005.profit" "00006.profit"
R>
We define a regular expression as digits followed by .profit, which we assign by keeping the expression in parantheses. The \\1 then recalls the first such assignment -- and as we recall nothing else that is what we get. The unique() then reduces the four items to two unique ones.
Dirk's answer is pretty much the ideal generalisable answer, but here are a couple of other options based on the fact that your example always has a - character starting the part you wish to chop off:
1: gsub to return everything prior to the -
gsub("(.+)-.+","\\1",str)
2: strsplit on - and keep only the first part.
sapply(strsplit(str,"-"),head,1)
Both return:
[1] "00005.profit" "00005.profit" "00006.profit" "00006.profit"
which you can then wrap in unique to not return duplicates like:
unique(gsub("(.+)-.+","\\1",str))
unique(sapply(strsplit(str,"-"),head,1))
These will then return:
[1] "00005.profit" "00006.profit"
Another non-generalisable solution would be to just take the first 12 characters (assuming string length for the part you want to keep doesn't change):
unique(substr(str,1,12))
[1] "00005.profit" "00006.profit"
I'm actually interpreting your question differently. I think you might want
grep("[0-9]+\\.profit$",str,value=TRUE)
That is, if you only want the strings that end with profit. The $ special character stands for "end of string", so it excludes cases that have additional characters at the end ... The \\. means "I really want to match a dot, not any character at all" (a . by itself would match any character). You weren't entirely clear about your target pattern -- you might prefer "0+[1-9]\\.profit$" (any number of zeros followed by a single non-zero digit), or even "0{4}[1-9]\\.profit$" (4 zeros followed by a single non-zero digit).