Cleaning forum post with multiple quotations in rvest + stringr - r

I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text
The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:
post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"
Then I try to pull out the quoted user (Meow1) and it works:
QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA)
QuotedUser_1
[1] "Meow1"
Then I created this codes for pulling out the quoted text, and the clean text:
Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)
It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall')
And same for the clean text, it only returns the last reply:
Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)
If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...
Cheers

Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data.
If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.

Related

"How to skip a line between strings in R?"

I'm doing a swirl lesson.
This is the problem:
Edit the string inside writeLines() so that it correctly displays
(with the line breaks in these positions)
This is a really
really really
long string
I tried typing
writeLines("This is a really\n\nreally really\n\nlong string")
but the swirl lesson keeps telling me that it is incorrect. Is there a different way to write the same thing?
Swirl is generally very strict about the answer, as it would be time consuming and difficult to put in ways to check for all the potentially correct answers.
As a matter of fact the answer is writelines("This is a really \nreally really \nlong string") (see here). You have the newline \n doubled, so Swirl won't accept that as an answer.

Improperly formatted CSV, how to repair?

I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?
Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".
Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text

String continuation across multiple lines, no newline characters

Am using the RODBC library to bring data into R. I have a long query that I want to pass a variable to, much like this SO user.
Problem is that R interprets the whitespace/carriage returns in my query as a newline '\n'.
The accepted solution for this question suggests to simply break up the text into chunks and then paste() together - which works, but ideally I'd like to keep the whitespace intact - makes it easier to test/verify the behavior of the query over in the database before pasting into R.
In other languages I'm familiar with there's a simple line continuation character - indeed, several of the comments on the accepted answer are looking for an approach similar to python's \.
I found an aside to a workaround using strwrap deep in the bowels of an R discussion lists, so in the interest of making the internet better I will post it here. However, if someone can point the direction toward a more elegant/straightforward solution, I will happily accept your answer.
I don't know if you will find this helpful or not, but I have eventually gravitated towards keeping my SQL separate from my R scripts. Keeping the query in my R script, except for very very short ones, I find gets unreadable very quickly.
These days, I tend to keep queries that are more than a single line in their own separate .sql file. Then I can keep them nice and formatted and readable in a nice text editor, and read them into R as needed via something like this:
read_sql <- function(path){
stopifnot(file.exists(path))
sql <- readChar(path,nchar = file.info(path)$size)
sql
}
For binding parameters into the queries, I just keep a %s where the parameter will go in the .sql file, and then add in the parameters in R using sprintf.
I've been much happier this way, as I was finding that cluttering up my R scripts with really long paste statements and multi-line character objects was making my code really hard to read.
R's strwrap will destroy whitespace, including newline characters, per the documentation.
Essentially, you can get the desired behavior by initially letting R introduce line breaks/newline \ns, and then immediately stripping them out.
#make query using PASTE
query_1 <- paste("SELECT map.ps_studentid
,students.first_name || ' ' || students.last_name AS full_name
,map.testritscore
,map.termname
,map.measurementscale
FROM map$comprehensive_with_growth map
JOIN students
ON map.ps_studentid = students.id
WHERE map.termname = '",map_term,"'", sep='')
#remove newline characters introduced above.
#width is an arbitrary big number-
#it just needs to be longer than your string.
query_1 <- strwrap(query_1, width=10000, simplify=TRUE)
#execute the query
map_njask <- sqlQuery(XE, query_1)
query <- gsub(pattern='\\s',replacement="",x=query)
Try using sprintf to get variable substitution, and then replacing all newlines and whitespace.
See my answer to a similar question for details.

Trying to parse a file that looks partically hex encoded

I'm trying to parse a file that looks sort of hex encoded but mostly not. I contacted support for the vendor who created the file and they said that they it can be parsed using "an 0x116 offset"
What is a 0x116 offset?
It took me 2 weeks to get an answer from the vendor on my first question, so I wanted to see if someone here could help me make sense of! Thank you!
"0x116 offset" means nothing. It could be a value that needs to be added to words or subtracted to remove some naive encoding, or anything else for that matter.
Could you post a part of the file? Is it binary or text? Could you define "mostly not"?
What vendor/software package/device does this file come from?

Insert unicode strings into CleverCSS

How can one insert a Unicode string CSS into CleverCSS?
In particular, how could one produce the following CSS using CleverCSS:
li:after {
content: "\00BB \0020";
}
I've figured out CleverCSS's parsing rules, but suffice that the permutations I've thought sensible have failed, for example:
li:
content: "\\00BB \\0020" // becomes content: 'BB 0'
EDIT: My other examples and the rest of my post weren't saved. Suffice to say that I had a longer list of examples that's missing.
I'd be grateful for any thoughts and input.
Brian
EDIT: I noted that inserting the unicode was one of the problems (once you start uploading CSS with utf-8 encoding it's fine). The wrapping of quote characters is another, which I solved that with something crazy likeso:
content: "'".string() + " ".string() ».string() + "'".string()
Hope that helps someone else.
This may be silly, but why still bother with escape sequences when you can just type/paste the actual characters? "A CSS style sheet is a sequence of characters from the Universal Character Set".
That is a lot easier on the eye, and is especially useful when maintaining existing code.
Or is CleverCSS not Unicode-enabled?
In looking at the code (CleverCSS 0.1) it would appear that the partial regular expression _r_string (defined on line 414) is where you would need to start. This is used to define several other REs, including _string_re which is used in the parsing rules (line 1374). This leads us to process_string() (line 1359) which looks like it was meant to accept Unicode.
Unfortunately, hand-built parsers tend to get a bit strange and the code is not exactly swimming in comments. If you really need to do this, I would focus on process_string() and put a bunch of before/after print statements in there and see if you can understand the goes-intos and goes-outofs.
You might also try bribing the original author with beer or ??? Good luck.

Resources