I'm grabbing the following page and storing it in R with the following code:
gQuery <- getURL("https://www.google.com/#q=mcdimalds")
Within this, there's the following snippet of code
Showing results for</span> <a class="spell" href="/search?rlz=1C1CHZL_enUS743US743&q=mcdonalds&spell=1&sa=X&ved=0ahUKEwj9koqPx_TTAhUKLSYKHRWfDlYQvwUIIygA"><b><i>mcdonalds</i></b></a>
Everything other than "showing results for" and the italics tags encasing the desired name for extraction are subject to change from query to query.
What I want to do is extract the mcdonalds out of this string using regex that occurs here: <b><i>mcdonalds</i> aka the second instance of mcdonalds. However, I'm not too sure how to write the regex to do so.
Any help accomplishing this would be greatly appreciated. As always, please let me know if any additional information should be added to clarify the question.
Related
I want to search for multiple codes appearing in a cell. There are so many codes that I'd like to write parts of the code in succeeding lines. For example, let's say I am looking for "^a11","^b12", "^c67$" or "^d13[[:blank:]]". I am using:
^a11|^b12|^c67$|^d13[[:blank:]]
This seems to work. Now, I tried:
^a11|^b12|
^c67$|^d13[[:blank:]]
That also seemed to work. However when I tried:
^a11|^b12|^c67$|
^d13[[:blank:]]
It did not count the last one.
Note that my code is wrapped into a function. So the above is an argument that I feed the function. I'm thinking that's the problem, but I still don't know why one truncation works while the other does not.
I realized the answer today. The problem is that since I am feeding the regex argument, it will count the next line in the succeeding code.
Thus, the code below was only "working" because ^c67$ is empty.
^a11|^b12|
^c67$|^d13[[:blank:]]
And the code below was not working because ^d13 is not empty but also this setup looks for (next line)^d13[[:blank:]] instead of just ^d13[[:blank:]]
^a11|^b12|^c67$|
^d13[[:blank:]]
So an inelegant fix is:
^a11|^b12|^c67$|
^nothinghere|^d13[[:blank:]]
This inserts a burner code that is empty which is affected by the line break.
I have the following XML structure. I am trying to extract the attributes StartDate and EndDate of the relationship period, that is only if rr:PeriodType is RELATIONSHIP_PERIOD.
However, the nodes for "relationship" and "accounting" have exactly the same name and am not sure how to proceed.
<rr:RelationshipPeriods>
<rr:RelationshipPeriod>
<rr:StartDate>2018-01-01T00:00:00.000Z</rr:StartDate>
<rr:EndDate>2018-12-31T00:00:00.000Z</rr:EndDate>
<rr:PeriodType>ACCOUNTING_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
<rr:RelationshipPeriod>
<rr:StartDate>2019-01-02T00:00:00.000Z</rr:StartDate>
<rr:PeriodType>RELATIONSHIP_PERIOD</rr:PeriodType>
</rr:RelationshipPeriod>
</rr:RelationshipPeriods>
I tried using this code
ldply(xpathApply(xmlData, '//rr:RelationshipPeriod/rr:StartDate', getChildrenStrings), rbind)
But doesn't work well as it's hard to understand if it is extracting accounting or relationship period.
Any help would be greatly appreciated!
For rr:StartDate use XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']/rr:StartDate
But probably better to first find the correct rr:RelationshipPeriod using XPath:
//rr:RelationshipPeriod[rr:PeriodType='RELATIONSHIP_PERIOD']
See this answer on how to reuse the result of a XPath.
But don't use // in front of rr:StartDate and rr:EndDate
I wonder whether someone can help me please.
I have the following URI in GA: /invite/accept-invitation/accepted/B
Which I'd like to change to: /invite/accept-invitation/accepted
I've tried a 'Search and Replace filter as follows:
Search String - /invite/accept-invitation/accepted/*
Replace String - /invite/accept-invitation/accepted
But the result I get is:
/inviteaccept-invitation/accepted/B
Could someone tell me where I've gone wrong with this please?
Many thanks and kind regards
Chris
Google Analytics "Search and replace" filter uses regular expressions. More precisely:
Replace string is either a regular string or it can refer to group
patterns in the search expression using backslash-escaped single
digits like (\0 to \9).
More details are available on the filter settings UI, which also refers to this link.
So in your case, the search string would be something like this.
\/invite\/accept-invitation\/accepted\/\w+
In this expression \ is escaped. Your last string part is captured with \w+, which
matches any word character (equal to [a-zA-Z0-9_]), between one and unlimited times, as many times as possible.
The Replace string doesn't have to be a regular expression. So in your case, your original version could be used:
/invite/accept-invitation/accepted/
Putting this together would result something like this, which gives the desired output in my test view:
I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text
The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:
post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"
Then I try to pull out the quoted user (Meow1) and it works:
QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA)
QuotedUser_1
[1] "Meow1"
Then I created this codes for pulling out the quoted text, and the clean text:
Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)
It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall')
And same for the clean text, it only returns the last reply:
Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)
If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...
Cheers
Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data.
If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.
I am trying to use regex generators to create an expression, but I can't seem to get it right.
What I need to do is find the following type of string in a string:
community_n
For example, within the string which may be
community community_1 community_new_1 community_1_new
from that, I just want to extract community_1
I have tried /(community_\\d+)/, but that is clearly not right.
Try adding word boundries, so
/(\\bcommunity_\\d+\\b)/
Try using the regex (community_\d+).
Though I could be incorrect since I don't know which language you are using.
(For some reason I cannot add comments, I can only answer questions).