I have many text files. In each text file, there is a section of interest (below):
<tr>
<td ><b>发起时间</b></td>
<td colspan="2" style="text-align: left">2015-04-08</td>
<td style="width: 25%;"><b>回报机制</b></td>
<td colspan="2" style="text-align: left">使用者付费</td>
</tr>
The information that varies across files is the date only. In this case, the date is 2015-04-08.
I want to extract the date. I am an R user, and I normally would use str_match from the stringr package. I would indicate the following as the start of the string:
<td ><b>发起时间</b></td>
<td colspan="2" style="text-align: left">
However, I am not sure what to do given that this string is spread over two lines. What can I do? (It also contains Chinese characters, but that's a separate issue)
But I'm not sure how to do so, given that
Doing it with Regex
It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.
Proposed solution with Regex
Can you use the \s+ where the carriage return and new line would be. The resulting regex would look like this:
<td ><b>发起时间<\/b><\/td>\s+<td colspan="2" style="text-align: left">([0-9]{4}-[0-9]{2}-[0-9]{2})<\/td>
** To see the image better, simply right click the image and select view in new window
And based on your sample text. The first capture group would then contain the string of characters that resembled the date. It should be noted that the regex is not actually validating the date, it's just matching the format.
Explained
The \s+ regex will do the following:
\s matches any white space character
+ allows the preceeding regex to match 1 or more times
Since we know there will be a carriage return, new line, and what appears to be a tab or multiple spaces, then all of those will be matched. However if these whitespace characters are optional in your source files, then you could use the \s*. In this case the * will match zero or more whitespace characters.
Example
Please see this live example
Related
This question already has answers here:
Removing html tags from a string in R
(7 answers)
Closed 7 months ago.
In my dataset, I have a column contains strings like this:
id<-c(1:4)
colstr<-c("<div class="rich-text-field-label"><p>107. <span style="font-weight: normal;">Did the </span>Goodie bag<span style="font-weight: normal;"> encourage you to go back for your month one PrEP refill?</span></p></div>","<div class="rich-text-field-label"><p>110. Have you ever seen the <span style="color: #3598db;">brochure</span> that is contained in the 'Goodie Bag'?</p></div>","<div class="rich-text-field-label"><p>116. <span style="font-weight: normal;">Have you ever used the </span>call-in line<span style="font-weight: normal;"> phone number on the brochure</span>?</p></div>","<div class='box-body'><b><p style="text-transform:uppercase; border:1px solid black;padding:2px;color:blue"><span style="display:block;border:1px solid grey;padding:10px">Review the data entered and make sure there is <i style="color:red">*no missing data*</i>.<br/>Thereafter, click on <i style="color:red">save & exit record</i> to save this interview</span></p></b></div>")
df<-data.frame(id, colstr)
For the column: "colstr", if I only want to keep the words outside of "<xxxx>", for example, ideal result like this:
id colstr
1 107. Did the Goodie bag encourage you to go back for your month one PrEP refill?
2 110. Have you ever seen the brochure that is contained in the 'Goodie Bag'?
....
Like the example that I need retrieve a whole sentence from different places of a string cut by irregular , How should I write a code in R and set up a pattern in that code to successfully retrieve the words I want? Thanks a lot~~!
Update:
Based on the help below, now the question has been simplified like: How to use either gsub or str_replace to remove all <xxxx> in string?
The code df$colstr<-gsub("</?.*?>", "", df$colstr)generates error message when I put it into my pipe line, when I use it as mutate(colstr=str_replace(df$colstr, "</?.*?>", "")), it only removes the >in string. Does anyone happen to know how to fix it? Thanks a lot~~!
One approach, assuming the HTML tags be not nested, would be to simply strip off all opening and closing tags:
df$colstr <- gsub("</?.*?>", "", df$colstr)
Your text really looks like HTML code.
Have you looked into the RVest Package?
You could actually read your HTML code and keep all the information. And then when needed extract the text out of the HTML code. This would be a lot cleaner and easier way to do want you want.
an example would be:
colstr <- read_html("https://www.youwebsite.html") %>%
html_text2()
I have a website from which I want to download an XLS file. The site has 3 filters, start date, final date and currency. I checked the source code and the final step before downloading the XLS:
<td width="142" valign="top" >
<!--onClick="return validar_fechas();"-->
<input name="Consultar" type="submit" id="Consultar" class="APLI_boton" value="Consultar" onClick="return validar_fechas();"></td>
<td width="476">
</td>
In the last line it seems to call some arguments like date and currency, however when I try to use this website to download the XLS without entering the dates manually it gives me an empty XLS file.
The urls I've tried are:
http://www.sbs.gob.pe/app/pp/vectorprecios/Vector_Lista_historica.asp?fec_cons="04/03/2016"&fec_cons2="04/03/2016"&tip_cur="x"
http://www.sbs.gob.pe/app/pp/vectorprecios/Vector_Lista_historica.asp?"04/03/2016"&"04/03/2016"&"x"
But none of them gave me the XLS file needed.
What am I missing?
Try this:
www.sbs.gob.pe/app/pp/vectorprecios/Vector_Lista_historica.asp?as_fec_cons=04/03/2016&as_fec_cons2=04/03/2016&as_tip_curva=x
You were close but a couple things were wrong. First, don't use quotation marks. Second, the names of the attributes weren't quite right. After adjusting those two things, it works.
I have a table where I need to apply two different classes, using expressions.
1st class is applied based on following expression.
{'Up':'class-up', 'Down':'class-down'}[a.status]
and 2nd class is applied based on bold: !a.read
The classes here are class-up, class-down, bold.
So how should be the expression framed? I tried:
<tr ng-repeat="a in all" ng-class="{{'Up':'class-up', 'Down':'class-down'}[a.status],bold: !a.read}">
<tr ng-repeat="a in all" ng-class="{'Up':'class-up', 'Down':'class-down'}[a.status],bold: !a.read">
But I keep getting errors in console. What is the correct format to apply these classes based on the given expressions
With the clarification from your comment:
<tr ng-repeat="a in all" ng-class="{'class-up': a.status=='up', 'class-down': a.status=='down', 'bold': !a.read}">hello world</tr>
reStructuredText has nice support for option lists. For example, rst2html.py translates this RST markup
Options:
--foo does a foo
-b, --bar ABAR bar something
into the following nicely formatted HTML table:
<dt>Options:</dt>
<dd><table class="first last docutils option-list" frame="void" rules="none">
<col class="option" />
<col class="description" />
<tbody valign="top">
<tr><td class="option-group">
<kbd><span class="option">--foo</span></kbd></td>
<td>does a foo</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-b</span>, <span class="option">--bar <var>ABAR</var></span></kbd></td>
<td>bar something</td></tr>
</tbody>
</table>
</dd>
This doesn't seem to extend naturally to positional arguments, however; for example
Arguments:
foo does a foo
bar ABAR bar something
renders as HTML completely lacking a table structure:
<dt>Arguments:</dt>
<dd>foo does a foo
bar ABAR bar something</dd>
Is there some way to produce an options list table for command line arguments that are not prefixed by dashes or slashes?
Yup. The rather limited syntax of option lists is not very well documented here:
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#option-lists
Here's the really irritating thing. Say you are writing up a series of options and some of them fit the syntax of an "option" per the preceding link, but some do not. For example --opt==keyword does (and "keyword" will be italicized whether you want it or not), but --pot=BLACK|KETTLE doesn't. Docutils will put all the ones that fit their syntax into a nice option-list <table> template, but where they don't, it drops out of the table format and codes them as standard <dl>s. So right in the middle of your stack of options are a couple that don't look like the others.
I'm trying to limit the punctuation that a user can enter into a text box and am using this regex:
^[\w ,-–\[\\\^\$\.\|\?\*\+\(\)\{\}/!##&\`\.'\n\r\f\t""’]*$
Why do > and < produce a match? They are not included in the regex.
NOTE: this is being used in a asp.net regular expression validator.
Edit: here's the asp.net source:
<input runat="server" type="text" id="txt_FName" class="textbox" maxlength="60" />
<asp:RegularExpressionValidator ID="rfvRegexFName" runat="server" ControlToValidate="txt_FName" ErrorMessage="<%$ Resources:Subscribe, inputValidationError %>" />
In the code behind I add the expression:
rfvRegexFName.ValidationExpression = #"^[\w ,-–\[\\\^\$\.\|\?\*\+\(\)\{\}/!##&\`\.'\n\r\f\t""’]*$";
Why do > and < produce a match?
Probably because the - (hyphen) in ,-– matches the character range [, to –]. Either escape the hyphen: ,\-– or place the hyphen at the very start or end of the class which causes it to match the literal - instead.
Also note that you need not escape the $, ., |, ?, *, +, (, ), { and } inside a character class
Edit: After seeing the other answers, it looks like there might have been a few things going on here. The main problem was the unescaped dash, though. For future reference of anyone reading this Q/A thread, see Bart Kiers' answer.
You don't want to escape the period. When it's inside the brackets, it matches a regular period by default, not any character like it does normally. I'm not positive, but that might be making it act as a special character again, therefore matching anything.
Try this:
^[\w ,-–\[\\\^\$.\|\?\*\+\(\)\{\}/!##&\`'\n\r\f\t""’]*$
Try changing the last * to a +. You're matching zero or more instances, which always guarantees a match.
Edit to add: Are all of those characters regular ASCII? It looks like you might be using an em-dash or something, which might be related to your problem.