I am currently working on a data sat which has two header rows (The first one acting as overall category description and the second one containing subcategories. And it happens to be that both contain various <text> intervals. For example:
In the first row (column names of the data frame), i have a cell that contains:
- text... <span style=\"text-decoration: underline;\">in the office</span> on the activities below. Total must add up to 100%. <br /><br />
The second row contains multiple cells with:
- text <strong>
- text </strong>
Now, I was able to work out of how to remove all <text> intervals in the second row through:
data[1,] = gsub("<.*>", "", data[1,])
However, for the column names row, if I use:
colnames(data) = gsub("<.*>", "",colnames(data))
I end up just with "text", which I don't want. Due to the fact, that I still want to have:
text... in the office on the activities below. Total must add up to 100%
If some one would have an idea of how to solve it. I would really appreciate it.
Thanks!
You can get what you need by changing the regular expression you are using with the following:
colnames(data) <- gsub("<[^>]+>", "",colnames(data))
This will remove anything between opening and closing tags (including the tag). That should give you what you want.
Your current regex is greedy and is consuming everything in between the first opening bracket and last closing bracket. One quick fix would be to make your regex non greedy by using ?:
data[1,] = gsub("<.*?>", "", data[1,])
Note that using regex to parse HTML generally is not a good idea. If you plan on doing anything with nested content then you should consider using an R package which can parse HTML content.
Demo
Related
I am reading a PDF file using R. I would like to transform the given text in such a way, that whenever multiple spaces are detected, I want to replace them by some value (for example "_"). I've come across questions where all spaces of 1 or more can be replaced using "\\s+" (Merge Multiple spaces to single space; remove trailing/leading spaces) but this will not work for me. I have a string that looks something like this;
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
When I apply the answers I found; replacing all spaces of 1 or more with a single space, I will not be able to recognise separate addresses anymore, because it would look like this;
gsub("\\s+", " ", str_trim(PDF))
"[1]This is the first address This is the second one
[2]This is the third one
[3]This is the fourth one This is the fifth"
So what I am looking for is something like this
"[1]This is the first address_This is the second one
[2]This is the third one_
[3]This is the fourth one_This is the fifth"
However if I rewrite the code used in the example, I get the following
gsub("\\s+", "_", str_trim(PDF))
"[1]This_is_the_first_address_This_is_the_second_one
[2]This_is_the_third_one_
[3]This_is_the_fourth_one_This_is_the_fifth"
Would anyone know a workaround for this? Any help will be greatly appreciated.
Whenever I come across string and reggex problems I like to refer to the stringr cheat sheet: https://raw.githubusercontent.com/rstudio/cheatsheets/master/strings.pdf
On the second page you can see a section titled "Quantifiers", which tells us how to solve this:
library(tidyverse)
s <- "This is the first address This is the second one"
str_replace(s, "\\s{2,}", "_")
(I am loading the complete tidyverse instead of just stringr here due to force of habit).
Any 2 or more whitespace characters will no be replaced with _.
I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!
I want to extract information from downloaded html-Code. The html-Code is given as a string. The required information is stored inbetween specific html-expressions. For example, if I want to have every headline in the string, I have to search for "H1>" and "/H1>" and the text between these html expressions.
So far, I used substr(), but I had to calculate the position of "H1>" and "/H1>" first.
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])
The output is correct, but to calculate every single start and stopposition is a lot of work. Instead I search for a similar function like substr (), where you can use start and stop words instead of the position. For example like this:
function(htmlcode, startword = "H1>", stopword = "/H1>")
I'd agree that using a package built for html processing is probably the best way to handle the example you give. However, one potential way to sub-string a string based on character values would be to do the following.
Step 1: Define a simple function to return to position of a character in a string, in this example I am only using fixed character strings.
strpos_fixed=function(string,char){
a<-gregexpr(char,string,fixed=T)
b<-a[[1]][1:length(a[[1]])]
return(b)
}
Step 2: Define your new sub-string function using the strpos_fixed() function you just defined
char_substr<-function(string,start,stop){
x<-strpos_fixed(string,start)+nchar(start)
y<-strpos_fixed(string,stop)-1
z<-cbind(x,y)
apply(z,1,function(x){substr(string,x[1],x[2])})
}
Step 3: Test
htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")
You have two options here. First, use a package that has been developed explicitly for the parsing of HTML structures, e.g., rvest. There are a number of tutorials online.
Second, for edge cases where you may need to extract from strings that are not necessarily well-formatted HTML you should use regular expressions. One of the simpler implementations for this comes from stringr::str_match:
# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>
str_match(htmlcode, "(<H1>)(.*?)(</H1>)")
This will yield a matrix where the columns are (in order) the fully matched string followed by each independent regex group we specified. You would just want to pull the second group in this case if you want whatever text is between the <H1> tags (3rd column).
I have several Wordpress HTML pages for import through CSV/excel. One of the fields is content for the Wordpress page. Since these pages are all the same except for in 3 places (2 names, 1 IMG URL) I'm trying to be efficient and upload an excel with custom fields.
What I'd like to do is merge the IMG urls and Product Names into the appropriate spot in the Excel cell text so it's imported as a complete page. I'm trying to avoid all the cutting and pasting when adding 100's of similar pages with only a few different spots.
Any tips or advice on where I can accomplish this? I haven't been able to figure it out or find help online.
Cell Data Example:
<div id="productimage" style="float:left;width:380px;">
<img alt="alternate" src="imagesource" />
</div>
<div id="productspecs" style="float:left;padding-left:25px;">
<h2><strong>Product Name</strong></h2>
</div>
"Product Name", "alternate", and "imagesource" I have fields for in a spreadsheet .. I just don't know how to merge them into this Cell Data Example to auto-populate these new pages.
Thanks!
If I understand your question correctly, you have html in an Excel cell and you want to make parts of that html dynamic by referencing content in other cells of the workbook.
I assume that in your example you want to make the imagesource and the Product Name dynamic.
You can copy and paste the html into the Excel formula editor. You can increase its height, so you see more than one line at a time. The formula editor can handle line breaks.
If you want to build a string that contains double quotes, you will need to use two double quotes if the quote is inside the string and three double quotes in a row if it is at the beginning or end of a string. You can use the ampersand to concatenate strings and cell references.
With your specific example above, the formula in Excel would read somewhere along these lines (replace Sheet2!A2 etc. with the cell that holds your data. Arrange that data in a table with a row for each product, then you can copy this formula down to get the desired result.
="<div id=""productimage"" style=""float:left;width:380px;"">
<img alt=""alternate"" src="""&Sheet2!A2&""" />
</div>
<div id=""productspecs"" style=""float:left;padding-left:25px;"">
<h2><strong>"&Sheet2!B2&"</strong></h2>
</div>"
Turn on "Wrap Text" in the cell format, otherwise you will see it all in one line of code. The screenshot below uses two rows of data with different texts for image source and product name in sheet 2.
EDIT: I tried to post this in a comment, but the double and triple quotes don't make it and get replaced with just one quote.
Also, you managed to delete some of the & signs that concatenate the different strings. Please look again at the original formula I've posted. Replace the cell references with yours, but don't mangle the code. The principle is this:
="First String"&A1&"Next String"
If the string has quotes inside, double them
="He said "Please" but nobody heard him"&A1&"next string"
If the string has quotes at the beginning of the string, then you need the opening quote for the string and the double quote for the quote inside the string. Likewise for quotes at the end of the string: duplicate the quote in the string and then add the closing quote.
="""Please" - he said"&A1&"and she answered "OK."""
I have a csv, and each line reads as follows:
"http://www.videourl.com/video,video title,video duration,thumbnail,<iframe src=""http://embed.videourl.com/video"" frameborder=0 width=510 height=400 scrolling=no> </iframe>,tag 1,tag 2",,,,,,,,,,,,,,,,,,,,,,,,,,
Is there a program I can use to clean this up? I'm trying to import it to wordpress and map it to current fields, but it isn't functioning properly. Any suggestions?
Just use search and replace in this case. remove the commas at the end and then replace the remaining commas with ",".
Should anyone else have the same issue. Know that this solution will only work with data much like the example giving. If data has a lot of text and there are commas within the text that need kept. Then search replacing comma will not work. Using regex would be the next option and that can be done in Notepad ++
However I think the regex pattern depends on the data so not much point creating an example.
PHP could be used to explode each line also. Remove values that match a regex out of many i.e. URL, money. Then what is left could be (depending on the data again) just a block of text. That approach may not work if there are two or more columns with a lot of text