lets say I have a list in r that contains other lists and I want to remove certain characters (comma in the following example) from elements of each list.
my.list <- list(c("hello , world ", "hello world,,," ),c("123,456", "1,234"))
The following gets the job done
gsub(",", "", my.list[[1]])
gsub(",", "", my.list[[2]])
but how do I do it more efficiently as my actual problem is long? I tried the following but it gives me strange results
lapply(my.list, function(x) gsub(",","",my.list))
any help? thx
You might be able to utilize the function filter_element
I would suggest sanitizing your data before combining them into a list. Once the data has been sanitized, then start setting entering them into lists.
You can check out the documentation for filter_element on page 10 in the following PDF.
https://cran.r-project.org/web/packages/textclean/textclean.pdf
Related
I am writing a program to generate whisker template in a loop, and I want to paste them together. For every template, my code looks like this:
script[i] <- whisker.render(template_pdf[i], data = parameter)
Is there a method to paste all the script[i] together? I know I can paste all the chunks first and then use function whisker.render just for one time, but that will cause some trouble in my particular case. If I can paste all the script[i] together, that will be convenient for me.
When script is a character vector you can do
paste(script, collapse = "\n")
with \n (newline) the character inserted between the script elements.
When script is a list, you can do
do.call(paste, c(script, sep ="\n"))
I have a list of identifiers as follows:
url_num <- c('85054655', '85023543', '85001177', '84988480', '84978776', '84952756', '84940316', '84916976', '84901819', '84884081', '84862066', '84848942', '84820189', '84814935', '84808144')
And from each of these I'm creating a unique variable:
for (id in url_num){
assign(paste('test_', id, sep = ""), FUNCTION GOES HERE)
}
This leaves me with my variables which are:
test_8505465, test_85023543, etc, etc
Each of them hold the correct output from the function (I've checked), however my next step is to combine them into one big vector which holds all of these created variables as a seperate element in the vector. This is easy enough via:
c(test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144)
However, as I update the original 'url_num' vector with new identifiers, I'd also have to come down to the above chunk and update this too!
Surely there's a more automated way I can setup the above chunk?
Maybe some sort of concat() function in the original for-loop which just adds each created variable straight into an empty vector right then and there?
So far I've just been trying to list all the variable names and somehow get the output to be in an acceptable format to get thrown straight into the c() function.
for (id in url_num){
cat(as.name(paste('test_', id, ",", sep = "")))
}
...which results in:
test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144,
This is close to the output I'm looking for but because it's using the cat() function it's essentially a print statement and its output can't really get put anywhere. Not to mention I feel like this method I've attempted is wrong to begin with and there must be something simpler I'm missing.
Thanks in advance for any help you guys can give me!
Troy
I am working on a web scraping project using rvest.
html_text(html_nodes(url, CSS))
extracts data from url wherever the matching CSS is found. My problem is that the website I am scraping uses a unique CSS ID for each listed product (such as ListItem_001_Price). So 1 CSS defines exactly 1 item's price and so automated webscraping doesn't work
I can create a vector
V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
for all the products' CSS IDs manually. Is it possible to pass it's individual elements to the html_nodes() function in one go and so collect the resulting data back as a single vector/dataframe?
How to make it work?
You can try using lapply here:
V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
results <- lapply(V, function(x) html_text(html_nodes(url, x)))
I assume here that your nested call to html_text will in general return a character vector of the text corresponding to the matching nodes, for each item in V. This would leave you with a list of vectors which you can then access.
html_nodes() needs the initial "." to find your tags by css-class. You could manually create
V <- c(".ListItem_001_Price", ".ListItem_002_Price", ".ListItem_003_Price")
like you sugest, but I recommend that you user regex to match the classes like 'ListItem_([0-9]{3})_Price' so you can avoid the manual labour. Make sure you regex on the actual string of your markup, and not on the html-node object. (see below)
In R, apply(), lapplay(), sapplay() and the like, work much like a short loop. In it you can apply a function to every member of data-type that contains numerous values, like lists, data-frames, matrixes or vectors.
In your case, it's a vector, and a way to beginning to understand how it works is thinking of it like:
sapply(vector, function(x) THING-TO-DO-WITH-ITEM-IN-VECTOR)
In your case, you'd like the thing to do with item in vector to be the fetching of the html_text corresponding to the items in the vector.
See the code below for an example:
library(rvest)
# An example piece of html
example_markup <- "<ul>
<li class=\"ListItem_041_Price\">Brush</li>
<li class=\"ListItem_031_Price\">Phone</li>
<li class=\"ListItem_002_Price\">Paper clip</li>
<li class=\"ListItem_012_Price\">Bucket</li>
</ul>"
html <- read_html(example_markup)
# Avoid manual creation of css with regex
regex <- 'ListItem_([0-9]{3})_Price'
# Note that ([0-9]{3}) will match three consecutive numeric characters
price_classes <- regmatches(example_markup, gregexpr(regex, example_markup))[[1]]
# Paste leading "." so that html_nodes() can find the class:
price_classes <- paste(".", price_classes, sep="")
# A singel entry is found like so:
html %>% html_nodes(".ListItem_031_Price") %>% html_text()
# Use sapply to get a named character vector of your products
# Note how ".ListItem_031_Price" from the line above is replaced by x
# which will be each item of price_classes in turn.
products <- sapply(price_classes, function(x) html %>% html_nodes(x) %>% html_text())
The result in products is a named character vector. Use unname(products) to drop the names.
I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).
I´m all new to scraping and I´m trying to understand xpath using R. My objective is to create a vector of people from this website. I´m able to do it using :
r<-htmlTreeParse(e) ## e is after getURL
g.k<-(r[[3]][[1]][[2]][[3]][[2]][[2]][[2]][[1]][[4]])
l<-g.k[names(g.k)=="text"]
u<-ldply(l,function(x) {
w<-xmlValue(x)
return(w)
})
However this is cumbersome and I´d prefer to use xpath. How do I go about referencing the path detailed above? Is there a function for this or can I submit my path somehow referenced as above?
I´ve come to
xpathApply( htmlTreeParse(e, useInt=T), "//body//text//div//div//p//text()", function(k) xmlValue(k))->kk
But this leaves me a lot of cleaning up to do and I assume it can be done better.
Regards,
//M
EDIT: Sorry for the unclearliness, but I´m all new to this and rather confused. The XML document is too large to be pasted unfortunately. I guess my question is whether there is some easy way to find the name of these nodes/structure of the document, besides using view source ? I´ve come a little closer to what I´d like:
getNodeSet(htmlTreeParse(e, useInt=T), "//p")[[5]]->e2
gives me the list of what I want. However still in xml with br tags. I thought running
xpathApply(e2, "//text()", function(k) xmlValue(k))->kk
would provide a list that later could be unlisted. however it provides a list with more garbage than e2 displays.
Is there a way to do this directly:
xpathApply(htmlTreeParse(e, useInt=T), "//p[5]//text()", function(k) xmlValue(k))->kk
Link to the web page: I´m trying to get the names, and only, the names from the page.
getURL("http://legeforeningen.no/id/1712")
I ended up with
xml = htmlTreeParse("http://legeforeningen.no/id/1712", useInternalNodes=TRUE)
(no need for RCurl) and then
sub(",.*$", "", unlist(xpathApply(xml, "//p[4]/text()", xmlValue)))
(subset in xpath) which leaves a final line that is not a name. One could do the text processing in XML, too, but then one would iterate at the R level.
n <- xpathApply(xml, "count(//p[4]/text())") - 1L
sapply(seq_len(n), function(i) {
xpathApply(xml, sprintf('substring-before(//p[4]/text()[%d], ",")', i))
})
Unfortunately, this does not pick up names that do not contain a comma.
Use a mixture of xpath and string manipulation.
#Retrieve and parse the page.
library(XML)
library(RCurl)
page <- getURL("http://legeforeningen.no/id/1712")
parsed <- htmlTreeParse(page, useInternalNodes = TRUE)
Inspecting the parsed variable which contains the page's source tells us that instead of sensibly using a list tag (like <ul>), the author just put a paragraph (<p>) of text split with line breaks (<br />). We use xpath to retrieve the <p> elements.
#Inspection tells use we want the fifth paragraph.
name_nodes <- xpathApply(parsed, "//p")[[5]]
Now we convert to character, split on the <br> tags and remove empty lines.
all_names <- as(name_nodes, "character")
all_names <- gsub("</?p>", "", all_names)
all_names <- strsplit(all_names, "<br />")[[1]]
all_names <- all_names[nzchar(all_names)]
all_names
Optionally, separate the names of people and their locations.
strsplit(all_names, ", ")
Or more prettily with stringr.
str_split_fixed(all_names, ", ", 2)