I have created the following code
library('XML')
library('rvest')
links <- c('https://www.google.com/',
'https://www.youtube.com/?gl=US',
'https://news.google.com/news/u/0/headlines?hl=en&ned=us')
for (i in 1:3){
html_object <- read_html(links[i])
write_xml(html_object, file="test.html")
}
I want to save all of these files as html files, but my current code is only saving one. I am guessing that it keeps rewriting the same file 3 times for this example. How would I make it so that it does not rewrite the same file? Ideally, I would like the file name for these html files to be their url link, but I am unable to figure out how to do that with multiple links. For example, my end result should be three HTML files titled 'https://google.com/', 'https://www.youtube.com/?gl=US', and 'https://news.google.come/news/u/0/headlines?h1-en&ned=us'.
What about using paste0() to create your filename in the for-loop?
for(i in 1:length(links)){
html_object <- read_html(links[i])
somefilename <- paste0("filename_", i, ".html")
write_xml(html_object, file = somefilename)
}
Related
I need to download a lot of files from an "HTML" link using R.
The links look like:
http://bioinf-applied.charite.de/supernatural_new/src/download_mol.php?sn_id=SN00000001
with the number after id= incrementing for each subsequent file. I want to download the first 1000 files, from: ...id=SN00000001 to ...id=SN00001000
I´m trying to use a loop with a variable to download all those files, but I have no idea how to construct this code in R.
Something like this:
for(i in 1:1000){
x <- sprintf("%08d", i)
myPath <- paste0("http://bioinf-applied.charite.de/supernatural_new/src/download_mol.php?sn_id=SN", x)
download.file(myPath, paste0("SN", x, ".mol"))
}
I'm using R, I've currently got a for loop to save text data from URLs in a csv file:
for(i in 1:9){
cancerdbdata <-
paste0("http://annotation.dbi.udel.edu/CancerDB/record_CD_0000", i, ".txt")
cancerdbdata1 <- download.file(cancerdbdata, destfile =
"CancerDrugDBdestfile.csv")
}
However, as this loops it does not sequentially download the data from each URL to the csv file and I am left with a csv file that only contains the information from the last URL. I've tried to find a way to add the data sequentially from each URL but cannot. Sorry if this has already been asked, I looked around but couldn't find anything that made sense to me. Thanks in advance for an answer or redirecting me to an answer!
The download.file has a mode parameter you can set to "a". This will make sure that the additional data will be appended to the file.
The following should do:
for(i in 1:9) {
cancerdbdata <- paste0("http://annotation.dbi.udel.edu/CancerDB/record_CD_0000",
i,
".txt")
cancerdbdata1 <- download.file(cancerdbdata,
destfile = "CancerDrugDBdestfile.csv",
mode = "a")
}
I hope this helps.
How can i append my R outputs in a single sheet of xlsx file? I am currently working on web crawling wherein i need to scrap the user reviews from website and save it in my deskstop in xlsx format. I need to every time change the website url(as user reviews are in different pages) in my code and save the output in one sheet of xlsx file.
Can you please help me with the code of appending outputs in a single sheet of xlsx file? Below is the code which i am using: Every time i need to change the website url and run the same below code and save the corresponding output in a single sheet of mydata.xlsx
library("rvest")
htmlpage <- html("http://www.glassdoor.com/GD/Reviews/Symphony-Teleca-Reviews-E28614_P2.htm?sort.sortType=RD&sort.ascending=false&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME&filter.employmentStatus=UNKNOWN")
proshtml <- html_nodes(htmlpage, ".pros")
pros <- html_text(proshtml)
pros
data=data.frame(pros)
library(xlsx)
write.xlsx(data, "D:/mydata.xlsx", append=TRUE)
A trivial, but super-slow way:
If you only need to add (a few) row(s) to an existing Excel file, and it only has one sheet to which you want to append, you can just do a simple read => overwrite step:
SHEET.NAME <- '...' # fill in with yours
existing.data <- read.xlsx(file, sheetName = SHEET.NAME)
new.data <- rbind(existing.data, data)
write.xlsx(new.data, file, sheetName = SHEET.NAME, row.names = F, append = F)
Note:
It's quite slow in general, will work only for small scale
read.xlsx is a slow function. Try read.xlsx2 to make it much faster (see the difference in the docs)
If your R process is run once and keeps working for a long time, obviously don't do it this way (reading and overwriting a file is ridiculous in that case)
look at package xlsx.
?write.xlsx will show you what you want. append=TRUE is the key.
========= EDIT TO CORRECT =========
As #Jakub pointed out, append=TRUE adds another worksheet to the file.
========= EDIT TO ADD: ANOTHER METHOD ==========
Another method is to save the data to a .csv file, which could easily open from excel. In this case, the append=T works as expected (adding to the existing sheet):
write.table(df,"D:/MyFile.csv",append=T,sep=",")
I need to download a few hundred number of excel files and import them into R each day. Each one should be their own data-frame. I have a csv. file with all the adresses (the adresses remains static).
The csv. file looks like this:
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%b
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a0
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%aa11
etc.....
I can do it with a single file like this:
library(XLConnect)
my.url <- "http://www.somehomepage.com/chartserver/hometolotsoffiles%a"
loc.download <- "C:/R/lotsofdata/" # each files probably needs to have their own name here?
download.file(my.url, loc.download, mode="wb")
df.import.x1 = readWorksheetFromFile("loc.download", sheet=2))
# This kind of import works on all the files, if you ran them individually
But I have no idea how to download each file, and place it separately in a folder, and then import them all into R as individual data frames.
It's hard to answer your question as you haven't provided a reproducible example and it isn't clear what you exactly want. Anyway, the code below should point you in the right direction.
You have a list of urls you want to visit:
urls = c("http://www/chartserver/hometolotsoffiles%a",
"http://www/chartserver/hometolotsoffiles%b")
in your example, you load this from a csv file
Next we download each file and put it in a separate directory (you mentioned that in your question
for(url in urls) {
split_url = strsplit(url, "/")[[1]]
##Extract final part of URL
dir = split_url[length(split_url)]
##Create a directory
dir.create(dir)
##Download the file
download.file(url, dir, mode="wb")
}
Then we loop over the directories and files and store the results in a list.
##Read in files
l = list(); i = 1
dirs = list.dirs("/data/", recursive=FALSE)
for(dir in dirs){
file = list.files(dir, full.names=TRUE)
##Do something?
##Perhaps store sheets as a list
l[[i]] = readWorksheetFromFile(file, sheet=2)
i = i + 1
}
We could of course combine steps two and three into a single loop. Or drop the loops and use sapply.
In R, I would like to read in data from a file, then do a bunch of stuff, then write out data to another file. I can do that. But I'd like to have the two files have similar names automatically.
e.g. if I create a file params1.R I can read it in with
source("c:\\personal\\consults\\ElwinWu\\params1.R")
then do a lot of stuff
then write out a resulting table with write.table and a filename similar to above, except with output1 instead of params1.
But I will be doing this with many different params files, and I can foresee making careless mistakes of not changing the output file to match the params file. Is there a way to automate this?
That is, set the number for output to match the number for params?
thanks
Peter
If your source file always contains "params" which you want to change to "output" then you can easilly do this with gsub:
source(file <- "c:\\personal\\consults\\ElwinWu\\params1.R")
### some stuff
write.table(youroutput, gsub("params","output",file) )
# Will write in "c:\\personal\\consults\\ElwinWu\\output1.R"
Edit:
Or to get .txt as filetype:
write.table(youroutput, gsub(".R",".txt",gsub("params","output",file)))
# Will output in c:\\personal\\consults\\ElwinWu\\output1.txt"
Edit2:
And a loop for 20 param files then would be:
n <- 20 # number of files
for (i in 1:n)
{
source(file <- paste("c:\\personal\\consults\\ElwinWu\\params",i,".R",sep=""))
### some stuff
write(youroutput, gsub(".R",".txt",gsub("params","output",file)))
}
If the idea is just to make sure that all the outputs go in the same directory as the input then try this:
source(file <- "c:\\personal\\consults\\ElwinWu\\params1.R")
old.dir <- setwd(dirname(file))
write.table(...whatever..., file = "output1.dat")
write.table(...whatever..., file = "output2.dat")
setwd(old.dir)
If you don't need to preserve the initial directory you can omit the last line.