I've a link where I need to download data which is in ".iqy" file and I need to read that for further cleaning.
I'm able to do it manually by entering the link present(in 3rd line) in the file using
con <- file("ABC1.iqy", "r", blocking = FALSE)
readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown').
Output:
[1] "WEB"
[2] "1"
[3] "https:abc.../excel/execution/EPnx?view=vrs" [4] ""
[5] ""
[6] "Selection=AllTables"
[7] "Formatting=None"
[8] "PreFormattedTextToColumns=True"
[9] "ConsecutiveDelimitersAsOne=True"
[10] "SingleBlockTextImport=False"
[11] "DisableDateRecognition=False"
[12] "DisableRedirections=False"
[13] ""
I need to automate this instead of doing it manually. Is there any option in r that I can use?
simply use download.file :)
con <- file("ABC1.iqy", "r", blocking = FALSE)
dest_path <- "ABC.file"
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path)
if you can't read the file you get, try :
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path, mode = "wb")
Related
I am using R to try and download images from the Reptile-database by filling their form to seek for specific images. For that, I am following previous suggestions to fill a form online from R, such as:
library(httr)
library(tidyverse)
POST(
url = "http://reptile-database.reptarium.cz/advanced_search",
encode = "json",
body = list(
genus = "Chamaeleo",
species = "dilepis"
)) -> res
out <- content(res)[1]
This seems to work smoothly, but my problem now is to identify the link with the correct species name in the resulting out object.
This object should contain the following page:
https://reptile-database.reptarium.cz/species?genus=Chamaeleo&species=dilepis&search_param=%28%28genus%3D%27Chamaeleo%27%29%28species%3D%27dilepis%27%29%29
This contains names with links. Thus, i would like to identify the link that takes me to the page with the correct species's table. however I am unable to find the link nor even the name of the species within the generated out object.
Here I only extract the links to the pictures. Simply map or apply a function to download them with download.file()
library(tidyverse)
library(rvest)
genus <- "Chamaeleo"
species <- "dilepis"
pics <- paste0(
"http://reptile-database.reptarium.cz/species?genus=", genus,
"&species=", species) %>%
read_html() %>%
html_elements("#gallery img") %>%
html_attr("src")
[1] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000034021_01_t.jpg"
[2] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000033342_01_t.jpg"
[3] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029987_01_t.jpg"
[4] "https://www.reptarium.cz/content/photo_rd_02/Chamaeleo-dilepis-03000029988_01_t.jpg"
[5] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035130_01_t.jpg"
[6] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035131_01_t.jpg"
[7] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035132_01_t.jpg"
[8] "https://www.reptarium.cz/content/photo_rd_05/Chamaeleo-dilepis-03000035133_01_t.jpg"
[9] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036237_01_t.jpg"
[10] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036238_01_t.jpg"
[11] "https://www.reptarium.cz/content/photo_rd_06/Chamaeleo-dilepis-03000036239_01_t.jpg"
[12] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041048_01_t.jpg"
[13] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041049_01_t.jpg"
[14] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041050_01_t.jpg"
[15] "https://www.reptarium.cz/content/photo_rd_11/Chamaeleo-dilepis-03000041051_01_t.jpg"
[16] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042287_01_t.jpg"
[17] "https://www.reptarium.cz/content/photo_rd_12/Chamaeleo-dilepis-03000042288_01_t.jpg"
[18] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0070.jpeg"
[19] "https://calphotos.berkeley.edu/imgs/128x192/1338_3161/0662/0074.jpeg"
[20] "https://calphotos.berkeley.edu/imgs/128x192/9121_3261/2921/0082.jpeg"
[21] "https://calphotos.berkeley.edu/imgs/128x192/1338_3152/3386/0125.jpeg"
[22] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/1009/0136.jpeg"
[23] "https://calphotos.berkeley.edu/imgs/128x192/6666_6666/0210/0057.jpeg"
I already have this jpeg files of R plots and I want to chance the title name. The problem is that I need to upload this images and then change that (I just have the images. I don't have the code to make it again).
I would like to know if theres some way to chance this.
Thank you
file.rename(from, to) will help.
wd="C:/Users/Public/r_img"
old_files=list.files(wd, full.names = TRUE)
no_of_files = length(old_files)
new_files=paste0(wd, '/' , 'NewFile' , c(1:no_of_files) , '.jpg')
file.rename(from = old_files, to = new_files)
print(old_files)
new_files=list.files(wd, full.names = TRUE)
print(new_files)
--print(old_files)
[1] "C:/Users/Public/r_img/OldFile1.jpg" "C:/Users/Public/r_img/OldFile10.jpg"
[3] "C:/Users/Public/r_img/OldFile2.jpg" "C:/Users/Public/r_img/OldFile3.jpg"
[5] "C:/Users/Public/r_img/OldFile4.jpg" "C:/Users/Public/r_img/OldFile5.jpg"
[7] "C:/Users/Public/r_img/OldFile6.jpg" "C:/Users/Public/r_img/OldFile7.jpg"
[9] "C:/Users/Public/r_img/OldFile8.jpg" "C:/Users/Public/r_img/OldFile9.jpg"
--print(new_files)
[1] "C:/Users/Public/r_img/NewFile1.jpg" "C:/Users/Public/r_img/NewFile10.jpg"
[3] "C:/Users/Public/r_img/NewFile2.jpg" "C:/Users/Public/r_img/NewFile3.jpg"
[5] "C:/Users/Public/r_img/NewFile4.jpg" "C:/Users/Public/r_img/NewFile5.jpg"
[7] "C:/Users/Public/r_img/NewFile6.jpg" "C:/Users/Public/r_img/NewFile7.jpg"
[9] "C:/Users/Public/r_img/NewFile8.jpg" "C:/Users/Public/r_img/NewFile9.jpg"
The following are the URLs I wish to extract:
> links
[1] "https://www.makemytrip.com/holidays-india/"
[2] "https://www.makemytrip.com/holidays-india/"
[3] "https://www.yatra.com/india-tour-packages"
[4] "http://www.thomascook.in/tcportal/international-holidays"
[5] "https://www.yatra.com/holidays"
[6] "https://www.travelguru.com/holiday-packages/domestic-packages.shtml"
[7] "https://www.chanbrothers.com/package"
[8] "https://www.tourmyindia.com/packagetours.html"
[9] "http://traveltriangle.com/tour-packages"
[10] "http://www.coxandkings.com/bharatdeko/"
[11] "https://www.sotc.in/india-tour-packages"
I have managed to do it using:
for (i in 1:10){
html <- getURL(links[i], followlocation = TRUE)
parse html
doc = htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)}
But the thing is all extracted data are all saved in "plain.text." How do I have "plain.text" for each link?
Thank you.
Is there a way to import data from a .pdf file into HTML format using R?
I tried with the following code:
library(tm)
filename = "file.pdf"
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),language = "en",id = "id1")
head(doc)
Output in HTML displays as:
## $content
## [1] " sample data"
## [2] ""
## [3] " records"
## [4] ""
## [5] " 31 July 2017"
## [6] ""
## [7] ""
## [8] "R Markdown setup
## [9] ""
## [10] ""
## [11] "R Markdown"
## [12] ""
## [13] "This is an R Markdown document. Markdown is a simple formatting syntax for"
## [14] "authoring HTML, PDF, and MS Word documents. For more details on using R"
## [15] "Markdown see http://rmarkdown.rstudio.com."
## [16] "When you click the Knit button a document will be generated that includes"
## [17] "both content as well as the output of any embedded R code chunks within the"
## [18] "document. You can embed an R code chunk like this:"
## [19] "{r cars} summary(cars)"
Please help!
I downloaded the pdf file available here : https://fie.org/competition/2022/152/results/pools/pdf?lang=en
With the following code, I have been able to convert the PDF file to a html file :
library(RDCOMClient)
path_PDF <- "C:\\pdf_with_table.pdf"
path_Html <- "C:\\temp.html"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Html, FileFormat = 9) # saves to html
From my point of view, it would be more straightforward to extract the tables directly from the PDF or to convert the PDF to a word file and extract the tables from the word file.
I'm attempting to use the system function in R to run a program, which I expect to yield an error message in some cases. For this I want to write a tryCatch function.
system(command, intern = TRUE) only returns the actual values which were echo'd by the program I'm running, it does not return my error.
In R, how can I get the error message which was yielded by my system?
My code:
test <- tryCatch({
cmd <- paste0("../Scripts/Plink2/plink --file ../InputData/",prefix," --bmerge ",
"../InputData/fs --missing --out ../InputData/",prefix)
print(cmd)
system(cmd)
} , error = function(e) {
# error handler picks up where error was generated
print("EZEL")
print(paste("MY_ERROR: ",e))
}, finally = {
print("something")
})
[1] "../Scripts/Plink2/plink --file ../InputData/GS80Kdata --bmerge ../InputData/fs --missing --out ../InputData/GS80Kdata"
PLINK v1.90b3.37 64-bit (16 May 2016) https://www.cog-genomics.org/plink2
#....
#skipping some lines here to reduce size
#....
Of these, 1414410 are new, while 2462 are present in the base dataset.
Error: 1 variant with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
# Skipping some more lines here
[1] "something"
However when using intern=TRUE and assigning the system function to a variable won't catch the error in the vector and still prints it in the R console.
Edit: Here the output of the vector (using gsub to reduce the ridiculous size)
> gsub(pattern="\b\\d.*", replacement = "", x = tst)
[1] "PLINK v1.90b3.37 64-bit (16 May 2016) https://www.cog-genomics.org/plink2"
[2] "(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3"
[3] "Logging to ../InputData/GS80Kdata.log."
[4] "Options in effect:"
[5] " --bmerge ../InputData/fs"
[6] " --file ../InputData/GS80Kdata"
[7] " --missing"
[8] " --out ../InputData/GS80Kdata"
[9] ""
[10] "64381 MB RAM detected; reserving 32190 MB for main workspace."
[11] "Scanning .ped file... 0%\b"
[12] "2%\b\b"
[13] "%\b\b"
[14] "\b\b"
[15] "\b"
[16] ""
[17] "58%\b\b"
[18] "7%\b\b"
[19] "%\b\b"
[20] "\b\b"
[21] "\b"
[22] "Performing single-pass .bed write (42884 variants, 14978 people)."
[23] "0%\b"
[24] "../InputData/GS80Kdata-temporary.bim + ../InputData/GS80Kdata-temporary.fam"
[25] "written."
[26] "14978 people loaded from ../InputData/GS80Kdata-temporary.fam."
[27] "144 people to be merged from ../InputData/fs.fam."
[28] "Of these, 140 are new, while 4 are present in the base dataset."
[29] "42884 markers loaded from ../InputData/GS80Kdata-temporary.bim."
[30] "1416872 markers to be merged from ../InputData/fs.bim."
[31] "Of these, 1414410 are new, while 2462 are present in the base dataset."
attr(,"status")
[1] 3
>