Download all the txt files from a website

Download all the txt files from a website - r

I would like to download multiple text files which ends with .txt from a website. I am trying to use these answers Download All Files From a Folder on a Website
So I changed the answers to txt ending files but none of the solutions from this page worked for me. I think I have to change this line of the first answer
xpQuery = "//a/#href['.txt'=substring(., string-length(.) - 3)]"
for txt files but I don't know how.
The edited code is something like below:
## your base url
url <- "https://..."
## query the url to get all the file names ending in '.txt'
txts <- XML::getHTMLLinks(
url,
xpQuery = "//a/#href['.txt'=substring(., string-length(.) - 3)]"
)
## create a new directory 'mytxts' to hold the downloads
dir.create("mytxts")
## save the current directory path for later
wd <- getwd()
## change working directory for the download
setwd("mytxts")
## create all the new files
file.create(txts)
## download them all
lapply(paste0(url, txts), function(x) download.file(x, basename(x)))
## reset working directory to original
setwd(wd)

Related

Read latest SPSS file from directory

I am trying to read the latest SPSS file from the directory which has several SPSS files. I want to read only the newest file from a list of 3 files which changes with time. Currently, I have manually entered the filename (SPSS-1568207835.sav for ex.) which works absolutely fine, but I want to make this dynamic and automatically fetch the latest file. Any help would be greatly appreciated.
setwd('/file/path/for/this/file/SPSS')
library(expss)
expss_output_viewer()
mydata = read_spss("SPSS-1568207835.sav",reencode = TRUE)
w <- data.frame(mydata)
args <- commandArgs(TRUE)

This should return a character string for the filename of the .sav file modified most recently
# get all .sav files
all_sav <- list.files(pattern ='\\.sav$')
# use file.info to get the index of the file most recently modified
all_sav[with(file.info(all_sav), which.max(mtime))]

Zip files without directory name in R

Inside working directory I have folders names ending "*_txt" containing files inside folder, want to zip all folders with original name and files inside them. Everything is working perfectly but problem in .zip contains the name of directory as well that i don't want e.g "1202_txt.zip\1202_txt\files" needs to be "1202_txt.zip\files"
dir.create("1202_txt") # creating folder inside working directory
array <- list.files( , "*_txt")
for (i in 1:length(array)){
name <- paste0(array[i],".zip")
#zip(name, files = paste0(d,paste0("/",array[i])))
zip(name, files = array[i])
}
Above code is available Creating zip file from folders in R
Note: Empty folders can be skipped

Can you please try this? (using R 3.5.0, macOS High Sierra 10.13.6)
dir_array <- list.files(getwd(), "*_txt")
zip_files <- function(dir_name){
zip_name <- paste0(dir_name, ".zip")
zip(zipfile = zip_name, files = dir_name)
}
Map(zip_files, dir_array)
This should zip all the folders inside the current working directory with the specified name. The zipped folders are also housed in the current working directory.

Here is the approach I used to achieve my desired results tricky but still works
setwd("c:/test")
dir.create("1202_txt") # creating folder inside working directory and some CSV files in there
array <- list.files( , "*_txt")
for (i in 1:length(array)){
name <- paste0(array[i],".zip")
Zip_Files <- list.files(path = paste0(getwd(),"/", array[[i]]), pattern = ".csv$")
# Moving Working Directory
setwd(file.path("C:\\test\\",array[[i]]))
#zipping files inside the directory
zip::zip(zipfile = paste0(name[[i]]), files = Zip_Files)
# Moving zip File from Inside folder to outside
file.rename(name[i], paste0("C:\\test\\", name[i]))
print(name[i])
}

Copying only text files into new folder in R

I have a folder of PDFs that I am supposed to perform text analytics on within R. Thus far the best method of doing so has been using R to convert these files to text files using pdftotext. After this however I am unable to perform any analytics as the text files are placed into the same folder as the PDFs from which they are derived.
I am achieving this through:
dest <- "C:/PDF"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/xpdfbin-win-3.04/bin64/pdftotext.exe"', paste0('"',i,'"')), wait= FALSE))
I was wondering the best method of retaining only the text files, whether it be saving them to a newly created folder in this step or if more must be done.
I have tried:
dir.create("C:/txtfiles")
new.folder <- "C:/txtfiles"
dest <- "C:/PDF"
list.of.files <-list.files(dest, ".txt$")
file.copy(list.of.files, new.folder)
However this only fills the new folder 'txtfiles' with blank text files named after the ones created by the first few lines of code.

use the following code:
files <- list.files(path="current folder location",pattern = "\\.txt$") #lists all .txt files
for(i in 1:length(files)){
file.copy(from=paste("~/current folder location/",files[i],sep=""),
to="destination folder")
This should copy all text files in "current folder location" into a separate folder "destination folder".

Unzip and rename files keeping original file extension

I want to unzip the files inside a folder and rename them with the same name as their .zip file of origin BUT keeping the original extension of the individual files. Any ideas on how to do this?
Reproducible example:
# Download zip files
ftppath1 <- "ftp://geoftp.ibge.gov.br/malhas_digitais/censo_2010/setores_censitarios/se/se_setores_censitarios.zip"
ftppath2 <- "ftp://geoftp.ibge.gov.br/malhas_digitais/censo_2010/setores_censitarios/al/al_setores_censitarios.zip"
download.file(ftppath1, "SE.zip", mode="wb")
download.file(ftppath2, "AL.zip", mode="wb")
What I had in mind was something as naive as this:
# unzip and rename files
unzip("SE.zip", file_name= paste0("SE",.originalextension))
unzip("AL.zip", file_name= paste0("AL",.originalextension))
In the end, these are the files I would have in my folder:
SE.zip
AL.zip
AL.shx
AL.shp
AL.prj
AL.dbf
SE.shx
SE.shp
SE.prj
SE.dbf

for (stem in c('SE','AL')) {
zf <- paste0(stem,'.zip'); ## derive zip file name
unzip(zf); ## extract all compressed files
files <- unzip(zf,list=T)$Name; ## get their orig names
for (file in files) file.rename(file,paste0(stem,'.',sub('.*\\.','',file))); ## rename
};
system('ls;');
## AL.dbf AL.prj AL.shp AL.shx AL.zip SE.dbf SE.prj SE.shp SE.shx SE.zip

R download and read many excel files, automatically

I need to download a few hundred number of excel files and import them into R each day. Each one should be their own data-frame. I have a csv. file with all the adresses (the adresses remains static).
The csv. file looks like this:
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%b
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%a0
http://www.www.somehomepage.com/chartserver/hometolotsoffiles%aa11
etc.....
I can do it with a single file like this:
library(XLConnect)
my.url <- "http://www.somehomepage.com/chartserver/hometolotsoffiles%a"
loc.download <- "C:/R/lotsofdata/" # each files probably needs to have their own name here?
download.file(my.url, loc.download, mode="wb")
df.import.x1 = readWorksheetFromFile("loc.download", sheet=2))
# This kind of import works on all the files, if you ran them individually
But I have no idea how to download each file, and place it separately in a folder, and then import them all into R as individual data frames.

It's hard to answer your question as you haven't provided a reproducible example and it isn't clear what you exactly want. Anyway, the code below should point you in the right direction.
You have a list of urls you want to visit:
urls = c("http://www/chartserver/hometolotsoffiles%a",
"http://www/chartserver/hometolotsoffiles%b")
in your example, you load this from a csv file
Next we download each file and put it in a separate directory (you mentioned that in your question
for(url in urls) {
split_url = strsplit(url, "/")[[1]]
##Extract final part of URL
dir = split_url[length(split_url)]
##Create a directory
dir.create(dir)
##Download the file
download.file(url, dir, mode="wb")
}
Then we loop over the directories and files and store the results in a list.
##Read in files
l = list(); i = 1
dirs = list.dirs("/data/", recursive=FALSE)
for(dir in dirs){
file = list.files(dir, full.names=TRUE)
##Do something?
##Perhaps store sheets as a list
l[[i]] = readWorksheetFromFile(file, sheet=2)
i = i + 1
}
We could of course combine steps two and three into a single loop. Or drop the loops and use sapply.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Download all the txt files from a website - r

Related

Read latest SPSS file from directory

Zip files without directory name in R

Copying only text files into new folder in R

Unzip and rename files keeping original file extension

R download and read many excel files, automatically

Categories

Resources