I have a bunch of .avi files I would like to load into R, breakdown each frame as an individual image, arrange the images, and save as a separate image. In spite of a sincere effort to try to find a package to load .avi files, I can't find anything.
1) is it possible to load and work with avi files in r?
2) how is this done?
3) is there a specific library for this?
Ive seen several examples using linux, such as this post, but I'm hoping for an R solution.
Converting AVI Frames to JPGs on Linux
I figured this out. As indicated in the comments section, ffmpeg is the package called by various packages in R to load videos into the R environment. The Imager package has a function called "load.video.internal" that works well and uses ffmpeg. I had to down load the package from github because this function was not available in the version I installed using "install.packages". I ended up copy/pasting the function from the source package and commenting out the reference to the "has.ffmpeg" function because it kept hanging at this step. Using paste, I was able to indicate the file path and load an avi file successfully.
Modified load.video.internal function:
load.video.internal <- function(fname,maxSize=1,skip.to=0,frames=NULL,fps=NULL,extra.args="",verbose=FALSE)
{
# if (!has.ffmpeg()) stop("Can't find ffmpeg. Please install.")
dd <- paste0(tempdir(),"/vid")
if (!is.null(frames)) extra.args <- sprintf("%s -vframes %i ",extra.args,frames)
if (!is.null(fps)) extra.args <- sprintf("%s -vf fps=%.4d ",extra.args,fps)
arg <- sprintf("-i %s %s -ss %s ",fname,extra.args,as.character(skip.to)) %>% paste0(dd,"/image-%d.bmp")
tryCatch({
dir.create(dd)
system2("ffmpeg",arg,stdout=verbose,stderr=verbose)
fls <- dir(dd,full.names=TRUE)
if (length(fls)==0) stop("No output was generated")
ordr <- stringr::str_extract(fls,"(\\d)+\\.bmp") %>% stringr::str_split(stringr::fixed(".")) %>% purrr::map_int(~ as.integer(.[[1]])) %>% order
fls <- fls[ordr]
#Check total size
imsz <- load.image(fls[[1]]) %>% dim %>% prod
tsz <- ((imsz*8)*length(fls))/(1024^3)
if (tsz > maxSize)
{
msg <- sprintf("Video exceeds maximum allowed size in memory (%.2d Gb out of %.2d Gb)",tsz,maxSize)
unlink(dd,recursive=TRUE)
stop(msg)
}
else
{
out <- map_il(fls,load.image) %>% imappend("z")
out
} },
finally=unlink(dd,recursive=TRUE))
}
Example of its use:
vid_in <- load.video.internal(paste("/home/phil/Documents/avi_files/Untitled.avi"))
UPDATED SIMPLER VERSION:
the magick package will read avi files using the image_read function.
Related
I have been mainly working with .xlsb files(binary file type of xlsx) which I would like to read/write using R. Could you please let me know if there is any package that is available for this or do I need to create package on my own?
RODBC did not work too.
Try the excel.link package. The xl.read.file function allows rectangular data sets to be read-in, though there are other options available.
You also need to (install and) call the RDCOMClient package before running the first excel.link function.
e.g.,
read_xlsb <- function(x){
require("RDCOMClient")
message(paste0("Reading ", x, "...\n"))
df <- excel.link::xl.read.file(filename = x, header = TRUE,
xl.sheet = Worksheet_name)
df$filename <- x
df <- as.data.frame(df)
return(df)
}
The only annoynce I've found is that I can't override Excel's "save on close" functionality so these pop-ups need to be closed by hand.
BTW I think excel.link only works on Windows machines.
I am trying to read pdf files with the TM package. I have gone through succesfully in most of the attempts, but one. I have several folders with hundreds of documents each. I have read all of them but one. The problem is that the pdfs in that specific folder have a sequence of images on the bottom of the first page that prevents me from reading them. I get the following error:
Error in strptime(d, fmt) : input string is too long
If I remove the first page, I manage to read them. I could do it without much loss of relevant information, but it is too much work.
I try with xpdf and ghoststring, but both give me the same error.
My code is as following:
library(rvest)
library(tm)
url<-paste0("http://www.tjrj.jus.br/search?q=acidente+de+transito+crianca+atropelamento&btnG=Pesquisar&processType=cnj&site=juris&client=juris&output=xml_no_dtd&proxystylesheet=juris&entqrm=0&oe=UTF-8&ie=UTF-8&ud=1&filter=0&getfields=*&partialfields=(ctd:1)&exclude_apps=1&ulang=en&lr=lang_pt&sort=date:D:S:d1&as_q=+&access=p&entqr=3&start=",seq(0,462,10))
css<-sprintf(".margin-top-10:nth-child(%.d) .outros .featured",1:10)
for (j in 1:1){ # There 47 pages, but I only put one here
for (i in 1:10){ # there are 10 files per page.
a<-html_node(css=css[i]) %>%
html_attr("href")
download.file(a,paste0("doc",j,i,".pdf"))
}
}
files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
docs <- Corpus(URISource(files,encoding="UTF-8"),readerControl = list(reader = Rpdf,language="portuguese"))
Does someone have a suggestion? I use a Mac.
Late answer:
But I recently discovered that with the current verions of tm (0.7-4) readPDF uses pdftools as default to read pdfs.
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))
I run an automated script to download 3 .xls files from 3 websites every hour. When I later try to read in the .xls files in R to further work with them, R produces the following error message:
"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"
When I manually open and save the .xls files this problem doesn't appear anymore and everything works normal, but since the total number of files is increasing with 72 every day this is not a nice work around.
The script I use to download and save the files:
library(httr)
setwd("WORKDIRECTION")
orig_wd <- getwd()
FOLDERS <- c("NAME1","NAME2","NAME3") #representing folder names
LINKS <- c("WEBSITE_1", #the urls from which I download
"WEBSITE_2",
"WEBSITE_3")
NO <- length(FOLDERS)
for(i in 1:NO){
today <- as.character(Sys.Date())
if (!file.exists(paste(FOLDERS[i],today,sep="/"))){
dir.create(paste(FOLDERS[i],today,sep="/"))
}
setwd(paste(orig_wd,FOLDERS[i],today,sep="/"))
dat<-GET(LINKS[i])
bin <- content(dat,"raw")
now <- as.character(format(Sys.time(),"%X"))
now <- gsub(":",".",now)
writeBin(bin,paste(now,".xls",sep=""))
setwd(orig_wd)
}
I then read in the files with the following script:
require(gdata)
require(XLConnect)
require(xlsReadWrite)
wb = loadWorkbook("FILEPATH")
df = readWorksheet(wb, "Favourite List" , header = FALSE)
Does anybody have experience with this type of error, and knows a solution or workaround?
The problem is partly resolved by using the readxl package available in the CRAN library. After installation files can be read in with:
library(readxl)
read_excel("PathToFile")
The only problem is, that the last column is omitted while reading in. If I find a solution for this I'll update the awnser.
I am trying to download zipped files from website like http://cdo.ncdc.noaa.gov/qclcd_ascii/.
Since there are many files, is there a way to download them in batch instead of one by one? Ideally, the downloaded files can be unzipped in batch after downloading.
I tried to use system(curl http://cdo.ncdc.noaa.gov/qclcd_ascii/QCLCD") etc.. but got many errors and status 127 warnings.
Any idea or suggestions?
Thanks!
This should work.
library(XML)
url<-c("http://cdo.ncdc.noaa.gov/qclcd_ascii/")
doc<-htmlParse(url)
#get <a> nodes.
Anodes<-getNodeSet(doc,"//a")
#get the ones with .zip's and .gz's
files<-grep("*.gz|*.zip",sapply(Anodes, function(Anode) xmlGetAttr(Anode,"href")),value=TRUE)
#make the full url
urls<-paste(url,files,sep="")
#Download each file.
mapply(function(x,y) download.file(x,y),urls,files)
It's not R, but you could easily use the program wget, ignoring robots.txt:
wget -r --no-parent -e robots=off --accept *.gz
http://cdo.ncdc.noaa.gov/qclcd_ascii/
Here's my take on it:
### Load XML package, for 'htmlParse'
require(XML)
### Read in HTML contents, extract file names.
root <- 'http://cdo.ncdc.noaa.gov/qclcd_ascii/'
doc <- htmlParse(root)
fnames <- xpathSApply(doc, '//a[#href]', xmlValue)
### Keep only zip files, and create url paths to scrape.
fnames <- grep('zip$', fnames, value = T)
paths <- paste0(root, fnames)
Now that you have a vector of url's and corresponding file-name's in R, you can download them to your hard disk. You have two options. You can download in serial, or in parallel.
### Download data in serial, saving to the current working directory.
mapply(download.file, url = paths, destfile = fnames)
### Download data in parallel, also saving to current working directory.
require(parallel)
cl <- makeCluster(detectCores())
clusterMap(cl, download.file, url = paths, destfile = fnames,
.scheduling = 'dynamic')
If you choose to download in parallel, I recommend considering 'dynamic' scheduling, which means that each core won't have to wait for others to finish before starting its next download. The downside to dynamic scheduling is the added communication overhead, but since the process of downloading ~50mb files is not very resource intensive, it will be worth it to use this option so long as files download at slightly varying speeds.
Lastly, if you want to also include tar files as well, change the regular expression to
fnames <- grep('(zip)|(gz)$', fnames, value = T)
To download everything under that directory you can do this:
wget -r -e robots=off http://cdo.ncdc.noaa.gov/qclcd_ascii/
#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")
For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))
To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.
Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")
rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)