Use R to unzip and rename file - r

I have used R to download about 200 zip files. The zipped files are in mmyy.dat format. The next step is to use R to unzip all the files and rename it as yymm.txt. I know the function unzip can unpack the files. But I am not sure which argument in the function can change the name and format of the unzipped files as well.
And when I unzip the files using
for (i in 1:length(destfile)){
unzip(destfile[i],exdir='C:/data/cps1')
}
The files extrated are jan94pub.cps which is supposed to be jan94pub.dat. The code I use to download the files are here.
month_vec <- c('jan','feb','mar','apr','may', jun','jul','aug','sep','oct','nov','dec')
year_vec <- c('94','95','96','97','98','99','00','01','02','03','04','05','06','07','08','09','10','11','12','13','14')
url <- "http://www.nber.org/cps-basic/"
month_year_vec <- apply(expand.grid(month_vec, year_vec), 1, paste, collapse="")
bab <-'pub.zip'
url1 <- paste(url,month_year_vec,bab,sep='')
for (i in 1:length(url1)){
destfile <- paste('C:/data/cps1/',month_year_vec,bab,sep='')
download.file(url1[i],destfile[i])
}
for (i in 1:length(destfile)){
unzip(destfile[i],exdir='C:/data/cps1')
}
When I use str(destfile), the filenames are correct, jan94pub.dat. I don't see where my code goes wrong.

I'd do something like:
file_list = list.files('*zip')
lapply(file_list, unzip)
Next you want to use the same kind of lapply trick in combination with strptime to convert the name of the file to a date:
t = strptime('010101.txt', format = '%d%m%y.txt') # Note I appended 01 (day) before the name, you can use paste for this (and its collapse argument)
[1] "2001-01-01"
You will need to tweak the filename a bit to get a reliable date, as only the month and the year is not enough. Next you can use strftime to transform it back to you desired yymm.txt format:
strftime(t, format = '%y%d.txt')
[1] "0101.txt"
Then you can use file.rename to perform the actual moving. To get this functionality into one function call, create a function which performs all the steps:
unzip_and_move = function(path) {
# - Get file list
# - Unzip files
# - create output file list
# - Move files
}

Related

Renaming multiple files with multiple names

setwd("C:\\Users\\...\\Documents\\Main\\eml orders")
files <- list.files(pattern="*.eml")
newfiles <- gsub(".eml$", ".txt", files)
file.rename(files, newfiles)
eml_files <- list.files(pattern = "txt$")
I have this code to convert .eml into .txt files now I want to rename the same files into a string that i make with a function.
Example of function
fetch_date <- function(x) {
date <- paste0(as.character(Sys.time()), ".txt")
file.rename(x, date)
}
Now I try map(eml_files, fetch_date)
And get this error:
cannot rename file '24 New order placed.txt' to '2020-11-14', reason 'The network path was not found'
No clue what's happening any help would be appreciated.
Be aware that your fetch_date function only outputs one string (Sys.date()). It tries to name several .txt objects with the same name. On my mac, this results in the last files getting kept while the others are overwritten. Perhaps you use Windows and there is another default behavior when files are overwritten?
eml_files <- list.files(pattern = "txt$")
fetch_date <- function(x) {
date <- paste0(as.character(Sys.time()), ".txt")
file.rename(x, date)
}
map(eml_files, fetch_date)

How do I apply the same action to all Excel Files in the directory?

I need to shape the data stored in Excel files and save it as new .csv files. I figured out what specific actions should be done, but can't understand how to use lapply.
All Excell files have the same structure. Each of the .csv files should have the name of original files.
## the original actions successfully performed on a single file
library(readxl)
library("reshape2")
DataSource <- read_excel("File1.xlsx", sheet = "Sheet10")
DataShaped <- melt(subset(DataSource [-(1),], select = - c(ng)), id.vars = c ("itemname","week"))
write.csv2(DataShaped, "C:/Users/Ol/Desktop/Meta/File1.csv")
## my attempt to apply to the rest of the files in the directory
lapply(Files, function (i){write.csv2((melt(subset(read_excel(i,sheet = "Sheet10")[-(1),], select = - c(ng)), id.vars = c ("itemname","week"))))})
R returns the result to the console but doesn't create any files. The result resembles .csv structure.
Could anybody explain what I am doing wrong? I'm new to R, I would be really grateful for the help
Answer
Thanks to the prompt answer from #Parfait the code is working! So glad. Here it is:
library(readxl)
library(reshape2)
Files <- list.files(full.names = TRUE)
lapply(Files, function(i) {
write.csv2(
melt(subset(read_excel(i, sheet = "Decomp_Val")[-(1),],
select = -c(ng)),id.vars = c("itemname","week")),
file = paste0(sub(".xlsx", ".csv",i)))
})
It reads an Excel file in the directory, drops first row (but headers) and the column named "ng", melts the data by labels "itemname" and "week", writes the result as a .csv to the working directory attributing the name of the original file. And then - rinse and repeat.
Simply pass an actual file path to write.csv2. Otherwise, as denoted in docs ?write.csv, the default value for file argument is empty string "" :
file: either a character string naming a file or a connection open for writing. "" indicates output to the console.
Below concatenates the Excel file stem to the specified path directory with .csv extension:
path <- "C:/Users/Ol/Desktop/Meta/"
lapply(Files, function (i){
write.csv2(
melt(subset(read_excel(i, sheet = "Sheet10")[-(1),],
select = -c(ng)),
id.vars = c("itemname","week")),
file = paste0(path, sub(".xlsx", ".csv", i))
)
})

R: Read single file from within a tar.gz directory

Consider a tar.gz file of a directory which containing a lot of individual files.
From within R I can easily extract the name of the individual files with this command:
fileList <- untar(my_tar_dir.tar.gz, list=T)
Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?
It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.
The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.
ParseTGZ<- function(archname){
# open tgz archive
tf <- gzfile(archname, open='rb')
on.exit(close(tf))
fnames <- list()
offset <- 0
nfile <- 0
while (TRUE) {
# go to beginning of entry
# never use "seek" to re-locate in a gzipped file!
if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
# read file name
fName <- rawToChar(readBin(tf, what="raw", n=100))
if (nchar(fName)==0) break
nfile <- nfile + 1
fnames <- c(fnames, fName)
attr(fnames[[nfile]], "offset") <- offset+512
# read size, first skip 24 bytes (file permissions etc)
# again, we only use readBin, not seek()
readBin(tf, what="raw", n=24)
# file size is encoded as a length 12 octal string,
# with the last character being '\0' (so 11 actual characters)
sz <- readChar(tf, nchars=11)
# convert string to number of bytes
sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
attr(fnames[[nfile]], "size") <- sz
# cat(sprintf('entry %s, %i bytes\n', fName, sz))
# go to the next message
# don't forget entry header (=512)
offset <- offset + 512*(ceiling(sz/512) + 1)
}
# return a named list of characters strings with attributes?
names(fnames) <- fnames
return(fnames)
}
This will give you the exact position and length of all files in the tar.gz archive.
Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.
extractTGZ <- function(archfile, filename) {
# this function returns a raw vector
# containing the desired file
fp <- ParseTGZ(archfile)
offset <- attributes(fp[[filename]])$offset
fsize <- attributes(fp[[filename]])$size
gzf <- gzfile(archfile, open="rb")
on.exit(close(gzf))
# jump to the byte position, don't use seek()
# may be a bad idea on really large archives...
readBin(gzf, what="raw", n=offset)
# now read the data into a raw vector
result <- readBin(gzf, what="raw", n=fsize)
result
}
now, finally:
ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.
One can read in a csv within an archive using library(archive) as follows (this should be a lot more elegant than the currently accepted answer, this package also supports all major archive formats - 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' & 'xz' and it works on all platforms):
library(archive)
library(readr)
read_csv(archive_read("my_tar_dir.tar.gz", file = 1), col_types = cols())

How to insert text in specific in directory in R

I am looking for an elegant way to insert character (name) into directory and create .csv file. I found one possible solution, however I am looking another without "replacing" but "inserting" text between specific charaktects.
#lets start
df <-data.frame()
name <- c("John Johnson")
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
#how to insert "name" vector between "Desktop/" and "." to get:
dir <- c("C:/Users/uzytkownik/Desktop/John Johnson.csv")
write.csv(df, file=dir)
#???
#I found the answer but it is not very elegant in my opinion
library(qdapRegex)
dir2 <- c("C:/Users/uzytkownik/Desktop/ab.csv")
dir2<-rm_between(dir2,'a','b', replacement = name)
> dir2
[1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
write.csv(df, file=dir2)
I like sprintf syntax for "fill-in-the-blank" style string construction:
name <- c("John Johnson")
sprintf("C:/Users/uzytkownik/Desktop/%s.csv", name)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
Another option, if you can't put the %s in the directory string, is to use sub. This is replacing, but it replaces .csv with <name>.csv.
dir <- c("C:/Users/uzytkownik/Desktop/.csv")
sub(".csv", paste0(name, ".csv"), dir, fixed = TRUE)
# [1] "C:/Users/uzytkownik/Desktop/John Johnson.csv"
This should get you what you need.
dir <- "C:/Users/uzytkownik/Desktop/.csv"
name <- "joe depp"
dirsplit <- strsplit(dir,"\\/\\.")
paste0(dirsplit[[1]][1],"/",name,".",dirsplit[[1]][2])
[1] "C:/Users/uzytkownik/Desktop/joe depp.csv"
I find that paste0() is the way to go, so long as you store your directory and extension separately:
path <- "some/path/"
file <- "file"
ext <- ".csv"
write.csv(myobj, file = paste0(path, file, ext))
For those unfamiliar, paste0() is shorthand for paste( , sep="").
Let’s suppose you have list with the desired names for some data structures you want to save, for instance:
names = [“file_1”, “file_2”, “file_3”]
Now, you want to update the path in which you are going to save your files adding the name plus the extension,
path = “/Users/Documents/Test_Folder/”
extension = “.csv”
A simple way to achieve it is using paste() to create the full path as input for write.csv() inside a lapply, as follows:
lapply(names, function(x) {
write.csv(x = data,
file = paste(path, x, extension))
}
)
The good thing of this approach is you can iterate on your list which contain the names of your files and the final path will be updated automatically. One possible extension is to define a list with extensions and update the path accordingly.

I have a zip folder which contains 332 csv file. I have to first unzip it using R and then save it to a directory. How do i do that?

I have tried-
read.zip(file ="C:/Users/dm/Downloads/rprog-data-specdata.zip")
and-
l = list.files("C:/Users/dm/Downloads/rprog-data-specdata")
read.csv(l[1:332])
But it's not working
Unless you really want them all extracted, you don't have to. You can read them all in directly from the archive:
# you
zipped_csvs <- "rprog-data-specdata.zip"
# get data.frame of file info in the zip
fils <- unzip(zipped_csvs, list=TRUE)
# read them all into a list (or you can read individual ones)
dats <- lapply(fils$Name, function(x) {
read.csv(unzip(zipped_csvs, x), stringsAsFactors=FALSE)
})

Resources