test if compressed archives contain same data

test if compressed archives contain same data - r

Is it possible to test if the contents of compressed archives are the same without needing to decompress them? What is the standard way of doing this in R? I was thinking of hashing them, like and md5 or something, but this all takes more time and is it necessary? I don't care about times the archives were created or anything like that, only if the contents of the files are the same.
Example (creates some test files on your comp.)
## Create some test files
dir.create("test1")
dir.create('test2')
writeLines(text="hi", con="test1/test1.txt")
writeLines(text="hi*2", con="test2/test2.txt")
## Make some compressed archives
tar("test.tar.gzip2", files="test1", compression="bzip2") # should be same as test1.tar.gzip2
tar("test1.tar.gzip2", files="test1", compression="bzip2")
tar("test2.tar.gzip2", files="test2", compression="bzip2")
I want to be able to test that "test.tar.gzip2" and "test1.tar.gzip2" are the same, but "test2.tar.gzip2" is different. How?

the following function extract the bytes from a file which you can then compare:
binRead <- function(fName){
f_s <- file.info(fName)$size
f <- file(fName,"rb")
res <- readBin(f,"raw",f_s)
close(f)
return(res)
}
t0 <- binRead("test.tar.gzip2")
t1 <- binRead("test1.tar.gzip2")
t2 <- binRead("test2.tar.gzip2")
all(t0 == t1) #true
all(t0 == t2) #false

Related

Write function to load set of predefined paths or files

I have recently made my first R package with specific tools for processing a large set of data that I am working with. In this project, there are several paths and files that I have to call and access at various points.
Is it possible to write functions that, when called will load a set of predefined paths or data to my global environment?
For example, the function
load_foo_paths()
would return
foo_path_1 <- "path/to/foo/1/"
foo_path_2 <- "path/to/foo/2/"
And the function
load_foo_data()
would return
foo_data_1 <- read.csv("foo_data_1.csv")
foo_data_2 <- read.csv("foo_data_2.csv")
How would I go about doing something like this?
Thanks

Maybe you can adapt the following to your use case:
loadHist <- function() {
HistFile <- c(".Rhistory", ".Rsession")
ruser <- Sys.getenv("R_USER") # C:\cygwin64\home\xxxx
whome <- Sys.getenv("HOME") # C:\cygwin64\home\xxxx
uprofile <- Sys.getenv("USERPROFILE") # C:\Users\xxxx
# Setting up History Paths (to .Rhistory)
hP1 <- c(getwd(), ruser, whome, uprofile)
hP2 <- c(file.path(hP1, HistFile[1]))
fe <- file.exists(hP2) # file.exists(file.path(getwd(), HistFile[1]))
# Load first find
fen = length(fe); i=1
while (i <= fen) {
if(fe[i]) {
cat('\nLoaded history from:\n',hP2[i],'\n', sep='')
try(utils::loadhistory(file=hP2[i]))
break
}
i = i + 1
}
#cat('\nDone!\n')
}

How would you write this using apply family of functions in R? Should you?

Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.

R Shiny unsource sourced files

One of the powers of R / Shiny is the posiblity to "source" a other R file in the R code. I am doing this dynamicly so in the end there are a lot of sourced files. So far so good.
FileToSource <- paste("Folder/",df$filename,".R", sep = "")
source(FileToSource, chdir=T)
unsource(......) ???
But at some point i want to clean up. I can delete variables etc. but can i "unsource" the previously "sourced" files ?
I have been looking for code of a way to do this but no luck up till now.
You can wonder if it is nessesary to "unsource" files but i like to clean up once in a while and this can be part of it. Less chance of conflicting code etc...
Suggestions ?
Thanks in advance, if i find a way i'll post it here too

You might want to consider using a local environment. Let's say there is a file called ~/x.R that contains one line bb <- 10. You can create a new environment
envir <- new.env()
and then source the file in that environment by
source('~/x.R',local=envir)
Then, you will be able to obtain the value of bb as envir$bb, and you wouldn't see bb in your Global Environment. Afterwards, you can delete the environment envir by setting envir <- NULL or something like that.

Great i did this test to find out if/how it works:
A.R:
xx <- function(){
print("A print")
}
yy <- 11
B.R:
xx <- function(){
print("B print")
}
yy <- 99
Main.R:
(remove the # to get a Error : attempt to apply non-function)
A <- new.env()
B <- new.env()
source("A.R", local=A)
source("B.R", local=B)
A$xx()
print(A$yy)
B$xx()
print(B$yy)
A <- NULL
#A$xx()
#print(A$yy)
B$xx()
print(B$yy)
B <- NULL
#A$xx()
#print(A$yy)
#B$xx()
#print(B$yy)
So in the end Main.R is
EMPTY & CLEAN & TIDY
<< just wat i wanted ! >>
THANKS #MARAT

loop loading pairs of files

I am writing a loop that takes two files per run e.g.a0.txt and b0.txt. I am running this over 100 files that run from a0.txt and b0.txt to a999.txt and b999.txt. The pattern function i use works perfect if i do the run for files a0 and b0 to a9 and b9 with only file pairs 0-9 in the directory. but when i put more files in the directory and do the run from '0:10, the loop fails and confuses vectors in files. I think this is becuase of thepattern` i use i.e.
list.files(pattern=paste('.', x, '\\.txt', sep=''))
This only looks for files that have '.',x,//txt.
So if '.'=a and x=1 it finds file a1. But i think it gets confused between a0 and a10 when I do the run over more files. But i cannot seem to find the appropriate loop that will serach for files that also look for files up to a999 and b999, as well.
Can anyone help with a better way to do this? code below.
dostuff <- function(x)
{
files <- list.files(pattern=paste('.', x, '\\.txt', sep=''))
a <- read.table(files[1],header=FALSE) #file a0.txt
G <- a$V1-a$V2
b <- read.table(files[2],header=FALSE) #file b0.txt
as.factor(b$V2)
q <- tapply(b$V3,b$V2,Fun=length)
H <- b$V1-b$V2
model <- lm(G~H)
return(model$coefficients[2],q)
}
results <- sapply(0:10,dostuff)
Error in tapply(b$V3, b$V2, FUN = length) : arguments must have same length

How about getting the files directly, without searching. i.e.
dostuff <- function(x)
{
a.filename <- paste('a', x, '.txt', sep='') # a<x>.txt
b.filename <- paste('b', x, '.txt', sep='') # b<x>.txt
a <- read.table(a.filename, header=FALSE)
# [...]
b <- read.table(b.filename, header=FALSE)
# [...]
}
But the error message says the problem is caused by the call to tapply rather than anything about incorrect file names, and I have literally no idea how that could happen, since I thought a data frame (which read.table creates) always has the same number of rows for each column. Did you copy-paste that error message out of R? (I have a feeling there might be a typo, and so it was, for example, q <- tapply(a$V3,b$V2,Fun=length). But I could easily be wrong)
Also, as.factor(b$V2) doesn't modify b$V2, it just returns a factor representing b$V2: after you call as.factor b$V2 is still a vector. You need to assign it to something, e.g.:
V2.factor <- as.factor(b$V2)

If the beginning of the two files is always the same (a,b in your example); you could use this information in the pattern:
x <- 1
list.files(pattern=paste('[a,b]', x, '\\.txt', sep=''))
# [1] "a1.txt" "b1.txt"
x <- 11
list.files(pattern=paste('[a,b]', x, '\\.txt', sep=''))
# [1] "a11.txt" "b11.txt"
Edit: and you should include the ^ as well, as Wojciech proposed. ^ matches the beginning of a line or in your case the beginning of the filename.

Allowing the user pick from a list of files rather than from inputting file path in R (or other means that can pass file to R)

I have a BATCH script (on a Windows machine, would like this to be generalised in time), that opens and runs the following code in the background:
library(svDialogs)
library(ggplot2)
library (XML)
sharesID <- display(guiDlg("SciViews-R", "Please Enter Shares ID:"))
test.df <- xmlToDataFrame(sharesID)
test.df
sapply(test.df, class)
test.df$timeStamp <- strptime(as.character(test.df$timeStamp), "%H:%M:%OS")
test.df$Price <- as.numeric(as.character(test.df$Price))
sapply(test.df, class)
options("digits.secs"=3)
summary (test.df)
with(test.df, plot(timeStamp, Price))
sd (test.df$Price)
mean(test.df$timeStamp)
test.df$timeStamp <- test.df[1,"timeStamp"] + cumsum(runif(7)*60)
summary(test.df)
qplot(timeStamp,Price,data=test.df,geom=c("point","line"))
Price <- summary(test.df$Price)
print (Price)
When it gets to the
sharesID <- display(guiDlg("SciViews-R", "Please Enter Shares ID:"))
It brings up a dialogue box asking the user to Enter Shares ID. At present you have to use the full path of the file you want the rest of the code to execute on. Is there a way that you can enter a file number from a list of files kept in a database or such.
The other question I have is that it generates a pdf file of the both plots only. While I like this is there a way to specify the output type and location (ie as a graph on a webpage).
I want to include a print out of the summary of Price in the output but this is not achieved using the above commands.

I've never seen the svDialogs package before, but it looks pretty awesome. Staying in base R, then maybe something like this is what you're after (or at least maybe it'll spark an idea); just copy and paste, it's a self contained example:
# set up the relations between file paths and the file names you want
f.names <- c("file_01", "file_02", "file_03")
f.paths <- c("C:\\text01.txt", "C:\\text02.txt", "C:\\text03.txt")
# ask the user to select one of your specified file names
choice <- select.list(choices = f.names, title = "Please Select Shares ID:")
# return the full file path based on the file name selected by the user
index <- grep(choice, f.names, fixed = TRUE)
sharesID <- f.paths[index]
The above will bring up a dialogue box with file choices as defined by you. The user then selects one of the choices and eventually you'll get the full file path:
> sharesID
[1] "C:\\text01.txt"
Hope that helps a little mate,
Tony Breyal

I'm addressing the second desire -- a web page of the graph. Here is a template for how you can use gWidgetsWWW to provide that. You can use this package locally through the localServerStart() command (it uses the help page web server). If you save the following to some file, say "makeGraph.R", then you load it from within R with localServerStart("makeGraph.R") (assuming, you are in the right directory, otherwise add your info):
require(ggplot2)
## a simple web page
w <- gwindow("Make a neat graph")
g <- ggroup(cont=w, horizontal=FALSE)
glabel("Select a data frame to produce a graph", cont=g)
cb <- gcombobox(names(mapNameToFile), selected=-1, cont=g)
f <- gframe("Summary", cont=g)
t <- ghtml("", cont=g)
f <- gframe("Plot", cont=g)
ourDevice <- gsvg(width=500, height=500, cont=f)
addHandlerChanged(cb, handler=function(h,...) {
makePlot(svalue(h$obj))
})
visible(w) <- TRUE
## Below here you must change to suit your application:
## example of map from names to some other object
mapNameToFile <- list("mtcars"=mtcars,
"CO2" = CO2)
## your main function
makePlot <- function(nm) {
df <- mapNameToFile[[nm]]
if(is.null(df)) {
galert(sprintf("Can't find file %s", nm))
return()
}
## your plot
p <- qplot(df[,1], df[,2])
## put into svg device
f <- getStaticTmpFile(ext=".svg")
require(RSVGTipsDevice, quietly=TRUE, warn=FALSE)
devSVGTips(f)
print(p)
dev.off()
svalue(ourDevice) <- f
## write a summary
svalue(t) <- paste("<pre>",
paste(capture.output(summary(df)), collapse="<br>"),
"</pre>",
sep="")
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

test if compressed archives contain same data - r

Related

Write function to load set of predefined paths or files

How would you write this using apply family of functions in R? Should you?

R Shiny unsource sourced files

loop loading pairs of files

Allowing the user pick from a list of files rather than from inputting file path in R (or other means that can pass file to R)

Categories

Resources