I am moving from R, and I use the head() function a lot. I couldn't find a similar method in Julia, so I wrote one for Julia Arrays. There are couple other R functions that I'm porting to Julia as well.
I need these methods to be available for use in every Julia instance that launches, whether through IJulia or through command line. Is there a "startup script" of sorts for Julia? How can I achieve this?
PS: In case someone else is interested, this is what I wrote. A lot needs to be done for general-purpose use, but it does what I need it to for now.
function head(obj::Array; nrows=5, ncols=size(obj)[2])
if (size(obj)[1] < nrows)
println("WARNING: nrows is greater than actual number of rows in the obj Array.")
nrows = size(obj)[1]
end
obj[[1:nrows], [1:ncols]]
end
You can make a ~/.juliarc.jl file, see the Getting Started section of the manual.
As for you head function, here is how I'd do it:
function head(obj::Array; nrows=5, ncols=size(obj,2))
if size(obj,1) < nrows
warn("nrows is greater than actual number of rows in the obj Array.")
nrows = size(obj,1)
end
obj[1:nrows, 1:ncols]
end
Related
I want to save a function in a JLD2 file. Is this even possible and if so how?
using JLD2
a = [1,2,3]
function foo(x, y)
x .> y
end
foo(x) = foo(x, a)
save_object("stored.jld2", foo)
So, I guess my point here is actually two things
I want to save the function.
I want it to have the method foo(x) available, i.e. if the jld2 file is opened somewhere else it has the method foo(x) with a = [1,2,3].
When someone builds a machine learning model and saves it, it must work something like this, right?
Really hope it's understandable what I mean here. If not please let me know.
OK, after some more research and some thinking, I come to the conclusion (happy to be proven wrong...) that this is not possible and/or not sensible. The comments in the question suggest using machine learning packages to save models and I think this approach explains the underlying point...
You can store data in jld2 files and but you can not store functions in jld2 files. What I tried here is not going to work because
foo(x) = foo(x, a)
accesses object a in the global environment - it does not magically store the data of a within the function. Furthermore, functions cannot be saved with JLD2 - who know's, the latter one is maybe gonna change in the future.
Great, but what's the solution now??
What you can instead do is store the data...
using JLD2
a = [1,2,3]
save("bar.jld2", Dict("a" => a))
...and make the function available through a module or a script you include().
include("module_with_foo_function.jl")
using module_with_foo_function
using JLD2
my_dict = load("bar.jld2")
a = my_dict["a"]
foo(x) = foo(x, a)
And if you think about it, this is exactly what packages like MLJ are doing as well. There you store an object that contains the information, but the functionality comes from the package, not from your stored object.
Not sure anyone ever has a similar problem, but if so I hope these thoughts help.
I search for a command to compute Walsh-Hadamard Transform of an image in R, but I don't find anything. In MATLAB fwht use for this. this command implement Walsh-Hadamard Tranform to each row of matrix. Can anyone introduce a similar way to compute Walsh-Hadamard on rows or columns of Matrix in R?
I find a package here:
http://www2.uaem.mx/r-mirror/web/packages/boolfun/boolfun.pdf
But why this package is not available when I want to install it?
Packages that are not maintained get put in the Archive. They get put there when that aren't updated to match changing requirements or start making errors with changing R code base. https://cran.r-project.org/web/packages/boolfun/index.html
It's possible that you might be able to extract useful code from the archive version, despite the relatively ancient version of R that package was written under.
The R code for walshTransform calls an object code routine:
walshTransform <- function ( truthTable ) # /!\ should check truthTable values are in {0,1}
{
len <- log(length(truthTable),base=2)
if( len != round(len) )
stop("bad truth table length")
res <- .Call( "walshTransform",
as.integer(truthTable),
as.integer(len))
res
}
Installing the package succeeded on my Mac, but would require the appropriate toolchain on whatever OS you are working in.
In a nutshell I am trying to parallelise my whole script over dates using Snow and adply but continually get the below error.
Error in unserialize(socklist[[n]]) : error reading from connection
In addition: Warning messages:
1: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
2: <anonymous>: ... may be used in an incorrect context: ‘.fun(piece, ...)’
I have set up the parallelisation process in the following way:
Cores = detectCores(all.tests = FALSE, logical = TRUE)
cl = makeCluster(Cores, type="SOCK")
registerDoSNOW(cl)
clusterExport(cl, c("Var1","Var2","Var3","Var4"), envir = environment())
exposureDaily <- adply(.data = dateSeries,.margins = 1,.fun = MainCalcFunction,
.expand = TRUE, Var1, Var2, Var3,
Var4,.parallel = TRUE)
stopCluster(cl)
Where dateSeries might look something like
> dateSeries
marketDate
1 2016-04-22
2 2016-04-26
MainCalcFunction is a very long script with multiple of my own functions contained within it. As the script is so long reproducing it wouldn't be practical, and a hypothetical small function would defeat the purpose as I have already got this methodology to work with other smaller functions. I can say that within MainCalcFunction I call all my libraries, necessary functions, and a file containing all other variables aside from those exported above so that I don't have to export a long list libraries and other objects.
MainCalcFunction can run successfully in its entirety over 2 dates using adply but not parallelisation, which tells me that it is not a bug in the code that is causing the parallelisation to fail.
Initially I thought (from experience) that the parallelisation over dates was failing because there was another function within the code that utilised parallelisation, however I have subsequently rebuilt the whole code to make sure that there was no such function.
I have poured over the script with a fine tooth comb to see if there was any place where I accidently didn't export something that I needed and I can't find anything.
Some ideas as to what could be causing the code to fail are:
The use of various option valuation functions in fOptions and rquantlib
The use of type sock
I am aware of this question already asked and also this question, and while the first question has helped me, it hasn't yet help solve the problem. (Note: that may be because I haven't used it correctly, having mainly used loginfo("text") to track where the code is. Potentially, there is a way to change that such that I log warning and/or error messages instead?)
Please let me know if there is any other information I can provide to help in solving this. I would be so appreciative if someone could provide some guidance, as the code takes close to 40 minutes to run for a day and I need to run it for close to a year, therefore parallelisation is essential!
EDIT
I have tried to implement the suggestion in the first question included above by utilising the outfile option. Given I am using Windows, I have done this by including the following lines before the exporting of the key objects and running MainCalcFunction :
reportLogName <- paste("logout_parallel.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting log file", getwd()))
mc<-detectCores()
cl<-makeCluster(mc, outfile="")
registerDoParallel(cl)
Similarly, at the beginning of MainCalcFunction, after having sourced my libraries and functions I have included the following to print to file:
reportLogName <- paste(testDate,"_logout.txt", sep="")
addHandler(writeToFile,
file = paste(Save_directory,reportLogName, sep="" ),
level='DEBUG')
with(getLogger(), names(handlers))
loginfo(paste("Starting test function ",getwd(), sep = ""))
In the MainCalcFunction function I have then put loginfo("text") statements at key junctures to inform me of where the code is at.
This has resulted in some text files being available after the code fails due to the aforementioned error. However, these text files provide no more information on the cause of the error aside from at what point. This is despite having a tryCatch statement embedded in MainCalcFunction where at the end, on any instance of error I have added the line logerror(e)
I am posting this answer in case it helps anyone else with a similar problem in the future.
Essentially, the error unserialize(socklist[[n]]) doesn't tell you a lot, so to solve it it's a matter of narrowing down the issue.
Firstly, be absolutely sure the code runs over several dates in non-parallel with no errors
Ensure the parallelisation is set up correctly. There are some obvious initial errors that many other questions respond to, e.g., hidden parallelisation inside the code which means parallelisation is occurring twice.
Once you are sure that there is no problem with the code and the parallelisation is set up correctly start narrowing down. The issue is likely (unless something has been missed above) something in the code which isn't a problem when it is run in serial, but becomes a problem when run in parallel. The easiest way to narrow down is by setting outfile = "Log.txt" in which make cluster function you use, e.g., cl<-makeCluster(cores-1, outfile="Log.txt"). Then add as many print("Point in code") comments in your function to narrow down on where the issue is occurring.
In my case, the problem was the line jj = closeAllConnections(). This line works fine in non-parallel but breaks the code when in parallel. I suspect it has something to do with the function closing all connections including socket connections that are required for the parallelisation.
Try running using plain R instead of running in RStudio.
Hi I'm working on SparkR in distribuited mode with yarn cluster.
I have two question :
1) If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
This is the script. I read a csv and take just the 100k first record.
I clean it (with R function) deleting NA values and created a SparkR dataframe.
This is what it do: foreach Lineset take each TimeInterval where that LineSet appear and sum some attribute (numeric attribute), after that put them all into a Matrix.
This is the script with R and SparkR code. It takes 7h in standalone mode and like 60h in distributed mode (killed by java.net.SocketException: Broken Pipe)
LineSmsInt<-fread("/home/sentiment/Scrivania/LineSmsInt.csv")
Short<-LineSmsInt[1:100000,]
Short[is.na(Short)] <- 0
Short$TimeInterval<-Short$TimeInterval/1000
ShortDF<-createDataFrame(sqlContext,Short)
UniqueLineSet<-unique(Short$LINESET)
UniqueTime<-unique(Short$TimeInterval)
UniqueTime<-as.numeric(UniqueTime)
Row<-length(UniqueLineSet)*length(UniqueTime)
IntTemp<-matrix(nrow =Row,ncol=7)
k<-1
colnames(IntTemp)<-c("LINESET","TimeInterval","SmsIN","SmsOut","CallIn","CallOut","Internet")
Sys.time()
for(i in 1:length(UniqueLineSet)){
SubSetID<-filter(ShortDF,ShortDF$LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-filter(SubSetID,SubSetID$TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
k3<-collect(select(SubTime,sum(SubTime$SmsIn)))
IntTemp[k,3]<-k3[1,1]
k4<-collect(select(SubTime,sum(SubTime$SmsOut)))
IntTemp[k,4]<-k4[1,1]
k5<-collect(select(SubTime,sum(SubTime$CallIn)))
IntTemp[k,5]<-k5[1,1]
k6<-collect(select(SubTime,sum(SubTime$CallOut)))
IntTemp[k,6]<-k6[1,1]
k7<-collect(select(SubTime,sum(SubTime$Internet)))
IntTemp[k,7]<-k7[1,1]
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
This is the script R the only things that change is the subset function and of course is a normal R data.frame not SparkR dataframe.
It takes 1.30 minute in standalone mode.
Why it's so fast just in R and it's so slowly in SparkR?
for(i in 1:length(UniqueLineSet)){
SubSetID<-subset.data.frame(LineSmsInt,LINESET==UniqueLineSet[i])
for(j in 1:length(UniqueTime)){
SubTime<-subset.data.frame(SubSetID,TimeInterval==UniqueTime[j])
IntTemp[k,1]<-UniqueLineSet[i]
IntTemp[k,2]<-as.numeric(UniqueTime[j])
IntTemp[k,3]<-sum(SubTime$SmsIn,na.rm = TRUE)
IntTemp[k,4]<-sum(SubTime$SmsOut,na.rm = TRUE)
IntTemp[k,5]<-sum(SubTime$CallIn,na.rm = TRUE)
IntTemp[k,6]<-sum(SubTime$CallOut,na.rm = TRUE)
IntTemp[k,7]<-sum(SubTime$Internet,na.rm=TRUE)
k<-k+1
}
print(UniqueLineSet[i])
print(i)
}
2) First script, in distribuited mode, was killed by :
java.net.SocketException: Broken Pipe
and this appear too sometimes:
java.net.SocketTimeoutException: Accept timed out
It may comes from a bad configuration? suggestion?
Thanks.
Don't take this the wrong way but it is not a particularly well written piece of code. It is already inefficient using core R and adding SparkR to the equation makes it even worse.
If I made a script that contains R line code and SparkR line code, it will distribute just the SparkR code or simple R too?
Unless you're using distributed data structures and functions which operate on these structures it is just a plain R code executed in a single thread on the master.
Why it's so fast just in R and it's so slowly in SparkR?
For starters you execute a single job for each combination of LINESET, UniqueTime and column. Each time Spark has scan all records and fetch data to the driver.
Moreover using Spark to handle data that can be easily handled in memory of a single machine simply doesn't make sense. Cost of running the job in case like this is usually much higher than a cost of actual processing.
suggestion?
If you really want to use SparkR just groupBy and agg:
group_by(Short, Short$LINESET, Short$TimeInterval) %>% agg(
sum(Short$SmsIn), sum(Short$SmsOut), sum(Short$CallIn),
sum(Short$CallOut), sum(Short$Internet))
If you care about missing (LINESET, TimeInterval) pairs fill these using either join or unionAll.
In practice it would simply stick with a data.table and aggregate locally:
Short[, lapply(.SD, sum, na.rm=TRUE), by=.(LINESET, TimeInterval)]
I've installed Julia Version 0.3.10-pre+4 and am trying to run the examples contained in julia/examples, in particular plife.jl, which seems to be a parallel simulation of Conway's game of life. I have the error
julia> include("plife.jl")
plife (generic function with 1 method)
julia> plife(100,100)
ERROR: Window not defined
in plife at /x/faia/julia/examples/plife.jl:44
where the error is caused by the code
function plife(m, n)
w = Window("parallel life", n, m)
c = Canvas(w)
I have searched for a package that defines the Window function, but have failed to find it. Does anyone know how to run this example code?
That example seems to be reliant on an external package, it looks like it might be Tk.jl but there have been some renames since then.
It has actually been removed on the development branch of Julia, but still lingers in the 0.3 series. I've filed a PR to remove it.