I try to read a text file in julia but I cannot give any type while reading it gives an error;
data = readdlm("data.txt",'\t', Float64)
at row 1, column 1 : ErrorException("file entry \" 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00\" cannot be converted to Float64")
If I dont use Float64, the data type is Array{Any,2}.
this result returns but I have 14 different columns in the data.
" 0.27957 0.00 9.690 0 0.5850 5.9260 42.60 2.3817 6 391.0 19.20 396.90 13.59 24.50"
" 0.17899 0.00 9.690 0 0.5850 5.6700 28.80 2.7986 6 391.0 19.20 393.29 17.60 23.10"
" 0.28960 0.00 9.690 0 0.5850 5.3900 72.90 2.7986 6 391.0 19.20 396.90 21.14 19.70"
" 0.26838 0.00 9.690 0 0.5850 5.7940 70.60 2.8927 6 391.0 19.20 396.90 14.10 18.30"
" 0.23912 0.00 9.690 0 0.5850 6.0190 65.30 2.4091 6 391.0 19.20 396.90 12.92 21.20"
I recommend using the CSV library to parse delimited files. It has features, such as handling repeated delimiters, which will probably deal with your input file.
julia> using Pkg
julia> Pkg.add("CSV")
julia> import CSV
julia> Array(CSV.read("data.txt"; delim=' ', ignorerepeated=true, type=Float64))
4×14 Array{Float64,2}:
0.17899 0.0 9.69 0.0 0.585 5.67 28.8 2.7986 6.0 391.0 19.2 393.29 17.6 23.1
0.2896 0.0 9.69 0.0 0.585 5.39 72.9 2.7986 6.0 391.0 19.2 396.9 21.14 19.7
0.26838 0.0 9.69 0.0 0.585 5.794 70.6 2.8927 6.0 391.0 19.2 396.9 14.1 18.3
0.23912 0.0 9.69 0.0 0.585 6.019 65.3 2.4091 6.0 391.0 19.2 396.9 12.92 21.2
I am essentially trying to make my own code for the nonpartest() function in the npmv package. I have a dataset:
Cattle <- read.table(text=" Treatment Replicate Weight_Loss Persistent Head_Size Salebarn_Q
'LA 200' 1 17.90 14.10 14.25 1.0
'LA 200' 2 19.30 15.30 2.56 1.0
'LA 200' 3 19.50 16.82 5.80 1.5
'LA 200' 4 18.94 12.70 7.51 1.5
Excede 1 19.60 11.20 14.52 1.0
Excede 2 19.50 10.54 9.83 1.0
Excede 3 19.10 10.83 3.82 0.5
Excede 4 20.40 11.00 0.04 1.0
Micotil 1 17.30 14.29 1.62 1.0
Micotil 2 20.00 11.65 0.13 3.0
Micotil 3 18.10 10.89 2.41 0.0
Micotil 4 19.50 12.43 5.93 2.0
Zoetis 1 18.50 25.48 10.08 1.0
Zoetis 2 17.60 20.12 11.93 1.0
Zoetis 3 19.70 23.29 7.93 2.5
Zoetis 4 18.50 28.32 13.08 3.0", header=TRUE)
Which I am trying to use to generate the matrices for Ri. and R.. and Rij in the equation in the paper below so that I can calculate the test statistics G and H
I attempted to do it using
R<-matrix(rank(Cattle,ties.method = "average"),N,p)
R_bar<-matrix(rank(Cattle,ties.method = "average"),1,p)
H<-(1/(a-1))*sum(n*(R-R_bar)*t(R-R_bar))
G<-(1/(N-a)*sum(sum(R-R_bar)*(R_prime-R_bar_prime)))
But that does not work apparently.. I'm not entirely sure what they're describing in the paper in regards to the dimensions of the R matrices.. I know you should use the rank() function and then transpose them using t() for the 'prime' versions
**Images show the excerpts of the paper where the different matrices and their dimensions and how they go in the actual equations are described
I would like to figure out which is the winning unit of a node in the kohonen plot
library(kohonen)
set.seed(0)
data("wines")
wines <- scale(wines)
som_grid <- somgrid(8, 6, "hexagonal")
som_model <- som(wines, som_grid)
plot(som_model)
The plot will look like this:
And you may know in which cluster the observation will lie with
head(data.frame(cbind(wines,unit= som_model$unit.classif)))
alcohol malic.acid ash ash.alkalinity magnesium tot..phenols flavonoids non.flav..phenols proanth col..int. col..hue
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05
2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03
3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86
4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04
5 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05
6 14.39 1.87 2.45 14.6 96 2.50 2.52 0.30 1.98 5.25 1.02
OD.ratio proline unit
1 3.40 1050 24
2 3.17 1185 46
3 3.45 1480 48
4 2.93 735 4
5 2.85 1450 48
6 3.58 1290 47
But I would like to retrieve this unit information in the plot, like putting a text in the nodes with this unit number in the same way that identify function does, but automatically. Thanks in advance!
I'm quite new with R and I'm stock with this problem. I'm trying to apply the Kruskal-Wallis and Dunn's post-hoc test in a dataframe that looks something like this.
## red blue yellow orange green
T1 1.01 0.78 0.90 0.57 0.81
T1 4.61 2.39 2.56 2.22 2.82
T1 5.20 5.23 5.10 8.74 7.34
T1 3.63 3.55 3.24 4.28 4.60
T2 1.06 0.46 0.53 0.25 0.33
T2 6.39 8.35 4.36 1.31 2.05
T2 0.48 0.18 0.67 0.13 0.15
T2 1.42 4.43 1.86 3.35 3.72
T3 4.63 2.93 2.41 2.45 2.31
T3 2.10 2.44 2.39 2.35 2.64
T3 3.52 2.06 1.72 1.63 1.38
T3 2.79 1.13 1.30 0.75 0.95
T4 1.05 1.26 0.86 1.10 1.33
T4 1.60 2.05 1.64 0.61 0.34
T4 3.60 3.70 3.37 3.36 2.38
T4 7.05 3.96 3.79 2.08 2.09
I have managed to perform the Kruskal-Wallis test using the apply function
kruskal<-apply(df[,-1], 2, function(x) + kruskal.test(x,df[,1])$p.value)
I tried using the same instruction for the posthoc.kruskal.dunn.test from the PMCMR library but it does not work.
So I change the data matrix like this, in order to obtain the test for each column, one by one...
## red blue yellow orange green treatment
a 1.01 0.78 0.90 0.57 0.81 T1
b 4.61 2.39 2.56 2.22 2.82 T1
c 5.20 5.23 5.10 8.74 7.34 T1
d 3.63 3.55 3.24 4.28 4.60 T1
e 1.06 0.46 0.53 0.25 0.33 T2
f 6.39 8.35 4.36 1.31 2.05 T2
g 0.48 0.18 0.67 0.13 0.15 T2
h 1.42 4.43 1.86 3.35 3.72 T2
i 4.63 2.93 2.41 2.45 2.31 T3
j 2.10 2.44 2.39 2.35 2.64 T3
k 3.52 2.06 1.72 1.63 1.38 T3
l 2.79 1.13 1.30 0.75 0.95 T3
m 1.05 1.26 0.86 1.10 1.33 T4
n 1.60 2.05 1.64 0.61 0.34 T4
o 3.60 3.70 3.37 3.36 2.38 T4
p 7.05 3.96 3.79 2.08 2.09 T4
using the following instruction
red<- posthoc.kruskal.dunn.test(red ~ treatment, data=df, p.adjust="BH")
It works fine, but my dataframe is much bigger than this example, and doing each column one by one is going to take me a lot of time, I'm sure there must be a way I can do all of them with one instruction or maybe a loop.
I have recently been using R to run a Generalised Linear Model(GLM) on a 100 mb csv file ( 9 million rows by 5 columns). The contents of this file includes 5 columns called depvar, var1,var2,var3,var4 and are all randomly distributed such that the columns contain numbers that are either 0,1 or 2. Basically I have used the biglm package to run the GLM on this data file and R processed this in approximately 2 minutes. This was on a linux machine using R version 2.10 (I am currently updating to 2.14), 4 cores and 8 GB of RAM. Basically I want to run the code faster at around 30 to 60 seconds. One solution is adding more cores and RAM but this would only be a temporary solution as I realise that datasets will only get bigger. Ideally I want to find a way to make the code faster for bigglm. I have run some R profiling code on the dataset. Adding the following code (before the code I want to run to check it's speed):
Rprof('method1.out')
Then after typing this command I write my bigglm code which looks something like this:
x<-read.csv('file location of 100mb file')
form<-depvar~var1+var2+var3+var4
a<-bigglm(form, data=x, chunksize=800000, sandwich=FALSE, family=binomial())
summary(a)
AIC(a)
deviance(a)
After running these codes which take around 2 to 3 minutes, I type the following to see my profiling code:
Rprofsummary('method1.out')
What I then get is a breakdown of the bigglm process and which individual lines are taking very long. After viewing this I was surprised to see that there was a call to fortran code that was taking a very long time (Around 20 seconds). This code can be found in the base file of Bigglm at:
http://cran.r-project.org/web/packages/biglm/index.html
In the bigglm 0.8.tar.gz file
Basically what I am asking the community is can this code be made more faster? For example by changing the code to recall the Fortran code to do the QR decomposition. Furthermore there were other functions like as.character and model.matrix which also took a long time. I have not attached the profiling file here as I believe it can easily be reproduced given the information I have supplied but basically I am hinting at the big problem of big data and processing GLM on this big data. This is a problem that is shared amongst the R community and I think any feedback or help would be grateful on this issue. You can probably easily replicate this example using a different dataset and look at what is taking so long in the bigglm code and see if they are the same things that I found. If so can someone please help me figure out how to make bigglm run faster. After Ben requested it I have uploaded the snippet of the profiling code I had as well as the first 10 lines of my csv file:
CSV File:
var1,var2,var3,var4,depvar
1,0,2,2,1
0,0,1,2,0
1,1,0,0,1
0,1,1,2,0
1,0,0,3,0
0,0,2,2,0
1,1,0,0,1
0,1,2,2,0
0,0,2,2,1
This CSV output was copied from my text editor UltraEdit and it can be seen that var1 takes on values 0 or 1, var2 takes on values 0 and 1, var3 takes on values 0,1,2, var4 takes on values 0,1,2,3 and depvar takes on values 1 or 0. This csv can be replicated in excel using the RAND function up to around 1 million rows then it can be copied and pasted several times over to get a large number of rows in a text editor like ultraedit. Basically type RAND() into one column for 1 million columns then do round(column) in the column beside the RAND() column to get 1s and zeros. Same sort of thinking applies to 0,1,2,3.
The profiling file is long so I have attached lines that took most time:
summaryRprof('method1.out')
$by.self
self.time self.pct total.time total.pct
"model.matrix.default" 25.40 20.5 26.76 21.6
".Call" 20.24 16.3 20.24 16.3
"as.character" 17.22 13.9 17.22 13.9
"[.data.frame" 14.80 11.9 22.12 17.8
"update.bigqr" 5.72 4.6 14.38 11.6
"-" 4.36 3.5 4.36 3.5
"anyDuplicated.default" 4.18 3.4 4.18 3.4
"|" 3.98 3.2 3.98 3.2
"*" 3.44 2.8 3.44 2.8
"/" 3.18 2.6 3.18 2.6
"unclass" 2.28 1.8 2.28 1.8
"sum" 2.26 1.8 2.26 1.8
"attr" 2.12 1.7 2.12 1.7
"na.omit" 2.02 1.6 20.00 16.1
"%*%" 1.74 1.4 1.74 1.4
"^" 1.56 1.3 1.56 1.3
"bigglm.function" 1.42 1.1 122.52 98.8
"+" 1.30 1.0 1.30 1.0
"is.na" 1.28 1.0 1.28 1.0
"model.frame.default" 1.20 1.0 22.68 18.3
">" 0.84 0.7 0.84 0.7
"strsplit" 0.62 0.5 0.62 0.5
$by.total
total.time total.pct self.time self.pct
"standardGeneric" 122.54 98.8 0.00 0.0
"bigglm.function" 122.52 98.8 1.42 1.1
"bigglm" 122.52 98.8 0.00 0.0
"bigglm.data.frame" 122.52 98.8 0.00 0.0
"model.matrix.default" 26.76 21.6 25.40 20.5
"model.matrix" 26.76 21.6 0.00 0.0
"model.frame.default" 22.68 18.3 1.20 1.0
"model.frame" 22.68 18.3 0.00 0.0
"[" 22.14 17.9 0.02 0.0
"[.data.frame" 22.12 17.8 14.80 11.9
".Call" 20.24 16.3 20.24 16.3
"na.omit" 20.00 16.1 2.02 1.6
"na.omit.data.frame" 17.98 14.5 0.02 0.0
"model.response" 17.44 14.1 0.10 0.1
"as.character" 17.22 13.9 17.22 13.9
"names<-" 17.22 13.9 0.00 0.0
"<Anonymous>" 15.10 12.2 0.00 0.0
"update.bigqr" 14.38 11.6 5.72 4.6
"update" 14.38 11.6 0.00 0.0
"data" 10.26 8.3 0.00 0.0
"-" 4.36 3.5 4.36 3.5
"anyDuplicated.default" 4.18 3.4 4.18 3.4
"anyDuplicated" 4.18 3.4 0.00 0.0
"|" 3.98 3.2 3.98 3.2
"*" 3.44 2.8 3.44 2.8
"/" 3.18 2.6 3.18 2.6
"lapply" 3.04 2.5 0.04 0.0
"sapply" 2.44 2.0 0.00 0.0
"as.list.data.frame" 2.30 1.9 0.02 0.0
"as.list" 2.30 1.9 0.00 0.0
"unclass" 2.28 1.8 2.28 1.8
"sum" 2.26 1.8 2.26 1.8
"attr" 2.12 1.7 2.12 1.7
"etafun" 1.88 1.5 0.14 0.1
"%*%" 1.74 1.4 1.74 1.4
"^" 1.56 1.3 1.56 1.3
"summaryRprof" 1.48 1.2 0.02 0.0
"+" 1.30 1.0 1.30 1.0
"is.na" 1.28 1.0 1.28 1.0
">" 0.84 0.7 0.84 0.7
"FUN" 0.70 0.6 0.36 0.3
"strsplit" 0.62 0.5 0.62 0.5
I was mainly surprised by the .Call function that is calling to Fortran. Maybe I didn't understand it. It seems all calculations are done once this function is used. I thought this was like a linking function that went to extract the Fortran code. Furthermore if Fortran is doing all the work and all the iteratively weighted least squares/QR why is the rest of the code taking so long then.
My lay understanding is that biglm is breaking data into chunks and running them sequentially.
So you could speed things up by optimising the size of the chunks to be just as big as your memory allows.
This is also just using one of your cores. This is not a multi-threaded code. You'd need to do some magic to get that working.