Rounding in excel vs R changes results of mixed models - r

Does anyone know what's the difference in how excel stores the decimals and why the values saved from R are slightly different when loaded back into R? It seems that excel can store up to 15 decimal points, what about R?
I have a dataset with a lot of values which in R display 6 decimal points, and I'm using them for an analysis in lme4. But I noticed that on the same dataset (saved in two different files) the models sometimes converge and sometimes not. I was able to narrow down the problem to the way excel changes the values, but I'm not sure what to do about it.
I have a dataframe like this:
head(Experiment1)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
log RT was obtained by calculating log10 of First.Key
I then save this to a csv file, and load the df again, and get
head(Experiment2)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
exactly the same values, up to 6 decimal points
but then this happens
Experiment1$logResponseTime - Experiment2$logResponseTime
1 2.220446e-15
2 -2.664535e-15
3 4.440892e-16
4 -4.440892e-16
5 -2.220446e-15
6 8.881784e-16
These differences are tiny, but they make a difference between convergence and non-convergence in my lmer models, where logResponseTime is the DV, which is why I'm concerned.
Is there a way to save R dataframes into excel to a format that won't make these changes (I use write.csv)? And more importantly, why do such tiny differences make a difference in lmer?

These tiny bits of rounding are hard to avoid; most of the time, it's not worth trying to fix them (in general errors of this magnitude are ubiquitous in any computer system that uses floating-point values).
It's hard to say exactly what the differences are between the analyses with the rounded and unrounded numbers, but you should be aware that the diagnosis of a convergence problem is based on particular numerical thresholds for the magnitude of the gradient at the maximum likelihood estimate and other related quantities. Suppose the threshold is 0.002 and that running your model with unrounded values results in a gradient of 0.0019, while running it with the rounded values results in a gradient of 0.0021. Then your model will "converge" in one case and "fail to converge" in the other case. I can appreciate the potential inconvenience of getting slightly different values just by saving your data to a CSV (or XLSX) file and restoring them from there, but you should also be aware that even running the same models on a different operating system could produce equally large differences. My suggestions:
check to see how big the important differences are between the rounded/unrounded results ("important differences" are differences in estimates you care about for your analysis, of magnitudes that are large enough to change your conclusions)
if these are all small, you can increase the tolerance of the convergence checks slightly so they don't bother you, e.g. use control = lmerControl(check.conv.grad = .makeCC("warning", tol = 6e-3, relTol = NULL)) (the default tolerance is 2e-3, see ?lmerControl)
if these are large, that should concern you - it means your model fit is very unstable. You should probably also try running allFit() to see how big the differences are when you use different optimizers.
you might be able to use the methods described here to make your read/write flow a little more precise.
if possible, you could save your data to a .rds or .rda file rather than CSV, which will keep the full precision.

Related

Weka Expected-Maximum clustering result explanation

I currently have a very large dataset with 2 attributes which contain only strings. The first attribute has search queries (single words) and the second attribute has their corresponding categories.
So the data is set up like this (a search query can have multiple categories):
Search Query | Category
X | Y
X | Z
A | B
C | G
C | H
Now I'm trying to use clustering algorithms to get an idea of the different groups my data is comprised of. I read somewhere that when using a clustering algorithm with just strings it is recommended to first use the Expected Maximum clustering algorithm to get a sense of how many clusters I need and then use that with K-means.
Unfortunately, I'm still very new to machine learning and Weka, so I'm constantly reading up on everything to teach myself. I might be making some very simple mistakes here so bear with me, please :)
So I imported a sample (100.000 lines out of 2.7 million) of my dataset in Weka and used the EM clustering algorithm and it gives me the following results:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100
Relation: testrunawk1_weka_sample.txt
Instances: 100000
Attributes: 2
att1
att2
Test mode: split 66% train, remainder test
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross-validation: 2
Number of iterations performed: 14
[135.000 lines long table with strings, 2 clusters and their values]
Time is taken to build a model (percentage split): 28.42 seconds
Clustered Instances
0 34000 (100%)
Log-likelihood: -20.2942
So should I infer from this that I should be using 2 or 34000 clusters with k-means?
Unfortunately, both seem unusable for me. What I was hoping for is that I get for example 20 clusters which I can then look at individually to figure out what kind of groups can be found in my data. 2 clusters seems like too low with the wide amount of categories etc in my data and 34000 clusters would be way too much to inspect manually.
I am unsure if I'm doing something wrong in either the Weka EM algorithm settings (set to standard now) or if my data is just a mess, and if so how would I go about making this work?
I am still very much learning how this all works, so any advice is much appreciated! If there is a need for more examples of my settings or anything else just tell me and I'll get it for you. I could also send you this dataset if that is easier, but it's too large to paste in here. :)

Is there a way to reduce computation time while using "ltm" package for a dataset with NULL values

I am using "ltm" package to find the difficulty and discrimination values of questions used in a Computer Adaptive Test conducted among 2400 students.My question bank contains 410 questions out of which 50 are presented to a particular candidate and if the response is correct it is scored as "1" and if wrong it is scored as"0". For the other 360 questions the the cells contains NULL values. The dataset has 2400 rows(candidate id) and 410 columns(questions).
The ltm function
values<-ltm(data~z1,IRT.param = T)
takes almost 25 minutes to process this dataset(system specs- 64 bit quadcore i3 3.30GHz, 8GB RAM).
But it takes only a minute to process a dataset of same dimension which does not contain NULL/missing values.
Is there a way to reduce the computation time?
Or can anyone suggest another package which does not have this issue.

How to train neural network with big data set in R using neuralnet?

I have data.table in r with 150 000 rows in it.
I use 9 features and it's training time more than 30 mins, I didn't wait more.
Also tried it on 500 rows (it takes 0.2 sec) and on 5000 it takes (71.2 sec).
So how I should train my model with all data or may be you can give me any other advice?
here compile log:
train1 <- train[1:5000,]+1
> f1 = as.formula("target~ v1+ v2+ v3+ v4+ v5+ v6+ v7+ v8+ v9")
> a=Sys.time()
> nn <-neuralnet(f1,data=train1, hidden = c(4,2), err.fct = "ce", linear.output = TRUE)
Warning message:
'err.fct' was automatically set to sum of squared error (sse), because the response is not binary
> b=Sys.time()
> difftime(b,a,units = "secs")
Time difference of 71.2000401 secs
This is to be expected in my experience, there are a lot of calculations involved in Neural Nets. I personally have one written in Python (2 hidden layers), detailed including momentum term, I have about 38,000 patterns of 56 inputs and 3 outputs. Splitting them into 8,000 chunks took about 10 minutes to run and just under a week to learn to my satisfaction.
The whole set of 38,000 had a larger hidden nodes to store all the patterns and that took over 6 hrs to go through one cycle and over 3 months to learn. Neural Networks is a very powerful tool but it comes at a price in my experience, others may have better implementation but all the comparisons of classification algorithms I have seen, have always mentioned the time to learn as being significant.

time-point data of two raters

My data consists of two raters interpreting one specific phenomenon to occur at different points in time. I have two questions:
1) What do I call these data? "Time-series data" seems too general and usually refers to metric data changing continuously over time (while I have just points of data along the time line). Under "time-point data" I don't find problems of the kinds described in question (2).
2) What indices for interrater reliability can I use - preferably in R? (If an index requires defining how much offset is tolerated, that could be 0.120 seconds.)
example data (in seconds)
rater1:
181.23
181.566
181.986
182.784
183.204
191.352
193.956
195.426
197.568
197.82
198.576
202.02
205.8
206.136
208.53
209.034
216.216
220.08
220.584
230.706
238.266
238.518
239.442
241.5
241.836
244.398
rater2:
181.902
182.784
183.204
193.956
195.384
197.694
197.82
198.576
199.5
202.146
205.8
206.136
208.53
216.258
219.576
220.542
222.096
222.558
226.002
228.312
229.11
230.244
230.496
230.832
231.504
232.554
238.266
238.518
238.602
238.938
241.5
241.836
244.272

Interpreting the results of R Mclust package

I'm using the R package mclust to estimate the number of clusters in my data and get this result:
Clustering table:
2 7 8 9
205693 4465 2418 91
Warning messages:
1: In map(z) : no assignment to 1,3,4,5,6
2: In map(z) : no assignment to 1,3,4,5,6
I have 9 clusters as the best, but it has no assignment to 5 of the clusters.
So does this mean I want to use 9 or 5 clusters?
If the answer can be found somewhere online, a link would be appreciated. Thanks in advance.
Most likely, the method just did not work at all on your data...
You may try other seeds, because when you "lose" clusters (i.e. they become empty) this usually means your seeds were not chosen well enough. And your cluster 9 is also pretty much gone, too.
However, if your data is actually generated by a mixture of Gaussians, it's hard to find such a bad starting point... so most likely, all of your results are bad, because the data does not satisfy your assumptions.
Judging from your cluster sizes, I'd say you have 1 cluster and a lot of noise...
Have you visualized and validated the results?
Don't blindly follow some number. Validate.

Resources