time-point data of two raters - r

My data consists of two raters interpreting one specific phenomenon to occur at different points in time. I have two questions:
1) What do I call these data? "Time-series data" seems too general and usually refers to metric data changing continuously over time (while I have just points of data along the time line). Under "time-point data" I don't find problems of the kinds described in question (2).
2) What indices for interrater reliability can I use - preferably in R? (If an index requires defining how much offset is tolerated, that could be 0.120 seconds.)
example data (in seconds)
rater1:
181.23
181.566
181.986
182.784
183.204
191.352
193.956
195.426
197.568
197.82
198.576
202.02
205.8
206.136
208.53
209.034
216.216
220.08
220.584
230.706
238.266
238.518
239.442
241.5
241.836
244.398
rater2:
181.902
182.784
183.204
193.956
195.384
197.694
197.82
198.576
199.5
202.146
205.8
206.136
208.53
216.258
219.576
220.542
222.096
222.558
226.002
228.312
229.11
230.244
230.496
230.832
231.504
232.554
238.266
238.518
238.602
238.938
241.5
241.836
244.272

Related

Rounding in excel vs R changes results of mixed models

Does anyone know what's the difference in how excel stores the decimals and why the values saved from R are slightly different when loaded back into R? It seems that excel can store up to 15 decimal points, what about R?
I have a dataset with a lot of values which in R display 6 decimal points, and I'm using them for an analysis in lme4. But I noticed that on the same dataset (saved in two different files) the models sometimes converge and sometimes not. I was able to narrow down the problem to the way excel changes the values, but I'm not sure what to do about it.
I have a dataframe like this:
head(Experiment1)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
log RT was obtained by calculating log10 of First.Key
I then save this to a csv file, and load the df again, and get
head(Experiment2)
Response First.Key logResponseTime
1 TREE 2345 3.370143
2 APPLE 927 2.967080
3 CHILD 343 2.535294
4 CAT 403 2.605305
5 ANGEL 692 2.840106
6 WINGS 459 2.661813
exactly the same values, up to 6 decimal points
but then this happens
Experiment1$logResponseTime - Experiment2$logResponseTime
1 2.220446e-15
2 -2.664535e-15
3 4.440892e-16
4 -4.440892e-16
5 -2.220446e-15
6 8.881784e-16
These differences are tiny, but they make a difference between convergence and non-convergence in my lmer models, where logResponseTime is the DV, which is why I'm concerned.
Is there a way to save R dataframes into excel to a format that won't make these changes (I use write.csv)? And more importantly, why do such tiny differences make a difference in lmer?
These tiny bits of rounding are hard to avoid; most of the time, it's not worth trying to fix them (in general errors of this magnitude are ubiquitous in any computer system that uses floating-point values).
It's hard to say exactly what the differences are between the analyses with the rounded and unrounded numbers, but you should be aware that the diagnosis of a convergence problem is based on particular numerical thresholds for the magnitude of the gradient at the maximum likelihood estimate and other related quantities. Suppose the threshold is 0.002 and that running your model with unrounded values results in a gradient of 0.0019, while running it with the rounded values results in a gradient of 0.0021. Then your model will "converge" in one case and "fail to converge" in the other case. I can appreciate the potential inconvenience of getting slightly different values just by saving your data to a CSV (or XLSX) file and restoring them from there, but you should also be aware that even running the same models on a different operating system could produce equally large differences. My suggestions:
check to see how big the important differences are between the rounded/unrounded results ("important differences" are differences in estimates you care about for your analysis, of magnitudes that are large enough to change your conclusions)
if these are all small, you can increase the tolerance of the convergence checks slightly so they don't bother you, e.g. use control = lmerControl(check.conv.grad = .makeCC("warning", tol = 6e-3, relTol = NULL)) (the default tolerance is 2e-3, see ?lmerControl)
if these are large, that should concern you - it means your model fit is very unstable. You should probably also try running allFit() to see how big the differences are when you use different optimizers.
you might be able to use the methods described here to make your read/write flow a little more precise.
if possible, you could save your data to a .rds or .rda file rather than CSV, which will keep the full precision.

adehabitat compana() doesn't work or returns lambda=NaN

I'm trying to do the compositional analysis of habitat use with the compana() function in the adehabitatHS package (I use adehabitat because I can't install adehabitatHS).
Compana() needs two matrices: one of habitat use and one of avaiable habitat.
When I try to run the function it doesn't work (it never stops), so I have to abort the RStudio session.
I read that one problem could be the 0-values in some habitat types for some animals in the 'avaiable' matrix, whereas other animals have positive values for the same habitat. As done by other people, I replaced 0-values with small values (0,001), ran compana and it worked BUT the lambda values returned me NaN.
The problem is similar to the one found here
adehabitatHS compana test returns lambda = NaN?
They said they resolved using as 'used' habitat matrix the counts (integers) and not the proportions.
I tried also this approach, but never changed (it freezes when there are 0-values in the available matrix, or returns NaN value for Lambda if I replace 0- values wit small values).
I checked all matrices and they are ok, so I'm getting crazy.
I have 6 animals and 21 habitat types.
Can you resolve this BIG problem?
PARTIALLY SOLVED: Asking to some researchers, they told me that the number of habitats shouldn't be higher than the number of animals.
In fact I merged some habitats in order to have six animals per six habitats and now the function works when I replace 0-values in the 'avaiable' matrix with small values (e.d. 0.001).
Unfortunately this is not what I wanted, because I needed to find values (rankings, Log-ratios, etc..) for each habitat type (originally they were 21).

Calculate Period Changes in Unevenly Sampled Times Series in R (or Matlab)

The heading says it all, I'm trying desperately to figure out a way calculate the period of a time series that is unevenly sampled. I tried creating an evenly sampled time series with NAs for the times where there is no data, but there are just too many NAs for any imputation method to do a reasonable job. The main problem being that the sample times are much further apart than the average period (VERY roughly 0.5), which is only obvious with period-folding applied. Because I'm looking for a small change in period, I can't round the sampling times.
Time-folded period
Here is a sample of the data:
HJD(time) Mag err
2088.91535 18.868 0.078
2090.87535 19.540 0.165
2103.92958 18.704 0.040
2104.94812 19.291 0.098
2106.84596 18.910 0.066
...
4864.56170 18.835 0.061
The data set has about 650 rows.
I've spent almost a week googling my problem and nothing has helped yet so any ideas would be greatly appreciated! I have some experience with Matlab too, so if it's possible to do it with Matlab rather than R, I'd be happy with that too.
I do not think that there is a way to do what you want. The Nyquist-Shannon theorem states that your average sampling frequency needs to be at least twice as high as the frequency of the events you want to capture (roughly speaking).
So if you want to extract information from events with a period of 0.5 [units] you will need a sample every 0.25 [units].
Note that this is a mathematical limitation, not one of R.

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Run nested logit regression in R

I want to run a nested logistic regression in R, but the examples I found online didn't help much. I read over an example from this website (Step by step procedure on how to run nested logistic regression in R) which is similar to my problem, but I found that it seems not resolved in the end (The questioner reported errors and I didn't see more answers).
So I have 9 predictors (continuous scores), and 1 categorical dependent variable (DV). The DV is called "effect", and it can be divided into 2 general categories: "negative (0)" and "positive (1)". I know how to run a simple binary logit regression (using the general grouping way, i.e., negative (0) and positive (1)), but this is not enough. "positive" can be further grouped into two types: "physical (1)" and "mental (2)". So I want to run a nested model which includes these 3 categories (negative (0), physical (1), and mental (2)), and reflects the nature that "physical" and "mental" are nested in "positive". Maybe R can compare these two models (general vs. detailed) together? So I created two new columns, one is called "effect general", in which the individual scores are "negative (0)" and "positive (1)"; the other is called "effect detailed", which contains 3 values - negative (0), physical (1), and mental (2). I ran a simple binary logit regression only using "effect general", but I don't know how to run a nested logit model for "effect detailed".
From the example I searched and other materials, the R package "mlogit" seems right, but I'm stuck with how to make it work for my data. I don't quite understand the examples in R-help, and this part in the example from this website I mentioned earlier (...shape='long', alt.var='town.list', nests=list(town.list)...) makes me very confused: I can see that my data shape should be 'wide', but I have no idea what "alt.var" and "nests" are...
I also looked at page 19 of the mlogit manual for examples of nested logit model calls. But I still cannot decide what I need in terms of options. (http://cran.r-project.org/web/packages/mlogit/mlogit.pdf)
Could someone provide me with detailed steps and notes on how to do it? I'm sure this example (if well discussed and resolved) is also going to help me and others a lot!
Thanks for your help!!!
I can help you with understanding the mlogit structure. When using the mlogit.data() command, specify choice = yourchoicevariable (and id.var = respondentid if you have a panel dataset, i.e. you have multiple responses from the same individual), along with the shape='wide' argument. The new data.frame created will be in long format, with a line for each choice situation, negative, physical, mental. So you will have 3 rows for which you only had one in the wide data format. Whatever your MN choice var is, it will now be a column of logical values, with TRUE for the row that the respondent chose. The row names will now have be in the format of observation#.level(choice variable) So in your case, if the first row of your dataset the person had a response of negative, you would see:
row.name | choice
1.negative | TRUE
1.physical | FALSE
1.mental | FALSE
Also not that the actual factor level for each choice is stored in an index called alt of the mlogit.data.frame which you can see by index(your.data.frame) and the observation number (i.e. the row number from your wide format data.frame) is stored in chid. Which is in essence what the row.name is telling you, i.e. chid.alt. Also note you DO NOT have to specify alt.var if your data is in wide format, only long format. The mlogit.data function does that for you as I have just described. Essentially, it takes unique(choice) when you specify your choice variable and creates the alt.var for you, so it is redundant if your data is in wide format.
You then specify the nests by adding to the mlogit() command a named list of the nests like this, assuming your factor levels are just '0','1','2':
mlogit(..., nests = c(negative = c('0'), positive = c('1','2')
or if the factor levels were 'negative','physical','mental' it would be the like this:
mlogit(..., nests = c(negative = c('negative'), positive = c('physical','mental')
Also note a nest of one still MUST be specified with a c() argument per the package documentation. The resulting model will then have the iv estimate between nests if you specify the un.nest.el=T argument, or nest specific estimates if un.nest.el=F
You may find Kenneth Train's Examples useful

Resources