chisq.test() on transition matrix for point-of-gaze - r

All,
I am trying to do a chisq.test() for eye data in a transition matrix where each row represents the tally of gaze from one area of 7 areas of interest (AoIs) to each of the others. In this analysis, it makes no sense for there to be a transition from one AoI to itself. Hence, those fields contain NAs.
I have tried a variety of different formats from a basic tabular input of 8 columns and rows (with the top row being the headers and the left column being the "from's"), to a simple three column data from (from, to, values).
My data.frame looks like this:
from <- c("frLS", "frLF", "frRF", "frRS", "frIns", "frEng", "frOthr")
frLS <- c(NA, 77,3, 0, 17, 0, 1)
frLF <- c(18, NA, 14, 1, 56, 2, 9)
frRF <- c(1, 52, NA, 15, 16, 1, 14)
frRS <- c(0, 7, 35, NA, 13, 15, 30)
frIns <- c(3, 54, 2, 1, NA, 4, 37)
frEng <- c(0, 9, 0, 3, 27, NA, 61)
frOthr <- c(2, 60, 2, 5, 27, 4, NA)
aoi.df <- data.frame(from, frLS, frLF, frRF, frRS, frIns, frEng, frOthr)
(Note that this is not actual data, but example data taken from Holmqvist's et al., textbook on Eye Tracking.)
Note I have also tried this as a matrix
aoi.matrix <- matrix(c(frLS, frLF, frRF, frRS, frIns, frEng, frOthr), ncol=7)
But I believe the problem is the NAs not the form of the data but, if that is the case, I am not sure how to handle it.

The NAs indeed is the problem. The error message is quite clear:
> chisq.test(aoi.matrix)
Error in chisq.test(aoi.matrix) :
all entries of 'x' must be nonnegative and finite
Either you need to substitute the NA with something else, say, 0 if that makes sense.
Now, I don't quite understand your problem. But are you sure that a chisq.test is what you want to do? It doesn't make any sense to me. Recall that you're testing for independence. However, if the diagonal elements always are zero or NA, then they cannot be independent.

Okay, here is how to handle a chisq.test with NAs. One thing I did not know when I asked this question is that the NAs in my matrix are what are called "structural zeros." Hence, they are not zeros as "zero" is a count nor are they some unexplained blip in data collection. Rather, they arise from the structure of the data set. In the case of the transition matrix, we do not allow a transition from object "A" to itself, only to other objects.
All of that said, it turns out that there is (of course) an R package for that!! I need to refer you to the aylmer documentation for a more detailed explanation, but I pretty much got what I was hoping that the chi.square would give me from:
aylmer.test(aoi.df, alternative = "two.sided", simulate.p.value = TRUE)
Note that I did have to remove the first column of "from" names, but other than that things worked just fine.

Related

How to find the index of an array, where the element has value x, in R

I have a very large array (RFO_2003; dim = c(360, 180, 13, 12)) of numeric data. I made the array using a for-loop that does some calculations based another array. I am trying to check some samples of data in this array to ensure I have generated it properly.
To do this, I want to apply a function that returns the index of the array where that element equals a specific value. For example, I want to start by looking at a few examples where the value == 100.
I tried
which(RFO_2003 == 100)
That returned (first line of results)
[1] 459766 460208 460212 1177802 1241374 1241498 1241499 1241711 1241736 1302164 1302165
match gave the same results. What I was expecting was something more like
[8, 20, 3, 6], [12, 150, 4, 7], [16, 170, 4, 8]
Is there a way to get the indices in that format?
My searches have found solutions in other languages, lots of stuff on vectors, or the index is never output, it is immediately fed into another part of a custom function so I can't see which part would output the index in a way I understand, such as this question, although that one also returns dimnames not an index.

histograms using loop using R

I am trying to find a more efficient way to plot these five histograms using a for loop for example how would I use a loop for the plots below in R
hist(dat$train[dat$train[,1]==7,10])
hist(dat$train[dat$train[,1]==7,2])
hist(dat$train[dat$train[,1]==7,17])
hist(dat$train[dat$train[,1]==7,200])
hist(dat$train[dat$train[,1]==7,56])
Preferably, for this kind of question, you should post some sample data for dat. In this case, only one variable changes within the loop. The for loop can loop over a vector of these values. Conventionally, the variable is calles i. I did not change your hist-statement except for inserting the i:
for(i in c(10, 2, 17, 200, 56))
hist(dat$train[dat$train[,1]==7, i])
Personally, I prefer speaking variable names, so I would replace the i by breakslike so:
for(breaks in c(10, 2, 17, 200, 56))
hist(dat$train[dat$train[,1]==7, breaks])

R weird data frame subset formula vs. no formula

Sorry for a kind of newb-ish question, as I've been using R for years, but I hadn't noticed this behavior until a student pointed it out to me and I can't explain it. First, build a little data frame. x-values greater than 100 are supposed to be illegal, but some have snuck in here. We also have a "group" independent variable.:
x = c(20, 30, 50, 60, 150, 35, 55, 75, 45, 145)
g = c(1,1,1,1,1,2,2,2,2,2)
df = data.frame(cbind(x,g))
Now, box plots, both grouped and ungrouped, which show all the data, including the illegal values, as they should:
boxplot(x~g)
boxplot(x)
So, we want to remove the illegal values by selecting only those rows in the frame with x-values less than 100. The grouped version works exactly as expected:
boxplot(x~g, data=df[x < 100,])
But the ungrouped one doesn't! All the data, including the values over 100, are plotted. Why does the previous one work and this one doesn't?
boxplot(x, data=df[x < 100,])
I'm sure I'm missing something simple, but for the life of me I can't figure out what it is, and I couldn't find the answer via Google or searching here.
boxplot is an S3 generic, which means that depending on what the first argument is, totally different functions are actually being called. boxplot.formula has different arguments than boxplot.default. Specifically, boxplot.default has no data argument at all; it's probably being sucked into ... and is then ignored as an unknown graphical parameter.
Try boxplot(x[x < 100]) instead.
The reason is because boxplot is reading x from the global environment, and not the data frame.
Note that this does not work as well:
df1 = df[x < 100, ]
boxplot(x, data=df1)
However, this works:
boxplot(df[df$x < 100, 'x'])

discretize function of multiple columns

I have the following csv:
https://github.com/antonio1695/Python/blob/master/nearBPO/facturasprueba.csv
With it I want to use the apriori function to find association rules. However, I get the error:
Error in asMethod(object) :
column(s) 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 not logical or a factor. Discretize the columns first.
I have already bumped into this error before, and what I did was:
dataframe$columnX <- discretize(df$columnX)
However, this only works if I select manually each column and discretize them one by one. I would like to do the same thing but for aprox 3k columns. The case I gave you has only 11, I'm guessing that 11 will do.
I found the answer, thanks for everyones help though. To select and discretize multiple columns:
for (i in 2:12){df[,i]<-discretize(df[,i])}

subseting columns from a data frame in R

I have a huge data frame, but I only need some columns to work on. my code:
outcome_data<- read.csv("dat.csv", colClasses= "character")
interested_data<- outcome_data[, c(1, 2, 7, 11, 17, 23)]
is giving me this error when I run it in my function:
Error in data.frame(list(Provider.Number = c("450690", "450358", "450820", : arguments imply differing number of rows: 370, 0
But works fine in interactive mode.
Any other alternative? or how to fix this?
data.table:::fread(data, select, ...)
select Vector of column names or numbers to keep, drop the rest.
etc.
fread(data, select=c("A","D"))
fread(data, select=c(1,4))

Resources