subseting columns from a data frame in R - r

I have a huge data frame, but I only need some columns to work on. my code:
outcome_data<- read.csv("dat.csv", colClasses= "character")
interested_data<- outcome_data[, c(1, 2, 7, 11, 17, 23)]
is giving me this error when I run it in my function:
Error in data.frame(list(Provider.Number = c("450690", "450358", "450820", : arguments imply differing number of rows: 370, 0
But works fine in interactive mode.
Any other alternative? or how to fix this?

data.table:::fread(data, select, ...)
select Vector of column names or numbers to keep, drop the rest.
etc.
fread(data, select=c("A","D"))
fread(data, select=c(1,4))

Related

How to find the index of an array, where the element has value x, in R

I have a very large array (RFO_2003; dim = c(360, 180, 13, 12)) of numeric data. I made the array using a for-loop that does some calculations based another array. I am trying to check some samples of data in this array to ensure I have generated it properly.
To do this, I want to apply a function that returns the index of the array where that element equals a specific value. For example, I want to start by looking at a few examples where the value == 100.
I tried
which(RFO_2003 == 100)
That returned (first line of results)
[1] 459766 460208 460212 1177802 1241374 1241498 1241499 1241711 1241736 1302164 1302165
match gave the same results. What I was expecting was something more like
[8, 20, 3, 6], [12, 150, 4, 7], [16, 170, 4, 8]
Is there a way to get the indices in that format?
My searches have found solutions in other languages, lots of stuff on vectors, or the index is never output, it is immediately fed into another part of a custom function so I can't see which part would output the index in a way I understand, such as this question, although that one also returns dimnames not an index.

Basics question: How can I convert a list vector to a table with R?

I have a pretty basic question I haven't been able to find the answer to because I'm not 100% clear on exactly how to ask it. The tutorials I found that seem appropriate are all a little too simplified and missing some key info.
I have a list vector of processed data and I'm trying to convert it to a table for downstream analysis and I'm stuck.
Currently I have a Values list with 4088 numerical datapoint. The metadata of 'Data' has my subtype information. I generated my list this way:
vec <-vector("list",3)
vec[[1]] <- Values[which(colData(Data)$Type=="Type1")]
vec[[2]] <- Values[which(colData(Data)$Type=="Type2")]
vec[[3]] <- Values[which(colData(Data)$Type=="Type3")]
Now, [[1]] has just the values I care about for Type 1, [[2]] for type 2 etc.
So, how do I convert this into a dataframe or a table for downstream stuff like T-Tests between groups? The below readout tells me I need to define my data somehow and that I have a different number of points per group is a problem, but I'm completely lost here.
df <- as.data.frame(vec)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1461, 658, 1969
Thanks for any help, or just a direction to a tutorial that will help me answer this for myself!
It really depends what your downstream steps are, but you can generate a data.frame with the maximum length of lengths(vec) for the number of rows (which will pad the vectors with NA - make sure how to handle NAs afterwards). Examples of how to get such a data.frame can be found in the answer of #AndyBrown
Specifically, to perform t-tests between groups you do not need to generate a data.frame. You could use pairwise combinations of your list elements, which can have different lengths.
Below is an example using all pairwise combinations of list elements of vec to perform t-tests:
lapply(combn(vec, 2, simplify=FALSE), function(x) t.test(x[[1]], x[[2]]))
There are actually a few options that you can use.
#base R
data.frame(lapply(vec, "length<-", max(lengths(vec))))
#Data Table
library(data.table)
setDT(lapply(vec, "length<-", max(lengths(vec))))

discretize function of multiple columns

I have the following csv:
https://github.com/antonio1695/Python/blob/master/nearBPO/facturasprueba.csv
With it I want to use the apriori function to find association rules. However, I get the error:
Error in asMethod(object) :
column(s) 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 not logical or a factor. Discretize the columns first.
I have already bumped into this error before, and what I did was:
dataframe$columnX <- discretize(df$columnX)
However, this only works if I select manually each column and discretize them one by one. I would like to do the same thing but for aprox 3k columns. The case I gave you has only 11, I'm guessing that 11 will do.
I found the answer, thanks for everyones help though. To select and discretize multiple columns:
for (i in 2:12){df[,i]<-discretize(df[,i])}

Create a datatable containing the Nth digit of each of a list of file names

I have a list of files containing output from a large model.
I load these as a datatable using:
files <- list.files(path.expand("/XYZ/"), pattern = ".*\\.rds", full.names = TRUE)
dt<- as.data.table(files)
This datatable "dt" has just 1 column, the file name.
e.g XZY_00_34234.rds
the 50th and 51st character of each file name is a number.
I want to create a datatable containing that 2 digit number for each file.
I used:
index <- as.data.table(as.integer(substr(dt,50,51)))
This gives me the correct value for the first file.
I think I should be able to use apply to run this against each row of the file
I tried:
integers <- as.data.table(apply(dt,1,as.integer(substr(50,51))))
But get:
Error in substr(50, 51) : argument "stop" is missing, with no default
Any suggestions gratefully accepted!
Try:
integers <- as.data.table(apply(dt, 1, function(x) as.integer(substr(x, 50, 51))))
The apply family of functions accept other functions and executes them over vectors and arrays. These functions are some times already defined, but an interesting feature was added to apply functions, you can write the function right there at the line for the first time. This saves time and keystrokes.
A narrower programming setup would require your function to first be written like:
fiftieth_char <- function(x) {
as.integer(substr(x, 50, 51))
}
Next, that function could then be passed to the apply function.
apply(dt, 1, fiftieth_char)
But look how we were able to do those two steps in one.
If you have just 1 column, you could extract the column as a vector and use substr directly on it instead of looping with apply. For data.table, extracting a column is using ?Extract functions [[ or $.
as.data.table(as.integer(substr(dt[[1]], 50, 51)))
Or
as.data.table(as.integer(substr(dt$files, 50, 51)))
I noticed that you are creating 'dt' as a data.table from 'files'. The output of list.files() is a vector, so instead of creating the data.table first, you could substr the vector and wrap it with as.data.table.
as.data.table(as.integer(files, 50, 51))
As an example,
files <- c('ABC_25', 'DEF_39')
dt <- as.data.table(files)
as.integer(substr(dt[[1]], 5, 6))
#[1] 25 39
as.integer(substr(files, 5, 6))
#[1] 25 39

chisq.test() on transition matrix for point-of-gaze

All,
I am trying to do a chisq.test() for eye data in a transition matrix where each row represents the tally of gaze from one area of 7 areas of interest (AoIs) to each of the others. In this analysis, it makes no sense for there to be a transition from one AoI to itself. Hence, those fields contain NAs.
I have tried a variety of different formats from a basic tabular input of 8 columns and rows (with the top row being the headers and the left column being the "from's"), to a simple three column data from (from, to, values).
My data.frame looks like this:
from <- c("frLS", "frLF", "frRF", "frRS", "frIns", "frEng", "frOthr")
frLS <- c(NA, 77,3, 0, 17, 0, 1)
frLF <- c(18, NA, 14, 1, 56, 2, 9)
frRF <- c(1, 52, NA, 15, 16, 1, 14)
frRS <- c(0, 7, 35, NA, 13, 15, 30)
frIns <- c(3, 54, 2, 1, NA, 4, 37)
frEng <- c(0, 9, 0, 3, 27, NA, 61)
frOthr <- c(2, 60, 2, 5, 27, 4, NA)
aoi.df <- data.frame(from, frLS, frLF, frRF, frRS, frIns, frEng, frOthr)
(Note that this is not actual data, but example data taken from Holmqvist's et al., textbook on Eye Tracking.)
Note I have also tried this as a matrix
aoi.matrix <- matrix(c(frLS, frLF, frRF, frRS, frIns, frEng, frOthr), ncol=7)
But I believe the problem is the NAs not the form of the data but, if that is the case, I am not sure how to handle it.
The NAs indeed is the problem. The error message is quite clear:
> chisq.test(aoi.matrix)
Error in chisq.test(aoi.matrix) :
all entries of 'x' must be nonnegative and finite
Either you need to substitute the NA with something else, say, 0 if that makes sense.
Now, I don't quite understand your problem. But are you sure that a chisq.test is what you want to do? It doesn't make any sense to me. Recall that you're testing for independence. However, if the diagonal elements always are zero or NA, then they cannot be independent.
Okay, here is how to handle a chisq.test with NAs. One thing I did not know when I asked this question is that the NAs in my matrix are what are called "structural zeros." Hence, they are not zeros as "zero" is a count nor are they some unexplained blip in data collection. Rather, they arise from the structure of the data set. In the case of the transition matrix, we do not allow a transition from object "A" to itself, only to other objects.
All of that said, it turns out that there is (of course) an R package for that!! I need to refer you to the aylmer documentation for a more detailed explanation, but I pretty much got what I was hoping that the chi.square would give me from:
aylmer.test(aoi.df, alternative = "two.sided", simulate.p.value = TRUE)
Note that I did have to remove the first column of "from" names, but other than that things worked just fine.

Resources