matrix subseting by column's name using `subset` function - r

Consider the following simulation snippet:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),i) )
dsts <- do.call(rbind,dsts)
why does this code throws an error (dsts is matrix):
subset(dsts,i==1)
#Error in subset.matrix(dsts, i == 1) : object 'i' not found
Even this one:
colnames(dsts)[3] <- 'iii'
subset(dsts,iii==1)
But not this one (matrix coerced as dataframe):
subset(as.data.frame(dsts),i==1)
This one works either where x is already defined:
subset(dsts,x> 500)
The error occurs in subset.matrix() on this line:
else if (!is.logical(subset))
Is this a bug that should be reported to R Core?

The behavior you are describing is by design and is documented on the ?subset help page.
From the help page:
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
In R, data.frames and matrices are very different types of objects. If this is causing a problem, you are probably using the wrong data structure for your data. Matrices are really only necessary if you meed matrix arithmetic. If you are thinking of your columns as different attributes for a row observations, then you should be storing your data in a data.frame in the first place. You could store all your values in a simple vector where every three values represent one observation, but that would also be a poor choice of data structure for your data. I'm not sure if you were trying to be more efficient by choosing a matrix but it seems like just the wrong choice.
A data.frame is stored as a named list while a matrix is stored as a dimensioned vector. A list can be used as an environment which makes it easy to evaluate variable names in that context. The biggest difference between the two is that data.frames can hold columns of different classes (numerics, characters, dates) while matrices can only hold values of exactly one data.type. You cannot always easily convert between the two without a loss of information.
Thinks like $ only work with data.frames as well.
dd <- data.frame(x=1:10)
dd$x
mm <- matrix(1:10, ncol=1, dimnames=list(NULL, "x"))
mm$x # Error
If you want to subset a matrix, you are better off using standard [ subsetting rather than the sub setting function.
dsts[ dsts[,"i"]==1, ]
This behavior has been a part of R for a very long time. Any changes to this behavior is likely to introduce breaking changes to existing code that relies on variables being evaluated in a certain context. I think the problem lies with whomever told you to use a matrix in the first place. Rather than cbind(), you should have used data.frame()

Related

R scale function with character variable

I'm relatively new to R - I'm having challenges to figure out how to scale a dataset that contains a character variable.
However I when I try to use the scale function to create a dataframe, I'm getting an error:
df<-scale(USArrests)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Is there a way to create a dataframe with a character variable to later use it in a cluster analysis?
km.res<-kmeans(df,4,nstart=10)
?scale() says scale is desgined to center columns of numeric matrices, see the help entry for further details.
However, df <- USArrests is sufficient to store the required in-built dataset as object df (see environment), if you have to name it df.
Compare the following:
df <- USArrests
# compare
head(df, n=5)
# to
df1 <- scale(df)
head(df1, n=5)
As you can see, all numeric columns are now scaled while the row ids, Alabama, ..., Wyoming, of course, do not change. Btw, to check the class of all variables you can use lapply(df, class).
I think you shouldn't have problems to then call km.res <- kmeans(df1,4,nstart=10). To inspect the object type km.res.
To be honest, I think previous to running kmeans() you should again have a look on the help page (e.g. help(kmeans)) to get in touch with the arguments clusters, iter, ...
Further, I think it would be a good idea to investigate why or why not to center the data in previous step. In any case, it is possible to run kmeans() with centered (df1) and uncentered (df) data. Why one of those alternatives is more appropriate is of major importance.
EDIT: It is recommended to set a seed (e.g. set.seed(09102021)) before running the algorithm. By doing so you ensure the reproducibility of results.

R: Check for finite values in DataFrame

I need to check whether data frame is "empty" or not ("empty" in a sense that dataframe contain zero finite value. If there is mix of finite and non-finite value, it should NOT be considered "empty")
Referring to How to check a data.frame for any non-finite, I came up with one line code to almost achieve this objective
nrow(tmp[rowSums(sapply(tmp, function(x) is.finite(x))) > 0,]) == 0
where tmp is some data frame.
This code works fine for most cases, but it fails if data frame contains a single row.
For example, the above code would work fine for,
tmp <- data.frame(a=c(NA,NA), b=c(NA,NA)) OR tmp <- data.frame(a=c(3,NA), b=c(4,NA))
But not for,
tmp <- data.frame(a=NA, b=NA)
because I think rowSums expects at least two rows
I looked at some other posts such as https://stats.stackexchange.com/questions/6142/how-to-calculate-the-rowmeans-with-some-single-rows-in-data, but I still couldn't come up a solution for my problem.
My question is, are there any clean ways (i.e. avoid using loops and ideally one liner) to check for being "empty" for any dataframes?
Thanks
If you are checking all columns, then you can just do
all(sapply(tmp, is.finite))
Here we are using all rather than the rowSums trick so we don't have to worry about preserving matrices.

Selecting unique values from single column of a data frame

I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance
The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"

In-place list modification without for loop in R

I'm wondering whether there is a way to do in-place modification of objects in a list without using a for loop. This would be useful, for example, if the individual objects in the list are large and complex, so that we want to avoid making a temporary copy of the entire object. As an example, consider the following code, which creates a list of three data frames, then calculates the vector of maximums across all three data frames for one column of the data, and then assigns that vector to each original data frame. (Code like this is needed when aligning plots in ggplot2.)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
for( i in 1:length(data_list))
{
data_list[[i]]$x <- max_x
}
Is there any way to write the final part without a for loop?
Answers to some of the questions I'm getting:
What makes me think a copy would be made? I don't know for sure whether a copy would or would not be made. The actual scenario I'm working with deals with entire ggplot graphs (see e.g. here). Since they are rather large and complex, it's critical that no copy be made.
What's the problem with a for loop? I just would rather iterate directly over a list than have to introduce a counter. I don't like counters.
Why not use data.table? Because I'm actually manipulating ggplot graphs, not data frames. The code provided here is just a simplified example.
Base R data structures are copy-on-modify with sharing. Take your example of a data.frame with three numeric columns. Each data.frame is a length 3 "list" vector, each containing a reference to the numeric vectors of the underlying columns. If we modify/replace the first column, R creates a new length 3 data.frame "list" containing references to the new(ly modified) column and the other two unmodified columns.
Let's take a look using the address function*
set.seed(1)
data_list <- lapply(1:3, function(x) data.frame(x=rnorm(10), y=rnorm(10), z=rnorm(10)))
before <- rapply(data_list,address)
Now you want to replace the first column with
max_x <- do.call(pmax, lapply(data_list, function(d){d$x}))
How you do this doesn't much matter, but here's one way without an explicit loop-with-counter
data_list <- lapply(data_list,`[<-`,"x",value=max_x)
after <- rapply(data_list,address)
Now compare the addresses before and after. Note that the addresses for the y and z columns have not changed. Furthermore, all "after" x columns have the same address -- the address of max_x!
address(max_x)
[1] "05660600"
cbind(before,after)
before after
x "0565F530" "05660600"
y "0565F400" "0565F400"
z "05660AC0" "05660AC0"
x "05660A28" "05660600"
y "05660990" "05660990"
z "05660860" "05660860"
x "056607C8" "05660600"
y "05660730" "05660730"
z "05660698" "05660698"
This means you don't have to worry as much as you might think about making a change to a large data structure. In general, only the modified piece and the skeleton of the data structure will have to be replaced. In this example, the max_x vector had to be created anyway, so the only overhead is creating a new 3 cell data.frame "list" and populating it with 3 references**. This, however, could start to become inefficient if you are iteratively "banging on" changes or working with subvectors rather than entire columns. These are use cases for data.table that are not applicable to this example.
* The address function used here is exported from the data.table package.
** And, of course, in this example, the 3 cell outer list "list" containing the 3 data.frames themselves.

Evaluating dataframe and storing the result

My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.

Resources