I am getting an error while executing the exact same code given in my textbook on my machine. It is the simple pairs function code.
my code is pairs(college[, 1:10])
The error I am getting:
Error in pairs.default(college[, 1:10]) : non-numeric argument to
'pairs'
Your college[,1:10] dataset contain columns that are not numeric.
Run:
str(college[,1:10])
And inspect the column types in your dataframe.
Since pairs matrix essentially creates a matrix of scatterplot using each combination of columns, it expect a dataframe with numeric columns only.
This makes sense, if you consider the fact that you wouldn't create a scatterplot using Student Gender (a categorical variable, i.e non-numeric) against Student Age for example. It doesn't make sense.
In your case, this error:
Error in pairs.default(college[, 1:10]) : non-numeric argument to 'pairs'
Tells you that among the 10 columns, one or more of these are non-numeric. Either remove these columns in your call to pairs() or perform explicit coercion using as.numeric().
Related
I'm working, in RStudio, with data for patients that are either normal, have Crohn's disease, or ulcerative colitis. Now, the data is structured in such a way that patient information is in a separate data frame (called sampleInfo), and the data I want to use for analysis is in a different data frame (called expressionData). For my analysis, I would like to remove the patients that are 'normal' from the dataset and only keep those with Crohn's disease or ulcerative colitis.
So, what I did was first run the following command to make a new data frame from sampleInfo containing all the patients (aka rows) with the normal disease state, using the following command:
bad_patients <- sampleInfo[sampleInfo$characteristics_ch1.3 == "disease state: normal", ]
bad_patients has a column called geoaccession, which contains the patient ID, which also corresponds with the column names for the same patient in expressionData.
I save the names of these IDs using
patient_names <- bad_patients$geo_accession.
Now, I want to remove the columns with these names from expressionData. I looked at a lot of different StackOverflow posts, as well as posts on the R help forum, and found two main ways, both of which I have tried. The first is done with the following command:
newDataFrame <- expressionData[ , !names(expressionData) %in% patient_names]
Though this method does produce a new matrix called newDataFrame, attempting to view this matrix in RStudio gives the following error:
Error in View : 'names' attribute [1] must be the same length as the vector [0]
I also tried a second subset method with the following command:
newDataFrame <- subset(expressionData, -patient_names)
which raises the error: Error in -patient_names : invalid argument to unary operator
I also tried this subset method by explicity typing out the columns I wanted to remove as follows:
newDataFrame <- subset(expressionData, -c('ID090190', ...) (where ... corresponds to the rest of the IDs) and got the same exact error.
Can someone tell me what I'm doing wrong, or how to work around this?
Couple of solutions:
Subsetting based on names
newDataFrame <- expressionData[!(names(expressionData) %in% patient_names)]
One problem with your attempt was that you hadn't wrapped the whole expression evaluated by ! in parentheses. As it was, you were looking for !names(expressionData) in patient_names. ! here would coerce names(expressionData) into a logical and likely return a vector full of FALSEs
I've subset with only one dimension (x[this] rather than x[,this]). You can do this with the columns of data frames because a data frame is a list of its columns. This subsetting method preserves the data.frame class of the returned object, whereas the two-dimensional subset will just return a vector if you select only one column. (Tibbles will return a tibble with both methods, which is one big advantage of tibbles)
Tidyverse solution: use dplyr::select with dplyr::all_of
newDataFrame <- dplyr::select(expressionData, -dplyr::all_of(patientnames))
Edit: Make sure your data really is a data.frame
If you're getting this error Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "c('matrix', 'array', 'double', 'numeric')", it's because your data is a matrix, rather than a data frame. You may have inadvertently coerced it in processing.
Use as.data.frame to return to a data frame object, which will be compabtible with the methods above. If you wish to keep your data as a matrix, use colnames:
expressionData[ , !(colnames(expressionData) %in% patient_names)] to subset the columns.
If expressionData is a matrix, you'll need to subset the columns with colnames, rather than names. The names of a data.frame are identical to its colnames (because a df is a list of its columns), but the names of a matrix are the names of every element in the matrix, because a matrix is just an array with dimensionality. You'll want to check colnames(expressionData) to make sure that there are colnames to subset.
You might want to try:
newDataFrame <- expressionData[ , !colnames(expressionData) %in% patient_numbers]
names(expressionData) is NULL, hence your error; you want the column names
in your example, your list of sample names was called patient_numbers, not patient_names
I am working on a data frame with all variables of numeric type
summary.default(pfnew)
ID 6016315 -none- numeric
iterator 6016315 -none- numeric
value 6016315 -none- numeric
CV 6016315 -none- numeric
I want to create a pivot table grouped by iterator and CV and summarize the count of ID. In essence, I want number of points in the data frame corresponding to a particular set of iterator and CV value. The code I have used is:
Code
install.packages("tidyr")
install.packages("dplyr")
install.packages("vctrs")
library(vctrs)
library(tidyr)
library(dplyr)
allow_lossy_cast(pivot<-pfnew%>%
select(pfnew$iterator,pfnew$CV,pfnew$ID)%>%
summarise(CT=count(pfnew$ID)))
But as discussed in other forums even after using allow_lossy_cast, I am getting the same error message.
Error: Must subset columns with a valid subscript vector. x Can't convert from to due to loss of precision.
How can we resolve this? Or can we do the same job in any other manner?
I just came across the same error with a different dplyr function and realized that I included the name of the data frame after calling it. Try removing pfnew$ from select and summarise so it's select(c(iterator, CV,ID)).
select function throws error when you are using dataframe to call the predictors here. Try renaming the column name to a more suitable name in case it persists (using space in name will throw error if you donot use dataframe to call the predictor column, therefore avoid using spaces in your column name) and use the predictors name directly and this will resolve the issue.
Example - instead of pfnew$This is an example,
use pfnew$This_Is_an_example
and then directly use this name in select -
select(This_Is_an_example) %>%
....
I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance
The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"
I'm trying to use the softImpute command (from the softImpute package) for filling in missing values, and I'm trying to turn categorical variables in a large data frame into factor type before using the softImpute.
I've used as.factor command and factor command but they all yield the following
train[a]=factor(train[a])
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
a here is a vector like: c(1:92)
I tried as.character too but the softImpute command would not recognize the variables as character and would treat them as numeric, resulting in decimal values for categorical/indicator variables.
Try:
train[[a]]=factor(train[[a]])
This does assume, of course that ,a is an object with either a numerical value in the range 1:length(train) or is one of the values in the names(train) vector. If you reference a dataframe using "[" you get a list with one element which happens to be the vector you were hoping to "factorize" but it isn't really a vector but is rather a one element list. The "[[" function instead gives you just the vector.
I have a data matrix (data) of 54675 obs. of 170 variables. And I want to perform
data.matrix.2 <- log2(data[,9:ncol(data)])
i.e. for values from the 9th column and beyond. The 8 columns before that are characters. I get the following error
Error in Math.data.frame(data.matrix[, 9:ncol(data)]) :
non-numeric variable in data frame:
Is there a way to treat a subset of the matrix as.numeric for the the log transform.
Thanks
My first thought was that you had gotten a character matrix and needed:
as.numeric(data.matrix.2[ , -(1:8) ])
... but data.matrix() should coerce to 'numeric' mode. Oh, no, there you go. You weren't using the data.matrix function .... so it would be better not to use the name "data.matrix" since that is also the name of an R function.
You are properly using "[,]" so your assumptions about your data object are probably flawed. There must be a column of data that got created as factor or character in the remaining 162 columns. You need to run str(data.matrix) to see which one(s) it/they are.