Selecting unique values from single column of a data frame - r

I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance

The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"

Related

How to use 'as.factor' with 'apply'?

I tried to convert the categorical features in a dataset to factors. However, using apply with as.factor did not work:
convert <- c(2:5, 7:9,11,16:17)
read_file[,convert] <- data.frame(apply(read_file[convert], 2, as.factor))
However, switching to lapply did work:
read_file[,convert] <- data.frame(lapply(read_file[convert], as.factor))
Can someone explain to me what's the difference and why second code works while the first fails?
apply returns a matrix and a matrix cannot contain a factor variable. Factor variables are coerced to character variables if you create a matrix from them. The documentation in help("apply") says:
In all cases the result is coerced by as.vector to one of the basic
vector types before the dimensions are set, so that (for example)
factor results will be coerced to a character array.
lapply returns a list and a list can contain (almost) anything. In fact, a data.frame is just a list with some additional attributes. You don't even need to call data.frame there. You can just subset-assign a list into a data.frame.

Dropping Columns of Specific Name in R

I'm working, in RStudio, with data for patients that are either normal, have Crohn's disease, or ulcerative colitis. Now, the data is structured in such a way that patient information is in a separate data frame (called sampleInfo), and the data I want to use for analysis is in a different data frame (called expressionData). For my analysis, I would like to remove the patients that are 'normal' from the dataset and only keep those with Crohn's disease or ulcerative colitis.
So, what I did was first run the following command to make a new data frame from sampleInfo containing all the patients (aka rows) with the normal disease state, using the following command:
bad_patients <- sampleInfo[sampleInfo$characteristics_ch1.3 == "disease state: normal", ]
bad_patients has a column called geoaccession, which contains the patient ID, which also corresponds with the column names for the same patient in expressionData.
I save the names of these IDs using
patient_names <- bad_patients$geo_accession.
Now, I want to remove the columns with these names from expressionData. I looked at a lot of different StackOverflow posts, as well as posts on the R help forum, and found two main ways, both of which I have tried. The first is done with the following command:
newDataFrame <- expressionData[ , !names(expressionData) %in% patient_names]
Though this method does produce a new matrix called newDataFrame, attempting to view this matrix in RStudio gives the following error:
Error in View : 'names' attribute [1] must be the same length as the vector [0]
I also tried a second subset method with the following command:
newDataFrame <- subset(expressionData, -patient_names)
which raises the error: Error in -patient_names : invalid argument to unary operator
I also tried this subset method by explicity typing out the columns I wanted to remove as follows:
newDataFrame <- subset(expressionData, -c('ID090190', ...) (where ... corresponds to the rest of the IDs) and got the same exact error.
Can someone tell me what I'm doing wrong, or how to work around this?
Couple of solutions:
Subsetting based on names
newDataFrame <- expressionData[!(names(expressionData) %in% patient_names)]
One problem with your attempt was that you hadn't wrapped the whole expression evaluated by ! in parentheses. As it was, you were looking for !names(expressionData) in patient_names. ! here would coerce names(expressionData) into a logical and likely return a vector full of FALSEs
I've subset with only one dimension (x[this] rather than x[,this]). You can do this with the columns of data frames because a data frame is a list of its columns. This subsetting method preserves the data.frame class of the returned object, whereas the two-dimensional subset will just return a vector if you select only one column. (Tibbles will return a tibble with both methods, which is one big advantage of tibbles)
Tidyverse solution: use dplyr::select with dplyr::all_of
newDataFrame <- dplyr::select(expressionData, -dplyr::all_of(patientnames))
Edit: Make sure your data really is a data.frame
If you're getting this error Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "c('matrix', 'array', 'double', 'numeric')", it's because your data is a matrix, rather than a data frame. You may have inadvertently coerced it in processing.
Use as.data.frame to return to a data frame object, which will be compabtible with the methods above. If you wish to keep your data as a matrix, use colnames:
expressionData[ , !(colnames(expressionData) %in% patient_names)] to subset the columns.
If expressionData is a matrix, you'll need to subset the columns with colnames, rather than names. The names of a data.frame are identical to its colnames (because a df is a list of its columns), but the names of a matrix are the names of every element in the matrix, because a matrix is just an array with dimensionality. You'll want to check colnames(expressionData) to make sure that there are colnames to subset.
You might want to try:
newDataFrame <- expressionData[ , !colnames(expressionData) %in% patient_numbers]
names(expressionData) is NULL, hence your error; you want the column names
in your example, your list of sample names was called patient_numbers, not patient_names

Convert several columns of data frame to numeric

I am reading a txt file into R and have several columns that should be numeric, but everything is interpreted as character. Now I would like to convert only a few columns within that matrix (I converted it to a matrix in a first step) to numeric, but I only managed to extract columns, but that way I got rid of the type matrix...
data <- as.numeric(data[,1])
Now, I've found similar questions here but none of the answers worked in the way that it conserved the type matrix.
For example, I've tried to store the affected columns in a vector and then perform the action on that vector with lapply
cols<- c("a","b","d")
data<- as.matrix(lapply(cols, as.numeric))
But this gives me only empty fields, and of course it only shows the columns I selected and not the rest of the matrix. I also got the error message
NAs introduced by coercion
As a last step I tried the following, but I ended up having a list and not a matrix anymore
data[1:25] <- as.matrix(lapply(data[1:25], as.numeric))
What I would like to have, is a matrix where several columns (not just 1:25 as in my example above but rather, say, columns 1,3 and 6) are converted to numeric and the rest stays the same.
Does someone have an answer and maybe even an explanation for why the things I've tried didn't work?

matrix subseting by column's name using `subset` function

Consider the following simulation snippet:
k <- 1:5
x <- seq(0,10,length.out = 100)
dsts <- lapply(1:length(k), function(i) cbind(x=x, distri=dchisq(x,k[i]),i) )
dsts <- do.call(rbind,dsts)
why does this code throws an error (dsts is matrix):
subset(dsts,i==1)
#Error in subset.matrix(dsts, i == 1) : object 'i' not found
Even this one:
colnames(dsts)[3] <- 'iii'
subset(dsts,iii==1)
But not this one (matrix coerced as dataframe):
subset(as.data.frame(dsts),i==1)
This one works either where x is already defined:
subset(dsts,x> 500)
The error occurs in subset.matrix() on this line:
else if (!is.logical(subset))
Is this a bug that should be reported to R Core?
The behavior you are describing is by design and is documented on the ?subset help page.
From the help page:
For data frames, the subset argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression (see the examples).
In R, data.frames and matrices are very different types of objects. If this is causing a problem, you are probably using the wrong data structure for your data. Matrices are really only necessary if you meed matrix arithmetic. If you are thinking of your columns as different attributes for a row observations, then you should be storing your data in a data.frame in the first place. You could store all your values in a simple vector where every three values represent one observation, but that would also be a poor choice of data structure for your data. I'm not sure if you were trying to be more efficient by choosing a matrix but it seems like just the wrong choice.
A data.frame is stored as a named list while a matrix is stored as a dimensioned vector. A list can be used as an environment which makes it easy to evaluate variable names in that context. The biggest difference between the two is that data.frames can hold columns of different classes (numerics, characters, dates) while matrices can only hold values of exactly one data.type. You cannot always easily convert between the two without a loss of information.
Thinks like $ only work with data.frames as well.
dd <- data.frame(x=1:10)
dd$x
mm <- matrix(1:10, ncol=1, dimnames=list(NULL, "x"))
mm$x # Error
If you want to subset a matrix, you are better off using standard [ subsetting rather than the sub setting function.
dsts[ dsts[,"i"]==1, ]
This behavior has been a part of R for a very long time. Any changes to this behavior is likely to introduce breaking changes to existing code that relies on variables being evaluated in a certain context. I think the problem lies with whomever told you to use a matrix in the first place. Rather than cbind(), you should have used data.frame()

correlation of several columns need to be calculated

I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.

Resources