How to remove duplicated rows by a column in an R matrix - r

I am trying to remove duplicated rows by one column (e.g the 1st column) in an R matrix. How can I extract the unique set by one column from a matrix? I've used
x_1 <- x[unique(x[,1]),]
While the size is correct, all of the values are NA. So instead, I tried
x_1 <- x[-duplicated(x[,1]),]
But the dimensions were incorrect.

I think you're confused about how subsetting works in R. unique(x[,1]) will return the set of unique values in the first column. If you then try to subset using those values R thinks you're referring to rows of the matrix. So you're likely getting NAs because the values refer to rows that don't exist in the matrix.
Your other attempt runs afoul of the fact that duplicated returns a boolean vector, not a vector of indices. So putting a minus sign in front of it converts it to a vector of 0's and -1's, which again R interprets as trying to refer to rows.
Try replacing the '-' with a '!' in front of duplicated, which is the boolean negation operator. Something like this:
m <- matrix(runif(100),10,10)
m[c(2,5,9),1] <- 1
m[!duplicated(m[,1]),]

As you need the indeces of the unique rows, use duplicated as you tried. The problem was using - instead of !, so try:
x[!duplicated(x[,1]),]

Related

How to create a subset from a set of values within a column in R

I have a dataframe with 62 columns and 110 rows. In the column "date_observed" I have 57 dates with some of them having multiple records for the same date.
I am trying to extract only 12 dates out of this. They are not in any given order.
I tried this:
datesubset <- original %>% select (original$date_observed == c("13-Jun-21","21-Jun-21", "28-Jun-21", "13-Jul-21", "20-Jul-21", "8-Aug-21", "9-Aug-21", "25-Aug-21", "31-Aug-21", "8-Sep-21", "27-Sep-21"))
But, I got the following error:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type logical.
i It must be numeric or character.
I did try searching here and on google but I could find results only for how to subset a set of columns but not for specific values within columns. I am still new to R so please pardon me if this was a very simple question to ask.
In {dplyr}, the select() function is for selecting particular columns, but if you want to subset particular rows you want to use filter().
The logical operator == will also compare what is on the left, to EVERYTHING on the right, giving you a vector of TRUE/FALSE for each row, rather than just a single TRUE or FALSE for each row, which is what you are after.
What I think you are after is the logical operator %in% which checks to see if what is on the left appears at all on the right, and returns a single TRUE or FALSE.
As was mentioned, inside of tidyverse functions you don't need the $, you can just input the column name as in the example below.
I don't have your original data to double check, but the example below should work with your original data frame.
specific_dates <- c(
"13-Jun-21",
"21-Jun-21",
"28-Jun-21",
"13-Jul-21",
"20-Jul-21",
"8-Aug-21",
"9-Aug-21",
"25-Aug-21",
"31-Aug-21",
"8-Sep-21",
"27-Sep-21"
)
datesubset <- original %>%
filter(date_observed %in% specific_dates)

R returns incorrect answers for logical operations?

I am trying to subset a dataframe based on values in a single column (Column_A), using code similar to the following:
new_df <- subset(df, df$Column_A<4)
I noticed that this code returns all rows where the value for Column_A is less than 4...as well as one row where the value is 12.4 (so, greater than 4).
I tried to look more closely at what R believes the value of this cell to be--df$Column_A[[2]] returned the expected value of 12.4.
I then tested several other variants of this logical operation--e.g.df$Column_A[[2]]<12 , df$Column_A[[2]]<11 , df$Column_A[[2]]<10 , df$Column_A[[2]]<9...
The first three expressions returned the expected answer ("FALSE"). However, df$Column_A[[2]]<9 and all variants of this expression with lower values (e.g. <8, <7...) return the answer ("TRUE"). This is clearly incorrect.
I have no idea what is causing this and would really appreciate any insight.
It could happen if the class of the column is character
"12.4" < 4
[1] TRUE
Remedy is to convert to numeric first and then subset
df$Column_A <- as.numeric(df$Column_A)
subset(df, Column_A < 4)

Getting only the rownames containing a specific character - R

I have a Seurat R object. I would like to only select the data corresponding to a specific sample. Therefore, I want to get only the row names that contain a specific character. Example of my differences in row names: CTAAGCTT-1 and CGTAAAT-2. I want to differentiate based on 1 and 2. The code below shows what I already tried. But it just returns the total numbers of row. Not how many rows are matching the character.
length <- length(rownames(seuratObject#meta.data) %in% "1")
OR
length <- length(grepl("-1",rownames(seuratObj#meta.data)))
Idents(seuratObject, cells = 1:length)
Thanks for any input.
Just missing which()
length(which(grepl("-1", rownames(seuratObject#meta.data))))

Clarification in colnames function in R

I am new to R and I wanted to ask experts about the colnames function in R. Using the function I realized that it provides a NULL if used for single column of a matrix object, however it works perfectly fine for more than 1 columns of a matrix object. To illustrate, say I have matrix test
>test<-matrix(0,ncol=4,nrow=5)
>colnames(test)<-c("A","B","C","D")
>colnames(test[,1]) or colnames(test[,c(1)]) gives output as NULL
NULL
whereas the following works fine,
colnames(test[,c(1:2)])
[1] "A" "B"
I understand that alternative way is to use colnames(test)[c(1:2)]. Am I missing something here in the case where I am getting NULL.
If you look in the description of ?colnames. You'll see that it takes an argument x which is a a matrix-like R object, with at least two dimensions for colnames.
When you are calling colnames(test[,1]) you are giving colnames a vector with 1 dimension. Compare class(test[,1]) vs. class(test[,c(1:2)]). Vectors don't have columns or rows and therefore no column or row names. You can have named elements within a vector, but that is definitely not equivalent to the column names from a matrix
The best way to extract a single (or multiple) column name is to select the column after from the full vector of column names
colnames(test) # gives you all column names
colnames(test)[1] # gives you the column name 1
colnames(test)[c(1,2)] # gives you column names 1 and 2
Does this clarify this issue for you?

How to code this if else clause in R?

I have a function that outputs a list containing strings. Now, I want to check if this list contain strings which are all 0's or if there is at least one string which doesn't contain all 0's (can be more).
I have a large dataset. I am going to execute my function on each of the rows of the dataset. Now,
Basically,
for each row of the dataset
mylst <- func(row[i])
if (mylst(contains strings containing all 0's)
process the next row of the dataset
else
execute some other code
Now, I can code the if-else clause but I am not able to code the part where I have to check the list for all 0's. How can I do this in R?
Thanks!
You can use this for loop:
for (i in seq(nrow(dat))) {
if( !any(grepl("^0+$", dat[i, ])) )
execute some other code
}
where dat is the name of your data frame.
Here, the regex "^0+$" matches a string that consists of 0s only.
I'd like to suggest solution that avoids use of explicit for-loop.
For a given data set df, one can find a logical vector that indicates the rows with all zeroes:
all.zeros <- apply(df,1,function(s) all(grepl('^0+$',s))) # grepl() was taken from the Sven's solution
With this logical vector, it is easy to subset df to remove all-zero rows:
df[!all.zeros,]
and use it for any subsequent transformations.
'Toy' dataset
df <- data.frame(V1=c('00','01','00'),V2=c('000','010','020'))
UPDATE
If you'd like to apply the function to each row first and then analyze the resulting strings, you should slightly modify the all.zeros expression:
all.zeros <- apply(df,1,function(s) all(grepl('^0+$',func(s))))

Resources