Can't select row in R - can't figure out why - r

I am trying to remove two rows from my dataset with this simple line of code:
my_data_screen <- my_data [-influential]
However, I get the error message Error: Can't negate columns that don't exist.
(The "influential" variable simply contains two numbers of rows, which is the result of calculating outliers from my sample.)
Even why I try do something as simple as targeting a specific row (i.e. my_data [37]), I get the same error message.
Why is R interpreting my command as targeting columns, rather than rows?

Hi with your code R cannot understand if you select a row or a column.
As #ThomasIsCoding suggest you should use:
my_data_screen <- my_data[-influential,]
Comma indicate there are rows, if you want to delete columns the following specification is the right one:
my_data_screen <- my_data[,-influential]
In summary, the position of commas tell R if you want to delete columns or rows.

If you have my_data as data.frame, then you should use
my_data[37, ]
since my_data[37] is indexing my_data in terms of columns by default.
Please read about https://rspatial.org/intr/4-indexing.html

If you are familiar with tidyverse, you should use :
The filter() function to remove rows : filter(!(influential %in% specified_values))
The select() function to remove columns : select(-influential)

Related

How to create a subset from a set of values within a column in R

I have a dataframe with 62 columns and 110 rows. In the column "date_observed" I have 57 dates with some of them having multiple records for the same date.
I am trying to extract only 12 dates out of this. They are not in any given order.
I tried this:
datesubset <- original %>% select (original$date_observed == c("13-Jun-21","21-Jun-21", "28-Jun-21", "13-Jul-21", "20-Jul-21", "8-Aug-21", "9-Aug-21", "25-Aug-21", "31-Aug-21", "8-Sep-21", "27-Sep-21"))
But, I got the following error:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type logical.
i It must be numeric or character.
I did try searching here and on google but I could find results only for how to subset a set of columns but not for specific values within columns. I am still new to R so please pardon me if this was a very simple question to ask.
In {dplyr}, the select() function is for selecting particular columns, but if you want to subset particular rows you want to use filter().
The logical operator == will also compare what is on the left, to EVERYTHING on the right, giving you a vector of TRUE/FALSE for each row, rather than just a single TRUE or FALSE for each row, which is what you are after.
What I think you are after is the logical operator %in% which checks to see if what is on the left appears at all on the right, and returns a single TRUE or FALSE.
As was mentioned, inside of tidyverse functions you don't need the $, you can just input the column name as in the example below.
I don't have your original data to double check, but the example below should work with your original data frame.
specific_dates <- c(
"13-Jun-21",
"21-Jun-21",
"28-Jun-21",
"13-Jul-21",
"20-Jul-21",
"8-Aug-21",
"9-Aug-21",
"25-Aug-21",
"31-Aug-21",
"8-Sep-21",
"27-Sep-21"
)
datesubset <- original %>%
filter(date_observed %in% specific_dates)

Adding a new column to a DataFrame

I am trying to add a column for totals to a dataframe using R and am getting this error:
Error in rowSums(EurostatCrime2017[, 7:10]) : 'x' must be numeric.
Here is my code:
EurostatCrime2017$All_Theft <- rowSums(EurostatCrime2017[,7:11])
It could be due to the type issue. If we check the type of the columns with str
str(EurostatCrime2017[,7:10])
will find if the columns are not numeric or integers.
One option is to convert the columns to numeric
EurostatCrime2017[,7:10] <- lapply(EurostatCrime2017[,7:10], function(x)
as.numeric(as.character(x)))
Here, we specified as.character in case the columns are factor.
and then do the rowSums
I tried the options and it doesnt seem to be working. Here is a link to the document I am working on.
https://drive.google.com/open?id=193JI7z41xvpDh88MWrKp52I3HiQ76LFb

Average of Multiple columns in dplyr getting error "must resolve to integer column positions, not a list"

gene HSC_7256.bam HSC_6792.bam HSC_7653.bam HSC_5852
My data frame looks like this i can do that in a normal way such as take out the columns make another data frame average it ,but i want to do that in dplyr and im having a hard time I not sure what is the problem
I doing something like this
HSC<- EPIGENETIC_FACTOR_SEQMONK %>%
select(EPIGENETIC_FACTOR_SEQMONK,gene)
I get this error
Error: EPIGENETIC_FACTOR_SEQMONK must resolve to integer column positions, not a list
So i have to do this take out all the HSC sample average them
Anyone suggest what am i doing it incorrectly ?that would be helpful
The %>% function pulls whatever is to the left of it into the first position of the following function. If your data frame is EPIGENETIC_FACTOR_SEQMONK, then these two statements are equivalent:
HSC <- EPIGENETIC_FACTOR_SEQMONK %>%
select(gene)
HSC <- select(EPIGENETIC_FACTOR_SEQMONK, gene)
In the first, we are passing EPIGENETIC_FACTOR_SEQMONK into select using %>%, which is generally used in dplyr chains as the first argument in dplyr functions is a data frame.

How to create a "top ten" vector that keeps labels?

I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!

Error in 'colsplit' function?

Im am trying to split a column of a dataframe into 2 columns using transform and colsplit from reshape package. I don't get what I am doing wrong. Here's an example...
library(reshape)
df1 <- data.frame(col1=c("x-1","y-2","z-3"))
Now I am trying to split the col1 into col1.a and col1.b at the delimiter '-'. the following is my code...
df1 <- transform(df1,col1 = colsplit(col1,split='-',names = c('a','b')))
Now in my RStudio when I do View(df1) I do get to see col1.a and col1.b split the way I want to.
But when I run...
df1$col1.a or head(df1$col1.a) I get NULL. Apparently I am not able to make any further operations on these split columns. What exactly is wrong with this?
colsplit returns a list, the easiest (and idiomatic) way to assign these to multiple columns in the data frame is to use [<-
eg
df1[c('col1.a','col1.b')] <- colsplit(df1$col1,'-',c('a','b'))
it will be much harder to do this within transform (see Assign multiple new variables on LHS in a single line in R)

Resources