Removing rows causes "row.names" column to appear when displayed with View() - r

To remove rows from a data frame, I use the following command:
data <- data[-1, ]
for example to remove the first row. I need to remove the first 6 rows, so I used the following:
data <- data[-c(1,2,3,4,5,6), ]
OR
data <- data[-(1:6), ]
this works as far as removing the row names, but introduced a new column called row.names that I cannot get rid of unless I use the command:
row.names(data) <- NULL
What is the reason for this? Is there a better way of removing a number of rows/columns with one command?
Example:
after the following code:
tquery <- tquery[-(1:6), ]
This is the data:

Although it seems as such, you are not actually adding a column to the data. What you are seeing is just a result of using View(). The function is showing the "row.names" attribute of the data frame as the first column, but you didn't really add the column.
This is expected and documented behavior. From the Details section of help(View)
If there are row names on the data frame that are not 1:nrow, they are displayed in a separate first column called row.names.
So since you subsetted the data, the row names are technically not 1:nrow any more and hence the new column is introduced in the viewer.
Print your data in the console and you'll see the difference.
View(mtcars) ## because the mtcars row names are not 1:nrow
versus
mtcars
Basically, don't trust View() to display an exact representation of the actual data. Instead use attributes(), *names(), dim(), length(), etc. or just peek at the data with head().

See r help via "?row.names" for more info. From the documentation, "All data frames have a row names attribute"
?row.names ## get more information about row.names from r help
row.names is not a new column, but rather an attribute of every single data frame. This is simply meta data and is ignored by most data. When you output this data (i.e. CSV) or use it in a function, this data will not interfere. This is similar to how excel has row numbers on the left margin, which is referential data for the application.
str(your_dataframe) ## see that those columns don't exist
colnames(your_dataframe) ## see column names

Related

Dropping Columns of Specific Name in R

I'm working, in RStudio, with data for patients that are either normal, have Crohn's disease, or ulcerative colitis. Now, the data is structured in such a way that patient information is in a separate data frame (called sampleInfo), and the data I want to use for analysis is in a different data frame (called expressionData). For my analysis, I would like to remove the patients that are 'normal' from the dataset and only keep those with Crohn's disease or ulcerative colitis.
So, what I did was first run the following command to make a new data frame from sampleInfo containing all the patients (aka rows) with the normal disease state, using the following command:
bad_patients <- sampleInfo[sampleInfo$characteristics_ch1.3 == "disease state: normal", ]
bad_patients has a column called geoaccession, which contains the patient ID, which also corresponds with the column names for the same patient in expressionData.
I save the names of these IDs using
patient_names <- bad_patients$geo_accession.
Now, I want to remove the columns with these names from expressionData. I looked at a lot of different StackOverflow posts, as well as posts on the R help forum, and found two main ways, both of which I have tried. The first is done with the following command:
newDataFrame <- expressionData[ , !names(expressionData) %in% patient_names]
Though this method does produce a new matrix called newDataFrame, attempting to view this matrix in RStudio gives the following error:
Error in View : 'names' attribute [1] must be the same length as the vector [0]
I also tried a second subset method with the following command:
newDataFrame <- subset(expressionData, -patient_names)
which raises the error: Error in -patient_names : invalid argument to unary operator
I also tried this subset method by explicity typing out the columns I wanted to remove as follows:
newDataFrame <- subset(expressionData, -c('ID090190', ...) (where ... corresponds to the rest of the IDs) and got the same exact error.
Can someone tell me what I'm doing wrong, or how to work around this?
Couple of solutions:
Subsetting based on names
newDataFrame <- expressionData[!(names(expressionData) %in% patient_names)]
One problem with your attempt was that you hadn't wrapped the whole expression evaluated by ! in parentheses. As it was, you were looking for !names(expressionData) in patient_names. ! here would coerce names(expressionData) into a logical and likely return a vector full of FALSEs
I've subset with only one dimension (x[this] rather than x[,this]). You can do this with the columns of data frames because a data frame is a list of its columns. This subsetting method preserves the data.frame class of the returned object, whereas the two-dimensional subset will just return a vector if you select only one column. (Tibbles will return a tibble with both methods, which is one big advantage of tibbles)
Tidyverse solution: use dplyr::select with dplyr::all_of
newDataFrame <- dplyr::select(expressionData, -dplyr::all_of(patientnames))
Edit: Make sure your data really is a data.frame
If you're getting this error Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "c('matrix', 'array', 'double', 'numeric')", it's because your data is a matrix, rather than a data frame. You may have inadvertently coerced it in processing.
Use as.data.frame to return to a data frame object, which will be compabtible with the methods above. If you wish to keep your data as a matrix, use colnames:
expressionData[ , !(colnames(expressionData) %in% patient_names)] to subset the columns.
If expressionData is a matrix, you'll need to subset the columns with colnames, rather than names. The names of a data.frame are identical to its colnames (because a df is a list of its columns), but the names of a matrix are the names of every element in the matrix, because a matrix is just an array with dimensionality. You'll want to check colnames(expressionData) to make sure that there are colnames to subset.
You might want to try:
newDataFrame <- expressionData[ , !colnames(expressionData) %in% patient_numbers]
names(expressionData) is NULL, hence your error; you want the column names
in your example, your list of sample names was called patient_numbers, not patient_names

Assigning Unnamed Columns To Another DataFrame

I'm in a very basic class that introduces R for genetic purposes. I'm encountering a rather peculiar problem in trying to follow the instructions given. Here is what I have along with the instructor's notes:
MangrovesRaw<-read.csv("C:/Users/esteb/Documents/PopGen/MangrovesSites.csv")
#i'm going to make a new dataframe now, with one column more than the mangrovesraw dataframe but the same number of rows.
View(MangrovesRaw)
Mangroves<-data.frame(matrix(nrow = 528, ncol = 23))
#next I want you to name the first column of Mangroves "pop"
colnames(Mangroves)<-c(col1="pop")
#i'm now assigning all values of that column to be 1
Mangroves$pop<-1
#assign the rest of the columns (2 to 23) to the entirety of the MangrovesRaw dataframe
#then change the names to match the mangroves raw names
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
I'm not really sure how to assign columns that haven't been named used the $ as we have in the past. A friend suggested I first run
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw
#X338 is the name of the first column from MangrovesRaw
But while this does transfer the data from MangrovesRaw, it comes at the cost of having my column names messed up with X338. added to every subsequent column. In an attempt to modify this I found the following "fix"
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw[,2]
#Mangroves$X338<-MangrovesRaw[,2:22]
#MangrovesRaw has 22 columns in total
While this transferred all the data I needed for the X338 Column, it didn't transfer any data for the remaining 21 columns. The code in # just results in the same problem of having X388. show up in all my column names.
What am I doing wrong?
There are a few ways to solve this problem. It may be that your instructor wants it done a certain way, but here's one simple solution: just cbind() the Mangroves$pop column with the real data. Then the data and column names are already added.
Mangroves <- cbind(Mangroves$pop, MangrovesRaw)
Here's another way:
Mangroves[, 2:23] <- MangrovesRaw
colnames(Mangroves)[2:23] <- colnames(MangrovesRaw)

'row.names' is not a character vector of length

I am simply trying to create a dataframe.
I read in data by doing:
>example <- read.csv(choose.files(), header=TRUE, sep=";")
The data contains 2 columns with 8736 rows plus a header.
I then simply want to combine this with the column of a dataframe with the same amount of rows (!) by doing:
>data_frame <- as.data.frame(example$x, example$y, otherdata$z)
It produces the following error
Warning message:
In as.data.frame.numeric(example$x, example$y, otherdata$z) :
'row.names' is not a character vector of length 8736 -- omitting it. Will be an error!
I have never had this problem before. It seems so easy to tackle but I cant help myself at the moment.
Overview
As long as the nrow(example) equals length(otherdata$z), use cbind.data.frame to combine columns into one data frame. An advantage with cbind.data.frame() is that there is no need to call the individual columns within example when binding them with otherdata$z.
# create a new data frame that adds the 'z' field from another source
df_example <- cbind.data.frame(example, otherdata$z)

integer function converting row names in to numbers

enter image description here
I used to this
mydata3 <- data.frame(sapply(mydata2, as.integer))
But now I see that row names which is gene names, has been converted to number like 1-200). But I should point that same command I used sometime ago when it was working well. So I thought there are some problems with my file then i used old file on which this command was working but i am seeing same problem like gene name is converted in to number here is full script:
countsTable<-read.table("JW.txt",header=TRUE,stringsAsFactors=TRUE,row.names=1)
mydata2 <- countsTable/1000
mydata3 <- data.frame(sapply(mydata2, as.integer))
str(mydata3)
Please let me know.
sapply works over columns of your data.frame mydata2, and returns respective output per column. as such, it does not return the row-names of your data.frame, so you either have to re-assign those, or re-assign the new column data into your original data.frame, like:
mydata2[] <- sapply(mydata2, as.integer)
Thus you can keep all of the original attributes.

Trying to predict in R

I created a data set using a random row generator:
training_data <- fulldata[sample(nrow(fulldata),100,]
I am under the impression that I can create a second data set of the rest of the data ... rest_data <- fulldata[-training_data] is the code I jotted down in my notes but I am getting
"Error in '[.default'(fulldata, -training_data) :
What part of my code is incorrect?
assuming that fulldatais a dataframe you need a comma in the subscript to indicate that you want the rows of the data frame (i.e. fulldata[rows,columns]). But the indices of the new dataframe training_data will be numbered 1:100so you need a different sort of indicator that corresponds between training_dataand fulldata to show which rows of fulldata should not be included. What you might do is use the rownames, something like:
rest_data<-fulldata[-which(rownames(fulldata)%in%rownames(training_data)),]
which should tell R to remove the rownames of fulldata that occur in training_data. If you have something like an ID variable that is unique to each row you could also use this
rest_data<-fulldata[-which(fulldata$ID%in%training_data$ID),]

Resources