How to subset the first column (rownames) in R [duplicate] - r

This question already has answers here:
What is about the first column in R's dataset mtcars?
(4 answers)
Closed 3 years ago.
I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.
> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);
Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).
The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.
The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.
However when I try
> genes <- df[1]
It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.
Also
> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)
Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.
I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.
I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.
Cheers guys

Please include a sample of your text file as text instead of an image.
I have created a dataset similar to yours:
X Y
1 a b
2 c d
3 d g
Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:
If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names.
Reading it in R:
df <- read.table(text = ' X Y
1 a b
2 c d
3 d g')
So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)
There are numerous ways to convert your row names to a column see for example this question.

If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.
genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B
As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.
genes <- df[1]

Related

How do I make a variable that represents the row number, when rows are not sorted in order?

Pic shows the row number order
I am trying to add a variable to my data set that represents the row number; however every code I've found adds them in order as the rows are currently (1,2,3,4,5), rather than in the order the View option shows (129, 98, 21, 09). I need the order shown in the View option, as I am trying to merge with a another data set, and need the correct ("original row number").
I cannot add row numbers before making changes to the data set as the function doesn't work when I add the ID number.
Alternatively, being able to sort the data by row number would also help, but I don't know how to do that either (clicking on the arrow above the row number does nothing).
A bit of context
I am classifying network nodes in R. I made a matrix from the networks nodes and edges (using nodes2vec), and have to merge this matrix with nodes labels data set (this data set contains one variable which shows if nodes are positive or negative). The picture above shows the created matrix, and the original row numbers from the network data set are no longer in the original order. I need to add a variable to the matrix, that I converted to a data frame using:
netdf1 <- as.data.frame(network.node2vec)
that represents the original row number
what I tried
netdf1 <- netdf1 %>% mutate(id = row_number())
This just adds the row number as the rows are currently ordered so 1,2,3,4...
WHAT WORKED IN THE END == CORRECT ANSWER
db$ID <- rownames(db)
If I do understand your question right you have some kind of dataframe with row names that are not continuus? And now you want to have these row names in an extra column as numeric values?
You can use the row.names()-function and can convert them to numeric if you like:
# just creating a DF that might show what you mean:
testDF <- data.frame(x = 1:10, y = sample((1:1000), 10))
testDF <- testDF[testDF$y < 500,]
View(testDF)
# one possible way to get the row names
testDF$rowNum <- as.numeric(row.names(testDF))
And try to type ?sort to the console if you like to learn something about sorting vectors.
Let's say you have a data frame with row names that are out of order:
my_data <- data.frame(row.names = 5:1,
V1 = 1:5)
#> my_data
# V1
#5 1
#4 2
#3 3
#2 4
#1 5
dplyr::row_number() will add row numbers based on the current sorting, not based on the row names. (A general practice in the tidyverse is to eschew keeping useful data in the row names and to instead incorporate any sorts of row ID info into a variable.)
So you could use #user2554330's advice and add my_data$ID <- row.names(my_data) or the tidyverse equivalent of my_data %>% tibble::rownames_to_column(var = "ID"), then sort by that column.
my_data %>%
tibble::rownames_to_column(var = "ID") %>%
arrange(ID)
ID V1
1 1 5
2 2 4
3 3 3
4 4 2
5 5 1

Viewing single column of data frame in R [duplicate]

This question already has answers here:
How to subset matrix to one column, maintain matrix data type, maintain row/column names?
(1 answer)
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 5 years ago.
I am running a simulation model that creates a large data frame as its output, with each column corresponding to the time-series of a particular variable:
data5<-as.data.frame(simulation3$baseline)
Occasionally I want to look at subsets, especially particular columns, of this data frame in order to get an idea of the output. For this I am using the View-function like so
View(data5[1:100,1])
for instance, if I wish to see the first 100 rows of column 1. Alternatively, I also sometimes do something like this, using the names of the time series:
timeframe=1:100
toAnalyse=c("u","u_n","u_e","u_nw")
View(data5[timeframe,toAnalyse])
In either case, there is an annoying display problem when I am trying to view a single column on its own (as for instance with View(data5[1:100,1])), whereby what I get looks like this:
Example 1
As you can see, the top of the table which would usually contain the name of the variable in the dataset instead contains a string of all values that the variable takes. This problem does not appear if I select 2 or more columns:
Example 2
Does anyone know how to get rid of this issue? Is there some argument that I can feed to View to make sure that it behaves nicely when I ask it to just show a single column?
View(data5[1:100,1, drop=FALSE])
When you access a single column of a data frame it is converted to a vector, drop=FALSE prevents that and retains the column name.
For instance:
> df
n s b
1 2 aa TRUE
2 3 bb TRUE
3 5 cc TRUE
> df[, 1]
[1] 2 3 5
> df[, 1, drop=FALSE]
n
1 2
2 3
3 5

R: create new dataframe rows are columns from another dataframe

Is there a simple one liner to create a new dataframe based on an original dataframe where the rownames (or at least first row) comes from the column names in the original dataframe?
for example:
Original <- data.frame("A"=c("apples", "aligator", "algebra"), "B"=c("Banana", "Beans", "Baby"))
Gives:
A B
1 apples Banana
2 aligator Beans
3 algebra Baby
What I want is:
A
B
Actually figured it out - was very simple.
NewDataFrame <- data.frame(colnames(Original))

sum across columns within rows for all columns that start with a specific character string in R [duplicate]

This question already has answers here:
Subset data to contain only columns whose names match a condition
(10 answers)
Closed 6 years ago.
I have a data frame with a set of species IDs in the ID column, and sample IDs as separate columns with the motif CA_**. The data look like this:
ID <- c('A','B','C')
CA_01 <- c(3,9,54)
CA_56 <- c(2,7,12)
CA_92 <- c(45,4,47)
d<- data.frame(ID,CA_01,CA_56,CA_92)
ID CA_01 CA_56 CA_92
A 3 2 45
B 9 7 4
C 54 12 47
I want to sum across the columns within each row, and generate a new column, that is the total abundance of each species ID across sample columns (final values 50, 20, 113). Furthermore, There are many other columns in my real data frame. I only want to sum across columns that start with CA_**.
NOTE: this is different than the question asked here, as the asker knows the positions of the columns the asker wants to sum. Imy example I only know that the columns start with the motif, CA_. I don't know the positions. Its also different that the question here, as I specifically ask how to sum across columns based on the grep command.
We can use grep to subset the columns having column names that start with CA_ and get the sum of the rows with rowSums.
d$newCol <- rowSums(d[grep('^CA\\_', names(d))])

How to subset a data frame by taking only the Non NA values of 2 columns in this data frame

I am trying to subset a data frame by taking the integer values of 2 columns om my data frame
Subs1<-subset(DATA,DATA[,2][!is.na(DATA[,2])] & DATA[,3][!is.na(DATA[,3])])
but it gives me an error : longer object length is not a multiple of shorter object length.
How can I construct a subset which is composed of NON NA values of column 2 AND column 3?
Thanks a lot?
Try this:
Subs1<-subset(DATA, (!is.na(DATA[,2])) & (!is.na(DATA[,3])))
The second parameter of subset is a logical vector with same length of nrow(DATA), indicating whether to keep the corresponding row.
The na.omit functions can be an answer to you question
Subs1 <- na.omit(DATA[2:3])
[https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html]
Here an example.
a,b ,c are 3 vectors which a and b have a missing value.
once they are created i use cbind in order to bind them in one matrix which afterwards you can transform to data frame.
The final result is a dataframe where 2 out of 3 columns have a missing value.
So we need to keep only the rows with complete cases.DATA[complete.cases(DATA), ] is used in order to keep only these rows that have not missing values in every column. subset object is these rows that have complete cases.
a <- c(1,NA,2)
b <- c(NA,1,2)
c <- c(1,2,3)
DATA <- as.data.frame(cbind(a,b,c))
subset <- DATA[complete.cases(DATA), ]

Resources