How to remove ending from sample names - r

I am trying to remove endings from sample names in my data frame. There are about 200 samples so I was hoping there was a way to end the name before the first - (common to each sample).
Examples of names are:
Glyc.1.20C.1wk-ATGGTTCACCCG-CATCAGTACGCC-R1.fastq
Glyc.1.20C.2m-CACTACGCTAGA-GTTCCTCCATTA-R1.fastq
Glyc.1.20C.2wk-GCTCGAAGATTC-CGAGGGAAAGTC-R1.fastq
Glyc.1.20C.3m-GTAGGTGCTTAC-GCATAAACGACT-R1.fastq
Using the change colnames(x) <- c("Glyc.1.20C.1wk, etc) would take me forever.
Any ideas?

If df is your dataframe, take the names, remove everything after the first -, and reset the names to the new short values...
names(df) <- gsub("\\-.+","",names(df))

Related

Extracting information between special characters in a column in R

I'm sorry because I feel like versions of this question have been asked many times, but I simply cannot find code from other examples that works in this case. I have a column where all the information I want is stored in between two sets of "%%", and I want to extract this information between the two sets of parentheses and put it into a new column, in this case called df$empty.
This is a long column, but in all cases I just want the information between the sets of parentheses. Is there a way to code this out across the whole column?
To be specific, I want in this example a new column that will look like "information", "wanted".
empty <- c('NA', 'NA')
information <- c('notimportant%%information%%morenotimportant', 'ignorethis%%wanted%%notthiseither')
df <- data.frame(information, empty)
In this case you can do:
df$empty <- sapply(strsplit(df$information, '%%'), '[', 2)
# information empty
# 1 notimportant%%information%%morenotimportant information
# 2 ignorethis%%wanted%%notthiseither wanted
That is, split the text by '%%' and take second elements of the resulting vectors.
Or you can get the same result using sub():
df$empty <- sub('.*%%(.+)%%.*', '\\1', df$information)

How to separate one column into many columns in a .txt file?

I've been given a data set for a project that I need to reformat in order to work with it.
The problem is that all of the column names and corresponding values are mashed into one column in the file. As shown in the picture.
I'm new to R so I hardly know how to work with complex commands.
My Questions:
Is there a simple way to separate this from 1 column into 12 columns?
Desire Output:
I'll also need to remove the periods between the column names and the semicolons between the values.
I just need to be able to do basic statistical analysis on the table.
Thanks
table
Although your data is in one column, it is semi colon separated. The read.csv function has the ability to accept a column separator:
df <- read.csv(file="path/to/your/file.txt", skip=1, header=FALSE, sep=";")
The above call will generate columns based on a ; separator. I skip the first line and ignore the header, because it is a single string. You may manually assign the columns names via:
names(df) <- c("name1", "name2", ..., "name12")

integer function converting row names in to numbers

enter image description here
I used to this
mydata3 <- data.frame(sapply(mydata2, as.integer))
But now I see that row names which is gene names, has been converted to number like 1-200). But I should point that same command I used sometime ago when it was working well. So I thought there are some problems with my file then i used old file on which this command was working but i am seeing same problem like gene name is converted in to number here is full script:
countsTable<-read.table("JW.txt",header=TRUE,stringsAsFactors=TRUE,row.names=1)
mydata2 <- countsTable/1000
mydata3 <- data.frame(sapply(mydata2, as.integer))
str(mydata3)
Please let me know.
sapply works over columns of your data.frame mydata2, and returns respective output per column. as such, it does not return the row-names of your data.frame, so you either have to re-assign those, or re-assign the new column data into your original data.frame, like:
mydata2[] <- sapply(mydata2, as.integer)
Thus you can keep all of the original attributes.

Column indexing based on row value

I have the data frame:
DT=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100))
The objective is to create a new column Quantity that selects the value for each row in the column equal to Price, such that:
DT.Objective=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100),
Quantity= c(400,200,200,100,100))
The dataset is very large so efficiency is important. I currently use and looking to make more efficient:
Names <- names(DT)
DT$Quantity<- DT[Names][cbind(seq_len(nrow(DT)), match(DT$Price, Names))]
For some reason the column names in the example come with an "X" in front of them, whereas in the actual data there is no X.
Cheers.
We can do this with row/column indexing after removing the prefix 'X' using sub or substring and then do the match as showed in the OP's post
DT$Quantity <- DT[cbind(1:nrow(DT), match(DT$Price, sub("^X", "", names(DT))))]
DT$Quantity
#[1] 400 200 200 100 100
The X is attached as prefix when the column names starts with numbers. One way to take care of this would be using check.names=FALSE in the data.frame call or read.csv/read.table
#akrun is correct, check.names=TRUE is the default behavior for data.frame(); from the man page:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
If possible, you may want to make your column names a bit more descriptive.

Select or substract an specific part of an element of a data frame

I want to get only a part of an element that is part of a data frame.
My dataframe has 1 column with 6000 rows looking like this:
chr5_122424840_122523745_NM_001136239_mRNA
chr17_38632079_38657854_NM_032865_mRNA
I want to obtain a new data frame with only
NM_001136239
NM_032865
I've tried with split and then paste but it is not working because it eliminates the zeros when pasting (NM_1136239 instead of NM_001136239)
I've also tried with stri_sub and substrbut the lenght before NM is not the same in each row. Also gsub but I don't know how to do it.
Thank you very much for your help, I hope I've been enough specific.
This should work
Data
df <- data.frame(col=c("chr5_122424840_122523745_NM_001136239_mRNA",
"chr17_38632079_38657854_NM_032865_mRNA"))
Code
df$col <- sub(".*(NM.*)_mRNA", "\\1", df$col)
Just as long as the strings have NM and end in _mRNA
There are many ways to do. Here goes one that uses library stringr. I simply recommend it because code is easier to understand
library(stringr)
patron <- "NM_[0-9]+" # NM_ pasted to any number of digits
str_extract(your_data_frame$your_column, patron) # Assign to a column or not

Resources