I have a column of a data frame that has thousands complicate sample names like this
sample- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
I am trying with no success to change the sample names to achieve the following sample names
16.3R1, 16.3R2, 2.3R1,2.3R2
I am thinking of solving the problem with qsub or stringsR.
Any suggestion? I have tried qsub but not retrieving the desirable name
You can use sub to extract the parts :
sample <- c("16_3_S16_R1_001","16_3_S16_R2_001","2_3_S2_R1_001","2_3_S2_R2_001")
sub('(\\d+)_(\\d+)_.*(R\\d+).*', '\\1.\\2\\3', sample)
#[1] "16.3R1" "16.3R2" "2.3R1" "2.3R2"
\\d+ refers to one or more digits. The values captured between () are called as capture groups. So here we are capturing one or more digits(1), followed by underscore and by another digit (2) and finally "R" with a digit (3). The values which are captured are referred using back reference so \\1 is the first value, \\2 as second value and so on.
If you split the string sample into substrings according to the pattern "_", you need only the 1st, 2n and 4th parts:
sample <- c("16_3_S16_R1_001",
"16_3_S16_R2_001",
"2_3_S2_R1_001",
"2_3_S2_R2_001")
x <- strsplit(sample, "_")
sapply(x, function(y) paste0(y[1], ".", y[2], y[4]))
Here is one way you could do it.
It helps to create a data frame with a header column, so it's what I did below, and I called the column "cats"
trial <- data.frame( "cats" = character(0))
x <- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
df <- data.frame("cats" = x)
The data needs to be in the right structure, in our case, as.factor()
df$cats <- as.factor(df$cats)
levels(df$cats)[levels(df$cats)=="16_3_S16_R1_001"] <- "16.3R1"
levels(df$cats)[levels(df$cats)=="16_3_S16_R2_001"] <- "16.3R2"
levels(df$cats)[levels(df$cats)=="2_3_S2_R1_001"] <- "2.3R1"
levels(df$cats)[levels(df$cats)=="2_3_S2_R2_001"] <- "2.3R2"
And voilĂ
Related
I'd like to extract the first and second values from a list of lists. I was able to extract the first value with no issue. However, it gives me an error when I was trying to extract the second value because not all lists from the suggestion column has more than one value. How can I extract the second value from the suggestion column in mydf_1 and generate NA to those with no second value?
Below are the codes I wrote to get to the first suggestion, but when I do
mydf_1$second_suggestion <- lapply(mydf_1$suggestion, `[[`, 2)
it gives this error:
Error in FUN(X[[i]], ...) : subscript out of bounds
Thanks.
# create a data frame contains words
mydf <- data.frame("words"=c("banna", "pocorn and drnk", "trael", "rabbitt",
"emptey", "ebay", "templete", "interne", "bing",
"methog", "tullius"), stringsAsFactors=FALSE)
# add a custom word to the dictionary$
library(hunspell)
mydict_hunspell <- dictionary(lang="en_US", affix=NULL, add_words="bing",
cache=TRUE)
# use hunspell to identify misspelled words and create a row number column
# for later uses
mydf$words_checking <- hunspell(mydf$word, dict=mydict_hunspell)
mydf$row_num <- rownames(mydf)
# unlist the words_checking column and get suggestions for those misspelled
# words in another data frame
library(tidyr)
mydf_1 <- unnest(mydf, words_checking)
mydf_1$suggestion <- hunspell_suggest(mydf_1$words_checking)
# extract first suggestion from suggestion column
mydf_1$first_suggestion <- lapply(mydf_1$suggestion, `[[`, 1)
You can check the length of each list first before trying to extract the element of interest. Also, I recommend using sapply so that you have a character vector returned, as opposed to another list.
For the first suggestion:
index <- 1
sapply(mydf_1$suggestion, function(x) {if(length(x) < index) {NA} else {x[[index]]}})
And for the second suggestion and so on:
index <- 2
sapply(mydf_1$suggestion, function(x) {if(length(x) < index) {NA} else {x[[index]]}})
This could be wrapped into a larger function with a bit more code if you need to automate...
In theory, you could test with is.null(see How to test if list element exists? ), but I still got the same error trying that approach.
I'd like to insert an underscore after the first three characters of all variable names in a data frame. Any help would be much appreciated.
Current data frame:
df1 <- data.frame("genCrc_b1"=c(1,1,1),"genprd"=c(1,1,1) ,"genopr_b1_b2"=c(1,1,1))
Desired data frame:
df2 <- data.frame("gen_Crc_b1"=c(1,1,1),"gen_prd"=c(1,1,1) ,"gen_opr_b1_b2"=c(1,1,1))
My attempts:
gsub('^(.{3})(.*)$', "_", names(df1))
gsub('^(.{3})(.*)$', '\\_\\2', names(df1))
We can use sub to capture the first 3 characters as a group ((.{3})) and in the replacement specify the backreference of the group (\\1) followed by underscore
names(df1) <- sub("^(.{3})", "\\1_", names(df1))
names(df1)
#[1] "gen_Crc_b1" "gen_prd" "gen_opr_b1_b2"
In the OP's post, especially the last one, there were two capture groups, but only one was specified
gsub('^(.{3})(.*)$', '\\1_\\2', names(df1))
BTW, gsub is not needed as we are replacing only at a single instance instead of multiple times.
In the first case, none of backreference for the captured groups were used in the replacement
If your variable names all begin with gen, we can also do the following.
colnames(df1) <- gsub("gen", "gen_", colnames(df1), fixed = TRUE)
You can also use regmatches<- to replace the sub-expressions.
regmatches(names(df1), regexpr("gen", names(df1), fixed=TRUE)) <- "gen_"
Now, check that the values have been properly changed.
names(df1)
[1] "gen_Crc_b1" "gen_prd" "gen_opr_b1_b2"
Here, regexpr finds the first position in each element of the character vector that matches the subexpression, "gen". These positions are fed to regmatches and the substitution is performed.
I've been profiting from SO, quite a while now and now decided to sign up and try to a) help others and b) get help from great guys :)
So coming to my question, I have vector extracted from a data frame that looks like this (just little subset of the data):
cho <- c("[M-H]: C4H4O2",
"[M+Hac-H]: C5H10O6",
"[M-H]: C6H4O3",
"[M+Fa-H]: C7H6O",
"[M-H]: C9H8O3",
"[M-H]: C18H30O3);
Now from this vector I want to extract the numbers in order to get the number of "C", "H", and "O" atoms:
temp <- strsplit(cho, "[^[:digit:]]");
temp <- as.numeric(unlist(temp));
#remove NAs
temp <- temp[!is.na(temp)];
#split into three column matrix and convert to df to merge with original df
temp <- as.data.frame(matrix(temp, ncol = 3, byrow = T));
In this case R is recycling the data to generate the matrix, in my case for the bigger data set, the generated temp vector is long enough and the matrix is getting generated, but it is a mess; this is due to cases such as "[M+Fa-H]: C7H6O" where only two numbers can be extracted; how is it possible to get a "1" after an "O" so that three numbers can be extracted instead of two? Is there a workaround for this?
Thanks a lot in advance for your help!
We can use str_extract_all. Use the regex lookaround to match one or more numbers (\\d+) that follows either a C or H or O, extract those numbers in a list, and convert to integer
library(stringr)
lst <- lapply(str_extract_all(cho, "(?<=C|H|O)\\d+"), as.integer)
Or a base R option is
read.csv(text=sub(".*C?(\\d+)H?(\\d+)O?(\\d*).*",
"\\1,\\2,\\3", cho), header=FALSE, fill=TRUE)
I have a data.frame (PC) that looks like this:
http://i.stack.imgur.com/NWJKe.png
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
http://i.stack.imgur.com/vQ48u.png
I want to sort the columns (beginning with "GTEX.") in the data.frame such that they are ordered by the age indicated in the age matrix.
PC <- read.csv("protein_coding.csv")
age <- read.table("Annotations_SubjectPhenotypes_DS.txt")
I started by changing the names in the age matrix to replace the '-' by '.':
new_SUBJID <- gsub("-", ".", age$SUBJID, fixed = TRUE)
age[, "SUBJID"] <- new_SUBJID
Then, I ordered the row names (SUBJUD) of the age matrix by age:
sort.age <- with(age, age[order(AGE) , ])
sort.age <- na.omit(sort.age)
I then created a vector age.ID containing the SUBJIDs in the right order (= how I want to order the columns from the PC matrix).
age.id <- sort.age$SUBJID
But then I am blocked since the names on the PC matrix and the age matrix are not the same... Could someone please help me?
Thank you very much in advance!
Svalf
It would have been better to show the example without using an image. Suppose, if there are two strings,
str1 <- c('GTEX.N7MS.0007.SM.2D7W1', 'GTEX.PFPP.0007.SM.2D8W1', 'GTEX.N7MS.0008.SM.4E3J1')
str2 <- c('GTEX.N7MS', 'GTEX.PFPP')
representing the column names of 'PC' and the 'SUBJID' column of 'age' dataset (after replacing the - with . and sorted), we remove the suffix part by matching the . followed by 4 digits (\\d{4}) followed by one or more characters to the end of the string (.*$) and replace it by ''.
str1N <- sub('\\.\\d{4}.*$', '', str1)
str1[order(match(str1N, str2))]
#[1] "GTEX.N7MS.0007.SM.2D7W1" "GTEX.N7MS.0008.SM.4E3J1"
#[3] "GTEX.PFPP.0007.SM.2D8W1"
I have a data.frame with one column of Sample.Names. The Names contain information where the samples are from. For instance c(RT4_4, RT3_6, RT4_2, RT3_9, RT5_5) RTx is the name of the site they are from and then follows a number.
I want a new columns that gives me the information were they are from. If Sample.Names contains RT4 -> df$Site == RT4
I don't know if there is a functions that allows you to look at part of the name my idea was
df$Site <- with(df, ifelse(df$Sample.Name %in% "RT4","RT4", ifelse(df$Sample.Name %in% "RT3","RT3","RT5")))
this doesn't work
You can use sub:
df$Site <- sub("_.+", "", df$Sample.Name)
This works with numbers consisting of multiple digits too.