Extracting three numbers from a string (missing number?)

Extracting three numbers from a string (missing number?) - r

I've been profiting from SO, quite a while now and now decided to sign up and try to a) help others and b) get help from great guys :)
So coming to my question, I have vector extracted from a data frame that looks like this (just little subset of the data):
cho <- c("[M-H]: C4H4O2",
"[M+Hac-H]: C5H10O6",
"[M-H]: C6H4O3",
"[M+Fa-H]: C7H6O",
"[M-H]: C9H8O3",
"[M-H]: C18H30O3);
Now from this vector I want to extract the numbers in order to get the number of "C", "H", and "O" atoms:
temp <- strsplit(cho, "[^[:digit:]]");
temp <- as.numeric(unlist(temp));
#remove NAs
temp <- temp[!is.na(temp)];
#split into three column matrix and convert to df to merge with original df
temp <- as.data.frame(matrix(temp, ncol = 3, byrow = T));
In this case R is recycling the data to generate the matrix, in my case for the bigger data set, the generated temp vector is long enough and the matrix is getting generated, but it is a mess; this is due to cases such as "[M+Fa-H]: C7H6O" where only two numbers can be extracted; how is it possible to get a "1" after an "O" so that three numbers can be extracted instead of two? Is there a workaround for this?
Thanks a lot in advance for your help!

We can use str_extract_all. Use the regex lookaround to match one or more numbers (\\d+) that follows either a C or H or O, extract those numbers in a list, and convert to integer
library(stringr)
lst <- lapply(str_extract_all(cho, "(?<=C|H|O)\\d+"), as.integer)
Or a base R option is
read.csv(text=sub(".*C?(\\d+)H?(\\d+)O?(\\d*).*",
"\\1,\\2,\\3", cho), header=FALSE, fill=TRUE)

Related

Change complicated strings in R with qsub or R-strings

I have a column of a data frame that has thousands complicate sample names like this
sample- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
I am trying with no success to change the sample names to achieve the following sample names
16.3R1, 16.3R2, 2.3R1,2.3R2
I am thinking of solving the problem with qsub or stringsR.
Any suggestion? I have tried qsub but not retrieving the desirable name

You can use sub to extract the parts :
sample <- c("16_3_S16_R1_001","16_3_S16_R2_001","2_3_S2_R1_001","2_3_S2_R2_001")
sub('(\\d+)_(\\d+)_.*(R\\d+).*', '\\1.\\2\\3', sample)
#[1] "16.3R1" "16.3R2" "2.3R1" "2.3R2"
\\d+ refers to one or more digits. The values captured between () are called as capture groups. So here we are capturing one or more digits(1), followed by underscore and by another digit (2) and finally "R" with a digit (3). The values which are captured are referred using back reference so \\1 is the first value, \\2 as second value and so on.

If you split the string sample into substrings according to the pattern "_", you need only the 1st, 2n and 4th parts:
sample <- c("16_3_S16_R1_001",
"16_3_S16_R2_001",
"2_3_S2_R1_001",
"2_3_S2_R2_001")
x <- strsplit(sample, "_")
sapply(x, function(y) paste0(y[1], ".", y[2], y[4]))

Here is one way you could do it.
It helps to create a data frame with a header column, so it's what I did below, and I called the column "cats"
trial <- data.frame( "cats" = character(0))
x <- c("16_3_S16_R1_001", "16_3_S16_R2_001", "2_3_S2_R1_001","2_3_S2_R2_001")
df <- data.frame("cats" = x)
The data needs to be in the right structure, in our case, as.factor()
df$cats <- as.factor(df$cats)
levels(df$cats)[levels(df$cats)=="16_3_S16_R1_001"] <- "16.3R1"
levels(df$cats)[levels(df$cats)=="16_3_S16_R2_001"] <- "16.3R2"
levels(df$cats)[levels(df$cats)=="2_3_S2_R1_001"] <- "2.3R1"
levels(df$cats)[levels(df$cats)=="2_3_S2_R2_001"] <- "2.3R2"
And voilà

How to extract numeric vectors from R data frame string columns and save as columns (lists) with vectors

Imagine a R data frame with numerous string columns that contain a chain of numerals (scientific notation) between some characters in each row. Here a simplified example:
df <- data.frame(id = 1:3,
vec1 = c("[a-4.16121967e-02 b4.51207198e-02 c-7.89282843e-02 d4.02516453e-03]",
"[a-7.52146867e-02 b3.78264938e-02 c-1.03749274e-02 d4.02516453e-03]",
"[a-2.13926377e-02 b9.27949827e-02 c-5.89836483e-02 d2.44455224e-03]"),
vec2 = c("[a-4.16121967e-02 b4.51207198e-02 c-7.89282843e-02 d4.02516453e-03]",
"[a-7.40210414e-02 b1.75862815e-02 c-1.03749274e-02 d4.02516453e-03]",
"[a-6.73705637e-02 b9.27949827e-02 c-8.35041553e-02 d2.44455224e-03]"))
I'm looking for a fast (the data frame I'm working with contains a lot more and much larger vectors) solution (preferably dplyr) that converts the vector columns into lists with numerical vectors for each row.
So far I managed to remove the unnecessary characters and separate the vector elements by commas like this:
mutate(df,
vec1 = str_replace_all(vec1, "\\[|\\]|a|b|c|d", ""),
vec1 = str_replace_all(vec1, " ", ","),
vec2 = str_replace_all(vec2, "\\[|\\]|a|b|c|d", ""),
vec2 = str_replace_all(vec2, " ", ","))
Maybe there's a better and more elegant solution for this step. While we're at it: I actually wonder how to do this with mutate_at() and starts_with("vec") in order to fix all my columns at once.
More importantly, I'm struggling with the conversion to numeric vectors resulting in 2 list columns with one numerical vector with 4 elements in each row and column. I only managed to extract and convert single vectors like this:
as.numeric(unlist(strsplit(df[1,'vec1'], ",")))
However, I'd like to avoid a loop through all the vectors. Any help is highly appreciated.

We can use mutate_at to apply a function to multiple columns, remove characters [a-d] and square brackets ([]) using gsub and convert the vector to numeric to get list columns.
library(dplyr)
df %>% mutate_at(vars(vec1:vec2),
~purrr::map(strsplit(gsub('[a-d]|\\[|\\]', '', .), "\\s+"), as.numeric))
# id vec1
#1 1 -0.04161220, 0.04512072, -0.07892828, 0.00402516
#2 2 -0.07521469, 0.03782649, -0.01037493, 0.00402516
#3 3 -0.02139264, 0.09279498, -0.05898365, 0.00244455
# vec2
#1 -0.04161220, 0.04512072, -0.07892828, 0.00402516
#2 -0.07402104, 0.01758628, -0.01037493, 0.00402516
#3 -0.06737056, 0.09279498, -0.08350416, 0.00244455

Comparison of of character type and factor type in R

Ok, so I am having this issue right now. I have a matrix A whose rownames are the values of a field in another matrix B. I want to find indices of my rownames in the second matrix B. Now I am trying to do this operation which(A$field == rowname_A) . Unfortunately couple of things are appearing one - the rowname_A variable is of character class. It is of this format , "X12345". The values of A$field is of type factor. Is there a way to remove the appended X from the character, convert it to factor and do the comparison. Or convert the factor variables of A$field in to character type and then do the comparison.
Help will be appreciated.
Thanks.

This is fairly straightfoward. The example below should help you out.
A <- matrix(1:3)
rownames(A) <- paste0("X", 1:3)
B <- data.frame(field = factor(1:3))
# Remove "X" from rownames(A) and check equality
B$field %in% substr(rownames(A), 2, nchar(rownames(A)))
# Add "X" to B$field and check equality
paste0("X", B$field) %in% rownames(A)

Order df columns according to a target vector (but the names match only partially)

I have a data.frame (PC) that looks like this:
http://i.stack.imgur.com/NWJKe.png
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
http://i.stack.imgur.com/vQ48u.png
I want to sort the columns (beginning with "GTEX.") in the data.frame such that they are ordered by the age indicated in the age matrix.
PC <- read.csv("protein_coding.csv")
age <- read.table("Annotations_SubjectPhenotypes_DS.txt")
I started by changing the names in the age matrix to replace the '-' by '.':
new_SUBJID <- gsub("-", ".", age$SUBJID, fixed = TRUE)
age[, "SUBJID"] <- new_SUBJID
Then, I ordered the row names (SUBJUD) of the age matrix by age:
sort.age <- with(age, age[order(AGE) , ])
sort.age <- na.omit(sort.age)
I then created a vector age.ID containing the SUBJIDs in the right order (= how I want to order the columns from the PC matrix).
age.id <- sort.age$SUBJID
But then I am blocked since the names on the PC matrix and the age matrix are not the same... Could someone please help me?
Thank you very much in advance!
Svalf

It would have been better to show the example without using an image. Suppose, if there are two strings,
str1 <- c('GTEX.N7MS.0007.SM.2D7W1', 'GTEX.PFPP.0007.SM.2D8W1', 'GTEX.N7MS.0008.SM.4E3J1')
str2 <- c('GTEX.N7MS', 'GTEX.PFPP')
representing the column names of 'PC' and the 'SUBJID' column of 'age' dataset (after replacing the - with . and sorted), we remove the suffix part by matching the . followed by 4 digits (\\d{4}) followed by one or more characters to the end of the string (.*$) and replace it by ''.
str1N <- sub('\\.\\d{4}.*$', '', str1)
str1[order(match(str1N, str2))]
#[1] "GTEX.N7MS.0007.SM.2D7W1" "GTEX.N7MS.0008.SM.4E3J1"
#[3] "GTEX.PFPP.0007.SM.2D8W1"

Concatenate rows of a data frame

I would like to take a data frame with characters and numbers, and concatenate all of the elements of the each row into a single string, which would be stored as a single element in a vector. As an example, I make a data frame of letters and numbers, and then I would like to concatenate the first row via the paste function, and hopefully return the value "A1"
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5)
df
## letters numbers
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## 5 E 5
paste(df[1,], sep =".")
## [1] "1" "1"
So paste is converting each element of the row into an integer that corresponds to the 'index of the corresponding level' as if it were a factor, and it keeps it a vector of length two. (I know/believe that factors that are coerced to be characters behave in this way, but as R is not storing df[1,] as a factor at all (tested by is.factor(), I can't verify that it is actually an index for a level)
is.factor(df[1,])
## [1] FALSE
is.vector(df[1,])
## [1] FALSE
So if it is not a vector then it makes sense that it is behaving oddly, but I can't coerce it into a vector
> is.vector(as.vector(df[1,]))
[1] FALSE
Using as.character did not seem to help in my attempts
Can anyone explain this behavior?

While others have focused on why your code isn't working and how to improve it, I'm going to try and focus more on getting the result you want. From your description, it seems you can readily achieve what you want using paste:
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=FALSE)
paste(df$letters, df$numbers, sep=""))
## [1] "A1" "B2" "C3" "D4" "E5"
You can change df$letters to character using df$letters <- as.character(df$letters) if you don't want to use the stringsAsFactors argument.
But let's assume that's not what you want. Let's assume you have hundreds of columns and you want to paste them all together. We can do that with your minimal example too:
df_args <- c(df, sep="")
do.call(paste, df_args)
## [1] "A1" "B2" "C3" "D4" "E5"
EDIT: Alternative method and explanation:
I realised the problem you're having is a combination of the fact that you're using a factor and that you're using the sep argument instead of collapse (as #adibender picked up). The difference is that sep gives the separator between two separate vectors and collapse gives separators within a vector. When you use df[1,], you supply a single vector to paste and hence you must use the collapse argument. Using your idea of getting every row and concatenating them, the following line of code will do exactly what you want:
apply(df, 1, paste, collapse="")
Ok, now for the explanations:
Why won't as.list work?
as.list converts an object to a list. So it does work. It will convert your dataframe to a list and subsequently ignore the sep="" argument. c combines objects together. Technically, a dataframe is just a list where every column is an element and all elements have to have the same length. So when I combine it with sep="", it just becomes a regular list with the columns of the dataframe as elements.
Why use do.call?
do.call allows you to call a function using a named list as its arguments. You can't just throw the list straight into paste, because it doesn't like dataframes. It's designed for concatenating vectors. So remember that dfargs is a list containing a vector of letters, a vector of numbers and sep which is a length 1 vector containing only "". When I use do.call, the resulting paste function is essentially paste(letters, numbers, sep).
But what if my original dataframe had columns "letters", "numbers", "squigs", "blargs" after which I added the separator like I did before? Then the paste function through do.call would look like:
paste(letters, numbers, squigs, blargs, sep)
So you see it works for any number of columns.

For those using library(tidyverse), you can simply use the unite function.
new.df <- df%>%
unite(together, letters, numbers, sep="")
This will give you a new column called together with A1, B2, etc.

This is indeed a little weird, but this is also what is supposed to happen.
When you create the data.frame as you did, column letters is stored as factor. Naturally factors have no ordering, therefore when as.numeric() is applied to a factor it returns the ordering of of the factor. For example:
> df[, 1]
[1] A B C D E
Levels: A B C D E
> as.numeric(df[, 1])
[1] 1 2 3 4 5
A is the first level of the factor df[, 1] therefore A gets converted to the value 1, when as.numeric is applied. This is what happens when you call paste(df[1, ]). Since columns 1 and 2 are of different class, paste first transforms both elements of row 1 to numeric then to characters.
When you want to concatenate both columns, you first need to transform the first row to character:
df[, 1] <- as.character(df[, 1])
paste(df[1,], collapse = "")
As #sebastian-c pointed out, you can also use stringsAsFactors = FALSE in the creation of the data.frame, then you can omit the as.character() step.

if you want to start with
df <- data.frame(letters = LETTERS[1:5], numbers = 1:5, stringsAsFactors=TRUE)
.. then there is no general rule about how df$letters will be interpreted by any given function. It's a factor for modelling functions, character for some and integer for some others. Even the same function such as paste may interpret it differently, depending on how you use it:
paste(df[1,], collapse="") # "11"
apply(df, 1, paste, collapse="") # "A1" "B2" "C3" "D4" "E5"
No logic in it except that it will probably make sense once you know the internals of every function.
The factors seem to be converted to integers when an argument is converted to vector (as you know, data frames are lists of vectors of equal length, so the first row of a data frame is also a list, and when it is forced to be a vector, something like this happens:)
df[1,]
# letters numbers
# 1 A 1
unlist(df[1,])
# letters numbers
# 1 1
I don't know how apply achieves what it does (i.e., factors are represented by character values) -- if you're interested, look at its source code. It may be useful to know, though, that you can trust (in this specific sense) apply (in this specific occasion). More generally, it is useful to store every piece of data in a sensible format, that includes storing strings as strings, i.e., using stringsAsFactors=FALSE.
Btw, every introductory R book should have this idea in a subtitle. For example, my plan for retirement is to write "A (not so) gentle introduction to the zen of data fishery with R, the stringsAsFactors=FALSE way".

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting three numbers from a string (missing number?) - r

Related

Change complicated strings in R with qsub or R-strings

How to extract numeric vectors from R data frame string columns and save as columns (lists) with vectors

Comparison of of character type and factor type in R

Order df columns according to a target vector (but the names match only partially)

Concatenate rows of a data frame

Categories

Resources