I have a dataframe with ~200,000 rows and I need to extract a portion of a string from one of the columns. I have tried split and str_split and not getting at what I need.
I need to cut/split/explode on the / forward slash and only return the first field.
From the following sample I need only subdomain.domain.org
subdomain.domain.org/Site/SiteXYZ/Users/Usergroup
I have tried:
splitA <- str_split(my_data$OrganizationalUnit, pattern = "/", simplify = TRUE)
which returns a dataframe with multiple columns, one for each string between the delimiters
head(splitA)
[,1] [,2] [,3] [,4] [,5]
[1,] "subdomain.domain.org" "Site" "SiteXYZ" "Users" "Usergroup"
and
splitB <- str_split(my_data$OrganizationalUnit, pattern = "/", n = '2', simplify = TRUE)
which returns a dataframe with multiple columns, one for each string between the delimiters
head(splitB)
[,1] [,2]
[1,] "subdomain.domain.org" "Site/SiteXYZ/Users/Usergroup"
and
splitC <- **str_split**(my_data$OrganizationalUnit, pattern = "/", n = '1', simplify = TRUE)
which returns a dataframe with one column, that looks like the original
head(splitC)
[,1]
[1,] "subdomain.domain.org/Site/SiteXYZ/Users/Usergroup"
My end goal is to either:
Extract this string as part of larger queries
Add a column with this field
or add this as a column upon csv import
Any help will be much appreciated.
Thank you,
-Jacob
You were very close, if I understood right what you are after. As stated in the help file of ?str_split(), the simplify argument means:
"If FALSE, the default, returns a list of character vectors. If TRUE returns a character matrix."
If you are dealing with a character vector of domains, you can use simplify=TRUE and extract the first column of the matrix with (...)[, 1]. In any case, it is more efficient to use a combination of str_sub() and str_locate().
library(tidyverse)
x <- c("subdomain.domain.org/Site/SiteXYZ/Users/Usergroup",
"othersubdomain.domain.com/Site/SiteXYZ/Users/Usergroup",
"yetanothersubdomain.domain.com/Site/SiteXYZ/Users")
str_sub(x, start = 1, str_locate(string = x, "/")[, 1]-1)
# -1 otherwise / is kept in resulting string
str_split(x, pattern = "/", n = '2', simplify = TRUE)[, 1]
[1] "subdomain.domain.org" "othersubdomain.domain.com" "yetanothersubdomain.domain.com"
In a data.frame
If it is a column in a data.frame, you can use mutate() from {dplyr}:
library(tidyverse)
df <- tibble(domain = x,
y = rnorm(length(x))
)
df_extracted_domain <- df %>% mutate(
domain_suffix = str_sub(x, start = 1, str_locate(string = x, "/")[,1]-1)
)
in case you are not familiar with tidyverse and specially the pipe operator %>%, just read it as "and then". so the above line you read:
take df, "and then" mutate df (by adding a new variable, called domain_suffix). The mutated data.frame is assigned to a new object (or overwritten, if you call them the same)
benchmarking
y <- rep(x, 1e3)
bm <- rbenchmark::benchmark(
str_split = {
str_split(y, pattern = "/", n = '2', simplify = TRUE)[,1]
},
str_sub = {
str_sub(y, start = 1, str_locate(string = x, "/")-1)
},
replications = 1000
)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 str_split 1000 1.4 7.1 1.4 0.011 0 0
# 2 str_sub 1000 0.2 1.0 0.2 0.000 0 0
Related
I am struggling to separate a single string input into a series of inputs. The user gives a list of FASTA formatted sequences (see example below). I'm able to separate the inputs into their own
ex:
">Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
.>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
"
[1] "Rosalind_6404CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG"
[2] "Rosalind_5959CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"
But I am struggling to find a way to create a function that splits the "Rosalind_6404" from the gene sequence to the unknown amount of FASTA sequences while creating new vectors for the split elements.
Ultimately, the result would look something such as:
.> "Rosalind_6404" "Rosalind5959"
.> "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG","CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"
I was hoping the convert_entries function would allow me to iterate over all the elements of the prepped_s character vector and split the elements into two new vectors with the same index number.
s <- ">Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC"
split_s <- strsplit(s, ">")
ul_split_s<- unlist(split_s)
fixed_s <- gsub("\n","", ul_split_s)
prepped_s <- fixed_s[-1]
prepped_s
nchar(prepped_s[2])
print(prepped_s[2])
entry_tags <- list()
entry_seqs <- list()
entries <- length(prepped_s)
unlist(entries)
first <- prepped_s[1]
convert_entries <- function() {
for (i in entries) {
tag <- substr(prepped_s[i], start = 1, stop = 13)
entry_tags <- append(entry_tags, tag)
return(entry_tags)
}
}
entry_tags <- convert_entries()
print(entry_tags)
Please help in any way you can, thanks!
One option with tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tibble(col1 = s) %>%
separate_rows(col1, sep="\n") %>%
group_by(grp = cumsum(str_detect(col1, '^>'))) %>%
summarise(prefix = first(col1),
col1 = str_c(col1[-1], collapse=""), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 2 x 2
prefix col1
<chr> <chr>
1 >Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
2 >Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
Using seqinr package:
library(seqinr)
# example fasta file
write(">Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC", "myFile.fasta")
# read the fasta file
x <- read.fasta("myFile.fasta", as.string = TRUE, forceDNAtolower = FALSE)
# get the names
names(x)
# [1] "Rosalind_6404" "Rosalind_5959"
# get the seq
x$Rosalind_6404
# [1] "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG"
# attr(,"name")
# [1] "Rosalind_6404"
# attr(,"Annot")
# [1] ">Rosalind_6404"
# attr(,"class")
# [1] "SeqFastadna"
In base R you could do:
t(gsub('\n', '', regmatches(s, gregexec("([A-Z][a-z_0-9]+)\n([A-Z\n]+)", s))[[1]][-1,]))
[,1] [,2]
[1,] "Rosalind_6404" "CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG"
[2,] "Rosalind_5959" "CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC"
NOTE: I transposed the matrix so that you may vie the results. Ignore the use of t function
Another base R solution:
read.table(text=sub('\n', ' ', gsub('(\\D)\n', '\\1', unlist(strsplit(s, '>')))))
V1 V2
1 Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
2 Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
or even
proto <- data.frame(name = character(), value = character())
new_s <- gsub('\n', '', unlist(strsplit(s, '>')))
strcapture("([A-Z][a-z_0-9]+)([A-Z]+)", grep('\\w', new_s, value = T), proto)
name value
1 Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
2 Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1
I'm using the isplit command from the iterators pacakge to loop over a data frame. Does anyone know if it's possible to get the number of elements in iterator?
E.g.,
library(iterators)
df <- data.frame(a = sample(letters[1:26], 100, replace = TRUE), b = runif(100))
df.iter <- isplit(df, df$a, drop = TRUE)
One option would be to convert it to a list with as.list (similar to list(generator) in python and get the length of it
length(as.list(df.iter))
#[1] 26
which is equal to the length from split
length(split(df, df$a, drop = TRUE))
#[1] 26
I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
I want to generate several random numbers, sampled from normal distribution, for several pairs of mean and standard deviation.
These pairs are stored in a data frame, with three columns containing the identifiant of the pair, value of mean and standard deviation as in the following example:
ex <- data.frame("id" = c("id_1_0.1", "id_2_0.5"), "mean" = c(1, 2), "sd" = c(0.1, 0.5))
To create 10 random numbers for each pair, I used these two lines:
tmp <- by(cbind(ex$mean, ex$sd), ex$id, function(x) rnorm(10, mean = x[, 1], sd = x[, 2]))
tmp <- do.call(rbind, lapply(tmp, data.frame, stringsAsFactors = FALSE))
What I would like to do is to then merge both data frames ex and tmp to have all the information in one data frame.
With this method, I face a problem of naming due to incrementation so I cannot do a simple merge.
Should I try to solve this using a regex formula or is there a simpler solution ?
This code seems to work for you:
library(dplyr)
ex <- data.frame("id" = c("id_1_0.1", "id_2_0.5"), mean = c(1, 2), sd = c(0.1, 0.5))
random_list = apply(ex[,c("id","mean","sd")],1,function(x) {
data.frame(id=rep(x[1],10),
random= rnorm(10, mean = as.numeric(x[2]), sd = as.numeric(x[3])))})
ex = do.call(rbind,random_list) %>% left_join(ex)
Hope this helps!
I was able to use some regex to delete the incrementation counters off your IDs, allowing them to merge with your original IDs. There may be a prettier way to do this, but this appears to work.
# Pull rownames in and delete counter
tmp$id <- gsub("(.[^.]*$)", "", rownames(tmp))
# Merge with original data
new <- merge(ex, tmp, by = "id")
head(new)
# id mean sd X..i..
# 1 id_1_0.1 1 0.1 1.1226943
# 2 id_1_0.1 1 0.1 1.0666694
# 3 id_1_0.1 1 0.1 0.8848397
# 4 id_1_0.1 1 0.1 0.9839212
# 5 id_1_0.1 1 0.1 0.9027086
# 6 id_1_0.1 1 0.1 0.9389538
Regex: Select a . followed by any number of non . characters [^.]*, starting at the end ($)