gsub producing a range rather than desired expression in R

gsub producing a range rather than desired expression in R - r

I would like to read in a table then use gsub to return a part of the text. I know gsub requires a character vector format. Instead of getting the desired samp list of 'C516_A1_B1' and pat list of'C516' etc, I get'1:5'. What is the simplest way to fix this? Thanks!
bamlist <- read.table('pathtotxtfile.txt')
for (y in bamlist) {
samp <- gsub('EPICC_(C\\S+)_S1\\S+$','\\1', bamlist)
pat <- gsub('(C\\d+)_\\S+$','\\1', samp)
}
bamlist:
EPICC_C516_A1_B1_S1-GRCh38.bam
EPICC_C516_A1_G4_S1-GRCh38.bam
EPICC_C516_B1_G7_S1-GRCh38.bam
EPICC_C516_B1_G8_S1-GRCh38.bam
EPICC_C516_B3_B1_S1-GRCh38.bam

Why loop, sub is vectorized over x.
samp <- sub("^[^_]*_(.*)_[^_]*$", "\\1", bamlist)
pat <- sub("(^[^_]+)_.*$", "\\1", samp)
samp
#[1] "C516_A1_B1" "C516_A1_G4" "C516_B1_G7" "C516_B1_G8"
#[5] "C516_B3_B1"
pat
#[1] "C516" "C516" "C516" "C516" "C516"
Data
bamlist <- scan(what = character(), text = "
EPICC_C516_A1_B1_S1-GRCh38.bam
EPICC_C516_A1_G4_S1-GRCh38.bam
EPICC_C516_B1_G7_S1-GRCh38.bam
EPICC_C516_B1_G8_S1-GRCh38.bam
EPICC_C516_B3_B1_S1-GRCh38.bam
")
Edit
Following user #akrun's comment, here is a way to apply the above code to a data.frame.
lapply(bamlist, function(y){
samp <- sub("^[^_]*_(.*)_[^_]*$", "\\1", y)
pat <- sub("(^[^_]+)_.*$", "\\1", samp)
data.frame(samp = samp, pat = pat)
})
#$X
# samp pat
#1 C516_A1_B1 C516
#2 C516_A1_G4 C516
#3 C516_B1_G7 C516
#4 C516_B1_G8 C516
#5 C516_B3_B1 C516
The data would now be
X <- scan(what = character(), text = "
EPICC_C516_A1_B1_S1-GRCh38.bam
EPICC_C516_A1_G4_S1-GRCh38.bam
EPICC_C516_B1_G7_S1-GRCh38.bam
EPICC_C516_B1_G8_S1-GRCh38.bam
EPICC_C516_B3_B1_S1-GRCh38.bam
")
bamlist <- data.frame(X)

Related

How to insert a part of a value of a column into a new column

I have not been programming for that long and have now encountered a problem to which I have not yet been able to find a solution.
In my dataframe there is a column that contains several pieces of information. For example, one row looks like this:
sp|O94910|AGRL1_HUMAN
or like this
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN
Now I want to create a new column with the combination of digits between the two vertical bars.
For the upper example it would be O94910, for the lower Q13554; Q13555
I have already tried functions like str_extract_all, str_match or gsub. But nothing worked.
The "id" is the column I look at. It includes different combinations of digits. I need the one between the two |
> dput(head(anaDiff_PD_vs_CTRL$id, 10))
c("sp|O94910|AGRL1_HUMAN", "sp|P02763|A1AG1_HUMAN", "sp|P19652|A1AG2_HUMAN",
"sp|P25311|ZA2G_HUMAN", "sp|Q8NFZ8|CADM4_HUMAN", "sp|P08174|DAF_HUMAN",
"sp|Q15262|PTPRK_HUMAN", "sp|P78324|SHPS1_HUMAN;sp|Q5TFQ8|SIRBL_HUMAN;sp|Q9P1W8|SIRPG_HUMAN",
"sp|Q8N3J6|CADM2_HUMAN", "sp|P19021|AMD_HUMAN")>

With dplyr and stringr you can try...
library(dplyr)
library(stringr)
dat %>%
rowwise() %>%
mutate(dig = str_extract_all(col, "(?<=sp\\|)[A-Z0-9]+(?=\\|)"),
dig = paste0(dig, collapse = "; "))
#> # A tibble: 4 x 2
#> # Rowwise:
#> col dig
#> <chr> <chr>
#> 1 sp|Q8NFZ8|CADM4_HUMAN Q8NFZ8
#> 2 sp|94910|AGRL1_HUMAN 94910
#> 3 sp|O94910|AGRL1_HUMAN O94910
#> 4 sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN Q13554; Q13555
data
dat <- data.frame(col = c("sp|Q8NFZ8|CADM4_HUMAN", "sp|94910|AGRL1_HUMAN", "sp|O94910|AGRL1_HUMAN", "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"))
Created on 2022-02-02 by the reprex package (v2.0.1)

Here is a solution without tidyverse:
dat <- read.table(text = "
sp|Q8NFZ8|CADM4_HUMAN
sp|94910|AGRL1_HUMAN
sp|O94910|AGRL1_HUMAN
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN")
ids <- strsplit(dat$V1, ";")
ids <- lapply(ids, function(x) gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", x))
ids <- lapply(ids, function(x) paste(x, collapse="; "))
dat$newcol <- unlist(ids)
Even with tidyverse, I would define a helper function for more clarity:
extract_ids <- function(x) {
ids <- strsplit(x, ";")
ids <- map(ids, ~ gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", .))
ids <- map(ids, ~ paste(., collapse="; "))
unlist(ids)
}
dat <- dat %>% mutate(ids = extract_ids(V1))

This solution should help if you want to change your column names in a similar fashion:
library(tidyverse)
# create test data frame with column names "sp|O94910|AGRL1_HUMAN" and "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
col1 <- c(1,2,3,4,5)
col2 <- c(6,7,8,9,10)
df <- data.frame(col1, col2)
names(df)[1] <- "sp|O94910|AGRL1_HUMAN"
names(df)[2] <- "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
names <- as.data.frame((str_split(colnames(df), "\\|", simplify = TRUE))) # split the strings representing the column names seperated by "|" into a list
# remove all strings that contain less digits than letters or special characters
for(i in 1:nrow(names)) {
for(j in 1:ncol(names)){
if ( (str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[0-9]") >
str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[:alpha:]|[:punct:]") )){
names[i,j] <- names[i,j]
} else {
names[i,j] <- ""
}
}
}
# combine the list columns into a single column calles "colnames"
names <- names %>% unite("colnames", 1:5, na.rm = TRUE, remove = TRUE, sep = ";")
# remove all ";" separators at the start of the strings, the end of the strings, and series of ";" into a single ";"
for (i in 1:nrow(names)){
names[i,] <- str_replace(names[i,],"\\;+$", "") %>%
str_replace("^\\;+", "") %>%
str_replace("\\;{2}", ";")
}
# convert column with new names into a vector
new_names <- as.vector(names$colnames)
# replace old names with new names
names(df) <- new_names

Flipping two sides of string

I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.

Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")

x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"

A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30

Multiple list objects as a character variable in df. How to convert into original df with R?

I have a problem with my dataset containing character variables that are actually a list of values which I would want to convert into dataframe. Orginal dataframe consists several 1000 of rows.
I would like to split to a list objects in order to convert lists into dataframe (long format), but I lack some skills with list-objects and splitting characters.
Reproducible example:
id <- c("112")
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}")
test <- data.frame(id, name, value)
test
I would like have an outcome like this:
id <- c("112","112","112")
name <- c( "dog", "cat","attashee")
value <- c("21000", "23400", "26800")
test1 <- data.frame(id, name, value)
test1
I suppose, I need to start by erasing first and the last characters { and }:
test$name <- gsub("{", "", test$name, fixed=TRUE)
test$name <- gsub("}", "", test$name, fixed=TRUE)
I have tried to use these string-split-into-list-r, convert-a-list-formatted-as-string-in-a-list and convert-a-character-variable-to-a-list-of-list,
test$name <- strsplit(test$name, ',')[[1]]
but I get an error message(when I try this to first row of my original data): "replacement has 91 rows, data has 1".
The fact is, I´m pretty lost here as I would need to convert name and value columns simultaneously (and I don´t know how to convert even one column).
All the help and advises are more than appreciated.

This should work:
id <- c("112")
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}")
test <- data.frame(id, name, value)
test
id <- rep(test$id, length(name))
name <- gsub("\\{", '', name)
name <- gsub("\\}", '', name)
name <- gsub('"', '', name)
name <- gsub('\\s+', '', name)
name <- strsplit(name, ',')[[1]]
value <- gsub("\\{", '', value)
value <- gsub("\\}", '', value)
value <- gsub('"', '', value)
value <- gsub('\\s+', '', value)
value <- strsplit(value, ',')[[1]]
test1 <- data.frame(id, name, value)
test1

I would parse and evaluate:
id <- c("112", "113")
name <- c( "{\"dog\", \"cat\",\"attashee\"}", "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}", "{\"21001\", \"23401\", \"26801\"}")
test <- data.frame(id, name, value)
clean_parse_eval <- function(x) {
eval(parse(text = gsub("\\}", ")", gsub("\\{", "c\\(", x))))
}
We then need a split-apply-combine approach to do this for every row. Of course, this isn't very fast.
library(data.table)
setDT(test)
test[, lapply(.SD, clean_parse_eval), by = id]
# id name value
#1: 112 dog 21000
#2: 112 cat 23400
#3: 112 attashee 26800
#4: 113 dog 21001
#5: 113 cat 23401
#6: 113 attashee 26801
Obviously, it would be better to avoid producing such malformed data to begin with.

This will work for one column. I tried this for your name object and it does the job
library(stringr)
library(dplyr)
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
x <- as.data.frame(name) %>% mutate_each(funs(str_replace_all(., "\"", "")))
result <- strsplit(x$name,"![a-z]")[[1]]
result <- gsub('\\{', '', result)
result <- gsub('\\}', '', result)
result <- strsplit(as.character(result), split = ',', fixed = TRUE)[[1]]
result <- gsub(" +", "", result)
str(result)
#chr [1:3] "dog" "cat" "attashee"

This can do the job for you, no libraries, some explanation in within lines
id <- c("112")
name <- c( "{\"dog\", \"cat\",\"attashee\"}")
value <- c("{\"21000\", \"23400\", \"26800\"}")
convert <- function(col, isFloat = FALSE) {
# remove the two {} characters
col <- gsub("{", "", gsub("}", "", col, fixed=TRUE), fixed=TRUE)
# create a vector with the right content
ans <- eval(parse(text = paste0('c(', col, ')')))
if (isFloat)
unlist(lapply(ans, as.numeric))
else
ans
}
test <- data.frame(convert(id), convert(name), convert(value, TRUE))
# optionally you can fix the names here
names(test) <- c('id', 'name', 'value')
# final result
test
Notice that you can use the convert function to add more columns to your dataframe if you need to.
If your name variable for example have 1000 more entries, then do name <- do.call(paste0, name) to put all into a single string and then replace the}{ with a , by doing name <- gsub("}{", ",", name, fixed=TRUE) then you fall in the original case, and then the same logic applies.

How to extract the user ids for each element

How can I extract the user_id from the retweets collected using this function?
## get only first 8 words from each tweet
x <- lapply(strsplit(dat$text, " "), "[", 1:8)
x <- lapply(x, na.omit)
x <- vapply(x, paste, collapse = " ", character(1))
## get rid of hyperlinks
x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
## encode for search query (handles the non ascii chars)
x <- sapply(x, URLencode, USE.NAMES = FALSE)
## get up to first 100 retweets for each tweet
data <- lapply(x, search_tweets, verbose = FALSE)
I have 12 elements, each contains a list of user ids, how can I extract the user ids only?
here is the full code:
library(rtweet)
library(dplyr)
library(plyr)
require(reshape2)
## search for day of rage tweets, try to exclude rt here
dor <- search_tweets("#Newsnight -filter:retweets", n = 10000)
## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
users_data() %>%
unique() %>%
right_join(dor) %>%
filter(!is_retweet) %>%
dplyr::select(user_id, screen_name, retweet_count, followers_count, text) %>%
filter(retweet_count >=50 & retweet_count <100 & followers_count < 10000 & followers_count > 500)
dat
## get only first 8 words from each tweet
x <- lapply(strsplit(dat$text, " "), "[", 1:8)
x <- lapply(x, na.omit)
x <- vapply(x, paste, collapse = " ", character(1))
## get rid of hyperlinks
x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
## encode for search query (handles the non ascii chars)
x <- sapply(x, URLencode, USE.NAMES = FALSE)
## get up to first 100 retweets for each tweet
data <- lapply(x, search_tweets, verbose = FALSE)
There are 11 more elements like this
12 elements

Ok, so you have a list of 12 dataframes, each has a column called user_id. if the list is named, then this will work, if it isn't named, then take out the df_name = names(data)[x], part.
lapply(1:12, function(x) {
df <- data[[x]]
data.frame(user_id = df$user_id,
# df_name = names(data)[x],
df_number = x, stringsAsFactors=FALSE) } ) %>%
dplyr::bind_rows()
That should give you a new dataframe with all of the userids and which previous dataframe they came from.

Isolate alphabetical strings within a larger string

Is there a way to isolate parts of a string that are in alphabetical order?
In other words, if you have a string like this: hjubcdepyvb
Could you just pull out the portion in alphabetical order?: bcde
I have thought about using the is.unsorted() function, but I'm not sure how to apply this to only a portion of a string.

Here's one way by converting to ASCII and back:
input <- "hjubcdepyvb"
spl_asc <- as.integer(charToRaw(input)) # Convert to ASCII
d1 <- diff(spl_asc) == 1 # Find sequences
filt <- spl_asc[c(FALSE, d1) | c(d1, FALSE)] # Only keep sequences (incl start and end)
rawToChar(as.raw(filt)) # Convert back to character
#[1] "bcde"
Note that this will concatenate any parts that are in alphabetical order.
i.e. If input is "abcxasdicfgaqwe" then output would be abcfg.
If you wanted to get separate vectors for each sequential string, you could do the following
input <- "abcxasdicfgaqwe"
spl_asc <- as.integer(charToRaw(input))
d1 <- diff(spl_asc) == 1
r <- rle(c(FALSE, d1) | c(d1, FALSE)) # Find boundaries
cm <- cumsum(c(1, r$lengths)) # Map these to string positions
substring(input, cm[-length(cm)], cm[-1] - 1)[r$values] # Extract matching strings
Finally, I had to come up with a way to use regex:
input <- c("abcxasdicfgaqwe", "xufasiuxaboqdasdij", "abcikmcapnoploDEFgnm",
"acfhgik")
(rg <- paste0("(", paste0(c(letters[-26], LETTERS[-26]),
"(?=", c(letters[-1], LETTERS[-1]), ")", collapse = "|"), ")+."))
#[1] "(a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|
#k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|
#v(?=w)|w(?=x)|x(?=y)|y(?=z)|A(?=B)|B(?=C)|C(?=D)|D(?=E)|E(?=F)|F(?=G)|G(?=H)|
#H(?=I)|I(?=J)|J(?=K)|K(?=L)|L(?=M)|M(?=N)|N(?=O)|O(?=P)|P(?=Q)|Q(?=R)|R(?=S)|
#S(?=T)|T(?=U)|U(?=V)|V(?=W)|W(?=X)|X(?=Y)|Y(?=Z))+."
regmatches(input, gregexpr(rg, input, perl = TRUE))
#[[1]]
#[1] "abc" "fg"
#
#[[2]]
#[1] "ab" "ij"
#
#[[3]]
#[1] "abc" "nop" "DEF"
#
#[[4]]
#character(0)
This regular expression will identify consecutive upper or lower case letters (but not mixed case). As demonstrated, it works for character vectors and produces a list of vectors with all the matches identified. If no match is found, the output is character(0).

Using factor integer conversion:
input <- "hjubcdepyvb"
d1 <- diff(as.integer(factor(unlist(strsplit(input, "")), levels = letters))) == 1
filt <- c(FALSE, d1) | c(d1, FALSE)
paste(unlist(strsplit(input, ""))[filt], collapse = "")
# [1] "bcde"

myf = function(x){
x = unlist(strsplit(x, ""))
ind = charmatch(x, letters)
d = c(0, diff(ind))
d[d !=1] = 0
d = d + c(sapply(1:(length(d)-1), function(i) {
ifelse(d[i] == 0 & d[i+1] == 1, 1, 0)
}
), 0)
d = split(seq_along(d)[d!=0], with(rle(d), rep(seq_along(values), lengths))[d!=0])
return(sapply(d, function(a) paste(x[a], collapse = "")))
}
myf(x = "hjubcdepyvblltpqrs")
# 2 4
#"bcde" "pqrs"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

gsub producing a range rather than desired expression in R - r

Related

How to insert a part of a value of a column into a new column

Flipping two sides of string

Multiple list objects as a character variable in df. How to convert into original df with R?

How to extract the user ids for each element

Isolate alphabetical strings within a larger string

Categories

Resources