I have these vectors:
texts=c("AANAAA","NNAAAA","AAAAAA", "NAAANN")
letter=c("C","C","G","T","P","D")
I am trying to replace N for each element in texts and replace it with a character from second array letter by position.
Take "AANAAA", the N at third position should be replaced by the third element of letter, i.e. G
My expected output is:
texts=c("AAGAAA","CCAAAA","AAAAAA", "CAAAPD")
I was trying gsub and a for loop
for (i in 1:6) {
for (j in 1:4)
gsub("N",letter[i],texts[j][i])
}
but it didn't work.
You can do that in base R like this. No loop is required. gregexpr extracts a list of positions in texts where matched. We then replace characters at those positions with corresponding characters in letter using regmatches<-.
m <- gregexpr("N", texts, fixed = TRUE)
regmatches(texts, m) <- lapply(m, \(i, x) x[i], letter)
Output
> texts
[1] "AAGAAA" "CCAAAA" "AAAAAA" "CAAAPD"
Using loops, split texts into letters and replace "N" with corresponding letter:
sapply(strsplit(texts, ""), function(i){
ix <- which(i == "N")
i[ ix ] <- letter[ ix ]
paste(i, collapse = "")
})
# [1] "AAGAAA" "CCAAAA" "AAAAAA" "CAAAPD"
A solution based on tidyverse/purrr:
library(tidyverse)
texts=c("AANAAA","NNAAAA","AAAAAA", "NAAANN")
letter=c("C","C","G","T","P","D")
texts %>%
map(~ str_split(.x, "") %>% unlist %>%
map2_chr(letter, ~ if_else(.x == "N", .y, .x))) %>%
map_chr(~ str_c(.x, collapse = ""))
#> [1] "AAGAAA" "CCAAAA" "AAAAAA" "CAAAPD"
Related
I have not been programming for that long and have now encountered a problem to which I have not yet been able to find a solution.
In my dataframe there is a column that contains several pieces of information. For example, one row looks like this:
sp|O94910|AGRL1_HUMAN
or like this
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN
Now I want to create a new column with the combination of digits between the two vertical bars.
For the upper example it would be O94910, for the lower Q13554; Q13555
I have already tried functions like str_extract_all, str_match or gsub. But nothing worked.
The "id" is the column I look at. It includes different combinations of digits. I need the one between the two |
> dput(head(anaDiff_PD_vs_CTRL$id, 10))
c("sp|O94910|AGRL1_HUMAN", "sp|P02763|A1AG1_HUMAN", "sp|P19652|A1AG2_HUMAN",
"sp|P25311|ZA2G_HUMAN", "sp|Q8NFZ8|CADM4_HUMAN", "sp|P08174|DAF_HUMAN",
"sp|Q15262|PTPRK_HUMAN", "sp|P78324|SHPS1_HUMAN;sp|Q5TFQ8|SIRBL_HUMAN;sp|Q9P1W8|SIRPG_HUMAN",
"sp|Q8N3J6|CADM2_HUMAN", "sp|P19021|AMD_HUMAN")>
With dplyr and stringr you can try...
library(dplyr)
library(stringr)
dat %>%
rowwise() %>%
mutate(dig = str_extract_all(col, "(?<=sp\\|)[A-Z0-9]+(?=\\|)"),
dig = paste0(dig, collapse = "; "))
#> # A tibble: 4 x 2
#> # Rowwise:
#> col dig
#> <chr> <chr>
#> 1 sp|Q8NFZ8|CADM4_HUMAN Q8NFZ8
#> 2 sp|94910|AGRL1_HUMAN 94910
#> 3 sp|O94910|AGRL1_HUMAN O94910
#> 4 sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN Q13554; Q13555
data
dat <- data.frame(col = c("sp|Q8NFZ8|CADM4_HUMAN", "sp|94910|AGRL1_HUMAN", "sp|O94910|AGRL1_HUMAN", "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"))
Created on 2022-02-02 by the reprex package (v2.0.1)
Here is a solution without tidyverse:
dat <- read.table(text = "
sp|Q8NFZ8|CADM4_HUMAN
sp|94910|AGRL1_HUMAN
sp|O94910|AGRL1_HUMAN
sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN")
ids <- strsplit(dat$V1, ";")
ids <- lapply(ids, function(x) gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", x))
ids <- lapply(ids, function(x) paste(x, collapse="; "))
dat$newcol <- unlist(ids)
Even with tidyverse, I would define a helper function for more clarity:
extract_ids <- function(x) {
ids <- strsplit(x, ";")
ids <- map(ids, ~ gsub("sp\\|([[:alnum:]]*)\\|.*", "\\1", .))
ids <- map(ids, ~ paste(., collapse="; "))
unlist(ids)
}
dat <- dat %>% mutate(ids = extract_ids(V1))
This solution should help if you want to change your column names in a similar fashion:
library(tidyverse)
# create test data frame with column names "sp|O94910|AGRL1_HUMAN" and "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
col1 <- c(1,2,3,4,5)
col2 <- c(6,7,8,9,10)
df <- data.frame(col1, col2)
names(df)[1] <- "sp|O94910|AGRL1_HUMAN"
names(df)[2] <- "sp|Q13554|KCC2B_HUMAN;sp|Q13555|KCC2G_HUMAN"
names <- as.data.frame((str_split(colnames(df), "\\|", simplify = TRUE))) # split the strings representing the column names seperated by "|" into a list
# remove all strings that contain less digits than letters or special characters
for(i in 1:nrow(names)) {
for(j in 1:ncol(names)){
if ( (str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[0-9]") >
str_count(as.vector(str_split(names[i,j], "\\|", simplify = TRUE)), "[:alpha:]|[:punct:]") )){
names[i,j] <- names[i,j]
} else {
names[i,j] <- ""
}
}
}
# combine the list columns into a single column calles "colnames"
names <- names %>% unite("colnames", 1:5, na.rm = TRUE, remove = TRUE, sep = ";")
# remove all ";" separators at the start of the strings, the end of the strings, and series of ";" into a single ";"
for (i in 1:nrow(names)){
names[i,] <- str_replace(names[i,],"\\;+$", "") %>%
str_replace("^\\;+", "") %>%
str_replace("\\;{2}", ";")
}
# convert column with new names into a vector
new_names <- as.vector(names$colnames)
# replace old names with new names
names(df) <- new_names
I need to prepare a certain dataset for analysis. What I have is a table with column names (obviously). The column names are as follows (sample colnames):
"X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"
(this is a vector, for those not familiair with R colnames() function)
Now, what I want is simply to flip the values in front of, and after the underscore. e.g. X99_NORM becomes NORM_X99. Note that I want this only for the column names which contain NORM in their name.
Some other base R options
1)
Use sub to switch the beginning and end - we can make use of capturing groups here.
x <- sub(pattern = "(^X\\d+)_(NORM$)", replacement = "\\2_\\1", x = x)
Result
x
# [1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
2)
A regex-free approach that might be more efficient using chartr, dirname and paste. But we need to get the indices of the columns that contain "NORM" first
idx <- grep(x = x, pattern = "NORM", fixed = TRUE)
x[idx] <- paste0("NORM_", dirname(chartr("_", "/", x[idx])))
x
data
x <- c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
x = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM")
replace(x,
grepl("NORM", x),
sapply(strsplit(x[grepl("NORM", x)], "_"), function(x){
paste(rev(x), collapse = "_")
}))
#[1] "NORM_X99" "NORM_X101" "X76_110_T02_09747" "NORM_X30"
A tidyverse solution with stringr:
library(tidyverse)
library(stringr)
my_data <- tibble(column = c("X99_NORM", "X101_NORM", "X76_110_T02_09747", "X30_NORM"))
my_data %>%
filter(str_detect(column, "NORM")) %>%
mutate(column_2 = paste0("NORM", "_", str_extract(column, ".+(?=_)"))) %>%
select(column_2)
# A tibble: 3 x 1
column_2
<chr>
1 NORM_X99
2 NORM_X101
3 NORM_X30
I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.
The best code that I could write to get this done is,
data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))
Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!
Here's a regex that gets all the columns at once:
data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))
The vertical bar represents an 'or'.
As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:
data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))
Suppose that in the code below x represents your names(data). Then the following will do what you want.
# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
EDIT.
You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.
var_names <- function(x, from = 1, to = Inf){
y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.integer(y[sapply(y, `!=`, "")])
x[from <= y & y <= to]
}
var_names(x, 5)
#[1] "r5e" "r7g" "r9i" "r11k" "r13g" "r15i"
Remove the non-digits, scan the remainder in and check whether each is in 5:12 :
DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data
DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
Using magrittr it could also be written like this:
library(magrittr)
DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
## r5e r7g r9i r11k
## 1 3 4 5 6
How can I extract the user_id from the retweets collected using this function?
## get only first 8 words from each tweet
x <- lapply(strsplit(dat$text, " "), "[", 1:8)
x <- lapply(x, na.omit)
x <- vapply(x, paste, collapse = " ", character(1))
## get rid of hyperlinks
x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
## encode for search query (handles the non ascii chars)
x <- sapply(x, URLencode, USE.NAMES = FALSE)
## get up to first 100 retweets for each tweet
data <- lapply(x, search_tweets, verbose = FALSE)
I have 12 elements, each contains a list of user ids, how can I extract the user ids only?
here is the full code:
library(rtweet)
library(dplyr)
library(plyr)
require(reshape2)
## search for day of rage tweets, try to exclude rt here
dor <- search_tweets("#Newsnight -filter:retweets", n = 10000)
## merge tweets data with unique (non duplicated) users data
## exclude retweets
## select status_id, retweet count, followers count, and text columns
dat <- dor %>%
users_data() %>%
unique() %>%
right_join(dor) %>%
filter(!is_retweet) %>%
dplyr::select(user_id, screen_name, retweet_count, followers_count, text) %>%
filter(retweet_count >=50 & retweet_count <100 & followers_count < 10000 & followers_count > 500)
dat
## get only first 8 words from each tweet
x <- lapply(strsplit(dat$text, " "), "[", 1:8)
x <- lapply(x, na.omit)
x <- vapply(x, paste, collapse = " ", character(1))
## get rid of hyperlinks
x <- gsub("http[\\S]{1,}", "", x, perl = TRUE)
## encode for search query (handles the non ascii chars)
x <- sapply(x, URLencode, USE.NAMES = FALSE)
## get up to first 100 retweets for each tweet
data <- lapply(x, search_tweets, verbose = FALSE)
There are 11 more elements like this
12 elements
Ok, so you have a list of 12 dataframes, each has a column called user_id. if the list is named, then this will work, if it isn't named, then take out the df_name = names(data)[x], part.
lapply(1:12, function(x) {
df <- data[[x]]
data.frame(user_id = df$user_id,
# df_name = names(data)[x],
df_number = x, stringsAsFactors=FALSE) } ) %>%
dplyr::bind_rows()
That should give you a new dataframe with all of the userids and which previous dataframe they came from.
Is there a way to isolate parts of a string that are in alphabetical order?
In other words, if you have a string like this: hjubcdepyvb
Could you just pull out the portion in alphabetical order?: bcde
I have thought about using the is.unsorted() function, but I'm not sure how to apply this to only a portion of a string.
Here's one way by converting to ASCII and back:
input <- "hjubcdepyvb"
spl_asc <- as.integer(charToRaw(input)) # Convert to ASCII
d1 <- diff(spl_asc) == 1 # Find sequences
filt <- spl_asc[c(FALSE, d1) | c(d1, FALSE)] # Only keep sequences (incl start and end)
rawToChar(as.raw(filt)) # Convert back to character
#[1] "bcde"
Note that this will concatenate any parts that are in alphabetical order.
i.e. If input is "abcxasdicfgaqwe" then output would be abcfg.
If you wanted to get separate vectors for each sequential string, you could do the following
input <- "abcxasdicfgaqwe"
spl_asc <- as.integer(charToRaw(input))
d1 <- diff(spl_asc) == 1
r <- rle(c(FALSE, d1) | c(d1, FALSE)) # Find boundaries
cm <- cumsum(c(1, r$lengths)) # Map these to string positions
substring(input, cm[-length(cm)], cm[-1] - 1)[r$values] # Extract matching strings
Finally, I had to come up with a way to use regex:
input <- c("abcxasdicfgaqwe", "xufasiuxaboqdasdij", "abcikmcapnoploDEFgnm",
"acfhgik")
(rg <- paste0("(", paste0(c(letters[-26], LETTERS[-26]),
"(?=", c(letters[-1], LETTERS[-1]), ")", collapse = "|"), ")+."))
#[1] "(a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|
#k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|
#v(?=w)|w(?=x)|x(?=y)|y(?=z)|A(?=B)|B(?=C)|C(?=D)|D(?=E)|E(?=F)|F(?=G)|G(?=H)|
#H(?=I)|I(?=J)|J(?=K)|K(?=L)|L(?=M)|M(?=N)|N(?=O)|O(?=P)|P(?=Q)|Q(?=R)|R(?=S)|
#S(?=T)|T(?=U)|U(?=V)|V(?=W)|W(?=X)|X(?=Y)|Y(?=Z))+."
regmatches(input, gregexpr(rg, input, perl = TRUE))
#[[1]]
#[1] "abc" "fg"
#
#[[2]]
#[1] "ab" "ij"
#
#[[3]]
#[1] "abc" "nop" "DEF"
#
#[[4]]
#character(0)
This regular expression will identify consecutive upper or lower case letters (but not mixed case). As demonstrated, it works for character vectors and produces a list of vectors with all the matches identified. If no match is found, the output is character(0).
Using factor integer conversion:
input <- "hjubcdepyvb"
d1 <- diff(as.integer(factor(unlist(strsplit(input, "")), levels = letters))) == 1
filt <- c(FALSE, d1) | c(d1, FALSE)
paste(unlist(strsplit(input, ""))[filt], collapse = "")
# [1] "bcde"
myf = function(x){
x = unlist(strsplit(x, ""))
ind = charmatch(x, letters)
d = c(0, diff(ind))
d[d !=1] = 0
d = d + c(sapply(1:(length(d)-1), function(i) {
ifelse(d[i] == 0 & d[i+1] == 1, 1, 0)
}
), 0)
d = split(seq_along(d)[d!=0], with(rle(d), rep(seq_along(values), lengths))[d!=0])
return(sapply(d, function(a) paste(x[a], collapse = "")))
}
myf(x = "hjubcdepyvblltpqrs")
# 2 4
#"bcde" "pqrs"