Sort strings based on number in part of string

Sort strings based on number in part of string - r

I have a huge data that I cannot upload here because.
I have two types of columns, their names start with T.H.L or T.H.L.varies..... Both types have are numbered in the format So####, e.g., T.H.L.So1_P1_A2 until T.H.L.So10000_P1_A2.
For each T.H.L column there is a column named T.H.L.varies.... with the same ending.
I want to order the columns by the numbers after So, with first the T.H.L and then the corresponding T.H.L.varies.... version for each So number.
What I tried was to do
library(gtools)
mySorted<- df2[,mixedorder(colnames(df2))]
Which is close, it sorts them correctly by number, but first all T.H.L and then all T.H.L.varies instead of alternating them.
I have posted the column names to Github:

Okay, let's call the names of your data frame (the names you want to reorder) x:
x = names(df2)
# first remove the ones without numbers
# because we want to use the numbers for ordering
no_numbers = c("T.H.L", "T.H.L.varies....")
x = x[! x %in% no_numbers]
# now extract the numbers so we can order them
library(stringr)
x_num = as.numeric(str_extract(string = x, pattern = "(?<=So)[0-9]+"))
# calculate the order first by number, then alphabetically to break ties
ord = order(x_num, x)
# verify it is working
head(c(no_numbers, x[ord]), 10)
# [1] "T.H.L" "T.H.L.varies...." "T.H.L.So1_P1_A1"
# [4] "T.H.L.varies.....So1_P1_A1" "T.H.L.So2_P1_A2" "T.H.L.varies.....So2_P1_A2"
# [7] "T.H.L.So3_P1_A3" "T.H.L.varies.....So3_P1_A3" "T.H.L.So4_P1_A4"
# [10] "T.H.L.varies.....So4_P1_A4"
# finally, reorder your data frame columns
df2 = df2[, c(no_numbers, x[ord])]
And you should be done.

Related

Find strings where the first half matches the second

I have a list of IP address pairs separated by "::".
ip_pairs <- c("104.124.199.136::192.168.1.67", "104.124.199.136::192.168.137.174", "192.168.1.67::104.124.199.136", "192.168.137.174::104.124.199.136")
As you can see, the third and fourth elements of the vector are the same as the first two, but reversed (my actual problem is to find all unique pairings of IPs, so the solution would drop the pair B::A if A::B is already present. This could be solved using stringr or regex, I'm guessing.

One option:
library(stringr)
split_function = function(x) {
x = sort(x)
paste(x, collapse="::")
}
pairs = str_split(ip_pairs, "::")
unique(sapply(pairs, split_function))
[1] "104.124.199.136::192.168.1.67" "104.124.199.136::192.168.137.174"

Use read.table to create a two column data frame from the pairs, sort each row and find the duplicates using duplicated. Then extract out the non-duplicates. No packages are used.
DF <- read.table(text = ip_pairs, sep = ":")[-2]
ip_pairs[! duplicated(t(apply(DF, 1, sort)))]
## [1] "192.168.1.67::104.124.199.136" "192.168.137.174::104.124.199.136"

add running counter for semi-consecutive strings in vector

I would like to add a number indicating the x^th occurrence of a word in a vector. (So this question is different from Make a column with duplicated values unique in a dataframe , because I have a simple vector and try to avoid the overhead of casting it to a data.frame).
E.g. for the vector:
book, ship, umbrella, book, ship, ship
the output would be:
book, ship, umbrella, book2, ship2, ship3
I have solved this myself by transposing the vector to a dataframe and next using the grouping function. That feels like using a sledgehammer to crack nuts:
# add consecutive number for equal string
words <- c("book", "ship", "umbrella", "book", "ship", "ship")
# transpose word vector to data.frame for grouping
df <- data.frame(words = words)
df <- df %>% group_by(words) %>% mutate(seqN = row_number())
# combine columns and remove '1' for first occurrence
wordsVec <- paste0(df$words, df$seqN)
gsub("1", "", wordsVec)
# [1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Is there a more clean solution, e.g. using the stringr package?

You can still utilize row_number() from dplyr but you don't need to convert to data frame, i.e.
sub('1$', '', ave(words, words, FUN = function(i) paste0(i, row_number(i))))
#[1] "book" "ship" "umbrella" "book2" "ship2" "ship3"
Another option is to use make.unique along with gsubfn to increment your values by 1, i.e.
library(gsubfn)
gsubfn("\\d+", function(x) as.numeric(x) + 1, make.unique(words))
#[1] "book" "ship" "umbrella" "book.2" "ship.2" "ship.3"

exclude specific country phone numbers from a column with different type of phone numbers

I've got problem with excluding a specific country phone numbers out of a column. the problem is that they are not in a same format and some countries have 3 digit country code ex:"001" and others have 4 digit country code ex:"0098"
sample:
00989121234567
009809121234567
989121234567
9121234567
09121234567
first I need to convert all of those formats into 1 format and next exclude them out of that column.output phone numbers must be in this format:
"989121234567"

You can use startsWith and substr (or gsub would do as well) for this. First though, you need an array with prefixes:
# variables
country_codes <- c('1', '98')
prefix <- union(country_codes, paste0('00', country_codes))
numbers <- c('00989121234567','009809121234567','989121234567','9121234567','09121234567')
# get rid of prefix
new_numbers <- character(length(numbers))
for (k in seq_along(prefix)) {
ind <- startsWith(numbers, prefix[k])
new_numbers[ind] <- substr(numbers[ind], nchar(prefix[k]) + 1, nchar(numbers[ind]))
}
new_numbers[new_numbers == ""] <- numbers[new_numbers == ""]
# results
new_numbers
# [1] "9121234567" "09121234567" "9121234567" "9121234567" "09121234567"
You can then add new country codes e.g. 44,31 etc. or you could also add paste0('+', country_codes) in prefix to deal with numbers of the form +1xxxx.

If you define the vector that includes the telephone number as numeric the zeros in front are removed and you are then free to remove the numbers that you don't want.
Using the numbers provided:
nr <- c(00989121234567,009809121234567,989121234567,9121234567,09121234567)
nr
[1] 9.891212e+11 9.809121e+12 9.891212e+11 9.121235e+09 9.121235e+09
subset(nr,!grepl("^98",nr))
[1] 9121234567 9121234567
EDIT: I see you added the requirement of returning a character vector. You can just use the as.character() function for that on the final vector.

Creating a palindrome function in r

Hi I have a data frame df and wish to find out are there any palindromes in one name column.
I have test data which has 12 records in it. I know 2 of the column records for name are palindromes.
The code below will return a list using lapply of true false values.
How do I return the name that is a palindrome with the true values and how would i find out which is the most frequently occuring palindrome name?
is_palindrome = function(x){
charsplit = strsplit(x, "")[[1]]
revchar = rev(charsplit)
all(charsplit==revchar)
}
dfnamelc = tolower(as.character(df$Name))
listtest = as.list(dfnamelc)
lapply(listtest,is_palindrome)
example df
Linda,F,100
Mary,F,150
Patrick,M,200
Barbara,F,300
Susan,F,100
Norman,M,40
Deborah,F,500
Sandra,F,23
Conor,M,80
anna,F,40
Otto,M,30
anna,M,40

It will probably be more convenient to use sapply() to return the results as a vector, and incorporate the results back into the data frame.
df <- transform(df,
is_pal=sapply(tolower(Name),is_palindrome))
df$Name[df$is_pal] ## which names are palindromes?
paltab <- table(df$Name[df$is_pal]) ## count palindromic names
names(paltab)[which.max(paltab)] ## "anna"
I'm not sure what your third column signifies, so I'm ignoring it.

How to subset data with advance string matching

I have the following data frame from which I would like to extract rows based on matching strings.
> GEMA_EO5
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
KNG1 3.433049 8.56e-28 NM_000893,NM_001102416 1.234245e-24
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
VPS29 3.827665 2.22e-25 NM_057180,NM_016226 2.560770e-22
CYP51A1 3.363149 5.95e-25 NM_000786,NM_001146152 6.239386e-22
TNPO2 4.707600 1.60e-23 NM_001136195,NM_001136196,NM_013433 1.538000e-20
NSDHL 2.703922 6.74e-23 NM_001129765,NM_015922 5.980454e-20
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
So I would like to extract e.g. two rows based on matching strings in $RefSeq_ID, that works fine with the following:
> list<-c("NM_001386", "NM_020385")
> GEMA_EO6<-subset(GEMA_EO5, GEMA_EO5$RefSeq_ID %in% list, drop = TRUE)
> GEMA_EO6
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
REXO4 3.245317 1.78e-27 NM_020385 2.281367e-24
DPYSL2 5.097382 1.29e-22 NM_001386 1.062868e-19
But some of the rows have several RefSeq_IDs separated with commas, so I am looking for a general way of telling if $RefSeq_ID contains a certain string pattern and then subset that row.

To do partial matching you'll need to use regular expressions (see ?grepl). Here's a solution to your particular problem:
##Notice that the first element appears in
##a row containing commas
l = c( "NM_013433", "NM_001386", "NM_020385")
To test one sequence at a time, we just select a particular seq id:
R> subset(GEMA_EO5, grepl(l[1], GEMA_EO5$RefSeq_ID))
gene_symbol fold_EO p_value RefSeq_ID BH_p_value
5 TNPO2 4.708 1.6e-23 NM_001136195,NM_001136196,NM_013433 1.538e-20
To test for multiple genes, we use the | operator:
R> paste(l, collapse="|")
[1] "NM_013433|NM_001386|NM_020385"
R> grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID)
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE
So
subset(GEMA_EO5, grepl(paste(l, collapse="|"),GEMA_EO5$RefSeq_ID))
should give you what you want.

A different approach is to recognize the duplicate entries in RefSeq_ID as an attempt to represent two data base tables in a single data frame. So if the original table is csv, then normalize the data into two tables
Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
key0 <- strsplit(csv$RefSeq_ID, ",")
RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
ID = unlist(key0))
and recognize that the query is a subset (select) on the RefSeq table, followed by a merge (join) with Anno
l <- c( "NM_013433", "NM_001386", "NM_020385")
merge(Anno, subset(RefSeq, ID %in% l))[, -1]
leading to
> merge(Anno, subset(RefSeq, ID %in% l))[, -1]
gene_symbol fold_EO p_value BH_p_value ID
1 REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
2 TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
3 DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
Perhaps the goal is to merge with a `Master' table, then
Master <- cbind(key = seq_len(nrow(csv)), csv)
merge(Master, subset(RefSeq, ID %in% l))[,-1]
or similar.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Sort strings based on number in part of string - r

Related

Find strings where the first half matches the second

add running counter for semi-consecutive strings in vector

exclude specific country phone numbers from a column with different type of phone numbers

Creating a palindrome function in r

How to subset data with advance string matching

Categories

Resources