I have some data that describes an ordered set of discrete events (or states). There are 34 possible states, which may occur in any order and may repeat. Each sequence of events can contain any number of events, and crucially there are more than 2 sequences of events. My eventual aim is to cluster these sequences into similar subsets, but my hunch is that this cannot be meaningful unless these sequences are aligned such that equivalent events occupy the same position in all sequences.
I'm very familiar with multiple alignment of biological sequences, but all the software I've come across for this (MUSCLE, MAFFT, T-COFFEE, Clustal*, etc) require DNA, RNA or AA sequences, and I have more states than any of these, so I can't get them to work.
I've found various implementations of the pairwise alignment algorithms such as Needleman-Wunsch in R, but so far haven't come across any generic (non-biological) implementations of any multiple sequence alignment algorithms.
For example, say my data looks like this:
1: ABCDEFG
2: ACDGH
3: BDEFEGI
4: AH
5: DEGHI
My aim is to have it look like this:
1: ABCDEF-G--
2: A-CD---GH-
3: -B-DEFE--I
4: A-------H-
5: ---DE--GHI
Where the - symbol denotes the absence of an event in this sequence. This is a simplified example, in reality I'm looking for something that penalises the opening of gaps (-) in the same way that biological sequence MSA algorithms do.
The only piece of software I've found that seems to possibly do this is Alphamalig (http://alggen.lsi.upc.es/recerca/align/alphamalig/intro-alphamalig.html) but it's old and I can't get it working on my machine. Ideally I'd like something that can be implemented in R.
I would advise using MAFFT sequence alignment. Typically, this is used to align biological sequences, but it has the option to align text using --anysymbol. Note that MAFFT is a bash script and requires an input/output file.
input file (mafft_anysymbol_input.txt):
>Seq1
ABCDEFG
>Seq2
ACDGH
>Seq3
BDEFEGI
>Seq4
AH
>Seq5
DEGHI
R code to run bash script:
#Be sure that input/output and R files share the same path, otherwise you'll have to specify the path in the mafft script call.
x <- 'mafft --anysymbol mafft_anysymbol_input.txt > mafft_anysymbol_output.txt'
system(x)
Contents of output file (mafft_anysymbol_output.txt):
>Seq1
ABCDEFG--
>Seq2
-ACDGH---
>Seq3
--BDEFEGI
>Seq4
----AH---
>Seq5
---DEGHI-
Edit - I see now that you are familiar with biological alignment tools. If you want to make a customized scoring matrix for your text alignments, check out mafft options --text and --textmatrix. It requires ascii code input (extra data type conversions), but you would have the option of associating similar letters (however you choose to define similar) by score. For example, you could associate upper and lowercase letters, or letters with/without accent marks.
Assuming that we need to match with LETTERS, one option is str_match, then change the NA to -, paste
library(stringr)
library(dplyr)
f1 <- Vectorize(function(x) str_match(x, LETTERS))
out1 <- f1(v1)
do.call(paste0, as.data.frame(t(replace_na(out1[!!rowSums(!is.na(out1)),], '-'))))
#[1] "ABCDEFG--" "A-CD--GH-" "-B-DEFG-I" "A------H-" "---DE-GHI"
It can be also done with match after splitting
lst <- strsplit(v1, "")
mx <- match(max(sapply(lst, tail, 1)), LETTERS)
sapply(lst, function(x) paste(replace_na(x[match(LETTERS[seq_len(mx)],
x)], '-'), collapse=""))
data
v1 <- c("ABCDEFG", "ACDGH", "BDEFEGI", "AH", "DEGHI")
Related
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)
I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!
You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.
I am currently using the 'agrep' function with 'lapply' in a data.table code to link entries from a user-provided VIN# list to a DMV VIN# database. Please see the following two links for all data/code so far:
Accelerate performance and speed of string match in R
Imperfect string match using data.table in R
Is there a way to extract the "best" match from my list that is being generated by:
dt <- dt[lapply(car.vins, function(x) agrep(x,vin.vins, max.distance=c(cost=2, all=2), value=T)), list(NumTimesFound=.N), vin.names]
because as of now, the 'agrep' function gives me multiple matches, even with a lot of modification of the cost, all, substitution, ect. variables.
I have also tried using the 'adist' function instead of 'agrip' but because 'adist' does not have an option for value=TRUE like 'agrep', it throws out the same
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins, :
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'.
Character columns must join to factor or character columns.
that I was receiving with the 'agrep' before.
Is there perhaps some other package I could use?
Thanks!
Tom, this isn't strictly a data.table problem. Also, it's hard to know exactly what you want without having the data you are using. I tried to figure out what you want, and I came up with this solution:
vin.match <- vapply(car.vins, function(x) which.min(adist(x, vin.vins)), integer(1L))
data.frame(car.vins, vin.vins=vin.vins[vin.match], vin.names=vin.names[vin.match])
# car.vins vin.vins vin.names
# 1 abcdekl abcdef NAME1
# 2 abcdeF abcdef NAME1
# 3 laskdjg laskdjf NAME2
# 4 blerghk blerghk NAME3
And here is the data:
vin.vins <- c("abcdef", "laskdjf", "blerghk")
vin.names <- paste0("NAME", 1:length(vin.vins))
car.vins <- c("abcdekl", "abcdeF", "laskdjg", "blerghk")
This will find the closest match for every value in car.vins in vin.vins, as per adist. I'm not sure data.table is needed for this particular step. If you provide your actual data (or a representative sample), then I can provide a more targeted answer.
I have millions of Keywords in a column labeled Keyword.text. Each factor or Keyword can contains multiple words (or shall we say token). Here is an example with 4 keywords
Keyword.text
The quick brown fox the
.8 .crazy lazy dog
dog
jumps over+the 9
I'd like to count the number of tokens in each Keyword, so as to obtain:
Keyword.length
5
4
1
4
I installed the Tau package but I haven't gotten very far...
textcnt(Mydf$Keyword.text, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
returns an error I don't understand. Maybe it's due to having factors; it worked fine when practicing with a string.
I know how to do it in excel, but it doesn't work for the last line. If A2 has the keywords then: =LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1 would do
Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :
tt <- data.frame(
a=rnorm(3),
b=rnorm(3),
c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
stringsAsFactors=F
)
sapply(tt$c, function(n){
length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})
To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.
A simple example using readLines :
con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().
Please show the error.
Also try:
require(tau)
textcnt(as character(Mydf$Keyword.txt), split, ....)
... to force character mode.
Or load your data with stringsAsFactors=FALSE -- the same question has come up here before.
What about a nice little function that let us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6