Sorting by a slice of the text string - r

I am wanting to sort my data but the standard Excel "A to Z" sort function isn't cutting it. I was hoping someone knew how to make a custom sort that could suit my needs. Here is a sample:
chrPos count
chr1_10000598 10
chr1_10000647 10
chr1_10001370 30
chr1_10001390 30
chr1_10001392 30
chr1_10001414 30
chr1_10001418 30
chr1_10001473 10
chr1_10001505 10
chr1_10001516 20
chr1_1000156 30
As you can see the last row is out of place when using the built in sort function, this should be the first row not the last one here. I think adding a second layer of sorting would to the trick but that layer would have to sort by ascending value based on the number that is following the underscore.
Any ideas? Would this possibly be easier with R instead?
Edit to add details from comments:
Sorting is to be ascending on the numeric part after the underscore, within ascending on the chr numeric part (running from 1 to 22 both inclusive) and then chrM_, chrX_ and chrY_ in that order (also with their numeric parts sorted ascending).
The numeric part after the underscore may be up to 8 digits.

Assuming chrPos is in ColumnA, please try in a helper column:
=IF(FIND("_",A1)=5,CHAR(64+MID(A1,4,1)),CHAR(64+MID(A1,4,2)))&REPT("0",8-LEN(A1)+FIND("_",A1))&MID(A1,FIND("_",A1)+1,8)
OR, for additional requirements as mentioned in comments:
=IF(MID(A1,4,1)="M","W",IF(MID(A1,4,1)="X","X",IF(MID(A1,4,1)="Y","Y",IF(FIND("_",A1)=5,CHAR(64+MID(A1,4,1)),CHAR(64+MID(A1,4,2))))))&REPT("0",9-LEN(A1)+FIND("_",A1))&MID(A1,FIND("_",A1)+1,9)
then select the helper column, Copy, Paste Special, Values over the top and use that for sorting.

Related

converting rows to columns of a data frame in R

I have a data set like this
movieID title year country genre directorName Rating actorName1 actorName.2
1 hello 1995 USA action john smith 6 tom hanks charlie sheen
2 MI2 1997 USA action mad max 8 tom cruize some_body
3 MI2 1997 USA thriller mad max 8 tom cruize some_body
basically there are numerous rows that just have a different user given genre that I would like to columns having genre1, genre2, ...
I tried reshape() but it would only convert based on some ID variable. If anyone has any ideas let me know
You can use reshape() to do this, if you understand the lens through which reshape() views data.
Background
First, consider the concept of a record in the context of the relational model of data management. Generally, in a table of data, each record should correspond to a well-defined unit of data, concisely termed the record unit, with one or more columns acting as identification or key variables that serve to differentiate between unique instances of the record unit.
Usually, units are described by a set of scalar variables. In other words, each record has associated with it one or more scalar values, each of which provides a single piece of information about the unit. In a nice simple world, all properties of units would be scalar, and thus you could represent each variable as a single column vector, with each element/cell corresponding to one record unit, and thereby providing the value of that particular property for that particular unit.
Further to the concept of properties, it is possible and very common to identify typing or grouping classifications of units. These are often represented as additional scalar properties of units.
When people talk about the long format vs. the wide format of tabular data, they are generally referring to how these kinds of type classifications are laid out in a table. This choice of data layout is directly related to the choice of unit that is represented by a single record in the table. These are actually one and the same choice.
For example, in an experiment with multiple measurements per individual, it would be possible to store one measurement per record, with individuals represented over multiple records, and a type column to distinguish between measurement type. Alternatively, it would be possible to store one individual per record, with each measurement represented by a single column. With respect to each other, the former format is long, and the latter format is wide. But now consider that, if each individual belonged to a single experimental group within the experiment, it would be possible to store one group per record, with each individual represented by a set of columns, and each measurement represented by one column within the set. This is yet a "wider" format, if you will. It's all relative.
Unfortunately, unit characteristics are sometimes more complex than simple scalar values. The most common case of this is a multivalue property, sometimes described as a many-to-one relationship (especially in the context of DBMSs). In other words, multiple values for the property can be associated with a single record unit. In cases like this, it is not possible to represent the multivalue property as a simple column vector within the data set. There are hacks that programmers often settle into when trying to deal with this complexity, such as:
Concatenating the multiple values into a single scalar value (such as a single comma-separated string, or a bit vector). Let's call this the "concatenation hack".
Duplicating the unit record for each value of the property. (This generally can only be plausible if only one of the properties in the data set is multivalue.) Let's call this the "duplication hack".
Separating the property into multiple "instances" of itself, each stored in its own column. Let's call this the "separation hack".
Simply trying to ignore all but one of the multiple values. Let's call this the "ignorance hack".
In some contexts, special data types can be used to more appropriately represent the data as a pseudo-column-vector. PostgreSQL, for example, provides an array column type, and even R data.frames can have list columns whose individual elements can hold any data type supported by R, including multielement vectors. These representations are usually preferable to the aforementioned hacks.
Probably the most widely used solution that I wouldn't qualify as a hack is to completely separate the multivalue property from the primary table of data, and instead store it as a separate table which is linked to the primary table on a key. Each record in the secondary table has a key to a record in the primary table, and stores alongside the key a single value of the multivalue property. This is the design advocated by the relational model.
These approaches all have their own tradeoffs, of course, and the analysis of which is optimal for a given situation can be very complex, nebulous, and even somewhat subjective. I won't go into more detail on this here.
Before I begin to talk about reshape(), it is important to emphasize that unit typing is a very different thing from multivalue properties. Reshaping data is generally supposed to be about managing typing and record unit selection. It is not supposed to be about managing multivalue property layout, but it can be used in this way, as we will see.
reshape()
At its most abstract, reshape() can be used to transform a set of typed scalar data columns from one row per type with a discriminator column to one column per type with a discriminator suffix in the column name, for every unique (possibly multicolumn) key, and vice-versa.
The key will generally correspond with a single record unit, to use the terminology introduced earlier. Each key uniquely identifies one record unit.
The data columns are the actual variables/properties which describe the record units, with the discriminator acting to distinguish between the different types of the data variables.
In the terminology of the reshape() documentation and interface, the key columns are "id" columns, the discriminator is the "time" column, and the data columns are "varying" columns.
It is important to understand that the key you specify as the idvar argument is always the unique key of the wide format, whether you are transforming to wide from long, or to long from wide. In the long format, the unique key is the idvar columns plus the discriminator column (timevar).
Here's a simple demo:
## define example long table
long <- data.frame(id1=rep(letters[1:2],each=4L),id2=rep(1:2,each=2L),type=1:2,x=1:8,y=9:16);
long;
## id1 id2 type x y
## 1 a 1 1 1 9
## 2 a 1 2 2 10
## 3 a 2 1 3 11
## 4 a 2 2 4 12
## 5 b 1 1 5 13
## 6 b 1 2 6 14
## 7 b 2 1 7 15
## 8 b 2 2 8 16
## convert to wide
idvar <- c('id1','id2');
timevar <- 'type';
wide <- reshape(long,dir='w',idvar=idvar,timevar=timevar);
attr(wide,'reshapeWide') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
wide;
## id1 id2 x.1 y.1 x.2 y.2
## 1 a 1 1 9 2 10
## 3 a 2 3 11 4 12
## 5 b 1 5 13 6 14
## 7 b 2 7 15 8 16
## convert back to long
long2 <- reshape(wide,dir='l',idvar=idvar,timevar=timevar,varying=names(wide)[!names(wide)%in%c(idvar,timevar)]);
attr(long2,'reshapeLong') <- NULL; ## remove "helper" attribute, which cannot always be relied upon
long2 <- long2[do.call(order,long2[c(idvar,timevar)]),]; ## better order, corresponding with original long
rownames(long2) <- NULL; ## remove useless row names
long2$type <- as.integer(long2$type); ## annoyingly, longifying interprets discriminator suffixes as doubles
identical(long,long2);
## [1] TRUE
The above code also demonstrates some of the quirks committed by reshape(), such as attribute assignments (that I've never seen anyone rely upon), unexpected row order, undesirable row names, and non-ideal vector type derivation. All of these quirks can be papered over with simple modifications, as I show above.
Also notice that the varying argument can be omitted when transforming from long to wide, in which case it is derived by reshape() by the process of elimination, but it cannot be omitted when transforming from wide to long.
Input
The situation you've gotten yourself into appears to be that you have a data.frame that is supposed to contain one row per movie, but each movie record has been duplicated for each genre that is associated with the movie. In other words, the movie is the record unit, and the genre is a multivalue property associated with the movie, which is currently being represented by the duplication hack.
Your objective seems to be to transform the data from the duplication hack into the separation hack.
I don't mean to sound too critical here; these hacks are widely used and are, in many cases, fairly effective at handling this kind of complexity in a relatively simple way. It's very likely this is a good solution for your application. But I'm going to call a spade a spade; these are hacks, and are far from the most appropriate or robust solutions for data processing. And I agree that the separation hack is better than the duplication hack.
Another confusing detail is that there is a movieID column which appears to be unique per row, and not unique per movie. IDs 2 and 3 both seem to be associated with movie MI2.
My interpretation is that, in the input, because the duplication hack has been used to deal with multiple genres, each row can be thought of as being unique per genre instance. In other words, each row represents a single instance of a genre as used in a single movie. Hence the movieID column is better thought of as a genre instance identifier, and has just been misnamed. (An alternative interpretation is that it was generated incorrectly, and should be unique per movie, in which case it should be fixed and treated identically to the key columns described later.)
Solution
We can solve this problem by calling reshape() to transform from long format to wide format.
Recall that reshaping is supposed to be used for type layout, for navigating between record unit representations. Here we're instead going to use it for transforming how the multivalue property currently stored in the genre column is laid out.
Now, the most important question is, which columns are keys (idvar), which is the discriminator (timevar), and which are data (varying)?
The easiest one is the genre column. It's a data column. It's not part of the key that will help uniquely identify each movie record in the wide format, and it's certainly not a discriminator of other data columns, so it must be a data column itself. We can also arrive at this answer by considering what must happen to it during the transformation; for each unique key, the genre values must be separated from one row per value to one column per value, which is what happens to all data columns when transforming from long to wide.
Now it's useful to consider the discriminator column. Which one is it? In actuality, it doesn't exist in the input. There's no column that says "this is genre type X, this is genre type Y". So what do we do? According to your required output, you want to associate with each genre a sequential index number, presumably in row order. This means we need to synthesize a new column with such a sequence when passing the data.frame to reshape(). However, we must be careful to ensure that the sequence starts anew for each movie, otherwise every record in the input table would see its genre occupy its own column in the output, due to its unique discriminator suffix. We can do this with ave() (grouping by the key columns) and transform(). We'll name the synthesized column time, which is the default assumption by reshape() if you don't specify the timevar argument. This will allow us to omit specification of that argument. (Note: I've always wished that reshape() would default to such a row-order sequence instead of looking for an input column named time, but it doesn't do that. Oh well.)
Now let's deal with the movieID column. Being a unique identifier in the input table, the only way to include it in the output table would be to also treat it as a data column, so that it would be split by the discriminator into separate columns. I decided to make the assumption that you don't want to do this, so I just removed it from the input table before reshaping, by exploiting the same transform() call. If you want, you can excise the removal piece to see the effect of including movieID across the transformation.
That leaves the remaining columns of title, year, country, directorName, Rating, actorName1, and actorName.2. How should we treat these?
Technically speaking, conceptually, most of them should be data columns. They can't be discriminators (we already covered that), and there's no way most of them (Rating, for example) could be considered key columns. Again, conceptually.
But it would be incorrect to specify any of them as data columns. The reason is that we're not using reshape() in the normal way. We know the movie records have been duplicated for the genre duplication hack used by the input data.frame, and so all the columns I just listed are actually just duplicates within the movie record group. We need these columns to effectively collapse to a single record in the output, and that's exactly what happens with key columns that pass through a reshape() call. Hence, we must identify them all as key columns by passing them to the idvar argument.
Another way of thinking about this is that the key columns are left untouched by reshape(), other than deduplication (if going from long to wide) or duplication (if going from wide to long). It is only the discriminator column that is transferred from column to suffix (if going from long to wide) or vice-versa (if going from wide to long), and data columns that are transferred from single column to multiple columns (if going from long to wide) or vice-versa (if going from wide to long). We need these columns to remain untouched, other than deduplication. Hence we require all columns, other than the target multivalue property column genre and the synthesized time column (and, in this case, the extraneous movieID column) to be specified as key columns.
Note that this is true even if one or more of the key columns could serve as a true key for the movie records. For example, if title was known to be unique within the table by movie, it would still be incorrect to just specify title as the key, and all the other column names I listed as data columns, because they would then be widened in the output according to the synthesized discriminator, even though we know all values within each movie record group are identical.
So, here's the end result:
df <- data.frame(movieID=c(1L,2L,3L),title=c('hello','MI2','MI2'),year=c(1995L,1997L,1997L),country=c('USA','USA','USA'),genre=c('action','action','thriller'),directorName=c('john smith','mad max','mad max'),Rating=c(6L,8L,8L),actorName1=c('tom hanks','tom cruize','tom cruize'),actorName.2=c('charlie sheen','some_body','some_body'),stringsAsFactors=F);
idcns <- names(df)[!names(df)%in%c('movieID','genre')];
reshape(transform(df,movieID=NULL,time=ave(df$movieID,df[idcns],FUN=seq_along)),dir='w',idvar=idcns,sep='');
## title year country directorName Rating actorName1 actorName.2 genre1 genre2
## 1 hello 1995 USA john smith 6 tom hanks charlie sheen action <NA>
## 2 MI2 1997 USA mad max 8 tom cruize some_body action thriller
Note that it is irrelevant exactly which vector is passed as the first argument to ave(), since seq_along() ignores its argument, except for its length. But we do require an integer vector, since ave() tries to coerce its result to the same type as the argument. It is acceptable to use df$movieID because it is an integer vector; alternatively we could use df$year, df$Rating, or synthesize an integer vector with seq_len(nrow(df)) or integer(nrow(df)).
Try this with dplyr and tidyr:
library(tidyr)
library(dplyr)
df %>% mutate(yesno=1) %>% spread(genre, yesno, fill=0)
This creates a column yesno that just gives a value to fill in for each genre. We can then use spread from tidyr. fill=0 means to fill in those not in the genre with 0 instead of NA.
Before:
genre title yesno
1 action lethal weapon 1
2 thriller shining 1
3 action taken 1
4 scifi alien 1
After:
title action scifi thriller
1 alien 0 1 0
2 lethal weapon 1 0 0
3 shining 0 0 1
4 taken 1 0 0

Using cbind on XTS object changes the dash (-) character in previous column names to a dot (.)

I have some R code that creates an XTS object, and then performs various cbind operations in the lifetime of that object. Some of my columns have names such as "adx-1". That is fine until another cbind() operation is performed. At that point, any columns with the "-" character are changes to a ".". So "adx-1" becomes "adx.1".
To reproduce:
x = xts(order.by=as.Date(c("2014-01-01","2014-01-02")))
x = cbind(x,c(1,2))
x
..2
2014-01-01 1
2014-01-02 2
colnames(x) = c("adx-1")
x
adx-1
2014-01-01 1
2014-01-02 2
x = cbind(x,c(1,2))
x
adx.1 ..2
2014-01-01 1 1
2014-01-02 2 2
It doesn't just do this with numbers either. It changes "test-text" to "test.text" as well. Multiple dashes are changed too. "test-text-two" is changed to "test.text.two".
Can someone please explain why this happens and, if possible, how to stop it from happening?
I can of course change my naming schemes, but it would be preferred if I didn't have to.
Thanks!
merge.xts converts the column names into syntactic names, which cannot contain -. According to ?Quotes:
Identifiers consist of a sequence of letters, digits, the period
('.') and the underscore. They must not start with a digit nor
underscore, nor with a period followed by a digit.
There is currently no way to alter this behavior.
The reason for the behavior is precisely the one Joshua Ulrich highlighted. It's common across many data types in R: you need "valid" names. Here is a great discussion of this "issue".
For data frames, you can pass the option check.names = FALSE as a workaround, but this is not implemented for xts object. This said, there are plenty of other workarounds available to you.
For instance, you could simply rename the columns of interest after very cbind. Using your code, simply add:
colnames(x)[1] <- c("adx-1")
to force back your desired column name.
Alternatively, you could consider this gsub solution if you wanted something potentially more systematic.

Determine when columns of a data.frame change value and return indices of the change

I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:
x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
The cnt column is a unique ID (could be a date, or time column, for simplicity it's an int here)
The code column is like an code for the set of rows (imagine several such groups but with different codes). The code and cnt are the keys in my data.table.
The val0,val1,val2 columns are something like scores.
The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.
My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)
I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).
Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.
In data.table version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist that does exactly this. And it's also written in C and is therefore terribly fast.
require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5
If you're using the development version of data.table (1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.
require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
Totally unreadable, but:
c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5
Basically, run diff on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.
Also, without the sapply:
c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Probability of 3-character string appearing in a randomly generated password

If you have a randomly generated password, consisting of only alphanumeric characters, of length 12, and the comparison is case insensitive (i.e. 'A' == 'a'), what is the probability that one specific string of length 3 (e.g. 'ABC') will appear in that password?
I know the number of total possible combinations is (26+10)^12, but beyond that, I'm a little lost. An explanation of the math would also be most helpful.
The string "abc" can appear in the first position, making the string look like this:
abcXXXXXXXXX
...where the X's can be any letter or number. There are (26 + 10)^9 such strings.
It can appear in the second position, making the string look like:
XabcXXXXXXXX
And there are (26 + 10)^9 such strings also.
Since "abc" can appear at anywhere from the first through 10th positions, there are 10*36^9 such strings.
But this overcounts, because it counts (for instance) strings like this twice:
abcXXXabcXXX
So we need to count all of the strings like this and subtract them off of our total.
Since there are 6 X's in this pattern, there are 36^6 strings that match this pattern.
I get 7+6+5+4+3+2+1 = 28 patterns like this. (If the first "abc" is at the beginning, the second can be in any of 7 places. If the first "abc" is in the second place, the second can be in any of 6 places. And so on.)
So subtract off 28*36^6.
...but that subtracts off too much, because it subtracted off strings like this three times instead of just once:
abcXabcXabcX
So we have to add back in the strings like this, twice. I get 4+3+2+1 + 3+2+1 + 2+1 + 1 = 20 of these patterns, meaning we have to add back in 2*20*(36^3).
But that math counted this string four times:
abcabcabcabc
...so we have to subtract off 3.
Final answer:
10*36^9 - 28*36^6 + 2*20*(36^3) - 3
Divide that by 36^12 to get your probability.
See also the Inclusion-Exclusion Principle. And let me know if I made an error in my counting.
If A is not equal to C, the probability P(n) of ABC occuring in a string of length n (assuming every alphanumeric symbol is equally likely) is
P(n)=P(n-1)+P(3)[1-P(n-3)]
where
P(0)=P(1)=P(2)=0 and P(3)=1/(36)^3
To expand on Paul R's answer. Probability (for equally likely outcomes) is the number of possible outcomes of your event divided by the total number of possible outcomes.
There are 10 possible places where a string of length 3 can be found in a string of length 12. And there are 9 more spots that can be filled with any other alphanumeric characters, which leads to 36^9 possibilities. So the number of possible outcomes of your event is 10 * 36^9.
Divide that by your total number of outcomes 36^12. And your answer is 10 * 36^-3 = 0.000214
EDIT: This is not completely correct. In this solution, some cases are double counted. However they only form a very small contribution to the probability so this answer is still correct up to 11 decimal places. If you want the full answer, see Nemo's answer.

Resources