Complex sorting by single colum in R - r

I am trying to sort a data frame by codes contained in one column.
The logic behind these code is:
S/number/number/number/digit (e.g. S120B). The numbers are not always 3 (e.g. S10K) and the letters are not always present (e.g. S2).
The first code is S1, and the list goes until S999, where it turns to S1A. Then it goes to S999A and then turns to S1B, and so on.
Furthermore, there are also codes inside thare are totally different, as W23, E100, etc that should go together.
How can I order the dataframe according to this pretty sick ordering scheme?
MWE: codes <- c(S1, S20D, S550C, S88A, S420K, E44, W22)

Following your directions, this is a customized function:
codes <- c("S1", "S20D", "E44", "S550C", "S88A", "S420K", "W22")
complex_order <- function(codes) {
# Create empty order vector
final_order <- rep(NA,length(codes))
# First into account codes that do not match the S convention
not_in_convention <- !tolower(substr(codes,1,1)) == "s"
final_order[(length(codes)-sum(not_in_convention)+1):length(codes)] <- which(not_in_convention)
# Then check the ones that has a letter at the end
letter_at_end <- tolower(substr(codes,nchar(codes),nchar(codes))) %in% letters & !not_in_convention
for (idx in which(letter_at_end)) {
lettr <- tolower(substr(codes[idx],nchar(codes[idx]),nchar(codes[idx])))
lettr_value <- which(lettr == letters) * 1000 # Every letter means 1000 positions ahead
codes[idx] <- paste0("S",as.character(lettr_value))
}
# Now that we have all in the same code, order the values
values <- as.numeric(tolower(substr(codes[!not_in_convention],2,nchar(codes[!not_in_convention]))))
final_order[order(values)] <- which(!not_in_convention)
final_order
}
codes[complex_order(codes)]
[1] "S1" "S88A" "S550C" "S20D" "S420K" "E44" "W22"
Hope it helps!

1.Create minimal reproducible example ;)
mre <- data.frame(ID = c("S1", "S20D", "S550C", "S88A", "S420K", "E44", "W22"),
stringsAsFactors = FALSE)
Now, I am not sure what you mean by:
Furthermore, there are also codes inside thare are totally different, as W23, E100, etc that should go together.
If you mean that "W23" should be read an sorted totally different than "S999" we need some additional information on how to distinguish between the two cases. Otherwise this should work:
2.Suggested solution alphabetical sorting:
library(dplyr)
mre %>%
arrange(ID)
ID
1 E44
2 S1
3 S20D
4 S420K
5 S550C
6 S88A
7 W22
Or using only base R:
mre[order(mre$ID),]

Related

How to batch process some frames with different dimension but same name pattern some how by R

In the R environment, I have already have some variable, their name:
id_01_r
id_02_l
id_05_l
id_06_r
id_07_l
id_09_1
id_11_l
So, their pattern seems like id_ and follows two figures, then _ and r or l randomly.
Each of them corresponds to one frame but different dim() output.
Also, there are some other variables in the environment, so first I should extract these frames. For this, I'm going to adopt:
> a <- list(ls()[grep("id*",ls())])` #a little sample for just id* I know
But, this function put them as one element, so I don't think it's good way
> length(a) [1] 1
I know how to read them in like below, but now for extact and same processes, I'm so confused.
i_set <- Sys.glob(paths='mypath/////id*.txt')
for (i in i_set) {
assign(substring(i, startx, endx),read.table(file=i,header=F))
}
Here, the key point is I want to do a series of same data processing for each of these frames. But based on these, what can I do instead of one by one?
Thanks your kind consideration.
Here is an example:
id_01_r <- iris
id_02_l <- mtcars
foo <- 42
vars <- grep("^id_\\d{2}_[rl]$", ls(), value = TRUE)
# [1] "id_01_r" "id_02_l"
process_data <- function(df) {
dim(df)
}
processed_data <- lapply(
mget(vars),
process_data
)
# $id_01_r
# [1] 150 5
#
# $id_02_l
# [1] 32 11

Loop through df column, comparing to list and creating new column

I have a column of numbers, like social security numbers for example. I would like to compare this column to a list of unacceptable values ( like 11111111 or 12345678 for example). There also some grepl operations i would like to perform, like the first 3 digits can't be 000. Below is a skeleton of what I think the code could look like, I prefer a for loop logic.
ssns <- c(12343210,23454321,34565432,11111111)
badssns <- c(11111111,22222222)
for( i in 1:length(ssns)) {
if(ssns[i] %in% badssn_list) {
ssns$newcolumn==BADSSN
}
else if( grepl(first 3 numbers 0){
ssns$newcolumn==BADSSN
}
else{ssns$newcolumn==GOODSSN}
}
Just using a nested ifelse should do the job imo:
ssns$newcolumn <- ifelse(ssns$num %in% badssns, 'BADSSN',
ifelse(substr(ssns$num,1,3)=='000', 'BADSSN', 'GOODSSN'))
or shorter using an OR statement (|):
ssns$newcolumn <- ifelse(ssns$num %in% badssns| substr(ssns$num,1,3)=='000', 'BADSSN', 'GOODSSN')
which gives:
> ssns
num newcolumn
1 12343210 GOODSSN
2 23454321 GOODSSN
3 34565432 GOODSSN
4 11111111 BADSSN
5 00065432 BADSSN
Used data:
ssns <- data.frame(num = c('12343210','23454321','34565432','11111111','00065432'), stringsAsFactors = FALSE)
badssns <- c('11111111','22222222')
It seems like you have some experience with computer programming, but maybe are new to R. In most cases, the best R programs don't use for loops.
Here's a more Rish way to accomplish what you've described. It will be much faster when ssns and badssns are long.
ssns<-c(12343210,23454321,34565432,11111111)
badssns<-c(11111111,22222222)
good.idxs <- is.na(match(ssns, badssns))
good.ssns <- ssns[good.idxs]
You might want to work with strings rather than numbers -- maybe you are concerned the letter "oh" was used in place of the number "zero". This approach works in that case as well. Somewhat unexpectedly (for me, anyway), it even works when ssns is a vector of characters and badssns is a vector of number or vice versa!
If ssns and badssns are character vectors:
ssns<-c("12343210","23454321","34565432","11111111","00023456")
badssns<-c("11111111","22222222")
then you can use just one ifelse:
result <- ifelse(ssns %in% badssns | grepl("^0{3}",ssns), "BADSSNS", "GOODSSNS")
##[1] "GOODSSNS" "GOODSSNS" "GOODSSNS" "BADSSNS" "BADSSNS"

R - Refactor list of lists [duplicate]

I have a list which contains list entries, and I need to transpose the structure.
The original structure is rectangular, but the names in the sub-lists do not match.
Here is an example:
ax <- data.frame(a=1,x=2)
ay <- data.frame(a=3,y=4)
bw <- data.frame(b=5,w=6)
bz <- data.frame(b=7,z=8)
before <- list( a=list(x=ax, y=ay), b=list(w=bw, z=bz))
What I want:
after <- list(w.x=list(a=ax, b=bw), y.z=list(a=ay, b=bz))
I do not care about the names of the resultant list (at any level).
Clearly this can be done explicitly:
after <- list(x.w=list(a=before$a$x, b=before$b$w), y.z=list(a=before$a$y, b=before$b$z))
but this is ugly and only works for a 2x2 structure. What's the idiomatic way of doing this?
The following piece of code will create a list with i-th element of every list in before:
lapply(before, "[[", i)
Now you just have to do
n <- length(before[[1]]) # assuming all lists in before have the same length
lapply(1:n, function(i) lapply(before, "[[", i))
and it should give you what you want. It's not very efficient (travels every list many times), and you can probably make it more efficient by keeping pointers to current list elements, so please decide whether this is good enough for you.
The purrr package now makes this process really easy:
library(purrr)
before %>% transpose()
## $x
## $x$a
## a x
## 1 1 2
##
## $x$b
## b w
## 1 5 6
##
##
## $y
## $y$a
## a y
## 1 3 4
##
## $y$b
## b z
## 1 7 8
Here's a different idea - use the fact that data.table can store data.frame's (in fact, given your question, maybe you don't even need to work with lists of lists and could just work with data.table's):
library(data.table)
dt = as.data.table(before)
after = as.list(data.table(t(dt)))
While this is an old question, i found it while searching for the same problem, and the second hit on google had a much more elegant solution in my opinion:
list_of_lists <- list(a=list(x="ax", y="ay"), b=list(w="bw", z="bz"))
new <- do.call(rbind, list_of_lists)
new is now a rectangular structure, a strange object: A list with a dimension attribute. It works with as many elements as you wish, as long as every sublist has the same length. To change it into a more common R-Object, one could for example create a matrix like this:
new.dims <- dim(new)
matrix(new,nrow = new.dims[1])
new.dims needed to be saved, as the matrix() function deletes the attribute of the list. Another way:
new <- do.call(c, new)
dim(new) <- new.dims
You can now for example convert it into a data.frame with as.data.frame() and split it into columns or do column wise operations. Before you do that, you could also change the dim attribute of the matrix, if it fits your needs better.
I found myself with this problem but I needed a solution that kept the names of each element. The solution I came up with should also work when the sub lists are not all the same length.
invertList = function(l){
elemnames = NULL
for (i in seq_along(l)){
elemnames = c(elemnames, names(l[[i]]))
}
elemnames = unique(elemnames)
res = list()
for (i in seq_along(elemnames)){
res[[elemnames[i]]] = list()
for (j in seq_along(l)){
if(exists(elemnames[i], l[[j]], inherits = F)){
res[[i]][[names(l)[j]]] = l[[names(l)[j]]][[elemnames[i]]]
}
}
}
res
}

How to refactor a vector?

I have this vector
v <- c("firstOne","firstTwo","secondOne")
I would like to factor the vector assigning c("firstOne","firstTwo) to the same level (i.e., firstOne). I have tried this:
> factor(v, labels = c("firstOne", "firstOne", "secondOne"))
[1] firstOne firstOne secondOne
Levels: firstOne firstOne secondOne
But I get a duplicate factor (and a warning message advising not to use it). Instead, I would like the output to look like:
[1] firstOne firstOne secondOne
Levels: firstOne secondOne
Is there any way to get this output without brutally substituting the character strings?
Here are a couple of options:
v <- factor(ifelse(v %in% c("firstOne", "firstTwo"), "firstOne", "secondOne"))
v <- factor(v,levels = c("firstOne","secondOne")); f[is.na(f)] <- 'firstOne'
A factor is just a numeric (integer) vector with labels, and so manipulating a factor is equivalent to manipulating integers, rather than character strings. Therefore performance-wise is perfectly OK to do
f <- as.factor(v)
f[f %in% c('firstOne', 'firstTwo')] <- 'firstOne'
f <- droplevels(f)
You could use the rec-function of the sjmisc-package:
rec(v, "firstTwo=firstOne;else=copy", as.fac = T)
> [1] firstOne firstOne secondOne
> Levels: firstOne secondOne
(the output is shortened; note that the sjmisc-package supports labelled data and thus adds label attributes to the vector, which you'll see in the console output as well)
Eventually I also found a solution which looks somehow sloppy but I don't see major issues (looking forward to listen which might be possible problems with this tho):
v <- c("firstOne","firstTwo","secondOne")
factor(v)
factor(factor(v,labels = c("firstOne","firstOne","secondOne")))

Vector-version / Vectorizing a for which equals loop in R

I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})

Resources