Reordering rows in a dataframe by multiple column permutations - r

I am trying to reorder a data.frame that contains around 250,000 rows and 7 columns. The rows I want at the top of the data.frame are those where column 2 contains the lowest value and column 7 the highest but would go in this sequence of columns that contain the lowest to highest values: 2,5,1,4,6,3,7 (so column 5 would have the second lowest value etc.).
Once any rows that match this sequence are identified it would move on to find rows where the columns values go from lowest to highest in the sequence 2,5,1,4,6 and then 2,5,1,4 and so on until only rows where column 2 is the lowest and the other column values are randomly assorted. Any row that does not have column 2 as the lowest value would be ignored and left unsorted below the sorted rows. I am struggling to come up with any workable solution to my problem - the best I can do in terms of providing similar data to that I am working with is this:
df<-data.frame(matrix(rnorm(70000), nrow=10000))
df<-abs(df)
If anyone has any ideas, I am all ears.
Thanks!

Given that you have a largish dataset of uniform type (numeric), I would suggest using a matrix not a data.frame
tt <- abs(matrix(rnorm(70000), nrow=10000))
You have a desired order you wish to match against
desiredOrder <- c(2,5,1,4,6,3,7)
You need to find what order each of your rows is in . I think it is easiest here to ensure that you are given a list back with an element for each row. I'd suggest something like this .
orders <- lapply(apply(tt, 1, function(x) list(order(x))), unlist)
You will then need to go through (from desiredOrder[seq_len(7)] to desiredOrder[seq_len(1)] to test when the required subset of the order for a particular row is equal to the required subset of desired order. (I thinking some combination of sapply with which and all)
Once you have identified all the rows that match your required result, you can use setdiff to find the unmatched ones, and then reorder tt using this new order vector.

One possible approach would be to weight rankings of the values in the columns. It would be something like rank regression. 7 columns of 250K rows is not that big. For the ones you want the low values to have higher weight you could either subtract the rank from NROW(dfrm). If you want to scale the wieighting across that column ordering scheme then jsut multiply by a weighting vector: say c(1, .6, .3, 0, .3, .6, 1)
dmat <- matrix(sample(20, 20*7, repl=TRUE), 20, 7)
dfrm <- as.data.frame(dmat)
dfrm$wt <- sapply( dfrm[ c(2,5,1,4,6,3,7)] , rank); dfrm
dfrm$wt[,1:3] <- rep(NROW(dfrm),3) - dfrm$wt[ , 1:3]
dfrm$wt <- dfrm$wt*rep(c(1, .6, .3, 0, .3, .6, 1), each=NROW(dfrm))
dfrm[ order( apply( dfrm$wt, 1, FUN=sum), decreasing=TRUE ) , ]
This does not force the lowest value for V2 to be first, since you implied a multiple criterion. You still have the ability to re-weight if this is not exactly what you imagined.

Like this:
dat <- as.matrix(df)
rnk <- t(apply(dat, 1, rank))
desiredRank <- order(c(2,5,1,4,6,3,7))
rnk.match <- rnk == matrix(desiredRank, nrow(rnk), ncol(rnk), byrow = TRUE)
match.score <- apply(rnk.match, 1, match, x = FALSE) - 1
match.score[is.na(match.score)] <- ncol(dat)
out <- dat[order(match.score, decreasing = TRUE), ]
head(out)
# X1 X2 X3 X4 X5 X6 X7
#[1,] 0.7740246 0.19692680 1.5665696 0.9623104 0.2882492 1.367786 1.8644204
#[2,] 0.5895921 0.00498982 1.7143083 1.2698382 0.1776051 2.494149 1.4216615
#[3,] 0.1981111 0.11379934 1.0379619 0.2130251 0.1660568 1.227547 0.9248101
#[4,] 0.7507257 0.23353923 1.6502192 1.2232615 0.7497352 2.032547 1.4409475
#[5,] 0.5418513 0.06987903 1.8882399 0.6923557 0.3681018 2.172043 1.2215323
#[6,] 0.1731943 0.01088604 0.6878847 0.2450998 0.0125614 1.197478 0.3087192
In this example, the first row matches the whole rank sequence; the next rows match the first five ranks of the sequence:
head(match.score[order(match.score, decreasing = TRUE)])
# [1] 7 5 5 5 5 5

You can use the fact that order() returns the index to the ordering,
which is exactly what you are trying to match
For example if we apply `order` twice to each row of
[1,] 23 17 118 57 20 66 137
[2,] 56 42 52 66 47 8 29
[3,] 35 5 76 35 29 217 89
We would get
[1,] 2 5 1 4 6 3 7
[2,] 6 7 2 5 3 1 4
[3,] 2 5 1 4 3 7 6
Then you simply need to check which rows match what you are looking for.
There are several ways to implement this, below is an example, where we create
a logical matrix, comparisons, which indicates whether each element of a row
is in the "correct" position, as indicated by expectedOrder.
We then order the original rows by how many elements are in the "correct column". (using this phrase loosely, of course)
# assuming mydf is your data frame or matrix
# the expected order of the columns
expectedOrder <- c(2,5,1,4,6,3,7)
# apply the order function twice.
ordering <- apply(mydf, 1, function(r) order(r) )
# Recall that the output of apply is transposed relative to the input.
# We make use of this along with the recycling of vectors for the comparison
comparisons <- ordering == expectedOrder
# find all rows with at least matches to 2,5,1,4
topRows <- which(colSums(comparisons[1:4, ])==4)
# reorder the indecies based on the total number of matches in comparisons
# ie: first all 7-matches, then 5-matches, then 4-matches
topRows <- topRows[order(colSums(comparisons[,topRows]), decreasing=TRUE)]
# reorder the dataframe (or matrix)
mydf.ordered <-
rbind(mydf[topRows, ],
mydf[-topRows,])
head(mydf.ordered)
# X1 X2 X3 X4 X5 X6 X7
# 23 17 118 57 20 66 137
# 39 21 102 50 24 53 163
# 80 6 159 116 44 139 248
# 131 5 185 132 128 147 202
# 35 18 75 40 33 67 151
# 61 14 157 82 57 105 355

Related

Create frequency vector based on input vector

I have a variable test in the structure:
> test <- c(9,87)
> names(test) <- c("VGP", "GGW")
> dput(test)
structure(c(9, 87), .Names = c("VGP", "GGW"))
> class(test)
[1] "numeric"
This is a very simplified version of the input vector, but I want an output as a vector of length 100 which contains the frequency of each number 1-100 inclusive. The real input vector is of length ~1000000, so I am looking for an approach that will work for a vector of any length, assuming only numbers 1-100 are in it.
In this example, the numbers in all positions except 9 and 87 will show up as 0, and the 9th and 87th vector will both say 50.
How can I generate this output?
If we are looking for a proportion inclusive of the values that are not in the vector and to have those values as 0, convert the vector to factor with levels specified and then do the table and prop.table
100*prop.table(table(factor(test, levels = 1:100)))
>freq<-vector(mode="numeric",length=100)
>for(i in X)
+{ if(i>=1 && i<=100)
+ freq[i]=freq[i]+1
+}
>freq
X is the vector containing 10000 elements
Adding an if condition could ensure that the values are in the range of [1,100].
Hope this helps.
If you have a numeric vector and just want to get a frequency table of the values, use the table function.
set.seed(1234)
d <- sample(1:10, 1000, replace = TRUE)
x <- table(d)
x
# 1 2 3 4 5 6 7 8 9 10
# 92 98 101 104 87 112 104 94 88 120
If there is a possibility of missing values, say 11 is a possibility in my example then I'd do the following:
y <- rep(0, 11)
names(y) <- as.character(1:11)
y[as.numeric(names(x))] <- x
y
# 1 2 3 4 5 6 7 8 9 10 11
92 98 101 104 87 112 104 94 88 120 0

How to pull out values corresponding to a random selection and get the cumulative summation of them?

Let's say I have a data frame with two columns for now:
df<- data.frame(scores_set1=c(32,45,65,96,45,23,23,14),
scores_set2=c(32,40,60,98,21,23,21,63))
I want to randomly select some rows
selected_indeces<- sample(c(1:8), 4, replace = FALSE)
Now I want to add up the values of selected_indeces sequentially meaning that for first selected_indeces I just need the value of that specific row, for the second I want the second row value + the first selected value ... and for the nth index I want sum of all values selected already + the value nth row. So, first need a matrix to put the results in
cumulative_loss<-matrix(rep(NA,8*2),nrow=8,ncol=2)
and then one loop for each column and another for each selected_index
for (s in 1:ncol(df)) #for each column
{
for (i in 1:length(selected_indeces)) #for each randomly selected index
{
if (i==1)
{
cumulative_loss[i,s]<- df[selected_indeces[i],s]
}
if (i > 1)
{
cumulative_loss[i,s]<- df[selected_indeces[i],s] +
df[selected_indeces[i-1],s]
}
}
}
The script works although It might be a naive way for doing such thing but the thing is that if (i=4) is only adds values of 4th and third selection while I want it to add first, second , third and fourth random selection and return it.
Conveniently, cumsum() works on data.frames directly, in which case it runs on each column independently. Thus we can index out the selected rows of df with an index operation and pass the result directly to cumsum() to get the required output:
set.seed(0L);
sel <- sample(1:8,4L);
sel;
## [1] 8 2 3 6
df[sel,];
## scores_set1 scores_set2
## 8 14 63
## 2 45 40
## 3 65 60
## 6 23 23
cumsum(df[sel,]);
## scores_set1 scores_set2
## 8 14 63
## 2 59 103
## 3 124 163
## 6 147 186
To select different indexes for each column, we can use apply():
set.seed(0L);
apply(df,2L,function(col) cumsum(col[sample(1:8,4L)]));
## scores_set1 scores_set2
## [1,] 14 63
## [2,] 59 103
## [3,] 124 126
## [4,] 147 147
If you want to compute the indexes in advance, it becomes slightly trickier. Here's one way of doing it:
set.seed(0L);
sels <- replicate(2L,sample(1:8,4L)); sels;
## [,1] [,2]
## [1,] 8 8
## [2,] 2 2
## [3,] 3 6
## [4,] 6 5
sapply(seq_len(ncol(df)),function(ci) cumsum(df[[ci]][sels[,ci]]));
## [,1] [,2]
## [1,] 14 63
## [2,] 59 103
## [3,] 124 126
## [4,] 147 147
Here's a way to do this with data.table (taking into account your comment on #bgoldst's answer:
library(data.table); setDT(df)
#sample 4 elements of each column (i.e., every element of .SD), then cumsum them
df[ , lapply(.SD, function(x) cumsum(sample(x, 4)))]
If you want to use different indices for each column, I would pre-choose them first:
set.seed(1023)
idx <- lapply(integer(ncol(df)), function(...) sample(nrow(df), 4))
idx
# [[1]] #indices for column 1
# [1] 2 8 6 3
#
# [[2]] #indices for column 2
# [1] 4 8 5 1
Then modify the above slightly:
df[ , lapply( seq_along(.SD), function(jj) cumsum(.SD[[jj]][ idx[[jj]] ]) )]
This is the craziest compendium of brackets/parentheses I've ever written in a functional line of code, so I guess it makes sense to break things down a bit:
seq_along .SD to pick out the index number of each column, jj
.SD[[jj]] selects the jth column, idx[[jj]] selects the indices for that column, .SD[jj]][idx[jj]]] picks the idx[[jj]] rows of the jth column; this is equivalent to .SD[idx[jj], jj, with = FALSE]
Lastly, we cumsum the length(idx[[jj]]) rows we chose for column jj.
Result:
# V1 V2
# 1: 45 98
# 2: 59 161
# 3: 82 182
# 4: 147 214
With dplyr, if we want to sample each column separately and do the cumsum, we can use mutate_each and then select the first 4 with head.
library(dplyr)
df %>%
mutate_each(funs(cumsum(sample(.)))) %>%
head(.,4)
If the sample needs to be for the whole dataset
df %>%
slice(sample(row_number(), 4)) %>%
mutate_each(funs(cumsum))

Altering a data frame in R

I have a data frame that has the first column go from 1 to 365 like this
c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2...
and the second column has times that repeat over and over again like this
c(0,30,130,200,230,300,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
so for every 1 value in the first column I have a corresponding time in the second column then when I get to the 2's the times start over and each 2 has a corresponding time,
occasionally I will come across
c(3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4...
c(0,30,130,200,230,330,400,430,500,0,30,130,200,230,300,330,400,430,500...
Here one of the 3's is missing and the corresponding time of 300 is missing with it.
How can I go through my entire data frame and add these missing values? I need a way for R to go through and identify any missing values then insert a row and put the appropriate value, 1 to 365, in column one and the appropriate time with it. So for the given example R would add a row in between 230 and 330 and then place a 3 in the first column and 300 in the second. There are parts of the column that are missing several consecutive values. It is not just one here and there
EDIT: Solution with all 10 times clearly specified in advance and code tidy up/commenting
You need to create another data.frame containing every possible row and then merge it with your data.frame. The key aspect is the all.x = TRUE in the final merge which forces the gaps in your data to be highlighted. I simulated the gaps by sampling only 15 of the first 20 possible day/time combinations in your.dat
# create vectors for the days and times
the.days = 1:365
the.times = c(0,30,100,130,200,230,330,400,430,500) # the 10 times to repeat
# create a master data.frame with all the times repeated for each day, taking only the first 20 observations
dat.all = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365))[1:20,]
# mimic your data.frame with some gaps in it (only 15 of 20 observations are present)
your.sample = sample(1:20, 15)
your.dat = data.frame(x1=rep(the.days, each=10), x2 = rep(the.times,times = 365), x3 = rnorm(365*10))[your.sample,]
# left outer join merge to include ALL of the master set and all of your matching subset, filling blanks with NA
merge(dat.all, your.dat, all.x = TRUE)
Here is the output from the merge, showing all 20 possible records with the gaps clearly visible as NA:
x1 x2 x3
1 1 0 NA
2 1 30 1.23128294
3 1 100 0.95806838
4 1 130 2.27075361
5 1 200 0.45347199
6 1 230 -1.61945983
7 1 330 NA
8 1 400 -0.98702883
9 1 430 NA
10 1 500 0.09342522
11 2 0 0.44340164
12 2 30 0.61114408
13 2 100 0.94592127
14 2 130 0.48916825
15 2 200 0.48850478
16 2 230 NA
17 2 330 0.52789171
18 2 400 -0.16939587
19 2 430 0.20961745
20 2 500 NA
Here are a few NA handling functions that could help you getting started.
For the inserting task, you should provide your own data using dput or a reproducible example.
df <- data.frame(x = sample(c(1, 2, 3, 4), 100, replace = T),
y = sample(c(0,30,130,200,230,300,330,400,430,500), 100, replace = T))
nas <- sample(NA, 20, replace = T)
df[1:20, 1] <- nas
df$y <- ifelse(df$y == 0, NA, df$y)
# Columns x and y have NA's in diferent places.
# Logical test for NA
is.na(df)
# Keep not NA cases of one colum
df[!is.na(df$x),]
df[!is.na(df$y),]
# Returns complete cases on both rows
df[complete.cases(df),]
# Gives the cases that are incomplete.
df[!complete.cases(df),]
# Returns the cases without NAs
na.omit(df)

Replace NA's by corresponding value in pair (2 rows is 1 pair)

I'm trying to find a loop that replaces NA's by designated values.
Say I have a data frame as follows (I actually have more rows):
a<-c(18,NA,12,33,32,14,15,55)
b<-c(18,30,12,33,32,14,15,NA)
c<-c(16,18,17,45,22,10,24,11)
d<-c(16,18,17,42,NA,10,24,11)
data<- data.frame(rbind(a,b,c,d))
names(data)<-rep(1:8)
All rows in my data frame are in pairs (row[1] and [2] are the first pair, row[3] and [4] are the second and so on).
I wish to replace all NA's by the corresponding value of the pair i.e to replace NA in the first pair by 30. Similarly, replace NA in the 4th row by 22.
Is there a loop I can carry out to treat each 2 rows as a pair and replace any NAs found by its corresponding value in the same pair?
I'd use R's built in vectorisation to find and replace NAs by the appropriate value. Seems like you want to replace by the row below when a row is odd numbered and the row above when it is even numbered...
# Locate NAs in data
nas <- which( is.na( data ) , arr.ind = TRUE )
# row col
#a 1 2
#d 4 5
#b 2 8
# Where to get replacement value from: below on odd rows and value above on even rows
rows <- nas[,1] %% 2
rows[ rows == 0 ] <- -1
repl <- cbind( ( nas[,1] + rows ) , nas[ ,2] )
# Do replacement
data[ nas ] <- data[ repl ]
# 1 2 3 4 5 6 7 8
#a 18 30 12 33 32 14 15 55
#b 18 30 12 33 32 14 15 55
#c 16 18 17 45 22 10 24 11
#d 16 18 17 42 22 10 24 11
I'm sure the creation of the replacement locations matrix could be a little cleaner, but this should be fast as it only uses vectorised operations.
Sure -- this does the trick:
for(i in 1:nrow(data)) {
missing <- which(is.na(data[i,]))
if(i%%2) {
data[i,missing] <- data[(i+1), missing]
} else {
data[i, missing] <- data[(i-1), missing]
}
}
It allows for missing observations in both the top and bottom row of each pair, and where there is a gap it fills in with the observation from the same column location in the other part of the pair.
note there's no error checking, or other nice stuff, so this is pretty raw.
Also, if they are truly pairs of data, there are better means of joining your observations than just sticking them all into a dataframe.

How do you apply a function to a nested list?

I need to get the maximum of a variable in a nested list. For a certain station number "s" and a certain member "m", mylist[[s]][[m]] are of the form:
station date.time member bias
6019 2011-08-06 12:00 mbr003 86
6019 2011-08-06 13:00 mbr003 34
For each station, I need to get the maximum of bias of all members. For s = 3, I managed to do it through:
library(plyr)
var1 <- mylist[[3]]
var2 <- lapply(var1, `[`, 4)
var3 <- laply(var2, .fun = max)
max.value <- max(var3)
Is there a way of avoiding the column number "4" in the second line and using the variable name $bias in lapply or a better way of doing it?
You can use [ with the names of columns of data frames as well as their index. So foo[4] will have the same result as foo["bias"] (assuming that bias is the name of the fourth column).
$bias isn't really the name of that column. $ is just another function in R, like [, that is used for accessing columns of data frames (among other things).
But now I'm going to go out on a limb and offer some advice on your data structure. If each element of your nested list contains the data for a unique combination of station and member, here is a simplified toy version of your data:
dat <- expand.grid(station = rep(1:3,each = 2),member = rep(1:3,each = 2))
dat$bias <- sample(50:100,36,replace = TRUE)
tmp <- split(dat,dat$station)
tmp <- lapply(tmp,function(x){split(x,x$member)})
> tmp
$`1`
$`1`$`1`
station member bias
1 1 1 87
2 1 1 82
7 1 1 51
8 1 1 60
$`1`$`2`
station member bias
13 1 2 64
14 1 2 100
19 1 2 68
20 1 2 74
etc.
tmp is a list of length three, where each element is itself a list of length three. Each element is a data frame as shown above.
It's really much easier to record this kind of data as a single data frame. You'll notice I constructed it that way first (dat) and then split it twice. In this case you can rbind it all together again using code like this:
newDat <- do.call(rbind,lapply(tmp,function(x){do.call(rbind,x)}))
rownames(newDat) <- NULL
In this form, these sorts of calculations are much easier:
library(plyr)
#Find the max bias for each unique station+member
ddply(newDat,.(station,member),summarise, mx = max(bias))
station member mx
1 1 1 87
2 1 2 100
3 1 3 91
4 2 1 94
5 2 2 88
6 2 3 89
7 3 1 74
8 3 2 88
9 3 3 99
#Or maybe the max bias for each station across all members
ddply(newDat,.(station),summarise, mx = max(bias))
station mx
1 1 100
2 2 94
3 3 99
Here is another solution using repeated lapply.
lapply(tmp, function(x) lapply(lapply(x, '[[', 'bias'), max))
You may need to use [[ instead of [, but it should work fine with a string (don't use the $). try:
var2 <- lapply( var1, `[`, 'bias' )
or
var2 <- lapply( var1, `[[`, 'bias' )
depending on if var1 is a list.

Resources