How to find unsorted fragments of data frame - r

Let's assume that I've got a data.frame that is supposed to be sorted with respect to selected columns and I want to make sure that it is indeed a case. I could try something like:
library(dplyr)
mpg2 <- mpg %>%
arrange(manufacturer, model, year)
identical(mpg, mpg2)
[1] FALSE
but if the identical returns FALSE this only lets me know that a dataset is in incorrect order.
What if I would like to inspect only those rows that are in fact in incorrect order? How can I filter those out of the whole dataset? (I need to avoid looping here at best, as the dataset I work with is pretty large)
If the remaining variables (not used for ordering) are different for the same value of manufacturer, model, year, how dplyr::arrange decides which observation comes first? Does it preserve the order from original dataset (mpg here)?

As for the second question, I believe that dplyr::arrange is stable, it preserves the order of the rows when there are ties in the sorting columns.
This can be seen by comparing with the result from base::order. From the help page, section Details (my emphasis):
In the case of ties in the first vector, values in the second are
used to break the ties. If the values are still tied, values in the
later arguments are used to break the tie (see the first example).
The sort used is stable (except for method = "quick"), so any
unresolved ties will be left in their original ordering.
mpg2 <- mpg %>%
arrange(manufacturer, model, year)
i <- with(mpg, order(manufacturer, model, year))
mpg3 <- mpg[i, ]
identical(as.data.frame(mpg2), as.data.frame(mpg3))
#[1] TRUE
The values are identical, except for their classes. So dplyr::arrange does preserve the order in the case of ties.
As for the first question, maybe the code below answers it. It just gets the rows for which the next order number is smaller than the previous one. This means that those rows have changed relative positions.
j <- which(diff(i) < 0)
mpg[i[j], ]

I don't think this is something I've needed before. Usually it's best practice to not rely on table ordering. Only times I would rely on it, the ordering would be contained within a function ie I wouldn't have function B depend on ordering that happens in function A.
I think this does what you ask for, using the data.table package. You set keys with this package, and they are ordered from left to right in terms of primary, secondary key etc. I'm not sure if concatenating the keys together is the best way, but its simple.
# reproducible fake data
library(data.table)
set.seed(1)
dt <- data.table(a=rep(1:5, 2), b=letters[1:10], c=sample(1:3, 10, TRUE))
# scramble
dt <- dt[sample(1:.N)]
# make the ideal structure
keys <- c("a", "b")
dt_ideal <- copy(dt)
dt_ideal <- setkeyv(dt_ideal, keys)
key(dt_ideal)
# function to find keys not the same for each row. Pasting together
findBad <- function(dt, dt_ideal){
not_ok <- which(dt_ideal[, do.call(paste, c(.SD, sep=">")), .SDcols=keys] !=
dt[, do.call(paste, c(.SD, sep=">")), .SDcols=keys])
not_ok
}
# index of bad rows - all bad in this case
not_ok <- findBad(dt, dt_ideal)
dt[not_ok]
# better eg, swap 7 & 8
dt2 <- copy(dt_ideal)
dt2 <- dt2[c(1:6, 8, 7, 9:10)]
not_ok <- findBad(dt2, dt_ideal)
dt2[not_ok]

Related

Using dplyr/purrr instead of for loops to mask multiple columns and/or expand rows

Essentially its about using bitmask/binary columns and row-oriented operations against a data table/frame: Firstly, to construct a logical vector from a combination of selected columns that can be used to mask a charcter vector to represent 'what' columns are flagged. Secondly, row-expansion - given a count in one column, prouce a data table that contains the original row data replicated that number of times.
For summarising the flags using a row-wise bitmask, which uses purrr:reduce to concatenate the row-represented flags, I cannot find a succinct method to do this in a %>% chain rather than a separate for loop. I suspect a purrr::map is required but I cannot get it/the syntax right.
For the row expansion, the nested for loop has appalling performance and I cannot find a way for dplyr/purrr to, row-wise, replicate that row a given number of times per row. A map and other functions would need to produce and append multiple rows which, I don't think map is capable of.
The following code produces the required output - but, apart from performance issues (especially regarding row expansion), I'd like to be able to do this as vectorised operations.
library(tidyverse)
library(data.table)
dt <- data.table(C1=c(0,0,1,0,1,0),
C2=c(1,0,0,0,0,1),
C3=c(0,1,0,0,1,0),
C4=c(0,1,1,0,0,0),
C5=c(0,0,0,0,1,1),
N=c(5,2,6,8,1,3),
Spurious = '')
flags <- c("Scratching Head","Screaming",
"Breaking Keyboard","Coffee Break",
"Giving up")
# Summarise states
flagSummary <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5),.funs=as.logical) %>%
dplyr::mutate(States=c(""))
for(i in 1:nrow(interim)){
interim$States[i] <-
flags[as.logical(interim[i,1:5])] %>%
purrr::reduce(~ paste(.x, .y, sep = ","),.init="") %>%
stringr::str_replace("^[,]","") }
dplyr::select(interim,States,N) }
summary <- flagSummary(dt)
View(summary)
# Expand states
expandStates <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5), .funs=as.logical) %>%
dplyr::select_at(vars(C1:C5,N)) %>%
data.table::setnames(.,append(flags,"Count"))
expansion <- interim[0,1:5]
for(i in 1:nrow(interim)){
for(j in 1:interim$Count[i]){
expansion <- bind_rows(expansion, interim[i,1:5]) } }
expansion }
expansion <- expandStates(dt)
View(expansion)
As stated, the code produces the expected result. I'd 'like' to see the same without resorting to for loops and whilst being able to chain the functions into the initial mutate/selects.
As for the row expansion of the expandStates function, the answer is proffered here Replicate each row of data.frame and specify the number of replications for each row? by A5C1D2H2I1M1N2O1R2T1.
Essentially, the nested for loop is simply replaced by
interim[rep(rownames(interim[,1:5]),interim$Count),][1:5]
On my 'actual' data, this reduces user systime from 28.64 seconds to 0.06 to produce some 26000 rows.

R: Determine if all sets in a list appear in a data frame

I need to figure out if sets of item ID's are found within a data frame.
If I'm only looking for a single set of ID's, the below code works just fine:
set <- c( id1, id2, etc...)
all(subSets %in% df[,rangeOfColumns])
However, if the set is a list of various things I want to check, this code doesn't work as expected and I am unsure how to get this functionality.
Example of what I'm aiming for:
set <- list()
set[[1]] <- c(1, 2)
set[[2]] <- c(2, 3)
df <- as.data.frame(cbind(c(1:4),c(2:5)))
all(set %in% df)
#Returns TRUE
Maybe check each row against each set and return TRUE if any row matches. Then if there is a match for each set, then the whole result is TRUE.
all(sapply(set, function(s)
any(apply(df, 1, function(x) all(x==s)))))
This might not be easy to understand but it does the job. Data frames are organized by column, so doing things by row isn't always straightforward.
# Your setup had some unnecessary complications. Here it is again
# more simply:
set <- list(1:2, 2:3)
d_f <- data.frame(1:4, 2:5) # df is already a function name so best not to use it again.
all(
sapply(seq_along(set),
function(i) any(
sapply(
lapply(1:nrow(d_f), function(j) set[[i]] == d_f[j,]),
all) # Does each element of set[[i]] equal the elements in df[j]?
)
) # Does it happen in any row of df?
) # It is true for all elements of set?
EDIT: to address the question in the comment Well, if it's not straight-forward, why not work with a transposed version of the df to make things easier?
Because a data frame is a list, not a matrix.
Doing matrix things (like transpose with t or using apply) ruin (often without any warning to the user) what a data frame is supposed to be, which is a list of vectors of the same length.
When you use t or apply on a data frame, the first thing to happen is as.matrix gets applied to it. And if your data frame has a date, character, or factor variable, then the whole thing is coerced to "character", and it doesn't tell you this happens.
An answer for your specific problem can be crafted using apply (as someone did) and/or t, but it's going to be a bit fragile unless one is completely sure of the classes of the variables in the data frame.

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

Rowmeans with matching column names

How could I calculate the rowMeans of a data.frame based on matching column names?
Ex)
c1=rnorm(10)
c2=rnorm(10)
c3=rnorm(10)
out=cbind(c1,c2,c3)
out=cbind(out,out)
I realize that the values are the same, this is just for demonstration.
Each row is a specific measurement type (consider it a factor).
Imagine c1 = compound 1, c2 = compound 2, etc.
I want to group together all the c1's and average there rows together. then repeat for all unique(colnames(out))
My idea was something like:
avg = rowMeans(out,by=(unique(colnames(out)))
but obviously this doesn't work...
Try this:
sapply(unique(colnames(out)), function(i)
rowMeans(out[,colnames(out) == i]))
As #Laterow points out in the comments, having duplicate column names will lead to trouble at some point; if not here, elsewhere in your code. Best to nip it in the bud now.
If you are starting with duplicate column names, use make.unique on the colnames first to append .n where n increments for each duplicate starting at .1 for the first duplicate, leaving the initial unique names as is:
colnames(out) <- make.unique(colnames(out));
Once that's done (or as OP explained in the comments, if it was already being done by the column-creating function silently), you can do your rowMeans operation with dplyr::select's starts_with argument to group columns based on prefix:
library(dplyr);
avg_c1 <- rowMeans(select(out, starts_with("c1"));
If you have a large number of columns, instead of specifying them individually, you can use the code below to have it create a data frame of the rowMeans regardless of input size:
case_count <- as.integer(sub('^c\\d+\\.(\\d+)$', '\\1', colnames(out)[ncol(out)])) + 1L;
var_count <- as.integer(ncol(out) %/% case_count);
avg_c <- as.data.frame(matrix(nrow = var_count , ncol = nrow(out)));
for (i in 1:var_count) {
avg_c[i, 1:nrow(out)] <- rowMeans(select(as.data.frame(out), starts_with(paste0("c", i))));
}
As #Tensibai points out in comments, this solution may not be efficient, and may be overkill depending on your actual data set. You may not need the flexibility it provides and there's probably a more succinct way to do it.
EDIT1: Based on OP comments
EDIT2: Based on comments, handle all rowMeans at once
EDIT3: Fixed code bugs and clarified starting point reasoning based on comments

How to find out erroneous values in one column based on values in another column in R?

I have two columns of data (say id and master_id) in R. It should be the case that all the values in id should be present in master_id. But, I suspect that is not the case and I want to identify which ones are the erroneous values. I cannot just inspect the data as I am dealing with data of the order of 100k.
How do I go about finding the erroneous values?
the %in% function may come in handy. It will throw an FALSE for those cases that are in the first but not the second set
E.g.
DF$master_id %in% DF$id
id is the subset of master_id, so master_id values without a counterpart will get a FALSE
or, to see how it works run (from R help file)
1:10 %in% c(1,3,5,9)
Here's an answer from 2 days ago:
library(data.table)
DF1<-data.frame(x=1:3,y=4:6,t=10:12)
DF2<-data.frame(x=3:5,y=6:8,s=1:3)
library(data.table)
DF1 <- data.table(DF1, key = c("x", "y"))
DF2 <- data.table(DF2, key = c("x", "y"))
DF1[!DF2] # maybe you want this?
DF2[!DF1] # or maybe you want this?

Resources