I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)
Related
Let's assume that I've got a data.frame that is supposed to be sorted with respect to selected columns and I want to make sure that it is indeed a case. I could try something like:
library(dplyr)
mpg2 <- mpg %>%
arrange(manufacturer, model, year)
identical(mpg, mpg2)
[1] FALSE
but if the identical returns FALSE this only lets me know that a dataset is in incorrect order.
What if I would like to inspect only those rows that are in fact in incorrect order? How can I filter those out of the whole dataset? (I need to avoid looping here at best, as the dataset I work with is pretty large)
If the remaining variables (not used for ordering) are different for the same value of manufacturer, model, year, how dplyr::arrange decides which observation comes first? Does it preserve the order from original dataset (mpg here)?
As for the second question, I believe that dplyr::arrange is stable, it preserves the order of the rows when there are ties in the sorting columns.
This can be seen by comparing with the result from base::order. From the help page, section Details (my emphasis):
In the case of ties in the first vector, values in the second are
used to break the ties. If the values are still tied, values in the
later arguments are used to break the tie (see the first example).
The sort used is stable (except for method = "quick"), so any
unresolved ties will be left in their original ordering.
mpg2 <- mpg %>%
arrange(manufacturer, model, year)
i <- with(mpg, order(manufacturer, model, year))
mpg3 <- mpg[i, ]
identical(as.data.frame(mpg2), as.data.frame(mpg3))
#[1] TRUE
The values are identical, except for their classes. So dplyr::arrange does preserve the order in the case of ties.
As for the first question, maybe the code below answers it. It just gets the rows for which the next order number is smaller than the previous one. This means that those rows have changed relative positions.
j <- which(diff(i) < 0)
mpg[i[j], ]
I don't think this is something I've needed before. Usually it's best practice to not rely on table ordering. Only times I would rely on it, the ordering would be contained within a function ie I wouldn't have function B depend on ordering that happens in function A.
I think this does what you ask for, using the data.table package. You set keys with this package, and they are ordered from left to right in terms of primary, secondary key etc. I'm not sure if concatenating the keys together is the best way, but its simple.
# reproducible fake data
library(data.table)
set.seed(1)
dt <- data.table(a=rep(1:5, 2), b=letters[1:10], c=sample(1:3, 10, TRUE))
# scramble
dt <- dt[sample(1:.N)]
# make the ideal structure
keys <- c("a", "b")
dt_ideal <- copy(dt)
dt_ideal <- setkeyv(dt_ideal, keys)
key(dt_ideal)
# function to find keys not the same for each row. Pasting together
findBad <- function(dt, dt_ideal){
not_ok <- which(dt_ideal[, do.call(paste, c(.SD, sep=">")), .SDcols=keys] !=
dt[, do.call(paste, c(.SD, sep=">")), .SDcols=keys])
not_ok
}
# index of bad rows - all bad in this case
not_ok <- findBad(dt, dt_ideal)
dt[not_ok]
# better eg, swap 7 & 8
dt2 <- copy(dt_ideal)
dt2 <- dt2[c(1:6, 8, 7, 9:10)]
not_ok <- findBad(dt2, dt_ideal)
dt2[not_ok]
It might be a trivial question (I am new to R), but I could not find a answer for my question, either here in SO or anywhere else. My scenario is the following.
I have an data frame df and i want to update a subset df$tag values. df is similar to the following:
id = rep( c(1:4), 3)
tag = rep( c("aaa", "bbb", "rrr", "fff"), 3)
df = data.frame(id, tag)
Then, I am trying to use match() to update the column tag from the subsets of the data frame, using a second data frame (e.g., aux) that contains two columns, namely, key and value. The subsets are defined by id = n, according to n in unique(df$id). aux looks like the following:
> aux
key value
"aaa" "valueAA"
"bbb" "valueBB"
"rrr" "valueRR"
"fff" "valueFF"
I have tried to loop over the data frame, as follows:
for(i in unique(df$id)){
indexer = df$id == i
# here is how I tried to update the dame frame:
df[indexer,]$tag <- aux[match(df[indexer,]$tag, aux$key),]$value
}
The expected result was the df[indexer,]$tag updated with the respective values from aux$value.
The actual result was df$tag fulfilled with NA's. I've got no errors, but the following warning message:
In '[<-.factor'('tmp', df$id == i, value = c(NA, :
invalid factor level, NA generated
Before, I was using df$tag <- aux[match(df$tag, aux$key),]$value, which worked properly, but some duplicated df$tags made the match() produce the misplaced updates in a number of rows. I also simulate the subsetting and it works fine. Can someone suggest a solution for this update?
UPDATE (how the final dataset should look like?):
> df
id tag
1 "valueAA"
2 "valueBB"
3 "valueRR"
4 "valueFF"
(...) (...)
Thank you in advance.
Does this produce the output you expect?
df$tag <- aux$value[match(df$tag, aux$key)]
merge() would work too unless you have duplicates in aux.
It turned out that my data was breaking all the available built-in functions providing me a wrong dataset in the end. Then, my solution (at least, a preliminary one) was the following:
to process each subset individually;
add each data frame to a list;
use rbindlist(a.list, use.names = T) to get a complete data frame with the results.
How could I calculate the rowMeans of a data.frame based on matching column names?
Ex)
c1=rnorm(10)
c2=rnorm(10)
c3=rnorm(10)
out=cbind(c1,c2,c3)
out=cbind(out,out)
I realize that the values are the same, this is just for demonstration.
Each row is a specific measurement type (consider it a factor).
Imagine c1 = compound 1, c2 = compound 2, etc.
I want to group together all the c1's and average there rows together. then repeat for all unique(colnames(out))
My idea was something like:
avg = rowMeans(out,by=(unique(colnames(out)))
but obviously this doesn't work...
Try this:
sapply(unique(colnames(out)), function(i)
rowMeans(out[,colnames(out) == i]))
As #Laterow points out in the comments, having duplicate column names will lead to trouble at some point; if not here, elsewhere in your code. Best to nip it in the bud now.
If you are starting with duplicate column names, use make.unique on the colnames first to append .n where n increments for each duplicate starting at .1 for the first duplicate, leaving the initial unique names as is:
colnames(out) <- make.unique(colnames(out));
Once that's done (or as OP explained in the comments, if it was already being done by the column-creating function silently), you can do your rowMeans operation with dplyr::select's starts_with argument to group columns based on prefix:
library(dplyr);
avg_c1 <- rowMeans(select(out, starts_with("c1"));
If you have a large number of columns, instead of specifying them individually, you can use the code below to have it create a data frame of the rowMeans regardless of input size:
case_count <- as.integer(sub('^c\\d+\\.(\\d+)$', '\\1', colnames(out)[ncol(out)])) + 1L;
var_count <- as.integer(ncol(out) %/% case_count);
avg_c <- as.data.frame(matrix(nrow = var_count , ncol = nrow(out)));
for (i in 1:var_count) {
avg_c[i, 1:nrow(out)] <- rowMeans(select(as.data.frame(out), starts_with(paste0("c", i))));
}
As #Tensibai points out in comments, this solution may not be efficient, and may be overkill depending on your actual data set. You may not need the flexibility it provides and there's probably a more succinct way to do it.
EDIT1: Based on OP comments
EDIT2: Based on comments, handle all rowMeans at once
EDIT3: Fixed code bugs and clarified starting point reasoning based on comments
I have a large data set I am attempting to sample rows from. Each row has a family ID, and there may be one or multiple rows for each family ID. I want to parse the data set by randomly sampling one row for each family ID. I have attempted to accomplish this by using both tapply() and split() + lapply() functions, but to no avail. Below is code that reproduces my issue - the size and scope of the factor levels and data entries mirror the data set I am working with.
set.seed(63)
f1 <- factor(c(rep(30000:32000, times=1),
rep(30500:31700, times = 2),
rep(30900:31900, times = 3)))
f2 <- factor(rep(sample(1:7, replace = TRUE), times = length(f1)/7))
x1 <- round(matrix(rnorm(length(f1)*300), nrow = length(f1), ncol = 300),3)
df <- data.frame(f1, f2, x1)
Next, I used tapply to sample one row per factor from f1, and then check for repeats. (f2 is a secondary factor that indexes another aspect of the observations, but is [hopefully] irrelevant here; I only include it for full disclosure of the structure of my data set).
s1 <- tapply(1:nrow(df), df$f1, sample, size=1)
any(duplicated(s1))
The output for the second line of code using duplicated is TRUE, which means there are repeats. Stumped, I tried split to see if that was the problem.
df.split <- split(1:nrow(df), df$f1)
any(duplicated(df.split))
The output here for duplicated is FALSE, so the problem is not split. I then used the output df.split with lapply and sample to see if the problem was with tapply.
df.unique <- unlist(lapply(df.split, sample, size = 1, replace = FALSE,
prob = NULL))
any(duplicated(df.unique))
In the first line, I sampled one value from each element of df.split which outputs a list, then I used unlist to convert into a vector. The output for duplicated here is also TRUE.
Somewhere within sample and lapply there is funky stuff going on (since tapply merely calls lapply). I'm not sure how to fix the issue (I searched SO and Google and found nothing related to my issue), so any help would be greatly appreciated!
EDIT: I'm hoping someone could tell me why the above code using tapply and lapply is not working as intended. Arthur has provided a nice answer, and I have coded a loop for sample as well. I'm wondering why the above code is misbehaving.
I would do that:
library(data.table)
data.table(df)[,.SD[sample(.N,1)],by='f1']
... but actually your original approach with tapply is faster if you just want an index and not the actual subset table ; however, you must notice that sample(n) actually samples in 1:n when length(n)==1. See ?sample. This version is error-proof:
s1 <- tapply(1:nrow(df), list(df$f1), function(v) v[sample(1:length(v), 1)])` is error prooff
I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))