R: remove duplicate rows with full overlap of non-missing variables - r
Many previous questions highlight various ways to remove duplicate rows with missing values, however none deal with the following case. Example starting data:
df <- data.frame(x = c(1, NA, 1), y=c(NA, 1, 1), z=c(0, NA, NA))
print(df)
Desired output:
df2 <- data.frame(x = c(1, 1), y=c(NA, 1), z=c(0, NA))
print(df2)
In this case the second row was removed because it was a perfect subset of row 3. In the real application I want to remove rows that contain all redundant info in non-missing columns, and keep the row that has less missing overall.
I thought this might be accomplished using dplyr and a rowwise application of distinct(), but to no avail. I could do this with a very slow for loop, but with hundreds of columns and thousands of rows this is a poor option.
Here is another option using data.table:
library(data.table)
#convert into long format and discard NAs
mDT <- melt(setDT(df)[, rn := .I], id.var="rn", na.rm=TRUE)[, cnt := .N , rn]
#self join and filter for rows that match to other rows
merged <- mDT[mDT, on=.(variable, value), {
diffrow <- i.rn!=x.rn
.(irn=i.rn[diffrow], xrn=x.rn[diffrow], icnt=i.cnt[diffrow])
}]
#count the occurrence and delete rows where all values are matched to another row
ix <- merged[, xcnt := .N, .(irn, xrn)][
icnt==xcnt]$irn
#delete dupe rows
df[-ix]
I'm not sure how to do it with dplyr, but here is soultion with loop. Also I'm not sure that dplyr solution can be faster than loop one (at the end it must use some loop), here you can at least control loop flow.
Subset vector function determines if vector a is subset of vector b (return 1) or if vector b is subset of vector a (returns 2) otherwise it returns 0. Then I loop over all rows of data.frame and remove subset rows.
subsetVector <- function(a, b){
na_a <- which(is.na(a))
na_b <- which(is.na(b))
if(all(na_a %in% na_b)){
if(all(a[-na_b] == b[-na_b])) return(2)
}else if(all(na_b %in% na_a)){
if(all(b[-na_a] == a[-na_a])) return(1)
}
return(0)
}
i <- 1
while(i < nrow(df)){
remove_rows <- NULL
for(j in (i+1):nrow(df)){
p <- subsetVector(df[i,], df[j,])
if(p == 1){
remove_rows <- c(remove_rows, i)
break()
}else if(p == 2){
remove_rows <- c(remove_rows, j)
}
}
if(length(remove_rows) > 0)
df <- df[-remove_rows,]
if(!1 %in% remove_rows)
i <- i + 1
}
Related
Count the number of missing values in R
I'm working with Pima Indians Diabetes data from Kaggle in Rstudio and instead of na's as missing values it has 0s. How can I count the number of "0" values in each variable with a single loop instead of typing table(data$variableName==0) for each column. Just rephrasing ,"a single loop for the whole data frame".
We can use colSums on a logical matrix colSums(data == 0) Or with sapply in a loop sapply(data, function(x) sum(x == 0)) or with apply apply(data, 2, function(x) sum(x == 0)) Or in a for loop count <- numeric(ncol(data)) for(i in seq_along(data)) count[i] <- sum(data[[i]] == 0)
Try this: library(dplyr) data %>% summarise(across(.fns = ~sum(.==0,na.rm=TRUE) ,.names = "Zeros_in_{.col}"))
how to insert sequential rows in data.table in R (Example given)?
df is data.table and df_expected is desired data.table . I want to add hour column from 0 to 23 and visits value would be filled as 0 for hours newly added . df<-data.table(customer=c("x","x","x","y","y"),location_id=c(1,1,1,2,3),hour=c(2,5,7,0,4),visits=c(40,50,60,70,80)) df_expected<-data.table(customer=c("x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x","x", "y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y", "y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y","y"), location_id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3), hour=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23), visits=c(0,0,40,0,0,50,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 70,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,80,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)) This is what I tried to obtain my result , but it did not work df1<-df[,':='(hour=seq(0:23)),by=(customer)] Error in `[.data.table`(df, , `:=`(hour = seq(0L:23L)), by = (customer)) : Type of RHS ('integer') must match LHS ('double'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)
Here's an approach that creates the target and then uses a join to add in the visits information. The ifelse statement just helps up clean up the NA from the merge. You could also leave them in and replace them with := in the new data.table. target <- data.table( customer = rep(unique(df$customer), each = 24), hour = 0:23) df_join <- df[target, on = c("customer", "hour"), .(customer, hour, visits = ifelse(is.na(visits), 0, visits)) ] all.equal(df_expected, df_join) Edit: This addresses the request to include the location_id column. One way to do this is with by=location in the creation of the target. I've also added in some of the code from chinsoon12's answer. target <- df[ , .("customer" = rep(unique(customer), each = 24L), "hour" = rep(0L:23L, times = uniqueN(customer))), by = location_id] df_join <- df[target, on = .NATURAL, .(customer, location_id, hour, visits = fcoalesce(visits, 0))] all.equal(df_expected, df_join)
Another option using CJ to generate your universe, on=.NATURAL for joining on identically named columns, and fcoalesce to handle NAs: df[CJ(customer, hour=0L:23L, unique=TRUE), on=.NATURAL, allow.cartesian=TRUE, .(customer=i.customer, hour=i.hour, visits=fcoalesce(visits, 0))]
here's a for-loop answer. df_final <- data.table() for(i in seq(24)){ if(i %in% df[,hour]){ a <- df[hour==i] }else{ a <- data.table(customer="x", hour=i, visits=0)} df_final <- rbind(df_final, a) } df_final You can wrap this in another for-loop to have your multiple customers x, y, etc. (the following loop isnt very clean but gets the job done). df_final <- data.table() for(j in unique(df[,customer])){ for(i in seq(24)){ if(i %in% df[,hour]){ if(df[hour==i,customer] %in% j){ a <- df[hour==i] }else{ a <- data.table(customer=j, hour=i, visits=0) } }else{ a <- data.table(customer=j, hour=i, visits=0) } df_final <- rbind(df_final, a) } } df_final
How do I create a dataframe with only some rows from another dataframe, with column names?
I'm working on a shiny R app in which I need to parse csv files. From them, I build a dataframe. Then, I want to extract some rows from this dataframe and put them in another dataframe. I found a way to do that using rbind, but it's pretty ugly and seems inadequate. function(set){ #set is the data.frame containing the data I want to extract newTable <- data.frame( name = character(1), value = numeric(1), columnC = character(1), stringsAsFactors=FALSE) threshold <- 0 for (i in 1:nrow(set)){ value <- calculateValue(set$Value[[i]])) if (value >= threshold){ name <- set[which(set$Name == "foo")), ]$Name columnC <- set[which(set$C == "bar")), ]$C v <- c(name, value, columnC) newTable <- rbind(newTable, v) } } If I don't initialize my dataframe values with character(1) or numeric(1), I get an error: Warning: Error in data.frame: arguments imply differing number of rows: 0, 1 75: stop 74: data.frame But then it leaves me with an empty row in my dataframe (empty strings for characters and 0s for numerics). Since R is a cool language, I assume there's an easier and more efficient to do this. Can anybody help me?
Rather than looping through each row, you can either subset function(set, threshold) { set[calculateValue(set$Value) >= threshold, c("name", "value", "columnC")] } Or use dplyr to filter rows and select columns to get the subset you want. library(tidyverse) function(set, threshold) { set %>% filter(calculateValue(Value) >= threshold) %>% select(name, value, columnC) } Then assign the result to a new variable if you want a new dataframe getValueOverThreshold <- function(set, threshold) { set %>% filter(calculateValue(Value) >= threshold) %>% select(name, value, columnC) } newDF <- getValueOverThreshold(set, 0) You might want to check out https://r4ds.had.co.nz/transform.html
Removing duplicate rows from a data frame in R, keeping those with a smaller/larger value
I am trying to remove duplicate rows in an R data frame, but I want the condition that the row with a smaller or larger value (not bothered for the purpose of this question) in a certain column should be kept. I can remove duplicate rows normally (from either side) like this: df = data.frame( x = c(1,1,2,3,4,5,5,6,1,2,3,3,4,5,6), y = c(rnorm(4),NA,rnorm(10)), id = c(rep(1,8), rep(2,7))) splitID <- split(df , df$id) lapply(splitID, function(x) x[!duplicated(x$x),] ) How can I condition the removal of duplicate rows? Thanks!
Use ave() to return a logical index to subset your data.frame idx = as.logical(ave(df$y, df$x, df$id, FUN=fun)) df[idx,, drop=FALSE] Some possible fun include fun1 = function(x) !is.na(x) & !duplicated(x) & (x == min(x, na.rm=TRUE)) fun2 = function(x) { res = logical(length(x)) res[which.min(x)] = TRUE res } The dplyr version of this might be df %>% group_by(x, id) %>% filter(fun2(y))
We may need to order before applying the duplicated lapply(splitID, function(x) x[!duplicated(x[order(x$x, x$y),]$x),] ) and for the reverse, i.e. keeping the larger values, order with decreasing = TRUE
change variable values based on preceding value
I have the following dataset: df <- data.frame(subject = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3), time = c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,11), performance = c(1,0,-1,-1,0,1,1,-1,0,0,0,1,1,1,-1,0,1,1,-1,0,0,1,-1,1,1,0,1,1,-1,0,-1,-1,0)) What I would like to do is to change some of the entries in the performance variable. More specifically, if a "-1" entry is preceded by a "1", I want to change the "-1" to "0". However, this should be done within subjects only, but not across subjects (all of the subjects have a varying number of sessions). So, this is what I'd like to have in the end: df2 =data.frame(subject = c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3), time = c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,11), performance = c(1,0,-1,-1,0,1,1,0,0,0,0,1,1,1,0,0,1,1,0,0,0,1,-1,1,1,0,1,1,-1,0,-1,-1,0)) Does anyone have an idea how to do this? Thanks in advance! S.
Using dplyr, df %>% group_by(subject) %>% mutate(performance = replace(performance, which(performance + lag(performance)==0 & performance == -1), 0))
Here's a data.table approach, where I first create a flag column which is then used to subset the data and update the performance column by reference. library(data.table) dt <- as.data.table(df) # or setDT(df) dt[, flag := performance == -1 & shift(performance, 1L) == 1, by = subject] dt[(flag), performance := 0][, flag := NULL] I chose to do it with an intermediate flag-column because I expect that to perform very well for large data sets. If performance is not your concern, you could of course use ifelse or replace instead.
This is ugly, but should work: dftest <- df for (i in 2:nrow(dftest)) { if( dftest$performance[i] == -1 && dftest$performance[i - 1] == 1 ){ if( dftest$subject[i] == dftest$subject[i - 1] ) { dftest$performance[i] <- 0 } } } all.equal(df2, dftest) # ONE ERROR This gives an error in line 29 - can you check whether your example df2 is correct here? If I understand the question correctly df2$performance[29] should be 0?
A base R solution using by and sapply: gr <- do.call(c, by(df, df$subject, function(x) { c(FALSE, unlist(sapply(1:length(x$performance), function(y) (x$performance[y] == -1) & (x$performance[y-1] == 1)))) })) df[gr, 3] <- 0 cbind(df, df2)