Removing all occurrences from a dataframe that occur in a second dataframe - r

I have two dataframes. I am looking to drop all the rows that match a second dataframe. I know there are similar questions out there but the solutions didnt work for me
dates <- rep(seq(as.Date("2004/01/01"), as.Date("2020/12/31"), "days"), each=20)
Animal_id <- rep(1:20, times=length(unique(dates)))
df1 <- data.frame(dates=dates, id=Animal_id)
dates2<-rep(seq(as.Date("2004/01/01"), as.Date("2020/12/31"), "days"), each=2)
Animal_id2<-rep(1:2, times=length(unique(dates2)))
df2<- data.frame(dates=dates2, id=Animal_id2)
df2 <- df2[-4, ]
df2 <- df2[-6, ]
##I would like to ensure that there any animal in df2 is removed from df1
df1$remove<-paste(df1$dates,df1$id,sep="-")
df2$remove<-paste(df2$dates,df1$id,sep="-")
dim(df1)
dim(df2)
anti_join(df1, df1, by = "remove")
I have also found the following and tried but it does not work
df1[!(df1$remove %in% df2$remove),]
I do not get any error messages, it just simply does not remove the columns (the dimensions of the data do not change). My actual dataset is quite large and I am hoping to avoid having to type out every date+ID combo I would like to filter out.
Is there a way I can get R to go through and remove matches between two dataframes when I need to do this over multiple columns (i.e. I can't just use ID because there will be differences in dates between the two)

If I understand you correct this should be the correct code (as inidcated by #Waldi) in the comments:
anti_join....return all rows from x where there are not matching values in y, keeping just columns from x.
library(dplyr)
anti_join(df1, df2, by="id")
output:
A tibble: 111,780 x 2
dates id
<date> <int>
1 2004-01-01 3
2 2004-01-01 4
3 2004-01-01 5
4 2004-01-01 6
5 2004-01-01 7
6 2004-01-01 8
7 2004-01-01 9
8 2004-01-01 10
9 2004-01-01 11
10 2004-01-01 12
# ... with 111,770 more rows

Related

Eliminate row from data.frame based on partially duplicate values

I have a relatively large data.frame with 205K observations and 54 variables. This data.frame is the result of appending three different data.frames. The original data.frames all have the columns date, time, lat and lon, but each data.frame carries accessory information which i need to retain. In the final data.frame I have therefore sets of three rows where date, time, lat, lon, are exactly the same but the values of var1, var2 and so forth are different and some are NA. A simplified version of my data.frame could look like the following:
mydf
var1 date time var2 var3 var4 var5 var6 lat lon
1 A 1 2 3 4 5 6 7 8 9
2 B 1 2 <NA> <NA> <NA> 6 7 8 9
3 <NA> 1 2 <NA> <NA> <NA> <NA> <NA> 8 9
In particular, I would like highlight in my data.frame those sets of rows with the same date, time, lat and long, but only retain the ones where, as an instance, var1 is not NA so that the final data.frame should look like:
var1 date time var2 var3 var4 var5 var6 lat lon
1 A 1 2 3 4 5 6 7 8 9
2 B 1 2 <NA> <NA> <NA> 6 7 8 9
I know that I can use the
distinct(mydf, ..., .keep_all = TRUE)
but I can't figure out to use the arguments properly. Any help is greatly appreciated.
With dplyr:
First identify duplicated rows from top to bottom to get all the duplicated rows with the selected variables. Then filter where this new variable is TRUE and where var1 is other than NA:
library(dplyr)
mydf %>% mutate(dup = duplicated(select(., date, time, lat, lon)) |
duplicated(select(., date, time, lat, lon), fromLast = TRUE)) %>%
filter(dup == TRUE & !is.na(var1))
I have found a less direct and most likely less elegant solution to deal with the duplicate rows in my specific data.frame by first adding a "datetime" column to all my data.frames, subsequently by "subtracting" one data.frame to another and finally by appending them toa final data.frame
#my data.frames
#df1
#df2
df1<- mutate(df1, datetime = paste(df1$date, df1$time)) #add a column "datetime" by concatenating the columns "date" and "time"
df2<- mutate(df2, datetime = paste(df2$date, df2$time)) #add a column "datetime" by concatenating the columns "date" and "time"
df1<-anti_join(df1, df2, by ="datetime") #delete from df1 those rows that occur in df2 based on the column "datedime"
df.f<-rbind.fill(df1, df2) #append the two data.frames in "df.f"
Possibly not the fastest approach but it works fine with my data.

How to keep one instance or more of the values in one column when removing duplicate rows?

I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a

Subsetting by counts [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I have a data.frame
library(dplyr)
ID <- c(1,1,1,1,2,2,3,3,3,3,4,4,5)
Score <- c(20,22,34,56,78,98,56,43,45,33,24,54,22)
Quarter <- c("Q1","Q2","Q3","Q4","Q1","Q2","Q1","Q2","Q3","Q4","Q1","Q2","Q1")
df <- data.frame(ID,Score,Quarter)
I only want to deal with the data that has all 4 quarters (Q1,Q2,Q3,Q4 in column "Quarters"). One way I thought I could do this is subset when the ID is present 4 times because it is repeated in each Quarter. I am having a hard time sub-setting on the count of IDs. I tried:
filter(df, count(df, vars = ID)==4)
But it did not work and guidance would be greatly appreciated.
Thank you
One way we can do is by using n_distinct to get unique values for each ID and filter the group which has all 4 values.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Quarter) == 4)
# ID Score Quarter
# <dbl> <dbl> <fct>
#1 1.00 20.0 Q1
#2 1.00 22.0 Q2
#3 1.00 34.0 Q3
#4 1.00 56.0 Q4
#5 3.00 56.0 Q1
#6 3.00 43.0 Q2
#7 3.00 45.0 Q3
#8 3.00 33.0 Q4
Equivalent base R implementation using ave would be
df[as.numeric(ave(df$Quarter, df$ID, FUN = function(x) length(unique(x)))) == 4, ]
Here are a few alternatives. The last three are base solutions.
#1 is an SQL solution which creates a one-column data frame df0 with only those IDs having 4 quarters which is then joined to df thereby eliminating all other IDs.
#2 is a dplyr solution which filters the groups retaining only those with 4 rows.
#3 is a data.table solution which returns the rows for those ID groups having 4 rows and NULL for the other groups. This has the effect of eliminating the other groups.
#4 is a zoo solution which converts df to a wide form zoo object with quarters along the top and ID as the time index. It then removes any row having an NA and reshapes back to the original using fortify.zoo also reordering back to a sorted order. The last line of the solution could be omitted if the row order does not matter. Interestingly it does not use knowledge of the number 4.
#5 is a base solution which splits df into a list of data frames, one per ID, and then uses Filter to extract those having 4 rows. Finally it puts it all back together.
#6 is a base solution which creates a vector having one element per row of df containing the number of rows (including the current row) having the ID in that row. Then use subset to reduce df to those rows for which that vector equals 4.
#7 is a base solution which splits df into a list of data frames, one per ID, and then uses Reduce to iterate over it appending the current data frame to what we have so far if it has 4 rows or just keeping what we have so far if not.
# 1
library(sqldf)
sqldf("with df0 as (
select ID from df group by ID having count(*) = 4
)
select * from df join df0 using (ID)")
# 2
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 4) %>% ungroup
# 3
library(data.table)
as.data.table(df)[, if (nrow(.SD) == 4) .SD, by = ID]
# 4
library(zoo)
z <- read.zoo(df, split = "Quarter")
df2 <- fortify.zoo(na.omit(z), melt = TRUE, names = names(df)[c(1, 3:2)])
df2 <- df2[order(df2$ID, df2$Quarter), ]
# 5
do.call("rbind", Filter(function(x) nrow(x) == 4, split(df, df$ID)))
# 6
subset(df, ave(ID, ID, FUN = length) == 4)
# 7
Reduce(function(x, y) if (nrow(y) == 4) rbind(x, y) else x, split(df, df$ID))
Here is another base R method using table, rowSums and %in%. We get the frequency count of 'ID', 'Quarter' columns with table, convert it to logical matrix where 0 values are TRUE and all others FALSE (!table(...)), get the rowwise sum (rowSums), convert to logical vector, get the names of the elements that are TRUE and create a comparison with the ID using %in% to subset the dataset
subset(df, ID %in% names(which(!rowSums(!table(df[c(1,3)])))))
# ID Score Quarter
#1 1 20 Q1
#2 1 22 Q2
#3 1 34 Q3
#4 1 56 Q4
#7 3 56 Q1
#8 3 43 Q2
#9 3 45 Q3
#10 3 33 Q4
I just figured out I can do this as well:
df[df$ID %in% names(table(df$ID))[table(df$ID)==4],]
It gets the desired result with using only the counts from ID

How to pair rows with the same value in one column of a dataframe in R

I have data in the following form :
set.seed(1234)
data <- data.frame(cbind(runif(40,0,10), rep(seq(1,20,1), each = 2)))
data <- data[sample(nrow(data)),]
colnames(data) <- c("obs","subject")
head(data)
obs subject
1.5904600 12
8.1059855 13
5.4497484 6
0.3999592 12
2.5880982 19
2.6682078 9
... ...
Let's say that I have only two observations (column "obs") by subject (column "subject", where subjects are numbered from 1 to 20).
I would like to "group" rows by values of the "subject" column. More precisely, I would like to "order" data by subject, but conserving the order displayed above. Thus, final data would be something like this:
obs subject
1.5904600 12
0.3999592 12
8.1059855 13
2.3656473 13
5.4497484 6
7.2934746 6
Any ideas ? I thought of maybe identifying each row corresponding to a subject with which:
which(data$subject==x)
then rbind these rows in a loop but I am sure there is a simpler and faster way to do this, isn't it ?
Convert to factor with levels then order:
data$group <- factor(data$subject, levels = unique(data$subject))
data[ order(data$group), ]
# obs subject group
# 1 1.59046003 12 12
# 4 0.39995918 12 12
# 2 8.10598552 13 13
# 30 2.18799541 13 13
# ...
Nest the data by obs and unnest again. The resulting tibble will have retained the original order but subject will be grouped.
library(tidyr)
data %>% nest(obs) %>% unnest()
# A tibble: 6 × 2
# subject obs
# <int> <dbl>
#1 12 1.5904600
#2 12 0.3999592
#3 13 8.1059855
#4 6 5.4497484
#5 19 2.5880982
#6 9 2.6682078
It is based on zx8754 but it does preserve the data type:
library(dplyr) #arrange function
group<-factor(data[,'subject'], levels=unique(data[,'subject']))
data<-cbind(data,group)
data<-arrange(as.data.frame(data),group)
data<-as.matrix(data[,-3])
dplyr is a great package with various useful verbs, one of which is arrange(variable), which does what you want here, and more elegantly (result is generally also a data.frame, so you don't need to cbind):
require(dplyr)
as.data.frame(data) %>% arrange(subject)
# or, if you want reverse order:
as.data.frame(data) %>% arrange(-subject)
(For that matter, data.table is great too. In fact, you can get them both merged in dtplyr package)

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Resources