Remove duplicates based on specific criteria

Remove duplicates based on specific criteria - r

I have a dataset that looks something like this:
df <- structure(list(Claim.Num = c(500L, 500L, 600L, 600L, 700L, 700L,
100L, 200L, 300L), Amount = c(NA, 1000L, NA, 564L, 0L, 200L,
NA, 0L, NA), Company = structure(c(NA, 1L, NA, 4L, 2L, 3L, NA,
3L, NA), .Label = c("ATT", "Boeing", "Petco", "T Mobile"), class = "factor")), .Names =
c("Claim.Num", "Amount", "Company"), class = "data.frame", row.names = c(NA,
-9L))
I want to remove duplicate rows based on Claim Num values, but to remove duplicates based on the following criteria: df$Company == 'NA' | df$Amount == 0
In other words, remove records 1, 3, and 5.
I've gotten this far: df <- df[!duplicated(df$Claim.Num[which(df$Amount = 0 | df$Company == 'NA')]),]
The code runs without errors, but doesn't actually remove duplicate rows based on the required criteria. I think that's because I'm telling it to remove any duplicate Claim Nums which match to those criteria, but not to remove any duplicate Claim.Num but treat certain Amounts & Companies preferentially for removal. Please note that, I can't simple filter out the dataset based on specified values, as there are other records that may have 0 or NA values, that require inclusion (e.g. records 8 & 9 shouldn't be excluded because their Claim.Nums are not duplicated).

If you order your data frame first, then you can make sure duplicated keeps the ones you want:
df.tmp <- with(df, df[order(ifelse(is.na(Company) | Amount == 0, 1, 0)), ])
df.tmp[!duplicated(df.tmp$Claim.Num), ]
# Claim.Num Amount Company
# 2 500 1000 ATT
# 4 600 564 T Mobile
# 6 700 200 Petco
# 7 100 NA <NA>
# 8 200 0 Petco
# 9 300 NA <NA>

Slightly different approach
r <- merge(df,
aggregate(df$Amount,by=list(Claim.Num=df$Claim.Num),length),
by="Claim.Num")
result <-r[!(r$x>1 & (is.na(r$Company) | (r$Amount==0))),-ncol(r)]
result
# Claim.Num Amount Company
# 1 100 NA <NA>
# 2 200 0 Petco
# 3 300 NA <NA>
# 5 500 1000 ATT
# 7 600 564 T Mobile
# 9 700 200 Petco
This adds a column x to indicate which rows have Claim.Num present more than once, then filters the result based on your criteria. The use of -ncol(r) just removes the column x at the end.

Another way based on subset and logical indices:
subset(dat, !(duplicated(Claim.Num) | duplicated(Claim.Num, fromLast = TRUE)) |
(!is.na(Amount) & Amount))
Claim.Num Amount Company
2 500 1000 ATT
4 600 564 T Mobile
6 700 200 Petco
7 100 NA <NA>
8 200 0 Petco
9 300 NA <NA>

Related

Intersecting two data frames in R

I have two data frames with different columns each
columns in dataframe 1 include
GID_2
MEX.15.1_1
MEX.15.1_2
MEX.15.1_3
MEX.15.1_4
MEX.15.1_5
MEX.15.1_6
columns in dataframe 2 include
ID_MUNICIPIO
B
C
D
1
500
200
100
2
200
300
100
3
100
600
400
4
200
400
700
5
600
100
800
6
700
100
200
I want to merge them like this
GID_2
X
MEX.15.1_1
500
MEX.15.1_2
300
MEX.15.1_3
600
MEX.15.1_4
700
MEX.15.1_5
800
MEX.15.1_6
700
Sorry if this is a rookie question I am fairly new to R

The logic is to find the max in each row!
Then we can use cbind:
cbind(df1, X= apply(df2, 1, max, na.rm=TRUE))
GID_2 X
1 MEX.15.1_1 500
2 MEX.15.1_2 300
3 MEX.15.1_3 600
4 MEX.15.1_4 700
5 MEX.15.1_5 800
6 MEX.15.1_6 700
data:
> dput(df1)
structure(list(GID_2 = c("MEX.15.1_1", "MEX.15.1_2", "MEX.15.1_3",
"MEX.15.1_4", "MEX.15.1_5", "MEX.15.1_6")), class = "data.frame", row.names = c(NA,
-6L))
> dput(df2)
structure(list(ID_MUNICIPIO = 1:6, B = c(500L, 200L, 100L, 200L,
600L, 700L), C = c(200L, 300L, 600L, 400L, 100L, 100L), D = c(100L,
100L, 400L, 700L, 800L, 200L)), class = "data.frame", row.names = c(NA,
-6L))

You can use the intersect function as below:
common_rows <- generics::intersect(GID_2, ID_MUNICIPIO)

More information is clearly needed. I'll assume that the last digit in GID_2 is the unique key can that can be used for a merge with IDMUNICIPIO in dataset 2. That is a big assumption.
The pseudo-code to solve this:
Create a new column in Dataset1 called "IDMUNICIPIO"
"IDMUNICIPIO" will equal the last character in GID_2.
Merge Dataset1 and Dataset2 on "IDMUNICIPIO"
Find the max in each row of the newly merged data set (see #TarJae suggestion).
At least that's how I think it should go. But this is predicated on my understanding of GID_2.

How to remove duplicates if specific column has value in r

I need to delete some rows in my dataset based on the given condition.
Kindly gothrough the sample data for reference.
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column. I have to delete the rows which have Dur != NA for group ID's
i.e ID's(123,789,852) have more than one record/row with Dur value. so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852.
I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
Expected Output:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.
Thanks in Advance!

One way would be to select rows where number of rows in the group is 1 or there are NA's rows in the data.
This can be written in dplyr as :
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))

We can use .I in data.table
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

Finding closest points between multiple datasets using Ellipsoidal/Vincenty

Note: This question is a follow up to a previous question: r - Finding closest coordinates between two large data sets.
I am aiming to identify the nearest entry in dataset 2 to each entry in dataset 1 based on the coordinates in both datasets. Dataset 1 contains 180,000 rows (only 1,800 unique coordinates) and dataset 2 contains contains 4,500 rows (full 4,500 unique coordinates).
The previously referenced post contains a solution the problem, however it uses RANN::nn2 which uses Euclidean distance as opposed to the aim of using Ellipsoidal/Vincenty.
Current code:
df1[ , c(4,5)] <- as.data.frame(RANN::nn2(df2[,c(2,3)],df1[,c(2,3)],k=1))
df1[,4] <- df2[df1[, 4], 1]
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 44 0.7990743
# 2 2 57.80945 -2.234544 5688 2.1676868
# 3 4 34.02335 -3.098445 61114 1.4758202
# 4 5 63.80879 -2.439163 23 4.2415854
# 5 6 53.68881 -7.396112 54 3.6445416
# 6 7 63.44628 -5.162345 23 2.3577811
# 7 8 21.60755 -8.633113 440 8.2123762
# 8 9 78.32444 3.813290 76 11.4936496
# 9 10 66.85533 -3.994326 55 1.9296370
# 10 3 51.62354 -8.906553 54 3.2180026
I suspect that the solution would involve geosphere::distVincentyEllipsoid but I am unsure as to how to integrate it into the existing code.
Data:
r details
platform x86_64-w64-mingw32
version.string R version 3.5.3 (2019-03-11)
data set 1 input (not narrowed down to unique coordinates)
df1 <- structure(list(id = c(1L, 2L, 4L, 5L,
6L, 7L, 8L, 9, 10L, 3L),
HIGH_PRCN_LAT = c(52.881442267773, 57.8094538200198, 34.0233529,
63.8087900198, 53.6888144440184, 63.4462810678651, 21.6075544376207,
78.324442654172, 66.85532539759495, 51.623544596), HIGH_PRCN_LON = c(-2.87377812157822,
-2.23454414781635, -3.0984448341, -2.439163178635, -7.396111601421454,
-5.162345043546359, -8.63311254098095, 3.813289888829932,
-3.994325961186105, -8.9065532453272409), SRC_ID = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), distance = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 10L), class = "data.frame")
data set 2 input
df2 <- structure(list(SRC_ID = c(55L, 54L, 23L, 11L, 44L, 21L, 76L,
5688L, 440L, 61114L), HIGH_PRCN_LAT = c(68.46506, 50.34127, 61.16432,
42.57807, 52.29879, 68.52132, 87.83912, 55.67825, 29.74444, 34.33228
), HIGH_PRCN_LON = c(-5.0584, -5.95506, -5.75546, -5.47801, -3.42062,
-6.99441, -2.63457, -2.63057, -7.52216, -1.65532)), row.names = c(NA,
10L), class = "data.frame")

Using distVincentyEllipsoid function:
library(geosphere)
t(
apply(
apply(df1[,c(3,2)], 1, function(mrow){distVincentyEllipsoid(mrow, df2[,c(3,2)])}),
2, function(x){ c(SRC_ID=df2[which.min(x),1],distance=min(x))}
)
)
SRC_ID distance
1 44 74680.48
2 5688 238553.51
3 61114 137385.18
4 23 340642.70
5 44 308458.73
6 23 256176.88
7 440 908292.28
8 76 1064419.47
9 55 185119.29
10 54 251580.45
Just use df1[,c(4,5)] <- t(apply(... to assign the values to the column of df1
Using rgeos::gDistance. This is Cartesian distance but starting from the solution below, I managed to post the updated answer above;
library(sp);library(rgeos)
#convert to spatial datasets
df1rgsp <- SpatialPointsDataFrame(df1[,c(3,2)], df1[,-c(3,2)])
df2rgsp <- SpatialPointsDataFrame(df2[,c(3,2)], data.frame(SRC_ID=df2[,1]))
#apply it on each rows
#find the minimum value and the corresponding row number
#transform it to become to columns and assign it to the columns of `df1`
df1[,c(4,5)] <- t( apply(gDistance(df1rgsp, df2rgsp, byid=TRUE), 1, function(x){
c(SRC_ID=which.min(x),distance=min(x))}))
#replace row numbers with `SRC_ID
df1[,4] <- df2[as.integer(df1[, 4]), 1] #same as what you have in the Q
# id HIGH_PRCN_LAT HIGH_PRCN_LON SRC_ID distance
# 1 1 52.88144 -2.873778 440 1.9296370
# 2 2 57.80945 -2.234544 61114 3.2180026
# 3 4 34.02335 -3.098445 21 2.3577811
# 4 5 63.80879 -2.439163 23 8.8794997
# 5 6 53.68881 -7.396112 55 0.7990743
# 6 7 63.44628 -5.162345 440 3.4316239
# 7 8 21.60755 -8.633113 5688 11.4936496
# 8 9 78.32444 3.813290 54 2.1676868
# 9 10 66.85533 -3.994326 23 6.1545391
# 10 3 51.62354 -8.906553 23 1.4758202

Merging two Dataframes in R by ID, One is the subset of the other

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
Thank you in advance
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
dfold:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.

Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
Using it on the whole dataset was a bit trickier:
First define all common colnames except pid:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
Now use mapply to replace the NA values with ifelse:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

Subset multiple observations by ID then select first observation by time [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 6 years ago.
I have a large dataset of observations, with several observations in rows and several different variables for each ID.
e.g.
Data
ID V1 V2 V3 time
1 35 100 5.2 2015-07-03 07:49
2 25 111 6.2 2015-04-01 11:52
3 41 120 NA 2015-04-01 14:17
1 25 NA NA 2015-07-03 07:51
2 NA 122 6.2 2015-04-01 11:50
3 40 110 4.1 2015-04-01 14:25
I would like to extract the earliest (first) observation for each variable independently based on the time column, for each unique ID. i.e. I would like to combine multiple rows of the same ID together so that I have one row of the first observation for each variable (time variable will not be equal for all).
The min() function will return the earliest time for a set of observations, but the problem is I need to do this for each variable. To do this I have tried using the tapply function with minimum time
tapply(Data, ID, min(time)
but get an error saying
"Error in match.fun(FUN) :
'min(Data$time)' is not a function, character or symbol.
I suspect that there is also a problem because many of the rows of observations have missing data.
Alternatively I have tried to just do each variable one at a time using aggregate, and select the min(time) this way:
firstV1 <-aggregate(V1[min(time)]~ID, data=Data, na.rm=T)
From the example dataset, what I would like to see is:
Data
ID V1 V2 V3
1 35 100 5.2
2 25 122 6.2
3 41 120 4.1
Note the '25' for ID2 V1 was from the later observation because the first observation was missing. Same for ID3 V3.
Input data
structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L), V1 = c(35L, 25L,
41L, 25L, NA, 40L), V2 = c(100L, 111L, 120L, NA, 122L, 110L),
V3 = c(5.2, 6.2, 4.2, NA, 6.2, 4.1), time = structure(c(1435906140,
1427885520, 1427894220, 1435906260, 1427885400, 1427894700
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("ID",
"V1", "V2", "V3", "time"), row.names = c(NA, -6L), class = "data.frame")

This should do what you need.
library(data.table)
Data <- rbind(cbind(1,35,100,5.2,"2015-07-03 07:49"),
cbind(2,25,111,6.2,"2015-04-01 11:52"),
cbind(3,41,120,4.2,"2015-04-01 14:17"),
cbind(1,25,NA,NA,"2015-07-03 07:51"),
cbind(2,NA,122,6.2,"2015-04-01 11:50"),
cbind(3,40,110,4.1,"2015-04-01 14:25"))
colnames(Data) <- c("ID","V1","V2","V3","time")
Data <- data.table(Data)
class(Data[,time])
Data[,time:=as.POSIXct(time)]
minTime.Data <- Data[,lapply(.SD, function(x) x[time==min(time)]),by=ID]
minTime.Data
The outcome will be
ID V1 V2 V3 time
1: 1 35 100 5.2 2015-07-03 07:49:00
2: 2 NA 122 6.2 2015-04-01 11:50:00
3: 3 41 120 4.2 2015-04-01 14:17:00
Let me know if this is what you were looking for, because there is a little ambiguity in your question.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicates based on specific criteria - r

Another way based on subset and logical indices: subset(dat, !(duplicated(Claim.Num) | duplicated(Claim.Num, fromLast = TRUE)) | (!is.na(Amount) & Amount)) Claim.Num Amount Company 2 500 1000 ATT 4 600 564 T Mobile 6 700 200 Petco 7 100 NA <NA> 8 200 0 Petco 9 300 NA <NA>

Related

Intersecting two data frames in R

How to remove duplicates if specific column has value in r

Finding closest points between multiple datasets using Ellipsoidal/Vincenty

Merging two Dataframes in R by ID, One is the subset of the other

Subset multiple observations by ID then select first observation by time [duplicate]

Categories

Resources