How to remove duplicates if specific column has value in r

How to remove duplicates if specific column has value in r - r

I need to delete some rows in my dataset based on the given condition.
Kindly gothrough the sample data for reference.
ID Date Dur
123 01/05/2000 3
123 08/04/2002 6
564 04/04/2012 2
741 01/08/2011 5
789 02/03/2009 1
789 08/01/2010 NA
789 05/05/2011 NA
852 06/06/2015 3
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
My main concern is Dur column. I have to delete the rows which have Dur != NA for group ID's
i.e ID's(123,789,852) have more than one record/row with Dur value. so I need to remove the ID with Dur value, which means entire ID of 123 and first record of 789 and 852.
I don't want to delete any ID's(564,741,852) have Dur with single record or any other ID's with null in Dur.
Expected Output:
ID Date Dur
564 04/04/2012 2
741 01/08/2011 5
789 08/01/2010 NA
789 05/05/2011 NA
852 03/02/2016 NA
155 03/02/2008 NA
155 01/01/2009 NA
159 07/07/2008 NA
Kindly suggest a code to solve the issue.
Thanks in Advance!

One way would be to select rows where number of rows in the group is 1 or there are NA's rows in the data.
This can be written in dplyr as :
library(dplyr)
df %>% group_by(ID) %>% filter(n() == 1 | is.na(Dur))
# ID Date Dur
# <int> <chr> <int>
#1 564 04/04/2012 2
#2 741 01/08/2011 5
#3 789 08/01/2010 NA
#4 789 05/05/2011 NA
#5 852 03/02/2016 NA
#6 155 03/02/2008 NA
#7 155 01/01/2009 NA
#8 159 07/07/2008 NA
Using data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | is.na(Dur)], ID]
and base R :
subset(df, ave(is.na(Dur), ID, FUN = function(x) length(x) == 1 | x))
data
df <- structure(list(ID = c(123L, 123L, 564L, 741L, 789L, 789L, 789L,
852L, 852L, 155L, 155L, 159L), Date = c("01/05/2000", "08/04/2002",
"04/04/2012", "01/08/2011", "02/03/2009", "08/01/2010", "05/05/2011",
"06/06/2015", "03/02/2016", "03/02/2008", "01/01/2009", "07/07/2008"
), Dur = c(3L, 6L, 2L, 5L, 1L, NA, NA, 3L, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))

We can use .I in data.table
library(data.table)
setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

Related

selecting the column with the maximum value

I have a dataframe in the wide format such as below:
Subject
Volume.1
Volume.2
Volume.3
Volume.4
1
77
22
1
NA
2
65
182
NA
NA
3
98
NA
NA
NA
4
66
76
145
677
I am wanting to select the volume.1 and the column and the largest volume of Volume1-4 irrespective of which column it came from but am struggling to code this correctly. Some of the columns are Na when a subject does not have a recording then.
For instance with the above example the table would look like:
Subject
Volume.1
Worst volume
1
77
22
2
65
182
3
98
NA
4
66
677
I was wondering if anyone could help?

We may use pmax
cbind(df[1:2], WorseVolume = do.call(pmax, c(df[3:5], na.rm = TRUE)))
-output
Subject Volume.1 WorseVolume
1 1 77 22
2 2 65 182
3 3 98 NA
4 4 66 677
data
df <- structure(list(Subject = 1:4, Volume.1 = c(77L, 65L, 98L, 66L
), Volume.2 = c(22L, 182L, NA, 76L), Volume.3 = c(1L, NA, NA,
145L), Volume.4 = c(NA, NA, NA, 677L)), class = "data.frame", row.names = c(NA,
-4L))

r data transform separate columns

I have a dataset with two columns three columns. The third column has date value mixed with some strings.
ID Col1 Value
123 Start.Date 2011-06-18
123 Stem A1
123 Stem_1 A6
123 Stem_2 NA
321 Start.Date 2014-08-05
321 Stem C1
321 Stem_1 C4
321 Stem_2 NA
677 Start.Date NA
677 Stem NA
677 Stem_1 NA
677 Stem_2 NA
How can I separate out the dates and store them in a different column like this ?
ID Col1 Value Start.Date
123 Stem A1 2011-06-18
123 Stem_1 A6 2011-06-18
123 Stem_2 NA 2011-06-18
321 Stem C1 2014-08-05
321 Stem_1 C4 2014-08-05
321 Stem_2 NA 2014-08-05
677 Stem NA NA
677 Stem_1 NA NA
677 Stem_2 NA NA
Thanks.

An alternative solution based solely on tidyr:
df %>% pivot_wider(ID, names_from = Col1, values_from = Value ) %>%
pivot_longer(c("Stem", "Stem_1", "Stem_2"), names_to = "Col1", values_to = "Value")

Create a new column in the data which has value from Value column wehre Col1 = 'Start.Date' or NA otherwise. For each ID we can fill the NA value from the previous dates and remove the rows with 'Start.Date'.
library(dplyr)
library(tidyr)
df %>%
mutate(Start.Date = as.Date(replace(Value, Col1 != 'Start.Date', NA))) %>%
group_by(ID) %>%
fill(Start.Date) %>%
ungroup() %>%
filter(Col1 != 'Start.Date')
# ID Col1 Value Start.Date
# <int> <chr> <chr> <date>
#1 123 Stem A1 2011-06-18
#2 123 Stem_1 A6 2011-06-18
#3 123 Stem_2 NA 2011-06-18
#4 321 Stem C1 2014-08-05
#5 321 Stem_1 C4 2014-08-05
#6 321 Stem_2 NA 2014-08-05
#7 677 Stem NA NA
#8 677 Stem_1 NA NA
#9 677 Stem_2 NA NA
data
df <- structure(list(ID = c(123L, 123L, 123L, 123L, 321L, 321L, 321L,
321L, 677L, 677L, 677L, 677L), Col1 = c("Start.Date", "Stem",
"Stem_1", "Stem_2", "Start.Date", "Stem", "Stem_1", "Stem_2",
"Start.Date", "Stem", "Stem_1", "Stem_2"), Value = c("2011-06-18",
"A1", "A6", NA, "2014-08-05", "C1", "C4", NA, NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -12L))

How to remove a list of observations from a dataframe with dplyr in R? [duplicate]

This question already has answers here:
How to specify "does not contain" in dplyr filter
(4 answers)
dplyr Exclude row [duplicate]
(1 answer)
Closed 3 years ago.
This is my dataframe x
ID Name Initials AGE
123 Mike NA 18
124 John NA 20
125 Lily NA 21
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
129 Oscar NA 32
I also have a list of ID's I want to remove from data frame x, num[1:3], which is the following: y
>print(y)
[1] 124 125 129
My goal is remove all the ID's in y from data frame x
This is my desired output
ID Name Initials AGE
123 Mike NA 18
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
I'm using the dplyr package and trying this but its not working,
FinalData <- x %>%
select(everything()) %>%
filter(ID != c(y))
Can anyone tell me what needs to be corrected?

We can use %in% and negate ! when the length of the 'y' is greater than 1. The select step is not needed as it is selecting all the columns with everything()
library(dplyr)
x %>%
filter(!ID %in% y)
# ID Name Initials AGE
#1 123 Mike NA 18
#2 126 Jasper NA 24
#3 127 Toby NA 27
#4 128 Will NA 19
Or another option is anti_join
x %>%
anti_join(tibble(ID = y))
In base R, subset can be used
subset(x, !ID %in% y)
data
y <- c(124, 125, 129)
x <- structure(list(ID = 123:129, Name = c("Mike", "John", "Lily",
"Jasper", "Toby", "Will", "Oscar"), Initials = c(NA, NA, NA,
NA, NA, NA, NA), AGE = c(18L, 20L, 21L, 24L, 27L, 19L, 32L)),
class = "data.frame", row.names = c(NA,
-7L))

Aggregating Dataset to "ignore" categorical variable

I have this dataset wich is structured like this
Neighborhood, var1, var2, COUNTRY, DAY, categ 1, categ 2
1 700 724 AL 0 YES YES
1 500 200 FR 0 YES NO
....
1 701 659 IT 1 NO YES
1 791 669 IT 1 NO YES
....
2 239 222 GE 0 YES NO
and so on...
So that the hyerarchy is "Neighborhood > DAY > COUNTRY" and for every neighborhood,for every day, for every country I have the observation of var1,var2,categ1 and categ2
I'm not interested for the moment in analyzing the country, so what I want to do is to aggregate that (by summing "over" the country field var1 and var2, the categorical variables categ1 and categ2 are not influenced by the country), and have a dataset that for each Neighborhood and for each Day gives me the infos on var1, var2, categ1 and categ2
I'm quite new to R-programming and basically don't know a lot of packages (I would write a program in c++, but I'm forcing myself to learn R)...
So do you have any idea on how to do this?
Data
df1 <- structure(list(Neighborhood = c(1L, 1L, 1L, 1L, 2L),
var1 = c(700L, 500L, 701L, 791L, 239L),
var2 = c(724L, 200L, 659L, 669L, 222L),
COUNTRY = c("AL", "FR", "IT", "IT", "GE"),
DAY = c(0L, 0L, 1L, 1L, 0L),
`categ 1` = c("YES", "YES", "NO", "NO", "YES"),
`categ 2` = c("YES", "NO", "YES", "YES", "NO")),
.Names = c("Neighborhood", "var1", "var2", "COUNTRY", "DAY", "categ 1", "categ 2"),
class = "data.frame", row.names = c(NA, -5L))
EDIT: #akrun
when I try your command, the result is:
aggregate(.~Neighborhood+DAY+COUNTRY, data= df1[!grepl("^categ", names(df1))], mean)
Neighborhood, DAY, COUNTRY, var1, var2
1 1 0 AL 700 724
2 1 0 FR 500 200
3 2 0 GE 239 222
4 1 1 IT 746 664
But (in this example) what I would like to have is:
Neighborhood, DAY, var1, var2
1 1 0 1200 924 //wher var1=700+500....
2 1 1 1492 1328
3 2 0 239 222

If we are not interested in the 'categ' columns, we can grep them out and use aggregate
aggregate(.~Neighborhood+DAY, data= df1[!grepl("^(categ|COUNTRY)", names(df1))], sum)
# Neighborhood DAY var1 var2
#1 1 0 1200 924
#2 2 0 239 222
#3 1 1 1492 1328
Or using dplyr
library(dplyr)
df1 %>%
group_by(Neighborhood, DAY) %>%
summarise_each(funs(sum), matches("^var"))
# Neighborhood DAY var1 var2
# (int) (int) (int) (int)
#1 1 0 1200 924
#2 1 1 1492 1328
#3 2 0 239 222

Remove duplicates based on specific criteria

I have a dataset that looks something like this:
df <- structure(list(Claim.Num = c(500L, 500L, 600L, 600L, 700L, 700L,
100L, 200L, 300L), Amount = c(NA, 1000L, NA, 564L, 0L, 200L,
NA, 0L, NA), Company = structure(c(NA, 1L, NA, 4L, 2L, 3L, NA,
3L, NA), .Label = c("ATT", "Boeing", "Petco", "T Mobile"), class = "factor")), .Names =
c("Claim.Num", "Amount", "Company"), class = "data.frame", row.names = c(NA,
-9L))
I want to remove duplicate rows based on Claim Num values, but to remove duplicates based on the following criteria: df$Company == 'NA' | df$Amount == 0
In other words, remove records 1, 3, and 5.
I've gotten this far: df <- df[!duplicated(df$Claim.Num[which(df$Amount = 0 | df$Company == 'NA')]),]
The code runs without errors, but doesn't actually remove duplicate rows based on the required criteria. I think that's because I'm telling it to remove any duplicate Claim Nums which match to those criteria, but not to remove any duplicate Claim.Num but treat certain Amounts & Companies preferentially for removal. Please note that, I can't simple filter out the dataset based on specified values, as there are other records that may have 0 or NA values, that require inclusion (e.g. records 8 & 9 shouldn't be excluded because their Claim.Nums are not duplicated).

If you order your data frame first, then you can make sure duplicated keeps the ones you want:
df.tmp <- with(df, df[order(ifelse(is.na(Company) | Amount == 0, 1, 0)), ])
df.tmp[!duplicated(df.tmp$Claim.Num), ]
# Claim.Num Amount Company
# 2 500 1000 ATT
# 4 600 564 T Mobile
# 6 700 200 Petco
# 7 100 NA <NA>
# 8 200 0 Petco
# 9 300 NA <NA>

Slightly different approach
r <- merge(df,
aggregate(df$Amount,by=list(Claim.Num=df$Claim.Num),length),
by="Claim.Num")
result <-r[!(r$x>1 & (is.na(r$Company) | (r$Amount==0))),-ncol(r)]
result
# Claim.Num Amount Company
# 1 100 NA <NA>
# 2 200 0 Petco
# 3 300 NA <NA>
# 5 500 1000 ATT
# 7 600 564 T Mobile
# 9 700 200 Petco
This adds a column x to indicate which rows have Claim.Num present more than once, then filters the result based on your criteria. The use of -ncol(r) just removes the column x at the end.

Another way based on subset and logical indices:
subset(dat, !(duplicated(Claim.Num) | duplicated(Claim.Num, fromLast = TRUE)) |
(!is.na(Amount) & Amount))
Claim.Num Amount Company
2 500 1000 ATT
4 600 564 T Mobile
6 700 200 Petco
7 100 NA <NA>
8 200 0 Petco
9 300 NA <NA>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove duplicates if specific column has value in r - r

We can use .I in data.table library(data.table) setDT(df1)[df1[, .I[.N == 1| is.na(Dur)], ID]$V1]

Related

selecting the column with the maximum value

r data transform separate columns

How to remove a list of observations from a dataframe with dplyr in R? [duplicate]

Aggregating Dataset to "ignore" categorical variable

Remove duplicates based on specific criteria

Categories

Resources