Remove inconsistent duplicate entries from data frame with Base R - r

I want to remove duplicate entries from a data frame that are inconsistent, the following gives a simplified example:
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Edgar 65
## 7 Edgar 55
## 8 Frank 90
## 9 George 120
## 10 George 120
## 11 George 120
## 12 Herbert 300
## 13 Iris 15
## 14 Iris 25
## 15 Iris 25
Version A)
Edgar and Iris are the same person yet the given amounts are inconsistent so I want to remove the entries:
#remove inconsistent duplicate entries
df2
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Frank 90
## 7 George 120
## 8 George 120
## 9 George 120
## 10 Herbert 300
Version B)
Another possibility would be to keep only one instance of the consistent entries:
#keep only one instance of consistent entries
df3
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 David 200
## 5 Frank 90
## 6 George 120
## 7 Herbert 300
I am interested in (elegant?) ways to solve both versions in Base R. Efficiency should not be a problem because the dataset I have is not that huge.

A base solution that solves both at once. This has the side effect of requiring row name changes.
A Remove "inconsistent" values
new_df<-do.call("rbind",
Filter(function(x) all(x$amount == x$amount[1]),
split(df,df$name)))
name amount
Andy Andy 100
Bert Bert 50
Cindy.3 Cindy 30
Cindy.4 Cindy 30
David David 200
Frank Frank 90
George.9 George 120
George.10 George 120
George.11 George 120
Herbert Herbert 300
The above needs further cleaning of row names (an unwanted side effect perhaps but we deal with that below)
B Remove duplicates
new_df<-new_df[!duplicated(new_df$name),]
row.names(new_df) <- 1:nrow(new_df)
Combined result
new_df
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
The question specifically requests for a base solution. If for whatever reason someone from the future wants to use dplyr, I will leave this solution here.
Using dplyr, we can check if all values are equal to the first value of amount. If not, make them NA and delete them. Proceed with removing duplicates for what remains.
A Remove inconsistent ones
library(dplyr)
(df %>%
group_by(name) %>%
mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>%
na.omit() -> new_df)
A tibble: 10 x 2
# Groups: name [7]
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
Remove duplicates
new_df %>%
filter(!duplicated(name)) %>%
ungroup()
# A tibble: 7 x 2
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300

A) First aggregate to apply the conditions, then filter the data and finally stack the result.
t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
setNames( stack( setNames(lapply( t_m$amount, function(x)
rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) )
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
B) Is a bit more straightforward. Just aggregate and filter.
t <- aggregate( amount ~ name, df, unique )
t[lengths(t$amount) == 1,]
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
6 Frank 90
7 George 120
8 Herbert 300

You can use duplicate, but you need to remove all duplicate rows. (your option B).
The result can be used to filter the data frame for all rows.
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df_unq <- unique(df)
df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]
df3
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 12 Herbert 300
df[df$name %in% df3$name, ]
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 4 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 10 George 120
#> 11 George 120
#> 12 Herbert 300
Created on 2021-12-12 by the reprex package (v2.0.1)

For the first requirement, where you need to get rid of duplicate entries, there's an in-built function in R called duplicated.
Here's the code:
df[!duplicated(df), ]
df[!duplicated(df$name),]
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
5 David 200
6 Edgar 65
8 Frank 90
9 George 120
12 Herbert 300
13 Iris 15
And for the second requirement, you'll need to do something like this:
df <- unique(df)
df <- split(df, df$name)
df <- df[sapply(df, nrow) == 1]
df <- do.call(rbind, df)
rownames(df) <- 1:nrow(df)
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
Both versions are using Base-R. You can do the same using dplyr package in R.

Problem B is a sub-problem of problem A. To solve A we can use var() to find inconsistent values, utilizing the behavior of Filter() which always takes NAs as FALSE. To solve B we just need to get rid of duplicated rows in A applying unique().
Case A
with(df, df[!name %in% names(Filter(var, split(amount, name))), ])
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 4 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 10 George 120
# 11 George 120
# 12 Herbert 300
Case B
with(df, df[!name %in% names(Filter(var, split(amount, name))), ]) |>
unique()
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 12 Herbert 300

Related

Joining rows with columns to create vertical table

I am trying to figure out how to join 2 data frames to create a vertical table of the data. Here is some sample data:
people <- data.frame(person = c("John","David","Peter"), company = c("A", "B", "C"))
grades <- data.frame(person1=c(10, 40, 50, 60), person2=c(60,70,80, 100), person3=c(33,44,55, 75))
NOTE: The order of the columns in grades is the same as the order of the person column in the people data frame.
I would like to get a data frame like the following but can't think of how to get there. Would prefer a solution using base R (am using an older version of R so some packages don't work for me):
person | company | grade
-------------------------
John | A | 10
John | A | 40
John | A | 50
John | A | 60
David | B | 60
David | B | 70
David | B | 80
David | B | 100
Peter | C | 33
Peter | C | 44
Peter | C | 55
Peter | C | 75
We change the column names of 'grades' with 'person' column from 'people', gather into 'long' format and then do a left_join
library(tidyverse)
setNames(grades, people$person) %>%
gather(person, grade) %>%
left_join(people)
# person grade company
#1 John 10 A
#2 John 40 A
#3 John 50 A
#4 John 60 A
#5 David 60 B
#6 David 70 B
#7 David 80 B
#8 David 100 B
#9 Peter 33 C
#10 Peter 44 C
#11 Peter 55 C
#12 Peter 75 C
Or using base R with merge
merge(stack(setNames(grades, people$person)),
people, all.x = TRUE, by.x = 'ind', by.y = 'person')
A base R option using cbind would be
idx <- rep(seq_along(people$person), each = dim(grades)[1])
cbind(people[idx,], stack(unlist(grades))["values"])
Result
# person company values
#1 John A 10
#1.1 John A 40
#1.2 John A 50
#1.3 John A 60
#2 David B 60
#2.1 David B 70
#2.2 David B 80
#2.3 David B 100
#3 Peter C 33
#3.1 Peter C 44
#3.2 Peter C 55
#3.3 Peter C 75
Use unlist and stack on grades to get
stack(unlist(grades))
values ind
1 10 john_grades1
2 40 john_grades2
3 50 john_grades3
4 60 john_grades4
5 60 david1
6 70 david2
7 80 david3
8 100 david4
9 33 pj1
10 44 pj2
11 55 pj3
12 75 pj4
Since "The order of the columns in grades is the same as the order of the person column in the people data frame." we can use cbind next, after we expanded people to have the correct number of rows.
(idx <- rep(seq_along(people$person), each = dim(grades)[1]))
# [1] 1 1 1 1 2 2 2 2 3 3 3 3
Another option, probably a little faster would be
cbind(people[idx,], data.frame(grade = unlist(grades, use.names = FALSE)))

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

Rank values based on individual users

I have a data set that looks like the follow:
User Area
Sarah 123.4
Sarah 20.5
Sarah 43
Sam 86
Sam 101
Sam 32.6
Justin 45
Justin 125.8
Justin 39
Justin 88.4
Zac 21
Zac 4
Zac 111
I want to sort the greatest area to smallest, however I want separate top areas for each individual user.
I have tried: test$Ranking1 <- order(test$User, test$Area, decreasing = FALSE ), but this ranks them all together
I then tried: test$Ranking1 <- ave(test$User, test$Area, FUN= rank ), and while others seem to have said this will work my output/ results give the middle (average) value a ranking of 1 and then going up by which is closest to the average. I was 1 to be the largest area not the average.
Any suggestions?
This can be done very easily with data.table:
library(data.table) # Load package
setDT(dat) # convert to data.table
dat[,max(Area),by=User] # compute
dat[,sort(Area),by=User] # Sort increasing
dat[,sort(Area,decreasing = T),by=User] # Sort decreasing
Hope this helps!
Read the documentation of the package, it's very helpful.
I am assuming that you want to rank area within each individual, and also want to know the largest area for each individual:
## make up data
set.seed(1)
user <- rep(LETTERS[sample(26, 5)], each=sample(5, 1))
area <- rnorm(length(user), 100, 20)
d <- data.frame(user, area)
library(dplyr)
d %>%
group_by(user) %>%
mutate(ranking=rank(-area), top_area=max(area)) %>%
ungroup()
user area ranking top_area
1 G 131.90562 1 131.9056
2 G 106.59016 4 131.9056
3 G 83.59063 5 131.9056
4 G 109.74858 3 131.9056
5 G 114.76649 2 131.9056
6 J 111.51563 2 130.2356
7 J 93.89223 4 130.2356
8 J 130.23562 1 130.2356
9 J 107.79686 3 130.2356
10 J 87.57519 5 130.2356
...
I think this is your desired output? If you want the order reversed, wrap a rev() around the rank() function.
x = "User Area
Sarah 123.4
Sarah 20.5
Sarah 43
Sam 86
Sam 101
Sam 32.6
Justin 45
Justin 125.8
Justin 39
Justin 88.4
Zac 21
Zac 4
Zac 111"
rank.foo = function(x) {
z = numeric()
for (i in as.character(unique(x$User)))
{z = c(z, rank(subset(x, User == i)$Area))}
return(z)
}
cbind(df, rank.foo(df))
User Area rank.foo(df)
Sarah 123.4 3
Sarah 20.5 1
Sarah 43.0 2
Sam 86.0 2
Sam 101.0 3
Sam 32.6 1
Justin 45.0 2
Justin 125.8 4
Justin 39.0 1
Justin 88.4 3
Zac 21.0 2
Zac 4.0 1
Zac 111.0 3

Adding a ranking column to a dataframe

This seems like it must be a very common task, but I can't find a solution in google or SO. I want to add a column called 'rank' to 'dat1' based on the sequence that 'order.scores' applies to 'dat'. I tried using row.names(), but the rownames are based on 'dat', not 'dat1'. I also tried 'dat$rank <-rank(dat1)', but this produces an error message.
fname<-c("Joe", "Bob", "Bill", "Tom", "Sue","Sam","Jane","Ruby")
score<-c(500, 490, 500, 750, 550, 500, 210, 320)
dat<-data.frame(fname,score)
order.scores<-order(dat$score,dat$fname)
dat1<-dat[order.scores,]
You can compute a ranking from an ordering as follows:
dat$rank <- NA
dat$rank[order.scores] <- 1:nrow(dat)
dat
# fname score rank
# 1 Joe 500 5
# 2 Bob 490 3
# 3 Bill 500 4
# 4 Tom 750 8
# 5 Sue 550 7
# 6 Sam 500 6
# 7 Jane 210 1
# 8 Ruby 320 2
Try:
## dat, dat1, and order.scores as defined
dat <- data.frame(fname=c("Joe", "Bob", "Bill", "Tom", "Sue","Sam","Jane","Ruby"),
score=c(500, 490, 500, 750, 550, 500, 210, 320))
order.scores <- order(dat$score)
dat1 <- dat[order.scores,]
dat1$rank <- rank(dat1$score)
dat1
## fname score rank
## 7 Jane 210 1
## 8 Ruby 320 2
## 2 Bob 490 3
## 3 Bill 500 5
## 1 Joe 500 5
## 6 Sam 500 5
## 5 Sue 550 7
## 4 Tom 750 8
This shows the ties in rank based on $score. If you don't want ties in $rank, then you might as well say dat1$rank <- 1:nrow(dat1) since they are already in order.
You can also use arrange and mutate from dplyr:
library(dplyr)
dat <- arrange(dat, desc(score)) %>%
mutate(rank = 1:nrow(dat))
dat
You can use:
dat$Rank <- rank(dat$score)
dat$Rank
you could do:
dat$rank <- order(order.scores)
dat$rank
#[1] 5 3 4 8 7 6 1 2
For the given dataframe dat:
fname score
Joe 500
Bob 490
Bill 500
Tom 750
Sue 550
Sam 500
Jane 210
Ruby 320
We can also use dplyr as below, it assigns the lowest rank to the smallest value, which is 210 in this case.
ranks = dat %>%
mutate(ranks = order(order(score)))
The output will be as below:
fname score ranks
Joe 500 4
Bob 490 3
Bill 500 5
Tom 750 8
Sue 550 7
Sam 500 6
Jane 210 1
Ruby 320 2
If the converse is required, i.e., rank 1 should be assigned to the highest value which is 750 in this case, then the code will be changed slightly as below:
ranks = dat %>%
mutate(ranks = order(order(score, decreasing = T)))
The output in this case will be as below:
fname score ranks
Joe 500 3
Bob 490 6
Bill 500 4
Tom 750 1
Sue 550 2
Sam 500 5
Jane 210 8
Ruby 320 7
Generally, Rank can be applied to find the least to highest in numerical values of a column data.
example: Salary is a column and it has 4 digit salary to 5 digit salary then here it goes by applying rank function!
simple understanding - the rank of salaries among them.
df['Salary'].rank(ascending = False).astype(int)

Merging data frames and combining columns into one

I've got the following three dataframes:
df1 <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student",3,5,"Employer"), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Anne", "Christine"),
age=c(26, 54),
height=c(175, 160),
group=c(3,5),
group2=c("Teacher",6), stringsAsFactors=FALSE)
df2 <- data.frame(name=c("Christine"),
age=c(54),
height=c(160),
group=c(5),
group2=c(6),
group3=c("Scientist"), stringsAsFactors=FALSE)
I'd like to combine them so that I get the following result:
df.all <- data.frame(name=c("John", "Anne", "Christine", "Andy"),
age=c(31, 26, 54, 48),
height=c(180, 175, 160, 168),
group=c("Student", "Teacher", "Scientist", "Employer"))
At the moment I'm doing it this way:
df.all <- merge(merge(df1[,c(1,4)], df2[,c(1,5)], all=TRUE, by="name"),
df3[,c(1,6)], all=TRUE, by="name")
row.ind <- which(df.all$group %in% c(6,5))
df.all[row.ind, c("group")] <- df.all[row.ind, c("group2")]
row.ind2 <- which(df.all$group2 %in% c(6))
df.all[row.ind2, c("group")] <- df.all[row.ind2, c("group3")]
This isn't generalisable and it is really messy. Maybe there would be a way to use merge_all or merge_recurse for the merging step (especially as there might be more than two dataframes to be merged), but I haven't figured out how. These two don't produce the right result:
df.all <- merge_all(list(df1, df2, df3))
df.all <- merge_recurse(list(df1, df2, df3), by=c("name"))
Is there a more general and elegant way to solve this problem?
Here is another possible approach, if I understand what you're ultimately after. (It is not clear what the numeric values in the "group" columns are, so I'm not sure this is exactly what you're looking for.)
Use Reduce() to merge your multiple data.frames.
temp <- Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
names(temp)[4] <- "group1" # Rename "group" to "group1" for reshaping
temp
# name age height group1 group2 group3
# 1 Andy 48 168 Employer <NA> <NA>
# 2 Anne 26 175 3 Teacher <NA>
# 3 Christine 54 160 5 6 Scientist
# 4 John 31 180 Student <NA> <NA>
Use reshape() to reshape your data from wide to long.
df.all <- reshape(temp, direction = "long", idvar="name", varying=4:6, sep="")
df.all
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# Anne.1 Anne 26 175 1 3
# Christine.1 Christine 54 160 1 5
# John.1 John 31 180 1 Student
# Andy.2 Andy 48 168 2 <NA>
# Anne.2 Anne 26 175 2 Teacher
# Christine.2 Christine 54 160 2 6
# John.2 John 31 180 2 <NA>
# Andy.3 Andy 48 168 3 <NA>
# Anne.3 Anne 26 175 3 <NA>
# Christine.3 Christine 54 160 3 Scientist
# John.3 John 31 180 3 <NA>
Take advantage of the fact that as.numeric() will coerce characters to NA, and use na.omit() to remove all of the rows with NA values.
na.omit(df.all[is.na(as.numeric(df.all$group)), ])
# name age height time group
# Andy.1 Andy 48 168 1 Employer
# John.1 John 31 180 1 Student
# Anne.2 Anne 26 175 2 Teacher
# Christine.3 Christine 54 160 3 Scientist
Again, this might be over-generalizing your problem--there might be NA values in other columns, for example--but it might help direct you towards a solution to your problem.
First step is to use merge_recurse with all.x = TRUE:
library(reshape)
merge.all <- merge_recurse(list(df1, df2, df3), all.x = TRUE)
# name age height group group2 group3
# 1 Anne 26 175 3 Teacher <NA>
# 2 Christine 54 160 5 6 Scientist
# 3 John 31 180 Student <NA> <NA>
# 4 Andy 48 168 Employer <NA> <NA>
Then you can use apply to get the last non-NA group from all the "group" columns:
group.cols <- grep("group", colnames(merge.all))
merge.all <- data.frame(merge.all[-group.cols],
group = apply(merge.all[group.cols], 1,
function(x)tail(na.omit(x), 1)))
# name age height group
# 1 Anne 26 175 Teacher
# 2 Christine 54 160 Scientist
# 3 John 31 180 Student
# 4 Andy 48 168 Employer

Resources