Removing duplicated data based on each group using R - r

I have a dataset which contains employee id, name and their bank account information. Some of these employees have duplicate names with either same employee id or different employee id for same employee name. Few of these employees also have same bank account information for same names while some have different bank account numbers under same name. The aim is to find those employees who have same name but different bank account number. Here's a sample of the data:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 123 | Joan | 6758 |
| 134 | Karyn | 1244 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
| 235 | Larry | 5201 |
| 433 | Larry | 5201 |
| 231 | Larry | 5201 |
| 120 | Amy | 7890 |
| 135 | Amy | 7890 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
I had to find the employees who were duplicates based on their names which I could do successfully. Once that was done, I had to identify the employees with same name but different bank account no. Right now the issue is that it is not grouping the employees based on name and searching for different bank account. Instead, it is looking for account numbers of different individuals and if it finds it to be same, it removes one of the duplicate values. For example, Chris and Cassy have same bank account number '1280', so it is identifying it to be same and automatically removing one of Chris's record (bank account no 1280 in the output). The output that I'm getting is as shown below:
| Emp_id | Name | Bank Account |
|--------|:-----:|-------------:|
| 120 | Amy | 7890 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
This is the code that I have followed:
sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]
The actual output however, should have considered the bank account nos within each name (same name). The output should look something like this:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 900 | Cassy |1280 |
| 900 | Cassy |9873 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
Can someone please direct me towards right code?

We can use n_distinct to filter
library(dplyr)
sample %>%
group_by(Name) %>%
filter(n() > 1) %>%
group_by(Id, add = TRUE) %>%
filter(n_distinct(Bank_Account) > 1) %>%
arrange(desc(Id))
# A tibble: 6 x 3
# Groups: Name, Id [3]
# Id Name Bank_Account
# <fct> <fct> <fct>
#1 900 Cassy 1280
#2 900 Cassy 9873
#3 150 Chris 1280
#4 150 Chris 6565
#5 143 Larry 4900
#6 143 Larry 5201

Step 1 - Identifying duplicate names:
step_1 <- sample %>%
arrange(Name) %>%
mutate(dup = duplicated(Name)) %>%
filter(Name %in% unique(as.character(Name[dup == T])))
Step 2 - Identifying duplicate accounts for these names:
step_2 <- step_1 %>%
group_by(Name, Bank_Account) %>%
mutate(dup = duplicated(Bank_Account)) %>%
filter(dup == F)

Related

How to join dataframes by ID column? [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
I have 3 Dataframes like the ones below with IDs that may not necessarily occur in all
DF1:
ID | Name | Phone# | Country | State | Amount_month1
0210 | John K. | 8942829725 | USA | PA | 1300
0215 | Peter | 8711234566 | USA | KS | 50
2312 | Steven | 9012341221 | USA | TX | 1000
0005 | Haris | 9167456363 | USA | NY | 1200
DF2:
ID | Name | Phone# | Country | State | Amount_month2
0210 | John K. | 8942829725 | USA | PA | 200
2312 | Steven | 9012341221 | USA | TX | 350
2112 | Jerry | 9817273794 | USA | CA | 100
DF3:
ID | Name | Phone# | Country | State | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 300
0005 | Haris | 9167456363 | USA | NY | 1250
1212 | Jerry | 9817273794 | USA | CA | 1200
1210 | Drew | 8012341234 | USA | TX | 1400
I would like to join these 3 dataframes by ID and add the varying column amounts as separate columns, the missing amount values can be either 0 or NA such as:
ID | Name | Phone# | Country |State| Amount_month1 | Amount_month2 | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 1300 | 200 | 300
0215 | Peter | 8711234566 | USA | KS | 50 | 0 | 0
2312 | Steven | 9012341221 | USA | TX | 1000 | 350 | 0
0005 | Haris | 9167456363 | USA | NY | 1200 | 0 | 1250
1212 | Jerry | 9817273794 | USA | CA | 0 | 100 | 1200
1210 | Drew | 8012341234 | USA | TX | 0 | 0 | 1400
It can be done in a single line using Reduce and merge
Reduce(function(x, y) merge(x, y, all=TRUE), list(DF1, DF2, DF3))
You can use left_join from the package dplyr first joining the first two df`s, then joining that result with the third df:
library(dplyr)
df_12 <- left_join(df1,df2, by = "ID")
df_123 <- left_join(df_12, df3, by = "ID")
Result:
df_123
ID Amount_month1 Amount_month2 Amount_month3
1 1 100 NA NA
2 2 200 50 NA
3 3 300 177 666
4 4 400 NA 77
Mock data:
df1 <- data.frame(
ID = as.character(1:4),
Amount_month1 = c(100,200,300,400)
)
df2 <- data.frame(
ID = as.character(2:3),
Amount_month2 = c(50,177)
)
df3 <- data.frame(
ID = as.character(3:4),
Amount_month3 = c(666,77)
)

Select max value in one column for every value in the other column [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 1 year ago.
I have a dataframe competition with columns branch, phone and sales
| branch | phone | sales|
|----------|---------|------|
| 123 | milky | 654 |
| 456 | lemon | 342 |
| 789 | blue | 966 |
| 456 | blue | 100 |
| 456 | milky | 234 |
| 123 | lemon | 874 |
| 789 | milky | 234 |
| 123 | blue | 332 |
| 789 | lemon | 865 |
I want to show the highest number of sales for every phone:
The output should be a dataframe winners that look like this
| branch | phone | sales|
|----------|---------|------|
| 123 | milky | 654 |
| 789 | blue | 966 |
| 123 | lemon | 874 |
I tried order a dataframe by sales first, and then left only 3 top rows,
competition <- competition[order(competition$sales, decreasing = TRUE ),]
winners <- head(competition, 3)
But the output shows lemon phone two times with 874 and 865 sales
aggregrate(sales ~ phone, df, max)

R - Join two dataframes based on date difference

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

Combine dplyr mutate function with a search through the whole table

I'm quite new to R and especially to the tidy verse. I'm trying to write a script with which we can rewrite a list of taxons. We already have one using quite a lot for and if loops and I want to try to simplify it with the tidyverse, but I'm kind of stuck how to do that.
what I have is a table that looks something like that (really simplified)
taxon_file<- tibble(name = c( "cockroach","cockroach2", "grasshopper", "spider", "lobster", "insect", "crustacea", "arachnid"),
Id = c(445,448,446,778,543,200,400,300),
parent_ID = c(200,200,200,300,400,200,400,300),
rank = c("genus","genus","genus","genus","genus","order","order","order")
)
+-------------+-----+-----------+----------+
| name | Id | parent_ID | rank |
+=============+=====+===========+==========+
| cockroach | 445 | 200 | genus |
| cockroach2 | 448 | 200 | genus |
| grasshopper | 446 | 200 | genus |
| spider | 778 | 300 | genus |
| lobster | 543 | 400 | genus |
| insect | 200 | 200 | order |
| crustacea | 400 | 400 | order |
| arachnid | 300 | 300 | order |
+-------------+-----+-----+------------+----------+
Now I want to rearrange it so that I get a new column where I can add the order that matches the parent_ID (so when parent_ID == ID then write name in column order). The end result should look kinda like this
+-------------+------------+------+-----------+
| name | order | Id | parent_ID |
+=============+============+======+===========+
| cockroach | insect | 445 | 200 |
| cockroach2 | insect | 448 | 200 |
| grasshopper | insect | 446 | 200 |
| spider | arachnid | 778 | 300 |
| lobster | crustacea | 543 | 400 |
+-------------+------------+------+-----------+
I tried to combine mutate with an ifelse statement but this just adds NA's to the whole order column.
tibble is named taxon_list
taxon_list %>%
mutate(order = ifelse(parent_ID == Id, Name, NA))
I know this will not work because it doesn't search the whole data-set for the correct row (that's what I did before with alle the for loops). Maybe someone can point me in the right direction?
One way is to filter to each rank type to 2 separate dfs, subset using select, and merge the 2.
df <- tibble(name = c( "cockroach","cockroach2", "grasshopper", "spider", "lobster", "insect", "crustacea", "arachnid"),
Id = c(445,448,446,778,543,200,400,300),
parent_ID = c(200,200,200,300,400,200,400,300),
rank = c("genus","genus","genus","genus","genus","order","order","order"))
library(tidyverse)
df_order <- df %>%
filter(rank == "order") %>%
select(order = name, parent_ID)
df_genus <- df %>%
filter(rank == "genus") %>%
select(name, Id, parent_ID) %>%
merge(df_order, by = "parent_ID")
Result:
parent_ID name Id order
1 200 cockroach 445 insect
2 200 cockroach2 448 insect
3 200 grasshopper 446 insect
4 300 spider 778 arachnid
5 400 lobster 543 crustacea

copy command in cassandra execution order

I am copying csv file to cassandra. I have the below csv file and the table is created as below.
CREATE TABLE UCBAdmissions(
id int PRIMARY KEY,
admit text,
dept text,
freq int,
gender text
)
When I use
copy UCBAdmissions from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.318 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+-------+------+------+--------
(0 rows)
copy UCBAdmissions(id,admit,gender, dept , freq )from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.364 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+----------+------+------+--------
23 | Admitted | F | 24 | Female
5 | Admitted | B | 353 | Male
10 | Rejected | C | 205 | Male
16 | Rejected | D | 244 | Female
13 | Admitted | D | 138 | Male
11 | Admitted | C | 202 | Female
1 | Admitted | A | 512 | Male
19 | Admitted | E | 94 | Female
8 | Rejected | B | 8 | Female
2 | Rejected | A | 313 | Male
4 | Rejected | A | 19 | Female
18 | Rejected | E | 138 | Male
15 | Admitted | D | 131 | Female
22 | Rejected | F | 351 | Male
20 | Rejected | E | 299 | Female
7 | Admitted | B | 17 | Female
6 | Rejected | B | 207 | Male
9 | Admitted | C | 120 | Male
14 | Rejected | D | 279 | Male
21 | Admitted | F | 22 | Male
17 | Admitted | E | 53 | Male
24 | Rejected | F | 317 | Female
12 | Rejected | C | 391 | Female
3 | Admitted | A | 89 | Female
UCBAdmissions.csv
"","Admit","Gender","Dept","Freq"
"1","Admitted","Male","A",512
"2","Rejected","Male","A",313
"3","Admitted","Female","A",89
"4","Rejected","Female","A",19
"5","Admitted","Male","B",353
"6","Rejected","Male","B",207
"7","Admitted","Female","B",17
"8","Rejected","Female","B",8
"9","Admitted","Male","C",120
"10","Rejected","Male","C",205
"11","Admitted","Female","C",202
"12","Rejected","Female","C",391
"13","Admitted","Male","D",138
"14","Rejected","Male","D",279
"15","Admitted","Female","D",131
"16","Rejected","Female","D",244
"17","Admitted","Male","E",53
"18","Rejected","Male","E",138
"19","Admitted","Female","E",94
"20","Rejected","Female","E",299
"21","Admitted","Male","F",22
"22","Rejected","Male","F",351
"23","Admitted","Female","F",24
"24","Rejected","Female","F",317
I see the output order getting changed from the csv file as seen above.
Question: What is the difference between 1 and 2? Should we follow the same order as of csv file to create the table in cassandra?
Cassandra is designed to be distributed - to accomplish this, it uses the partition key of your table (id) and hashes it using the cluster's partitioner (probably Murmur3Partitioner) to create an integer (actually a Long), and then uses that integer to assign it to a node in the ring.
What you're seeing are the results ordered by the resulting token, which is non-intuitive, but not necessarily wrong. There is no straight-forward way to do a SELECT * FROM table ORDER BY primaryKey ASC in Cassandra - the distributed nature makes that difficult to do effectively.

Resources