Removing duplicates for each ID - r

Suppose that there are three variables in my data frame (mydata): 1) id, 2) case, and 3) value.
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), case=c("a","b","c","c","b","a","b","c","c","a","b","c","c","a","b","c","a"), value=c(1,34,56,23,34,546,34,67,23,65,23,65,23,87,34,321,87))
mydata
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 1 b 34
6 2 a 546
7 2 b 34
8 2 c 67
9 2 c 23
10 3 a 65
11 3 b 23
12 3 c 65
13 3 c 23
14 4 a 87
15 4 b 34
16 4 c 321
17 4 a 87
For each id, we could have similar ‘case’ characters, and their values could be the same or different. So basically, if their values are the same, I only need to keep one and remove the duplicate.
My final data then would be
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 2 a 546
6 2 b 34
7 2 c 67
8 2 c 23
9 3 a 65
10 3 b 23
11 3 c 65
12 3 c 23
13 4 a 87
14 4 b 34
15 4 c 321

To add to the other answers, here's a dplyr approach:
library(dplyr)
mydata %>% group_by(id, case, value) %>% distinct()
Or
mydata %>% distinct(id, case, value)

You could try duplicated
mydata[!duplicated(mydata[,c('id', 'case', 'value')]),]
# id case value
#1 1 a 1
#2 1 b 34
#3 1 c 56
#4 1 c 23
#6 2 a 546
#7 2 b 34
#8 2 c 67
#9 2 c 23
#10 3 a 65
#11 3 b 23
#12 3 c 65
#13 3 c 23
#14 4 a 87
#15 4 b 34
#16 4 c 321
Or use unique with by option from data.table
library(data.table)
set.seed(25)
mydata1 <- cbind(mydata, value1=rnorm(17))
DT <- as.data.table(mydata1)
unique(DT, by=c('id', 'case', 'value'))
# id case value value1
#1: 1 a 1 -0.21183360
#2: 1 b 34 -1.04159113
#3: 1 c 56 -1.15330756
#4: 1 c 23 0.32153150
#5: 2 a 546 -0.44553326
#6: 2 b 34 1.73404543
#7: 2 c 67 0.51129562
#8: 2 c 23 0.09964504
#9: 3 a 65 -0.05789111
#10: 3 b 23 -1.74278763
#11: 3 c 65 -1.32495298
#12: 3 c 23 -0.54793388
#13: 4 a 87 -1.45638428
#14: 4 b 34 0.08268682
#15: 4 c 321 0.92757895

Case and value only? Easy:
> mydata[!duplicated(mydata[,c("id","case","value")]),]
Even if you have a ton more variables in the dataset, they won't be considered by the duplicated() call.

Related

dplyr creating new column based on some condition [duplicate]

This question already has an answer here:
Assign the value of the first row of a group to the whole group [duplicate]
(1 answer)
Closed 1 year ago.
I have the following df:
df<-data.frame(geo_num=c(11,12,22,41,42,43,77,71),
cust_id=c("A","A","B","C","C","C","D","D"),
sales=c(2,3,2,1,2,4,6,3))
> df
geo_num cust_id sales
1 11 A 2
2 12 A 3
3 22 B 2
4 41 C 1
5 42 C 2
6 43 C 4
7 77 D 6
8 71 D 3
Require to create a new column 'geo_num_new' which has for every group from 'cust_id' has first values from 'geo_num' as shown below:
> df_new
geo_num cust_id sales geo_num_new
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77
thanks.
We could use first after grouping by 'cust_id'. The single value will be recycled for the entire grouping
library(dplyr)
df <- df %>%
group_by(cust_id) %>%
mutate(geo_num_new = first(geo_num)) %>%
ungroup
-ouptut
df
# A tibble: 8 x 4
geo_num cust_id sales geo_num_new
<dbl> <chr> <dbl> <dbl>
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77
Or use data.table
library(data.table)
setDT(df)[, geo_num_new := first(geo_num), by = cust_id]
or with base R
df$geo_num_new <- with(df, ave(geo_num, cust_id, FUN = function(x) x[1]))
Or an option with collapse
library(collapse)
tfm(df, geo_num_new = ffirst(geo_num, g = cust_id, TRA = "replace"))
geo_num cust_id sales geo_num_new
1 11 A 2 11
2 12 A 3 11
3 22 B 2 22
4 41 C 1 41
5 42 C 2 41
6 43 C 4 41
7 77 D 6 77
8 71 D 3 77

add values of one group into another group in R

I have a question on how to add the value from a group to rest of the elements in the group then delete that row. for ex:
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
in the above example, my data is grouped by Year, Cluster, Seed and Day where seed=99 values need to be added to above rows based on (Year, Cluster and Day) group then delete this row. for ex: Row # 16, is part of (Year=1, Cluster=a,Day=1 and Seed=99) group and the value of Row #16 which is 55 should be added to Row #1 (5+55), Row # 6 (6+55) and Row # 11 (2+55) and row # 16 should be deleted. But when it comes to Row #21, which is in cluster=C with seed=99, should remain in the database as is as it cannot find any matching in year+cluster+day combination.
My actual data is of 1 million records with 10 years, 80 clusters, 500 days and 10+1 (1 to 10 and 99) seeds, so looking for so looking for an efficient solution.
Year Cluster Seed Day value
1 1 a 1 1 60
2 1 a 1 2 68
3 1 a 1 3 78
4 1 a 1 4 90
5 1 a 1 5 107
6 1 a 2 1 61
7 1 a 2 2 73
8 1 a 2 3 86
9 1 a 2 4 91
10 1 a 2 5 104
11 1 a 3 1 57
12 1 a 3 2 67
13 1 a 3 3 79
14 1 a 3 4 96
15 1 a 3 5 105
16 1 c 99 1 10
17 2 b 1 1 60
18 2 b 1 2 68
19 2 b 1 3 78
20 2 b 1 4 90
21 2 b 1 5 107
22 2 b 2 1 61
23 2 b 2 2 73
24 2 b 2 3 86
25 2 b 2 4 91
26 2 b 2 5 104
27 2 b 3 1 57
28 2 b 3 2 67
29 2 b 3 3 79
30 2 b 3 4 96
31 2 b 3 5 105
32 2 d 99 1 10
A data.table approach:
library(data.table)
df <- setDT(df)[, `:=` (value = ifelse(Seed != 99, value + value[Seed == 99], value),
flag = Seed == 99 & .N == 1), by = .(Year, Cluster, Day)][!(Seed == 99 & flag == FALSE),][, "flag" := NULL]
Output:
df[]
Year Cluster Seed Day value
1: 1 a 1 1 60
2: 1 a 1 2 68
3: 1 a 1 3 78
4: 1 a 1 4 90
5: 1 a 1 5 107
6: 1 a 2 1 61
7: 1 a 2 2 73
8: 1 a 2 3 86
9: 1 a 2 4 91
10: 1 a 2 5 104
11: 1 a 3 1 57
12: 1 a 3 2 67
13: 1 a 3 3 79
14: 1 a 3 4 96
15: 1 a 3 5 105
16: 1 c 99 1 10
17: 2 b 1 1 60
18: 2 b 1 2 68
19: 2 b 1 3 78
20: 2 b 1 4 90
21: 2 b 1 5 107
22: 2 b 2 1 61
23: 2 b 2 2 73
24: 2 b 2 3 86
25: 2 b 2 4 91
26: 2 b 2 5 104
27: 2 b 3 1 57
28: 2 b 3 2 67
29: 2 b 3 3 79
30: 2 b 3 4 96
31: 2 b 3 5 105
32: 2 d 99 1 10
Here's an approach using the tidyverse. If you're looking for speed with a million rows, a data.table solution will probably perform better.
library(tidyverse)
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
seeds <- df %>%
filter(Seed == 99)
matches <- df %>%
filter(Seed != 99) %>%
inner_join(select(seeds, -Seed), by = c("Year", "Cluster", "Day")) %>%
mutate(value = value.x + value.y) %>%
select(Year, Cluster, Seed, Day, value)
no_matches <- anti_join(seeds, matches, by = c("Year", "Cluster", "Day"))
bind_rows(matches, no_matches) %>%
arrange(Year, Cluster, Seed, Day)
#> Year Cluster Seed Day value
#> 1 1 a 1 1 60
#> 2 1 a 1 2 68
#> 3 1 a 1 3 78
#> 4 1 a 1 4 90
#> 5 1 a 1 5 107
#> 6 1 a 2 1 61
#> 7 1 a 2 2 73
#> 8 1 a 2 3 86
#> 9 1 a 2 4 91
#> 10 1 a 2 5 104
#> 11 1 a 3 1 57
#> 12 1 a 3 2 67
#> 13 1 a 3 3 79
#> 14 1 a 3 4 96
#> 15 1 a 3 5 105
#> 16 1 c 99 1 10
#> 17 2 b 1 1 60
#> 18 2 b 1 2 68
#> 19 2 b 1 3 78
#> 20 2 b 1 4 90
#> 21 2 b 1 5 107
#> 22 2 b 2 1 61
#> 23 2 b 2 2 73
#> 24 2 b 2 3 86
#> 25 2 b 2 4 91
#> 26 2 b 2 5 104
#> 27 2 b 3 1 57
#> 28 2 b 3 2 67
#> 29 2 b 3 3 79
#> 30 2 b 3 4 96
#> 31 2 b 3 5 105
#> 32 2 d 99 1 10
Created on 2018-11-23 by the reprex package (v0.2.1)

Distinct in r within groups of data

How do I transform a dataframe (on the left) to dataframe (on the right)?
I am trying to do this via dplyr, by grouping into name and distinct, but it gives only 3 rows
df %>%
group_by(name) %>%
distinct(.,keep.all = T) %>%
View()
There is a simple way to access all the cells you want to change:
data <- data.frame(name = c(rep("A", 5), rep("B", 5), rep("C", 5)), subject = c(rep(1:5, 3)), marks = sample(1:100, 15))
> data
name subject marks
1 A 1 31
2 A 2 12
3 A 3 29
4 A 4 67
5 A 5 99
6 B 1 77
7 B 2 3
8 B 3 92
9 B 4 69
10 B 5 42
11 C 1 52
12 C 2 66
13 C 3 98
14 C 4 23
15 C 5 72
duplicated(data$name) accesses the relevant cells. But R has no way to leave a cell "blank", so to speak.
You can either set them NA, or fill it with an empty character:
data$name[duplicated(data$name)] <- NA
> data
name subject marks
1 A 1 31
2 <NA> 2 12
3 <NA> 3 29
4 <NA> 4 67
5 <NA> 5 99
6 B 1 77
7 <NA> 2 3
8 <NA> 3 92
9 <NA> 4 69
10 <NA> 5 42
11 C 1 52
12 <NA> 2 66
13 <NA> 3 98
14 <NA> 4 23
15 <NA> 5 72
data$name <- as.character(data$name)
data$name[duplicated(data$name)] <- ""
> data
name subject marks
1 A 1 30
2 2 52
3 3 5
4 4 48
5 5 99
6 B 1 14
7 2 20
8 3 34
9 4 55
10 5 53
11 C 1 38
12 2 27
13 3 67
14 4 12
15 5 77
To use the latter solution with a factor variable, you need to add "" as a factor label:
data$name <- factor(as.numeric(data$name), 1:4, c(levels(data$name), ""))
data$name[duplicated(data$name)] <- ""

Permutations and Decision Trees with R

I was wondering if there is a way to produce a decision tree that solves a permutation of selecting n objects of k classes. We have the set A={1,2,...,10}, and the subsets B={1,2,..,5}, C={6,7} and D={8,9,10}. The total number of permutations can be calculated by
x <- factorial(10)/(factorial(5)*factorial(2)*factorial(3))
I would like to produce a decision tree similar to an edge list, as the following:
1 2 5 B
1 3 2 C
1 4 3 D
2 5 4 B
2 6 2 C
2 7 3 D
3 8 5 B
3 9 1 C
3 10 3 D
4 11 5 B
4 12 2 C
4 13 2 D
5 14 3 B
5 15 2 C
5 16 3 D
6 17 4 B
6 18 1 C
6 19 3 D
7 20 4 B
7 21 2 C
7 22 2 D
8 23 4 B
8 24 1 C
8 25 3 D
9 26 5 B
9 27 3 D
10 28 5 B
10 29 1 C
10 30 2 D
11 31 4 B
11 32 2 C
11 33 2 D
12 34 5 B
12 35 1 C
12 36 2 D
13 37 5 B
13 38 2 C
13 39 1 D
14 40 2 B
14 41 2 C
14 42 3 D
15 43 3 B
15 44 1 C
15 45 3 D
16 46 3 B
16 47 2 C
16 48 2 D
17 49 3 B
17 50 1 C
17 51 3 D
18 52 4 B
18 53 3 D
19 54 4 B
19 55 1 C
19 56 2 D
20 57 3 B
20 58 2 C
20 59 2 D
21 60 4 B
21 61 1 C
21 62 2 D
22 63 4 B
22 64 2 C
22 65 1 D
23 66 3 B
23 67 1 C
23 68 3 D
24 69 4 B
24 70 3 D
25 71 4 B
25 72 1 C
25 73 2 D
26 74 4 B
26 75 3 D
27 76 5 B
27 77 2 D
28 78 4 B
28 79 1 C
28 80 2 D
29 81 5 B
29 82 2 D
30 83 5 B
30 84 1 C
30 85 1 D
31 86 3 B
31 87 2 C
31 88 2 D
32 89 4 B
32 90 1 C
32 91 2 D
33 92 4 B
33 93 2 C
33 94 1 D
34 95 4 B
34 96 1 C
34 97 2 D
. . . .
. . . .
. . . .
The first two columns correspond to the edge list, the third column is the number of elements in each subset decreasing by each ramification and the fourth column is just the subset name.
One computed the edge list, I'm thinking on plotting the graph with this command:
plot(g, layout = layout.reingold.tilford(g, root=1)

Distinguishing the levels of a factor variable in R

Let's say my data set contains three columns: id (identification), case (character), and value(numeric). This is my dataset:
tdata <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), case=c("a","b","c","c","a","b","c","c","a","b","c","c","a","b","c","c"), value=c(1,34,56,23,546,34,67,23,65,23,65,23,87,34,321,56))
tdata
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 2 a 546
6 2 b 34
7 2 c 67
8 2 c 23
9 3 a 65
10 3 b 23
11 3 c 65
12 3 c 23
13 4 a 87
14 4 b 34
15 4 c 321
16 4 c 56
If you notice, for each ID, we have two c's. How can I rename them c1 and c2? (I need to distinguish between them
for further analysis).
How about:
within(tdata, case <- ave(as.character(case), id, FUN=make.unique))
How about this slightly modified approach:
library(dplyr)
tdata %>% group_by(id, case) %>% mutate(caseNo = paste0(case, row_number())) %>%
ungroup() %>% select(-case)
#Source: local data frame [16 x 3]
#
# id value caseNo
#1 1 1 a1
#2 1 34 b1
#3 1 56 c1
#4 1 23 c2
#5 2 546 a1
#6 2 34 b1
#7 2 67 c1
#8 2 23 c2
#9 3 65 a1
#10 3 23 b1
#11 3 65 c1
#12 3 23 c2
#13 4 87 a1
#14 4 34 b1
#15 4 321 c1
#16 4 56 c2
I would suggest that rather than replacing the values in the "case" column, you just add a secondary "ID" column. This is easily done with getanID from my "splitstackshape" package.
library(splitstackshape)
getanID(tdata, c("id", "case"))[]
# id case value .id
# 1: 1 a 1 1
# 2: 1 b 34 1
# 3: 1 c 56 1
# 4: 1 c 23 2
# 5: 2 a 546 1
# 6: 2 b 34 1
# 7: 2 c 67 1
# 8: 2 c 23 2
# 9: 3 a 65 1
# 10: 3 b 23 1
# 11: 3 c 65 1
# 12: 3 c 23 2
# 13: 4 a 87 1
# 14: 4 b 34 1
# 15: 4 c 321 1
# 16: 4 c 56 2
The [] may or may not be required depending on which version of "data.table" you have installed.
If you really did want to collapse those columns, you could also do:
getanID(tdata, c("id", "case"))[, case := paste0(case, .id)][, .id := NULL][]
# id case value
# 1: 1 a1 1
# 2: 1 b1 34
# 3: 1 c1 56
# 4: 1 c2 23
# 5: 2 a1 546
# 6: 2 b1 34
# 7: 2 c1 67
# 8: 2 c2 23
# 9: 3 a1 65
# 10: 3 b1 23
# 11: 3 c1 65
# 12: 3 c2 23
# 13: 4 a1 87
# 14: 4 b1 34
# 15: 4 c1 321
# 16: 4 c2 56

Resources