Distinguishing the levels of a factor variable in R - r

Let's say my data set contains three columns: id (identification), case (character), and value(numeric). This is my dataset:
tdata <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), case=c("a","b","c","c","a","b","c","c","a","b","c","c","a","b","c","c"), value=c(1,34,56,23,546,34,67,23,65,23,65,23,87,34,321,56))
tdata
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 2 a 546
6 2 b 34
7 2 c 67
8 2 c 23
9 3 a 65
10 3 b 23
11 3 c 65
12 3 c 23
13 4 a 87
14 4 b 34
15 4 c 321
16 4 c 56
If you notice, for each ID, we have two c's. How can I rename them c1 and c2? (I need to distinguish between them
for further analysis).

How about:
within(tdata, case <- ave(as.character(case), id, FUN=make.unique))

How about this slightly modified approach:
library(dplyr)
tdata %>% group_by(id, case) %>% mutate(caseNo = paste0(case, row_number())) %>%
ungroup() %>% select(-case)
#Source: local data frame [16 x 3]
#
# id value caseNo
#1 1 1 a1
#2 1 34 b1
#3 1 56 c1
#4 1 23 c2
#5 2 546 a1
#6 2 34 b1
#7 2 67 c1
#8 2 23 c2
#9 3 65 a1
#10 3 23 b1
#11 3 65 c1
#12 3 23 c2
#13 4 87 a1
#14 4 34 b1
#15 4 321 c1
#16 4 56 c2

I would suggest that rather than replacing the values in the "case" column, you just add a secondary "ID" column. This is easily done with getanID from my "splitstackshape" package.
library(splitstackshape)
getanID(tdata, c("id", "case"))[]
# id case value .id
# 1: 1 a 1 1
# 2: 1 b 34 1
# 3: 1 c 56 1
# 4: 1 c 23 2
# 5: 2 a 546 1
# 6: 2 b 34 1
# 7: 2 c 67 1
# 8: 2 c 23 2
# 9: 3 a 65 1
# 10: 3 b 23 1
# 11: 3 c 65 1
# 12: 3 c 23 2
# 13: 4 a 87 1
# 14: 4 b 34 1
# 15: 4 c 321 1
# 16: 4 c 56 2
The [] may or may not be required depending on which version of "data.table" you have installed.
If you really did want to collapse those columns, you could also do:
getanID(tdata, c("id", "case"))[, case := paste0(case, .id)][, .id := NULL][]
# id case value
# 1: 1 a1 1
# 2: 1 b1 34
# 3: 1 c1 56
# 4: 1 c2 23
# 5: 2 a1 546
# 6: 2 b1 34
# 7: 2 c1 67
# 8: 2 c2 23
# 9: 3 a1 65
# 10: 3 b1 23
# 11: 3 c1 65
# 12: 3 c2 23
# 13: 4 a1 87
# 14: 4 b1 34
# 15: 4 c1 321
# 16: 4 c2 56

Related

Merging two dataframes by keeping certain column values in r

I have two dataframes I need to merge with. The second one has certain columns missing and it also has some more ids. Here is how the sample datasets look like.
df1 <- data.frame(id = c(1,2,3,4,5,6),
item = c(11,22,33,44,55,66),
score = c(1,0,1,1,1,0),
cat.a = c("A","B","C","D","E","F"),
cat.b = c("a","a","b","b","c","f"))
> df1
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
df2 <- data.frame(id = c(1,2,3,4,5,6,7,8),
item = c(11,22,33,44,55,66,77,88),
score = c(1,0,1,1,1,0,1,1),
cat.a = c(NA,NA,NA,NA,NA,NA,NA,NA),
cat.b = c(NA,NA,NA,NA,NA,NA,NA,NA))
> df2
id item score cat.a cat.b
1 1 11 1 NA NA
2 2 22 0 NA NA
3 3 33 1 NA NA
4 4 44 1 NA NA
5 5 55 1 NA NA
6 6 66 0 NA NA
7 7 77 1 NA NA
8 8 88 1 NA NA
The two datasets share first 6 rows and dataset 2 has two more rows. When I merge I need to keep cat.a and cat.b information from the first dataframe. Then I also want to keep id=7 and id=8 with cat.a and cat.b columns missing.
Here is my desired output.
> df3
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>
Any ideas?
Thanks!
We may use rows_update
library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))
-output
id item score cat.a cat.b
1 1 11 1 A a
2 2 22 0 B a
3 3 33 1 C b
4 4 44 1 D b
5 5 55 1 E c
6 6 66 0 F f
7 7 77 1 <NA> <NA>
8 8 88 1 <NA> <NA>

Filling in gaps in column based on matching rows in two columns R

In the df2 I would like to fill in gaps in column d based on matching records in columns b and c between two dataframes. What would be a quick and elegant way to do that? It is important to mention it should work well for occasions where matching rows might have different locations in both dataframes.
df1 <- data.frame( a = c(1,1,1,1,1,2,2,2,2,2) ,b = rep(seq(41,45,1),each=2), c = c(101:105,101:105), d = LETTERS[seq( from = 1, to = 10 )])
df2 <- data.frame( a = c(1,1,1,1,1,2,2,2,2,2) ,b = rep(seq(41,45,1),each=2), c = c(101:105,101:105), d = c(LETTERS[seq( from = 1, to = 6 )],rep(NA,4)))
> df1
a b c d
1 1 41 101 A
2 1 41 102 B
3 1 42 103 C
4 1 42 104 D
5 1 43 105 E
6 2 43 101 F
7 2 44 102 G
8 2 44 103 H
9 2 45 104 I
10 2 45 105 J
> df2
a b c d
1 1 41 101 A
2 1 41 102 B
3 1 42 103 C
4 1 42 104 D
5 1 43 105 E
6 2 43 101 F
7 2 44 102 <NA>
8 2 44 103 <NA>
9 2 45 104 <NA>
10 2 45 105 <NA>
The result should be following:
a b c d
1 1 41 101 A
2 1 41 102 B
3 1 42 103 C
4 1 42 104 D
5 1 43 105 E
6 2 43 101 F
7 2 44 102 G
8 2 44 103 H
9 2 45 104 I
10 2 45 105 J
While you can do lookups with match and perhaps %in%, I'd think another (robust) way to do it is with a merge/join:
df2mod <- merge(df2, df1[,c('b','c','d')], by = c("b", "c"), all=TRUE)
df2mod
# b c a d.x d.y
# 1 41 101 1 A A
# 2 41 102 1 B B
# 3 42 103 1 C C
# 4 42 104 1 D D
# 5 43 101 2 F F
# 6 43 105 1 E E
# 7 44 102 2 <NA> G
# 8 44 103 2 <NA> H
# 9 45 104 2 <NA> I
# 10 45 105 2 <NA> J
In this case, d.x are the original df2$d. Because your data is factors, some extra parts are necessary (as.character and the refactor).
df2mod$d <- with(df2mod, ifelse(is.na(d.x), as.character(d.y), as.character(d.x)))
df2mod$d <- factor(df2mod$d, levels = levels(df1$d))
df2mod
# b c a d.x d.y d
# 1 41 101 1 A A A
# 2 41 102 1 B B B
# 3 42 103 1 C C C
# 4 42 104 1 D D D
# 5 43 101 2 F F F
# 6 43 105 1 E E E
# 7 44 102 2 <NA> G G
# 8 44 103 2 <NA> H H
# 9 45 104 2 <NA> I I
# 10 45 105 2 <NA> J J
df2mod[,c("d.x", "d.y")] <- NULL # cleanup unnecessary columns

add values of one group into another group in R

I have a question on how to add the value from a group to rest of the elements in the group then delete that row. for ex:
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
in the above example, my data is grouped by Year, Cluster, Seed and Day where seed=99 values need to be added to above rows based on (Year, Cluster and Day) group then delete this row. for ex: Row # 16, is part of (Year=1, Cluster=a,Day=1 and Seed=99) group and the value of Row #16 which is 55 should be added to Row #1 (5+55), Row # 6 (6+55) and Row # 11 (2+55) and row # 16 should be deleted. But when it comes to Row #21, which is in cluster=C with seed=99, should remain in the database as is as it cannot find any matching in year+cluster+day combination.
My actual data is of 1 million records with 10 years, 80 clusters, 500 days and 10+1 (1 to 10 and 99) seeds, so looking for so looking for an efficient solution.
Year Cluster Seed Day value
1 1 a 1 1 60
2 1 a 1 2 68
3 1 a 1 3 78
4 1 a 1 4 90
5 1 a 1 5 107
6 1 a 2 1 61
7 1 a 2 2 73
8 1 a 2 3 86
9 1 a 2 4 91
10 1 a 2 5 104
11 1 a 3 1 57
12 1 a 3 2 67
13 1 a 3 3 79
14 1 a 3 4 96
15 1 a 3 5 105
16 1 c 99 1 10
17 2 b 1 1 60
18 2 b 1 2 68
19 2 b 1 3 78
20 2 b 1 4 90
21 2 b 1 5 107
22 2 b 2 1 61
23 2 b 2 2 73
24 2 b 2 3 86
25 2 b 2 4 91
26 2 b 2 5 104
27 2 b 3 1 57
28 2 b 3 2 67
29 2 b 3 3 79
30 2 b 3 4 96
31 2 b 3 5 105
32 2 d 99 1 10
A data.table approach:
library(data.table)
df <- setDT(df)[, `:=` (value = ifelse(Seed != 99, value + value[Seed == 99], value),
flag = Seed == 99 & .N == 1), by = .(Year, Cluster, Day)][!(Seed == 99 & flag == FALSE),][, "flag" := NULL]
Output:
df[]
Year Cluster Seed Day value
1: 1 a 1 1 60
2: 1 a 1 2 68
3: 1 a 1 3 78
4: 1 a 1 4 90
5: 1 a 1 5 107
6: 1 a 2 1 61
7: 1 a 2 2 73
8: 1 a 2 3 86
9: 1 a 2 4 91
10: 1 a 2 5 104
11: 1 a 3 1 57
12: 1 a 3 2 67
13: 1 a 3 3 79
14: 1 a 3 4 96
15: 1 a 3 5 105
16: 1 c 99 1 10
17: 2 b 1 1 60
18: 2 b 1 2 68
19: 2 b 1 3 78
20: 2 b 1 4 90
21: 2 b 1 5 107
22: 2 b 2 1 61
23: 2 b 2 2 73
24: 2 b 2 3 86
25: 2 b 2 4 91
26: 2 b 2 5 104
27: 2 b 3 1 57
28: 2 b 3 2 67
29: 2 b 3 3 79
30: 2 b 3 4 96
31: 2 b 3 5 105
32: 2 d 99 1 10
Here's an approach using the tidyverse. If you're looking for speed with a million rows, a data.table solution will probably perform better.
library(tidyverse)
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
seeds <- df %>%
filter(Seed == 99)
matches <- df %>%
filter(Seed != 99) %>%
inner_join(select(seeds, -Seed), by = c("Year", "Cluster", "Day")) %>%
mutate(value = value.x + value.y) %>%
select(Year, Cluster, Seed, Day, value)
no_matches <- anti_join(seeds, matches, by = c("Year", "Cluster", "Day"))
bind_rows(matches, no_matches) %>%
arrange(Year, Cluster, Seed, Day)
#> Year Cluster Seed Day value
#> 1 1 a 1 1 60
#> 2 1 a 1 2 68
#> 3 1 a 1 3 78
#> 4 1 a 1 4 90
#> 5 1 a 1 5 107
#> 6 1 a 2 1 61
#> 7 1 a 2 2 73
#> 8 1 a 2 3 86
#> 9 1 a 2 4 91
#> 10 1 a 2 5 104
#> 11 1 a 3 1 57
#> 12 1 a 3 2 67
#> 13 1 a 3 3 79
#> 14 1 a 3 4 96
#> 15 1 a 3 5 105
#> 16 1 c 99 1 10
#> 17 2 b 1 1 60
#> 18 2 b 1 2 68
#> 19 2 b 1 3 78
#> 20 2 b 1 4 90
#> 21 2 b 1 5 107
#> 22 2 b 2 1 61
#> 23 2 b 2 2 73
#> 24 2 b 2 3 86
#> 25 2 b 2 4 91
#> 26 2 b 2 5 104
#> 27 2 b 3 1 57
#> 28 2 b 3 2 67
#> 29 2 b 3 3 79
#> 30 2 b 3 4 96
#> 31 2 b 3 5 105
#> 32 2 d 99 1 10
Created on 2018-11-23 by the reprex package (v0.2.1)

Merge (full join) recursively one data.table with each group of another data.table

I have 2 data.tables:
a.id <- c("a","a","a","b","b","c","c","c","c")
b.id <- c(1,2,3,4,5,1,3,4,5)
x <- seq(1:9)
dt1 <- data.table(a.id,b.id,x)
and
rp <- c("r","s")
t <- rep(rp, each=5)
b.id <- rep(1:5, 2)
y <- sample.int(50, 10)
dt2 <- data.table(t, b.id, y)
For each a.id of dt1, I would like to full-join each t of dt2, adding them by column into dt1 and giving to the column the name the value of t. As this is a full-join, all the missing x(b.id) in dt1 are added with NA.
Here is the desired output (for r and s, these are random values):
a.id b.id x r s
a 1 1 14 40
a 2 2 42 25
a 3 3 32 11
a 4 NA 33 3
a 5 NA 21 1
b 1 NA 14 40
b 2 NA 42 25
b 3 NA 32 11
b 4 4 33 3
b 5 5 21 1
c 1 6 14 40
c 2 NA 42 25
c 3 7 32 11
c 4 8 33 3
c 5 9 21 1
I have tried something like:
dt1[, merge(.SD, dt2, by = "b.id", all = TRUE), by = a.id]
But it does not work.
I would appreciate your help on that problem.
Thanks for your time.
Try something like:
f<-dcast(dt2,b.id~t)
dt1[f[rep(1:nrow(f),uniqueN(dt1$a.id)),
c(.SD,list(a.id=rep(unique(dt1$a.id),each=nrow(f))))],on=c("a.id","b.id")]
# a.id b.id x r s
# 1: a 1 1 40 28
# 2: a 2 2 4 17
# 3: a 3 3 11 13
# 4: a 4 NA 49 42
# 5: a 5 NA 29 37
# 6: b 1 NA 40 28
# 7: b 2 NA 4 17
# 8: b 3 NA 11 13
# 9: b 4 4 49 42
#10: b 5 5 29 37
#11: c 1 6 40 28
#12: c 2 NA 4 17
#13: c 3 7 11 13
#14: c 4 8 49 42
#15: c 5 9 29 37
The result differs since a seed had not been set.
With a cross join one can do:
dcast(dt2, b.id~t, value.var = "y")[
dt1[CJ(a.id=a.id, b.id=b.id, unique=TRUE), on=.(a.id, b.id)], on="b.id"]
if not all possible values of b.id are in dt1$b.id then the CJ()-part should look like
CJ(a.id=a.id, b.id=dt2$b.id, unique=TRUE)
Here is another variant:
dt1[dcast(dt2, b.id~t, value.var = "y")[
CJ(a.id=dt1$a.id, b.id=dt2$b.id, unique=TRUE), on=.(b.id)], on=.(a.id, b.id)]
# a.id b.id x r s
# 1: a 1 1 46 24
# 2: a 2 2 50 33
# 3: a 3 3 14 6
# 4: a 4 NA 40 28
# 5: a 5 NA 30 29
# 6: b 1 NA 46 24
# 7: b 2 NA 50 33
# 8: b 3 NA 14 6
# 9: b 4 4 40 28
# 10: b 5 5 30 29
# 11: c 1 6 46 24
# 12: c 2 NA 50 33
# 13: c 3 7 14 6
# 14: c 4 8 40 28
# 15: c 5 9 30 29
data:
library("data.table")
set.seed(42)
dt1 <- data.table(a.id=rep(c("a", "b", "c"), c(3,2,4)), b.id=c(1:5,1,3,4,5), x=1:9)
dt2 <- data.table(t=rep(c("r","s"), each=5), b.id=1:5, y=sample.int(50, 10))

Removing duplicates for each ID

Suppose that there are three variables in my data frame (mydata): 1) id, 2) case, and 3) value.
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4), case=c("a","b","c","c","b","a","b","c","c","a","b","c","c","a","b","c","a"), value=c(1,34,56,23,34,546,34,67,23,65,23,65,23,87,34,321,87))
mydata
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 1 b 34
6 2 a 546
7 2 b 34
8 2 c 67
9 2 c 23
10 3 a 65
11 3 b 23
12 3 c 65
13 3 c 23
14 4 a 87
15 4 b 34
16 4 c 321
17 4 a 87
For each id, we could have similar ‘case’ characters, and their values could be the same or different. So basically, if their values are the same, I only need to keep one and remove the duplicate.
My final data then would be
id case value
1 1 a 1
2 1 b 34
3 1 c 56
4 1 c 23
5 2 a 546
6 2 b 34
7 2 c 67
8 2 c 23
9 3 a 65
10 3 b 23
11 3 c 65
12 3 c 23
13 4 a 87
14 4 b 34
15 4 c 321
To add to the other answers, here's a dplyr approach:
library(dplyr)
mydata %>% group_by(id, case, value) %>% distinct()
Or
mydata %>% distinct(id, case, value)
You could try duplicated
mydata[!duplicated(mydata[,c('id', 'case', 'value')]),]
# id case value
#1 1 a 1
#2 1 b 34
#3 1 c 56
#4 1 c 23
#6 2 a 546
#7 2 b 34
#8 2 c 67
#9 2 c 23
#10 3 a 65
#11 3 b 23
#12 3 c 65
#13 3 c 23
#14 4 a 87
#15 4 b 34
#16 4 c 321
Or use unique with by option from data.table
library(data.table)
set.seed(25)
mydata1 <- cbind(mydata, value1=rnorm(17))
DT <- as.data.table(mydata1)
unique(DT, by=c('id', 'case', 'value'))
# id case value value1
#1: 1 a 1 -0.21183360
#2: 1 b 34 -1.04159113
#3: 1 c 56 -1.15330756
#4: 1 c 23 0.32153150
#5: 2 a 546 -0.44553326
#6: 2 b 34 1.73404543
#7: 2 c 67 0.51129562
#8: 2 c 23 0.09964504
#9: 3 a 65 -0.05789111
#10: 3 b 23 -1.74278763
#11: 3 c 65 -1.32495298
#12: 3 c 23 -0.54793388
#13: 4 a 87 -1.45638428
#14: 4 b 34 0.08268682
#15: 4 c 321 0.92757895
Case and value only? Easy:
> mydata[!duplicated(mydata[,c("id","case","value")]),]
Even if you have a ton more variables in the dataset, they won't be considered by the duplicated() call.

Resources