r - count number of identical rows - r

I hope this is not a duplicate question (did my best to see if it was already asked). I have a data frame and would like to count how many rows are identical.
df = data.frame(ID = c("id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9"),
Val1 = c("A", "B", "C", "A", "A", "B", "D", "C", "D"),
Val2 = c("B", "C", NA, "B", "B", "D", "E", "D", "E"),
Val3 = c("C", NA, NA, "C", "C", "B", NA, NA,NA),
Val4 = c("D", NA, NA, "E", "D", NA, NA, NA, NA))
> df
ID Val1 Val2 Val3 Val4
1 id1 A B C D
2 id2 B C <NA> <NA>
3 id3 C <NA> <NA> <NA>
4 id4 A B C E
5 id5 A B C D
6 id6 B D B <NA>
7 id7 D E <NA> <NA>
8 id8 C D <NA> <NA>
9 id9 D E <NA> <NA>
So for this example I expect that the return would be A B C D 2, D E 2, B C <NA> <NA> 1 and so on..
Tried with table but I get an Error in table(type_table) : attempt to make a table with >= 2^31 elements and my df has "only" ~140K rows. I want to apply this on a much larger dataset. Tried with summarise also but probably I do not know how to apply it correctly. Is aggregate an option? Thank you

The reason why table isn't working is because it treats each column separately and tries to find by element combinations instead of by row combinations.
You can try using the do.call(paste( combination in order to paste elements by row and run table over it
table(do.call(paste, df[-1]))
# A B C D A B C E B C NA NA B D B NA C D NA NA C NA NA NA D E NA NA
# 2 1 1 1 1 1 2
If table isn't efficient enough, we can try with .N from data.table instead
library(data.table)
setDT(df)[, .N, by = c(names(df)[-1])]
# Val1 Val2 Val3 Val4 N
# 1: A B C D 2
# 2: B C NA NA 1
# 3: C NA NA NA 1
# 4: A B C E 1
# 5: B D B NA 1
# 6: D E NA NA 2
# 7: C D NA NA 1

With data.table
library(data.table)
setDT(df)
df[, dups := 1:.N, setdiff(names(df), "ID")]
df[, .SD[.N], setdiff(names(df), c("ID", "dups"))][dups != 1]
Group by everything except ID, index items within groups of duplicates, then select the last row in each group (when the duplication index isn't 1).

Related

update names based on columns R

Following this question (update names based on columns), another thing that I want to ask
df <- data.frame(name1 = c("a", "a", "a", "a", 'a', NA, NA, NA,NA),
name2 = c("b", "b", "b", "b", "c", NA, NA, NA,NA),
name3 = c("b", "b", "b", "b", "c", "a", "a", "a", "f"))
df
name1 name2 name3
1 a b b
2 a b b
3 a b b
4 a b b
5 a c c
6 <NA> <NA> a
7 <NA> <NA> a
8 <NA> <NA> a
9 <NA> <NA> f
Now, I want to keep f while replacing a by b.
-Desired output
name1 name2 name3
1 a b b
2 a b b
3 a b b
4 a b b
5 a c c
6 <NA> <NA> b
7 <NA> <NA> b
8 <NA> <NA> b
9 <NA> <NA> f
The code from comments of #Rui and #TarJae
df %>%
mutate(name3 = case_when(
any(name1 == "a") & is.na(name2) ~ "b",
TRUE ~ name3
))
However, in this case, this does not work because I call NA from name2.
Any sugesstions for me?
If you just want to keep f, how about this?
edit except a and b will be not be changed.
df %>%
mutate(name3 = case_when(
!(name3 %in% c("a", "b")) ~ name3,
any(name1 == "a") & is.na(name2) ~ "b",
TRUE ~ name3
))
name1 name2 name3
1 a b b
2 a b b
3 a b b
4 a b b
5 a c c
6 <NA> <NA> b
7 <NA> <NA> b
8 <NA> <NA> b
9 <NA> <NA> f
Generalizing the solution #Tjn25, one can do the following
f <- function(x) ifelse(x == "a", "b", x)
data.frame(lapply(df, f))
If you just want to change the names3 column values that are a to b
then you could use
df$name3[df$name3 == 'a'] <- 'b'

update names based on columns

I would like update the names based on two columns
My example has 3 originial columns
df <- data.frame(name1 = c("a", "a", "a", "a", 'a', NA, NA, NA),
name2 = c("b", "b", "b", "b", "c", NA, NA, NA),
name3 = c("b", "b", "b", "b", "c", "a", "a", "a"))
df
name1 name2 name3
1 a b b
2 a b b
3 a b b
4 a b b
5 a c c
6 <NA> <NA> a
7 <NA> <NA> a
8 <NA> <NA> a
I would like to update column name3 (or even create a new column) saying that if name1 == a, and name2 == NA, then the a character in name3 will be replaced by b in column name2.
My desired output something like
name1 name2 name3
1 a b b
2 a b b
3 a b b
4 a b b
5 a c c
6 <NA> <NA> b
7 <NA> <NA> b
8 <NA> <NA> b
So far, i am using this df %>% mutate(name3 = ifelse(name1 == "a" & is.na(name2), "b", name3)), but now NA appeared. Any suggestions for this?
Base R
df$name3 <- ifelse(any(df$name1 == "a") & is.na(df$name2), "b", df$name3)
dplyr
library(dplyr)
df %>%
mutate(name3 = case_when(
any(name1 == "a") & is.na(name2) ~ "b",
TRUE ~ name3
))
# name1 name2 name3
#1 a b b
#2 a b b
#3 a b b
#4 a b b
#5 a c c
#6 <NA> <NA> b
#7 <NA> <NA> b
#8 <NA> <NA> b
We can replace == with %in% to eliminate the NAs, because R evaluates NA %in% x to FALSE, but NA==x to NA
df %>% mutate(name3 = ifelse(name1 %in% 'a' & is.na(name2), 'b', name3))
We could use a case_when or ifelse statement:
library(dplyr)
df %>%
mutate(name3 = case_when(any(name1 %in% "a") &
is.na(name2) ~ "b",
TRUE ~ name3))
or:
df %>%
mutate(name3 = ifelse(any(name1 %in% "a") &
is.na(name2), "b", name3))
name1 name2 name3
1 a b b
2 a b b
3 a b b
4 a b b
5 a c c
6 <NA> <NA> b
7 <NA> <NA> b
8 <NA> <NA> b

Count number of element for each row in a matrix [duplicate]

This question already has answers here:
Count number of values in row using dplyr
(5 answers)
Counting number of instances of a condition per row R [duplicate]
(1 answer)
Closed 2 years ago.
Hello I have a matrix such as :
COL1 COL2 COL3
A "A" "B" NA
B "B" "B" "C"
C NA NA NA
D "B" "B" "B"
E NA NA "C"
F "A" "A" "C"
and I would liek for each row (A,B,C,D etc) get the number of letters being A or B
exemple :
Nb
A 2
B 2
C 0
D 3
E 0
F 2
does someone have an idea ?
another way is to use sapply:
df$n <- sapply(1:nrow(df), function(i) sum((df[i,] %in% c('A', 'B'))))
# COL1 COL2 COL3 n
# A A B <NA> 2
# B B B C 2
# C <NA> <NA> <NA> 0
# D B B B 3
# E <NA> <NA> C 0
# F A A C 2
You can achieve the same output by using purrr::map_dbl as well. Just replace sapply with map_dbl.
You can try a base R solution with apply():
#Base R
df$Var <- apply(df,1,function(x) length(which(!is.na(x) & x %in% c('A','B'))))
Output:
COL1 COL2 COL3 Var
A A B <NA> 2
B B B C 2
C <NA> <NA> <NA> 0
D B B B 3
E <NA> <NA> C 0
F A A C 2
Some data used:
#Data
df <- structure(list(COL1 = c("A", "B", NA, "B", NA, "A"), COL2 = c("B",
"B", NA, "B", NA, "A"), COL3 = c(NA, "C", NA, "B", "C", "C")), row.names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame")
Or if you feel curious about tidyverse:
library(tidyverse)
#Code
df %>% mutate(id=1:n()) %>%
left_join(df %>% mutate(id=1:n()) %>%
pivot_longer(cols = -id) %>%
filter(value %in% c('A','B')) %>%
group_by(id) %>%
summarise(Var=n())) %>% ungroup() %>%
replace(is.na(.),0) %>% select(-id)
Output:
COL1 COL2 COL3 Var
1 A B 0 2
2 B B C 2
3 0 0 0 0
4 B B B 3
5 0 0 C 0
6 A A C 2
library(dplyr)
df <- structure(list(COL1 = c("A", "B", NA, "B", NA, "A"), COL2 = c("B",
"B", NA, "B", NA, "A"), COL3 = c(NA, "C", NA, "B", "C", "C")), row.names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame")
df %>%
rowwise() %>%
mutate(sumVar = across(c(COL1:COL3),~ifelse(. %in% c("A", "B"),1,0)) %>% sum)
# A tibble: 6 x 4
# Rowwise:
COL1 COL2 COL3 sumVar
<chr> <chr> <chr> <dbl>
1 A B NA 2
2 B B C 2
3 NA NA NA 0
4 B B B 3
5 NA NA C 0
6 A A C 2

Replace certain columns in a data frame with the columns of another data frame

I have two data frames with the same columns names and the same size. Each of them has 40 columns and 5000 rows. I would like to replace certain columns in a data frame with those from the other df arranged by their common ID. The column ID is identical for both dfs but not necessarily in the same order for each df.
Let me provide an example for clarity.
df1 <- data.frame( ID = c("ID1", "ID2","ID3", "ID4","ID5", "ID6","ID7", "ID8", "ID9"),
A = c(1,2,3,4,5,6,7,8,9),
B = c(11,21,31,41,51,61,71,81,91),
C = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
D = c("a1","b1","c1", "d1","e1", "f1", "g1", "h1", "i1")
)
df1
df2 <- data.frame( ID = c("ID2", "ID1","ID3", "ID4","ID5", "ID6","ID9", "ID8", "ID7"),
A = sample(x = 1:20, size = 9),
B = sample(x = 1:50, size = 9),
C = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
D = c("A1","B1","C1", "D1","E1", "F1", "G1", "H1", "I1")
)
df2
This should be the df2 after replacing its columns, A, B with those from df1 while keeping the rest of the columns (C, D) unchanged.
df2_out <- data.frame( ID = c("ID2", "ID1","ID3", "ID4","ID5", "ID6","ID9", "ID8", "ID7"),
A = c(2,1,3,4,5,6,9,8,7),
B = c(21,11,31,41,51,61,91,81,71),
C = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
D = c("A1","B1","C1", "D1","E1", "F1", "G1", "H1", "I1")
)
As mentioned the number of the columns to be changed is long (30) in my data set:
changed_columns <- c("A", "B", ....)
any help on how to make it ?
Thank you
Using the data.table package, you can solve your problem as follows:
library(data.table)
setDT(df2)[df1, c("A", "B") := .(i.A, i.B), on = "ID"]
# ID A B C D
# 1: ID2 2 21 A A1
# 2: ID1 1 11 B B1
# 3: ID3 3 31 C C1
# 4: ID4 4 41 D D1
# 5: ID5 5 51 E E1
# 6: ID6 6 61 F F1
# 7: ID9 9 91 G G1
# 8: ID8 8 81 H H1
# 9: ID7 7 71 I I1
Another base R option by using merge + subset
df2_out <- subset(merge(df1[c("ID","A","B")],df2,all = TRUE,by = "ID"),select = -cbind(A.y,B.y))
such that
> df2_out
ID A.x B.x C D
1 ID1 1 11 B B1
2 ID2 2 21 A A1
3 ID3 3 31 C C1
4 ID4 4 41 D D1
5 ID5 5 51 E E1
6 ID6 6 61 F F1
7 ID7 7 71 I I1
8 ID8 8 81 H H1
9 ID9 9 91 G G1
We can use match to get the order of ID and replace them with changed_columns in df1.
changed_columns <- c("A", "B")
df2[match(df1$ID, df2$ID), changed_columns] <- df1[changed_columns]
df2
# ID A B C D
#1 ID2 2 21 A A1
#2 ID1 1 11 B B1
#3 ID3 3 31 C C1
#4 ID4 4 41 D D1
#5 ID5 5 51 E E1
#6 ID6 6 61 F F1
#7 ID9 9 91 G G1
#8 ID8 8 81 H H1
#9 ID7 7 71 I I1

Reshape data table

I have a data table like (data is not necessarily ordered by 'col1')
col0 col1 col2
1: abc 1 a
2: abc 2 b
3: abc 3 c
4: abc 4 d
5: abc 5 e
6: def 1 a
7: def 2 b
8: def 3 c
9: def 4 d
10: def 5 e
I want to reshape it the following way
col0 col1 col2 new_1 new_2 new_3 new_4
1: abc 1 a NA NA NA NA
2: abc 2 b a NA NA NA
3: abc 3 c b a NA NA
4: abc 4 d c b a NA
5: abc 5 e d c b a
6: def 1 a NA NA NA NA
7: def 2 b a NA NA NA
8: def 3 c b a NA NA
9: def 4 d c b a NA
10: def 5 e d c b a
Basically I want to get previously occurred values of col2 for each row in the same row as above and if there is none the corresponding new column should say NA.
I can of course do it by merge on col2 5 times but I need to do this on a large table (in that case I will have to merge 20-30 times).
What is the best way to achieve it in R in 1 or 2 lines?
We can use shift from the devel version of data.table i.e. v1.9.5 (Instructions to install the devel version are here. By default, the type in shift is lag. We can specify n as a vector, in this case 1:4. We assign (:=) the output to new columns.
library(data.table)#v1.9.5+
DT[, paste('new', 1:4, sep="_") := shift(col2, 1:4)]
DT
# col1 col2 new_1 new_2 new_3 new_4
#1: 1 a NA NA NA NA
#2: 2 b a NA NA NA
#3: 3 c b a NA NA
#4: 4 d c b a NA
#5: 5 e d c b a
For the new dataset 'DT2', we need to group by 'col0' and then do the shift on 'col2'
DT2[, paste('new', 1:4, sep="_") := shift(col2, 1:4), by = col0]
DT2
# col0 col1 col2 new_1 new_2 new_3 new_4
# 1: abc 1 a NA NA NA NA
# 2: abc 2 b a NA NA NA
# 3: abc 3 c b a NA NA
# 4: abc 4 d c b a NA
# 5: abc 5 e d c b a
# 6: def 1 a NA NA NA NA
# 7: def 2 b a NA NA NA
# 8: def 3 c b a NA NA
# 9: def 4 d c b a NA
#10: def 5 e d c b a
data
df1 <- structure(list(col1 = 1:5, col2 = c("a", "b", "c", "d", "e"),
new_1 = c(NA, "a", "b", "c", "d"), new_2 = c(NA, NA, "a",
"b", "c"), new_3 = c(NA, NA, NA, "a", "b"), new_4 = c(NA,
NA, NA, NA, "a")), .Names = c("col1", "col2", "new_1", "new_2",
"new_3", "new_4"), class = "data.frame", row.names = c(NA, -5L
))
DT <- as.data.table(df1)
df2 <- structure(list(col0 = c("abc", "abc", "abc", "abc", "abc",
"def",
"def", "def", "def", "def"), col1 = c(1L, 2L, 3L, 4L, 5L, 1L,
2L, 3L, 4L, 5L), col2 = c("a", "b", "c", "d", "e", "a", "b",
"c", "d", "e")), .Names = c("col0", "col1", "col2"),
class = "data.frame", row.names = c(NA, -10L))
DT2 <- as.data.table(df2)

Resources