Compare and count the frequency of pairs of entries in two columns - r

I have two columns (v5 & v6) in a matrix where both columns have entries between 0 and 5 as
head(matrix)
v1 v2 ... v5 v6
[1,] 0 5
[2,] 1 3
[3,] 2 1
[4,] 4 1
[5,] 2 2
I want to construct a new (6*6)matrix contains the number of occurrences of each pair of values in both columns as
new_matrix
0 1 2 3 4 5
0 2326 2882 2587 734 341 0
1 50 17 103 14 0 6
2 ......
3 .......
4 ......
5 .......
I mean that I want to know how many pairs (0,0) , (0,1), ..., (0,5),... (5,5) are in both columns?
I used library(plyr) as
freq <- ddply(matrix, .(matrix$v5, matrix$v6), nrow)
names(freq) <- c("v5", "v6", "Freq")
But this will not give the needed result!

With tidyverse, you can arrive at this answers using usual group_by operations.
Sample data
I'm creating column names to make it easier to convert to tibble.
set.seed(123)
M <- matrix(sample(0:5, 100, TRUE),
sample(0:5, 100, TRUE),
ncol = 2,
nrow = 100,
dimnames = list(NULL, c("colA", "colB")))
Solution
library("tidyverse")
as_tibble(M) %>%
arrange(colA, colB) %>%
group_by(colA, colB) %>%
summarise(num_pairs = n(), .groups = "drop") %>%
pivot_wider(names_from = colB, values_from = num_pairs) %>%
remove_rownames()
Preview
# A tibble: 6 x 7
colA `0` `1` `2` `4` `5` `3`
<int> <int> <int> <int> <int> <int> <int>
1 0 4 4 4 2 4 NA
2 1 2 2 4 6 2 NA
3 2 6 4 NA 2 6 NA
4 3 2 NA NA 4 6 2
5 4 NA 2 6 NA 2 4
6 5 6 2 4 4 2 2
Comments
You have asked:
I mean that I want to know how many pairs (0,0) , (0,1), ...,
(0,5),... (5,5) are in both columns?
This answer gives you that, the question is how important is for you to have your results stored as a matrix? You can convert the results further into matrix by using as.matrix on what you get. Likely, I would stop after summarise(num_pairs = n(), .groups = "drop") as that gives very usable results, easy to subset join and so forth.

We can also use table
table(as.data.frame(M))
-output
# colB
#colA 0 1 2 3 4 5
# 0 4 4 4 0 2 4
# 1 2 2 4 0 6 2
# 2 6 4 0 0 2 6
# 3 2 0 0 2 4 6
# 4 0 2 6 4 0 2
# 5 6 2 4 2 4 2

Related

create multiple columns at once, depending on the number of columns of another df, with dplyr

I want to create with dplyr a df with n columns (depending on the number of columns of df data), where the root of the name is the same TIME. and first the column is equal to 1 in all rows, the second equal to 2 and so on. The number of rows is the same as data
data <- data.frame(ID=c(1:6), VALUE.1=c(2,5,7,1,3,5), VALUE.2=c(1,7,2,4,5,4), VALUE.3=c(9,2,6,3,4,4), VALUE.4=c(1,2,3,2,3,8))
And the first column of data as first column. This is what I'd like to have:
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
6 1 2 3 4
Now I'm doing:
T1 <- data.frame(ID=unique(data$ID), TIME.1=rep(1, length(unique(data$ID))), TIME.2=rep(2, length(unique(data$ID))), TIME.3=rep(3, length(unique(data$ID))), TIME.4=rep(4, length(unique(data$ID))) )
We can replace the column contents with the suffix in the column name, then rename the columns from VALUE.n to TIME.n.
library(dplyr)
data %>%
mutate(across(starts_with("VALUE"), ~sub("VALUE.", "", cur_column()))) %>%
rename_with(~sub("VALUE", "TIME", .x))
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 1 2 3 4
2 2 1 2 3 4
3 3 1 2 3 4
4 4 1 2 3 4
5 5 1 2 3 4
6 6 1 2 3 4
Here is a base R approach that may give a similar result. This involves creating a matrix based on your other data.frame data, using its dimensions for column names and determining the number of rows. We subtract 1 from number of columns given the first ID column present.
nc <- ncol(data) - 1
nr <- nrow(data)
as.data.frame(cbind(
ID = data$ID,
matrix(1:nc, ncol = nc, nrow = nr, byrow = T, dimnames = list(NULL, paste0("TIME.", 1:nc)))
))
Output
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 1 2 3 4
2 2 1 2 3 4
3 3 1 2 3 4
4 4 1 2 3 4
5 5 1 2 3 4
6 6 1 2 3 4

dplyr mean problems (argument is not numeric or logical: returning NA)

DF <- data.frame(id=c(1,1,2,2,3,3,4,4), A = c(1,2,10,4,8,NA,NA,2))
Why doesn´t this work?:
DF%>%mean(A,na.rm=T)
[1] NA
Warning message:
In mean.default(., A, na.rm = T) :
argument is not numeric or logical: returning NA
But his does?:
> mean(DF$A,na.rm=T)
[1] 4.5
glimpse(DF)
Observations: 8
Variables: 2
$ id <chr> "1", "1", "2", "2", "3", "3", "4", "4"
$ A <dbl> 1, 2, 10, 4, 8, NA, NA, 2
The idea later on is to mutate() a new column with mean for every id.
Best H
EDIT:
Additional question. Thanks for your answers. Now I want to calculate mean in each group - but duplicates of values are just allowed to be counted once. See example.
I want this:
DF<-data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4), A=c(2,2,1,1,2,3,4,4,1,NA,2,2))
> DF
id A
1 1 2
2 1 2
3 1 1
4 2 1
5 2 2
6 2 3
7 3 4
8 3 4
9 3 1
10 4 NA
11 4 2
12 4 2
To end like this:
id A mean
1 1 2 1.5
2 1 2 1.5
3 1 1 1.5
4 2 1 2
5 2 2 2
6 2 3 2
7 3 4 2.5
8 3 4 2.5
9 3 1 2.5
10 4 NA 2
11 4 2 2
12 4 2 2
mean expects a vector while 'A' is not getting extracted. We can use .$
library(dplyr)
DF %>%
{mean(.$A, na.rm = TRUE)}
#[1] 4.5
Or if we want to avoid the {}
DF %>%
.$A %>% # \\ or use: pull(A)
mean(na.rm = TRUE)
#[1] 4.5
the mean function takes vectors, not dataframes, as its argument, so you can't just pipe in DF. You have to use summarize:
DF %>%
summarize(mean(A, na.rm = TRUE))
mean(A, na.rm = TRUE)
1 4.5
If you want a group-wise mean, you can use group_by:
DF %>%
group_by(id) %>%
summarize(mean(A, na.rm = TRUE))
id `mean(A, na.rm = TRUE)`
<dbl> <dbl>
1 1 1.5
2 2 7
3 3 8
4 4 2
And if you want to keep every row but add on the grouped means, you replace summarize with mutate:
DF %>%
group_by(id) %>%
mutate(mean(A, na.rm = TRUE))
# Groups: id [4]
id A `mean(A, na.rm = TRUE)`
<dbl> <dbl> <dbl>
1 1 1 1.5
2 1 2 1.5
3 2 10 7
4 2 4 7
5 3 8 8
6 3 NA 8
7 4 NA 2
8 4 2 2
EDIT:
If you want to keep all the rows but only count distinct ones for your average, you can use use row_number to reset for each unique row and then weight your mean based on whether the row number is 1:
DF <- data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4),
A=c(2,2,1,1,2,3,4,4,1,NA,2,2))
DF %>%
group_by(id, A) %>%
mutate(count = row_number()) %>%
group_by(id) %>%
mutate(mean = weighted.mean(A, count == 1, na.rm = TRUE))
id A count mean
<dbl> <dbl> <int> <dbl>
1 1 2 1 1.5
2 1 2 2 1.5
3 1 1 1 1.5
4 2 1 1 2
5 2 2 1 2
6 2 3 1 2
7 3 4 1 2.5
8 3 4 2 2.5
9 3 1 1 2.5
10 4 NA 1 2
11 4 2 1 2
12 4 2 2 2

How to use table() with dplyr group by, map from purrr and a list of dataframes/tibbles)? (In R)

Question
How to create tables using table() with a list of data frames/tibbles, while
grouping by two variables (example: a sequence of days (eg. {1,2,...,10}) and a factor {0,1,2,3,4})
Data example
example:
ldf<-lapply(1:30, function(x) as.data.frame(cbind(sample(1:3,10,replace=T), sample(1:3,10,replace=T), seq(1:5), sample(0:4,10,replace=T))))
example:
[[1]]
V1 V2 V3 V4
1 3 1 1 4
2 1 3 2 2
3 2 2 3 3
4 3 1 4 1
5 1 1 5 3
6 1 1 1 4
7 1 1 2 2
8 3 3 3 3
9 2 2 4 1
10 1 1 5 3
[[2]]
V1 V2 V3 V4
1 2 1 1 2
2 3 1 2 0
3 1 1 3 4
4 3 1 4 0
5 2 1 5 0
6 2 2 1 2
7 2 2 2 0
8 2 2 3 4
9 2 1 4 0
10 2 3 5 3
...
Where V1 and V2 are transition states which I want to table eg. table(df$V1, df$V2), & V3 (the day) and V4 (a factor between 0-4) which I want to group by.
Expected output
I would like to get a table grouped by V3 and V4 for each data.frame/tibble in the list of data.frame/tibbles and save it back into another list of objects.
visual example (not actual data)
data.frame 1
group by v3=1 & v4=0
1 2 3
1 0 1 2
2 0 3 4
3 4 5 6
data.frame 1
group by v3=1 & v4=1
1 2 3
1 1 7 8
2 2 6 9
3 4 5 0
...
data.frame 1
group by v3=2 & v4=0
1 2 3
1 5 4 4
2 6 5 3
3 7 8 4
...
data.frame 2
...
data.frame 3
...
etc...
We can split the data.frame by 'V3', 'V4' and get the table of 'V1', 'V2'
lst2 <- lapply(ldf[1:2], function(dat) lapply(split(dat[1:2],
dat[3:4], drop = TRUE), function(x) {
lvls <- sort(unique(unlist(x)))
table(factor(x[[1]], levels = lvls), factor(x[[2]], levels = lvls))
}))
With tidyverse, here is an option
library(purrr)
library(tidyr)
library(dplyr)
map(ldf[1:2], ~
.x %>%
group_split(V3, V4) %>%
map(~ .x %>%
unite(V3V4, V3, V4) %>%
group_by_all %>%
summarise(n = n()) %>%
ungroup %>%
complete(V1 = sort(unique(unlist(select(., V1, V2)))),
V2 = sort(unique(unlist(select(., V1, V2)))),
fill = list(n = 0) ) %>%
pivot_wider(names_from = V2, values_from = n,
values_fill = list(n = 0)) %>%
fill(V3V4, .direction = "updown")))

New variable that indicates the first occurrence of a specific value

I want to create a new variable that indicates the first specific observation of a value for a variable.
In the following example dataset I want to have a new variable "firstna" that is "1" for the first observation of "NA" for this player.
game_data <- data.frame(player = c(1,1,1,1,2,2,2,2), level = c(1,2,3,4,1,2,3,4), points = c(20,NA,NA,NA,20,40,NA,NA))
game_data
player level points
1 1 1 20
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 2 1 20
6 2 2 40
7 2 3 NA
8 2 4 NA
The resulting dataframe should look like this:
game_data_new <- data.frame(player = c(1,1,1,1,2,2,2,2), level = c(1,2,3,4,1,2,3,4), points = c(20,NA,NA,NA,20,40,NA,NA), firstna = c(0,1,0,0,0,0,1,0))
game_data_new
player level points firstna
1 1 1 20 0
2 1 2 NA 1
3 1 3 NA 0
4 1 4 NA 0
5 2 1 20 0
6 2 2 40 0
7 2 3 NA 1
8 2 4 NA 0
To be honest i don't know how to do this. It would be perfect if there is a dplyr option to do so.
A base R solution:
ave(game_data$points, game_data$player,
FUN = function(x) seq_along(x) == match(NA, x, nomatch = 0))
Another ave option to find out first NA by group (player).
game_data$firstna <- ave(game_data$points, game_data$player,
FUN = function(x) cumsum(is.na(x)) == 1)
game_data
# player level points firstna
#1 1 1 20 0
#2 1 2 NA 1
#3 1 3 NA 0
#4 1 4 NA 0
#5 2 1 20 0
#6 2 2 40 0
#7 2 3 NA 1
#8 2 4 NA 0
Here is a solution with data.table:
library("data.table")
game_data <- data.table(player = c(1,1,1,1,2,2,2,2), level = c(1,2,3,4,1,2,3,4), points = c(20,NA,NA,NA,20,40,NA,NA))
game_data[, firstna:=is.na(points) & !is.na(shift(points)), player][]
# > game_data[, firstna:=is.na(points) & !is.na(shift(points)), player][]
# player level points firstna
# 1: 1 1 20 FALSE
# 2: 1 2 NA TRUE
# 3: 1 3 NA FALSE
# 4: 1 4 NA FALSE
# 5: 2 1 20 FALSE
# 6: 2 2 40 FALSE
# 7: 2 3 NA TRUE
# 8: 2 4 NA FALSE
You can do this by grouping by player and then mutating to check if a row has an NA value and the previous row doesn't
game_data %>%
group_by(player) %>%
mutate(firstna = ifelse(is.na(points) & lag(!is.na(points)),1,0)) %>%
ungroup()
Result:
# A tibble: 8 x 4
# Groups: player [2]
player level points firstna
<dbl> <dbl> <dbl> <dbl>
1 1 1 20 0
2 1 2 NA 1
3 1 3 NA 0
4 1 4 NA 0
5 2 1 20 0
6 2 2 40 0
7 2 3 NA 1
8 2 4 NA 0
library(tidyverse)
library(data.table)
data.frame(
player = c(1,1,1,1,2,2,2,2),
level = c(1,2,3,4,1,2,3,4),
points = c(20,NA,NA,NA,20,40,NA,NA)
) -> game_data
game_data_base1 <- game_data
game_data_dt <- data.table(game_data)
microbenchmark::microbenchmark(
better_base = game_data$first_na <- ave(
game_data$points,
game_data$player,
FUN=function(x) seq_along(x)==match(NA,x,nomatch=0)
),
brute_base = do.call(
rbind.data.frame,
lapply(
split(game_data, game_data$player),
function(x) {
x$firstna <- 0
na_loc <- which(is.na(x$points))
if (length(na_loc) > 0) x$firstna[na_loc[1]] <- 1
x
}
)
),
tidy = game_data %>%
group_by(player) %>%
mutate(firstna=as.numeric(is.na(points) & !duplicated(points))) %>%
ungroup(),
dt = game_data_dt[, firstna:=as.integer(is.na(points) & !is.na(shift(points))), player]
)
## Unit: microseconds
## expr min lq mean median uq max neval
## better_base 125.188 156.861 362.9829 191.6385 355.6675 3095.958 100
## brute_base 366.642 450.002 2782.6621 658.0380 1072.6475 174373.974 100
## tidy 998.924 1119.022 2528.3687 1509.0705 2516.9350 42406.778 100
## dt 330.428 421.211 1031.9978 535.8415 1042.1240 9671.991 100
game_data %>%
group_by(player) %>%
mutate(firstna=as.numeric(is.na(points) & !duplicated(points)))
Group by player, then create a boolean vector for cases that are both NA and not duplicates for previous rows.
# A tibble: 8 x 4
# Groups: player [2]
player level points firstna
<dbl> <dbl> <dbl> <dbl>
1 1 1 20 0
2 1 2 NA 1
3 1 3 NA 0
4 1 4 NA 0
5 2 1 20 0
6 2 2 40 0
7 2 3 NA 1
8 2 4 NA 0
If you want the 1s on the last non-NA line before an NA, replace the mutate line with this:
mutate(lastnonNA=as.numeric(!is.na(points) & is.na(lead(points))))
First row of a block of NAs that runs all the way to the end of the player's group:
game_data %>%
group_by(player) %>%
mutate(firstna=as.numeric(is.na(points) & !duplicated(cbind(points,cumsum(!is.na(points))))))
Another way using base:
game_data$firstna <-
unlist(
tapply(game_data$points, game_data$player, function(x) {i<-which(is.na(x))[1];x[]<-0;x[i]<-1;x})
)
or as another ?ave clone:
ave(game_data$points, game_data$player, FUN = function(x) {
i<-which(is.na(x))[1];x[]<-0;x[i]<-1;x
})
An option using diff
transform(game_data, firstna = ave(is.na(points), player, FUN = function(x) c(0,diff(x))))
# player level points firstna
# 1 1 1 20 0
# 2 1 2 NA 1
# 3 1 3 NA 0
# 4 1 4 NA 0
# 5 2 1 20 0
# 6 2 2 40 0
# 7 2 3 NA 1
# 8 2 4 NA 0
And its dplyr equivalent:
library(dplyr)
game_data %>% group_by(player) %>% mutate(firstna = c(0,diff(is.na(points))))
# # A tibble: 8 x 4
# # Groups: player [2]
# player level points firstna
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 20 0
# 2 1 2 NA 1
# 3 1 3 NA 0
# 4 1 4 NA 0
# 5 2 1 20 0
# 6 2 2 40 0
# 7 2 3 NA 1
# 8 2 4 NA 0

R. Join with duplicates only once

I need help to join two data frames by one key with duplicates.
I want to merge only once for each duplicate, and I can't do it with dplyr::left_join.
Example:
ds1 <- data.frame(
id = c(1,1,1,2,2),
V2 = c(5,6,7,5,8)
)
ds2<-data.frame(
id=c(1,2),
Value=c(56,98)
)
ds3<-left_join(ds1, ds2, by="id")
In this case I have:
# id V2 Value
1 1 5 56
2 1 6 56
3 1 7 56
4 2 5 98
5 2 8 98
But I need:
# id V2 Value
1 1 5 56
2 1 6
3 1 7
4 2 5 98
5 2 8
Keep your code and just add this:
ds3$Value[duplicated(ds3[c("Value","id")])] <- NA
# id V2 Value
# 1 1 5 56
# 2 1 6 NA
# 3 1 7 NA
# 4 2 5 98
# 5 2 8 NA
Here is another idea using slice, left_join, and then full_join.
ds3 <- ds1 %>%
group_by(id) %>%
slice(1) %>%
left_join(ds2, by = "id") %>%
full_join(ds1, by = c("id", "V2")) %>%
ungroup() %>%
arrange(id, V2)
ds3
# # A tibble: 5 x 3
# id V2 Value
# <dbl> <dbl> <dbl>
# 1 1. 5. 56.
# 2 1. 6. NA
# 3 1. 7. NA
# 4 2. 5. 98.
# 5 2. 8. NA

Resources