Calculating share of data.frame row pairs that match - r

I have a dataset with an id variable and several other variables, similar to this:
mydata <- tibble::tribble(
~idvar, ~age,
1, 18,
1, 18,
2, 27,
3, 89,
4, 89,
5, 12,
1, 17,
2, 27,
2, 28,
3, 41
)
For each value of idvar, I want to calculate the rate at which, given idvar is the same between a pair of rows, age is also the same. In other words, I want to know:
PR(age match | id match)
For example, there are three rows with idvar == 1, which form three pairs of rows. For one of those pairs, age also matches. So we would return .333 for idvar == 1.
Desired output:
1 .333
2 .333
3 0
4 NA
5 NA

You could use table from base R. From the manual for ?base::table:
table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
In other words, we can use it to count the number of entries for each unique value of age. Where the count is more than 1, we know we have a match (or repeated value) somewhere in age.
table(mydata$age)
12 17 18 27 28 41 89
1 1 2 2 1 1 2
For your given example, we will not do this for all of age at once. Instead, we will need to group by idvar first.
Additionally, we need to use the binomial coefficient on each element of table(age) to determine how many pairs are possible, and then sum them all up to get the total number of pairs in the numerator. In R, the choose(n,k) function is the binomial coefficient. The denominator is just choose(.N, 2) (in data.table, .N is the number of rows in the current group), which is the number of all possible pairs for the group.
Putting it all together:
library(data.table)
setDT(mydata)
# Helper function
count_pairs <- function(x) {
if (length(x) > 1) { # if more than 1 row
if (length(table(x)[table(x) > 1]) > 0) { # if there is at least 1 match
sum(sapply(table(x)[table(x) > 1], function(z) choose(z, 2)))
} else {
0 # no matches
}
} else {
NA_real_ # only 1 row
}
}
mydata[, count_pairs(age) / choose(.N, 2), by = idvar]
idvar V1
1: 1 0.3333333
2: 2 0.3333333
3: 3 0.0000000
4: 4 NA
5: 5 NA

Related

Identifying columns with high correlation in large dataset

I have two large dataframes (50+ columns and many are long character vars) and I need to identify the "link" variable that I should use to merge them together. The problem is the name of the variables don't match up. That is I need to identify variables in the two datasets where the values have a high correlation.
As an example :
dta1 = data.frame(A = c(1 , 2,3, 4), B = c( 23, 45, 6, 8), C = c("001", "028", "076", "039"))
dta2 = data.frame(first = c(5, 6, 7, 8), second = c( 58, 32, 33, 45), third = c("008", "028", "076", "039"))
I would like the code to tell me that columns C and third have a very high correlation (they are not complete duplicates though!).
I have tried adding the two dataframes and running a cor() function, but this doesn't work with character variables.
Also tried union_all(x, y, ...) from dplyr but that requires the same column names.
At this point I am out of ideas.
Thanks very much.
To identify the columns most similar, try the following. It systematically compares the values from each column in dta1 with the columns in dta2. It returns a matrix.
sapply(dta1, function(x) sapply(dta2, function(y) sum(x == y)))
A B C
first 0 1 0
second 0 0 0
third 0 0 3
From here we can see that third and C have the most matches. Now you can join your two data.frames. To keep all rows and columns, you will want a full_join from the dplyr package.
library(dplyr)
full_join(dta1, dta2, by = c("C" = "third"))
A B C first second
1 1 23 001 NA NA
2 2 45 028 6 32
3 3 6 076 7 33
4 4 8 039 8 45
5 NA NA 008 5 58

How to organize based on specific data values

I have this data frame:
structure(list(ID = c(101, 102, 103, 104, 105, 106
), 1Var = c(1, 3, 3, 1, 1, 1), 2Var = c(1, 1,
1, 1, 1, 1), 3Var = c(3, 1, 1, 1, 1, 1), 4Var = c(1,
1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")
I have been trying to subset based on values of 1 and 0. In this data table there are no 0 values but my full data has it.
I toyed around with this method:
Prime <- grep('$Var', names(Data))
DataPrime <- Data[rowSums(Data[Prime] <= 1),]
I am getting duplicated observations though. Another issue with this method is that it keeps all rows that have a 1 or 0 but not rows with ONLY 1 or 0. So, some rows that have 3 but the rest of the variables are value of 1 that row is still kept in my data.
I think my method will work but I'm not sure what else I need to specify in the argument. I tried a simple subset too but that removed everything from the data:
DataPrime <- subset(Data, '1Var' <=1, '2Var' <=1, '3Var' <=1, '4Var' <=1)
I essentially want my data to look something like this:
ID 1Var 2Var 3Var 4Var
4 104 1 1 1 1
5 105 1 1 1 1
6 106 1 1 1 1
We can use Reduce with & to create a logical vector for subsetting the rows
subset(Data, Reduce(`&`, lapply(Data[-1], `<=`, 1)))
-output
# ID 1Var 2Var 3Var 4Var
#4 104 1 1 1 1
#5 105 1 1 1 1
#6 106 1 1 1 1
Or another option is rowSums
subset(Data, !rowSums(Data[-1] > 1))
I think you're looking for something like:
Prime <- grep('\\dVar', names(Data))
Data[apply(Data[Prime], 1, function(x) !any(x > 1)),]
#> ID 1Var 2Var 3Var 4Var
#> 4 104 1 1 1 1
#> 5 105 1 1 1 1
#> 6 106 1 1 1 1
A few things to note are:
Your regex inside grep was wrong. The "$" symbol represents the end of a string, not a number. For numbers you can use \\d . Your Prime variable is therefore empty in the example.
It's best not to have column names (or any variable name) starting with numbers. These are not legal names in R. You can get round this by surrounding them with backticks, but this is easy to overlook and is a source of bugs.
rowSums adds up all the values in each row, so the lowest sum of any of the rows is 4, whereas rowSums(Data[Prime] <= 1) gives the total number of entries that are one or less, giving a vector like c(3, 3, 3, 4, 4, 4). Subsetting Data by this will give 3 copies of row 3 then three copies of row 4, which clearly isn't what you want.
In subset, you need the logical conjunction of all your var <= 1 terms, so you should split these with &, not with commas.

List within a data.frame column

So far I have the following data.frame, with an initial column filled with set values:
df <- data.frame(N=seq(10, 100, by=10))
Now, I want to have a second column here, which would be a list (or c()) of integers, such that the output of calling df would be as follows:
N I
1 10 2, 8, 1
2 20 4, 0, 99
.. .. ..
I tried doing the following, where df <- data.frame(N=seq(10, 100, by=10), I=logical(10)), which puts a FALSE in each of the columns. But trying to test what I wanted to do using df$I[df$N == 10] <- list(2, 8, 1) throws the error:
number of items to replace is not a multiple of replacement length
Edit: I also tried using I(list(...)) to keep the list interpreted as is, but the same error was thrown.
We can create the list by wrapping with I in data.frame and then assign by extracting the list element that corresponds to the index provided by the logical vector
df <- data.frame(N=seq(10, 100, by=10), I= I(vector('list', 10)))
df$I[df$N == 10][[1]] <- list(2, 8, 1)
df
# N I
#1 10 2, 8, 1
#2 20
#3 30
#4 40
#5 50
#6 60
#7 70
#8 80
#9 90
#10 100

Combining/summing two positions in a vector of integers in R

I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
The count of values that were included in a particular sum (your ex.indices).
The total after summing relevant values (your ex.want).
The positions of the original values from the input vector.
The original values themselves, in case you want to verify it later.
The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92

Sum of only consecutive number

I have an array which like
a <- c(1,2,3,7,8,14,17,18)
I want to sum only consecutive numbers, I want answer like this using R.
"6, 15, 14, 35"
I shall really appreciate for your response.
Using tapply to group by consecutive values,
tapply(a, cumsum(c(FALSE, diff(a) != 1)), sum)
# 0 1 2 3
# 6 15 14 35

Resources