Related
I have a dataset with an id variable and several other variables, similar to this:
mydata <- tibble::tribble(
~idvar, ~age,
1, 18,
1, 18,
2, 27,
3, 89,
4, 89,
5, 12,
1, 17,
2, 27,
2, 28,
3, 41
)
For each value of idvar, I want to calculate the rate at which, given idvar is the same between a pair of rows, age is also the same. In other words, I want to know:
PR(age match | id match)
For example, there are three rows with idvar == 1, which form three pairs of rows. For one of those pairs, age also matches. So we would return .333 for idvar == 1.
Desired output:
1 .333
2 .333
3 0
4 NA
5 NA
You could use table from base R. From the manual for ?base::table:
table uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.
In other words, we can use it to count the number of entries for each unique value of age. Where the count is more than 1, we know we have a match (or repeated value) somewhere in age.
table(mydata$age)
12 17 18 27 28 41 89
1 1 2 2 1 1 2
For your given example, we will not do this for all of age at once. Instead, we will need to group by idvar first.
Additionally, we need to use the binomial coefficient on each element of table(age) to determine how many pairs are possible, and then sum them all up to get the total number of pairs in the numerator. In R, the choose(n,k) function is the binomial coefficient. The denominator is just choose(.N, 2) (in data.table, .N is the number of rows in the current group), which is the number of all possible pairs for the group.
Putting it all together:
library(data.table)
setDT(mydata)
# Helper function
count_pairs <- function(x) {
if (length(x) > 1) { # if more than 1 row
if (length(table(x)[table(x) > 1]) > 0) { # if there is at least 1 match
sum(sapply(table(x)[table(x) > 1], function(z) choose(z, 2)))
} else {
0 # no matches
}
} else {
NA_real_ # only 1 row
}
}
mydata[, count_pairs(age) / choose(.N, 2), by = idvar]
idvar V1
1: 1 0.3333333
2: 2 0.3333333
3: 3 0.0000000
4: 4 NA
5: 5 NA
I have two large dataframes (50+ columns and many are long character vars) and I need to identify the "link" variable that I should use to merge them together. The problem is the name of the variables don't match up. That is I need to identify variables in the two datasets where the values have a high correlation.
As an example :
dta1 = data.frame(A = c(1 , 2,3, 4), B = c( 23, 45, 6, 8), C = c("001", "028", "076", "039"))
dta2 = data.frame(first = c(5, 6, 7, 8), second = c( 58, 32, 33, 45), third = c("008", "028", "076", "039"))
I would like the code to tell me that columns C and third have a very high correlation (they are not complete duplicates though!).
I have tried adding the two dataframes and running a cor() function, but this doesn't work with character variables.
Also tried union_all(x, y, ...) from dplyr but that requires the same column names.
At this point I am out of ideas.
Thanks very much.
To identify the columns most similar, try the following. It systematically compares the values from each column in dta1 with the columns in dta2. It returns a matrix.
sapply(dta1, function(x) sapply(dta2, function(y) sum(x == y)))
A B C
first 0 1 0
second 0 0 0
third 0 0 3
From here we can see that third and C have the most matches. Now you can join your two data.frames. To keep all rows and columns, you will want a full_join from the dplyr package.
library(dplyr)
full_join(dta1, dta2, by = c("C" = "third"))
A B C first second
1 1 23 001 NA NA
2 2 45 028 6 32
3 3 6 076 7 33
4 4 8 039 8 45
5 NA NA 008 5 58
I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample and split functions, but am hitting a dead end. For instance, sum(sample(ex.have,2)) will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2)) will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge function takes the following arguments:
x: an integer vector.
event_perc: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x multiplied by event_perc.
sample_n: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4 and 5 is merged, and 8 and 9 is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
The count of values that were included in a particular sum (your ex.indices).
The total after summing relevant values (your ex.want).
The positions of the original values from the input vector.
The original values themselves, in case you want to verify it later.
The last two columns can be dropped from the result by setting include_orig = FALSE. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92
I am trying to input nominal variables based on a column dedicated to age. Basically, if someone is between the ages of 1 to 5, indicated in the age column, then I want the age group column to have the value of 1, since they are in age group 1. I'm trying to do this in multiple columns since ages increase by one each year. I've tried doing this through a for loop that uses an if else function, but it does not work.
`my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
my_data_1<-cbind("AgeGroup2000"=NA, "AgeGroup2001"=NA, "AgeGroup2002"=NA, my_data_1)
my_data_1
#I'm basically trying to make the below code into a for loop
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 1:5]<-1
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 6:10]<-2
my_data_1$AgeGroup2000[my_data_1$Age2000 %in% 11:15]<-3
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 1:5]<-1
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 6:10]<-2
my_data_1$AgeGroup2001[my_data_1$Age2001 %in% 11:15]<-3
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 1:5]<-1
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 6:10]<-2
my_data_1$AgeGroup2002[my_data_1$Age2002 %in% 11:15]<-3`
Maybe it is better to use findInterval or cut here. We can use lapply to apply it for multiple columns
my_data_1[paste0("AgeGroup_", 2000:2002)] <- lapply(my_data_1, findInterval, c(1, 6, 11))
# Age2000 Age2001 Age2002 AgeGroup_2000 AgeGroup_2001 AgeGroup_2002
#Participant1 1 2 3 1 1 1
#Participant2 3 4 5 1 1 1
#Participant3 5 6 7 1 2 2
#Participant4 7 8 9 2 2 2
#Participant5 9 10 11 2 3 3
#Participant6 11 12 13 3 3 3
Or mutate_all from dplyr
library(dplyr)
my_data_1 %>% mutate_all(list(Group = ~findInterval(., c(1, 6, 11))))
data
my_vector_1<-c(1,3,5,7,9,11,2,4,6,8,10,12,3,5,7,9,11,13)
my_matrix_1<-matrix(data=my_vector_1, nrow=6, ncol=3)
colnames(my_matrix_1)<-c(paste0("Age", 2000:2002))
rownames(my_matrix_1)<-c(paste0("Participant", 1:6))
my_data_1<-data.frame(my_matrix_1)
I've got a data.frame of monthly values of a variable for many locations (so many rows) and I want to count the numbers of consecutive months (i.e consecutive cells) that have a value of zero. This would be easy if it was just being read left to right, but the added complication is that the end of the year is consecutive to the start of the year.
For example, in the shortened example dataset below (with seasons instead of months),location 1 has 3 '0' months, location 2 has 2, and 3 has none.
df<-cbind(location= c(1,2,3),
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
How can I count these consecutive zero values? I've looked at rle but I'm still none the wiser currently!
Many thanks for any help :)
You've identified the two cases that the longest run can take: (1) somewhere int he middle or (2) split between the end and beginning of each row. Hence you want to calculate each condition and take the max like so:
df<-cbind(
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
#> Winter Spring Summer Autumn
#> [1,] 0 0 0 3
#> [2,] 0 2 2 0
#> [3,] 3 4 7 4
# calculate the number of consecutive zeros at the start and end
startZeros <- apply(df,1,function(x)which.min(x==0)-1)
#> [1] 3 1 0
endZeros <- apply(df,1,function(x)which.min(rev(x==0))-1)
#> [1] 0 1 0
# calculate the longest run of zeros
longestRun <- apply(df,1,function(x){
y = rle(x);
max(y$lengths[y$values==0],0)}))
#> [1] 3 1 0
# take the max of the two values
pmax(longestRun,startZeros +endZeros )
#> [1] 3 2 0
Of course an even easier solution is:
longestRun <- apply(cbind(df,df),# tricky way to wrap the zeros from the start to the end
1,# the margin over which to apply the summary function
function(x){# the summary function
y = rle(x);
max(y$lengths[y$values==0],
0)#include zero incase there are no zeros in y$values
})
Note that the above solution works because my df does not include the location field (column).
Try this:
df <- data.frame(location = c(1, 2, 3),
Winter = c(0, 0, 3),
Spring = c(0, 2, 4),
Summer = c(0, 2, 7),
Autumn = c(3, 0, 4))
maxcumzero <- function(x) {
l <- x == 0
max(cumsum(l) - cummax(cumsum(l) * !l))
}
df$N.Consec <- apply(cbind(df[, -1], df[, -1]), 1, maxcumzero)
df
# location Winter Spring Summer Autumn N.Consec
# 1 1 0 0 0 3 3
# 2 2 0 2 2 0 2
# 3 3 3 4 7 4 0
This adds a column to the data frame specifying the maximum number of times zero has occurred consecutively in each row of the data frame. The data frame is column bound to itself to be able to detect consecutive zeroes between autumn and winter.
The method used here is based on that of Martin Morgan in his answer to this similar question.