Counting repetition in r - r

I want to count the number of specific repetitions in my dataframe. Here is a reproducible example
df <- data.frame(Date= c('5/5', '5/5', '5/5', '5/6', '5/7'),
First = c('a','b','c','a','c'),
Second = c('A','B','C','D','A'),
Third = c('q','w','e','w','q'),
Fourth = c('z','x','c','v','z'))
Give this:
Date First Second Third Fourth
1 5/5 a A q z
2 5/5 b B w x
3 5/5 c C e c
4 5/6 a D w v
5 5/7 c A q z
I read a big file that holds 400,000 instances and I want to know different statistics about specific attributes. For an example here I'd like to know how many times a happens on 5/5. I tried using sum(df$Date == '5/5' & df$First == 'a', na.rm=TRUE) which gave me the right result here (2), but when I run it on the big data set, the numbers are not accurate.
Any idea why?

Related

How to find the longest sequence of non-NA rows in R?

I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.
My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!
If I understand this phrase correctly:
[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column
You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.
Example data
df <- data.frame(A = LETTERS[1:10],
B = LETTERS[1:10],
C = LETTERS[1:10],
WANT = LETTERS[1:10],
E = LETTERS[1:10])
set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA
# A B C WANT E
#1 A A A A A
#2 B B B B B
#3 C C C <NA> C
#4 D D D D D
#5 E E E E E
#6 F F F F F
#7 G G G G G
#8 H H H H H
#9 I I I I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J
Here you would want all I values as it is the row with the longest running non-NA values in WANT.
If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.
The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()
# from #jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
# index using which.max()
want_data <- df[which.max(res$WANT), ]
#> want_data
# A B C WANT E
#9 I I I I I
If this isn't correct, please edit your question for clarity.

Limiting Duplication of Specified Columns

I'm trying to find a way to add some constraints into a linear programme to force the solution to have a certain level of uniqueness to it. I'll try explain what I mean here. Take the example below, the linear programme returns the max possible Score for a combination of 2 males and 1 female.
Looking at the Team/Grade/Rep columns however we can see that there is a lot of duplication from row to row. In fact Shana and Jason are identical.
Name<-c("Jane","Brad","Harry","Shana","Debra","Jason")
Sex<-c("F","M","M","F","F","M")
Score<-c(25,50,36,40,39,62)
Team<-c("A","A","A","B","B","B")
Grade<-c(1,2,1,2,1,2)
Rep<-c("C","D","C","D","D","D")
df<-data.frame(Name,Sex,Score,Team,Grade,Rep)
df
Name Sex Score Team Grade Rep
1 Jane F 25 A 1 C
2 Brad M 50 A 2 D
3 Harry M 36 A 1 C
4 Shana F 40 B 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"),as.numeric(df$Sex == "F"))
direction <- c("==","==")
rhs<-c(2,1)
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
4 Shana F 40 B 2 D
6 Jason M 62 B 2 D
What I am trying to work out is how to limit say the level of randomness across those last three columns. For example I would like there to no more than ie 2 columns the same across any two rows. So this would mean that either the Shana row or Jason row would be replaced in the model with an alternative.
I'm not sure if this is something that can be easily added into the Rglpk model? Appreciate any help that can be offered.
It sounds like you're asking how to prevent having a pair of individuals who are "too similar" from being returned by your optimization model. Once you have determined a rule for what makes a pair of people "too similar", you can simply add a constraint for each pair, limiting your solution to have no more than one of those two people.
For instance, if we use your rule of having no more than 2 columns the same, we could easily identify all pairs that we want to block:
pairs <- t(combn(nrow(df), 2))
(blocked <- pairs[rowSums(sapply(df[,c("Team", "Grade", "Rep")], function(x) {
x[pairs[,1]] == x[pairs[,2]]
})) >= 3,])
# [,1] [,2]
# [1,] 1 3
# [2,] 4 6
We want to block the pairs Jane/Harry and Shana/Jason. This is easy to do with linear constraints:
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"), as.numeric(df$Sex == "F"),
outer(blocked[,1], seq_len(num), "==") + outer(blocked[,2], seq_len(num), "=="))
direction <- rep(c("==", "<="), c(2, nrow(blocked)))
rhs<-c(2, 1, rep(1, nrow(blocked)))
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
# Name Sex Score Team Grade Rep
# 2 Brad M 50 A 2 D
# 5 Debra F 39 B 1 D
# 6 Jason M 62 B 2 D
The approach of computing every pair to block is attractive because we could have a much more complicated rule for which pairs to block, since we don't need to encode the rule into the linear program. All we need to be able to do is to compute every pair that needs to be blocked.
For each group of rows having the same last 3 columns we construct a constraint such that at most one of those rows may appear. If a is an indictor vector of the rows of such a group then the constraint would look like this:
a'x <= 1
To do that split the row numbers by the last 3 columns into a list of vectors s each of whose components is a vector of row numbers for rows having the same last 3 columns. Only keep those conponents having more than 1 row number giving s1. In this case the first component of s1 is c(1, 3) referring to the Jane and Harry rows and the second component is c(4, 6) referring to the Shana and Jason rows. In this particular data there were 2 rows in each of the groups but in other data there could be more than 2 rows in a group. excl has one row (constraint) for each element of s1.
The data in the question only has groups of size 2 but in general if there were k rows in some group one would need k choose 2 constraint rows to ensure that only one of the k were chosen if this were done pairwise whereas the approach here only requires one constraint row for the entire group. For example, if k = 10 then choose(10, 2) = 45 so this uses 1 constraint in place of 45.
Finally rbind excl to matrix giving matrix2 and adjust the other Rglpk_solve_LP arguments accordingly giving:
nr <- nrow(df)
s <- split(1:nr, df[4:6])
s1 <- s[lengths(s) > 1]
excl <-t(sapply(s1, "%in%", x = 1:nr)) + 0
matrix2 <- rbind(matrix, excl)
direction2 <- c(direction, rep("<=", nrow(excl)))
rhs2 <- c(rhs, rep(1, nrow(excl)))
sol2 <- Rglpk_solve_LP(obj = obj, mat = matrix2,
dir = direction2, rhs = rhs2, types = "B", max = TRUE)
df[ sol2$solution == 1, ]
giving:
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D

Programming a randomization scheme, while loops

I'm trying to simulate a randomization process. I think I'll have to use a while loop, and I'm unfamiliar with how to best structure what I'm trying to accomplish in my R code.
Let's say I have 3 classes, a,b, and c in that I want to be distributed in a 3:2:1 ratio, respectively. A vector containing a minimally 'balanced' set of these classes in this ratio would look something like this:
class_1<-"a"
class_2<-"b"
class_3<-"c"
ratio_a<-3
ratio_b<-2
ratio_c<-1
min_set<-c(rep(class_1,ratio_a),rep(class_2,ratio_b),rep(class_3,ratio_c))
This minimum set would look something like this:
min_set
"a""a""a""b""b""c"
Let's then say I want to have k number of this minimally balanced set, I could create that like this:
block_1<-matrix(0,k,length(min_set))
for(i in 1:k)
block_1[i,]<-min_set
This would create a new matrix with my min_setvector for k rows.
Let's now say I want to sample from block_1 without replacement (a treatment allocation would be determined by the class (a,b,c) of the sample) This can be done as:
sample(as.vector(block_1),n,replace=F)
From here, I can enumerate all sampling outcome permutations of the min_set as (thanks to amonk):
myList <- permn(min_set)
all_out <- data.table(matrix(unlist(myList),byrow = T,ncol = 6))
All_out is a df with rows representing each permutation of the min_set. Here's where I'd like help.
Let's create a second block
#Create inactive urn
block_2<-vector('numeric',length=dim(block_1)[1]*dim(block_1)[2])
I would like to sample from block_1 until I have sample one permutation of min_set (one of the rows from all out). My code would look something like this (not currently working):
while (block[2]!='any row of all_out'){
for (i in 1:(dim(block_1)[1]*dim(block_1)[2]))
block_2[i]<-sample(as.vector(block_1),i,replace=F)
}
Once I have achieved the min_set in block_2, I'd like to return the min_set back to block_1 from block_2, keeping p-6 samples (i.e. those not part of the min_set) in block_2.
Repeat until a prespecified number of allocations are made.
So for a given set of characters:
>min_set
[1] "a" "a" "a" "b" "b" "c"
all the permutations are generated (respecting the analogies of characters per string):
library(combinat)
library(data.table)
myList <- permn(min_set)
myDT <- data.table(matrix(unlist(myList),byrow = T,ncol = 6))
> myDT
V1 V2 V3 V4 V5 V6
1: a a a b b c
2: a a a b c b
3: a a a c b b
4: a a c a b b
5: a c a a b b
---
716: a c a a b b
717: a a c a b b
718: a a a c b b
719: a a a b c b
720: a a a b b c

Compare group of two columns and return index matches R

Many thanks for reading. Apologies for what I'm sure is a simple task.
I have a dataframe:
(Edited: Added extra column not to be included in comparison)
b = c(5, 6, 7, 8, 10, 11)
c = c('david','alan','pete', 'ben', 'richard', 'edd')
d = c('alex','edd','ben','pete','raymond', 'alan')
df = data.frame(b, c, d)
df
b c d
1 5 david alex
2 6 alan edd
3 7 pete ben
4 8 ben pete
5 10 richard raymond
6 11 edd alan
I want to compare the group of columns c and d with the group of columns d and c. That is, for one row, I want to compare the combined values in c and d with the combined values in d and c for all other rows.
(Note the values could either be characters or integers)
Where these match I want to return the index of those rows which match, preferably as a list of lists. I need to be able to access the indexes without referring to the values in column c or d.
I.e. for the above dataframe, my expected output would be:
c(c(2, 6), c(3, 4))
((2,6), (3,4))
As:
Row 2: (c + d == alan + edd) = row 6: (d + c == edd + alan)
Row 3: (c + d == pete + ben) = row 4: (d + c == ben + pete)
I understand how to determine the match case for two separate columns using match melt, but not if they are joined together and iterating over all possible row combinations.
I envision something like:
lapply(1:6, function(x), ifelse((df$a & df$b) == (df$b & df$a), index(x), 0))
But obviously that is incorrect and won't work.
I consulted the following questions but have been unable to formulate an answer. I have no idea where to begin.
Matching multiple columns on different data frames and getting other column as result
match two columns with two other columns
Comparing two columns in a data frame across many rows
R Comparing each value of all pairs of columns
How can I achieve the above?
You could do something like this. It splits the row indices 1:nrow(df) according to unique sorted strings formed from the columns of df. The sorting ensures that A,B and B,A are treated identically.
duplist <- split(1:nrow(df),apply(df,1,function(r) paste(sort(r),collapse=" ")))
duplist
$`alan edd`
[1] 2 6
$`alex david`
[1] 1
$`ben pete`
[1] 3 4
$`raymond richard`
[1] 5

Match and Fill Values in R

I have a data set containing 3 columns. First column contains Products Name (A through E) and corresponding 2 columns contain nearest 2 neighbors (i.e customers who own Product specified in column A are more likely to buy the next best 2 products (nearest 2 neighbors).
m1 = data.frame(Product=c("A","B","C","D","E"), V1=c("C","A","A","A","D"),
V2=c("D","D","B","E","A"))
In the second data set, i have data at user level. First column contains User IDs and corresponding 5 columns contain information whether user own the product or not. 1 - Own it. 0 - Don't own it.
m2 = data.frame(ID = c(1:7), A = rbinom(7,1,1/2), B = rbinom(7,1,1/2),
C = rbinom(7,1,1/2), D = rbinom(7,1,1/2), E = rbinom(7,1,1/2))
I want product recommendation at user level. I want m1 data to be merged with m2 based on the user own it or not. The output should look like -
User - 1 A D
You haven't posted reproducible example and exact expected results, but this seems to do what you want.
set.seed(321)
m1 = data.frame(Product=c("A","B","C","D","E"), V1=c("C","A","A","A","D"),
V2=c("D","D","B","E","A"))
m2 = data.frame(ID = c(1:7), A = rbinom(7,1,1/2), B = rbinom(7,1,1/2),
C = rbinom(7,1,1/2), D = rbinom(7,1,1/2), E = rbinom(7,1,1/2))
recommended <- apply(m2, 1, function(x) {
client.recommended <- m1[as.logical(x[-1]),-1]
top <- names(sort(table(as.vector(t(client.recommended))),
decreasing = TRUE)[1:2])
c(x[1], top)
})
recommended <- as.data.frame(t(recommended), stringsAsFactors = FALSE)
ID V2 V3
1 1 A B
2 2 A D
3 3 A B
4 4 A D
5 5 A D
6 6 A D
7 7 A B
What this code does:
For every row in m2 data.frame (every client), take that row
Take subset of m1 data.frame corresponding to values found in row (if client chosen "A" and "B", take rows "A" and "B" from m1
Turn this subset into vector
Count occurrences of unique values in vector
Sort unique values by count
Take first most common unique values
Return these values along with client ID
Turn everything into proper data.frame for further processing
It seems that you expect to obtain only two products for each client and that is what this code does. For products with the same number of occurrences, apparently one that comes first alphabetically wins. You can get all recommended product by dropping [1:2] part, but then you will need to figure out how to coerce uneven-length vectors into single data.frame.

Resources