R associative memory doesn't work as expected - r

I am trying to use associative memory and ddply to add a column to a data frame. For example:
First, I have defined association and a function that uses association to calculate product of two elements of a row (property damage and multiplier) to get actual damage in dollars. Here,"B" means Billion, "m|M" means MIllions, etc.
validMultiplierLetter <- c("B", "h", "H", "k", "K", "m", "M")
Multiplier <- c(1000000000, 100, 100, 1000, 1000, 1000000, 1000000)
names(Multiplier) <- validMultiplierLetter
The function ploss (property loss) is:
ploss <- function(pd,pm) {
if (pm %in% validMultiplierLetter) pd*Multiplier[pm]
else 0
}
here is a sample data frame with columns pd (property damage) and pm (multiplier) and ddply code to create a pl (property loss) column, which is a product of property damage and the associated value of multiplier. Invalid multipliers are equivalent to 0 (e.g., "+").
tdf <- data.frame(pd = c(5, 10, 15, 20, 25), pm = c("B", "m", "K", "+", "h"))
tldf <- ddply(tdf, .(pd, pm), transform, pl = ploss(pd,pm))
I get the following output when I execute the code above - you can see that the right multiplier was not used for the rows.
> tldf
pd pm pl
1 5 B 500
2 10 m 10000
3 15 K 15000
4 20 + 0
5 25 h 2500
Strangely though, when you pass constant, the multiplier works correctly. But, when you pass a variable (whose value is same as the constant), for some reason you get an incorrect result.
> Multiplier["B"]
B
1e+09
> tdf$pm[1]
[1] B
Levels: + B h K m
> Multiplier[tdf$pm[1]]
h
100
Any explanation of why this happens and how to fix it is greatly appreciated. Thanks.

The problem is that tdf$pm is a factor. When presented a factor, [ will use the factor levels rather than the character values:
x <- 10:15
names(x) <- LETTERS[1:6]
x
## A B C D E F
## 10 11 12 13 14 15
x[c('A','F')] # Lookup by name
## A F
## 10 15
x[factor(c('A','F'))] # Lookup by integer
## A B
## 10 11
This is fixed by using as.character around the factor, so that a character vector is presented to [:
x[as.character(factor(c('A','F')))]
## A F
## 10 15
For your problem, you can coerce to character in the transform function:
ddply(tdf, .(pd, pm), transform, pl = ploss(pd,as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
In addition, you could vectorize your ploss function in the obvious way and do the job directly with transform:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[pm], 0)
}
transform(tdf, pl=ploss(pd, as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
And of course, the as.character coercion could be within the function ploss, so it isn't required in the transform call:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[as.character(pm)], 0)
}

The problem I see is that, if you're using the default R options, tdf$pm is a factor, not a character. You can check this with class(tdf$pm). What's happening here is that "B" is really a mask for 2 (following the order in the printout: Levels: + B h K m), so pd has the value of 2 as far as [ is concerned, and Multiplier[2] is 100 as you've assigned.
When you call data.frame (or read.table) you need to add the argument stringsAsFactors = FALSE, or change the corresponding global option with the options function.

Related

permute dataframe but must have unique rows

Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?
Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.

Limiting Duplication of Specified Columns

I'm trying to find a way to add some constraints into a linear programme to force the solution to have a certain level of uniqueness to it. I'll try explain what I mean here. Take the example below, the linear programme returns the max possible Score for a combination of 2 males and 1 female.
Looking at the Team/Grade/Rep columns however we can see that there is a lot of duplication from row to row. In fact Shana and Jason are identical.
Name<-c("Jane","Brad","Harry","Shana","Debra","Jason")
Sex<-c("F","M","M","F","F","M")
Score<-c(25,50,36,40,39,62)
Team<-c("A","A","A","B","B","B")
Grade<-c(1,2,1,2,1,2)
Rep<-c("C","D","C","D","D","D")
df<-data.frame(Name,Sex,Score,Team,Grade,Rep)
df
Name Sex Score Team Grade Rep
1 Jane F 25 A 1 C
2 Brad M 50 A 2 D
3 Harry M 36 A 1 C
4 Shana F 40 B 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"),as.numeric(df$Sex == "F"))
direction <- c("==","==")
rhs<-c(2,1)
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
4 Shana F 40 B 2 D
6 Jason M 62 B 2 D
What I am trying to work out is how to limit say the level of randomness across those last three columns. For example I would like there to no more than ie 2 columns the same across any two rows. So this would mean that either the Shana row or Jason row would be replaced in the model with an alternative.
I'm not sure if this is something that can be easily added into the Rglpk model? Appreciate any help that can be offered.
It sounds like you're asking how to prevent having a pair of individuals who are "too similar" from being returned by your optimization model. Once you have determined a rule for what makes a pair of people "too similar", you can simply add a constraint for each pair, limiting your solution to have no more than one of those two people.
For instance, if we use your rule of having no more than 2 columns the same, we could easily identify all pairs that we want to block:
pairs <- t(combn(nrow(df), 2))
(blocked <- pairs[rowSums(sapply(df[,c("Team", "Grade", "Rep")], function(x) {
x[pairs[,1]] == x[pairs[,2]]
})) >= 3,])
# [,1] [,2]
# [1,] 1 3
# [2,] 4 6
We want to block the pairs Jane/Harry and Shana/Jason. This is easy to do with linear constraints:
library(Rglpk)
num <- length(df$Name)
obj<-df$Score
var.types<-rep("B",num)
matrix <- rbind(as.numeric(df$Sex == "M"), as.numeric(df$Sex == "F"),
outer(blocked[,1], seq_len(num), "==") + outer(blocked[,2], seq_len(num), "=="))
direction <- rep(c("==", "<="), c(2, nrow(blocked)))
rhs<-c(2, 1, rep(1, nrow(blocked)))
sol <- Rglpk_solve_LP(obj = obj, mat = matrix, dir = direction, rhs = rhs,types = var.types, max = TRUE)
df[sol$solution==1,]
# Name Sex Score Team Grade Rep
# 2 Brad M 50 A 2 D
# 5 Debra F 39 B 1 D
# 6 Jason M 62 B 2 D
The approach of computing every pair to block is attractive because we could have a much more complicated rule for which pairs to block, since we don't need to encode the rule into the linear program. All we need to be able to do is to compute every pair that needs to be blocked.
For each group of rows having the same last 3 columns we construct a constraint such that at most one of those rows may appear. If a is an indictor vector of the rows of such a group then the constraint would look like this:
a'x <= 1
To do that split the row numbers by the last 3 columns into a list of vectors s each of whose components is a vector of row numbers for rows having the same last 3 columns. Only keep those conponents having more than 1 row number giving s1. In this case the first component of s1 is c(1, 3) referring to the Jane and Harry rows and the second component is c(4, 6) referring to the Shana and Jason rows. In this particular data there were 2 rows in each of the groups but in other data there could be more than 2 rows in a group. excl has one row (constraint) for each element of s1.
The data in the question only has groups of size 2 but in general if there were k rows in some group one would need k choose 2 constraint rows to ensure that only one of the k were chosen if this were done pairwise whereas the approach here only requires one constraint row for the entire group. For example, if k = 10 then choose(10, 2) = 45 so this uses 1 constraint in place of 45.
Finally rbind excl to matrix giving matrix2 and adjust the other Rglpk_solve_LP arguments accordingly giving:
nr <- nrow(df)
s <- split(1:nr, df[4:6])
s1 <- s[lengths(s) > 1]
excl <-t(sapply(s1, "%in%", x = 1:nr)) + 0
matrix2 <- rbind(matrix, excl)
direction2 <- c(direction, rep("<=", nrow(excl)))
rhs2 <- c(rhs, rep(1, nrow(excl)))
sol2 <- Rglpk_solve_LP(obj = obj, mat = matrix2,
dir = direction2, rhs = rhs2, types = "B", max = TRUE)
df[ sol2$solution == 1, ]
giving:
Name Sex Score Team Grade Rep
2 Brad M 50 A 2 D
5 Debra F 39 B 1 D
6 Jason M 62 B 2 D

Calculating the mode or 2nd/3rd/4th most common value

Surely there has to be a function out there in some package for this?
I've searched and I've found this function to calculate the mode:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
But I'd like a function that lets me easily calculate the 2nd/3rd/4th/nth most common value in a column of data.
Ultimately I will apply this function to a large number of dplyr::group_by()s.
Thank you for your help!
Maybe you could try
f <- function (x) with(rle(sort(x)), values[order(lengths, decreasing = TRUE)])
This gives unique vector values sorted by decreasing frequency. The first will be the mode, the 2nd will be 2nd most common, etc.
Another method is to based on table():
g <- function (x) as.numeric(names(sort(table(x), decreasing = TRUE)))
But this is not recommended, as input vector x will be coerced to factor first. If you have a large vector, this is very slow. Also on exit, we have to extract character names and of the table and coerce it to numeric.
Example
set.seed(0); x <- rpois(100, 10)
f(x)
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
Let's compare with the contingency table from table:
tab <- sort(table(x), decreasing = TRUE)
# 11 12 7 9 8 13 10 14 5 15 6 2 3 16
# 14 14 11 11 10 10 9 7 5 4 2 1 1 1
as.numeric(names(tab))
# [1] 11 12 7 9 8 13 10 14 5 15 6 2 3 16
So the results are the same.
Here is an R function that I made (inspired by several other SO posts), which may work for your goal (and I use a local dataset on religious affiliation to illustrate it):
It's simple; only R base functions are involved: length, match, sort, tabulate, table, unique, which, as.character.
Find_Nth_Mode = function(d, N = 2) {
maxN = function(x, N){
len = length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N = length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
(ux = unique(as.character(d)))
(match(d, ux))
(a1 = tabulate(match(d, ux)))
(a2 = maxN(a1, N))
(a3 = which(a1 == a2))
(ux[a3])
}
Sample Output
> table(religion_data$relig11)
0.None 1.Protestant_Conservative 2.Protestant_Liberal 3.Catholic
34486 6134 19678 36880
4.Orthodox 5.Islam_Sunni 6.Islam_Shia 7.Hindu
20702 28170 668 4653
8.Buddhism 9.Jewish 10.Other
9983 381 6851
> Find_Nth_Mode(religion_data$relig11, 1)
[1] "3.Catholic"
> Find_Nth_Mode(religion_data$relig11, 2)
[1] "0.None"
> Find_Nth_Mode(religion_data$relig11, 3)
[1] "5.Islam_Sunni"
Reference:
I want to express my gratitude to these posts, from which I get the two functions and integrate them into one:
function to find the N th largest value: Fastest way to find second (third...) highest/lowest value in vector or column
how to find the second largest mode value?
Calculating the mode or 2nd/3rd/4th most common value

Using sapply and tapply

I struggle to correct my code. Given this data frame
S <- data.frame(Z1=c("A","A","A","D","D","A","A","A"),
Z2=c("A","A","A","D","D","C","C","D"),
K1=c(24,36,44,63,34,26,19,23),
K2=c(12,24,13,16,23,25,12,34))
I applied this transformation:
B <- sapply(1:2, function(x) {
x1 <- S[c(x, x+2)]
tapply(x1[,2], x1[,1], FUN=function(S) ceiling(median(S)))
})
colnames(B) <- c("G1","G2")
which I expected to set B to
G1 G2
C 0 19
B 0 0
A 25 13
D 49 23
but instead I get this error:
Error in `colnames<-`(`*tmp*`, value = c("G1", "G2")) :
attempt to set 'colnames' on an object with less than two dimensions
One of your problems is that R has no idea you consider Z1 and Z2 to be categorical variables that can take values A, B, C, D. The way you tell it this is with the factor type.
S <- data.frame(Z1=c("A","A","A","D","D","A","A","A"),
Z2=c("A","A","A","D","D","C","C","D"),
K1=c(24,36,44,63,34,26,19,23),
K2=c(12,24,13,16,23,25,12,34))
S$Z1 <- factor(S$Z1, levels=c("A", "B", "C", "D"))
S$Z2 <- factor(S$Z2, levels=c("A", "B", "C", "D"))
Notice how I have to explicitly spell out that all four of A, B, C, D are possible even though not all of them appear. Having done that, your transformation function produces a 2D matrix to which colnames can be applied.
B <- sapply(1:2, function(x) {
x1 <- S[c(x, x+2)]
tapply(x1[,2], x1[,1], FUN=function(S) ceiling(median(S)))
})
colnames(B) <- c("G1","G2")
However, you don't get zeroes where you expected them to be, you get NAs:
> B
G1 G2
A 25 13
B NA NA
C NA 19
D 49 23
This is because the median value of an empty set is undefined. You can paper over that with is.na:
> B[is.na(B)] <- 0
> B
G1 G2
A 25 13
B 0 0
C 0 19
D 49 23
Also, the S[c(x, x+2)] thing is extremely brittle and I would not be relying on it in production code if I were you. Likewise the thing where you use sapply(1:2, function(x) ...) where the function operates on global variables.
You may find the reshape2 package easier to persuade to do what you want.

Appending data frames based on a function in R

How do I append data frames one after the other to form another data frame?
Whether a data frame would be included or not will be decided by a criteria.
Here is an example data:
d1 <- data.frame(MyGroups =sample(LETTERS,100,replace=TRUE),
MyInt = sample(c(1:20),100,replace=TRUE))
Now, how should I choose groups (A,B,C...) from MyGroups that have mean of variable MyInt greater than 10?
I tried the following without a success. Here, I am appending the data frame into a file based on the given criteria.
require("plyr")
keepGrp <- function(df0) {
if(max(df0$MyInt < 10)) {df0 <- NULL}
write.csv(df0,'mytable.txt',append=TRUE,sep=',')
}
ddply(d1,.(MyInt),function(x) keepGrp(x))
The desired data frame should be in file mytable.txt
I am fully sure there is a better way to do what I am trying to do.
I would be happy to clarify my question if I need to do so.
I will appreciate of someone can (1) show me a feedback on improving my programming thoughts (2) give me a solution to my problem.
If I understand your question properly, you want to calculate the mean by group and only write those that meet a certain threshold to a pre-existing file. If so, why not calculate all the means at once, subset that, then write that out? Here's a one liner that should probably be split into multiples, but I think you'll get the point:
write.table(
subset(
ddply(d1, "MyGroups", transform, meanval = mean(MyInt)
),
meanval > 10),
"yourcsv.csv", append = TRUE, sep = ",", col.names = FALSE
)
It is simpler than you are making it. The function called by ddply can either return the subset of data if the criteria are met, or an empty data.frame if not.
keepGrp <- function(df0) {
if(mean(df0$MyInt) > 10) {
df0
} else {
data.frame()
}
}
res <- ddply(d1, .(MyGroups), keepGrp)
Note that your tests inside keepGrp was wrong (didn't test the mean of the MyInt values) and the grouping of the ddply was wrong (should be MyGroups, not MyInt).
Checking that this is right:
> ddply(d1, .(MyGroups), summarise, ave = mean(MyInt))
MyGroups ave
1 A 14.200000
2 B 9.600000
3 C 5.600000
4 D 5.600000
5 E 8.000000
6 F 10.500000
7 G 7.333333
8 H 12.000000
9 I 7.333333
10 J 9.500000
11 K 11.000000
12 L 12.375000
13 M 13.250000
14 N 12.000000
15 O 11.666667
16 P 8.625000
17 Q 13.000000
18 R 6.000000
19 S 16.000000
20 T 12.000000
21 U 12.000000
22 V 13.250000
23 W 17.666667
24 X 9.000000
25 Y 12.400000
26 Z 13.750000
> unique(res$MyGroup)
[1] A F H K L M N O Q S T U V W Y Z
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
So the ones that appear in res are those that have the appropriate mean value for MyInt.

Resources