Designing a function to output the smallest plurality winner possible - r

I'm trying to design a function/formula where you're given two integer variables representing, say for example, 5 and 100. The first number could represent 5 ice cream flavours on a survey, and the 100 the number of people being sampled being asked their favourite ice cream.
I want to design a function/formula which would produce the combination of numbers where 1 of the 5 ice cream flavour's can win by smallest plurality (so, I'm guessing in most cases, by 1), and have the smallest number possible based on the number of ice cream flavours on the survey.
So with 5 ice cream flavour's and 100 respondents, I would want R to produce a vector of (order not really important):
[1] 21 20 20 20 19
As 21 is the smallest number possible for a majority ice cream flavour winner out of 100 respondants and 5 flavours. As a function it would need to deal with when numbers of choices don't neatly divide with the numeber of respondants as well.
Desired output
combinations_function <- function(x, y) {
?????
}
combinations_function(5, 100)
[1] 21 20 20 20 19
combinations_function(5, 38)
[1] 9 8 7 7 7
combinations_function(7, 48)
[1] 8 7 7 7 7 6 6

Think I got it:
smallest_margin <- function(choices, respondents)
{
values = rep(respondents %/% choices, choices)
remainder = respondents %% choices
while(remainder != 0)
{
values[which.min(values)] <- values[which.min(values)] + 1
remainder = remainder - 1
}
if(length(which(values == max(values))) > 1)
values[which(values == max(values))[1:2]] <-
values[which(values == max(values))[1:2]] + c(-1, 1)
return(sort(values))
}
smallest_margin(5, 100)
# [1] 19 20 20 20 21
smallest_margin(1, 100)
# [1] 100
smallest_margin(5, 99)
# [1] 19 19 20 20 21
smallest_margin(12, 758)
# [1] 63 63 63 63 63 63 63 63 63 63 63 65

Here is a code-golfian approach
f <- function(x,y)
rep(y%/%x, x) + ifelse(rep(y%%x>0, x),
c(1^(1:(y%%x)), 0*((y%%x+1):x)), 0) + c((-1)^(0:1), 0*(3:x))
Example
f(5, 100)
# [1] 21 19 20 20 20
f(5, 38)
# [1] 9 7 8 7 7
f(5, 48)
# [1] 11 9 10 9 9
f(7, 48)
# [1] 8 6 7 7 7 7 6

Related

Ceil and floor values in R

I have a data.table of integers with values between 1 and 60.
My question is about flooring or ceiling any number to the following values: 12 18 24 30 36 ... 60.
For example, let's say my data.table contains the number 13. I want R to "transform" this number into 12 and 18 as 13 lies in between those numbers. Moreover, if I have 18 I want R to keep it at 18.
If my data.table contains the value 50, I want R to convert that number into 48 and 54 and so on.
My goal is to get two different data.tables. One where the floored values are saved and one where the ceiled values are saved.
Any idea how one could do this in R?
EDIT: Numbers smaller than 12 should always be transformed to 12.
Example output:
If have the following data.table data.table(c(1,28,29,41,53,53,17,41,41,53))
I want the following two output data.tables: floored values data.table(c(12,24,24,36,48,48,12,36,36,48))
I want the following two output data.tables: ceiled values data.table(c(12,30,30,42,54,54,18,42,42,54))
Here is a fairly direct way (edited to round up to 12 if any values are below):
df <- data.frame(nums = 10:20)
df$floors <- with(df,pmax(12,6*floor(nums/6)))
df$ceils <- with(df,pmax(12,6*ceiling(nums/6)))
Leading to:
> df
nums floors ceils
1 10 12 12
2 11 12 12
3 12 12 12
4 13 12 18
5 14 12 18
6 15 12 18
7 16 12 18
8 17 12 18
9 18 18 18
10 19 18 24
11 20 18 24
Here's a way we could do this, using sapply and the which.min functions. From your question, it's not immediately clear how values < 12 should be handled.
x <- 1:60
num_list <- seq(12, 60, 6)
floorr <- sapply(x, function(x){
diff_vec <- x - num_list
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
ceill <- sapply(x, function(x){
diff_vec <- num_list - x
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
tail(cbind(x, floorr, ceill))
x floorr ceill
[55,] 55 54 60
[56,] 56 54 60
[57,] 57 54 60
[58,] 58 54 60
[59,] 59 54 60
[60,] 60 60 60

Flip Every Nth Coin in R [duplicate]

This question already has answers here:
R: How to use ifelse statement for a vector of characters
(2 answers)
Closed 6 years ago.
My friend gave me a brain teaser that I wanted to try on R.
Imagine 100 coins in a row, with heads facing up for all coins. Now every 2nd coin is flipped (thus becoming tails). Then every 3rd coin is flipped. How many coins are now showing heads?
To create the vector, I started with:
flips <- rep('h', 100)
levels(flips) <- c("h", "t")
Not sure how to proceed from here. Any help would be appreciated.
Try this:
coins <- rep(1, 100) # 1 = Head, 0 = Tail
n = 3 # run till the time when you flip every 3rd coin
invisible(sapply(2:n function(i) {indices <- seq(i, 100, i); coins[indices] <<- (coins[indices] + 1) %% 2}) )
which(coins == 1)
# [1] 1 5 6 7 11 12 13 17 18 19 23 24 25 29 30 31 35 36 37 41 42 43 47 48 49 53 54 55 59 60 61 65 66 67 71 72 73 77 78 79 83 84 85 89 90 91 95 96 97
sum(coins==1)
#[1] 49
If you run till n = 100, only the coins at the positions which are perfect squares will be showing heads.
coins <- rep(1, 100) # 1 = Head, 0 = Tail
n <- 100
invisible(sapply(2:n, function(i) {indices <- seq(i, 100, i); coins[indices] <<- (coins[indices] + 1) %% 2}) )
which(coins == 1)
# [1] 1 4 9 16 25 36 49 64 81 100
sum(coins==1)
# [1] 10

Use vector to make probability table

In the form of a probability table, I'd like to illustrate a vector of quantiles divisible by 7 and 5, for marginal probability distributions, and 5 given 7, for conditional probability.
Let's assume this is my data:
>prob.table(table(x)) # discrete number and its probability
20 22 23 24 25 26 27 28 29 30 31
0.000152 0.000625 0.000796 0.001224 0.003138 0.003043 0.004549 0.006444 0.005938 0.009301 0.009456
32 33 34 35 36 37 38 39 40 41 42
0.013448 0.019839 0.018596 0.026613 0.028902 0.027377 0.035156 0.041379 0.041092 0.047733 0.055827
43 44 45 46 47 48 49 50 51 52 53
0.046099 0.051624 0.055131 0.049779 0.056992 0.049801 0.052912 0.031924 0.049114 0.022880 0.042279
54 55 56 57 58 59 61 63 65
0.013946 0.032340 0.003466 0.021240 0.001227 0.011734 0.005115 0.001491 0.000278
How can I turn this into a two-way probability table that shows which numbers are divisible by 7 and/or 5 for marginal and conditional probability?
This is what I'd hope the table to look like
Yes NO # Probability of numbers divisible by 7
Yes 0.02754 0.02886
No 0.02656 0.02831
# Probability of numbers divisible by 5
x <- sample(1:100, 100, replace = TRUE)
# %% is the mod operator, which gives the remainder after the division of the left-hand side by the right-hand side. x %% y == 0 therefore returns TRUE if x is divisible by y
db5 <- x %% 5 == 0
db7 <- x %% 7 == 0
table(db5, db7) / length(x)
# db7
# db5 FALSE TRUE
# FALSE 0.62 0.13
# TRUE 0.24 0.01

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

Ordering clustered points using Kmeans and R

I have set of data (of 5000 points with 4 dimensions) that I have clustered using kmeans in R.
I want to order the points in each cluster by their distance to the center of that cluster.
Very simply, the data looks like this (I am using a subset to test out various approaches):
id Ans Acc Que Kudos
1 100 100 100 100
2 85 83 80 75
3 69 65 30 29
4 41 45 30 22
5 10 12 18 16
6 10 13 10 9
7 10 16 16 19
8 65 68 100 100
9 36 30 35 29
10 36 30 26 22
Firstly, I used the following method to cluster the dataset into 2 clusters:
(result <- kmeans(data, 2))
This returns a kmeans object that has the following methods:
cluster, centers etc.
But I cannot figure out how to compare each point and produce an ordered list.
Secondly, I tried the seriation approach as suggested by another SO user here
I use these commands:
clus <- kmeans(scale(x, scale = FALSE), centers = 3, iter.max = 50, nstart = 10)
mns <- sapply(split(x, clus$cluster), function(x) mean(unlist(x)))
result <- dat[order(order(mns)[clus$cluster]), ]
Which seems to produce an ordered list but if I bind it to the labeled clusters (using the following cbind command):
result <- cbind(x[order(order(mns)[clus$cluster]), ],clus$cluster)
I get the following result, which does not appear to be ordered correctly:
id Ans Acc Que Kudos clus
1 3 69 65 30 29 1
2 4 41 45 30 22 1
3 5 10 12 18 16 2
4 6 10 13 10 9 2
5 7 10 16 16 19 2
6 9 36 30 35 29 2
7 10 36 30 26 22 2
8 1 100 100 100 100 1
9 2 85 83 80 75 2
10 8 65 68 100 100 2
I don't want to be writing commands willy-nilly but understand how the approach works. If anyone could help out or spread some light on this, it would be really great.
EDIT:::::::::::
As the clusters can be easily plotted, I'd imagine there is a more straightforward way to get and rank the distances between points and the center.
The centers for the above clusters (when using k = 2) are as follows. But I do not know how to get and compare this with each individual point.
Ans Accep Que Kudos
1 83.33333 83.66667 93.33333 91.66667
2 30.28571 30.14286 23.57143 20.85714
NB::::::::
I don't need top use kmeans but I want to specify the number of clusters and retrieve an ordered list of points from those clusters.
Here is an example that does what you ask, using the first example from ?kmeans. It is probably not terribly efficient, but is something to build upon.
#Taken straight from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 2)
x <- cbind(x,cl = cl$cluster)
#Function to apply to each cluster to
# do the ordering
orderCluster <- function(i,data,centers){
#Extract cluster and center
dt <- data[data[,3] == i,]
ct <- centers[i,]
#Calculate distances
dt <- cbind(dt,dist = apply((dt[,1:2] - ct)^2,1,sum))
#Sort
dt[order(dt[,4]),]
}
do.call(rbind,lapply(sort(unique(cl$cluster)),orderCluster,data = x,centers = cl$centers))

Resources