I am new to R and have a particular fantasy sports team optimization problem I would like to solve. I have seen other posts use lpSolve for similar problems but I can not seem to wrap my head around the code. Example data table below. Every player is on a team, plays a particular role, has a salary, and has avg points produced per game. The constraints that I need are I need exactly 8 players. No more than 3 players may come from any one team. There must be at least one player for each role (of 5). And cumulative salary must not exceed $10,000.
Team Player Role Avgpts Salary
Bears A T 22 930
Bears B M 19 900
Bears C B 30 1300
Bears D J 25 970
Bears E S 20 910
Jets F T 21 920
Jets G M 26 980
In R, I write in the following
> obj = DF$AVGPTS
> con = rbind(t(model.matrix(~ Role + 0, DF)), rep(1,nrow(DF)), DF$Salary)
> dir = c(">=",">=",">=",">=",">=","==","<=")
> rhs = c(1,1,1,1,1,8,10000)
> result = lp("max", obj, con, dir, rhs, all.bin = TRUE)
This code works fine in producing the optimal fantasy team without the limitation of no more than 3 players may come from any one team. This is where I am stuck and I suspect it relates to the con argument. Any help is appreciated.

What if you added something similar to the way you did the roles to con?
If you add t(model.matrix(~ Team + 0, DF)) you'll have indicators for each team in your constraint. For the example you gave:
> con <- rbind(t(model.matrix(~ Role + 0,DF)), t(model.matrix(~ Team + 0, DF)), rep(1,nrow(DF)), DF$Salary)
> con
1 2 3 4 5 6 7
RoleB 0 0 1 0 0 0 0
RoleJ 0 0 0 1 0 0 0
RoleM 0 1 0 0 0 0 1
RoleS 0 0 0 0 1 0 0
RoleT 1 0 0 0 0 1 0
TeamBears 1 1 1 1 1 0 0
TeamJets 0 0 0 0 0 1 1
1 1 1 1 1 1 1
930 900 1300 970 910 920 980
We now need to update dir and rhs to account for this:
dir <- c(">=",">=",">=",">=",">=",rep('<=',n_teams),"<=","<=")
rhs <- c(1,1,1,1,1,rep(3,n_teams),8,10000)
With n_teams set appropriately.


Determine weights in multivariate weighted linear regression

I have a dataset containing insurance pricing and coverage information. The first column refers to the policy identifier, and the remaining columns refer to premium, limit, deductible, and further details as dummy variables (State and coverage).
Moreover, as it can be seen from the sample dataframe, several rows can refer to the same insurance product. In the original data frame, up to 40 rows may refer to a single policy, while other policies are described in a single row.
I am trying to conduct a multivariate regression
reg <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df)
By conducting the multivariate regression, it emerges that the distribution of residual errors does not follow a normal distribution. I therefore decided to Log() the dependent variable. Moreover, in my dataframe there are several outliers and presence of heteroscedasticity.
For the reasons above I thought WLS regression could be a solution to my problem, because it can help me assigning an appropriate weight to each error term. Trying to understand the functioning and theory behind WLS I tried to conduct simple weighted regression as explained here
wt <- 1 / lm(abs(reg$residuals) ~ reg$fitted.values)$fitted.values^2  
wls_model <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, weight=wt)
But when looking at the results I don’t think this is the correct approach to tackle my problem, also considering the fact that by trying to solve this issue many rows are not considered.
From my understand, as the weight parameter of lm should be a vector, I could assign a specific weight to each policy. For instance, each row pertaining POL1 is 1/5. Despite having read documentation, relevant posts, and searched for packages that could facilitate my work, it is not clear to me how to implement WLS in my case.

Calculating mean grade of students' peers

I have one dataset which includes all the points of students and other variables.
I further have a diagonal matrix which includes information on which student is a peer of another student.
Now I would like to use the second matrix (network) to calculate the mean-peer-points for each student. Everyone can have different (number of) peers.
To calculate the mean, I recalculated the simple 0,1 matrix into percentages, whereby the denominator is the sum of the number of peers one student has.
The second matrix then would look something like this:
ID1 0 0 0 0 1
ID2 0 0 0.5 0.5 0
ID3 0 0.5 0 0 0.5
ID4 0 0.5 0 0 0.5
ID5 0.33 0 0.33 0.33 0
And the points of each students is a simple variable in another dataset, and I would like to have the peers-average-points in as a second variable:
ID Points Peers
ID1 45 11
ID2 42 33.5
ID3 25 26.5
ID4 60 26.5
ID5 11 43.33
Are there any commands in Stata for that problem? I am currently looking into the Stata commands nwcommands, but I am unsure whether it can help. I could use solutions for Stata and R.
Without getting too creative, you can accomplish what you are trying to do with reshape, collapse and a couple of merges in Stata. Generally speaking, data in long format is easier to work with for this type of exercise.
Below is an example which produces the desired result.
/* Set-up data for example */
input int(id points)
1 45
2 42
3 25
4 60
5 11
tempfile points
save `points'
input int(StudentId id1 id2 id3 id4 id5)
1 0 0 0 0 1
2 0 0 1 1 0
3 0 1 0 0 1
4 0 1 0 0 1
5 1 0 1 1 0
/* End data set-up */
* Reshape peers data to long form
reshape long id, i(Student) j(PeerId)
drop if id == 0 // drop if student is not a peer of `StudentId`
* create id variable to use in merge
replace id = PeerId
* Merge to points data to get peer points
merge m:1 id using `points', nogen
* collapse data to the student level, sum peer points
collapse (sum) PeerPoints = points (count) CountPeers = PeerId, by(StudentId)
* merge back to points data to get student points
rename StudentId id
merge 1:1 id using `points', nogen
gen peers = PeerPoints / CountPeers
li id points peers
| id points peers |
1. | 1 45 11 |
2. | 2 42 42.5 |
3. | 3 25 26.5 |
4. | 4 60 26.5 |
5. | 5 11 43.33333
In the above code, I reshape your peer data into long form data and keep only student-peer pairs. I then merge this data to the points data to get the points of each students peers. From here, I collapse the data back to the student level, totaling peer points and peer count in the process. At this point, you have total points for the peers of each student and the number of peers each student has. Now, you simply have to merge back to the points data to get the subject students points and divide total peer points (PeerPoints) by the number of peers the student has (CountPeers) for average peer points.
nwcommands is an outstanding package I have never used or studied, so I will just try the problem from first principles. This is all matrix algebra, but given a matrix and a variable, I would approach it like this in Stata.
scalar third = 1/3
mat M = (0,0,0,0,1\0,0,0.5,0.5,0\0,0.5,0,0,0.5\0,0.5,0,0,0.5\third,0,third,third,0)
input ID Points Peers
1 45 11
2 42 33.5
3 25 26.5
4 60 26.5
5 11 43.33
gen Wanted = 0
quietly forval i = 1/5 {
forval j = 1/5 {
replace Wanted = Wanted + M[`i', `j'] * Points[`j'] in `i'
| ID Points Peers Wanted |
1. | 1 45 11 11 |
2. | 2 42 33.5 42.5 |
3. | 3 25 26.5 26.5 |
4. | 4 60 26.5 26.5 |
5. | 5 11 43.33 43.33334 |
Small points: Using 0.33 for 1/3 doesn't give enough precision. You'll have similar problems for 1/6 and 1/7, for example.
Also, I get that the peers of 2 are 3 and 4 so their average is (25 + 60)/2 = 42.5, not 33.5.
EDIT: A similar approach starts with a data structure very like that imagined by #ander2ed
input int(id points id1 id2 id3 id4 id5)
1 45 0 0 0 0 1
2 42 0 0 1 1 0
3 25 0 1 0 0 1
4 60 0 1 0 0 1
5 11 1 0 1 1 0
gen wanted = 0
quietly forval i = 1/5 {
forval j = 1/5 {
replace wanted = wanted + id`j'[`i'] * points[`j'] in `i'
egen count = rowtotal(id1-id5)
replace wanted = wanted/count
| id points id1 id2 id3 id4 id5 wanted count |
1. | 1 45 0 0 0 0 1 11 1 |
2. | 2 42 0 0 1 1 0 42.5 2 |
3. | 3 25 0 1 0 0 1 26.5 2 |
4. | 4 60 0 1 0 0 1 26.5 2 |
5. | 5 11 1 0 1 1 0 43.33333 3 |

Optimization of an R loop taking 18 hours to run

I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
for(ngram in list_ngrams$ngram)
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
1 this is a good thing
2 this is bad
#ngrams dataset
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1

How to clean and re-code check-all-that-apply responses in R survey data?

I've got survey data with some multiple-response questions like this:
HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)
1 Too expensive
2 No transportation to the hospital/clinic
3 Hospital/clinic is too far away
4 Hospital/clinic staff do not speak my language
5 Hospital/clinic staff do not like foreigners
6 Wait time too long
7 Cannot take time off of work
8 None of these. I have no problem accessing medical care
where multiple responses were entered with commas and are recorded as different levels i.e.:
[1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999
This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.
How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?
Thanks everyone!!!
My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)
Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:
Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)
The code:
ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
.Names = paste0('HS18.', lvls),
row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
## HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.
(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)

How can I calculate an inner product with an arbitrary number of columns using ddply?

I want to perform an inner product of the first D columns for each row in a data frame with a given array, W. I am trying the following:
W = (1,2,3);
ddply(df, .(id), transform, inner_product=c(col1, col2, col3) %*% W);
This works but I typically may have an arbitrary number of columns. Can I generalize the above expression to handle that case?
This is an updated example as asked for in the comments:
W = array();
W[1:3] = seq(1,3);
spamdf = head(spam);
spamdf$id = seq(1,nrow(spamdf));
df_out=ddply(spamdf, .(id), transform, inner_product=c(make, address, all) %*% W);
> W
[1] 1 2 3
> spamdf[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id
1 0 0 3.756 61 278 spam 1
> df_out[1,]
make address all num3d our over remove internet order mail receive will
1 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64
people report addresses free business email you credit your font num000
1 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0
money hp hpl george num650 lab labs telnet num857 data num415 num85
1 0 0 0 0 0 0 0 0 0 0 0 0
technology num1999 parts pm direct cs meeting original project re edu table
1 0 0 0 0 0 0 0 0 0 0 0 0
conference charSemicolon charRoundbracket charSquarebracket charExclamation
1 0 0 0 0 0.778
charDollar charHash capitalAve capitalLong capitalTotal type id inner_product
1 0 0 3.756 61 278 spam 1 3.2
The above example performs a inner product of the first three dimensions with an array W=(1,2,3) of the spam data set available in kernlab package. Here I have explicity specified the first three dimensions as c(make, address, all).
Thus df_out[1,"inner_product"] = 3.2.
Instead I want to perform the inner product over all the dimensions without having to list all the dimensions. The conversion to a matrix and back to a data frame seems to be an expensive operation?
A strategy along the lines of the following should work:
Convert each chunk to a matrix
Perform a matrix multiplication
Convert results to data.frame
The code:
df <- data.frame(
id=sample(1:5, 20, replace=TRUE),
col1 = runif(20),
col2 = runif(20),
col3 = runif(20),
col4 = runif(20)
W <- c(1,2,3,4)
ddply(df, .(id), function(x)as.data.frame(as.matrix(x[, -1]) %*% W))
The results:
id V1
1 1 4.924994
2 1 5.076043
3 2 7.053864
4 2 5.237132
5 2 6.307620
6 2 3.413056
7 2 5.182214
8 2 7.623164
9 3 5.194714
10 3 6.733229
11 4 4.122548
12 4 3.569013
13 4 4.978939
14 4 5.513444
15 4 5.840900
16 4 6.526522
17 5 3.530220
18 5 3.549646
19 5 4.340173
20 5 3.955517
If you want to append a column of cross-products, you could do this (assuming W had the right number of elements to match the non-"id" columns:
df2 <- cbind(df, as.matrix(df[, -grep("id", names(df))]) %*% W )
It does not appear that the .(id) serves any useful purpose, since you are not do a sum of crossproducts within id, and if you were then you wouldn't be using transform but some other aggregating function.
