In R, I am trying to optimize the following : choose rows which maximize the number of columns whose sum exceeds a certain value which varies by column + some other basic constraints on the row selections.
Is there anything out there in R which allows you to incorporate logic into an objective function? ie maximize countif ( sum(value column) > target value for column ) over ~10k columns choosing 5 rows with ~ 500 row choices.
Simple example: Grab the combo of 4 rows below whose col sum exceeds the target more frequently than any other combo of 4 rows.
+--------+------+------+------+------+------+------+------+------+------+-------+
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
+--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
+--------+------+------+------+------+------+------+------+------+------+-------+
EDIT + Update: I implemented the following using ompr, ROI, and some Big M logic.
nr <- 10 # number of rows
nt <- 15 # number of target columns
vals <- matrix(sample.int(nr*nt, nr*nt), nrow=nr, ncol=nt)
targets <- vector(length=nt)
targets[1:nt] <- 4*mean(vals)
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= 9999999*A[j], j =1:nt)%>%
add_constraint(s[j] >= A[j], j =1:nt)%>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
set_objective(sum_expr(-9999999*A[j], i = 1:nr, j = 1:nt), "max")
model <- solve_model(model,with_ROI(solver = "glpk"))
The model works great for small problems, including those where no solution exists which exceeds the target of every column.
However, the above returns Infeasible when I change the number of columns to even just 150. Given that I tested various scenarios on my smaller example, my hunch is that my model definition is OK...
Any suggestions as to why this is infeasible? Or maybe a more optimal way to define my model?
You could try a Local-Search algorithm. It may give you only a "good" solution; but in exchange it is highly flexible.
Here is a sketch. Start with an arbitrary valid solution x, for instance
for your example data
x <- c(rep(TRUE, 4), rep(FALSE, 6))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Define an objective function:
obj_fun <- function(x, table, target, ...) {
-sum(colSums(table[x, ]) >= target)
}
Given a table and a target vector, it selects the rows
defined in x and calculates the number of rowsums that
match or exceed the target. I write -sum
because I'll use an implementation that minimises an
objective function.
-obj_fun(x, table, target)
## [1] 7
So, for the chosen initial solution, 7 column sums are equal to or greater than the target.
Then you'll need a neighbourhood function. It takes a
solution x and returns a slightly changed version (a
"neighbour" of the original x). Here is a neighbour function
that changes a single row in x.
nb <- function(x, ...) {
true <- which( x)
false <- which(!x)
i <- true[sample.int(length( true), size = 1)]
j <- false[sample.int(length(false), size = 1)]
x[i] <- FALSE
x[j] <- TRUE
x
}
x
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
nb(x)
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## ^^^^^ ^^^^
Here is your data:
library("orgutils")
tt <- readOrg(text = "
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
|--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
")
table <- tt[1:10, -1]
target <- tt[11, -1]
Run the search; in this case, with an algorithm called
"Threshold Accepting". I use the implementation in package NMOF (which I maintain).
library("NMOF")
x0 <- c(rep(TRUE, 4), rep(FALSE, 6))
sol <- TAopt(obj_fun,
list(neighbour = nb, ## neighbourhood fun
x0 = sample(x0), ## initial solution
nI = 1000, ## iterations
OF.target = -ncol(target) ## when to stop
),
target = target,
table = as.matrix(table))
rbind(Sums = colSums(table[sol$xbest, ]), Target = target)
## col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
## Sums 222 206 216 135 252 148 175 239 198 181
## Target 200 100 125 135 250 89 109 210 184 178
As I said, this is a only sketch, and depending on how
large and important your actual problem is, there are a number
of points to consider:
most importantly: nI sets the number of search
iterations. 1000 is the default, but you'll definitely
want to play around with this number.
there may be cases (i.e. datasets) for which the
objective function does not provide good guidance: if
selecting different rows does not change the number
of columns for which the target is met, the algorithm
cannot judge if a new solution is better than the
previous one. Thus, adding more-continuous guidance
(e.g. via some distance-to-target) may help.
updating: the computation above actually does a lot
that's not necessary. When a new candidate solution
is evaluated, there would actually be no need to
recompute the full column sums. Instead, only adjust
the previous solution's sums by the changed
rows. (For a small dataset, this won't matter much.)
This isn't quite what you asked as it is cast in python, but perhaps it will show you the approach to doing this with Integer Programming. You should be able to replicate this in R as there are bindings in R for several solvers, including CBC, which is the one I'm using below, which is suitable for Integer Programs.
I'm also using pyomo to frame up the math model for the solver. I think with a little research, you could find equivalent way to do this in R. The syntax at the start is just to ingest the data (which I just pasted into a .csv file). The rest should be readable.
The good/bad...
This solves almost immediately for your toy problem. It can be shown that 5 rows can exceed all column totals.
For many more columns, it can bog down greatly. I did a couple tests with large matrices of random numbers.... This is very challenging for the solver because it cannot identify "good" rows easily. I can get it to solve for 500x100 with random values (and the total row randomized and multiplied by 5 (the number of selections....just to make it challenging) only in reasonable time by relaxing the tolerance on the solution.
If you really have 10K columns, there are only a couple ways this could work... 1. You have several rows that can cover all the column totals (solver should discover this quickly) or 2. there is some pattern (other than random noise) to the data/totals that can guide the solver, and 3. Using a large ratio based gap (or time limit)
import pyomo.environ as pyo
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", header=None) # this is the data from the post
# uncomment this below for a randomized set of data
# df = pd.DataFrame(
# data = np.random.random(size=(500,100)))
# df.iloc[-1] = df.iloc[-1]*5
# convert to dictionary
data = df.iloc[:len(df)-1].stack().to_dict()
col_sums = df.iloc[len(df)-1].to_dict()
limit = 5 # max number or rows selected
m = pyo.ConcreteModel('row picker')
### SETS
m.R = pyo.Set(initialize=range(len(df)-1))
m.C = pyo.Set(initialize=range(len(df.columns)))
### Params
m.val = pyo.Param(m.R, m.C, initialize=data)
m.tots = pyo.Param(m.C, initialize=col_sums)
### Variables
m.sel = pyo.Var(m.R, domain=pyo.Binary) # indicator for which rows are selected
m.abv = pyo.Var(m.C, domain=pyo.Binary) # indicator for which column is above total
### OBJECTIVE
m.obj = pyo.Objective(expr=sum(m.abv[c] for c in m.C), sense=pyo.maximize)
### CONSTRAINTS
# limit the total number of selections...
m.sel_limit = pyo.Constraint(expr=sum(m.sel[r] for r in m.R) <= limit)
# link the indicator variable to the column sum
def c_sum(m, c):
return sum(m.val[r, c] * m.sel[r] for r in m.R) >= m.tots[c] * m.abv[c]
m.col_sum = pyo.Constraint(m.C, rule=c_sum)
### SOLVE
print("...built... solving...")
solver = pyo.SolverFactory('cbc', options={'ratio': 0.05})
result = solver.solve(m)
print(result)
### Inspect answer ...
print("rows to select: ")
for r in m.R:
if m.sel[r]:
print(r, end=', ')
print("\ncolumn sums from those rows")
tots = [sum(m.val[r,c]*m.sel[r].value for r in m.R) for c in m.C]
print(tots)
print(f'percentage of column totals exceeded: {len([1 for c in m.C if m.abv[c]])/len(m.C)*100:0.2f}%')
Yields:
Problem:
- Name: unknown
Lower bound: -10.0
Upper bound: -10.0
Number of objectives: 1
Number of constraints: 11
Number of variables: 20
Number of binary variables: 20
Number of integer variables: 20
Number of nonzeros: 10
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 0.0
Wallclock time: 0.0
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 0.013128995895385742
Solution:
- number of solutions: 0
number of solutions displayed: 0
rows to select:
0, 2, 3, 8, 9,
column sums from those rows
[261.0, 203.0, 153.0, 174.0, 312.0, 194.0, 287.0, 246.0, 214.0, 321.0]
percentage of column totals exceeded: 100.00%
[Finished in 845ms]
Edit:
I see your edit follows similar pattern to the above solution.
The reason you are getting "INFEASIBLE" for larger instantiations is that your Big-M is no longer big enough when the values are bigger and more are summed. You should pre-analyze your matrix and set BIG_M to be the maximal value in your target row, which will be big enough to cover any gap (by inspection). That will keep you feasible without massive overshoot on BIG_M which has consequences also.
I tweaked a few things on your r model. My r syntax is terrible, but try this out:
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= BIG_M*A[j], j =1:nt)%>%
# NOT NEEDED: add_constraint(s[j] >= A[j], j =1:nt)%>%
# DON'T include A[j]: add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + s[j] >= targets[j], j=1:nt) %>%
# REMOVE unneded indexing for i: set_objective(sum_expr(A[j], i = 1:nr, j = 1:nt), "min")
# and just minimize. No need to multiply by a large constant here.
set_objective(sum_expr(A[j], j = 1:nt), "min")
model <- solve_model(model,with_ROI(solver = "glpk"))
This is IMHO a linear programming modeling question: Can we formulate the problem as a "normalized" linear problem that can be solved by, for example, ompr or ROI (I would add lpSolveAPI)?
I believe it is possible, though I do not have the time to provide the full formulation. Here are some ideas:
As parameters, i.e. fixed values, we have
nr <- 10 # number of rows
nt <- 10 # number of target columns
vals <- matrix(sample.int(100, nr*nt), nrow=nr, ncol=nt)
targets <- sample.int(300, nt)
The decision variables we are interested in are x[1...nr] as binary variables (1 if the row is picked, 0 otherwise).
Obviously, one constraint would be sum(x[i],i)==4 -- the numbers of rows we pick.
For the objective, I would introduce auxilliary variables, such as
y[j] = 1, if sum_{i=1..nr} x[i]*vals[i,j]>= targets[j]
(and 0 otherwise) for j=1...nt. Now this definition of y is not compatible with linear programming and needs to be linearized. If we can assume that val[i,j] and targets[j] are greater or equal to zero, then we can define y[j] as binary variables like this:
x'vals[,j]-t[j]*y[j] >= 0
(x'y is meant as inner product, i.e. sum(x[i]*y[i], i).)
In the case x'vals[,j]>=t[j], the value y[j]==1 is valid. In the case x'vals[,j]<t[j], y[j]==0 is enforced.
With the objective max sum(y[j],j), we should get a proper formulation of the problem. No big-M required. But additional assumptions on non-negativity introduced.
What you want to solve here is called a "mixed integer program", and there's lots of (mostly commercial) software designed around it.
Your typical R functions such as optim are hardly any good for it due to the kind of constraints, but you can use specialized software (such as CBC) as long as you are able to frame the problem in a standard MIP structure (in this case the variables to optimize are binary variables for each row in your data).
As an alternative, you could also look at the package nloptr with its global derivate-free black-box optimizers, in which you can enter a function like this (setting bounds on the variables) and let it optimize it with some general-purpose heuristics.
I have some survey data. As an example, I use the credit data from the ÌSLR
package.
library(ISLR)
The distribution of Gender in the data looks like this
prop.table(table(Credit$Gender))
Male Female
0.4825 0.5175
and the distribution of Student looks like this.
prop.table(table(Credit$Student))
No Yes
0.9 0.1
Let´s say, in the population, the actual distribution of Gender is Male/Female(0.35/0.65) and the distribution of Student is Yes/No(0.2/0.8).
In SPSS it´s possible to weight the samples, by dividing the "population distribution" by the "distribution of the sample" to simulated the distribution of the population. This process is called "RIM Weighting". The data will be only analyzed by crosstables (i.e. no regression, t-test, etc.). What is a good method in R the weight a sample, in order to analyze the data by crosstables later on?
It is possible to calculate the RIM weights in R.
install.packages("devtools")
devtools::install_github("ttrodrigz/iterake")
credit_uni = universe(df = Credit,
category(
name = "Gender",
buckets = c(" Male", "Female"),
targets = c(.35, .65)),
category(
name = "Student",
buckets = c("Yes", "No"),
targets = c(.2, .8)))
credit_weighted = iterake(Credit, credit_uni)
-- iterake summary -------------------------------------------------------------
Convergence: Success
Iterations: 5
Unweighted N: 400.00
Effective N: 339.58
Weighted N: 400.00
Efficiency: 84.9%
Loss: 0.178
Here the SPSS output (crosstables) of the weighted data
Student
No Yes
Gender Male 117 23 140
Female 203 57 260
320 80 400
and here from the unweighted data (I export both files and made the calculation in SPSS. I weighted the weighted sample by the calculated weights).
Student
No Yes
Gender Male 177 16 193
Female 183 24 20
360 40 400
In the weighted data set, I have the desired distribution Student: Yes/No(0.2/0.8) and Gender male/female(0.35/0.65).
Here is another example using SPSS of Gender and Married (weighted)
Married
No Yes
Gender Male 57 83 140
Female 102 158 260
159 241 400
and unweighted.
Married
No Yes
Gender Male 76 117 193
Female 79 128 207
155 245 400
This doesn't work in R (i.e. both crosstables looks like the unweighted one).
library(expss)
cro(Credit$Gender, Credit$Married)
cro(credit_weighted$Gender, credit_weighted$Married)
| | | Credit$Married | |
| | | No | Yes |
| ------------- | ------------ | -------------- | --- |
| Credit$Gender | Male | 76 | 117 |
| | Female | 79 | 128 |
| | #Total cases | 155 | 245 |
| | | credit_weighted$Married | |
| | | No | Yes |
| ---------------------- | ------------ | ----------------------- | --- |
| credit_weighted$Gender | Male | 76 | 117 |
| | Female | 79 | 128 |
| | #Total cases | 155 | 245 |
With expss package you need to explicitly provide your weight variable. As far as I understand iterake adds special variable weight to the dataset:
library(expss)
cro(Credit$Gender, Credit$Married) # unweighted result
cro(credit_weighted$Gender, credit_weighted$Married, weight = credit_weighted$weight) # weighted result
I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description
Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)