(RIM) weighting samples in R

(RIM) weighting samples in R - r

I have some survey data. As an example, I use the credit data from the ÌSLR
package.
library(ISLR)
The distribution of Gender in the data looks like this
prop.table(table(Credit$Gender))
Male Female
0.4825 0.5175
and the distribution of Student looks like this.
prop.table(table(Credit$Student))
No Yes
0.9 0.1
Let´s say, in the population, the actual distribution of Gender is Male/Female(0.35/0.65) and the distribution of Student is Yes/No(0.2/0.8).
In SPSS it´s possible to weight the samples, by dividing the "population distribution" by the "distribution of the sample" to simulated the distribution of the population. This process is called "RIM Weighting". The data will be only analyzed by crosstables (i.e. no regression, t-test, etc.). What is a good method in R the weight a sample, in order to analyze the data by crosstables later on?
It is possible to calculate the RIM weights in R.
install.packages("devtools")
devtools::install_github("ttrodrigz/iterake")
credit_uni = universe(df = Credit,
category(
name = "Gender",
buckets = c(" Male", "Female"),
targets = c(.35, .65)),
category(
name = "Student",
buckets = c("Yes", "No"),
targets = c(.2, .8)))
credit_weighted = iterake(Credit, credit_uni)
-- iterake summary -------------------------------------------------------------
Convergence: Success
Iterations: 5
Unweighted N: 400.00
Effective N: 339.58
Weighted N: 400.00
Efficiency: 84.9%
Loss: 0.178
Here the SPSS output (crosstables) of the weighted data
Student
No Yes
Gender Male 117 23 140
Female 203 57 260
320 80 400
and here from the unweighted data (I export both files and made the calculation in SPSS. I weighted the weighted sample by the calculated weights).
Student
No Yes
Gender Male 177 16 193
Female 183 24 20
360 40 400
In the weighted data set, I have the desired distribution Student: Yes/No(0.2/0.8) and Gender male/female(0.35/0.65).
Here is another example using SPSS of Gender and Married (weighted)
Married
No Yes
Gender Male 57 83 140
Female 102 158 260
159 241 400
and unweighted.
Married
No Yes
Gender Male 76 117 193
Female 79 128 207
155 245 400
This doesn't work in R (i.e. both crosstables looks like the unweighted one).
library(expss)
cro(Credit$Gender, Credit$Married)
cro(credit_weighted$Gender, credit_weighted$Married)
| | | Credit$Married | |
| | | No | Yes |
| ------------- | ------------ | -------------- | --- |
| Credit$Gender | Male | 76 | 117 |
| | Female | 79 | 128 |
| | #Total cases | 155 | 245 |
| | | credit_weighted$Married | |
| | | No | Yes |
| ---------------------- | ------------ | ----------------------- | --- |
| credit_weighted$Gender | Male | 76 | 117 |
| | Female | 79 | 128 |
| | #Total cases | 155 | 245 |

With expss package you need to explicitly provide your weight variable. As far as I understand iterake adds special variable weight to the dataset:
library(expss)
cro(Credit$Gender, Credit$Married) # unweighted result
cro(credit_weighted$Gender, credit_weighted$Married, weight = credit_weighted$weight) # weighted result

Related

Is there a way to define a complex objective function in an R optimizer?

In R, I am trying to optimize the following : choose rows which maximize the number of columns whose sum exceeds a certain value which varies by column + some other basic constraints on the row selections.
Is there anything out there in R which allows you to incorporate logic into an objective function? ie maximize countif ( sum(value column) > target value for column ) over ~10k columns choosing 5 rows with ~ 500 row choices.
Simple example: Grab the combo of 4 rows below whose col sum exceeds the target more frequently than any other combo of 4 rows.
+--------+------+------+------+------+------+------+------+------+------+-------+
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
+--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
+--------+------+------+------+------+------+------+------+------+------+-------+
EDIT + Update: I implemented the following using ompr, ROI, and some Big M logic.
nr <- 10 # number of rows
nt <- 15 # number of target columns
vals <- matrix(sample.int(nr*nt, nr*nt), nrow=nr, ncol=nt)
targets <- vector(length=nt)
targets[1:nt] <- 4*mean(vals)
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= 9999999*A[j], j =1:nt)%>%
add_constraint(s[j] >= A[j], j =1:nt)%>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
set_objective(sum_expr(-9999999*A[j], i = 1:nr, j = 1:nt), "max")
model <- solve_model(model,with_ROI(solver = "glpk"))
The model works great for small problems, including those where no solution exists which exceeds the target of every column.
However, the above returns Infeasible when I change the number of columns to even just 150. Given that I tested various scenarios on my smaller example, my hunch is that my model definition is OK...
Any suggestions as to why this is infeasible? Or maybe a more optimal way to define my model?

You could try a Local-Search algorithm. It may give you only a "good" solution; but in exchange it is highly flexible.
Here is a sketch. Start with an arbitrary valid solution x, for instance
for your example data
x <- c(rep(TRUE, 4), rep(FALSE, 6))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Define an objective function:
obj_fun <- function(x, table, target, ...) {
-sum(colSums(table[x, ]) >= target)
}
Given a table and a target vector, it selects the rows
defined in x and calculates the number of rowsums that
match or exceed the target. I write -sum
because I'll use an implementation that minimises an
objective function.
-obj_fun(x, table, target)
## [1] 7
So, for the chosen initial solution, 7 column sums are equal to or greater than the target.
Then you'll need a neighbourhood function. It takes a
solution x and returns a slightly changed version (a
"neighbour" of the original x). Here is a neighbour function
that changes a single row in x.
nb <- function(x, ...) {
true <- which( x)
false <- which(!x)
i <- true[sample.int(length( true), size = 1)]
j <- false[sample.int(length(false), size = 1)]
x[i] <- FALSE
x[j] <- TRUE
x
}
x
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
nb(x)
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## ^^^^^ ^^^^
Here is your data:
library("orgutils")
tt <- readOrg(text = "
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
|--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
")
table <- tt[1:10, -1]
target <- tt[11, -1]
Run the search; in this case, with an algorithm called
"Threshold Accepting". I use the implementation in package NMOF (which I maintain).
library("NMOF")
x0 <- c(rep(TRUE, 4), rep(FALSE, 6))
sol <- TAopt(obj_fun,
list(neighbour = nb, ## neighbourhood fun
x0 = sample(x0), ## initial solution
nI = 1000, ## iterations
OF.target = -ncol(target) ## when to stop
),
target = target,
table = as.matrix(table))
rbind(Sums = colSums(table[sol$xbest, ]), Target = target)
## col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
## Sums 222 206 216 135 252 148 175 239 198 181
## Target 200 100 125 135 250 89 109 210 184 178
As I said, this is a only sketch, and depending on how
large and important your actual problem is, there are a number
of points to consider:
most importantly: nI sets the number of search
iterations. 1000 is the default, but you'll definitely
want to play around with this number.
there may be cases (i.e. datasets) for which the
objective function does not provide good guidance: if
selecting different rows does not change the number
of columns for which the target is met, the algorithm
cannot judge if a new solution is better than the
previous one. Thus, adding more-continuous guidance
(e.g. via some distance-to-target) may help.
updating: the computation above actually does a lot
that's not necessary. When a new candidate solution
is evaluated, there would actually be no need to
recompute the full column sums. Instead, only adjust
the previous solution's sums by the changed
rows. (For a small dataset, this won't matter much.)

This isn't quite what you asked as it is cast in python, but perhaps it will show you the approach to doing this with Integer Programming. You should be able to replicate this in R as there are bindings in R for several solvers, including CBC, which is the one I'm using below, which is suitable for Integer Programs.
I'm also using pyomo to frame up the math model for the solver. I think with a little research, you could find equivalent way to do this in R. The syntax at the start is just to ingest the data (which I just pasted into a .csv file). The rest should be readable.
The good/bad...
This solves almost immediately for your toy problem. It can be shown that 5 rows can exceed all column totals.
For many more columns, it can bog down greatly. I did a couple tests with large matrices of random numbers.... This is very challenging for the solver because it cannot identify "good" rows easily. I can get it to solve for 500x100 with random values (and the total row randomized and multiplied by 5 (the number of selections....just to make it challenging) only in reasonable time by relaxing the tolerance on the solution.
If you really have 10K columns, there are only a couple ways this could work... 1. You have several rows that can cover all the column totals (solver should discover this quickly) or 2. there is some pattern (other than random noise) to the data/totals that can guide the solver, and 3. Using a large ratio based gap (or time limit)
import pyomo.environ as pyo
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", header=None) # this is the data from the post
# uncomment this below for a randomized set of data
# df = pd.DataFrame(
# data = np.random.random(size=(500,100)))
# df.iloc[-1] = df.iloc[-1]*5
# convert to dictionary
data = df.iloc[:len(df)-1].stack().to_dict()
col_sums = df.iloc[len(df)-1].to_dict()
limit = 5 # max number or rows selected
m = pyo.ConcreteModel('row picker')
### SETS
m.R = pyo.Set(initialize=range(len(df)-1))
m.C = pyo.Set(initialize=range(len(df.columns)))
### Params
m.val = pyo.Param(m.R, m.C, initialize=data)
m.tots = pyo.Param(m.C, initialize=col_sums)
### Variables
m.sel = pyo.Var(m.R, domain=pyo.Binary) # indicator for which rows are selected
m.abv = pyo.Var(m.C, domain=pyo.Binary) # indicator for which column is above total
### OBJECTIVE
m.obj = pyo.Objective(expr=sum(m.abv[c] for c in m.C), sense=pyo.maximize)
### CONSTRAINTS
# limit the total number of selections...
m.sel_limit = pyo.Constraint(expr=sum(m.sel[r] for r in m.R) <= limit)
# link the indicator variable to the column sum
def c_sum(m, c):
return sum(m.val[r, c] * m.sel[r] for r in m.R) >= m.tots[c] * m.abv[c]
m.col_sum = pyo.Constraint(m.C, rule=c_sum)
### SOLVE
print("...built... solving...")
solver = pyo.SolverFactory('cbc', options={'ratio': 0.05})
result = solver.solve(m)
print(result)
### Inspect answer ...
print("rows to select: ")
for r in m.R:
if m.sel[r]:
print(r, end=', ')
print("\ncolumn sums from those rows")
tots = [sum(m.val[r,c]*m.sel[r].value for r in m.R) for c in m.C]
print(tots)
print(f'percentage of column totals exceeded: {len([1 for c in m.C if m.abv[c]])/len(m.C)*100:0.2f}%')
Yields:
Problem:
- Name: unknown
Lower bound: -10.0
Upper bound: -10.0
Number of objectives: 1
Number of constraints: 11
Number of variables: 20
Number of binary variables: 20
Number of integer variables: 20
Number of nonzeros: 10
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 0.0
Wallclock time: 0.0
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 0.013128995895385742
Solution:
- number of solutions: 0
number of solutions displayed: 0
rows to select:
0, 2, 3, 8, 9,
column sums from those rows
[261.0, 203.0, 153.0, 174.0, 312.0, 194.0, 287.0, 246.0, 214.0, 321.0]
percentage of column totals exceeded: 100.00%
[Finished in 845ms]
Edit:
I see your edit follows similar pattern to the above solution.
The reason you are getting "INFEASIBLE" for larger instantiations is that your Big-M is no longer big enough when the values are bigger and more are summed. You should pre-analyze your matrix and set BIG_M to be the maximal value in your target row, which will be big enough to cover any gap (by inspection). That will keep you feasible without massive overshoot on BIG_M which has consequences also.
I tweaked a few things on your r model. My r syntax is terrible, but try this out:
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= BIG_M*A[j], j =1:nt)%>%
# NOT NEEDED: add_constraint(s[j] >= A[j], j =1:nt)%>%
# DON'T include A[j]: add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + s[j] >= targets[j], j=1:nt) %>%
# REMOVE unneded indexing for i: set_objective(sum_expr(A[j], i = 1:nr, j = 1:nt), "min")
# and just minimize. No need to multiply by a large constant here.
set_objective(sum_expr(A[j], j = 1:nt), "min")
model <- solve_model(model,with_ROI(solver = "glpk"))

This is IMHO a linear programming modeling question: Can we formulate the problem as a "normalized" linear problem that can be solved by, for example, ompr or ROI (I would add lpSolveAPI)?
I believe it is possible, though I do not have the time to provide the full formulation. Here are some ideas:
As parameters, i.e. fixed values, we have
nr <- 10 # number of rows
nt <- 10 # number of target columns
vals <- matrix(sample.int(100, nr*nt), nrow=nr, ncol=nt)
targets <- sample.int(300, nt)
The decision variables we are interested in are x[1...nr] as binary variables (1 if the row is picked, 0 otherwise).
Obviously, one constraint would be sum(x[i],i)==4 -- the numbers of rows we pick.
For the objective, I would introduce auxilliary variables, such as
y[j] = 1, if sum_{i=1..nr} x[i]*vals[i,j]>= targets[j]
(and 0 otherwise) for j=1...nt. Now this definition of y is not compatible with linear programming and needs to be linearized. If we can assume that val[i,j] and targets[j] are greater or equal to zero, then we can define y[j] as binary variables like this:
x'vals[,j]-t[j]*y[j] >= 0
(x'y is meant as inner product, i.e. sum(x[i]*y[i], i).)
In the case x'vals[,j]>=t[j], the value y[j]==1 is valid. In the case x'vals[,j]<t[j], y[j]==0 is enforced.
With the objective max sum(y[j],j), we should get a proper formulation of the problem. No big-M required. But additional assumptions on non-negativity introduced.

What you want to solve here is called a "mixed integer program", and there's lots of (mostly commercial) software designed around it.
Your typical R functions such as optim are hardly any good for it due to the kind of constraints, but you can use specialized software (such as CBC) as long as you are able to frame the problem in a standard MIP structure (in this case the variables to optimize are binary variables for each row in your data).
As an alternative, you could also look at the package nloptr with its global derivate-free black-box optimizers, in which you can enter a function like this (setting bounds on the variables) and let it optimize it with some general-purpose heuristics.

How to get the Standard Deviation based on the sample [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
The situation is I have 96 sets of paper, 4 questions on every paper and categorized into 5 categories.Every options consisting marks of 0,1,2,3,4. What's the formula to calculate the Standard Deviation?
| | 0 | 1 | 2 | 3 | 4 |
|-----|-----|-----|-----|-----|-----|
| A | 5 | 42 | 71 | 116 | 150 |
| B | 7 | 43 | 94 | 136 | 104 |
| C | 0 | 47 | 118 | 175 | 140 |
| D | 0 | 13 | 40 | 123 | 112 |
| E | 0 | 148 | 183 | 175 | 70 |
Category A consists of 4 questions
Category B consists of 4 questions
Category C consists of 5 questions
Category D consists of 3 questions
Category E consists of 6 questions
For A:
Total: (5*0)+(42*1)+(71*2)+(116*3)+(150*4) = 1132
Avg: (1132/96 sets)/4 questions = 2.94
STDEV: ?

The variance of a distribution is the average sum of (value - average)^2
Then, the Standard deviation is the square root of the variance.
Here, it would be :
Sum : 5*(0-2.94)^2 + 42*(1-2.94)^2 + 71*(2-2.94)^2 + 116*(3-2.94)^2 + 150*(4-2.94)^2 = 432.9824
Variance : 432.9824 /96/4 = 1.127
STDEV : sqrt(1.127) = 1.06

R LpSolve How to optimize picks with Budget Restriction

I have a question about LpSolve in R. I have a panel with the following data: Football player ID (around 500 player), how many games each of them has already played, number of goals scored and cost of the player. I want to create a matrix from this data, but I do not know how this works with such a large amount of data (I have about 500 football players an therefore 500 rows).
The goal is to select the optimal number of players for a budget of 1,000,000. Each player can only be selected once, optimized by the number of scored goals.
In the end I want to have the optimal selection of players which scored the most goals, and the budget has to be (almost) used up.
Since I am relatively new with R I do not know how to solve this problem with LpSolve yet and I fail at the matrix production and the constraints.
I´m very grateful for your help !
My panel looks like this (example):
footballplayerID | gamesplayed | avggoals | costsperplayer
233276 | 120 | 80 | 50.000
474823 | 200 | 140 | 34.000
192834 | 150 | 90 | 14.000
192833 | 30 | 50 | 90.000
129834 | 204 | 129 | 70.000
347594 | 123 | 19 | 10.000
203845 | 129 | 57 | 43.000
128747 | 98 | 124 | 140.000
.
.
123749 | 128 | 182 | 100.000

First I create a df like this:
df <- read.table(text =
"footballplayerID | gamesplayed | avggoals | costsperplayer
233276 | 120 | 80 | 50000
474823 | 200 | 140 | 34000
192834 | 150 | 90 | 14000
192833 | 30 | 50 | 90000
129834 | 204 | 129 | 70000
347594 | 123 | 19 | 10000
203845 | 129 | 57 | 43000
128747 | 98 | 124 | 1400001",
header = TRUE,
stringsAsFactors = FALSE,
sep = "|"
)
library(lpSolve)
The coefficients for objective function is avggoals:
obj_fun <- df$avggoals
The constraints is the sum of costperplayer that needs to be less than or equal to 100.000.000
constraints <- matrix(df$costsperplayer, nrow = 1)
c_dir <- "<="
c_rhs <- 1000000
You can then solve that lp problem with lp(). the argument all.bin = TRUE makes sure you choose a player once or not at all.
lp <- lp("max",
obj_fun,
constraints,
c_dir,
c_rhs,
all.bin = TRUE)
You can than have a look at the selected players:
df[lp$solution == 1, ]

Normalize data frame by levels

Good evening,
I am still new to R, so sorry in advance if this question seems obvious to you.
I am currently working on a drug screening protocol and I created .csv table in Excel with the output of my analysis. I imported it as data frame as raw.data into R with the following structure:
| Sample | Group | Parameter Drug 1 | Parameter Drug 2 | Time Parameter Drug 1 (ms) |
|---------------|-------|------------------|------------------|----------------------------|
| Heart_Sample1 | Heart | 2.4 | 9.0 | 1.5 |
| Heart_Sample1 | Heart | 2.29 | 22.2 | 3.4 |
| Heart_Sample1 | Heart | 3.4 | 3.5 | 4.5 |
| Heart_Sample1 | Heart | 5.2 | 8.4 | 6.5 |
| Heart_Sample1 | Heart | 2.3 | 34.1 | 7.8 |
| ... | Organ | value | value | time |
| Heart_Sample2 | Heart | 10.4 | 10.2 | 1.5 |
| Heart_Sample2 | Heart | 8.4 | 2.45 | 3.6 |
| ... | Organ | value | value | time |
| Liver_Sample1 | Liver | 13.4 | 44.5 | 2.8 |
| ... | Organ | 2.3 | value | time |
Parameter indicates the value of a certain parameter I am experimentally measuring (e.g. neuronal spikes). Time of Parameter indicates the time of the recording at which the spikes occur.
I transformed raw.data into mod.data with gather with the following formula:
mod.data <- gather(raw.data, `Parameter Drug 1`, `Parameter Drug 2`, `Parameter Drug 3`, key = "Drug", value = "value")
| Sample | Group | Time Parameter Drug 1 (ms) | Drug | value |
|---------------|-------|----------------------------|-----------------|-------|
| Heart_Sample1 | Heart | | Baseline | |
| Heart_Sample1 | Heart | | Baseline | |
| Heart_Sample1 | Heart | | Concentration 1 | |
| Heart_Sample1 | Heart | | Concentration 1 | |
| Heart_Sample1 | Heart | | Concentration 2 | |
Then I generated the plots, separated by Sample and , in order to have a clear overview of what is happening to the parameter, over time, in all the samples. The results is a huge plot array, with ~200 plots.
Since different organs have different values, and also within the same organ I can find very different values, the scales have to be matched within each Sample to clearly understand what is going in the sample.
I then tried to normalize with the following function:
normalize <- function(x){
(x - min(x))/(max(x)-min(x))
}
Where x is my parameter of interest. Unfortunately, it takes as min and max the respective min and max of the whole Parameter, regardless of the Sample and the Group. I also try to subset, but it would mean to create a single subset for each Sample and then merge them together in a figure. I also try with group_by(Sample, Group), as described in the RStudio cheatsheet, but I was not able to apply the normalize function to the generated data frame.
tl;dr My question is: how can I normalize, from 0 to 1, within each Sample, my values?
Thank you in advance for the answers.
Regards

Here's another approach using dplyr and your normalize function. I had no issues applying it to the toy data I created.
library(dplyr)
set.seed(123)
df <- data.frame(Sample = sample(c("Sample1", "Sample2"), 20, replace = T),
Group = sample(c("Heart", "Liver"), 20, replace = T),
Time = sample(100:500, 20),
Value = sample(1000:5000, 20))
normalize <- function(x){
(x - min(x))/(max(x)-min(x))
}
df %>%
group_by(Sample, Group) %>%
mutate(Time_std = normalize(Time),
Value_std = normalize(Value)) %>%
arrange(Sample, Group, Time_std)
# Sample Group Time Value Time_std Value_std
# Sample1 Heart 317 2895 0.00000000 0.47500000
# Sample1 Heart 389 3441 0.57600000 1.00000000
# Sample1 Heart 436 2755 0.95200000 0.34038462
# Sample1 Heart 442 2401 1.00000000 0.00000000
# Sample1 Liver 149 2513 0.00000000 0.00000000
# Sample1 Liver 154 2792 0.01428571 0.24303136
# Sample1 Liver 157 3661 0.02285714 1.00000000
# Sample1 Liver 272 3510 0.35142857 0.86846690
# Sample1 Liver 499 2535 1.00000000 0.01916376
# Sample2 Heart 179 1877 0.00000000 0.15939905
# Sample2 Heart 204 4171 0.39062500 1.00000000
# Sample2 Heart 243 1442 1.00000000 0.00000000
# Sample2 Liver 117 4011 0.00000000 0.92470805
# Sample2 Liver 147 1002 0.10238908 0.00000000
# Sample2 Liver 160 4256 0.14675768 1.00000000
# Sample2 Liver 192 4236 0.25597270 0.99385372
# Sample2 Liver 246 2096 0.44027304 0.33620160
# Sample2 Liver 265 1379 0.50511945 0.11585741
# Sample2 Liver 283 4244 0.56655290 0.99631223
# Sample2 Liver 410 3832 1.00000000 0.86969883

Using data.table you could go about this using the following approach.
Toy example:
library(data.table)
normalize <- function(x){
(x - min(x))/(max(x)-min(x))
}
df <- data.table(group = c(1, 1, 1, 1, 2, 2, 2), measure = c(10, 20, 0, 2, 1, 1, 10))
df[, measure_normalized := normalize(measure), by = group]

Interpolate variables on subsets of dataframe

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!

This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

(RIM) weighting samples in R - r

Related

Is there a way to define a complex objective function in an R optimizer?

How to get the Standard Deviation based on the sample [closed]

R LpSolve How to optimize picks with Budget Restriction

Normalize data frame by levels

Interpolate variables on subsets of dataframe

Categories

Resources