Normalize data frame by levels - r

Good evening,
I am still new to R, so sorry in advance if this question seems obvious to you.
I am currently working on a drug screening protocol and I created .csv table in Excel with the output of my analysis. I imported it as data frame as raw.data into R with the following structure:
| Sample | Group | Parameter Drug 1 | Parameter Drug 2 | Time Parameter Drug 1 (ms) |
|---------------|-------|------------------|------------------|----------------------------|
| Heart_Sample1 | Heart | 2.4 | 9.0 | 1.5 |
| Heart_Sample1 | Heart | 2.29 | 22.2 | 3.4 |
| Heart_Sample1 | Heart | 3.4 | 3.5 | 4.5 |
| Heart_Sample1 | Heart | 5.2 | 8.4 | 6.5 |
| Heart_Sample1 | Heart | 2.3 | 34.1 | 7.8 |
| ... | Organ | value | value | time |
| Heart_Sample2 | Heart | 10.4 | 10.2 | 1.5 |
| Heart_Sample2 | Heart | 8.4 | 2.45 | 3.6 |
| ... | Organ | value | value | time |
| Liver_Sample1 | Liver | 13.4 | 44.5 | 2.8 |
| ... | Organ | 2.3 | value | time |
Parameter indicates the value of a certain parameter I am experimentally measuring (e.g. neuronal spikes). Time of Parameter indicates the time of the recording at which the spikes occur.
I transformed raw.data into mod.data with gather with the following formula:
mod.data <- gather(raw.data, `Parameter Drug 1`, `Parameter Drug 2`, `Parameter Drug 3`, key = "Drug", value = "value")
| Sample | Group | Time Parameter Drug 1 (ms) | Drug | value |
|---------------|-------|----------------------------|-----------------|-------|
| Heart_Sample1 | Heart | | Baseline | |
| Heart_Sample1 | Heart | | Baseline | |
| Heart_Sample1 | Heart | | Concentration 1 | |
| Heart_Sample1 | Heart | | Concentration 1 | |
| Heart_Sample1 | Heart | | Concentration 2 | |
Then I generated the plots, separated by Sample and , in order to have a clear overview of what is happening to the parameter, over time, in all the samples. The results is a huge plot array, with ~200 plots.
Since different organs have different values, and also within the same organ I can find very different values, the scales have to be matched within each Sample to clearly understand what is going in the sample.
I then tried to normalize with the following function:
normalize <- function(x){
(x - min(x))/(max(x)-min(x))
}
Where x is my parameter of interest. Unfortunately, it takes as min and max the respective min and max of the whole Parameter, regardless of the Sample and the Group. I also try to subset, but it would mean to create a single subset for each Sample and then merge them together in a figure. I also try with group_by(Sample, Group), as described in the RStudio cheatsheet, but I was not able to apply the normalize function to the generated data frame.
tl;dr My question is: how can I normalize, from 0 to 1, within each Sample, my values?
Thank you in advance for the answers.
Regards

Here's another approach using dplyr and your normalize function. I had no issues applying it to the toy data I created.
library(dplyr)
set.seed(123)
df <- data.frame(Sample = sample(c("Sample1", "Sample2"), 20, replace = T),
Group = sample(c("Heart", "Liver"), 20, replace = T),
Time = sample(100:500, 20),
Value = sample(1000:5000, 20))
normalize <- function(x){
(x - min(x))/(max(x)-min(x))
}
df %>%
group_by(Sample, Group) %>%
mutate(Time_std = normalize(Time),
Value_std = normalize(Value)) %>%
arrange(Sample, Group, Time_std)
# Sample Group Time Value Time_std Value_std
# Sample1 Heart 317 2895 0.00000000 0.47500000
# Sample1 Heart 389 3441 0.57600000 1.00000000
# Sample1 Heart 436 2755 0.95200000 0.34038462
# Sample1 Heart 442 2401 1.00000000 0.00000000
# Sample1 Liver 149 2513 0.00000000 0.00000000
# Sample1 Liver 154 2792 0.01428571 0.24303136
# Sample1 Liver 157 3661 0.02285714 1.00000000
# Sample1 Liver 272 3510 0.35142857 0.86846690
# Sample1 Liver 499 2535 1.00000000 0.01916376
# Sample2 Heart 179 1877 0.00000000 0.15939905
# Sample2 Heart 204 4171 0.39062500 1.00000000
# Sample2 Heart 243 1442 1.00000000 0.00000000
# Sample2 Liver 117 4011 0.00000000 0.92470805
# Sample2 Liver 147 1002 0.10238908 0.00000000
# Sample2 Liver 160 4256 0.14675768 1.00000000
# Sample2 Liver 192 4236 0.25597270 0.99385372
# Sample2 Liver 246 2096 0.44027304 0.33620160
# Sample2 Liver 265 1379 0.50511945 0.11585741
# Sample2 Liver 283 4244 0.56655290 0.99631223
# Sample2 Liver 410 3832 1.00000000 0.86969883

Using data.table you could go about this using the following approach.
Toy example:
library(data.table)
normalize <- function(x){
(x - min(x))/(max(x)-min(x))
}
df <- data.table(group = c(1, 1, 1, 1, 2, 2, 2), measure = c(10, 20, 0, 2, 1, 1, 10))
df[, measure_normalized := normalize(measure), by = group]

Related

Is there a way to define a complex objective function in an R optimizer?

In R, I am trying to optimize the following : choose rows which maximize the number of columns whose sum exceeds a certain value which varies by column + some other basic constraints on the row selections.
Is there anything out there in R which allows you to incorporate logic into an objective function? ie maximize countif ( sum(value column) > target value for column ) over ~10k columns choosing 5 rows with ~ 500 row choices.
Simple example: Grab the combo of 4 rows below whose col sum exceeds the target more frequently than any other combo of 4 rows.
+--------+------+------+------+------+------+------+------+------+------+-------+
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
+--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
+--------+------+------+------+------+------+------+------+------+------+-------+
EDIT + Update: I implemented the following using ompr, ROI, and some Big M logic.
nr <- 10 # number of rows
nt <- 15 # number of target columns
vals <- matrix(sample.int(nr*nt, nr*nt), nrow=nr, ncol=nt)
targets <- vector(length=nt)
targets[1:nt] <- 4*mean(vals)
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= 9999999*A[j], j =1:nt)%>%
add_constraint(s[j] >= A[j], j =1:nt)%>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
set_objective(sum_expr(-9999999*A[j], i = 1:nr, j = 1:nt), "max")
model <- solve_model(model,with_ROI(solver = "glpk"))
The model works great for small problems, including those where no solution exists which exceeds the target of every column.
However, the above returns Infeasible when I change the number of columns to even just 150. Given that I tested various scenarios on my smaller example, my hunch is that my model definition is OK...
Any suggestions as to why this is infeasible? Or maybe a more optimal way to define my model?
You could try a Local-Search algorithm. It may give you only a "good" solution; but in exchange it is highly flexible.
Here is a sketch. Start with an arbitrary valid solution x, for instance
for your example data
x <- c(rep(TRUE, 4), rep(FALSE, 6))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Define an objective function:
obj_fun <- function(x, table, target, ...) {
-sum(colSums(table[x, ]) >= target)
}
Given a table and a target vector, it selects the rows
defined in x and calculates the number of rowsums that
match or exceed the target. I write -sum
because I'll use an implementation that minimises an
objective function.
-obj_fun(x, table, target)
## [1] 7
So, for the chosen initial solution, 7 column sums are equal to or greater than the target.
Then you'll need a neighbourhood function. It takes a
solution x and returns a slightly changed version (a
"neighbour" of the original x). Here is a neighbour function
that changes a single row in x.
nb <- function(x, ...) {
true <- which( x)
false <- which(!x)
i <- true[sample.int(length( true), size = 1)]
j <- false[sample.int(length(false), size = 1)]
x[i] <- FALSE
x[j] <- TRUE
x
}
x
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
nb(x)
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## ^^^^^ ^^^^
Here is your data:
library("orgutils")
tt <- readOrg(text = "
| x | col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8 | col9 | col10 |
|--------+------+------+------+------+------+------+------+------+------+-------+
| row1 | 82 | 73 | 50 | 11 | 76 | 12 | 46 | 64 | 5 | 44 |
| row2 | 2 | 33 | 35 | 55 | 52 | 18 | 13 | 86 | 72 | 39 |
| row3 | 94 | 5 | 10 | 21 | 90 | 62 | 54 | 54 | 7 | 17 |
| row4 | 27 | 10 | 28 | 87 | 27 | 83 | 62 | 56 | 54 | 86 |
| row5 | 17 | 50 | 34 | 30 | 80 | 7 | 96 | 91 | 32 | 21 |
| row6 | 73 | 75 | 32 | 71 | 37 | 1 | 13 | 76 | 10 | 34 |
| row7 | 98 | 13 | 87 | 49 | 27 | 90 | 28 | 75 | 55 | 21 |
| row8 | 45 | 54 | 25 | 1 | 3 | 75 | 84 | 76 | 9 | 87 |
| row9 | 40 | 87 | 44 | 20 | 97 | 28 | 88 | 14 | 66 | 77 |
| row10 | 18 | 28 | 21 | 35 | 22 | 9 | 37 | 58 | 82 | 97 |
| target | 200 | 100 | 125 | 135| 250 | 89 | 109 | 210| 184 | 178 |
")
table <- tt[1:10, -1]
target <- tt[11, -1]
Run the search; in this case, with an algorithm called
"Threshold Accepting". I use the implementation in package NMOF (which I maintain).
library("NMOF")
x0 <- c(rep(TRUE, 4), rep(FALSE, 6))
sol <- TAopt(obj_fun,
list(neighbour = nb, ## neighbourhood fun
x0 = sample(x0), ## initial solution
nI = 1000, ## iterations
OF.target = -ncol(target) ## when to stop
),
target = target,
table = as.matrix(table))
rbind(Sums = colSums(table[sol$xbest, ]), Target = target)
## col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
## Sums 222 206 216 135 252 148 175 239 198 181
## Target 200 100 125 135 250 89 109 210 184 178
As I said, this is a only sketch, and depending on how
large and important your actual problem is, there are a number
of points to consider:
most importantly: nI sets the number of search
iterations. 1000 is the default, but you'll definitely
want to play around with this number.
there may be cases (i.e. datasets) for which the
objective function does not provide good guidance: if
selecting different rows does not change the number
of columns for which the target is met, the algorithm
cannot judge if a new solution is better than the
previous one. Thus, adding more-continuous guidance
(e.g. via some distance-to-target) may help.
updating: the computation above actually does a lot
that's not necessary. When a new candidate solution
is evaluated, there would actually be no need to
recompute the full column sums. Instead, only adjust
the previous solution's sums by the changed
rows. (For a small dataset, this won't matter much.)
This isn't quite what you asked as it is cast in python, but perhaps it will show you the approach to doing this with Integer Programming. You should be able to replicate this in R as there are bindings in R for several solvers, including CBC, which is the one I'm using below, which is suitable for Integer Programs.
I'm also using pyomo to frame up the math model for the solver. I think with a little research, you could find equivalent way to do this in R. The syntax at the start is just to ingest the data (which I just pasted into a .csv file). The rest should be readable.
The good/bad...
This solves almost immediately for your toy problem. It can be shown that 5 rows can exceed all column totals.
For many more columns, it can bog down greatly. I did a couple tests with large matrices of random numbers.... This is very challenging for the solver because it cannot identify "good" rows easily. I can get it to solve for 500x100 with random values (and the total row randomized and multiplied by 5 (the number of selections....just to make it challenging) only in reasonable time by relaxing the tolerance on the solution.
If you really have 10K columns, there are only a couple ways this could work... 1. You have several rows that can cover all the column totals (solver should discover this quickly) or 2. there is some pattern (other than random noise) to the data/totals that can guide the solver, and 3. Using a large ratio based gap (or time limit)
import pyomo.environ as pyo
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv", header=None) # this is the data from the post
# uncomment this below for a randomized set of data
# df = pd.DataFrame(
# data = np.random.random(size=(500,100)))
# df.iloc[-1] = df.iloc[-1]*5
# convert to dictionary
data = df.iloc[:len(df)-1].stack().to_dict()
col_sums = df.iloc[len(df)-1].to_dict()
limit = 5 # max number or rows selected
m = pyo.ConcreteModel('row picker')
### SETS
m.R = pyo.Set(initialize=range(len(df)-1))
m.C = pyo.Set(initialize=range(len(df.columns)))
### Params
m.val = pyo.Param(m.R, m.C, initialize=data)
m.tots = pyo.Param(m.C, initialize=col_sums)
### Variables
m.sel = pyo.Var(m.R, domain=pyo.Binary) # indicator for which rows are selected
m.abv = pyo.Var(m.C, domain=pyo.Binary) # indicator for which column is above total
### OBJECTIVE
m.obj = pyo.Objective(expr=sum(m.abv[c] for c in m.C), sense=pyo.maximize)
### CONSTRAINTS
# limit the total number of selections...
m.sel_limit = pyo.Constraint(expr=sum(m.sel[r] for r in m.R) <= limit)
# link the indicator variable to the column sum
def c_sum(m, c):
return sum(m.val[r, c] * m.sel[r] for r in m.R) >= m.tots[c] * m.abv[c]
m.col_sum = pyo.Constraint(m.C, rule=c_sum)
### SOLVE
print("...built... solving...")
solver = pyo.SolverFactory('cbc', options={'ratio': 0.05})
result = solver.solve(m)
print(result)
### Inspect answer ...
print("rows to select: ")
for r in m.R:
if m.sel[r]:
print(r, end=', ')
print("\ncolumn sums from those rows")
tots = [sum(m.val[r,c]*m.sel[r].value for r in m.R) for c in m.C]
print(tots)
print(f'percentage of column totals exceeded: {len([1 for c in m.C if m.abv[c]])/len(m.C)*100:0.2f}%')
Yields:
Problem:
- Name: unknown
Lower bound: -10.0
Upper bound: -10.0
Number of objectives: 1
Number of constraints: 11
Number of variables: 20
Number of binary variables: 20
Number of integer variables: 20
Number of nonzeros: 10
Sense: maximize
Solver:
- Status: ok
User time: -1.0
System time: 0.0
Wallclock time: 0.0
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 0
Number of created subproblems: 0
Black box:
Number of iterations: 0
Error rc: 0
Time: 0.013128995895385742
Solution:
- number of solutions: 0
number of solutions displayed: 0
rows to select:
0, 2, 3, 8, 9,
column sums from those rows
[261.0, 203.0, 153.0, 174.0, 312.0, 194.0, 287.0, 246.0, 214.0, 321.0]
percentage of column totals exceeded: 100.00%
[Finished in 845ms]
Edit:
I see your edit follows similar pattern to the above solution.
The reason you are getting "INFEASIBLE" for larger instantiations is that your Big-M is no longer big enough when the values are bigger and more are summed. You should pre-analyze your matrix and set BIG_M to be the maximal value in your target row, which will be big enough to cover any gap (by inspection). That will keep you feasible without massive overshoot on BIG_M which has consequences also.
I tweaked a few things on your r model. My r syntax is terrible, but try this out:
model <- MIPModel() %>%
add_variable(x[i], i = 1:nr, type = "binary") %>%
add_constraint(sum_expr(x[i], i = 1:nr)==4)%>%
add_variable(A[j], j = 1:nt, type = "binary") %>%
add_variable(s[j], j = 1:nt, type = "continuous",lb=0) %>%
add_constraint(s[j] <= BIG_M*A[j], j =1:nt)%>%
# NOT NEEDED: add_constraint(s[j] >= A[j], j =1:nt)%>%
# DON'T include A[j]: add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + A[j] + s[j] >= targets[j], j=1:nt) %>%
add_constraint(sum_expr(vals[i,j]*x[i], i = 1:nr) + s[j] >= targets[j], j=1:nt) %>%
# REMOVE unneded indexing for i: set_objective(sum_expr(A[j], i = 1:nr, j = 1:nt), "min")
# and just minimize. No need to multiply by a large constant here.
set_objective(sum_expr(A[j], j = 1:nt), "min")
model <- solve_model(model,with_ROI(solver = "glpk"))
This is IMHO a linear programming modeling question: Can we formulate the problem as a "normalized" linear problem that can be solved by, for example, ompr or ROI (I would add lpSolveAPI)?
I believe it is possible, though I do not have the time to provide the full formulation. Here are some ideas:
As parameters, i.e. fixed values, we have
nr <- 10 # number of rows
nt <- 10 # number of target columns
vals <- matrix(sample.int(100, nr*nt), nrow=nr, ncol=nt)
targets <- sample.int(300, nt)
The decision variables we are interested in are x[1...nr] as binary variables (1 if the row is picked, 0 otherwise).
Obviously, one constraint would be sum(x[i],i)==4 -- the numbers of rows we pick.
For the objective, I would introduce auxilliary variables, such as
y[j] = 1, if sum_{i=1..nr} x[i]*vals[i,j]>= targets[j]
(and 0 otherwise) for j=1...nt. Now this definition of y is not compatible with linear programming and needs to be linearized. If we can assume that val[i,j] and targets[j] are greater or equal to zero, then we can define y[j] as binary variables like this:
x'vals[,j]-t[j]*y[j] >= 0
(x'y is meant as inner product, i.e. sum(x[i]*y[i], i).)
In the case x'vals[,j]>=t[j], the value y[j]==1 is valid. In the case x'vals[,j]<t[j], y[j]==0 is enforced.
With the objective max sum(y[j],j), we should get a proper formulation of the problem. No big-M required. But additional assumptions on non-negativity introduced.
What you want to solve here is called a "mixed integer program", and there's lots of (mostly commercial) software designed around it.
Your typical R functions such as optim are hardly any good for it due to the kind of constraints, but you can use specialized software (such as CBC) as long as you are able to frame the problem in a standard MIP structure (in this case the variables to optimize are binary variables for each row in your data).
As an alternative, you could also look at the package nloptr with its global derivate-free black-box optimizers, in which you can enter a function like this (setting bounds on the variables) and let it optimize it with some general-purpose heuristics.

icc on dataframe with row for each rater

Let me start off with saying I'm completely new to R and trying to figure out how to run icc on my specific dataset which might be a bit different then normally.
The dataset looks as follows
+------------+------------------+--------------+--------------+--------------+
| date | measurement_type | measurement1 | measurement2 | measurement3 |
+------------+------------------+--------------+--------------+--------------+
| 25-04-2020 | 1 | 15.5 | 34.3 | 43.2 |
| 25-04-2020 | 2 | 21.2 | 12.3 | 2.2 |
| 25-04-2020 | 3 | 16.2 | 9.6 | 43.3 |
| 25-04-2020 | 4 | 27 | 1 | 6 |
+------------+------------------+--------------+--------------+--------------+
now I want to do icc on all of those rows since each row stands for a different rater. It should leave the date and measurement_type columns out.
can someone point me in the right direction, I have absolutely no idea how to go about this.
------- EDIT -------
I exported the actual dataset that will come out with some test data.
Which is available here
The 2 important sheets here are the first and third.
The first contains all the participants of the research and the third contains all 4 different reports for each participant. The code I have so far just to tie each report to the correct participant;
library("XLConnect")
library("sqldf")
library("irr")
library("dplyr")
library("tidyr")
# Load in Workbook
wb = loadWorkbook("Measuring.xlsx")
# Load in Worksheet
# Sheet 1 = Study Results
# Sheet 3 = Meetpunten
records = readWorksheet(wb, sheet=1)
reports = readWorksheet(wb, sheet=3)
for (record in 1:nrow(records)) {
recordId = records[record, 'Record.Id']
participantReports = sqldf(sprintf("select * from reports where `Record.Id` = '%s'", recordId))
baselineReport = sqldf("select * from participantReports where measurement_type = '1'")
drinkReport = sqldf("select * from participantReports where measurement_type = '2'")
regularReport = sqldf("select * from participantReports where measurement_type = '3'")
exerciseReport = sqldf("select * from participantReports where measurement_type = '4'")
}
Since in your data each row stands for a different rater, but the icc function in the irr package needs the raters to be columns, you can ignore the two first columns of your table, transpose it and run icc.
So, assuming this table:
+------------+------------------+--------------+--------------+--------------+
| date | measurement_type | measurement1 | measurement2 | measurement3 |
+------------+------------------+--------------+--------------+--------------+
| 25-04-2020 | 1 | 15.5 | 34.3 | 43.2 |
| 25-04-2020 | 2 | 21.2 | 12.3 | 2.2 |
| 25-04-2020 | 3 | 16.2 | 9.6 | 43.3 |
| 25-04-2020 | 4 | 27 | 1 | 6 |
+------------+------------------+--------------+--------------+--------------+
is stored in a variable called data, i would do it like this:
data2 = data.matrix(data[,-c(1,2)]) # generates the dataset without the first two columns
data2 is this table:
+--------------+--------------+--------------+
| measurement1 | measurement2 | measurement3 |
+--------------+--------------+--------------+
| 15.5 | 34.3 | 43.2 |
| 21.2 | 12.3 | 2.2 |
| 16.2 | 9.6 | 43.3 |
| 27 | 1 | 6 |
+--------------+--------------+--------------+
Then:
data2 = t(data2) # transpose data2 so as to have raters in the columns and their ratings in each line
icc(data2) # here i'm not bothering with the parameters, but you should explore the appropriate icc parameters for your needs.
should generate a correct run.

How to read and process columns with sub columns from an excel/.csv/any file?

I tried reading an Excel file where I need to read sub columns too, but not getting a way to resolve this.
The Excel file contains data as,
| Sl No. | Sales 1 | Sales 2 | % Change |
| | 1 Qtr | % Qtr | 2 Qtr| % Qtr | |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
After importing I can see the data as
| Sl No. |Sales 1| ...3 |Sales 2 | ...5 | % Change |
| NA | 1 Qtr | % Qtr | 2 Qtr | % Qtr | NA |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
I tried several ways to merge "Sales 1 & ...3 and Sales 2 & ...5" and keep 1 Qtr,% Qtr,2 Qtr,% Qtr as sub columns, but unable to do so
I need it to be like,
| Sl No. | Sales 1 | Sales 2 | % Change |
| | 1 Qtr | % Qtr | 2 Qtr| % Qtr | |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
Unfortunately, R doesn't allow for multiple colnames. So probably the easiest thing you can do using base R is combining the colnames and then getting rid of the first line.
library(openxlsx)
x <- read.xlsx("your_file.xlsx")
# Sl.No Sales.1 X3 Sales.2 X5 %Change
# 1 NA 1 Qtr %Qtr 2 Qtr %Qtr NA
# 2 1 134 67 175 74 12.5
colnames(x) <- paste0(colnames(x),ifelse(is.na(x[1,]),"",paste0(" - ", x[1,])))
x <- x[-1,]
# Sl.No Sales.1 - 1 Qtr X3 - %Qtr Sales.2 - 2 Qtr X5 - %Qtr %Change
# 2 1 134 67 175 74 12.5
colnames(x)
# [1] "Sl.No" "Sales.1 - 1 Qtr" "X3 - %Qtr" "Sales.2 - 2 Qtr" "X5 - %Qtr" "%Change"

Random sample by group and filtering on the basis of result

I have a dataframe that is generated by the following code
l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)
dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)),
c("ids", "months", "calWeek", "value"))
and looks like this:
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 1 | 5 | 201710 | 6 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
| 2 | 88 | 201743 | 90 |
| 2 | 88 | 201744 | 51 |
+----+-------+----------+-------+
I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.
An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be
+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
| 1 | 5 | 201708 | 4.5 |
| 1 | 5 | 201709 | 5 |
| 2 | 88 | 201741 | 75 |
| 2 | 88 | 201742 | 89 |
2 | 88 | 201743 | 90 |
+----+-------+----------+-------+
I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.
Please let me know in case you need further information.
Many thanks
require(dplyr)
set.seed(1) # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1))
sampled_day <- sampled$calWeek
dat %>% group_by(ids) %>%
mutate(max_day = which(calWeek %in% sampled_day)) %>%
filter(row_number() <= max_day)
#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)
# A tibble: 3 x 4
# Groups: ids [2]
ids months calWeek value
<dbl> <dbl> <dbl> <dbl>
1 1.00 5.00 201708 5.00
2 2.00 88.0 201741 99.0
3 2.00 88.0 201742 100
This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=
That should do the trick:
sample_and_get_below <- function(df, when, size){
res <- filter(df, calWeek == when) %>%
sample_n(size)
filter(df, calWeek > when) %>%
rbind(res, .)
}
sample_and_get_below(dat, 201741, 1)
ids months calWeek value
1 2 88 201741 99
2 2 88 201742 100
3 2 88 201743 1001
4 2 88 201744 1002

Interpolate variables on subsets of dataframe

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!
This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Resources