I have this optimization problem where I am trying to maximize column z based on a unique value from column X, but also within a constraint that each of the unique values picked of X added up column of Y most be less than (in this example) 23.
For example, I have this sample data:
d=data.frame(x=c(1,1,1,2,2,2,3,3,3),y=c(9,7,5,9,7,5,9,7,5),z=c(25,20,5,20,10,5,10,5,3))
Which looks like this:
X Y Z
1 1 9 25
2 1 7 20
3 1 5 5
4 2 9 20
5 2 7 10
6 2 5 5
7 3 9 10
8 3 7 5
9 3 5 5
The result should look like this:
X Y Z
1 1 9 25
4 2 9 20
9 3 5 5
How do I set this problem up in the lpSolve::lp function?
You are trying to maximize the sum of the z values of the selected options subject to two types of constraints:
The sum of the y values for the selected options does not exceed 23
You select exactly one value for each unique x value
You can create a binary variable for each option and then solve with lpSolve:
d=data.frame(x=c(1,1,1,2,2,2,3,3,3),y=c(9,7,5,9,7,5,9,7,5),z=c(25,20,5,20,10,5,10,5,3))
library(lpSolve)
all.x <- unique(d$x)
d[lp(direction = "max",
objective.in = d$z,
const.mat = rbind(outer(all.x, d$x, "=="), d$y),
const.dir = rep(c("==", "<="), c(length(all.x), 1)),
const.rhs = rep(c(1, 23), c(length(all.x), 1)),
all.bin = TRUE)$solution == 1,]
# x y z
# 1 1 9 25
# 4 2 9 20
# 9 3 5 3
Related
I am quite new to R, but was wondering if there is a specific way to group/analyze integer values from my data frame i.e.,
Sample X : int 1 2 3 4 5
Sample Y : int 6 7 8 9 10
Sample Z : int 11 12 13 14 15
and assign these to my factor variable which has the corresponding number of levels (5 in this example) which are called in this example lvl 1, lvl 2, lvl 3, lvl 4, lvl 5. The goal is to be able to graph the observations at each level, for example lvl 1 had the observations 1, 6, and 11/ lvl 2 had 2, 7, and 12, etc.
I've found no clean way to do this. Other attempts have including individually typing out the name of each sample and manually linking this to the factor levels, but that has not gone well.
Any advice would be appreciated!
If I understood correctly, you want to have each x, y and z observations associated with a level and plot by level.
library(ggplot2)
library(reshape2)
df = data.frame(x = 1:5, y = 6:10, z = 11:15)
df$level = factor(paste0("lvl",1:5))
df
df
# x y z level
# 1 1 6 11 lvl1
# 2 2 7 12 lvl2
# 3 3 8 13 lvl3
# 4 4 9 14 lvl4
# 5 5 10 15 lvl5
It's easier to use long formatted data for plot (with ggplot2 package). I use reshape2::melt here but you could find equivalent solution with tidyr::pivot_long
df <- reshape2::melt(df, id.vars = "level")
df
level variable value
1 lvl1 x 1
2 lvl2 x 2
3 lvl3 x 3
4 lvl4 x 4
5 lvl5 x 5
6 lvl1 y 6
7 lvl2 y 7
8 lvl3 y 8
9 lvl4 y 9
10 lvl5 y 10
11 lvl1 z 11
12 lvl2 z 12
13 lvl3 z 13
14 lvl4 z 14
15 lvl5 z 15
Finally, you can plot. Let's say you want points for each level:
ggplot(df, aes(x = level, y = value)) + geom_point()
First, the following data are split randomly into two groups according to the sl variable and then run the model for both groups using the for loop shown below the data set
mydata
y x sl
1 5.297967 1 1
2 3.322833 2 1
3 4.969813 3 1
4 4.276666 4 1
5 5.972807 1 2
6 6.619440 2 2
7 8.045588 3 2
8 7.377759 4 2
9 6.907755 5 2
10 8.672486 6 2
11 8.283999 7 2
12 8.455318 8 2
13 7.414573 9 2
14 8.634087 10 2
15 7.356355 1 3
16 6.606247 2 3
17 6.396930 9 3
18 6.579251 10 3
19 5.521110 1 4
20 2.224221 2 4
21 6.742881 3 4
22 6.709304 4 4
23 6.875232 5 4
24 8.476371 6 4
25 7.360104 7 4
Runnign model using lme() function for both group and then store the beta coefficients as matrix and theta[ random intercept term ] as vector
sl.no=unique(mydata$sl)
m=length(unique(mydata$sl))
ngrp=2
set.seed(125)
idx=sample(1:ngrp, size=m, replace = T)
beta=matrix(NA, nrow = ngrp, ncol=3, byrow=T) #null matrix to store coefficients from both groups
theta=rep(0,m) #null vector to store intercepts from both groups
library(nlme)
for ( g in 1:ngrp){
rg=sl.no[idx==g]
mydata_rG=mydata[mydata$sl %in% rg,] #Data set belongs to group-g
lme_mod=lme(y~x+I(x^2),random = ~ 1|sl,
data = mydata_rG, method = "ML") #mixed effect model for each group
beta[g,]=c(unlist(lme_mod$coefficients[1])[[1]],
unlist(lme_mod$coefficients[1])[[2]],
unlist(lme_mod$coefficients[1])[[3]])
theta=c(unname(lme_mod$coefficients$random$sl))
}
I am expecting a theta vector of length m. Unfortunately, theta comes as the size of one.
Any help is appreciated.
results of beta and theta
beta
[,1] [,2] [,3]
[1,] 4.895805 0.7954474 -0.05602771
[2,] 6.423533 -1.7441753 0.32049662
theta
[1] 4.264366e-21 #it should be length of m.
It's only about how you store theta values
sl.no=unique(mydata$sl)
m=length(unique(mydata$sl))
ngrp=2
set.seed(125)
idx=sample(1:ngrp, size=m, replace = T)
beta=matrix(NA, nrow = ngrp, ncol=3, byrow=T)
theta=numeric() #Change here
library(nlme)
for ( g in 1:ngrp){
rg=sl.no[idx==g]
mydata_rG=mydata[mydata$sl %in% rg,]
lme_mod=lme(y~x+I(x^2),random = ~ 1|sl,
data = mydata_rG, method = "ML")
beta[g,]=c(unlist(lme_mod$coefficients[1])[[1]],
unlist(lme_mod$coefficients[1])[[2]],
unlist(lme_mod$coefficients[1])[[3]])
theta=c(theta, unname(lme_mod$coefficients$random$sl)) #Change here
}
I have a problem that I suspect has arisen from a dplyr update combined with my hacky code. Given a data frame in which every row is duplicated, I want to assign each row a unique id by combining the entries of two columns with either "_" or "a_" in the middle. I also want to assign a group id by combining the entries of one column with either "" or "a". Because these formats are important for lining up with another data frame, I can't use solutions based on interact and factor that I've seen in other posts.
So I want to go from this:
Generation Identity
1 1 X
2 1 Y
3 1 Z
4 2 X
5 2 Y
6 2 Z
7 3 X
8 3 Y
9 3 Z
10 1 X
11 1 Y
12 1 Z
13 2 X
14 2 Y
15 2 Z
16 3 X
17 3 Y
18 3 Z
to this:
Generation Identity Unique_id Group_id
1 1 X 1_X X
2 1 Y 1_Y Y
3 1 Z 1_Z Z
4 2 X 2_X X
5 2 Y 2_Y Y
6 2 Z 2_Z Z
7 3 X 3_X X
8 3 Y 3_Y Y
9 3 Z 3_Z Z
10 1 X 1a_X Xa
11 1 Y 1a_Y Ya
12 1 Z 1a_Z Za
13 2 X 2a_X Xa
14 2 Y 2a_Y Ya
15 2 Z 2a_Z Za
16 3 X 3a_X Xa
17 3 Y 3a_Y Ya
18 3 Z 3a_Z Za
The minimal example below is based on code that previously worked for me and others in setting the unique id but that now causes RStudio to crash with a seg fault (Exception Type: EXC_BAD_ACCESS (SIGSEGV)). When I call a function containing this code it generates the message
Error in match(vector, df$Unique_id) : 'translateCharUTF8' must be
called on a CHARSXP
which I've read can be symptomatic of memory issues.
library(dplyr)
dff <- data.frame(Generation = rep(1:3, each = 3),
Identity = rep(LETTERS[24:26], times = 3))
dff <- rbind(dff, dff) # duplicate rows
dff <- group_by_(dff, ~Generation, ~Identity) %>%
mutate(Unique_id = c(paste0(Identity[1], "_", Generation[1]), paste0(Identity[1], "a", "_", Generation[1]))) %>%
ungroup
I think the problem is related to an update of dplyr (I'm using the latest release versions of RStudio and all packages, on OSX Sierra). In any case, my solution above is something of a hack. I'd very much appreciate suggestions for improved code, preferably using either base R or dplyr (since the code is part of a package that currently depends on dplyr).
Here is how you can approach the problem:
First find the duplicates of your data. I called my data A
dup=duplicated(A)
Then add a counter row:
A$count=1:nrow(A)
n=ncol(A)#THE COLUMN ADDED
now obtain the two columns needed and cbind it with the original dataframe:
B=data.frame(t(apply(A,1,function(x)
if(dup[as.numeric(x[n])]) c(paste0(x["Identity"],"a"),paste(x[-n],collapse="a_"))
else c(x["Identity"],paste(x[-n],collapse="_")))))
`names<-`(cbind(A[-n],B),c(names(A[-1]),"Group_ID","Unique_ID"))
Identity count Group_ID Unique_ID
1 1 X X 1_X
2 1 Y Y 1_Y
3 1 Z Z 1_Z
4 2 X X 2_X
5 2 Y Y 2_Y
6 2 Z Z 2_Z
7 3 X X 3_X
8 3 Y Y 3_Y
9 3 Z Z 3_Z
10 1 X Xa 1a_X
11 1 Y Ya 1a_Y
12 1 Z Za 1a_Z
13 2 X Xa 2a_X
14 2 Y Ya 2a_Y
15 2 Z Za 2a_Z
16 3 X Xa 3a_X
17 3 Y Ya 3a_Y
18 3 Z Za 3a_Z
Here's my amended version of Onyambu's solution, which refers to columns by name rather than number (and so can handle data frames that have additional columns):
dup <- duplicated(dff) # identify duplicates
dff$count <- 1:nrow(dff) # add count column to the dataframe
# create a new dataframe containing the unique and group ids:
B <- data.frame(t(apply(dff, 1, function(x)
if(dup[as.numeric(x["count"])]) c(paste0(x["Identity"], "a"),
paste(x["Identity"], x["Generation"], sep = "a_"))
else c(x["Identity"], paste(x["Identity"], x["Generation"], sep = "_")))))
# combine the dataframes:
colnames(B) <- c("Group_id", "Unique_id")
dff <- cbind(dff[-ncol(dff), B)
Consider the following groupings:
> data.frame(x = c(3:5,7:9,12:14), grp = c(1,1,1,2,2,2,3,3,3))
x grp
1 3 1
2 4 1
3 5 1
4 7 2
5 8 2
6 9 2
7 12 3
8 13 3
9 14 3
Let's say I don't know the grp values but only have a vector x. What is the easiest way to generate grp values, essentially an id field of groups of values within a threshold from from each other? Is this a percolation algorithm?
One option would be to compare the next with the current value and check if the difference is greater than 1, and get the cumulative sum.
df1$grp <- cumsum(c(TRUE, diff(df1$x) > 1))
df1$grp
#[1] 1 1 1 2 2 2 3 3 3
EDIT: From #geotheory's comments.
Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.
I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:
a b
x y z x y z
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9
csv file below:
a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9
How can I get to a data.frame in R as shown below?
cell x y z
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9
Use base R reshape():
temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5
Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.
You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.
You can change the "time" variable to your "cell" variable as follows:
# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])
Then, drop the "time" column:
OUT$time = NULL
Update
To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:
temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL