How to allow user inputs in a min-max range in R? - r

Good afternoon.
My question is very simple.
I'm wanting to retrieve some n user inputs ( a vector of length n ). Only values between 0 and 1 are accepted.
I know how to retrieve values with scan function but i don't know how to force users to enter only values in [min-max] interval.
Thank you for help !
Code :
x <- scan(,n=3)

One way can be using a while loop:
stayInLoop <- TRUE
N<-3 # number of elements in vector
while(stayInLoop){
print("Please insert x")
x <- scan(,n=N) #readLines(,n=N)
if (any(x<0) | any(x>1)) {
print("Re-enter the values, as valid values can be between 0 and 1")
x <- scan(,n=N)
}
stayInLoop<-any(x<0) & any(x>1)
}
[1] "Please insert x"
1: 1
2: 2
3: 3
Read 3 items
[1] "Re-enter the values, as valid values can be between 0 and 1"
1: 0.2
2: 0.2
3: 0.4
Read 3 items
> x
[1] 0.2 0.2 0.4

Related

How to filter alphanumeric characters range?

I need to create dummy variables using ICD-10 codes. For example, chapter 2 starts with C00 and ends with D48X. Data looks like this:
data <- data.frame(LINHAA1 = c("B342", "C000", "D450", "0985"),
LINHAA2 = c("U071", "C99", "D68X", "J061"),
LINHAA3 = c("D48X", "Y098", "X223", "D640"))
Then I need to create a column that receives 1 if it's between the C00-D48X range and 0 if it's not. The result I desire:
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
B342 U071 D48X 1
C000 C99 Y098 1
D450 D68X X223 1
O985 J061 D640 0
It needs to go through LINHAA1 to LINHAA3. Thanks in advance!
This should do it:
as.numeric(apply(apply(data, 1,
function(x) { x >="C00" & x <= "D48X" }), 2, any))
[1] 1 1 1 0
A little explanation: Checking if the codes are in the range can just be checked using alphabetic order (which you can get from <= etc). The inner apply checks each element and produces a matrix of logical values. The outer apply uses any to check if any one of the three logical values is true. as.numeric changes the result from TRUE/False to 1/0.
This is the typical case for dplyr::if_any. if_any returns TRUE if a given condition is met in any of the tested columns, rowwise:
library(dplyr)
data %>%
mutate(CHAPTER2 = +if_any(starts_with("LINHAA"),
~.x >= 'C00' & .x <='D48X'))
LINHAA1 LINHAA2 LINHAA3 CHAPTER2
1 B342 U071 D48X 1
2 C000 C99 Y098 1
3 D450 D68X X223 1
4 0985 J061 D640 0
Using dedicated icd package
# remotes::install_github("jackwasey/icd")
library(icd)
#get the 2nd chapter start and end codes
ch2 <- icd::icd10_chapters[[ 2 ]]
# start end
# "C00" "D49"
#expland the codes to include all chapter2 codes
ch2codes <- expand_range(ch2[ "start" ], ch2[ "end" ])
# length(ch2codes)
# 2094
#check if codes in a row match
ix <- apply(data, 1, function(i) any(i %in% ch2codes))
# [1] FALSE TRUE FALSE FALSE
data$chapter2 <- as.integer(ix)
#data
# LINHAA1 LINHAA2 LINHAA3 chapter2
# 1 B342 U071 D48X 0
# 2 C000 C99 Y098 1
# 3 D450 D68X X223 0
# 4 0985 J061 D640 0
Note that you have some invalid codes:
#invalid
is_defined("D48X")
# [1] FALSE
explain_code("D48X")
# character(0)
#Valid
is_defined("D48")
# [1] TRUE
explain_code("D48")
# [1] "Neoplasm of uncertain behavior of other and unspecified sites"

how to count the number of distinct items in a linear programming problem

I am studying some linear programming problems with all-binary variables, where it is necessary to count (and then either constrain or maximise/minimise) the number of distinct items in the solution.
This is the post I could find that seemed closest to it:
https://stats.stackexchange.com/questions/136608/constrained-assignment-problem-linear-programming-genetic-algorithm-etc
The 'items' being counted in this case are the supply centers used. I am trying to understand if the approach suggested in the above post is correct for my purposes.
In the answer by user 'TAS', the example is 3 shops by 2 supply centers, and the idea is (A) to assign one (and only one) supply center to each shop, so that: (B) the distance travelled is minimal, (C) no supply center must supply more than a given maximal number of shops (in this case 3, i.e. no limit), and (D) the max total number of supply centers used is limited (in this case to 2).
I tried to reconstruct how the problem was set up, starting from a dataset like the one I would have in my case.
df <- cbind(expand.grid(shop=1:3,supply=1:2),distance=c(2.8,5.4,1.4,4.2,3.0,6.3))
df["Entry"] <- 1:dim(df)[[1]]
shop.mat <- table(df$shop,df$Entry)
shop.mat
1 2 3 4 5 6
1 1 0 0 1 0 0
2 0 1 0 0 1 0
3 0 0 1 0 0 1
supply.mat <- table(df$supply,df$Entry)
supply.mat
1 2 3 4 5 6
1 1 1 1 0 0 0
2 0 0 0 1 1 1
N_supply <- dim(supply.mat)[[1]]
N_shop <- dim(shop.mat)[[1]]
N_entry <- dim(df)[[1]]
The solution vector will have length N_entry + N_supply, and each row of the constraint matrix will need to have the same length.
constr.mat <- NULL
dir <- NULL
rhs <- NULL
(A) is addressed by constraining each line in the shop.mat to be == 1:
constr.mat <- rbind(constr.mat,cbind(shop.mat,matrix(0,N_shop,N_supply)))
dir <- c(dir,rep("==",N_shop))
rhs <- c(rhs,rep(1,N_shop))
(B) is addressed by setting the objective vector to the distance for each Entry, and 0 for each shop (because there is no cost in adding one more supply center, although in reality there might be):
obj <- c(aggregate(distance~Entry,df,c)[["distance"]],rep(0,N_supply))
(C) is addressed by rearranging the equation and turning it into a <= 0 constraint:
constr.mat <- rbind(constr.mat,cbind(supply.mat,-diag(table(df$supply))))
dir <- c(dir,rep("<=",N_supply))
rhs <- c(rhs,rep(0,N_supply))
(D) is addressed by adding a constraint <= 2:
constr.mat <- rbind(constr.mat,c(rep(0,N_entry),rep(1,N_supply)))
dir <- c(dir,"<=")
rhs <- c(rhs,2)
The problem can then be solved using lpSolve:
require(lpSolve)
sol <- lp("min", obj, constr.mat, dir, rhs, all.bin = TRUE,num.bin.solns = 1, use.rw=FALSE, transpose.constr=TRUE)
sol$solution
[1] 1 0 1 0 1 0 1 1
sol$objval
[1] 7.2
selected_Entry <- dimnames(shop.mat)[[2]][as.logical(sol$solution[1:N_entry])]
selected_Entry
[1] "1" "3" "5"
df[df$Entry %in% selected_Entry,]
shop supply distance Entry
1 1 1 2.8 1
3 3 1 1.4 3
5 2 2 3.0 5
I can see that in this specific case the solution vector is forced (by constraints (C)) to have '1' in any of the 'supply' variables for which at least one corresponding Entry is selected. If this were not the case, the row sums for constraints (C) would be > 0.
But: suppose the distances were different and only supply center 1 were chosen for all 3 shops. What would stop the solution vector variable for supply center 2 from being set to '1'?
The current solution gives:
constr.mat %*% sol$solution
[,1]
1 1
2 1
3 1
1 -1
2 -2
2
But this alternative solution would still meet all the constraints:
constr.mat %*% c(1,1,1,0,0,0,1,1)
[,1]
1 1
2 1
3 1
1 0
2 -3
2
despite the fact that supplier center 2 was not used.
In this case this would not affect the solution, because there is no cost associated to including the supply centers (the corresponding elements of the objective vector are 0).
But if I wanted to get from the solution the count of distinct supply centers used, I think this would not work.
A few years ago I asked for advice on this problem on another forum, and someone immediately gave me the solution, however saying that he/she 'was not sure it was the most efficient'.
It was the following: all the same as above, and then for each of the supply centers, add to the constr.mat twice the supply.mat augmented by the negated diagonal matrix of the number of entries per supply center, constraining the first N_supply added rows to be <= 0, and the last N_supply rows to be >= 1 - the diagonal of the above mentioned diagonal matrix.
constr.mat <- rbind(constr.mat,cbind(supply.mat,-diag(table(df$supply))),cbind(supply.mat,-diag(table(df$supply))))
dir <- c(dir,rep("<=",N_supply),rep(">=",N_supply))
rhs <- c(rhs,rep(0,N_supply),1-table(df$supply))
The addition of these constraints makes sure that the 'supply' variables in the solution vector are 1 if and only if the corresponding supply center has been used, and 0 if and only if it hasn't been used.
For instance, the original solution would still work:
paste(t(unlist(constr.mat %*% sol$solution)),dir,rhs)
[1] "1 == 1" "1 == 1" "1 == 1" "-1 <= 0"
[5] "-2 <= 0" "2 <= 2" "-1 <= 0" "-2 <= 0"
[9] "-1 >= -2" "-2 >= -2"
[BTW I would not know how to turn this into an evaluated logical vector; any idea?]
whereas the other solution, which erroneously set the variable for supply center 2 to 1 although this supply center wasn't used, would instead not be valid:
paste(t(unlist(constr.mat %*% c(1,1,1,0,0,0,1,1))),dir,rhs)
[1] "1 == 1" "1 == 1" "1 == 1" "0 <= 0"
[5] "-3 <= 0" "2 <= 2" "0 <= 0" "-3 <= 0"
[9] "0 >= -2" "-3 >= -2"
(the last constraint would not be met).
Q1 Do you think the above makes sense, i.e. is it true that we need the additional constraint rows I mentioned to make sure that the 'supply' variables in the solution vector are appropriately set, or am I wrong?
Q2 Can you think of a more efficient way to count occurrences of distinct items in such problems (the example here is small, but I am often dealing with VERY large ones, where adding so many more constraints is not really helping, despite all the presolve in the world)?
Thanks!
Note: this question was originally posted in another community. I deleted it from there.
EDIT after consulting the Wikipedia page on 'Uncapacited Facility Location Problem', mentioned in the original post I linked above.
In fact there is a cost associated to opening a new supply center, so the objective vector should not have 0's at the end, but some cost ($f_i$ in the Wikipedia formulation).
Only then the issue of $\sum_iy_i$ not always being the number of open supply centers disappears, because the $\sum_jx_{i,j} \le m \cdot y_i$ contraints will still ensure that whenever a given center is used, the corresponding $y_i$ is 1; and there will be no need for the other condition I imposed, because there is now a cost associated with setting to 1 each $y_i$, therefore only the strictly necessary $y_i$'s will be set to 1.
So in short, if the objective vector is properly constructed, with costs for each supply center, I can do without several constraints.
In fact, depending on the value for the supply center opening cost, the constraint on the max total number of centers may even be superseded.
Then it would be interesting to evaluate the suggestion made in the Wikipedia discussion, namely to split the 'big M' constraints into several smaller ones. If it's true that it makes the problem easier to solve computationally, why not...
EDIT: The algorithm is fixed.
I had a similar issue that I wanted to minimize the number of distinct values in a solution. The following is how I came up with the answer to mathematically calculate the number of distinct items.
Suppose we have the following set:
11 12 13 11 11
We can see there are 3 distinct numbers in there (11, 12, and 13). Following is the way to compute it.
Write the numbers in triangle matrix like this:
11 12 13 11 11 row=0
12 13 11 11 row=1
13 11 11 row=2
11 11 row=3
if I get the difference of 11 and 12 and assign a binary variable to
1 if |a1 - a2| != 0
0 if |a1 - a2| == 0
then I have the following:
1 1 0 0 --> 0
1 1 1 --> 1
1 1 --> 1
0 --> 0
1 + 1 + (extra 1) = 3
if a number is distinct then its row should be all 1s.
So for the above case we have 2 rows of full 1s, meaning we have 2 numbers that are distinct from the first number. So, in total we have 3.
Now to translate into Linear Programming:
Assume:
Variables = a(1), a(2), a(3), a(4), ..., a(n)
Binary Variables b(i,j) where i,j in [0...n]
Binary Variable c(i) where i in [0...n]
The Linear Program would be:
obj = 1
for i in range(0, n):
for j in range(i+1, n):
# This is |a(i) - a(j)| part
addConstr( b(i,j) * BigM >= a(i) - a(j))
addConstr( b(i,j) * BigM >= -(a(i) - a(j)))
# So here c(i) will be 0 if there is any 0 in the row otherwise it is 1.
addConstr(c(i) * BigM >= sum(b(i,j) for all j) - (n-i))
obj = obj + c(i)
Minimize(Sum(obj))

looping between two vectors for meeting three conditions

I have a csv file containing 4 columns of data. I need to select the first column from the csv file which I do like this:
file1<-read.csv("file1.csv",header=TRUE)
x<-file[,1]
The first column contains (x, here) contains row numbers.
x
5
10
54
177
178
182
183
184
185
203
204
205
206
207
208
Now there is another csv file which contains a single column of 365 rows of data
y<-read.csv("data.csv",header=TRUE)
y
0
2.3
0.5
21
0
.
.
.
9.5 #total 365 numbers
This is what I intend to do:
1) From x, chose the first number (which is 5)
2) In y, select the corresponding 5th data point (which is 0) and 4 data point prior to it (which are 21,0.5,2.3,0), then test the following condition respectively
Condition 1: From the 5 data points, if the three out of five are > 0, then print 5 (result of step 1)
Condition 2: If all four of five are >0, then print 5 again
Condition 3: If all of five are >0, then print 5 again
However, if out of three conditions, only the first two are met and third one is not met, then select the second number from x (10 in this case) and again choose the corresponding 10th data point in y and four data points prior to it (6th,7th,8th and 9th) and evaluate them for the third condition (i.e. if all the five numbers - 6th, 7th, 8th,9th and 10th are > 0, I do not need to evaluate the first and second condition which are already met by the previous number from x),then save 10 and stop.
This sounds quite complicated for a feeble mind of mine (as seen by my reputation) and was hoping someone can tell me how to do this in R.
Thanks a lot
Sounds like you need a while loop.
file1 <- data.frame(x=seq(5, 205, by=5))
file1
x <- file1[, 1]
set.seed(123)
file2 <- data.frame(y=rnorm(365))
y <- file2[, 1]
# flags for each condition
cond1 <- FALSE
cond2 <- FALSE
cond3 <- FALSE
k <- 0
while(!cond3) {
k <- k + 1
# select first number
num <- x[k]
# select all y's up to data point
all.y <- y[(num-4):num]
# number of positive values
chk.pos <- length(which(all.y > 0))
# condition 1: check if 3 of 5 are positive
cnt <- 0
if (!cond1 & chk.pos >= 3) {
cnt <- cnt + 1
cond1 <- TRUE
print(num)
}
# condition 2: check if 4 of 5 are positive
if (!cond2 & chk.pos >= 4) {
cnt <- cnt + 1
cond2 <- TRUE
print(num)
}
# condition 3: check if 5 of 5 are positive
if (!cond3 & chk.pos == 5) {
cnt <- cnt + 1
cond3 <- TRUE
print(num)
}
}
for me returns
[1] 5
[1] 15
[1] 70

analyze by row groups in R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Averaging column values for specific sections of data corresponding to other column values
I would like to analyze a dataset by group. The data is set up like this:
Group Result cens
A 1.3 1
A 2.4 0
A 2.1 0
B 1.2 1
B 1.7 0
B 1.9 0
I have a function that calculates the following
sumStats = function(obs, cens) {
detects = obs[cens==0]
nondetects= obs[cens=1]
mean.detects=mean(detects)
return(mean.detects) }
This of course is a simple function for illustration purpose. Is there a function in R that will allow me to use this home-made function that needs 2 variables input to analyze the data by groups.
I looked into the by function but it seems to take in 1 column data at a time.
Import your data:
test <- read.table(header=TRUE,textConnection("Group Result cens
A 1.3 1
A 2.4 0
A 2.1 0
B 1.2 1
B 1.7 0
B 1.9 0"))
Though there are many ways to do this, using by specifically you could do something like this (assuming your dataframe is called test):
by(test,test$Group,function(x) mean(x$Result[x$cens==1]))
which will give you the mean of all the Results values within each group which have cens==1
Output looks like:
test$Group: A
[1] 1.3
----------------------------------------------------------------------
test$Group: B
[1] 1.2
To help you understand how this might work with your function, consider this:
If you just ask the by statement to return the contents of each group, you will get:
> by(test,test$Group,function(x) return(x))
test$Group: A
Group Result cens
1 A 1.3 1
2 A 2.4 0
3 A 2.1 0
-----------------------------------------------------------------------
test$Group: B
Group Result cens
4 B 1.2 1
5 B 1.7 0
6 B 1.9 0
...which is actually 2 data frames with only the rows for each group, stored as a list:
This means you can access parts of the data.frames for each group as you would before they they were split up. The x in the above functions is referring to the whole sub-dataframe for each of the groups. I.e. - you can use individual variables as part of x to pass to functions - a basic example:
> by(test,test$Group,function(x) x$Result)
test$Group: A
[1] 1.3 2.4 2.1
-------------------------------------------------------------------
test$Group: B
[1] 1.2 1.7 1.9
Now, to finally get around to answering your specific query!
If you take an example function which gets the mean of two inputs separately:
sumStats = function(var1, var2) {
res1 <- mean(var1)
res2 <- mean(var2)
output <- c(res1,res2)
return(output)
}
You could call this using by to get the mean of both Result and cens like so:
> by(test,test$Group,function(x) sumStats(x$Result,x$cens))
test$Group: A
[1] 1.9333333 0.3333333
----------------------------------------------------------------------
test$Group: B
[1] 1.6000000 0.3333333
Hope that is helpful.
The aggregate function is designed for this.
aggregate(dfrm$cens, dfrm["group"], FUN-mean)
You can get the mean value os several columns at once, each within 'group'
aggregate(dfrm[ , c("Result", "cens") ], dfrm["group"], FUN=mean)

How can I extract values from a data.frame based on a vector in R?

suppose I have a numeric vector like:
x <- c(1.0, 2.5, 3.0)
and data.frame:
df<-data.frame(key=c(0.5,1.0,1.5,2.0,2.5,3.0),
value=c(-1.187,0.095,-0.142,-0.818,-0.734,0.511))
df
key value
1 0.5 -1.187
2 1.0 0.095
3 1.5 -0.142
4 2.0 -0.818
5 2.5 -0.734
6 3.0 0.511
I want to extract all the rows in df$key that have the same values equal to x, with result like:
df.x$value
[1] 0.095 -0.734 0.511
Is there an efficient way to do this please? I've tried data.frame, hash package and data.table, all with no success. Thanks for help!
Thanks guys. I actually tried similar thing but got df$key and x reversed. Is it possible to do this with the hash() function (in the 'hash' package)? I see hash can do things like:
h <- hash( keys=letters, values=1:26 )
h$a # 1
h$foo <- "bar"
h[ "foo" ]
h[[ "foo" ]]
z <- letters[3:5]
h[z]
<hash> containing 3 key-value pair(s).
c : 3
d : 4
e : 5
But seems like it doesn't take an array in its key chain, such as:
h[[z]]
Error in h[[z]] : wrong arguments for subsetting an environment
but I need the values only as in a vector rather than a hash. Otherwise, it would be perfect so that we can get rid of data.frame by using some 'real' hash concept.
Try,
df[df$key %in% x,"value"] # resp
df[df$key %in% x,]
Using an OR | condition you may modify it in such a way that your vector may occur in either of your columns. General tip: also have a look at which.
Have you tried testing the valued of df$key that are in x and extracting the value in the value column? I only say this out loud because StackOverflow doesnt like oneline answers:
> x
[1] 1.0 2.5 3.0
> df
key value
1 0.5 -0.7398436
2 1.0 0.6324852
3 1.5 1.8699257
4 2.0 1.0038996
5 2.5 1.2432679
6 3.0 -0.6850663
> df[df$key %in% x,'value']
[1] 0.6324852 1.2432679 -0.6850663
>
BIG WARNING - comparisons with floating point numbers with == can be a bad idea - read R FAQ 7.31 for more info.

Resources