I'm trying to create a new column in my matrix that is the rate of change from one point in time to the next. Using the the following matrix, this is a 3 step process.
n <- 20
data <- matrix(rnorm(2 * n), nrow = n)
1) Focusing on column 1, I want to divide row 2 by row 1.
2) I want to create a new column to hold the answer in row 2
3) repeat this process down the rows (3/2, 4/3,6/5, etc.)
I'm assuming a simple function like the following would be involved in step 1
y<-data[1,1]
z<-data[2,1]
roc<- function(x){(z/y)}
Step 2 is simple
data$ROC[data[1,] >= 0]<- roc
But I'm at a loss for step 3, and I'm not 100% sure that the function is correctly written.
Complete answer based off Ryan's comment.
####data matrix####
n <- 20
data <- matrix(rnorm(2 * n), nrow = n)
####math####
y<-data[,1]/data.table::shift(data[,1])
####new column####
data$ROC[data[1,] >= 0]<- y
Related
In R, I'm attempting to find the best combination of 8 different columns of values but with the caveat of only being able to select one value from each row. It sounds relatively simple, but I'm trying to avoid a nasty looping scenario to evaluate all possible options, so I'm hopeful there is a function available that could make this a possibility. There are scenarios where I will need to run this on datasets with over 2000 rows, so efficiency is really important.
Here is an example:
I've been racking my brain and searching forever, but every scenario and solution I'm able to find can maximize series of columns but cant handle the condition of only allowing a single value per row. Are there any functions where this is possible?
I will take a risk here, and assume that I interpreted you right. That you seek the group of 8 numbers in that table that have the maximum sum. Given, of course that they do not share a column or a row.
There is no easy answer to this question. I am not a computer scientist, but I believe this is what is called an NP-hard problem. So efficiency will always be a problem. Fortunately, in practical terms, I think you can get an answer for a 2000+ table in a matter of seconds, as long as the number of columns remains small.
The algorithm I tried to use to win this problem is essentially a depth-first search that takes advantage of existing function in R that makes it faster. You can think of your problem as jumping from column to column, each time selecting the highest value with a twist. Every time you select a value, all cells in that row are turned to zero. So in essence, when you get to the last column, there will only be one value to choose.
However, due to this nature of excluding rows, your results will be different depending on the order you choose to visit the columns (let's call that a path). Thus, you have to test all paths.
So our code must be something of the sort:
1- Enumerate all paths (all permutations of column numbers);
2- For each path, "walk" it taking the maximum value of each column and transforming to 0 the values in its row. Store the values;
3- For each set of values, calculate its sum and select based on that.
Below is the code I have used to do it:
library(combinat) # loads permn function, that enumerates all the permutations
#Create fake data
data = sample(1:25)
data = matrix(data,5,5)
# Walking function
walker = function(path,data) {
bestn = numeric(length(path)) # Placeholder for the max value of each column
usedrows = numeric(length(path)) #Placeholder for the row of each max value
data.reduced=data # copies data to a new object
for(a in 1:length(path)) { # iterate through columns
bestn[a] = max(data.reduced[,path[a]]) #find the maximum value
usedrows[a] = which.max(data.reduced[,path[a]]) # find maximum value's row
data.reduced[usedrows,]=0 # set all values in that row to 0
data.reduced[,path[a]]=0 # set current column to 0.
}
return(bestn)
}
# Create all permutations and use functions in it, get their sum, and choose based on that
paths = permn(1:5)
values = lapply(paths,walker,data)
values.sum = sapply(values,sum)
values[[ which.max(values.sum)]]
The code can handle a matrix of 2000 x 5 in less than a second in a laptop. I just did not added it here, because the more rows, the more independent the results become from the path taken. And it is less easy to see its progress with large numbers.
This problem can be solved simply as a binary integer optimization problem. Here using the ROI and ompr optimization packages. ompr is a formulation manager that calls ROI functions for optimization and processing. Here is an example:
require(ROI)
require(ROI.plugin.glpk)
require(ompr)
require(ompr.roi)
set.seed(7)
n <- runif(77, 80, 120)
n <- c(n, rep(0, 179))
n <- sample(n)
m <- matrix(n, ncol = 8)
nrows <- nrow(m)
ncols <- ncol(m)
model <- MIPModel() %>%
add_variable(x[i, j], i=1:nrows, j=1:ncols, type='binary', lb=0) %>%
set_objective(sum_expr(colwise(m[i, j]) * x[i, j], i=1:nrows, j=1:ncols), 'max') %>%
add_constraint(sum_expr(x[i, j], i=1:nrows) <= 1, j=1:ncols) %>%
add_constraint(sum_expr(x[i, j], j=1:ncols) <= 1, i=1:nrows)
result <- solve_model(model, with_ROI(solver = "glpk", verbose = TRUE))
<SOLVER MSG> ----
GLPK Simplex Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
* 0: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
* 20: obj = 9.321807877e+002 infeas = 0.000e+000 (0)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
40 rows, 256 columns, 512 non-zeros
256 integer variables, all of which are binary
Integer optimization begins...
+ 20: mip = not found yet <= +inf (1; 0)
+ 20: >>>>> 9.321807877e+002 <= 9.321807877e+002 0.0% (1; 0)
+ 20: mip = 9.321807877e+002 <= tree is empty 0.0% (0; 1)
INTEGER OPTIMAL SOLUTION FOUND
<!SOLVER MSG> ----
solution <- get_solution(result, x[i, j])
solution <- subset(solution, value != 0)
solution
variable i j value
27 x 27 1 1
43 x 11 2 1
88 x 24 3 1
99 x 3 4 1
146 x 18 5 1
173 x 13 6 1
209 x 17 7 1
246 x 22 8 1
The first code chunk generates a 32X8 random matrix. The sample generates a 30% fill. The constraints constrain each column and row to have <= 1 active variable. You can use this code directly for any matrix of any dimension.
This is my first post here and I couldn't find the answer I was looking for.
I'm currently taking edX course on Probability in Data Science, but I got stuck on section 1.
The task asks you to simulate a series of 6 games with random, independent outcomes of either a loss (0) or win(1), and then use the sum function to determine whether a simulated series contained at least 4 wins.
Here's what I did:
l <- list(0:1)
n <- 6
games <- expand.grid(rep(l, n))
games <- paste (games$Var1, games$Var2, games$Var3, games$Var4, games$Var5, games$Var6)
sample (game, 1, replace = TRUE)
but I can't seem to use the sum function to sum the result of '''sample''' and check if there's a series of at least 4 games. I've been trying to use
sum(sample (game, 1, replace = TRUE))
but can't seem to get anywhere with it.
Any light would be greatly appreciated!
Thanks a lot!
This is what one simulated series look like
sample(c(0, 1), 6, replace = TRUE)
To count number of wins (i.e 1) you could use sum like
sum(sample(c(0, 1), 6, replace = TRUE)) >= 4
Now you could generate such series n times with replicate.
n <- 1000
replicate(n, sum(sample(c(0, 1), 6, replace = TRUE)) >= 4)
If you have to use games to calculate you can use rowSums to count number of 1's
sum(rowSums(games) >= 4)
#[1] 22
The fibonacci series is obtained by adding together the prior two integers in the series, the Series include 1, 1, 2 , 3, 5 , 8. I used the following code to have series till 50.
y <- 50
}
fibvals <- numeric(y)
fibvals[1] <- 1
fibvals[2] <- 1
for (i in 3:y) {
fibvals[i] <- fibvals[i-1]+fibvals[i-2]
}
Now i want to add the numbers at even position, i.e 1, 3, 8 till 50th number? please help?
Try using seq to select the even vector indices from 2 to 50, like this:
sum(fibvals[seq(2, 50, by = 2)])
Also: there are R libraries to make working with series easier. You could use the numbers package for example, to get the first 50 Fibonacci numbers:
fibvals <- sapply(1:50, numbers::fibonacci)
I am new to R and I have an R code that uses the for loop to calculate the y = m*i + b. The starting value is a negative and I want to use that in my calculation and store it in the first occurrence of the Trend.Line and so forth.
I am not getting the results that I'm expecting. If my starting is a positive number no matter what number it is, I still want to store it in the first occurrence.
For example, if start= -5, I would like to store this calculated value in Y <= m*i + b in Trend.Line[1], the -4 calculated value to Trend.Line[2]. Now if start = 6, I would like to store calculated value in Trend.Line[1], the 7 calculated value to Trend.Line[2]
Thanks for looking into this.
Here is my code:
Trend.Line <- numeric(0)
start <- -5
end <- 12
m <- 345.72
b <- 54454
for(i in start:end){
y <- m*(i) + b
Trend.Line[i] <- y
}
Trend.Line
How about just doing
Trend.Line <- start:end
m * Trend.Line + b
It returns a numeric vector with everything at the index you want. It also makes use of the vectorization of functions in R. So multiplication and addition work on all elements of the vector Trend.Line.
I would like to perform two things to my fairly large data set about 10 K x 50 K . The following is smaller set of 200 x 10000.
First I want to generate 5% missing values, which perhaps simple and can be done with simple trick:
# dummy data
set.seed(123)
# matrix of X variable
xmat <- matrix(sample(0:4, 2000000, replace = TRUE), ncol = 10000)
colnames(xmat) <- paste ("M", 1:10000, sep ="")
rownames(xmat) <- paste("sample", 1:200, sep = "")
Generate missing values at 5% random places in the data.
N <- 2000000*0.05 # 5% random missing values
inds_miss <- round ( runif(N, 1, length(xmat)) )
xmat[inds_miss] <- NA
Now I would like to generate error (means that different value than what I have in above matrix. The above matrix have values of 0 to 4. So what I would like to do:
(1) I would like to replace x value with another value that is not x (for example 0 can be replaced by a random sample of that is not 0 (i.e. 1 or 2 or 3 or 4), similarly 1 can be replaced by that is not 1 (i.e. 0 or 2 or 3 or 4). Indicies where random value can be replaced can be simply done with:
inds_err <- round ( runif(N, 1, length(xmat)) )
If I randomly sample 0:4 values and replace with the indices, this will sometime replace same value with same value ( 0 with 0, 1 with 1 and so on) without creating error.
errorg <- sample(0:4, length(inds_err), replace = TRUE)
xmat[inds_err] <- errorg
(2) So what I would like to do is introduce error in xmat with missing values, However I do not want NA generated in above step be replaced with a value (0 to 4). So ind_err should not be member of vector inds_miss.
So summary rules :
(1) The missing values should not be replaced with error values
(2) The existing value must be replaced with different value (which is definition of error here)- in random sampling this 1/5 probability of doing this.
How can it be done ? I need faster solution that can be used in my large dataset.
You can try this:
inds_err <- setdiff(round ( runif(2*N, 1, length(xmat)) ),inds_miss)[1:N]
xmat[inds_err]<-(xmat[inds_err]+sample(4,N,replace=TRUE))%%5
With the first line you generate 2*N possible error indices, than you subtract the ones belonging to inds_miss and then take the first N. With the second line you add to the values you want to change a random number between 1 and 4 and than take the mod 5. In this way you are sure that the new value will be different from the original and stil in the 0-4 range.
Here's an if/else solution that could work for you. It is a for loop so not sure if that will be okay for you. Possibly vectorize it is some way to make it faster.
# vector of options
vec <- 0:4
# simple logic based solution if just don't want NA changed
for(i in 1:length(inds_err){
if(is.na(xmat[i])){
next
}else{
xmat[i] <- sample(vec[-xmat[i]], 1)
}
}