I need to adjust one variable until it satisfies the condition that none of its rows are higher than one specific value. Here is some context:
I have 2 vectors: 'a' and 'b'
I normalize a and b to calculate their ratio 'c' (a_norm/b_norm)
Every row of 'c' must not be higher than a constant 'd'. Any 'c' row that is higher than d should be transformed into d.
After all 'c' rows that need to are adjusted (let's call this new column c_adjusted), I must recalculate a_norm (c_adjusted*b) (note that this will not make a_norm to be normalise, so let's call it a_adjusted)
I normalize a_adjusted to estimate the new a_norm (a_adjusted_norm = a_adjusted/sum(a_adjusted)*100
I calculate again c to check if all rows satisfy the condition after the adjustment. If any is still higher than d, I have to repeat the process until the condition is satisfied. At the end I would like the final a_adjusted_norm as the final result.
Does anybody knows how to achieve this? Here is a reproducible example:
set.seed(8)
#create dataframe
a<- runif(100, min = 0, max = 10)
b<- runif(100, min = 0, max = 10)
a_norm <- a/sum(a)*100
b_norm <- b/sum(b)*100
c <- a_norm / b_norm
c_cap <- 1 #C must not be higher than c_Cap
df <- data.frame(a_norm, b_norm, c)
df <- df %>%
mutate(c_adjusted = ifelse(c >= c_cap, c_cap, c), #We adjust c rows that are higher than c_cap
a_adjusted = c_adjusted*b_norm, #We calculate the adjusted a with adjusted c
a_adjusted_norm = a_adjusted/sum(a_adjusted)*100) #Normalize adjusted a
#We calculate again c to see if it matches condition
df <- df %>%
mutate(c = a_adjusted_norm/b_norm) #see if c satisfy condition after adjusting variables
#If any row of C is still higher than cap, I must adjust it again and repeat the process until all rows match the condition
Thanks in advance!
Generally you can do:
a <- runif(10, min = 0, max = 10)
b <- runif(10, min = 0, max = 10)
a_norm <- a/sum(a)*100
b_norm <- b/sum(b)*100
cap <- 1
c <- a_norm / b_norm
while (max(c) > cap) {
c[c>cap] <- cap
a_adjusted <- c * b_norm
a_adjusted_norm <- a_adjusted/sum(a_adjusted)*100
c <- a_adjusted_norm/b_norm
}
However, this seems to never work, because while your approach shrinks the higher values towards 1, it at the same time pushes smaller values than 1 to become larger than 1. Which means that the loop will never end (at least I stopped it manually after some time)
So you probably need to adjust the formula to recalculate your c values!
Related
I'm trying to run a simulation with a combination of static variables and values within columns, sum the output, and store the individual outputs in a vector or dataframe.
mean1 <- 2.4
sd1 <- 0.5
df <- data.frame(x = c(2, 3, 4), y = c(5, 6, 7))
What I want to do is :
divide each row in column x by each row in column y
multiply by a normal distribution using mean1 and sd1
sum the resultant row values, so I'd have a single value per simulation.
I think I understand how I'd get the value if I wasn't going row by row, so for row 1 it'd be:
v1 <- replicate(n = 1, expr = rnorm(n = 100, mean = mean1, sd = sd1) * 2 / 5, simplify = TRUE)
But where I'm drawing a blank is how to run that for each row, then sum the results of each row for each simulation, in this case sum the three values from each of the three rows 100 times, so I'd have an output with 100 values.
Dividing x by y is constant so you can do it once and save it in a variable. You can then use replicate 100 times and generate 1 random number at every iteration to multiply and take sum.
val <- df$x/df$y
n <- 100
replicate(n, {
sum(val * rnorm(n = 1, mean = mean1, sd = sd1))
})
Or you can also generate 100 random values together and sum them with sapply.
r_val <- rnorm(n, mean = mean1, sd = sd1)
sapply(r_val, function(x) sum(val * x))
Ronak answered my question with:
val <- df$x/df$y
n <- 100
replicate(n, {
sum(val * rnorm(n = 1, mean = mean1, sd = sd1))
})
I had to add back the df$column reference (df$x here) as opposed to creating a constant since the actual application had more variables and math that was more complicated than the example, but the structure worked perfectly.
Thank you!
recently I am trying to mimic a game.
I am going to throw 2 dice at the same time. If the sum of 2 dice is greater than or equals to 10, I win 1 point.
If it is lower than 10, I lose 1 point. I will do this for 1000 times.
At the very beginning, I draw 2000 random samples with set.seed (1234)
set.seed(1234)
d = sample(c(1:6), size = 2000, replace = T)
d
And then, I turn it into a matrix, and sum each row
a = matrix(d, nrow=1000, ncol=2, byrow=T)
t = rowSums(a)
t
Now, I have 1000 elements (sum of two dice each time). I would like to create a vector X to calculate the point that I can get.
However, how can I apply if statement to create vector X in this time?
Thank you very much
Do you mean this?
X <- ifelse(t>=10,1,-1)
or
X <- 2*(t>=10)-1
Using case_when
library(dplyr)
case_when(t >= 10 ~ 1, TRUE ~ -1)
You could assign a temporary variable and assign points by comparing the values.
tmp <- t
t[tmp >= 10] <- 1
t[tmp < 10] <- -1
Or without a temporary variable.
t1 <- c(-1, 1)[(t >= 10) + 1]
I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)
I have a simple question, so lets take some basic data
a <- rnorm(100, mean=1, sd = 0.1)
b <- rnorm(100, mean=5, sd = 2)
c <- data.frame(a,b)
Now I want to redefine C$B such that if it is below a limit, the user manually defines the new variable it will take, and if it is above this limit, the values take the same as previous
c$b <- with(c, ifelse(b < 2, 1, # leave as exsiting value #))
so when b < 2, we want to assign a value of 1, otherwise use the exisitng value
If we are using ifelse, try
c$b <- with(c, ifelse(b < 2, 1, b))
This doesn't even require ifelse. We can get the logical index of values less than 2 in the 'b' column (c$b <2) and assign those values to 1.
c$b[c$b<2] <- 1
I have a data frame df with 2 variables A and B. I would like to split A in groups 1 and 2 so that mean(df$B[df$group==1]) as close as possible to mean(df$B[df$group==2])
Or just to express it otherwise, what I would like is to find a cut point (cutp) in df$A that would minimize the abs(mean(df$B[df$A<cutp])-mean(df$B[df$A>=cutp]))
Any ideas?
If you want to find a threshold on variable A, to split the data into two groups, so that the means of B in those two groups be similar, you can compute these means for all possible cut-points, and check when the distance between those means is minimal.
# Sample data
n <- 10
d <- data.frame(
A = rnorm(n),
B = rnorm(n)
)
# The quantity to minimize
# (You can use a loop instead of apply.)
d$differences <- apply(
d, 1,
# Compute the difference of the means for each value of A
function (u) {
i <- d$A <= u[1];
abs( mean( d$B[which(i)]) - mean(d$B[which(!i)] ) )
}
)
# The mean of an empty vector is NaN: discard those values
d$differences[ ! is.finite( d$differences ) ] <- Inf
# Take the minimum
threshold <- d$A[ which.min( d$differences ) ]
# Build the groups
d$group <- ifelse( d$A <= threshold, "group 1", "group 2" )
I'm still not sure how column A factors into it. It seems you want to create a new column that has two levels which create ~= mean values for column B. Column A is obviously associated with the new column created, but does not directly factor into the calculation needed. Am I missing something?
Regardless, here's a start (note this can be made much more robust, but proof of concept should work). Define a tolerance that you find acceptable and then set up a while loop to create new groups until the condition is met, i.e.
FUN <- function(tol){
df$groups <- sample(1:2, nrow(df), TRUE)
while(abs(mean(df$B[df$groups == 1]) - mean(df$B[df$groups == 2])) > tol) {
df$groups <- sample(1:2, nrow(df), TRUE)
}
return(df)
}
set.seed(101)
df <- data.frame(A=runif(20),B=runif(20))
#Test it. Means should be less than .02 different and have roughly equivalent sample sizes.
set.seed(101)
out <- FUN(.02)
library(plyr)
> ddply(out, "groups", summarize, n = length(B), mean = mean(B))
groups n mean
1 1 11 0.5229024
2 2 9 0.5037279
I should note that you could create a runaway function if you set tol super low so don't blame me if your computer crashes.