I'm trying to run a simulation with a combination of static variables and values within columns, sum the output, and store the individual outputs in a vector or dataframe.
mean1 <- 2.4
sd1 <- 0.5
df <- data.frame(x = c(2, 3, 4), y = c(5, 6, 7))
What I want to do is :
divide each row in column x by each row in column y
multiply by a normal distribution using mean1 and sd1
sum the resultant row values, so I'd have a single value per simulation.
I think I understand how I'd get the value if I wasn't going row by row, so for row 1 it'd be:
v1 <- replicate(n = 1, expr = rnorm(n = 100, mean = mean1, sd = sd1) * 2 / 5, simplify = TRUE)
But where I'm drawing a blank is how to run that for each row, then sum the results of each row for each simulation, in this case sum the three values from each of the three rows 100 times, so I'd have an output with 100 values.
Dividing x by y is constant so you can do it once and save it in a variable. You can then use replicate 100 times and generate 1 random number at every iteration to multiply and take sum.
val <- df$x/df$y
n <- 100
replicate(n, {
sum(val * rnorm(n = 1, mean = mean1, sd = sd1))
})
Or you can also generate 100 random values together and sum them with sapply.
r_val <- rnorm(n, mean = mean1, sd = sd1)
sapply(r_val, function(x) sum(val * x))
Ronak answered my question with:
val <- df$x/df$y
n <- 100
replicate(n, {
sum(val * rnorm(n = 1, mean = mean1, sd = sd1))
})
I had to add back the df$column reference (df$x here) as opposed to creating a constant since the actual application had more variables and math that was more complicated than the example, but the structure worked perfectly.
Thank you!
Related
Lets talk you through my workflow:
General idea
Based on data in a dataframe, select the appropriate distribution functions, combine them in all possible ways to get the mean of the combined distributions.
Starting position
I have a large data frame df. In there I have different variables var1, var2 and var3 in this example which contains data to select the appropriate distribution function.
I have several distribution functions per variable:
var1_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var1_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 6, sd = 1))
var1_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 2, sd = 2))
var2_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 3))
var2_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 3, sd = 1))
var2_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 2))
var3_distr1 <- pdqr::as_d(function(x)dnorm(x, mean = 4, sd = 1))
var3_distr2 <- pdqr::as_d(function(x)dnorm(x, mean = 5, sd = 1))
var3_distr3 <- pdqr::as_d(function(x)dnorm(x, mean = 7, sd = 2))
Select the right distribution
Using an if_else on each of the vars I generate the appropriate distribution per case in a new vector. The if_else looks like this for var1 and has the same appearance for all vars:
df$distr_var1 <- if_else(df$info < 0, "var1_distr1",
if_else(df$info > 0 & df$info < 100, "var1_distr2", "var1_distr3")
This results in the following df:
df <- data.frame(distr_var1 = c("var1_distr1", "var1_distr3", "var1_distr1", "var1_distr2", "var1_distr2", "var1_distr1", "var1_distr3"),
distr_var2 = c("var2_distr2", "var2_distr1", "var2_distr2", "var2_distr1", "var2_distr3", "var2_distr3", "var2_distr1"),
distr_var3 = c("var3_distr2", "var3_distr3", "var3_distr1", "var3_distr1", "var3_distr2", "var3_distr3", "var3_distr1"))
Combine distribution functions
To combine distribution functions in a new proportional distribution function I have created this function based on this question:
foo <- function(...){
#set x values
x <- seq(1, 10, by = 1)
#create y values
y <- 1L
for (fun in list(...)) y <- y * fun(x)
#create new PDF
p <- data.frame(x,y)
pdqr::new_d(p, type = "continuous")
}
And I have stored the PDFs in a list:
PDFS <- list(var1_distr1 = var1_distr1, var1_distr2 = var1_distr2, var1_distr3 = var1_distr3,
var2_distr1 = var2_distr1, var2_distr2 = var2_distr2, var2_distr3 = var2_distr3,
var3_distr1 = var3_distr1, var3_distr2 = var3_distr2, var3_distr3 = var3_distr3)
I would like to use the function foo in the df to generate proportional distributions for all combinations of distributions given in the df. So, for each case, a the following combinations: var1_var2, var1_var3, var2_var3, var1_var2_var3.
Calculate mean over distributions
If I want to calculate a mean over the distributions individually, I can do this:
means <- sapply(PDFS, pdqr::summ_mean)
df$mean_var1 <- means[df$distr_var1]
Or:
df$mean_var2 <- sapply(mget(df$distr_var2), pdqr::summ_mean)
Both approaches work fine. But on the combinations var1_var2, var1_var3, var2_var3, var1_var2_var3 I have not found a suitable approach, but tried these:
df$var1_var2_mean <- sapply(foo(mget(mapply(PDFS, sapply, df$distr_var1, df$distr_var2))), pdqr::summ_mean)
I tried to overcome not calling functions by using a list, but things seem to get too complicated / nested to work nicely...
Question
How to select the appropriate distributions given in distr_var1, distr_var2 and distr_var3, combined them using foo and calculate the mean using pdqr::summ_mean?
I'm happy with all comments, also on the workflow in general
A foreach loop works for me:
df$var1_var2_mean <- foreach(i = 1:nrow(df), .combine = c) %do% {
A <- as.name(df$var1[i])
B <- as.name(df$var2[i])
mean <- summ_mean(foo(get(A),get(B)))
}
And, for each combination I need to do this. At least I got it working...
I want to generate 599 numbers (with repeats) between 0 and 100 with specified mean and sd, the generated numbers should all be multiples of 5. A pseudo code can be
df$Murphy_task_strategy<-rnorm(mean=57, sd= 25.30, n=599,by=5, min=0, max=100)
Thank you
It won't strictly be a normal distribution but I'm assuming you're mainly looking for something that looks approximately bell shaped when taking a histogram.
rdiscnorm <- function(n, mean, sd, min = 0, max = 100, by = 5){
# generate the possible values we can take on
vals <- seq(from = min, to = max, by = by)
# use dnorm to get the density at each of those points
unnormed_probabilities <- dnorm(vals, mean = mean, sd = sd)
# normalize so that the probabilities sum to 1
# - this isn't strictly necessary because we use sample
# but it makes sense when thinking about the process
ps <- unnormed_probabilities/sum(unnormed_probabilities)
# Take a sample with replacement of the vals
# using the generated probabilities
output <- sample(vals, size = n, replace = TRUE, prob = ps)
return(output)
}
out <- rdiscnorm(599, 57, 25.30)
df$Murphy_task_strategy <- out
I would like to apply the kmeans function to a dataset.
I run it several times. I increase the number of centers each time. For each run I store the total within sum of square in a vector, and I plot the total within sum of square against the number of clusters like so:
# Dummy data
cluster1_x <- rnorm(1000, mean = 3.5, sd = .75)
cluster1_y <- rnorm(1000, mean = 4, sd = 1.13)
cluster1 <- cbind(cluster1_x, cluster1_y)
cluster2_x <- rnorm(1000, mean = 5.2, sd = .75)
cluster2_y <- rnorm(1000, mean = .9, sd = .64)
cluster2 <- cbind(cluster2_x, cluster2_y)
cluster3_x <- rnorm(1000, mean = .68, sd = .86)
cluster3_y <- rnorm(1000, mean = 0.8, sd = 1)
cluster3 <- cbind(cluster3_x, cluster3_y)
df <- rbind(cluster1, cluster2, cluster3)
# To see the dummy clusters
# plot(df, pch = 20)
# Applying kmeans
# Vector that will be filled with the variance in the clusters
tot.within.sum.square <- rep(NA, 20)
for (nb_center in 1:20){
tps_start <- Sys.time()
set.seed(13)
res.kmeans <- kmeans(df, centers=nb_center, iter.max = 30)
tot.within.sum.square[nb_center] <- res.kmeans$tot.withinss
tps_exec <- Sys.time() - tps_start
print(paste0("Iteration ", nb_center, " : ", tps_exec))
}
plot(1:20, tot.within.sum.square, type = 'b', pch=20)
I would like to repeat this process 4 times, each time using a different algorithm. There are 4 different values "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen", so I want to end up with 4 different vectors of length 20, one vector for each algorithm. Each element of a given vector is the value contained in res.kmeans$tot.withinss. For example, the 4th element of the vectors is the value corresponding to the total within sum of square of a run of kmeans for 4 centers. I can copy and paste the previous code but I am looking for a more elegant way to achieve the results.
I can somewhat get what I want using this:
sapply(algos, function(x) {
sapply(nb_centers, function(y) kmeans(df, centers = y, algorithm = x))
})
but I am not able to store each total.withinss from each iteration of each algorithm in a variable.
Any help will be appreciated!
As mentionned in the comments by #Parfait,
tot.withinss <- sapply(algos, function(x) {
sapply(nb_centers, function(y) kmeans(df, centers = y, algorithm = x)$tot.withinss)
})
will do the trick!
I have the following data frame:
library(dplyr)
set.seed(42)
df <- data_frame(x = sample(seq(0, 1, 0.1), 5, replace = T), y = sample(seq(0, 1, 0.1), 5, replace = T), z= sample(seq(0, 1, 0.1), 5, replace = T) )
For each row in df, I would like to find out whether there is a row in df2 which is close to it ("neighbor") in all columns, where "close" means that it is not different by more than 0.1 in each column.
So for instance, a proper neighbor to the row (1, 0.5, 0.5) would be (0.9, 0.6, 0.4).
The second data set is
set.seed(42)
df2 <- data_frame(x = sample(seq(0, 1, 0.1), 10, replace = T), y = sample(seq(0, 1, 0.1), 10, replace = T), z= sample(seq(0, 1, 0.1), 10, replace = T) )
In this case there is no "neighbor", so Im supposed to get "FALSE" for all rows of df.
My actual data frames are much bigger than this (dozens of columns and hundreds of thousands of rows, so the naming has to be very general rather than "x", "y" and "z".
I have a sense that this can be done using mutate and funs, for example I tried this line:
df <- df %>% mutate_all(funs(close = (. <= df2(, .)+0.1) & (. >= df2(, .)-0.1))
But got an error.
Any ideas?
You can use package fuzzyjoin
library(fuzzyjoin)
# adding two rows that match
df2 <- rbind(df2,df[1:2,] +0.01)
df %>%
fuzzy_left_join(df2,match_fun= function(x,y) y<x+0.1 & y> x-0.1 ) %>%
mutate(found=!is.na(x.y)) %>%
select(-4:-6)
# # A tibble: 5 x 4
# x.x y.x z.x found
# <dbl> <dbl> <dbl> <lgl>
# 1 1 0.5 0.5 TRUE
# 2 1 0.8 0.7 TRUE
# 3 0.3 0.1 1 FALSE
# 4 0.9 0.7 0.2 FALSE
# 5 0.7 0.7 0.5 FALSE
find more info there: Joining/matching data frames in R
The machine learning approach to finding a close entry in a multi-dimensional dataset is Euclidian distance.
The general approach is to normalize all the attributes. Make the range for each column the same, zero to one or negative one to one. That equalizes the effect of the columns with large and small values. When more advanced approaches are used one would center the adjusted column values on zero. The test criteria is scaled the same.
The next step is to calculate the distance of each observation from its neighbors. If the data set is small or computing time is cheap, calculate the distance from every observation to every other. The Euclidian distance from observation1 (row1) to observation2 (row2) is sqrt((X1 - X2)^2 + sqrt((Y1 - Y2)^2 + ...). Choose your criteria and select.
In your case, the section criterion is simpler. Two observations are close if no attribute is more than 0.1 from the other observation. I assume that df and df2 have the same number of columns in the same order. I make the assumption that close observations are relatively rare. My approach tells me once we discover a pair is distant, discontinue investigation. If you have hundred of thousands of rows, you will likely exhaust memory if you try to calculate all the combinations at the same time.
~~~~~
You have a big problem. If your data sets df and df2 are one hundred thousand rows each, and four dozen columns, the machine needs to do 4.8e+11 comparisons. The scorecard at the end will have 1e+10 results (close or distant). I started with some subsetting to do comparisons with tearful results. R wanted matrices of the same size. The kluge I devised was unsuccessful. Therefore I regressed to the days of FORTRAN and did it with loops. With the loop approach, you could subset the problem and finish without smoking your machine.
From the sample data, I did the comparisons by hand, all 150 of them: nrow(df) * nrow(df2) * ncol(df). There were no close observations in the sample data by the definition you gave.
Here is how I intended to present the results before transferring the results to a new column in df.
dfclose <- matrix(TRUE, nrow = nrow(df), ncol = nrow(df2))
dfclose # Have a look
This matrix describes the distance from observation in df (rows in dfclose) to observation in df2 (colums in dfclose). If close, the entry is TRUE.
Here is the repository of the result of the distance measures:
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
dfdist # have a look; it's the same format, but with numbers
We start with the assumption that all observations in df aare close to df2.
The total distance is zero. To that we add the Manhattan Distance. When the total Manhattan distance is greater than .1, they are no longer close. We needn't evaluate any more.
closeCriterion <- function(origin, dest) {
manhattanDistance <- abs(origin-dest)
#print(paste("manhattanDistance =", manhattanDistance))
if (manhattanDistance < .1) ret <- 0 else ret <- 1
}
convertScore <- function(x) if (x>0) FALSE else TRUE
for (j in 1:ncol(df)) {
print(paste("col =",j))
for (i in 1:nrow(df)) {
print(paste("df row =",i))
for (k in 1:nrow(df2)) {
# print(paste("df2 row (and dflist column) =", k))
distantScore <- closeCriterion(df[i,j], df2[k,j])
#print(paste("df and dfdist row =", i, " df2 row (and dflist column) =", k, " distantScore = ", distantScore))
dfdist[i,k] <- dfdist[i,k] + distantScore
}
}
}
dfdist # have a look at the numerical results
dfclose <- matrix(lapply(dfdist, convertScore), ncol = nrow(df2))
I wanted to see what the process would look like at scale.
set.seed(42)
df <- matrix(rnorm(3000), ncol = 30)
set.seed(42)
df2 <-matrix(rnorm(5580), ncol = 30)
dfdist <- matrix(0, nrow = nrow(df), ncol = nrow(df2))
Then I ran the code block to see what would happen.
~ ~ ~
You might consider the problem definition. I ran the model several times, changing the criterion for closeness. If the entry in each of three dozen columns in df2 has a 90% chance of matching its correspondent in df, the row only has a 2.2% chance of matching. The example data is not such a good test case for the algorithm.
Best of luck
Here's one way to calculate that column without fuzzyjoin
library(tidyverse)
found <-
expand.grid(row.df = seq(nrow(df)),
row.df2 = seq(nrow(df2))) %>%
mutate(in.range = pmap_lgl(., ~ all(abs(df[.x,] - df2[.y,]) <= 0.1))) %>%
group_by(row.df) %>%
summarise_at('in.range', any) %>%
select(in.range)
I have a vector of values that I want to order by value in descending order, then bin in bins of size 100, with the final bin containing all of the remaining values.
#generate random data
set.seed(1)
x <- rnorm(8366)
#In descending order
y <- x[order(-x)]
Now I have used cut to bin by value before, but I want the bins to be of finite size. So the first bin will have the first 100 values in y, the second bin the next hundred etc until I have ten bins, with the final bin containing all of the remaining values. I am not sure how to go about this.
The below will return the bins as a list:
mylist <- split(y, c(rep(1:9, each = 100), rep(10, 8366 - 900)))
The first 9 elements contain 100 records each and the rest are stored in the 10th element.
I'm not sure what you mean by "bin". Do you want to summarize each 100 values in some way? For example, sum them? If so, here's one solution:
#generate random data
set.seed(1)
x <- rnorm(8836)
n <- ceiling(length(x)/100) * 100
y <- rep(0, n)
#In descending order
y[1:length(x)] <- x[order(-x)]
X <- matrix(y, nrow = , ncol = 100, byrow = T)
apply(X, 2, sum)
You can use cut :
res <- cut(y,c(rev(y)[seq(1,901,100)],Inf),right = F)
table(res)
# res
# [-3.67,-2.33) [-2.33,-2.05) [-2.05,-1.87) [-1.87,-1.72) [-1.72,-1.6)
# 100 100 100 100 100
# [-1.6,-1.5) [-1.5,-1.41) [-1.41,-1.34) [-1.34,-1.27) [-1.27,Inf)
# 100 100 100 100 7466