Weighted random sampling for Monte Carlo simulation in R - r

I would like to run a Monte Carlo simulation. I have a data.frame where rows are unique IDs which have a probability of association with one of the columns. The data entered into the columns can be treated as the weights for that probability. I want to randomly sample each row in the data.frame based on the weights listed for each row. Each row should only return one value per run. The data.frame structure looks like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 0.5, 0.5, 0
X33, 0.25, 0.25, 0.25, 0.25, 0
X55, 0, 0, 0, 0, 1
X77, 0.5, 0, 0, 0, 0.5
For weighting, "X11" should either return X2002 or X2003, "X33" should have an equal probability of returning X2000, X2001, X2002, or X2003, should be equal with no chance of returning X2004. The only possible return for "X55" should be X2004.
The output data I am interested in are the IDs and the column that was sampled for that run, although it would probably be simpler to return something like this:
ID, X2000, X2001, X2002, X2003, X2004
X11, 0, 0, 1, 0, 0
X33, 1, 0, 0, 0, 0
X55, 0, 0, 0, 0, 1
X77, 1, 0, 0, 0, 0

Your data.frame is transposed - the sample() function takes a probability vector. However, your probability vector is rowwise which means it's harder to extract from a data.frame.
To get around this - you can import your ID column as a row.name. This allows you to be able to access it during an apply() statement. Note the apply() will coerce the data.frame to a matrix which means only one data type is allowed. That's why the IDs needed to be rownames - otherwise we'd have a probability vector of characters instead of numerics.
mc_df <- read.table(
text =
'ID X2000 X2001 X2002 X2003 X2004
X11 0 0 0.5 0.5 0
X33 0.25 0.25 0.25 0.25 0
X55 0 0 0 0 1
X77 0.5 0 0 0 0.5'
, header = T
,row.names = 1)
From there, can use the apply function:
apply(mc_df, 1, function(x) sample(names(x), size = 200, replace = T, prob = x))
Or you could make it fancy
apply(mc_df, 1, function(x) table(sample(names(x), size = 200, replace = T, prob = x)))
$X11
X2002 X2003
102 98
$X33
X2000 X2001 X2002 X2003
54 47 64 35
$X55
X2004
200
$X77
X2000 X2004
103 97
Fancier:
apply(mc_df, 1, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
X11 X33 X55 X77
X2000 0 51 0 99
X2001 0 50 0 0
X2002 91 57 0 0
X2003 109 42 0 0
X2004 0 0 200 101
Or fanciest:
prop.table(apply(mc_df
, 1
, function(x) table(sample(as.factor(names(x)), size = 200, replace = T, prob = x)))
,2)
X11 X33 X55 X77
X2000 0.00 0.270 0 0.515
X2001 0.00 0.235 0 0.000
X2002 0.51 0.320 0 0.000
X2003 0.49 0.175 0 0.000
X2004 0.00 0.000 1 0.485

Related

combine redundant row items in r

I have a dataset with the the names of many different plant species (column MTmatch), some of which appear repeatedly. Each of these has a column (ReadSum) with a sum associated with it (as well as many other pieces of information). How do I combine/aggregate all of the redundant plant species and sum the associated ReadSum with each, while leaving the non-redundant rows alone?
I would like to take a dataset like this, and either have it transformed so that each sample has the aggregate of the combined rows, or at least an additional column showing the sum of the ReadSum column for the combined redundant species. Sorry if this is confusing, I'm not sure how to ask this question.
I have been messing about with dplyr, using group_by() and summarise(), but that seems to be summarizing across the whole column rather than just the new group.
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466)), row.names = c(NA, -9L), class = "data.frame")
> dput(samp5[1:9])
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Do either of these approached produce your intended outcome?
Data:
df <- structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028",
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135",
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432,
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389,
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677,
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955,
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888,
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855,
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835,
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936,
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534,
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus",
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA,
-9L), class = "data.frame")
Create a new column "combined_ReadSum" (2nd col) which is the sum of "ReadSum" for each "SPECIES":
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(combined_ReadSum = sum(ReadSum)) %>%
left_join(df, by = "SPECIES")
#> # A tibble: 9 × 10
#> SPECIES combi…¹ ESVID S0264…² S0264…³ S0264…⁴ S0264…⁵ S0264…⁶ MAX ReadSum
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 3.56 ESV_… 0.222 0.0226 0.361 0.221 0.0261 0.401 3.55
#> 2 "Abies " 3.56 ESV_… 0 0 0 0 0 0.00993 0.0121
#> 3 "Acer" 0.203 ESV_… 0 0 0 0 0 0.124 0.203
#> 4 "Alnus" 0.0211 ESV_… 0 0 0.00815 0 0 0.00815 0.0211
#> 5 "Berber… 0.00982 ESV_… 0 0 0 0 0 0.00982 0.00982
#> 6 "Betula… 0.128 ESV_… 0 0 0 0 0 0.0865 0.128
#> 7 "Boykin… 0.978 ESV_… 0.0291 0.0164 0.0151 0.00515 0 0.155 0.860
#> 8 "Boykin… 0.978 ESV_… 0 0.00126 0 0 0 0.0156 0.0682
#> 9 "Boykin… 0.978 ESV_… 0 0 0.00115 0 0 0.00834 0.0503
#> # … with abbreviated variable names ¹​combined_ReadSum, ²​S026401.R1,
#> # ³​S026404.R1, ⁴​S026406.R1, ⁵​S026409.R1, ⁶​S026412.R1
Or, summarise columns by summing the values for each unique species:
library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 6 × 8
#> SPECIES S026401.R1 S026404.R1 S026406.R1 S026409.R1 S0264…¹ MAX ReadSum
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 "Abies " 0.222 0.0226 0.361 0.221 0.0261 0.411 3.56
#> 2 "Acer" 0 0 0 0 0 0.124 0.203
#> 3 "Alnus" 0 0 0.00815 0 0 0.00815 0.0211
#> 4 "Berberis" 0 0 0 0 0 0.00982 0.00982
#> 5 "Betula " 0 0 0 0 0 0.0865 0.128
#> 6 "Boykinia" 0.0291 0.0176 0.0163 0.00515 0 0.179 0.978
#> # … with abbreviated variable name ¹​S026412.R1
Created on 2022-10-28 by the reprex package (v2.0.1)

How to convert long form to wide form based on category in R

I have the following data.
name x1 x2 x3 x4
1 V1_3 1 0 999 999
2 V2_3 1.12 0.044 25.4 0
3 V3_3 0.917 0.045 20.4 0
4 V1_15 1 0 999 999
5 V2_15 1.07 0.036 29.8 0
6 V3_15 0.867 0.039 22.5 0
7 V1_25 1 0 999 999
8 V2_25 1.07 0.034 31.1 0
9 V3_25 0.917 0.037 24.6 0
10 V1_35 1 0 999 999
11 V2_35 1.05 0.034 31.2 0
12 V3_35 0.994 0.037 26.6 0
13 V1_47 1 0 999 999
14 V2_47 1.03 0.031 33.6 0
15 V3_47 0.937 0.034 27.4 0
16 V1_57 1 0 999 999
17 V2_57 1.13 0.036 31.9 0
18 V3_57 1.03 0.037 28.1 0
I want to convert this data to the following data. Can someone give me some suggestion, please?
name est_3 est_15 est_25 est_35 est_47 est_57
1 V2 1.12 1.07 1.07 1.05 1.03 1.13
2 V3 0.917 0.867 0.917 0.994 0.937 1.03
Here is one approach for you. Your data is called mydf here. First, you want to choose necessary columns (i.e., name and x1) using select(). Then, you want to subset rows using filter(). You want to grab rows that begin with V2 or V3 in strings. grepl() checks if each string has the pattern. Then, you want to split the column, name and create two columns (i.e., name and est). Finally, you want to convert the data to a long-format data using pivot_wider().
library(dplyr)
library(tidyr)
select(mydf, name:x1) %>%
filter(grepl(x = name, pattern = "^V[2|3]")) %>%
separate(col = name, into = c("name", "est"), sep = "_") %>%
pivot_wider(names_from = "est",values_from = "x1", names_prefix = "est_")
# name est_3 est_15 est_25 est_35 est_47 est_57
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 V2 1.12 1.07 1.07 1.05 1.03 1.13
#2 V3 0.917 0.867 0.917 0.994 0.937 1.03
For your reference, when you ask questions, you want to provide a minimal sample data and code. If you can do that, SO users can easily help you out. Please read this question.
DATA
mydf <- structure(list(name = c("V1_3", "V2_3", "V3_3", "V1_15", "V2_15",
"V3_15", "V1_25", "V2_25", "V3_25", "V1_35", "V2_35", "V3_35",
"V1_47", "V2_47", "V3_47", "V1_57", "V2_57", "V3_57"), x1 = c(1,
1.122, 0.917, 1, 1.069, 0.867, 1, 1.066, 0.917, 1, 1.048, 0.994,
1, 1.03, 0.937, 1, 1.133, 1.032), x2 = c(0, 0.044, 0.045, 0,
0.036, 0.039, 0, 0.034, 0.037, 0, 0.034, 0.037, 0, 0.031, 0.034,
0, 0.036, 0.037), x3 = c(999, 25.446, 20.385, 999, 29.751, 22.478,
999, 31.134, 24.565, 999, 31.18, 26.587, 999, 33.637, 27.405,
999, 31.883, 28.081), x4 = c(999, 0, 0, 999, 0, 0, 999, 0, 0,
999, 0, 0, 999, 0, 0, 999, 0, 0)), row.names = c(NA, -18L), class = c("tbl_df",
"tbl", "data.frame"))

Search rows looking for 2 conditions (OR)

I have the following table, with ordered variables:
table <- data.frame(Ident = c("Id_01", "Id_02", "Id_03", "Id_04", "Id_05", "Id_06"),
X01 = c(NA, 18, 0, 14, 0, NA),
X02 = c(0, 16, 0, 17, 0, 53),
X03 = c(NA, 15, 20, 30, 0, 72),
X04 = c(0, 17, 0, 19, 0, NA),
X05 = c(NA, 29, 21, 23, 0, 73),
X06 = c(0, 36, 22, 19, 0, 55))
Ident X01 X02 X03 X04 X05 X06
Id_01 NA 0 NA 0 NA 0
Id_02 18 16 15 17 29 36
Id_03 0 0 20 0 21 22
Id_04 14 17 30 19 23 19
Id_05 0 0 0 0 0 0
Id_06 NA 53 72 NA 73 55
From a previous question, I have the following code provided from a user here, to search by row for one condition (1st and 2nd position > 0) and returning the position of the ocurrence (name of the variable for the specific position):
apply(table[-1], 1, function(x) {
i1 <- x > 0 & !is.na(x)
names(x)[which(i1[-1] & i1[-length(i1)])[1]]})
I'm looking to add a second condition to the apply code, so the conditions needs to be:
1st and 2nd ocurrence (consecutive) > 0
OR
1st and 3rd ocurrence > 0
Considering this change, the output of the evaluation for the table posted before should be:
For Id_01: never occurs (NA?)
For Id_02: 1st position (X01)
For Id_03: 3rd position (X03)
For Id_04: 1st position (X01)
For Id_05: never occurs (NA?)
For Id_06: 2nd position (X02)
Thanks in advance!
We can use lag and lead from dplyr
library(dplyr)
f1 <- function(x) {
i1 <- x > 0 & !is.na(x)
which((i1 & lag(i1, default = i1[1])) |
(i1 & lead(i1, n = 3, default = i1[1])))[1]
}
n1 <- apply(table[-1], 1, f1)
names(table)[-1][n1]
#[1] NA "X01" "X03" "X01" NA "X02"
Or use pmap
library(purrr)
n1 <- pmap_int(table[-1], ~ c(...) %>%
f1)
names(table)[-1][n1]

Why does Keras perform poorly on this simple toy dataset?

Here I've created a toy dataset by randomly sampling from two bernoulli distributions dictated by the logistic functions
1 / (1 + exp(-0.2 * (x - 20)))
-1 / (1 + exp(-0.2 * (x - 80)))
My hope was that I could train a keras NNet with a 2-node hidden layer and a softmax activation function that would learn these two logistic functions, but the resulting model predicts probability of 1 for every x value.
library(keras)
train <- data.frame(
x = c(4.44, 8.25, 15.72, 17.53, 17.53, 17.86, 18.57, 20.22, 20.24, 20.57, 21.99, 25.06, 28.3, 31.1, 35.91, 37.29, 38.36, 39.58,
39.78, 40.1, 47.29, 51.67, 51.74, 53.52, 57.45, 62.69, 63.03, 69.03, 70.11, 74.44, 76.4, 79.81, 86.92, 87.59, 89.88),
y = c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
)
head(train, 10)
x y
1 4.44 0
2 8.25 0
3 15.72 0
4 17.53 0
5 17.53 0
6 17.86 0
7 18.57 0
8 20.22 0
9 20.24 1
10 20.57 1
# Build and fit model
model <- keras_model_sequential()
model <- layer_dense(object = model, input_shape = 1L, use_bias = TRUE, units = 2L, activation = 'sigmoid')
model <- layer_dense(object = model, units = 1L, activation = 'softmax', input_shape = 2L)
model <- compile(object = model, loss = 'binary_crossentropy', optimizer = 'sgd', metrics = c('accuracy'))
fit(object = model, x = dt$Age, y = dt$LittleSleep * 1, epochs = 30)
# Evaluate
predict_proba(object = model, x = train$x)[, 1]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Why does Keras do such a poor job of fitting to the training data?
Keras is not doing a poor job, it is exactly doing the job you told it to do in your network architecture :)
You are using a softmax activation at the output with only one output neuron, meaning that the softmax will always output 1.0, as the output is normalized across neurons. Do not do that, use at least two output neurons so normalization can happen correctly.
As you use binary cross-entropy loss, a better choice of activation would be sigmoid at the output, which will work with a single output neuron.

Row by row application in R [duplicate]

I have my data in the form of a data.table given below
structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA,
NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94,
57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp",
"len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
I need to form 4 new columns csi_begin,csi_end, IRQ and csi_order. the value of csi_begin and csi_end when atp=1 depends directly on inv and gu values.
But when atp is not equal to 1 csi_begin and csi_end depends on inv and gu values and IRQ value of previous row
The value of IRQ depends on csi_order of that row if atp==1 else its 0 and csi_order value depends on two rows previous csi_begin value.
I have written the condition with the help of for loop.
Below is the code given
lostsales<-function(transit)
{
if (transit$atp==1)
{
transit$csi_begin[i]<-(transit$inv)[i]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
else
{
transit$csi_begin[i]<-(transit$inv)[i]+transit$IRQ[i-1]
transit$csi_end[i]<-transit$csi_begin[i]-transit$GU[i]
}
if (transit$csi_begin[i-2]!= NA)
{
transit$csi_order[i]<-transit$csi_begin[i-2]
}
else
{ transit$csi_order[i]<-0}
if (transit$atp==1)
{
transit$IRQ[i]<-transit$csi_order[i]-transit$RUTL[i]
}
else
{
transit$IRQ[i]<-0
}
}
Can anyone help me how to do efficient looping with data.tables using setkeys? As my data set is very large and I cannot use for loop else the timing would be very high.
Adding the desired outcome to your example would be very helpful, as I'm having trouble following the if/then logic. But I took a stab at it anyway:
library(data.table)
# Example data:
dt <- structure(list(atp = c(1, 0, 1, 0, 0, 1), len = c(2, NA, 3, NA, NA, 1), inv = c(593, 823, 668, 640, 593, 745), GU = c(36, 94, 57, 105, 48, 67), RUTL = c(100, NA, 173, NA, NA, 7)), .Names = c("atp", "len", "inv", "GU", "RUTL"), row.names = c(NA, -6L), class = c("data.table", "data.frame"), .internal.selfref = "<pointer: 0x0000000000320788>")
# Add a row number:
dt[,rn:=.I]
# Use this function to get the value from a previous (shiftLen is negative) or future (shiftLen is positive) row:
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
# My attempt to follow the seemingly circular if/then rules:
lostsales2 <- function(transit) {
# If atp==1, set csi_begin to inv and csi_end to csi_begin - GU:
transit[atp==1, `:=`(csi_begin=inv, csi_end=inv-GU)]
# Set csi_order to the value of csi_begin from two rows prior:
transit[, csi_order:=rowShift(csi_begin,-2)]
# Set csi_order to 0 if csi_begin from two rows prior was NA
transit[is.na(csi_order), csi_order:=0]
# Initialize IRQ to 0
transit[, IRQ:=0]
# If ATP==1, set IRQ to csi_order - RUTL
transit[atp==1, IRQ:=csi_order-RUTL]
# If ATP!=1, set csi_begin to inv + IRQ value from previous row, and csi_end to csi_begin - GU
transit[atp!=1, `:=`(csi_begin=inv+rowShift(IRQ,-1), csi_end=inv+rowShift(IRQ,-1)-GU)]
return(transit)
}
lostsales2(dt)
## atp len inv GU RUTL rn csi_begin csi_end csi_order IRQ
## 1: 1 2 593 36 100 1 593 557 0 -100
## 2: 0 NA 823 94 NA 2 NA NA 0 0
## 3: 1 3 668 57 173 3 668 611 593 420
## 4: 0 NA 640 105 NA 4 640 535 0 0
## 5: 0 NA 593 48 NA 5 593 545 668 0
## 6: 1 1 745 67 7 6 745 678 640 633
Is this output close to what you were expecting?

Resources