My question builds on another one previously posted by someone: mapply for all arguments' combinations [R]
I want to apply a function to multiple arguments using mapply, and this works with my code below. But I want to add a condition such that NOT ALL tmin- and tmax- values will be combined, instead only the first tmin with the first tmax, the second tmin with the second tmax (if tmin == 0.01 & tmax == 0.99 or if tmin == 0.05 & tmax == 0.95, but e.g. tmin == 0.01 should not be combined with tmax == 0.95).
But the first elements of tmin & tmax should be combined with ALL variables, all second elements of tmin & tmax should be combined with ALL variables, etc (as below in the expand.grid() function).
In the end I should have a data frame as the one called "alltogether", but I should have 15 rows with the described condition and not 75 as it is the case now.
I could just filter rows with dplyr::filter afterwards, but is there a nice way to include this condition in the function?
Here an example data frame:
dataframe <- data.frame(personID = 1:10,
Var1 = c(4, 6, 3, 3, 7, 1, 20, NA, 12, 2),
Var2 = c(5, 4, 5, 6, 9, 14, 14, 1, 0, NA),
Var3 = c(NA, 15, 12, 0, NA, NA, 2, 7, 6, 7),
Var4 = c(0, 0, 0, 0, 1, 0, 1, 4, 2, 1),
Var5 = c(12, 15, 11, 10, 10, 15, NA, 10, 13, 11))
and here the code I have so far:
des <- function(var, tmin, tmax){
v <- var[var >= quantile(var, probs = tmin, na.rm = TRUE) &
var <= quantile(var, probs = tmax, na.rm = TRUE)]
d <- psych::describe(v)
df <- cbind(variable = deparse(substitute(var)), tmin = tmin, tmax = tmax, d)
print(df)
}
args = expand.grid(var = dataframe[, c("Var2", "Var4", "Var5")], tmin = c(0.01, 0.05, 0.1, 0.2, 0.25), tmax = c(0.99, 0.95, 0.9, 0.8, 0.75))
alltogether <- do.call("rbind", mapply(FUN = des, var = args$var, tmin = args$tmin, tmax = args$tmax, SIMPLIFY = FALSE))
Thank you for helping!
Edit:
The expected output is the one after filtering the "alltogether"-dataframe with the following code (15 obs. of 16 variables):
alltogether <- alltogether%>%
dplyr::filter((tmin == 0.01 & tmax == 0.99) |
(tmin == 0.05 & tmax == 0.95) |
(tmin == 0.1 & tmax == 0.9) |
(tmin == 0.2 & tmax == 0.8) |
(tmin == 0.25 & tmax == 0.75))
OK, here's a solution to both problems. Unfortunately, I couldn't get one using mapply so I had to rely on a good old for loop (but it's still faster, given that it doesn't have to do all the extra calculations). Also, I changed the function to give you the names of the variables as you wanted. The biggest difference is that I'm not using expand.grid but merge. Finally, it incorporates your comment from above.
des <- function(var, tmin, tmax, cor.var, cor.method = c("spearman", "pearson", "kendall")){
var[var < quantile(var, probs = tmin, na.rm = TRUE) |
var > quantile(var, probs = tmax, na.rm = TRUE)] <- NA
d <- psych::describe(var)
correlation<- cor(cor.var, var, use="pairwise.complete", match.arg(cor.method))
df <- cbind(variable = names(var), tmin = tmin, tmax = tmax, d, correlation)
names(df)[length(names(df))]<- paste0("correlation_with_", names(cor.var))
print(df)
}
minmax = data.frame(tmin = c(0.01, 0.05, 0.1, 0.2, 0.25), tmax = c(0.99, 0.95, 0.9, 0.8, 0.75))
args<- merge(c("Var2", "Var4", "Var5"), minmax)
args[,1]<- as.character(args[,1])
alltogether<- NULL
for (i in 1:nrow(args)){
alltogether<- rbind(alltogether, des(var = dataframe[args[i,1]],
tmin = args[i, 2], tmax=args[i, 3], cor.var = dataframe["Var1"]))
}
Related
This is my first question on Stackoverflow, so please bear with me if I make any mistakes or omit necessary information.
I have a dataset consisting of a time series where I need to find the 5-day rolling average of a binary variable for each specific hour of the day. An example of my data can be created using:
library(dplyr)
library(zoo)
set.seed(69)
df <- data.frame(Hour = rep(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24), times = 10),
Reg = rep(round(runif(24*10, 0, 1))),
HumidityLevel = rep(runif(24*10, 0, 100)))
df_ranges <- data.frame(LowerRange = rep(cbind(rollapply(df$HumidityLevel, 24, min, by = 24)), each = 24)
,UpperRange = rep(cbind(rollapply(df$HumidityLevel, 24, max, by = 24)), each = 24))
df <- cbind(df, df_ranges)
I have computed the simple rolling average using the following code:
df <- df %>%
group_by(Hour) %>%
mutate(AvgReg = lag(rollapplyr(Reg, 5, mean, na.rm = T, partial = T), n = 1))
What I need to do is compute the rolling average of Reg using previous rows where HumidityLevel lies within the range for that specific day. The lower and upper boundary of the range is determined by two columns (LowerRange, UpperRange). The boundary values are dependent on the lowest and highest HumidityLevel-values for the day.
For instance, a day may have levels between 20 and 54. The rolling average for hour 1 of that specific day should then be computed by using previous Hour 1 observations with a HumidityLevel value above or equal to 20 and below or equal to 54.
I hope that my question makes sense.
This is my desired output:
desired_output <- data.frame(RowNum = c(1:10),
Hour = rep(1, times = 10),
Reg = c(1,0,0,1,0,1,0,0,0,0),
HumidityLevel = c(28.36, 65.02, 1.12, 49.61, 24.50, 98.16, 77.33, 97.03, 47.03, 85.71),
LowerBoundary = c(5.67, 7.50, 1.12, 19.32, 0.01, 6.94, 7.48, 0.71, 2.85, 1.59),
UpperBoundary = c(93.60, 89.37, 97.25, 99.63, 91.92, 98.16, 98.48, 99.98, 99.70, 98.86),
AvgReg = c("NA", 1, 0.5, 0.5, 0.5, 0.5, 0.6, 0.4, 0.4, 0.2))
Using data.table you can use between for filter and shift + frollmean for calculation:
setDT(df)[
between(HumidityLevel, LowerRange, UpperRange),
new_col := shift(
frollmean(Reg, c(seq_len(min(5, .N)), rep(5, max(0, .N - 5))), adaptive = TRUE)
),
by = Hour
]
Say I have a simple DAG representing a confounding variable X = Smoking, a treatment T and outcome Y = Death such that:
T ~ X
Y ~ T + X
Is it possible to produce a synthetic dataset of say 1m observations that follows some specified conditional probabilities:
# Pr(smoking):
smoking <- data.frame(
smoking = c(0, 1),
proba = c(0.7, 0.3)
)
# Pr(treatment | smoking):
treatment <- expand.grid(
smoking = c(0, 1),
treatment = c(0, 1)
) %>% arrange(smoking, treatment)
treatment$proba <- c(0.8, 0.2, 0.45, 0.55)
# Pr(death | treatment, smoking):
death <- expand.grid(
treatment = c(0, 1),
smoking = c(0,1),
dead = c(0,1)
) %>%
arrange(treatment, smoking, dead)
death$proba <- c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5)
I can do this manually here because it's a very basic DAG but I was wondering if it can be done in another more scalable way, using something like bnlearn .
Current solution:
db <- data.frame(
smoking = rbinom(n = 1000000, size = 1, prob = 0.3)
)
db$treatment[db$smoking == 0] <- rbinom(n = sum(db$smoking == 0), size = 1, prob = 0.2)
db$treatment[db$smoking == 1] <- rbinom(n = sum(db$smoking == 1), size = 1, prob = 0.55)
db$dead[db$treatment == 0 & db$smoking == 0] <- rbinom(
n = sum(db$treatment == 0 & db$smoking == 0),
size = 1, prob = 0.1
)
db$dead[db$treatment == 0 & db$smoking == 1] <- rbinom(
n = sum(db$treatment == 0 & db$smoking == 1),
size = 1, prob = 0.8
)
db$dead[db$treatment == 1 & db$smoking == 0] <- rbinom(
n = sum(db$treatment == 1 & db$smoking == 0),
size = 1, prob = 0.11
)
db$dead[db$treatment == 1 & db$smoking == 1] <- rbinom(
n = sum(db$treatment == 1 & db$smoking == 1),
size = 1, prob = 0.5
)
It will be easier to let existing packages do this for you; like bnlearn. You can use custom.fit to specify the DAG and the CPTs and then use rbn to draw samples from it.
An example
library(bnlearn)
# Specify DAG
net <- model2network("[treatment|smoking][smoking][death|treatment:smoking]")
graphviz.plot(net)
# Define CPTs
smoking <- matrix(c(0.7, 0.3), ncol = 2, dimnames = list(NULL, c("no", "yes")))
treatment <- matrix(c(0.8, 0.2, 0.45, 0.55), ncol = 2, dimnames = list(c("no", "yes"), c("no", "yes")))
death <- array(c(0.9, 0.1, 0.2, 0.8, 0.89, 0.11, 0.5, 0.5), c(2,2,2), dimnames=list(c("no", "yes"), c("no", "yes"), c("no", "yes")))
# Build BN
fit <- custom.fit(net, dist = list(smoking = smoking, treatment = treatment, death = death))
# Draw samples
set.seed(69395642)
samples <- rbn(fit, n=1e6)
I have data table that looks like this:
data <- data.table(time = c(0, 1, 2, 3, 4, 5, 6, 7),
anom = c(0, 0, 1, 1, 1, 0, 0, 0),
gier = c(0, 0, 4, 9, 7, 0, 0, 0))
Now I am calculating some statistical values of the column gier grouped by column anom like this:
cols <- c("gier")
statFun <- function(x) list(mean = mean(x), median = median(x), std = sd(x))
statSum <- data[, unlist(lapply(.SD, statFun), recursive = FALSE), .SDcols = cols, by = anom]
This is fine but I want to go a step further and put in the start and end points of time depending on the start and of the anom groups (0 and 1). So in the end I have something like a new time series but only with the start and end points of time. So in the end the result should look like this:
res <- data.table(x.start = c(0, 2, 5),
x.end = c(1, 4, 7),
anom = c(0, 1, 0),
gier.mean = c(0, 6.666, 0),
gier.median = c(0, 7, 0),
gier.std = c(0, 2.516, 0))
How is it possible to achieve this?
addition: is there a way to achieve the result for multiple columns and not only one column like gier? For example I am able to do this but I don't know how to extend it with the mentioned columns. This way there is at least an extra column rn for the column names I calculate the statistical values.
res <- data[, setDT(do.call(rbind.data.frame, lapply(.SD, statFun)), keep.rownames = TRUE), .SDcols = cols, by = anom]
You can include additional calculation outside lapply :
library(data.table)
data[, unlist(c(lapply(.SD, statFun),
anom = first(anom), x.start = first(time), x.end = last(time)),
recursive = FALSE), rleid(anom), .SDcols = cols]
# rleid gier.mean gier.median gier.std anom x.start x.end
#1: 1 0.000000 0 0.000000 0 0 1
#2: 2 6.666667 7 2.516611 1 2 4
#3: 3 0.000000 0 0.000000 0 5 7
In dplyr we can do this similarly :
library(dplyr)
data %>%
group_by(grp = rleid(anom)) %>%
summarise(across(cols, list(mean = mean, median = median, std = sd)),
x.start = first(time),
x.end = last(time))
I would like to calculate the "non-NA values interval" for different columns.
Here is the dataset:
temp <- data.frame(
date = seq(as.Date("2018-01-01"), by = 'month', length.out = 12),
X1 = c(100, NA, 23, NA, NA, 12, NA, NA, NA, NA, NA, 100),
X2 = runif(12, 50, 100),
X3 = c(24, NA, NA, NA, NA, 31, 1, NA, 44, NA, 100, NA),
X4 = NA
)
For example, X1 has non-NA intervals as 1, 2, 5, which means, from 100 to 23, there is 1 NA between these two non-NA values, from 23 to 12, there is 2 NAs between these two non-NA values, and from 12 to 100, there are 5 NAs between these two non-NA values.
The expected result is:
result <- data.frame(
X1_inv_mean = mean(c(1, 2, 5)),
X1_inv_median = median(c(1, 2, 5)),
X1_inv_sd = sd(c(1, 2, 5)),
X2_inv_mean = mean(0),
X2_inv_median = median(0),
X2_inv_sd = sd(0),
X3_inv_mean = mean(c(4, 1, 1, 1)),
X3_inv_median = median(c(4, 1, 1, 1)),
X3_inv_sd = sd(c(4, 1, 1, 1)),
X4_inv_mean = NA,
X4_inv_median = NA,
X4_inv_sd = NA
)
>result
X1_inv_mean X1_inv_median X1_inv_sd X2_inv_mean X2_inv_median X2_inv_sd X3_inv_mean X3_inv_median X3_inv_sd
1 2.666667 2 2.081666 0 0 NA 1.75 1 1.5
X4_inv_mean X4_inv_median X4_inv_sd
1 NA NA NA
Thanks for the help!
A base R option
out <- lapply(temp[-1], function(x) {
if(all(is.na(x))) {
tmp <- NA
} else {
tmp <- with(rle(is.na(x)), lengths[values])
c(mean = mean(tmp),
median = median(tmp),
sd = sd(tmp))}
})
as.data.frame(out)
# X1 X2 X3 X4
#mean 2.666667 NaN 1.75 NA
#median 2.000000 NA 1.00 NA
#sd 2.081666 NA 1.50 NA
Using rle the following line gives you the runs of NAs for each column
tmp <- with(rle(is.na(x)), lengths[values])
E.g. for column X1
with(rle(is.na(temp$X1)), lengths[values])
#[1] 1 2 5
Then we calculate your summary statistics for each tmp.
If all values in a column are NA the function returns NA.
Update:
For variable n columns:
command <- ""
summaryString <- ""
for(i in colnames(temp)){
if(i != "date"){
print(i)
summaryString <- paste(summaryString,i,"_inv_mean = mean(",i,", na.rm = T),",sep="")
summaryString <- paste(summaryString,i,"_inv_median = median(",i,", na.rm = T),",sep="")
summaryString <- paste(summaryString,i,"_inv_sd = sd(",i,", na.rm = T),",sep="")
}
command <- paste("output <- temp %>% summarise(",substr(summaryString, 0, nchar(summaryString)-1),")",sep="")
}
eval(parse(text=command))
Using dplyr:
library(dplyr)
output <- temp%>%
summarise(x1_inv_mean = mean(X1, na.rm = T),
x1_inv_median = median(X1, na.rm = T),
x1_inv_sd = sd(X1, na.rm = T),
x2_inv_mean = median(X2, na.rm = T),
x2_inv_median = mean(X2, na.rm = T),
x2_inv_sd = sd(X2, na.rm = T),
x3_inv_mean = median(X3, na.rm = T),
x3_inv_median = mean(X3, na.rm = T),
x3_inv_sd = sd(X3, na.rm = T),
x4_inv_mean = mean(X4, na.rm = T),
x4_inv_median = median(X4, na.rm = T),
x4_inv_sd = sd(X4, na.rm = T))
When i use code from example:
library(deepnet)
Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
x <- matrix(c(Var1, Var2), nrow = 100, ncol = 2)
y <- c(rep(1, 50), rep(0, 50))
nn <- dbn.dnn.train(x, y, hidden = c(5))
it works. But when i use this code:
Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
x <- matrix(c(Var1, Var2), nrow = 100, ncol = 2)
**y <- c(rep("1", 50), rep("0", 50))**
nn <- dbn.dnn.train(x, y, hidden = c(5))
i receive error:
Error in batch_y - nn$post[[i]] : non-numeric argument to binary operator
How can i use deepnet package for classification problem?
y1 <- c(rep("1", 50), rep("0", 50))
lead you to character vector which is not acceptable by the package. so that you get error
class(y)
#[1] "character"
The right y should be numeric as follows
y <- c(rep(1, 50), rep(0, 50))
class(y)
#[1] "numeric"
if you see inside your y , you can find that you have 1 or 0 which is a binary values for classification
> table(y)
#y
# 0 1
#50 50
If you want to train as it is mentioned in the manual, you can do the following to train and predict a test set
Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
x <- matrix(c(Var1, Var2), nrow = 100, ncol = 2)
y <- c(rep(1, 50), rep(0, 50))
If you now look at your x and y by str just simply write str(x) or str(y) you can see that they are numeric (to make sure, you can check them by class(x) and class(y).
After having your X and y , then you can build your model
dnn <- dbn.dnn.train(x, y, hidden = c(5, 5))
If you have a test set to predict, then you can predict it using for example as is mentioned in the manual
test_Var1 <- c(rnorm(50, 1, 0.5), rnorm(50, -0.6, 0.2))
test_Var2 <- c(rnorm(50, -0.8, 0.2), rnorm(50, 2, 1))
test_x <- matrix(c(test_Var1, test_Var2), nrow = 100, ncol = 2)
nn.test(dnn, test_x, y)
#[1] 0.25
Again your test_x must be numeric. If your problem is that you have the values as character, then you can convert it to numeric by mydata<- as.numeric()