data manipulation - R - r

I am struggling with data manipulation in R. My dataset consists of variables type(5 factors), intensity(3 factors), damage(continous). I want to calculate mean damage(demage1, demage2 and damage3 separately) with respect to intensity and type. In onther words I want to summarize the average damage by type and intensity. I have created this small reproducible example of my data:
type <- sample(seq(from = 1, to = 5, by = 1), size = 50, replace = TRUE)
intensity <- sample(seq(from = 1, to = 3, by = 1), size = 50, replace = TRUE)
damage1 <- sample(seq(from = 1, to = 50, by = 1), size = 50, replace = TRUE)
damage2 <- sample(seq(from = 1, to = 200, by = 1), size = 50, replace = TRUE)
damage3 <- sample(seq(from = 1, to = 500, by = 1), size = 50, replace = TRUE)
dat <- cbind(type, intensity, damage1, damage2, damage3)
then to manipulate the data I have used the pipe operator %>% buy my commands seem not to work very well:
dat <- as.data.frame(dat)
dat %>%
filter(type == 1) %>%
group_by(intensity, damage) %>%
summarise(mean_damage = mean(Value))
I have read about multiple usefull functions here:
efficient reshaping using data tables
manipulating data tables
Do Faster Data Manipulation using These 7 R Packages
But I wasnt able to make any progress here. My question are:
What is wrong with my code?
Am I even going in the right direction here?
Is there some alternative how to do this?

Related

Difference in differences placebo test plot

How do I make graphs like this in R?
Lets say I have a dataset like this:
data <- tibble(date=sample(seq(as.Date("2006-01-01"),
as.Date("2019-01-01"), by="day"),
10000, replace = T),
treatment=sample(c(0,1),10000, replace= T),
after=ifelse(date>as.Date("2015-03-01"), 1, 0),
score=rnorm(10000)+ifelse(treatment*after==1, 0.2, 0)
)
and is doing a difference in differences analysis:
did <- lm(score~treatment+after+treatment*after, data=data)
summary(did)
How can I make a plot with placebo tests?
Just using plot_model function in sjPlot.
data <- tibble(date=sample(seq(as.Date("2006-01-01"),
as.Date("2019-01-01"), by="day"),
10000, replace = T),
treatment=sample(c(0,1),10000, replace= T),
after=ifelse(date>as.Date("2015-03-01"), 1, 0),
score=rnorm(10000)+ifelse(treatment*after==1, 0.2, 0)
)
did <- lm(score~treatment+after+treatment*after, data=data)
summary(did)
sjPlot::plot_model(did,vline = 'black',show.values = T) + ylim(-.25, .5)
vline means to add a horizontal line at x = 1;
show.values means whether values should be plotted or not.
You can check the details of argument of plot_model from here.

Creating a function to loop over data frame to create distributions of significant correlations in R

I have trouble creating a function that is too complex for my R knowledge and I'd appreciate any help.
I have a data set (DRC_epi) consisting of ~800.000 columns of epigenetic data. I'd like to randomly draw 1000 samples consisting of 500 column names each:
set.seed(42)
y <- replicate(1000, {
names(DRC_epi[, sample(ncol(DRC_epi), 500, replace = TRUE)])
})
I want to use these samples to select samples of a different data frame (DRC_epi_pheno) from which I want to create correlations with the outcome variable of my interest (phenotype_aas). So for the first sub sample it would look like this:
library(tidyverse)
library(correlation)
DRC_cor_sign_1 <- DRC_epi_pheno %>%
select(phenotype_aas, any_of(y[,1])) %>%
correlation(method = "spearman", p_adjust = "fdr") %>%
filter(Parameter1 %in% "phenotype_aas") %>%
filter(p <= 0.05) %>%
select(Parameter1, Parameter2, p)
From this result, I want to store the percentage of significant results in an object:
percentage <- data.frame()
percentage() <- length(DRC_cor_sign_1)/500*100
The question I have now is, how can I put it all together and automate it, so that I don't have to run the analyses 1000 times manually?
So that you have an idea of my data, I create here a toy data set that is similar to my real data set:
set.seed(42)
DRC_epi <- data.frame("cg1" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg2" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg3" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg4" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg5" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg6" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg7" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg8" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg9" = rnorm(n = 10, mean = 1, sd = 1.5),
"cg10" = rnorm(n = 10, mean = 1, sd = 1.5))
DRC_epi_pheno <- cbind(DRC_epi, phenotype_aas = sample(x = 0:40, size = 10, replace = TRUE))

How to facet a plot by row and columns simultaneously from a data frame using ggpubr?

I'm not an R expert, and I'm having a little troubles plotting my data
Some context: An experiment using 2 coagulants, at 3 different velocities with 3 different solids concentrations was conducted to see the behavior of 3 parameters. The experiment results was recorded in the following data frame:
Coagulant <- c(rep("Chem_1", times = 9),
rep("Chem_2", times = 9))
Velocity <- rep(c(rep("low", times = 3),
rep("mid", times = 3),
rep("high", times = 3)),
times = 2)
Solids <- rep(c(0,50,100), times = 6)
Parameter_1 <- runif(n = 18,
min = 70,
max = 100)
Parameter_2 <- runif(n = 18,
min = 90,
max = 100)
Parameter_3 <- runif(n = 18,
min = 0,
max = 100)
Ex.data.frame <- data.frame(Coagulant,
Velocity,
Solids,
Parameter_1,
Parameter_2,
Parameter_3)
Ex.data.frame$Coagulant <- factor(Ex.data.frame$Coagulant,
levels = c("Chem_1", "Chem_2"),
labels = c("Chem_1", "Chem_2"))
Ex.data.frame$Velocity <- factor(Ex.data.frame$Velocity,
levels = c("low", "mid", "high"),
labels = c("low", "mid", "high"))
Ex.data.frame$Solids <- factor(Ex.data.frame$Solids,
levels = c(0, 50, 100),
labels = c(0, 50, 100))
I'd like to plot this data following this panel separation:
I tried a lot, but only can provide this code to make the plot:
require(ggpubr)
ggbarplot(data = Ex.data.frame,
y = "Parameter_1", #I'd like to use for all parameters (same units)
fill = "Solids",
position = position_dodge2(),
facet.by = c("Coagulant", "??")) #Facet by the parameters columns from the data frame
Here is a potential way towards your goal (you just have to switch in the facet to have the same columns and lines as in your graph) :
library(dplyr)
library(magrittr)
d <-rbind(Ex.data.frame[1:4] %>% rename(Parameter = Parameter_1) %>% mutate(param = "Parameter_1"),
Ex.data.frame[c(1:3,5)] %>% rename(Parameter = Parameter_2) %>% mutate(param = "Parameter_2"),
Ex.data.frame[c(1:3,6)] %>% rename(Parameter = Parameter_3) %>% mutate(param = "Parameter_3")
)
ggbarplot(data = d,
y = c("Parameter" ),
fill = "Solids",
position = position_dodge2(),
facet.by = c( "param", "Coagulant" ))

MXNet: sequence length in LSTM in a non-sequence data (R)

My data are not timeseries, but it has sequential properties.
Consider one sample:
data1 = matrix(rnorm(10, 0, 1), nrow = 1)
label1 = rnorm(1, 0, 1)
label1 is a function of the data1, but the data matrix is not a timeseries. I suppose that label is a function of not just one data sample, but more older samples, which are naturally ordered in time (not sampled randomly), in other words, data samples are dependent with one another.
I have a batch of examples, say, 16.
With that I want to understand how I can design an RNN/LSTM model which will memorize all 16 examples from the batch to construct the internal state. I am especially confused with the seq_len parameter, which as I understand is specifically about the length of the timeseries used as an input to a network, which is not case.
Now this piece of code (taken from a timeseries example) only confuses me because I don't see how my task fits in.
rm(symbol)
symbol <- rnn.graph.unroll(seq_len = 5,
num_rnn_layer = 1,
num_hidden = 50,
input_size = NULL,
num_embed = NULL,
num_decode = 1,
masking = F,
loss_output = "linear",
dropout = 0.2,
ignore_label = -1,
cell_type = "lstm",
output_last_state = F,
config = "seq-to-one")
graph.viz(symbol, type = "graph", direction = "LR",
graph.height.px = 600, graph.width.px = 800)
train.data <- mx.io.arrayiter(
data = matrix(rnorm(100, 0, 1), ncol = 20)
, label = rnorm(20, 0, 1)
, batch.size = 20
, shuffle = F
)
Sure, you can treat them as time steps, and apply LSTM. Also check out this example: https://github.com/apache/incubator-mxnet/tree/master/example/multivariate_time_series as it might be relevant for your case.

Creating imputation list for use with svyglm

Using the survey package, I am having issues creating an imputationList that svydesign will accept. Here is a reproducible example:
library(tibble)
library(survey)
library(mitools)
# Data set 1
# Note that I am excluding the "income" variable from the "df"s and creating
# it separately so that it varies between the data sets. This simulates the
# variation with multiple imputation. Since I am using the same seed
# (i.e., 123), all the other variables will be the same, the only one that
# will vary will be "income."
set.seed(123)
df1 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 2
set.seed(123)
df2 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Data set 3
set.seed(123)
df3 <- tibble(id = seq(1, 100, by = 1),
gender = as.factor(rbinom(n = 100, size = 1, prob = 0.50)),
working = as.factor(rbinom(n = 100, size = 1, prob = 0.40)),
pweight = sample(50:500, 100, replace = TRUE))
# Create list of imputed data sets
impList <- imputationList(df1,
df2,
df3)
# Apply NHIS weights
weights <- svydesign(id = ~id,
weight = ~pweight,
data = impList)
I get the following error:
Error in eval(predvars, data, env) :
numeric 'envir' arg not of length one
To get it to work, I needed to directly add imputationList to svydesign as follows:
weights <- svydesign(id = ~id,
weight = ~pweight,
data = imputationList(list(df1,
df2,
df3))
the step by step instructions available at http://asdfree.com/national-health-interview-survey-nhis.html walk through exactly how to create a multiply-imputed nhis design, and the analysis examples below that include svyglm calls. avoid using library(data.table) and library(dplyr) with library(survey)

Resources