I am trying to find several info between each correlation
corr.test
I have two data set df1 and df2
df1<- structure(list(col1A = c(1.64, 0.03, 0, 4.202, 2.981, 0.055,
0, 0.002, 0.005, 0, 0.002, 0.649, 2.55, 2.762, 6.402), col2A = c(2.635,
0.019, 0, 5.638, 3.542, 0.793, 0.259, 0, 0.046, 0.004, 0.017,
0.971, 3.81, 3.104, 5.849), col3A = c(0.91, 0.037, 0, 5.757,
3.916, 0.022, 0, 0, 0.003, 0, 0.262, 0.136, 2.874, 3.466, 5.003
), col4A = c(1.027, 0.021, 0, 4.697, 2.832, 0.038, 0.032, 0.001,
0.003, 0, 0, 0.317, 2.743, 3.187, 6.455)), class = "data.frame", row.names = c(NA,
-15L))
the second data is like below
df2<-structure(list(col1 = c(2.172, 0, 0, 4.353, 4.581, 0.001, 0.027,
0, 0.002, 0, 0, 0.087, 2.129, 4.317, 5.849), col2 = c(2.093,
0, 0, 4.235, 3.166, 0, 0, 0.006, 0.01, 0, 0, 0.475, 0, 2.62,
5.364), col3 = c(3.322, 0, 0, 4.332, 4.018, 0.049, 0.169, 0.004,
0.02, 0, 0.032, 1.354, 2.944, 4.323, 5.44), col4 = c(0.928, 0.018,
0, 3.943, 3.723, 0.02, 0, 0, 0, 0, 0.075, 0.136, 3.982, 3.875,
5.83)), row.names = c("A", "AA", "AAA", "Aab", "buy", "yuyn",
"gff", "fgd", "kil", "lilk", "hhk", "lolo", "fdd", "vgfh", "nghg"
), class = "data.frame")
I want to obtain all possible correlation between the two and extract all p values and adjusted p values
I use
library(psych)
corr.test(df1,df2, use = "pairwise",method="pearson",adjust="holm",alpha=.05,ci=TRUE,minlength=5)
it does not give me any p value. also I cannot control any sort of permutation to calculate the adjusted p value.
I was thinking to use the following
x <-df1[,1]
y <-df2[,2]
corr_init <- cor(x,y) # original correlation
N <- 1000 # initialize number of permutations
count <- 0 # counts correlation greater than corr_init
for (i in 1:N) {
y_perm <- permute(y)
if (cor(y_perm,x) > corr_init) count <- count+1
}
p <- count/N #final p
but then I have do it one by one and still I need to extract each column and test ...
I am wondering if there is better way to calculate all correlation between two data, get R values, p values and P adjusted with specific number of randomization ?
It could be done using the Hmisc package:
library(Hmisc)
df1_cor_matrix <- rcorr(as.matrix(df1), type = "pearson")
df2_cor_matrix <- rcorr(as.matrix(df2), type = "pearson")
You can then extract out the coefficients using the following:
df1_coef <- df1_cor_matrix$r
df2_coef <- df2_cor_matrix$r
You can extract the p-values using the following:
df1_p_values <- df1_cor_matrix$P
df2_p_values <- df2_cor_matrix$P
You could get the adjusted p-values using the rcorr.adjust function:
rcorr.adjust(df1_cor_matrix, type = "pearson")
rcorr.adjust(df2_cor_matrix, type = "pearson")
Related
I can build very basic plots in R, but I'm trying to make my heatmap look more professional and I'm not sure how to do it. I have a data frame df of 11 observations of 11 variables:
> dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10+"), class = "data.frame")
So I transform df into a matrix to get a heatmap:
heatmap(data.matrix(df), Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
This is what the plot looks like:
I'm not sure how to:
Change the location of the keys for the row and columns. I want them both to start from 0 in the top left corner, and both row and column to continue ascending until 10+
I'd also like more granularity in the colour. Right now you can't event tell the difference in values by looking at the colour...
Is heatmap from base R even the right library for this? I looked up a few examples and I wasn't sure if there's a better library to achieve what I want.
There are several libraries that offer heatmap functionalities. IMO base heatmap and gplots::heatmap.2 did not age well and are not the best options anymore. 3 good possibilities are with ggplot2::geom_tile, pheatmap and ComplexHeatmap.
Example data
Let's assume we have a matrix
dta <- matrix(rnorm(25), nrow=5)
rownames(dta) <- letters[1:5]
colnames(dta) <- LETTERS[1:5]
ggplot2::geom_tile
The ggplot2 version requires your data to be a tidy dataframe, so we can transform our matrix with tidyr::pivot_longer().
dta %>%
as_tibble(rownames = "myrows") %>%
pivot_longer(cols = -myrows, names_to = "mycols", values_to = "level") %>%
ggplot() +
geom_tile(aes(x=myrows, y=mycols, fill = level))
pheatmap
The pheatmap package is quite good at generating modern heatmaps. It takes a matrix as input. It can cluster the rows and columns and make a dendrogram, which is often a desired feature. It can also scale rows and columns (effectively plotting a Z-score).
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE)
Note that the positions of rows and columns are not the same as with ggplot. You can look at the options that allow some useful customization. For example, if our rows have classes defined elsewhere.
ann_df <- data.frame(row.names = rownames(dta),
classification = rep(c("first", "second"), times = c(2,3)))
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_row = ann_df,
gaps_row = c(2))
Color scale
One of the big aspects that make your heatmap look professional is the color scale. On ggplot, you should check out scale_fill_gradient2().
On pheatmap, you can try these settings for color as a starting point (see the documentation of these functions):
color = scales::div_gradient_pal(low = "navy",
mid = "green",
high="yellow")(seq(0,1,
length.out = max(dta))),
color = colorRampPalette(RColorBrewer::brewer.pal(n = 9,
name = "Blues"))(max(dta)),
color = viridisLite::plasma(max(dta)),
ComplexHeatmap
Finally, a package that has gained success recently is ComplexHeatmap. It is based on pheatmap but offers many additional options. See the link in zx8754's comment for a detailed book full of examples.
I have a dataframe that I want to plot a heatmap of:
dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10 or more` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5",
"6", "7", "8", "9", "10 or more"), class = "data.frame")
Now to plot the heatmap using ggplot2 this is how I approach the solution:
df %>%
as_tibble(rownames = "homeScore") %>%
pivot_longer(cols = -homeScore, names_to = "awayScore", values_to = "level") %>%
ggplot() +geom_tile(aes(x=homeScore, y=awayScore, fill = level))
The problem I face is that the columns and rows are being sorted on (0,1,10+,2,..) instead of (0,1,2,...10+). Here's the example:
How do I sort the values such that 10+ is the last for row and column, instead of the third?
As #Nate already mentioned you have to convert your vars to factors and put the levels in the right order. Instead of converting via factor(as.numeric(.)) (which converts "10 or more" to NA) I would recommend to make use of forcats::fct_relevel which allows you to change the order of the levels, e.g. forcats::fct_relevel(homeScore, "10 or more", after = 10) will change the order of the levels such that 10 or more becomes the last level. Try this:
library(ggplot2)
library(tidyr)
library(dplyr)
library(forcats)
df %>%
as_tibble(rownames = "homeScore") %>%
pivot_longer(cols = -homeScore, names_to = "awayScore", values_to = "level") %>%
mutate_at(vars(homeScore, awayScore), ~forcats::fct_relevel(.x, "10 or more", after = 10)) %>%
ggplot() +
geom_tile(aes(x=homeScore, y=awayScore, fill = level))
I would like to ask a data manipulation question by R.
all_matrix<-structure(list(V1 = c(0.012, 0, 0, 0, 0.037, 0, 0, 0, 0.007, 0, 0.104, 0.149, 0.164, 0.258, 3.986, 0, 0.002, 0, 0, 0), V2 = c(0, 0.07, 0, 0, 0.017, 0, 0, 0, 0.025, 0, 2.322, 0.327, 0.134, 1.035, 2.732, 0.01, 1.097, 0.388, 0, 0), V3 = c(0, 0, 0, 0.005, 0, 0, 0, 0, 0, 0, 0.007, 0, 0, 0, 1.777, 0, 0.241, 0, 0, 0), V4 = c(0, 0, 0, 0.001, 0.003, 0, 0, 0, 0, 0, 0.207, 0.002, 0.003, 0.015, 0.032, 0, 0.007, 0, 0, 0), V5 = c(0, 0, 0, 0, 0.026, 0, 0, 0, 0.001, 0, 0.101, 0, 0, 0.005, 0.01, 0, 0, 0, 0, 0), V6 = c(0, 0, 0, 0.003, 0.009, 0, 0, 0, 0.076, 0, 0.01, 0.006, 0, 0.091, 0.829, 0, 0.002, 0, 0, 0), V7 = c(0, 0, 0, 0, 0.026, 0, 0, 0, 0.351, 0, 1.849, 0.003, 0, 0.005, 0.998, 0.009, 0.18, 0, 0, 0), V8 = c(0, 0, 0.002, 0.047, 0.01, 0.003, 0, 0, 0.021, 0, 0.848, 0.007, 0.005, 0.206, 0.023, 0, 0.025, 0, 0, 0), V9 = c(0, 0, 0, 0.02, 0.013, 0, 0, 0, 0, 0, 0.008, 0, 0, 0, 0, 0, 0, 0, 0, 0), V10 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.012, 9.895, 4.362, 0, 0, 0, 0, 0), P_diff = c(-4.051, -3.769, -3.602, -7.563, -6.398, -1.816, 0.84, -1.91, 3.095, -1.544, 6.068, 16.469, 6.403, 9.9, 9.089, 18.977, 14.123, 3.103, 1.527, -2.902), PH_fold = c(0.144, 0.511, 0.686, 0.372, 0.582, 0.325, 1.312, 0.436, 1.061, 0.371, 1.119, 1.298, 1.134, 1.146, 1.123, 1.484, 1.204, 1.263, 1.843, 0.423), PNH_fold = c(6.933, 1.955, 1.459, 2.69, 1.718, 3.081, 0.762, 2.291, 0.943, 2.696, 0.894, 0.77, 0.882, 0.872, 0.891, 0.674, 0.831, 0.792, 0.543, 2.366)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "P_diff", "PH_fold", "PNH_fold"), class = "data.frame", row.names = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10", "S11", "S12", "S13", "S14", "S15", "S16", "S17", "S18", "S19", "S20"))
sig1<-data.frame(subset(all_matrix, all_matrix$PH_fold >= 1.1 & all_matrix$P_diff >=1))##conditions are in range like all_matrix$PH_fold >= 1.1,1.2,1.3,....,2 & all_matrix$P_diff >= 1,2,3,4......, upto 20###
sig2<-data.frame(subset(all_matrix, all_matrix$PNH_fold >= 1.1 & all_matrix$P_diff <= -1))##conditions are in range like all_matrix$PNH_fold >= 1.1,1.2,1.3,....,2 & all_matrix$P_diff >= -1,-2,-3,-4......, upto -20###
d <- function(x){sum((log(x[x>0]))*(x[x>0]))*(-1)}
sh1<-apply((sig1[,-c(5:7)]/100), 2, d)
sh2<-apply((sig2[,-c(5:7)]/100), 2, d)
count1<-apply(sig1[,-c(5:7)], 2, function(i) (sum(i > 0)))
count2<-apply(sig2[,-c(5:7)], 2, function(i) (sum(i > 0)))
G1<-(sig1*sh1)
G2<-(sig2*sh2)
G<-data.frame(G1/G2)
I would like to set a R code to calculate "G" for all subsets for all_matrix based on each condition as mentioned above.
So how can I get the subsets of a matrix using different conditions for each subset using loop for further process to calculate "G":
I want to use subset() function in loop:
Can anyone help me out?
Thank you!
I appreciate any replies!
If you have large dataframes, don't use subset, it's slower than any other option. The fastest is generally the indexing way.
I first create two indices with the conditions given. Then a list to hold the sub-dfs.
i1 <- data1$age < 15
i2 <- data1$age > 20
sub_data1 <- list()
sub_data1[['less_15']] <- data1[i1, ]
sub_data1[['over_20']] <- data1[i2, ]
Note that you can skip the indices creation code and simply do
sub_data1 <- list()
sub_data1[['less_15']] <- data1[data1$age < 15, ]
sub_data1[['over_20']] <- data1[data1$age > 20, ]
I have a dataset which contains an acceleration and a time column. Within excel I can use these to create a velocity metric. However, I can't seem to replicate the formula in R as one step involves adding a cell to the previous cell in the data.
Within excel the formula is H5 = H4+(G5*(B5-B4)) which is calculating a difference in time between readings(B5-B4), multiplying the result by acceleration (G5*(B5-B4)) then adding the results to the starting velocity value which is always zero.
The first two steps are fine but I haven't found how to replicate the third
data %>%
mutate(
Time_diff = Time - lag(Time),
Accel_Time = Accel*Time_diff
)
Here is the dataset with expected velocity column also, I've skipped ahead slightly in the data here as the first 100 or so rows have a zero velocity reading.
> dput(head(data1, 20))
structure(list(Time = c(1.002, 1.004, 1.006, 1.008, 1.01, 1.012,
1.014, 1.016, 1.018, 1.02, 1.022, 1.024, 1.026, 1.028, 1.03,
1.032, 1.034, 1.036, 1.038, 1.04), Accel = c(-0.04, -0.04, -0.05,
-0.05, -0.04, -0.04, -0.05, -0.05, -0.05, -0.05, -0.05, -0.06,
-0.06, -0.06, -0.06, -0.06, -0.06, -0.06, -0.07, -0.06),
Velocity = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Time_diff = c(NA, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002,
0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002, 0.002),
Accel_Time = c(NA, -0.0000800000000000001, -0.0001, -0.0001,
-0.0000800000000000001, -0.0000800000000000001, -0.0001, -0.0001,
-0.0001, -0.0001, -0.0001, -0.00012, -0.00012, -0.00012, -0.00012, -0.00012, -0.00012, -0.00012, -0.00014, -0.00012)),
row.names = c(NA, 20L), class = "data.frame")
Any advice on this would be appreciated, thanks
constructing the example data frame:
data1 <- data.frame(Time = c(1.002, 1.004, 1.006, 1.008, 1.01, 1.012, 1.014, 1.016, 1.018, 1.02, 1.022, 1.024, 1.026, 1.028, 1.03, 1.032, 1.034, 1.036, 1.038, 1.04), Accel = c(-0.04, -0.04, -0.05, -0.05, -0.04, -0.04, -0.05, -0.05, -0.05, -0.05, -0.05, -0.06, -0.06, -0.06, -0.06, -0.06, -0.06, -0.06, -0.07, -0.06))
computing the time lag - using 0 in the beginning and then simply subtracting 1st from 2nd, 2nd from 3rd and so on:
data1$Time_diff <- c(0,data1$Time[-1] - data1$Time[-length(data1$Time)])
computing accel_time:
data1$accel_time <- data1$Time_diff * data1$Accel
getting the cummulative sum of velocity:
data1$velocity <- cumsum(data1$accel_time)
one liner:
cumsum(c(0,data1$Time[-1] - data1$Time[-length(data1$Time)]) * data1$Accel)
I am trying to run a multiple imputation using the mice function (from the package of the same name) in R. I get a Warning that events have been logged. Here is the output from mice(.)$loggedEvents from my MWE (see below):
it im dep meth out
1 1 X pmm H
I'm not sure what is causing this warning and what the implications are. From what I understand, this can be caused by collinearity amongst variables, but this should be prevented by using remove_collinear=FALSE, but this isn't fixing the Warning.
MWE:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07),
Z = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90),
D = c( 0, 0, 0, 1, 0, 0),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Obviously my original issue involved much more rows and columns of data and a higher number of imputations and iterations, but I've managed to trim this down to find this MWE.
Any help into figuring out what's causing this problem would be great. Is there some sort of cut-off that mice uses when deciding if/when a covariable is collinear? If it's very high, would this override the remove_collinear=FALSE parameter?
This isn't a full answer, but I couldn't fit the reply in a comment.
The logged events warning can arise from a variety of issues. The issue raised can be identified from the "meth" column in the mice()$loggedEvents output.
The two issues I know of are collinearity, and a constant predictor across all values (or maybe constant across all missing/not missing also satisfied this criteria). Added some variables to highlight these:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
Z2 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 collinear Z2
2 1 1 X pmm H
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
consvar = c( 0.83, 0.83, 0.83, 0.83, 0.83, 0.83, 0.83),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 constant consvar
2 1 1 X pmm H
Unfortunately I don't know what issue "pmm" is. Maybe something to do with the predictive mean matching (chosen imputation method) not able to work in such a small dataset?