generating quintiles and recoding multiple variables in R - r

I have 33 columns/variables with different values. What I am trying to do: generate quintiles for all variables (done), then use the quintiles to recode each variable (-2, -1, 0, 1, 2) by quintile, I generated quintiles using:
q <- apply(ndataframe[2:34], 2, quantile, c(.2, .4, .6, .8, 1), na.rm = T)
Each variable is on different scale which is why the quintile values differ. I assume there is a better and more efficient way to then recode by quintile than what I have been doing so which is using the quintile values and manually recoding each column one by one, e.g.:
n_df_quins$A_q <- recode(n_dataframe$A
"0:1529 = '-2'; 1530:2199 = '-1'; 2200:2999 = '0'; 3000:3999 = '1'; 4000:25000 = '2'")
Thanks very much for any assistance anyone can offer.

You can use percent_rank and create a new data set with percentile values for each observation in each columns and then write a function to recode based on your criteria which you can apply on the whole data set in one go using mutate_all. Below is the code:
library("dplyr")
df<- data.frame(var1 = c(1:100), var2 = sample(1:1000, 100))
df1<- mutate_all(df, percent_rank)
recode_new<- function(x)
{
x = ifelse(x<=.2, -2, ifelse(x<=.4, -1, ifelse(x<=.6,0, ifelse(x<.8,1,2))))
return(x)
}
df_final<- mutate_all(df1, recode_new)
Let me know if you have any question

Related

Merging on row index in R (by = 0 and by = "row.names" not working)

Tl;dr - I'm trying to use the merge.data.table() function with row indexes and the suggestions given in the R documentation are not working.
My data is roughly:
library(data.table)
library(quantreg)
library(purrr)
foo <- expand.grid(c(seq(60001, 60050, by = 1),
c("18-30", "31-60", "61+"),
c("pre", "during", "after"))
foo <- as.data.table(foo)
setnames(foo, names(foo), c("zip", "agegroup", "period"))
foo <- cbind(foo,
quartile = floor(runif(n = nrow(foo), 1, 4)),
times = runif(n = nrow(foo), 18, 25))
I ran several quantile regressions on the data, subsetting by age group (at someone else's request).
v_tau <- c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)
mq_age1 <- map(v_tau, ~rq(data = foo[agegroup == "18-30",],
times ~ quartile + period + quartile*period,
tau = .x))
I'm trying to merge a vector of predicted fitted values from the rq() object with the original data table (I could also transform it into a dataframe, it doesn't have to be a data table). This vector is shorter than the number of rows in the data table, so I've been trying to apply the answer given here for a plm() object, modifying to account for the fact that my fitted values do not have multiple index attributes.
So, what I have been trying to do is join them by row index. I realize I can make another column with an explicit index, but I would like to avoid that because the fitted values are from a subset of the data and I am joining them to a subset of the data; adding an explicit index is possible, but not uniform or parsimonious, and will end up generating a lot of NAs that I don't want to deal with.
fitted <- mq_age1[[10]]$fitted.values
d_fitted <- cbind(attr(fitted, "index"),
fitted = fitted)
foo2 <- merge(foo[agegroup == "18-30",], d_fitted, by = 0, all.x = TRUE)
Looking at the merge() documentation, it says: "Columns to merge on can be specified by name, number or by a logical vector: the name "row.names" or the number 0 specifies the row names. If specified by name it must correspond uniquely to a named column in the input."
However, when I try this, it gives me the following error message:
Error in
merge.data.table(foo[agegroup == "18-30", ], d_fitted, by = 0, :
A non-empty vector of column names for `by` is required.
Similarly, when I try using "row.names":
foo2 <- merge(foo[agegroup == "18-30",], d_fitted, by = "row.names", all.x = TRUE)
Error in merge.data.table(foo[agegroup == "18-30", ], d_fitted, by = "row.names", :
Elements listed in `by` must be valid column names in x and y
What is going on? Why can't I do this?
Found the answer: #r2evans kindly pointed out that base::merge has this functionality, while data.table::merge does not.
foo <- as.data.frame(foo)
before
foo2 <- merge(foo[foo$agegroup == "18-49",], d_fitted, by = 0, all.x = TRUE)
did the trick. Thanks!

Add p-value column in qwraps::summary_table

I want to make a little summary table for my colleagues in R-Markdown using qwraps::summary_table. The data.frame contains information of different exposures. All the variables are coded as binary.
library(qwraps2)
library(dplyr)
pop <- rbinom(n = 1000, size = 1, prob = runif(n = 10, min = 0, max = 1))
exp <- rbinom(n = 1000, size = 1, prob = .5)
ID <- c(1:500)
therapy <- factor(sample(x = pop, size = 500, replace = TRUE), labels = c("Control", "Intervention"))
exp_1 <- sample(x = exp, size = 500, replace = TRUE)
exp_2 <- sample(x = exp, size = 500, replace = TRUE)
exp_3 <- sample(x = exp, size = 500, replace = TRUE)
exp_4 <- sample(x = exp, size = 500, replace = TRUE)
df <- data.frame(ID, exp_1, exp_2, exp_3, exp_4, therapy)
head(df)
In the next step, I create a simple summary table as follows. In the table I want to have the groups (control vs. intervention) as columns and the exposures as rows:
my_summary <-
list(list("Exposure 1" = ~ n_perc(exp_1 %in% 1),
"Exposure 2" = ~ n_perc(exp_2 %in% 1),
"Exposure 3" = ~ n_perc(exp_3 %in% 1),
"Exposure 4" = ~ n_perc(exp_4 %in% 1))
)
my_table <- summary_table(group_by(df, therapy), my_summary)
my_table
In the next step I wanted to add a further column containing p-values for the group differences between control and intervention group, e. g. with fisher.test. I read in ?qwraps::summary_table that cbind is a suitable method for class qwraps2_summary_table, but to be honest, I'm struggling with it. I tried different ways but failed, unfortunately.
Is there a convenient way to add individual columns via qwraps::summary_table especially p-values according to the grouped columns?
Thanks for your help!
Best,
Florian
[SOLVED]
Meanwhile, after a lot of research on this topic, I found a convenient and easy way to add a p.values column. Maybe it is not the smartest solution, but worked, at least for me.
First I calculated the p.values with a function, which extracts the p.values from the returned output of fisher.test and stored them in an object, in my case a simple numeric vector:
# write function to extract fishers.test
fisher.pvalue <- function(x) {
value <- fisher.test(x)$p.value
return(value)
}
# fisher test/generate pvalues
p.vals <- round(sapply(list(
table(df$exp_1, df$therapy),
table(df$exp_2, df$therapy),
table(df$exp_3, df$therapy),
table(df$exp_4, df$therapy)), fisher.pvalue), digits = 2)
In the following step I simply added an empty table column called P-Values and added the p.vals to the column cells.
overall_table <- cbind(my_table, "P-Value" = "") # create empty column
overall_table[9:12] <- p.vals # add vals to empty column
# overall_table <- cbind(my_table, "P-Value" = p.vals) works the same way in one line of code
overall_table
In my case, I simply looked for the corresponding cell indices in overall_table (for P-Values = 9:12) and filled them using base syntax. In the vignette of qwraps2 (https://cran.r-project.org/web/packages/qwraps2/vignettes/summary-statistics.html), the author used regular expressions to identify the right cells (see section 3.2).
If there are other methods to add individual columns to qwraps2::summary_table I would appreciate to see how it is possible.
Best,
Florian

T-Test For Genes using Apply Function in Dataframe

I’m trying to run a t.test on two data frames.
The dataframes (which I carved out from a data.frame) has the data I need to rows 1:143. I’ve already created sub-variables as I needed to calculate rowMeans.
> c.mRNA<-rowMeans(c007[1:143,(4:9)])
> h.mRNA<-rowMeans(c007[1:143,(10:15)])
I’m simply trying to run a t.test for each row, and then plot the p-values as histograms. This is what I thought would work…
Pvals<-apply(mRNA143.data,1,function(x) {t.test(x[c.mRNA],x[h.mRNA])$p.value})
But I keep getting an error?
Error in t.test.default(x[c.mRNA], x[h.mRNA]) :
not enough 'x' observations
I’ve got something off in my syntax and cannot figure it out for the life of me!
EDIT: I've created a data.frame so it's now just two columns, I need a p-value for each row. Below is a sample of my data...
c.mRNA h.mRNA
1 8.224342 8.520142
2 9.096665 11.762597
3 10.698863 10.815275
4 10.666233 10.972130
5 12.043525 12.140297
I tried this...
pvals=apply(mRNA143.data,1,function(x) {t.test(mRNA143.data[,1],mRNA143.data[, 2])$p.value})
But I can tell from my plot that I'm off (the plots are in a straight line).
A reproducible example would go a long way. In preparing it, you might have realized that you are trying to subset columns based on mean, which doesn't make sense, really.
What you want to do is go through rows of your data, subset columns belonging to a certain group, repeat for the second group and pass that to t.test function.
This is how I would do it.
group1 <- matrix(rnorm(50, mean = 0, sd = 2), ncol = 5)
group2 <- matrix(rnorm(50, mean = 5, sd = 2), ncol = 5)
xy <- cbind(group1, group2)
# this is just a visualization of the test you're performing
plot(0, 0, xlim = c(-5, 11), ylim = c(0, 0.25), type = "n")
curve(dnorm(x, mean = 5, sd = 2), add = TRUE)
curve(dnorm(x, mean = 0, sd = 2), add = TRUE)
out <- apply(xy, MARGIN = 1, FUN = function(x) {
# x is a vector, e.g. xy[i, ] or xy[1, ]
t.test(x = x[1:5], y = x[6:10])$p.value
})
out

R apply function to data based on index column value

Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]

R: t-test over all subsets over all columns

This is a follow up question from R: t-test over all columns
Suppose I have a huge data set, and then I created numerous subsets based on certain conditions. The subsets should have the same number of columns. Then I want to do t-test on two subsets at a time (outer loop) and then for each combination of subsets go through all columns one column at a time (inner loop).
Here is what I have come up with based on previous answer. This one stops with an error.
C <- c("c1","c1","c1","c1","c1",
"c2","c2","c2","c2","c2",
"c3","c3","c3","c3","c3",
"c4","c4","c4","c4","c4",
"c5","c5","c5","c5","c5",
"c6","c6","c6","c6","c6",
"c7","c7","c7","c7","c7",
"c8","c8","c8","c8","c8",
"c9","c9","c9","c9","c9",
"c10","c10","c10","c10","c10")
X <- rnorm(n=50, mean = 10, sd = 5)
Y <- rnorm(n=50, mean = 15, sd = 6)
Z <- rnorm(n=50, mean = 20, sd = 5)
Data <- data.frame(C, X, Y, Z)
Data.c1 = subset(Data, C == "c1",select=X:Z)
Data.c2 = subset(Data, C == "c2",select=X:Z)
Data.c3 = subset(Data, C == "c3",select=X:Z)
Data.c4 = subset(Data, C == "c4",select=X:Z)
Data.c5 = subset(Data, C == "c5",select=X:Z)
Data.Subsets = c("Data.c1",
"Data.c2",
"Data.c3",
"Data.c4",
"Data.c5")
library(plyr)
combo1 <- combn(length(Data.Subsets),1)
adply(combo1, 1, function(x) {
combo2 <- combn(ncol(Data.Subsets[x]),2)
adply(combo2, 2, function(y) {
test <- t.test( Data.Subsets[x][, y[1]], Data.Subsets[x][, y[2]], na.rm=TRUE)
out <- data.frame("Subset" = rownames(Data.Subsets[x]),
, "Row" = colnames(x)[y[1]]
, "Column" = colnames(x[y[2]])
, "t.value" = round(test$statistic,3)
, "df"= test$parameter
, "p.value" = round(test$p.value, 3)
)
return(out)
} )
} )
First of all, you can more easily define you dataset using gl, and by avoiding creating individual variables for the columns.
Data <- data.frame(
C = gl(10, 5, labels = paste("c", 1:10, sep = "")),
X = rnorm(n = 50, mean = 10, sd = 5),
Y = rnorm(n = 50, mean = 15, sd = 6),
Z = rnorm(n = 50, mean = 20, sd = 5)
)
Convert this to "long" format using melt from the reshape package. (You can also use the base reshape function.)
longData <- melt(Data, id.vars = "C")
Now Use pairwise.t.test to compute t tests on all pairs of X/Y/Z for for each level of C.
with(longData, pairwise.t.test(value, interaction(C, variable)))
Note that it is important to use pairwise.t.test rather than just lots of individual calls to t.test because you need to adjust your p values if you run lots of tests. (See, e.g., xkcd for explanation.)
In general, pairwise t tests are inferior to a regression so be careful about their usage.
You can use get(Data.subset[x]) which will pick out the relevant data frame. But I don't think this should be necessary.
Explicitly subsetting that many times shoudn't be necessry either. You could create them using something like
conditions = c("c1", "c2", "c3", "c4", "c5")
dfs <- lapply(conditions, function(x){subset(Data, C==x, select=X:Z)})
That should (didn't test it) return a list of data frames each subseted on the various conditions you passed it.
However it would be a much better idea as #Richie Cotton points out, to reshape your data frame and use pairwise t tests.
I should point out that doing this many t-tests doesn't seem wise. Even after correction for multiple testing, be it FDR, permutation or otherwise. It would be better to try and figure out if you can use an anova of some sort as they are used for almost exactly this purpose.

Resources