Combining and appending columns of different lengths, by row number, R - r

I'm working with biochemical data from subjects, analysing the results by sex. I have 19 biochemical tests to analyse for each sex, for each of two drugs (haematology and anatomy tests coming later).
For reasons of reproducibility of results and for preventing transcription errors, I am trying to summarise each test into one table. Included in the table output, I need a column for the Dunnett post hoc comparison p-values. Because the Dunnett test compares to the control results, with a control and 3 drug levels I only get 3 p-values. However, I have 4 mean and sd values.
Using ddply to get the mean and sd results (having limited the number of significant figures, I get a dataset that looks like this:
Sex<- c(rep("F",4), rep("M",4))
Druglevel <- c(rep(0:3,2))
Sample <- c(rep(10,8))
Mean <- c(0.44, 0.50, 0.46, 0.49, 0.48, 0.55, 0.47, 0.57)
sd <- c(0.07, 0.07, 0.09, 0.12, 0.18, 0.19, 0.13, 0.41)
Drug1Biochem1 <- data.frame(Sex, Druglevel, Sample, Mean, sd)
I have used glht in the package multcomp to perform the Dunnett tests on the aov object I constructed from undertaking a normal aov. I've extracted the p-values from the glht summary (I've rounded these to three decimal places). The male and female analyses have been run using separate ANOVA so I have one set of output for each sex. The female results are:
femaleR <- c(0.371, 0.973, 0.490)
and the male results are:
maleR <- c(0.862, 0.999, 0.738)
How can I append a column for the p-values to my original dataframe (Drug1Biochem1) so that both femaleR and maleR are in that final column, with row 1 and row 5 of that column empty (i.e. no p-values for the control)?
I wish to output the resulting combination to html, which can be inserted into a Word document so no transcription errors occur. I have set a seed value so that the results of the program are reproducible (when I finally stop debugging).
In summary, I would like a data frame (or table, or whatever I can output to html) that has the following format:
Sex Druglevel Sample Mean sd p-value
F 0 10 0.44 0.07
F 1 10 0.50 0.07 0.371
F 2 10 0.46 0.09 0.973
F 3 10 0.49 0.12 0.480
M 0 10 0.48 0.18
M 1 10 0.55 0.19 0.862
M 2 10 0.47 0.13 0.999
M 3 10 0.57 0.41 0.738
For each test, I wish to reproduce this exact table. There will always be 4 groups per sex, and there will never be a p-value for the control, which will always be summarised in row 1 (F) and row 5 (M).

You could try merge
dN <- data.frame(Sex=rep(c('M', 'F'), each=3), Druglevel=1:3,
pval=c(maleR, femaleR))
merge(Drug1Biochem1, dN, by=c('Sex', 'Druglevel'), all=TRUE)
# Sex Druglevel Sample Mean sd pval
#1 F 0 10 0.44 0.07 NA
#2 F 1 10 0.50 0.07 0.371
#3 F 2 10 0.46 0.09 0.973
#4 F 3 10 0.49 0.12 0.490
#5 M 0 10 0.48 0.18 NA
#6 M 1 10 0.55 0.19 0.862
#7 M 2 10 0.47 0.13 0.999
#8 M 3 10 0.57 0.41 0.738

Related

Heatmap of effect sizes and p-values using different exposures and outcomes in ggplot2

I want to create a heat map that graphically shows effect sizes between different outcomes and exposures and if p-values were significant.
I have created one big dataframe containing all exposure-outcomes tests with p-values and effect sizes. The effect direction can be positive or negative. Now, there are great resources to create this for correlation matrices such as corrplot.
I don't get how to do this for effects sizes with different exposures and outcomes.
This would be the sample dataframe. The exposures would be 20 and the outcome 15.
Here is a shortened example. Estimates and p-values made up, so disregard the statical nonsense in the values.
dat
# id Exposure Outcome beta p-value se x
# 1 a 1 0.02 0.04 0.001
# 1 a 2 0.52 0.001 0.02
# 1 a 3 0.001 0.54 0.001
# 1 b 1 -0.02 0.09 0.045
# 1 b 2 0.06 0.12 0.03
# 1 b 3 -0.1 0.41 0.09
# 1 c 1 -0.42 0.01 0.08
This is an example of a similar plot using correlation.

Compute stats for several columns at the same time using sapply

I have a dataframe as follows:
# A tibble: 6 x 4
Placebo High Medium Low
<dbl> <dbl> <dbl> <dbl>
1 0.0400 -0.04 0.0100 0.0100
2 0.04 0 -0.0100 0.04
3 0.0200 -0.1 -0.05 -0.0200
4 0.03 -0.0200 0.03 -0.00700
5 -0.00500 -0.0100 0.0200 0.0100
6 0.0300 -0.0100 NA NA
You could get the cohensD for two of the columns using the cohen.d() function from the effsize package:
df <- data.frame(Placebo = c(0.0400, 0.04, 0.0200, 0.03, -0.00500, 0.0300),
Low = c(-0.04, 0, -0.1, -0.0200, -0.0100, -0.0100),
Medium = c(0.0100, -0.0100, -0.05, 0.03, 0.0200, NA ),
High = c(0.0100, 0.04, -0.0200, -0.00700, 0.0100, NA))
library(effsize)
cohen.d(as.vector(na.omit(df$Placebo)), as.vector(na.omit(df$High)))
Interestingly enough, I'm getting the following error with this code:
Error in data[, group] : incorrect number of dimensions
However, I would like to create a function that allows you to obtain all the cohensd between one of the columns and the rest of them.
In order to get the cohensD of all columns against the Placebo we would use something like:
sapply(df, function(i) cohen.d(pull(df, as.vector(na.omit(!!Placebo))), as.vector(na.omit(i))))
But I'm not sure this would work anyway.
Edit: I don't want to erase the full row, as cohens d can be computed for different length vectors. Ideally, I would like to get the stat with the NA removed for each column independetly
It may be better to remove the NA on each of the columns separately by creating a logical index along with 'Placebo'
library(dplyr)
library(effsize)
df %>%
summarise(across(Low:High, ~ list({
i1 <- complete.cases(Placebo)& complete.cases(.x)
cohen.d(Placebo[i1], .x[i1])})))
Or if we want to use lapply/sapply, loop over the columns other than Placebo
lapply(df[-1], function(x) {
x1 <- na.omit(cbind(df$Placebo, x))
cohen.d(x1[,1], x1[,2])
})
-output
$Low
Cohen's d
d estimate: 1.947312 (large)
95 percent confidence interval:
lower upper
0.3854929 3.5091319
$Medium
Cohen's d
d estimate: 0.9622504 (large)
95 percent confidence interval:
lower upper
-0.5782851 2.5027860
$High
Cohen's d
d estimate: 0.8884639 (large)
95 percent confidence interval:
lower upper
-0.6402419 2.4171697

Remove leading zeros in numbers *within a data frame*

Edit: For anyone coming later: THIS IS NOT A DUPLICATE, since it explicitely concerns work on data frames, not single variables/vectors.
I have found several sites describing how to drop leading zeros in numbers or strings, including vectors. But none of the descriptions I found seem applicable to data frames.
Or the f_num function in the numform package. It treats "[a] vector of numbers (or string equivalents)", but does not seem to solve unwanted leading zeros in a data frame.
I am relatively new to R but understand that I could develop some (in my mind) complex code to drop leading zeros by subsetting vectors from a data frame and then combining those vectors into a full data frame. I would like to avoid that.
Here is a simple data frame:
df <- structure(list(est = c(0.05, -0.16, -0.02, 0, -0.11, 0.15, -0.26,
-0.23), low2.5 = c(0.01, -0.2, -0.05, -0.03, -0.2, 0.1, -0.3,
-0.28), up2.5 = c(0.09, -0.12, 0, 0.04, -0.01, 0.2, -0.22, -0.17
)), row.names = c(NA, 8L), class = "data.frame")
Which gives
df
est low2.5 up2.5
1 0.05 0.01 0.09
2 -0.16 -0.20 -0.12
3 -0.02 -0.05 0.00
4 0.00 -0.03 0.04
5 -0.11 -0.20 -0.01
6 0.15 0.10 0.20
7 -0.26 -0.30 -0.22
8 -0.23 -0.28 -0.17
I would want
est low2.5 up2.5
1 .05 .01 .09
2 -.16 -.20 -.12
3 -.02 -.05 .00
4 .00 -.03 .04
5 -.11 -.20 -.01
6 .15 .10 .20
7 -.26 -.30 -.22
8 -.23 -.28 -.17
Is that possible with relatively simple code for a whole data frame?
Edit: An incorrect link has been removed.
I am interpreting the intention of your question is to convert each numeric cell in the data.frame into a "pretty-printed" string which is possible using string substitution and a simple regular expression (a good question BTW since I do not know any method to configure the output of numeric data to suppress leading zeros without converting the numeric data into a string!):
df2 <- data.frame(lapply(df,
function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", as.character(x)))),
stringsAsFactors = FALSE)
df2
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.2 -.12
# 3 -.02 -.05 0
# 4 0 -.03 .04
# 5 -.11 -.2 -.01
# 6 .15 .1 .2
# 7 -.26 -.3 -.22
# 8 -.23 -.28 -.17
str(df2)
# 'data.frame': 8 obs. of 3 variables:
# $ est : chr ".05" "-.16" "-.02" "0" ...
# $ low2.5: chr ".01" "-.2" "-.05" "-.03" ...
# $ up2.5 : chr ".09" "-.12" "0" ".04" ...
If you want to get a fixed number of digits after the decimal point (as shown in the expected output but not asked for explicitly) you could use sprintf or format:
df3 <- data.frame(lapply(df, function(x) gsub("^0\\.", "\\.", gsub("^-0\\.", "-\\.", sprintf("%.2f", x)))), stringsAsFactors = FALSE)
df3
# est low2.5 up2.5
# 1 .05 .01 .09
# 2 -.16 -.20 -.12
# 3 -.02 -.05 .00
# 4 .00 -.03 .04
# 5 -.11 -.20 -.01
# 6 .15 .10 .20
# 7 -.26 -.30 -.22
# 8 -.23 -.28 -.17
Note: This solution is not robust against different decimal point character (different locales) - it always expects a decimal point...

Find where species accumulation curve reaches asymptote

I have used the specaccum() command to develop species accumulation curves for my samples.
Here is some example data:
site1<-c(0,8,9,7,0,0,0,8,0,7,8,0)
site2<-c(5,0,9,0,5,0,0,0,0,0,0,0)
site3<-c(5,0,9,0,0,0,0,0,0,6,0,0)
site4<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site5<-c(5,0,9,0,0,6,6,0,0,0,0,0)
site6<-c(5,0,9,0,0,0,6,6,0,0,0,0)
site7<-c(5,0,9,0,0,0,0,0,7,0,0,3)
site8<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site9<-c(5,0,9,0,0,0,0,0,0,0,1,0)
site10<-c(5,0,9,0,0,0,0,0,0,0,1,6)
site11<-c(5,0,9,0,0,0,5,0,0,0,0,0)
site12<-c(5,0,9,0,0,0,0,0,0,0,0,0)
site13<-c(5,1,9,0,0,0,0,0,0,0,0,0)
species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13)
accum <- specaccum(species_counts, method="random", permutations=100)
plot(accum)
In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13).
results <- with(accum, data.frame(sites, richness, sd))
Produces this:
sites richness sd
1 1 3.46 0.9991916
2 2 4.94 1.6625403
3 3 5.94 1.7513054
4 4 7.05 1.6779918
5 5 8.03 1.6542263
6 6 8.74 1.6794660
7 7 9.32 1.5497149
8 8 9.92 1.3534841
9 9 10.51 1.0492422
10 10 11.00 0.8408750
11 11 11.35 0.7017295
12 12 11.67 0.4725816
13 13 12.00 0.0000000
I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit:
species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0,
5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0,
5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3,
5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0,
5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0,
5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0),
byrow=TRUE,nrow=13)
Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package):
set.seed(101)
library(vegan)
accum <- specaccum(species_counts, method="random", permutations=100)
Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites)
(slopes <- with(accum,diff(richness)/diff(sites)))
## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31
In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3:
which(slopes<0.3)[1]
returns NA.

Random sample in R when data is in long format

I need to randomly sample a dataset which is arranged in long format. In my dataset, each subject has 4 observations, so if I randomly sample a row I am randomly losing one or more observation per subject.
This is a simulated data for illustration purposes, my data is much bigger.
sub sex group dv1 dv2
P1 m A 0.66 0.94
P1 m B 0.98 0.26
P1 m C 0.02 0.03
P1 m D 0.60 0.30
P2 m A 0.92 0.99
P2 m B 0.82 0.09
P2 m C 0.44 0.67
P2 m D 0.53 0.80
P3 f A 0.29 0.22
P3 f B 0.46 0.20
P3 f C 0.37 0.77
P3 f D 0.76 0.54
P4 m A 0.28 0.99
P4 m B 0.16 0.57
P4 m C 0.46 0.75
P4 m D 0.28 0.21
In this example, I need to randomly select 2 males. For example, I tried using dplyr packaged (see below), but if I give a sample of 2, it just gives me 2 rows for sex="m" and 2 for sex="f". In total, 4 randomly chosen rows. What I need it to do is to give me 8 rows where 4 come from one male and 4 from another. Changing grouping parameter to sub doesn't work, as it barks that there are only 2 levels in the group (actually, it would work in this toy example as there are 4 levels for each sub, but note that I am choosing like 50 samples from a bigger dataset). Also, it would just give me 2 random rows for each sub, which is not what I need.
library(dplyr)
subset <- data %>%
group_by(sex) %>%
sample_n(2)
Please do not suggest to reshape the date to wide format and sample it there, as I know that I can do that. I am sure there must be a way to sample in long format.
I would sample from the patient names and then filter by those sampled names:
Look at all males
male_subset <- data %>% filter(sex == "m")
Look for unique male ID
male_IDs <- unique(male_subset$sub)
Sample from the unique IDs
sampled_IDs <- sample(male_IDs, 2)
Now you subset your data based on these sampled IDs:
data %>% filter(sub %in% sampled_IDs)
This should return all four rows for each of the 2 sampled individuals.
I'm not sure if I've quite understood what you want. Would this do it?
data %>% filter(sex == 'm') %>% filter(sub %in% sample(paste0('P',1:4), 2))
You'd have to change what's in the paste0 function for your real data, of course.
In base R,
set.seed(1)
subset<- sample(data[data$sex == "m",]$sub,2)
data_subset<-data[data$sub %in% subset,]
nrow(data_subset)
# [1] 8
Works, but not flashy.

Resources