How can I print the p-value with 2 significant figures? - r

When I print my p value from my t.test by doing:
ttest_bb[3]
It returns the full p value. How can I make it so it only prints the first two integers? i.e. .03 instead of .034587297?

The output from t.test is a list. If you only use [ to grab the p-value then what is returned is a list with one element. You want to use [[ to grab the element contained at the spot in the list returned by t.test if you want to treat it as a vector.
> ttest_bb <- t.test(rnorm(20), rnorm(20))
> ttest_bb
Welch Two Sample t-test
data: rnorm(20) and rnorm(20)
t = -2.5027, df = 37.82, p-value = 0.01677
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.4193002 -0.1498456
sample estimates:
mean of x mean of y
-0.3727489 0.4118240
> # Notice that what is returned when subsetting like this is
> # a list with the name p.value
> ttest_bb[3]
$`p.value`
[1] 0.01676605
> # If we use the double parens then it extracts just the vector contained
> ttest_bb[[3]]
[1] 0.01676605
> # What you're seeing is this:
> round(ttest_bb[3])
Error in round(ttest_bb[3]) :
non-numeric argument to mathematical function
> # If you use double parens you can use that value
> round(ttest_bb[[3]],2)
[1] 0.02
> # I prefer using the named argument to make it more clear what you're grabbing
> ttest_bb$p.value
[1] 0.01676605
> round(ttest_bb$p.value, 2)
[1] 0.02

Related

extract rownames and column names from correlation matrix using aspecefic value

My aim is to eliminate duplicates from a dataset.
For that I wrote a program that calculates correlations.
I want to take the name of the variables that have a correlation higher than a specific value I determine.
Here's one of the results I got.
M926T709 M927T709_1 M927T709_2 M929T709
M926T709 1.0000000 0.9947082 0.9879702 0.8716944
M927T709_1 0.9947082 1.0000000 0.9955145 0.8785669
M927T709_2 0.9879702 0.9955145 1.0000000 0.8621052
M929T709 0.8716944 0.8785669 0.8621052 1.0000000
Let's say i want to obtain the name of variables that have percentage high than 95%
so i should obtain this result
M926T709 , M927T709_1 , M927T709_2
Edit : the answer given by Ronak Shah worked well , but i need to obtain the results as vector so i can use the names after
Note, that I shouldn't analyze orthogonal results because they always equal to 1.
Please tell me if you need any clarification, also tell me if you want to see my entire program.
Using rowSums and colSums you can count how many values are more than 0.95 in each row and column respectively and then return the names.
tmp <- mat > 0.95
diag(tmp) <- FALSE
names(Filter(function(x) x > 0, rowSums(tmp) > 0 | colSums(tmp) > 0))
#[1] "M926T709" "M927T709_1" "M927T709_2"
Sample data: the limit and the correlation matrix m (with an added negative correlation for demonstration purposes):
limit <- 0.95
m <- as.matrix( read.table(text = "
M926T709 M927T709_1 M927T709_2 M929T709
M926T709 1.0000000 -0.9947082 0.9879702 0.8716944
M927T709_1 -0.9947082 1.0000000 0.9955145 0.8785669
M927T709_2 0.9879702 0.9955145 1.0000000 0.8621052
M929T709 0.8716944 0.8785669 0.8621052 1.0000000"))
Create a subset of the desired matrix and extract the row/column names.
Target <- unique( # Remove any duplicates
unlist( # merge subvectors of the `dimnames` list into one
dimnames( # gives all names of rows and columns of the matrix below
# Create a subset of the matrix that ignores correlations < limit
m[rowSums(abs(m) * upper.tri(m) > limit) > 0, # Rows
colSums(abs(m) * upper.tri(m) > limit) > 0] # Columns
),
recursive = FALSE))
Target
#> [1] "M926T709" "M927T709_1" "M927T709_2"
Created on 2021-10-25 by the reprex package (v2.0.1)

How to simplify code in R (normality test): different sample sizes in 1 line or 2 lines of code?

I want to conduct normality tests a little bit cleaner in my coding and do a simulation (repeat the test 1000 times).
sample <- c(10,30,50,100,500)
shapiro.test(rnorm(sample))
Shapiro-Wilk normality test
data: rnorm(sample)
W = 0.90644, p-value = 0.4465
This only gives one output as you can observe above. How do I get 5 outputs? Is there something I am missing here..?
Using the replicate function gives me 1000 statistics per sample size, while I am only interested in the p-values and relate them to a significance level. In the coding of the individual normality tests, I used the following code (thanks to user StupidWolf, in my previous posted questions on stackoverflow)
replicate_sw10 = replicate(1000,shapiro.test(rnorm(10)))
table(replicate_sw10["p.value",]<0.10)/1000
#which gave the following output
> FALSE TRUE
> 0.896 0.104
You may simply use $p.value. The code below yields a matrix with 1,000 rows for the repetitions, and 5 columns for the smpl sizes. If you want a list as result, just use lapply instead of sapply.
smpl <- c(10, 30, 50, 100, 500)
set.seed(42) ## for sake of reproducibility
res <- sapply(smpl, function(x) replicate(1e3, shapiro.test(rnorm(x))$p.value))
head(res)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.43524553 0.5624891 0.02116901 0.8972087 0.8010757
# [2,] 0.67500688 0.1417968 0.03722656 0.7614192 0.7559309
# [3,] 0.52777713 0.6728819 0.67880178 0.1455375 0.7734797
# [4,] 0.55618980 0.1736095 0.69879316 0.4950400 0.5181642
# [5,] 0.93774782 0.9077292 0.58930787 0.2687687 0.8435223
# [6,] 0.01444456 0.1214157 0.07042380 0.4479121 0.7982574
using the purrr package
map(sample, function(x) shapiro.test(rnorm(x)))
which gives
[[1]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.92567, p-value = 0.4067
[[2]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.95621, p-value = 0.247
[[3]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.96144, p-value = 0.1021
[[4]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.98654, p-value = 0.4077
[[5]]
Shapiro-Wilk normality test
data: rnorm(x)
W = 0.99597, p-value = 0.2324
Edit: so after your edit you are requesting some table. This doesn't work in the way you are doing it with your replicate_sw10 example as that is a matrix, while map (or lapply for that matter) results in a list. So again you want to use apply or map to do the same transformations on all the parts of the list.
replicate_swall <- map(sample, function(x) shapiro.test(rnorm(x)))
replicate_pvalue_extract <- map(replicate_swall , function(x) x["p.value",]) %>% unlist(., recursive = F)
table(replicate_pvalue_extract < 0.10) / length(replicate_pvalue_extract )
This will give you:
FALSE TRUE
0.896 0.104
Another option is using the magrittr package for the extract. Your code will than look like
replicate_pvalue_extract <- map(replicate_swall, magrittr::extract, "p.value") %>% unlist(., recursive = F)
table(replicate_pvalue_extract < 0.10) / length(replicate_pvalue_extract )
In the code above I assumed that you wanted to divide your table by all replicates and that it doesn't matter what the input was (with input I mean 10,30,50,100, or 500) . If you do care about the input you can keep them separate, I will give the code below. Also note that I used length rather than your hardcoded /1000. In this way your code is way more generic, if you change the replicate number the number you divide your table with automatically changes as well. Otherwise you have to make the changes on multiple locations (especially if someone else uses your code) which could easily result in mistakes.
replicate_pvalue_extract <- map(replicate_swall , function(x) x["p.value",])
map(replicate_pvalue_extract , function(x) table(x < 0.10) / length(x))
Or you can combine them:
map(map(replicate_swall, function(x) x["p.value",]), function(x) table(x < 0.10) / length(x))
This is why I gave you the magrittr option, as I do not like the function(x) twice. With magrittr it would look like:
map(map(replicate_swall, magrittr::extract, "p.value"), function(x) table(x < 0.10) / length(x))
which would result in:
[[1]]
FALSE TRUE
0.896 0.104
[[2]]
FALSE TRUE
0.889 0.111
[[3]]
FALSE TRUE
0.904 0.096
[[4]]
FALSE TRUE
0.9 0.1
[[5]]
FALSE TRUE
0.891 0.109

Conditional rounding with IF statement

I'm trying to round numeric values in a data frame to closest interval. I want to round at different intervals based on how big the number is. I've started with this (coming from excel mindset) but I'm stuck to translate it to R code.
Note round_any rounds a number to the closest interval(e.g. 5.13->5, 5.85->6)
library(plyr)
DataFrame <- sapply(DataFrame, function(x) {
if(x>1) round_any(x,0.25),
if(x>5) round_any(x,0.5),
if(x>10) round_any(x,1),
else x})
Could you please help me out?
When using sapply on a data frame, you are iterating over the column vectors rather than individual values. As such, you should be looking at vectorized conditional logic functions: just using the standard if control flow isn't terribly useful, as it can only take scalar (length 1) conditions.
In this case, plyr::round_any can take a vector as the accuracy argument; the dplyr function case_when could be useful here. From ?case_when:
This function allows you to vectorise multiple if and else if
statements. It is an R equivalent of the SQL CASE WHEN statement.
Here's an example for the case of a single vector to be rounded:
set.seed(11)
# Generate some raw numbers
x <- runif(8, max = 20)
print(x, digits = 4)
#> [1] 5.54500 0.01037 10.21217 0.28096 1.29380 19.09698 1.72992 5.79950
# Round to differing accuracy
plyr::round_any(
x,
dplyr::case_when(
x > 10 ~ 1.0,
x > 5 ~ 0.50,
x > 1 ~ 0.25,
TRUE ~ 0.001
)
)
#> [1] 5.500 0.010 10.000 0.281 1.250 19.000 1.750 6.000
Created on 2018-05-11 by the reprex package (v0.2.0).
Thank you all for your help. Based on your responses the following code worked for my data frame
library(plyr)
library(dplyr)
DataFrame[] <- lapply(DataFrame, function(x){
round_any(x,
case_when(
x > 10 ~ 1.0,
x > 5 ~ 0.50,
x > 1 ~ 0.25,
TRUE ~ 0.001))})

Standard Chi Squared Test in R?

I have samples of observation counts for 4 genotypes in a single copy region. What I want to do, is calculate the allele frequencies of these genotypes, and then test of these frequencies deviate significantly from expected values of 25%:25%:25%:25% using Chi Squared in R.
So far, I got:
> a <- c(do.call(rbind, strsplit(as.character(gdr18[1,9]), ",")), as.character(gdr18[1,8]))
> a
[1] "27" "30" "19" "52"
Next I get total count:
> sum <- as.numeric(a[1]) + as.numeric(a[2]) + as.numeric(a[3]) + as.numeric(a[4])
> sum
[1] 128
Now frequencies:
> af1 <- as.numeric(a[1])/sum
> af2 <- as.numeric(a[2])/sum
> af3 <- as.numeric(a[3])/sum
> af4 <- as.numeric(a[4])/sum
> af1
[1] 0.2109375
> af2
[1] 0.234375
> af3
[1] 0.1484375
> af4
[1] 0.40625
Here I am lost now. I want to know if af1, af2, af3 and af4 deviate significantly from 0.25, 0.25, 0.25 and 0.25
How do I do this in R?
Thank you,
Adrian
EDIT:
Alright, I am trying chisq.test() as suggested:
> p <- c(0.25,0.25,0.25,0.25)
> chisq.test(af, p=p)
Chi-squared test for given probabilities
data: af
X-squared = 0.146, df = 3, p-value = 0.9858
Warning message:
In chisq.test(af, p = p) : Chi-squared approximation may be incorrect
What is the warning message trying to tell me? Why would the approximation be incorrect?
To test this methodology, I picked values far from expected 0.25:
> af=c(0.001,0.200,1.0,0.5)
> chisq.test(af, p=p)
Chi-squared test for given probabilities
data: af
X-squared = 1.3325, df = 3, p-value = 0.7214
Warning message:
In chisq.test(af, p = p) : Chi-squared approximation may be incorrect
In this case the H0 is still not rejected, even though the values are pretty far off from the expected 0.25 values.
observed <- c(27,30,19,52)
chisq.test(observed)
which indicates that such frequencies or more extreme than this would arise by chance alone about 0.03% of the time (p = 0.0003172).
If your null hypothesis is not a 25:25:25:25 distribution across the four categories, but say that the question was whether these data depart significantly from the 3:3:1:9 expectation, you need to calculate the expected frequencies explicitly:
expected <- sum(observed)*c(3,3,1,9)/16
chisq.test(observed,p=c(3,3,1,9),rescale.p=TRUE)

quick standard deviation with weights

I wanted to use a function that would quickly give me a standard deviation of a vector ad allow me to include weights for elements in the vector. i.e.
sd(c(1,2,3)) #weights all equal 1
#[1] 1
sd(c(1,2,3,3,3)) #weights equal 1,1,3 respectively
#[1] 0.8944272
For weighted means I can use wt.mean() from library(SDMTools) e.g.
> mean(c(1,2,3))
[1] 2
> wt.mean(c(1,2,3),c(1,1,1))
[1] 2
>
> mean(c(1,2,3,3,3))
[1] 2.4
> wt.mean(c(1,2,3),c(1,1,3))
[1] 2.4
but the wt.sd function does not seem to provide what I thought I wanted:
> sd(c(1,2,3))
[1] 1
> wt.sd(c(1,2,3),c(1,1,1))
[1] 1
> sd(c(1,2,3,3,3))
[1] 0.8944272
> wt.sd(c(1,2,3),c(1,1,3))
[1] 1.069045
I am expecting a function that returns 0.8944272 from me weighted sd. Preferably I would be using this on a data.frame like:
data.frame(x=c(1,2,3),w=c(1,1,3))
library(Hmisc)
sqrt(wtd.var(1:3,c(1,1,3)))
#[1] 0.8944272
You can use rep to replicate the values according to their weights. Then, sd can be computed for the resulting vector.
x <- c(1, 2, 3) # values
w <- c(1, 1, 3) # weights
sd(rep(x, w))
[1] 0.8944272

Resources