Getting estimate and p-value into dataframe - r

I am fairly new to R. My data looks something like this (only with 9000 columns and 66 rows)
Time <- c(0, 6.4, 8.6, 15.2, 19.4, 28.1, 42.6, 73, 73, 85, 88, 88, 88, 88, 88)
ID1 <- c(55030, 54539, 54937, 48897, 58160, 54686, 55393, 47191, 39805, 37601, 51328, 28882, 45587, 60061, 31892, 28670)
ID2 <- c(20485, 11907, 10571, 20974, 10462, 11149, 20970, NA, NA, 9295, NA, 8714, 24446, 10748, 9037, 11859)
ID3 <- c(93914, 44482, 43705, 51144, 49485, 43908, 44324, 37342, 18872, 39660,61673, 43837, 36528, 44738, 41648, 11100)
DF <- data.frame (Time, ID1, ID2, ID3)
I want to get a data frame that looks like this :
ID1, rho, p-value
ID2, rho, p-value
...
The rho and the p-value would be the results from a cor.test (spearman) with Time and each ID
Among other things I've tried this:
results <- data.frame(ID="", Estimate="", P.value="")
estimates = numeric(16)
pvalues = numeric(16)
for (i in 2:4){
test <- cor.test(DF[,1], DF[,i])
estimates[i] = test$estimate
pvalues[i] = test$p.value
}
And R gives me the following error:
Error: object 'test' not found
I've also tried:
result <- do.call(rbind,lapply(2:4, function(x) {
cor.result<-cor.test(DF[,1],DF[,x])
pvalue <- cor.result$p.value
estimate <- cor.result$estimate
return(data.frame(pvalue = pvalue, estimate = estimate))
})
)
And R gives me a similar error
Error: object 'cor.result' not found
I'm sure it's an easy fix but I can't seem to figure it out. Any help is more than welcome.
This is what I got after running
dput(head(SmallDataset[,1:5]))
structure(list(Species = c("Human.hsapiens", "Chimpanzee.ptroglodytes",
"Gorilla.ggorilla", "Orangutan.pabelii", "Gibbon.nleucogenys",
"Macaque.mmulatta"), Time = c(0, 6.4, 8.61, 15.2, 19.43, 28.1
), ID1 = c(55030, 54539, 54937, 48897, 58160, 54686), ID2 = c(20485,
11907, 10571, 20974, 10462, 11149), ID3 = c(93914, 44482, 43705,
51144, 49485, 43908)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))

My solution involves defining a function within a lapply call
##
library(dplyr)
###Create dataframe
Time <- c(0, 6.4, 8.6, 15.2, 19.4, 28.1, 42.6, 73, 73, 85, 88, 88, 88, 88, 88, 89)
ID1 <- c(55030, 54539, 54937, 48897, 58160, 54686, 55393, 47191, 39805, 37601, 51328, 28882, 45587, 60061, 31892, 28670)
ID2 <- c(20485, 11907, 10571, 20974, 10462, 11149, 20970, NA, NA, 9295, NA, 8714, 24446, 10748, 9037, 11859)
ID3 <- c(93914, 44482, 43705, 51144, 49485, 43908, 44324, 37342, 18872, 39660,61673, 43837, 36528, 44738, 41648, 11100)
DF <- data.frame (Time, ID1, ID2, ID3)
##Run the correlations
l2 <- lapply(2:4, function(i)cor.test(DF$Time, DF[,i]))
##Define function to extract p_value and coefficients
l3 <- lapply(l2, function(i){
return(tibble(estimate = i$estimate,
p_value = i$p.value))
})
##Create a dataframe with information
l4 <- bind_rows(l3) %>% mutate(ID = paste0("ID", 1:3)) ##Data frame with info
l4

Consider building a list of data frames witih lapply (an iteration function similar to for but builds a list of objects of equal length as input). Afterwards, row bind all data frame elements together:
results <- lapply(2:4, function(i){
test <- cor.test(DF[,1], DF[,i])
data.frame(ID = names(DF)[i],
estimate = unname(test$estimate),
pvalues = unname(test$p.value))
})
final_df <- do.call(rbind, results)
final_df
# ID estimate pvalues
# 1 ID1 -0.6238591 0.009805341
# 2 ID2 -0.2270515 0.455676037
# 3 ID3 -0.4964092 0.050481533
NOTE: Your posted data for Time is missing an observation and cannot immediately be cast into data.frame() with other vectors. To resolve, I supplemented a 6th 88 at end:
Time <- c(0, 6.4, 8.6, 15.2, 19.4, 28.1, 42.6, 73, 73, 85, 88, 88, 88, 88, 88, 88)
Using posted SmallDataset:
SmallDataset <- structure(...)
results <- lapply(3:5, function(i){
test <- cor.test(SmallDataset$Time, SmallDataset[,i])
data.frame(ID = names(SmallDataset)[i],
estimate = unname(test$estimate),
pvalues = unname(test$p.value))
})
final_df <- do.call(rbind, results)
final_df
# ID estimate pvalues
# 1 ID1 0.03251407 0.9512461
# 2 ID2 -0.41733336 0.4103428
# 3 ID3 -0.60732484 0.2010166

Related

Running my multivariable function for several vectors?

I have the following dataframe with 6 columns and several thousand rows.
Example:
Screenshot of example data
Each column represents a different timepoint 0,1,3,6,9,12. I want to calculate the area under the curve for each row of values.
For example for row 1, I would use the following function from the DescTools package
x=c(0,1,3,6,9,12)
y=c(130, 125, 120, 115, 108, 115)
AUC(x, y, method = c("linear"), na.rm=FALSE)
Is there a way to create a new variable which is the AUC for each row from my dataframe?
Thanks!
We can use apply with MARGIN = 1 to do rowwise
library(DescTools)
i1 <- rowSums(!is.na(df1)) >2
df1$AUC[i1] <- apply(df1[i1,], 1, FUN = function(y)
AUC(x, y, method = "linear", na.rm = FALSE))
df1$AUC[i1]
[1] 1394 1518
data
df1 <- structure(list(col1 = c(130, 140), col2 = c(125, 137), col3 = c(120,
125), col4 = c(115, 120), col5 = c(108, 125), col6 = c(115, 130
)), class = "data.frame", row.names = c(NA, -2L))
x <- c(0,1,3,6,9,12)

Summing ranks for variable with fewest entries

I am learning R and want to manually compute the Mann-Whitney U statistic and p-value using a normal approximation (and not use wilcox.test or equivalent). My pensioner's brain struggles with coding so it has taken me hours to produce the same answers as the textbook. However, my code to sum the 'StateRank' for the state with the fewest values is convoluted. How can I replace the commented section with more efficient code? I've hunted high and low, both here and on Google, but I don't even know which search terms to use! It won't surprise me to hear that there is a one-line solution but I'm no nearer knowing what it is.
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average"))
# clumsy code to sort, then sum ranks (StateRank) for group with fewest values (nA)
#--------------------------------------------------------------------------------
asc_or_desc <- as.matrix(count(a.df, State))
if (as.numeric(asc_or_desc[1,2])>as.numeric(asc_or_desc[2,2])) {
a.df <- arrange(a.df, desc(State))
} else {
a.df <- arrange(a.df, State)
}
#--------------------------------------------------------------------------------
nA <- as.numeric(min(count(a.df, State, sort = TRUE)$n))
nB <- as.numeric(max(count(a.df, State, sort = TRUE)$n))
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)
Please try this code and tell me if I am on the right way:
I replaced your so called clumsy code with this one
... %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
The whole code:
library(tidyverse)
# Activity 9: aboriginal village size in Alaska and California
a.df <- data.frame(
Alaska = c(23, 26, 30, 33, 42, 45, 45, 50, 50.5, 96, 113, 557, NA),
Calif = c(39, 48, 53.5, 55, 57, 66, 77, 79, 108, 121, 162, 197, 309)
) %>%
pivot_longer(
cols = c("Alaska", "Calif"),
names_to = "State",
values_to = "Value",
values_drop_na = TRUE
) %>%
mutate(StateRank = rank(Value, ties.method = "average")) %>%
group_by(State) %>%
mutate(mx = max(Value)) %>%
arrange(desc(mx), desc(Value)) %>%
select(-mx)
-----------------------------------------------------------------------------
a.U <- sum(a.df$StateRank[1:nA])
a.E <- (nA*(nA+nB+1))/2 # Expectation of U
a.V <- (nA*nB*(nA+nB+1))/12 # Variance of U
a.Z <- (a.U - a.E)/sqrt(a.V)
a.P <- round((1 - round(pnorm(round(abs(a.Z), 2),
mean = 0, sd = 1) ,4)) * 2, 3)
# all the rounding is to mimic statistical tables (so that
# the answer is the same as in the textbook that I use)

Intersecting row values from many R dataframes and calculate averages of the corresponding values

Below is the example:
df1 <- data.frame("names" = c('John','Peter','Jolie'), "value1" = c(21, 24, 26), "value2" = c(20, 23, 32))
df2 <- data.frame("names" = c('Sam','John','Jolie'), "value1" = c(35, 11, 10), "value2" = c(10, 28, 27))
df3 <- data.frame("names" = c('Louis','Jolie','John'), "value1" = c(42, 74, 26), "value2" = c(26, 53, 54))
df4 <- data.frame("names" = c('Ale','John','Jolie'), "value1" = c(61, 34, 76), "value2" = c(28, 63, 38))
df5 <- data.frame("names" = c('John','Jolie','peter'), "value1" = c(11, 84, 86), "value2" = c(50, 13, 68))
intersect_names <- Reduce(intersect, list(df1$names,df2$names,df3$names,df4$names,df5$names))
Using the reduce and intersect command, I can get the intersection of all the names. But, I want the corresponding mean of value1 and value2 for each of the names in the dataframes.
Expected Output dataframe:
names Value1 Value2
John 20.6 43
Jolie 54 32.6
For ex: The value 20.6 was obtained by taking mean(c(21,11,26,34,11))
We can create a list of dataframes, extract the rows for intersect_names and take mean for each name.
list_df <- mget(paste0('df', 1:5))
intersect_names <- Reduce(intersect, lapply(list_df, `[[`, 'names'))
aggregate(.~names, do.call(rbind, lapply(list_df, function(x)
x[x$names %in% intersect_names, ])), mean)
The same using tidyverse functions :
library(dplyr)
library(purrr)
map_df(list_df, ~.x %>% filter(names %in% intersect_names)) %>%
group_by(names) %>%
summarise(across(.fns = mean))
# names value1 value2
# <chr> <dbl> <dbl>
#1 John 20.6 43
#2 Jolie 54 32.6

Obtain common values between 5 from the same column of different dataframes

I am struggling to extract the common values between a specific column of 5 different dataframes. I know how to do this with two, but not with more.
df1$ID<-c(121, 122, 176)
df2$ID<-c(121, 88, 199)
df3$ID<-c(77, 121, 230)
df4$ID<-c(6, 88, 121)
df5$ID<-c(121, 122, 123)
In this example, my desired output would be:
result<-c(121)
Thanks!
We can get all the datasets in a list and then use intersect
Reduce(intersect, lapply(mget(paste0('df', 1:5)), `[[`, 'ID'))
#[1] 121
Or using purrr
library(purrr)
library(stringr)
library(dplyr)
mget(paste0('df', 1:5)) %>%
map(~ .x %>%
pull(ID)) %>%
reduce(intersect)
#[1] 121
data
df1 <- data.frame(ID = c(121, 122, 176))
df2 <- data.frame(ID = c(121, 88, 199))
df3 <- data.frame(ID = c(77, 121, 230))
df4 <- data.frame(ID = c(6, 88, 121))
df5 <- data.frame(ID = c(121, 122, 123))

Weighted mean calculation in R with missing values

Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel
Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333
The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})

Resources