Fisher's exact test on values from large dataframe and bypassing errors - r

I have a dataframe which is 214 columns long and many rows long, and I want to perform a fisher's exact test for each row using values from 4 columns.
An example subset of relevant information from my dataframe looks like:
Variant DB.count.1 DB.count.2 pop.count.1 pop.count.2
A 23 62 35 70
B 81 4 39 22
C 51 42 49 52
D NA NA 65 8
E 73 21 50 33
F 72 13 81 10
G 61 32 75 21
H NA NA 42 22
I NA NA 60 20
J 80 12 72 24
I am trying to use a for loop to:
create a contingency table for each row for the Fisher's exact test to compare DB.counts to pop.counts
run a Fisher's exact test using this contingency table to determine if there is a difference between DB.counts and pop.counts
output the p-value result to a new column on my dataframe
As you can see there are "NA" values in some positions and thus in some contingency tables, obviously this will cause an error, which is ok, but I would like for the code to output a value to the column when it encounters this error such as "." or "error" and skip to the next row/contingency table.
i.e. I would like an output which looks like this:
Variant DB.count.1 DB.count.2 pop.count.1 pop.count.2 fishers
A 23 62 35 70 0.4286
B 81 4 39 22 <0.0001
C 51 42 49 52 0.3921
D NA NA 65 8 error
E 73 21 50 33 0.0143
F 72 13 81 10 0.5032
G 61 32 75 21 0.0744
H NA NA 42 22 error
I NA NA 60 20 error
J 80 12 72 24 0.0425
The code I currently have (based on R loop over Fisher test - Error message) is:
df$fishers" <- for (i in 1:nrow(df))
{
table <- matrix(c(df[i,4], df[i,5], df[i,2], df[i,3]), ncol = 2, byrow = TRUE)
fisher.test(table, alternative="greater")
}
This seems to create the contingency tables the way I want but the problem of bypassing the errors and printing the p-vlaue to the new column remains. I have tried to use try and tryCatch but have been unsuccessful in doing so.
I am an R beginner so really appreciate any advice on how to improve my questions or any advice for my problem! Thank you!
Edit 1: I have now tried using the data.table package as below and have got what I need from data sets with no "NA" values but how do I skip the errors and make the code continue? Thanks!!!
library(data.table)
dt <- data.table(df)
dt[, p.val := fisher.test(matrix(c(pop.count.1, pop.count.2, DB.count.1, DB.count.2), ncol=2), workspace=1e9)$p.value, by=Variant]
df <- as.data.frame(dt)

You can include an if-else statement in your loop like this:
res <- NULL
for (i in 1:nrow(df)){
table <- matrix(c(df[i,4], df[i,5], df[i,2], df[i,3]), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table, alternative="greater")$p.value
# save all p values in a vector
res <- c(res,p)
}
df$fishers <- res
Or put the code in a function and use apply instead of a loop:
foo <- function(y){
# include here as.numeric to be sure that your values are numeric:
table <- matrix(as.numeric(c(y[4], y[5], y[2], y[3])), ncol = 2, byrow = TRUE)
if(any(is.na(table))) p <- "error" else p <- fisher.test(table, alternative="greater")$p.value
p
}
df$fishers <- apply(df, 1, foo)

Related

Using a function and mapply in R to create new columns that sums other columns

Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0

How to create a messy_impute() function that imputes NA values in messy data with mean or median?

I have the following data frame for a student with homework and exam scores.
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 NA 41 61 11 A
The Homework_10 score is missing, and I need to create a function to impute the NA value with mean or median.
The function messy_impute should have the following arguments:
data : data frame or tibble to be imputed.
center : whether to impute using mean or median.
margin : whether to use row or column to input value (1- use row 2-use column).
For example,
messy_impute(student1,mean,1) should print out
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 **62** 41 61 11 A
since the mean of the rest of the homework is 62.
And, if the mean of the columns (other students) in section A for homework 10 is 50, then
messy_impute(student1,mean,2) should print out
> student1
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Homework_6 Homework_7 Homework_8
10 582493224 59 99 88 10 66 90 50 80
Homework_9 Homework_10 Exam_1 Exam_2 Exam_3 Section
10 16 **50** 41 61 11 A
since the mean of columns in section A is 50.
Please note the if the margin is 2, then the calculation should be done with the same section.
I'm really stuck on this defining the function.
Base R solution:
# Define function to Impute a row-wise mean (assumes one observation per student):
row_wise_mean_impute <- function(df){
grade_df <- df[,names(df) != "studid"]
return(cbind(df[,c("studid"), drop = FALSE],
replace(grade_df, is.na(grade_df), apply(grade_df, 1, mean, na.rm = TRUE))))
}
# Apply function:
row_wise_mean_impute(student1)
Data:
x <- c(rnorm(85, 50, 3), rnorm(15, 50, 15))
student1 <- cbind(studid = 1010101, data.frame(t(x)))
student1[, 10] <- NA_real_

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

saving vectors from a loop in a list in r

Hi I never edited a question of mine but I'll give it a try. It's not soo extremely important what the code means actually. For me only saving the vectors "liste" in a new list is relevant :D
test <- list()
test <- replicate(5, sample(1:100, 50), simplify = FALSE) # Creates a list of 5 vectors
> test[[1]]
[1] 90 96 20 86 32 77 83 33 64 29 88 97 78 81 40 60 89 19 31 59 26 38 34 71 5 80 85
[28] 3 70 87 41 50 6 18 37 58 9 76 91 62 12 30 42 94 72 95 100 10 68 82
S <- test[[1]]
x <- diff(S) # following algorythm creates "liste" (vector) for test [[1]]
trendtest <- list()
k <- NULL
d <- NULL
t <- vector("list",length(x))
A <- vector("list",length(x))
z <- vector("list",length(x)-2)
za <- vector("list",length(x)-2)
liste <- NULL
dreisum <- sapply(1:(length(x)-2), function(i) sum(x[c(i,(i+1))]))
dreisumi <- lapply(1:(length(x)-2), function(i) dreisum[i:(length(x)-2)])
zdreisumi<- lapply(1:(length(x)-4), function(i) dreisumi[[i]] [3:length(dreisumi[[i]])]<0)
zadreisumi<- lapply(1:(length(S)-4), function(i) dreisumi[[i]][3:length(dreisumi[[i]])]>0)
Si <- lapply(1:(length(x)-2), function(i) S[i:(length(x))])
i <- 1
h <- 1
while(i<(length(x)-3) & h!=Inf){
k <- c(k,k <- (S[i]-S[i+2])/(-2))
d <- c(d,d <- (S[i+2]*i-S[i]*(i+2))/(-2))
t[[i]] <- i:(length(x))
A[[i]] <- k[length(liste)+1]*t[[i]]+d[length(liste)+1]
A[[i]][3] <- S[i+2]
z[[i]] <- Si[[i]][3:length(Si[[i]])]<A[[i]][3:length(A[[i]])]
za[[i]] <- Si[[i]][3:length(Si[[i]])]>A[[i]][3:length(A[[i]])]
if(k[length(liste)+1]>0 & S[i+3]>A[[i]][4] & is.element(TRUE,z[[i]])){h <- (min(which(z[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]>0 & S[i+3]<A[[i]][4] & is.element(TRUE,za[[i]])){h <- (min(which(za[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]<0 & S[i+3]>A[[i]][4] & is.element(TRUE,z[[i]])){h <- (min(which(z[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]<0 & S[i+3]<A[[i]][4] & is.element(TRUE,za[[i]])){h <- (min(which(za[[i]]!=FALSE))+1)}else{
if(k[length(liste)+1]>0 & S[i+3]>A[[i]][4] & (all(z[[i]]==FALSE))){h <- (min(which(zdreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]>0 & S[i+3]<A[[i]][4] & (all(za[[i]]==FALSE))){h <- (min(which(zdreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]<0 & S[i+3]>A[[i]][4] & (all(z[[i]]==FALSE))){h <- (min(which(zadreisumi[[i]]!=FALSE))+2)}else{
if(k[length(liste)+1]<0 & S[i+3]<A[[i]][4] & (all(za[[i]]==FALSE))){h <- (min(which(zadreisumi[[i]]!=FALSE))+2)}}}}}}}}
liste <- c(liste,i)
i <- i+h-1
if((length(x)-3)<=i & i<=length(x)){liste <- c(liste,i)}}
> liste
[1] 1 3 7 10 12 16 18 20 24 27 30 33 36 39 41 46
Actually the whole code is not so interesting for my problem because it works! I made the example for test[[1]] now. BUT I want that a for-loop (or whatever) takes ALL vectors in "test" and saves ALL 5 vectors "liste" in a new list (lets call it "trendtest" ... whatever :D)
The following will do what you ask for:
Delete the line trendtest <- list().
Take the code from x <- diff(S) to last line (except the very last line that only prints liste) and insert it at the position indicated by the placeholder __CODE_HERE__.
trendtest <- lapply(test, FUN = function(S) {
__CODE_HERE__
return(liste)
})
This is the "R way" of doing what you want. Alternatively, you could do the following (which is closer to your initial approach, but less the "R way"):
trendtest <- vector("list", length(test))
for (u in 1:length(test)) { # better: u in seq_along(test)
S <- test[[u]]
__CODE_HERE__
trendtest[[u]] <- liste
}
Note that there will be an error message which is due to the sample data (which doesn't fit the algorithm provided) and is unrelated to saving liste in trendtest.

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

Resources