How to sum all G-values? - r

After using the G.test on all rows of my data subset
apply(datamixG +1 , 1, G.test)
I get an output for each row that looks like this
[[1]]
G-test for given probabilities
data: [(newX,,i)
G = 3.9624, df = 1, p-value = 0.04653
I have 46 rows. I need to sum the df and G-values. Is there a way to have R report the G-values differently and/or sum all of the G-values and df?

I'll assume you're using the G.test function from the RVAideMemoire package:
# Sample data (always a good idea to post!)
dat <- matrix(1:4, nrow=2)
library(RVAideMemoire)
tests <- apply(dat, 1, G.test)
You can use unlist and lapply to extract a single value from each element in a list and to return a vector of the results:
dfs <- unlist(lapply(tests, "[[", "parameter"))
dfs
# df df
# 1 1
sum(dfs)
# [1] 2
Gs <- unlist(lapply(tests, "[[", "statistic"))
Gs
# G G
# 1.0464963 0.6795961
sum(Gs)
# [1] 1.726092

Related

Need Help writing a Loop function

I have a huge dataset and created a large correlation matrix. My goal is to clean this up and create a new data frame with all the correlations greater than the abs(.25) with the variable names include.
For example, I have this data set, how would I use a double nested loop over the rows and columns of the table of correlation.
a <- rnorm(10, 0 ,1)
b <- rnorm(10,1,1.5)
c <- rnorm(10,1.5,2)
d <- rnorm(10,-0.5,1)
e <- rnorm(10,-2,1)
matrix <- data.frame(a,b,c,d,e)
cor(matrix)
(notice, that there is redundancy in the matrix. You only need to inspect the first 5
columns; and you don’t need to inspect all rows. If I’m looking at column 3, for example, I
only need to start looking at row 4, after the correlation = 1)
Thank you
Is your ultimate goal to create a 5x5 with all values with absolute less than 0.25 set to zero? This can be done via sapply(matrix,function(x) ifelse(x<0.25,0,x)). If your goal is to simply create a loop over the rows and columns, this can be done via:
m <- cor(matrix)
for (row in rownames(m)){
for (col in colnames(m)){
#your code here
#operating on m[row,col]
}
}
To avoid redundancy:
for (row in rownames(m)[1:(length(rownames(m))-1)]){
for (col in colnames(m)[(which(colnames(m) == row)+1):length(colnames(m))]){
#your code here
#operating on m[row,col]
print(m[row,col])
}
}
I'd suggest using the corrr package, in conjunction with tidyr and dplyr.
This allows you to generate a correlation data frame rather than a matrix and remove the duplicate values (where for example a-b is the same as b-a) using the shave function. You can then rearrange by pivoting, remove the NA values (from the diagonal, e.g. a-a) and filter for values greater than 0.25.
library(dplyr)
library(tidyr)
library(magrittr) # for the pipe %>% or just use library(tidyverse) instead of all 3
library(corrr)
# for reproducible values
set.seed(1001)
# no need to make a data frame from vectors
# and don't call it matrix, that's a function name
mydata <- data.frame(a = rnorm(10, 0 ,1),
b = rnorm(10, 1, 1.5),
c = rnorm(10, 1.5, 2),
d = rnorm(10, -0.5, 1),
e = rnorm(10, -2, 1))
mydata %>%
correlate() %>%
shave() %>%
pivot_longer(2:6) %>%
na.omit() %>%
filter(abs(value) > 0.25)
Result:
# A tibble: 4 x 3
term name value
<chr> <chr> <dbl>
1 c b -0.296
2 d b 0.357
3 e a -0.440
4 e d -0.280

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

Conditions & Subtraction from Matrix in R

I've looked at R create a vector from conditional operation on matrix, and using a similar solution does not yield what I want (and I'm not sure why).
My goal is to evaluate df with the following condition: if df > 2, df -2, else 0
Take df:
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
df is simply:
a b
1 0
2 1
3 2
4 3
5 4
df_final should look like this after a suitable function:
a b
0 0
0 0
1 0
2 1
3 2
I applied the following function with the result, and I'm not sure why it doesn't work (further explanation of a solution would be appreciated)
apply(df,2,function(df){
ifelse(any(df>2),df-2,0)
})
Yielding the following:
a b
-1 -2
Thank you SO community!
Let's fix your function and understand why it didn't work:
apply(df, # apply to df
2, # to each *column* of df
function(df){ # this function. Call the function argument (each column) df
# (confusing because this is the same name as the data frame...)
ifelse( # Looking at each column...
any(df > 2), # if there are any values > 2
df - 2, # then df - 2
0 # otherwise 0
)
})
any() returns a single value. ifelse() returns something the same shape as the test, so by making your test any(df > 2) (a single value), ifelse() will also return a single value.
Let's fix this by (a) changing the function to be of a different name than the input (for readability) and (b) getting rid of the any:
apply(df, # apply to df
2, # to each *column* of df
function(x){ # this function. Call the function argument (each column) x
ifelse( # Looking at each column...
x > 2, # when x is > 2
df - 2, # make it x - 2
0 # otherwise 0
)
})
apply is made for working on matrices. When you give it a data frame, the first thing it does is convert it to a matrix. If you want the result to be a data frame, you need to convert it back to a data frame.
Or we can use lapply instead. lapply returns a list, and by assigning it to the columns of df with df[] <- lapply(), we won't need to convert. (And since lapply doesn't do the matrix conversion, it knows by default to apply the function to each column.)
df[] <- lapply(df, function(x) ifelse(x > 2, x - 2, 0))
As a side note, df <- cbind(a,b) %>% as.data.frame() is a more complicated way of writing df <- data.frame(a, b)
Create the 'out' dataset by subtracting 2, then replace the values that are based on a logical condition to 0
out <- df - 2
out[out < 0] <- 0
Or in a single step
(df-2) * ((df - 2) > 0)
Using apply
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
new_matrix <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0))
new_matrix
###if you want it to return a tibble/df
new_tibble <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0)) %>% as_tibble()

Removing all subsets from a list

I have a list that looks as follows:
a <- c(1, 3, 4)
b <- c(0, 2, 6)
c <- c(3, 4)
d <- c(0, 2, 6)
list(a, b, c, d)
From this list I would like to remove all subsets such that the list looks as follows:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
How do I do this? In my actual data I am working with a very long list (> 500k elements) so any suggestions for an efficient implementation are welcome.
Here is an approach.
lst <- list(a, b, c, d) # The list
First, remove all duplicates.
lstu <- unique(lst)
If the list still contains more than one element, we order the list by the lengths of its elements (decreasing).
lstuo <- lstu[order(-lengths(lstu))]
Then subsets can be filtered with this command:
lstuo[c(TRUE, !sapply(2:length(lstuo),
function(x) any(sapply(seq_along(lstuo)[-x],
function(y) all(lstuo[[x]] %in% lstu[[y]])))))]
The result:
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6
Alternative approach
Your data
lst <- list(a, b, c, d) # The list
lstu <- unique(lst) # remove duplicates, piggyback Sven's approach
Make matrix of values and index
m <- combn(lstu, 2) # 2-row matrix of non-self pairwise combinations of values
n <- combn(length(lstu), 2) # 2-row matrix of non-self pairwise combination of index
Determine if subset
issubset <- t(sapply(list(c(1,2),c(2,1)), function(z) mapply(function(x,y) all(x %in% y), m[z[1],], m[z[2],])))
Discard subset vectors from list
discard <- c(n*issubset)[c(n*issubset)>0]
ans <- lstu[-discard]
Output
[[1]]
[1] 1 3 4
[[2]]
[1] 0 2 6

How to extract 105 variables for calculation in R

I have 7 dataframes of experiments which are each subdivided into 15 repetition (or iteration). I am now interested in all 105 variable x for calculation later on in the analysis.
Imagine you have the following dataframes with randomized numbers and, for the sake of simplicity, pretend that all dataframes contain different numbers:
set.seed(2)
a <- runif(100, -1.5, 1.5)
b <- pnorm(rnorm(100))
c <- rnorm(100)
d <- rnorm(100)
e <- dnorm(rnorm(100))
iteration <- sort(sample(1:7, 100, replace=T), decreasing=F)
x <- f <- sample(1:1000, 100, replace=T)
df1 <- data.frame(a,b,c,d,e,iteration,x)
df2 <- data.frame(a,b,c,d,e,iteration,x)
df3 <- data.frame(a,b,c,d,e,iteration,x)
df4 <- data.frame(a,b,c,d,e,iteration,x)
df5 <- data.frame(a,b,c,d,e,iteration,x)
df6 <- data.frame(a,b,c,d,e,iteration,x)
df7 <- data.frame(a,b,c,d,e,iteration,x)
How can I break down all 105 variable x combination (df1$x of iteration 1, df1$x of iteration 2, ..., df7$x of iteration 7) so that I can calculate the following example nonsense equation for all 105 variable combination?
mean(df1$x of iteration 1) - sd(df1$x of iteration 1)
mean(df1$x of iteration 2) - sd(df1$x of iteration 2)
...
mean(df7$x of iteration 7) - sd(df7$x of iteration 7)
I have the following command in order to "extract" variable df1$x of iteration 1 but this would involve 208 more lines to come for the remaining variables:
df_1 <- df1[which(df1$iteration=='1'),]
df_1_final <- df_1[grepl("1", df_1$iteration), c(6, 7)]
Does this make sense? Is there not a better way to do that in Gnu R?
A possibility using dplyr. Probably easier to work with your data.frames in a list (from comments by #akrun)
library(dplyr)
bind_rows(mget(paste0('df', 1:7))) %>% # put your data.frames in a list -> data.frame
mutate(group=rep(1:7, each=100)) %>% # add a grouping column
group_by(group, iteration) %>% # group
summarise(mean(x) - sd(x)) # do your stuff
or in data.table
rbindlist(mget(paste0('df', 1:7)))[,mean(x)-sd(x) ,.(gr=rep(1:7,each=100),iteration)]
You could create a nonsense equation function and then utilize it in tapply() with, iteration as the INDEX argument, for each df. So for df1: tapply(df1$x, INDEX = df1$iteration, nonsenseFunction), which will return a list/array with all computations for each group(iteration) of df1.

Resources