How to run Chisq test for multiple rows FASTER in R? - r

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.

I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

Related

Rename column names of a dataframe with incrementation

I have a script generating a dataframe with multiple columns named with numbers 1, 2, 3 –> n
I want to rename the columns with the following names: "Cluster_1", "Cluster_2", "Cluster_3" –> "Cluster_n" (with incrementation).
As the number of columns in my dataframe can change accordingly to another part of my script, I would like to be able to have a kind of loop structure that would go through my dataframe and change columns accordingly.
I would like to do something like:
for (i in colnames(df)){
an expression that would change the column name to a concatenation of "Cluster_" + i
}
Outside the loop context, I generally use this expression to rename a column:
names(df)[names(df) == '1'] <- 'Cluster_1'
But I struggle to produce an adapted version of this expression that would properly integrate in my for loop with a concatenation of string and variable value.
How can I adjust the expression that renames the column of the dataframe to integrate in my for loop?
Or is there a better way than a for loop to do this?
A tidyverse solution: rename_with()
require(dplyr)
## '~' notation can be used for formulae in this context:
df <- rename_with(df, ~ paste0("Cluster_", .))
Using paste0.
names(df) <- paste0('cluster_', seq_len(length(df)))
If you really need a for loop, try
for (i in seq_along(names(df))) {
names(df)[i] <- paste0('cluster_', i)
}
df
# cluster_1 cluster_2 cluster_3 cluster_4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
Note: colnames()/rownames() is designed for class "matrix", for "data.frame"s, you might want to use names()/row.names().
Data:
df <- data.frame(matrix(1:12, 3, 4))

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

R: Using a vector to feed dataframe names for sapply

I'm quite new to R, and I trying to use it to organize and extract info from some tables into different, but similar tables, and instead of repeating the commands but changing the names of the table:
#DvE, DvS, and EvS are dataframes
Sum.DvE <- data.frame(DvE$genes, DvE$FDR, DvE$logFC)
names(Sum.DvE) <- c("gene","FDR","log2FC")
Sum.DvS <- data.frame(DvS$genes, DvS$FDR, DvS$logFC)
names(Sum.DvS) <- c("gene","FDR","log2FC")
Sum.EvS <- data.frame(EvS$genes, EvS$FDR, EvS$logFC)
names(Sum.EvS) <- c("gene","FDR","log2FC")
I thought it would be easier to create a vector of the table names, and feed it into a for loop:
Sum.Comp <- c("DvE","DvS","EvS")
for(i in 1:3){
Sum.Comp[i] <- data.frame(i$genes, i$FDR, i$logFC)
names(Sum.Comp[i]) <- c("gene","FDR","log2FC")
}
But I get
>Error in i$genes : $ operator is invalid for atomic vectors
which I kind of expected because I was just trying it out, but can someone tell me if what I want to do can be done some other way, or if you have some suggestions for me, that would be much appreciated!
Clarification: Basically I'm trying to ask if there's a way to feed a dataframe name into a for loop through a vector, because I think I get the error because R doesn't realize "i" in the for loop stands for a dataframe name. This is a more simplified example:
DF1 <- data.frame(A=1:5, B=1:5, C=1:5, D=1:5)
DF2 <- data.frame(A=10:15, B=10:15, C=10:15, D=10:15)
DF3 <- data.frame(A=20:25, B=20:25, D=20:25, D=20:25)
DFs <- ("DF1", "DF2", "DF3")
for (i in 1:3){
New.i <- dataframe(i$A, i$D)
}
And I'd like it to make 3 new dataframes called "New.DF1", "New.DF2", "New.DF3" with example outputs like:
New.DF1
A D
1 1
2 2
3 3
4 4
5 5
New.DF2
A D
10 10
11 11
12 12
13 13
14 14
15 15
Thank you!
Not entirely sure I understand your problem, but the code below may do what you're asking. I've created simple values for the input data frames for testing.
DvE <- data.frame(genes=1:2, FDR=2:3, logFC=3:4)
DvS <- data.frame(genes=4, FDR=5, logFC=6)
EvS <- data.frame(genes=7, FDR=8, logFC=9)
df_names <- c("DvE","DvS", "EvS")
sum_df <- function(x) data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
for(df in df_names) {
assign(paste("Sum.",df,sep=""), do.call("sum_df", list(as.name(df)) ) )
}
Instead of operating on the names of variables, it would be easier to store the data frames you want to process in a list and then process them with lapply:
to.process <- list(DvE, DvS, EvS)
processed <- lapply(to.process, function(x) {
data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
})
Now you can access the new data frames with processed[[1]], processed[[2]], and processed[[3]].

removing duplicate subsets of rows

I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!
It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1

Replace NA with 0 in R using a loop on a dataframe

I would like to run through specific columns in a dataframe and replace all NAs with 0s using a loop.
extract = read.csv("2013-09 Data extract.csv")
extract$Premium1[is.na(extract$Premium1)] <- 0
extract$Premium1
gives me the required result for Premium1 in dataframe extract, but I would like to loop through all 27 columns of premiums, so what I am trying is
extract = read.csv("2013-09 Data extract.csv")
for(i in 1:27) {
thispremium <- get(paste("extract$Premium", i, sep=""))
thispremium[is.na(thispremium)] <- 0
}
which gives
Error in get(paste("extract$Premium", i, sep = "")) :
object 'extract$Premium1' not found
Any idea on what is causing the error?
Do you need the loop because of other requirements? Because it works just fine without one:
extract[is.na(extract)] <- 0
If you want to do the replacement for some columns only, select those columns first, perform the replacement, and substitute the columns back into the original set:
first5 <- extract[, 1 : 5]
first5[is.na(first5)] <- 0
extract[, 1 : 5] <- first5
More generally loops can (and should) be almost avoided in R – especially when manipulating data frames). Often operations vectorise automatically (like above). When they don’t, functions of the apply family can be used.
How about
for (colname in names(extract))
extract[[colname]][is.na(extract[[colname]])] <- 0
(or even extract[is.na(extract)] <- 0)
Or, if you are not doing it to all the columns (I think I misread your question):
for(i in 1:27) {
colname <- paste0("Premium",i)
extract[[colname]][is.na(extract[[colname]])] <- 0
}
Alternatively, you don't really need to know the number of such columns:
premium <- grep("^Premium[0-9]*$",names(extract))
extract[premium][is.na(extract[premium])] <- 0

Resources