Rfast segmentation fault on independence test - r

I am having troubles using the G2-test function of the Fast function in R since it outputs a segmentation fault even though it seems to me that the input parameters are correct.
More specifically, I am able to run the example code in the manual page
nvalues <- 3
nvars <- 10
nsamples <- 5000
data <- matrix( sample( 0:(nvalues - 1), nvars * nsamples, replace = TRUE ), nsamples, nvars )
dc <- rep(nvalues, nvars)
res<-g2Test( data, 1, 2, 3, c(3, 3, 3) )
But I'm not able to make it run on my data. The function g2Test takes as input a matrix of numbers, three integer that stands for the column on which to condition (in the example we are studying the dependence of the first on the second conditioned on the third) and a vector with the number of unique values per column.
My code follows the same principles reading data from the ALARM csv file
library(readr)
library(Rfast)
# open the file
path <- "datasets/alarm.csv"
dataset <- read.csv(path)
# search for the indexes of the column I'm interested in and the amount of unique values per column
c1 <- "PVS"
c2 <- "ACO2"
s <- c("VALV", "VLNG", "VTUB", "VMCH")
n <- colnames(dataset)
col_c1 <- match(c1, n)
col_c2 <- match(c2, n)
cols_c3 <- c()
uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])
if (!s[1]=="()"){
for(v in s){
idx <- match(v, n)
cols_c3 <- append(cols_c3,idx)
uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
}
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}
ds <- as.matrix(dataset)
colnames(ds) <- NULL
# running the G2 test
res <- g2Test(ds, col_c1, col_c2, cols_c3, uni)
But it results into a segmentation fault
*** caught segfault ***
address 0x1f103f96a, cause 'memory not mapped'
Traceback:
1: g2Test(ds, col_c1, col_c2, cols_c3, uni)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
The same happens if I condition on just one variable and not on multiple ones.
I really don't understand why this happens since it seems to me that my case is the same as the example on the reference, just with different data. I would really appreciate any help for debugging this issue, please tell me if I need to specify further infos.

First, I'm sorry that I missed that you had originally included your data!
Alright, I wish I would have realized this sooner (as you will, as well...). The columns have to be consecutive and the values must start at zero. So what does that mean? You have to rearrange the columns so that col_c1 is the first column, col_c2 is the second column, and so on. You have to subtract all values by one (since the lowest value is 1).
This is what I did (and how I checked it):
# there was no PVS, I assume this was PVSAT
c1 <- "PVSAT"
# c1 <- "PVS"
# there was no ACO2, I assume this was ARTCO2
c2 <- "ARTCO2"
# c2 <- "ACO2"
# there are no columns with these names...
# for VALV - VENTALV; for VLNG - VENTLUNG; for VTUB - VENTTUBE; for VMCH - VENTMACH
s <- c("VENTALV", "VENTLUNG", "VENTTUBE", "VENTMACH")
# s <- c("VALV", "VLNG", "VTUB", "VMCH")
This next chunk is exactly as you wrote it:
n <- colnames(dataset)
col_c1 <- match(c1, n)
col_c2 <- match(c2, n)
cols_c3 <- c()
uni <- c(length(unique(dataset[c1])[[1]])[[1]],length(unique(dataset[c2])[[1]])[[1]])
if (!s[1]=="()"){
for(v in s){
idx <- match(v, n)
cols_c3 <- append(cols_c3,idx)
uni <- append(uni,length(unique(dataset[v])[[1]])[[1]])
}
}
# transforming the str DataFrame into a integer matrix
for (nn in n){
dataset[nn] <- unclass(as.factor(dataset[nn][[1]]))
}
ds <- as.matrix(dataset)
This is where I made the minimum zero:
# look at the number of unique values before changing, as a means of validation
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))
# look at the minimum, as a means of validation
sapply(1:ncol(ds), function(x) min(ds[,x]))
# the minimum value must be zero
ds <- ds - 1
# check
sapply(1:ncol(ds), function(x) min(ds[,x]))
sapply(1:ncol(ds), function(x) length(unique(ds[, x])))
# looked as expected
Next, I rearranged the columns. I did this before removing the names so I could use the names to ensure the order was correct.
# the data must be consecutive numbers
# catch names before and after
n2 <- dimnames(ds)
# some of the results from this:
# [[2]]
# [1] "HISTORY" "CVP" "PCWP" "HYPOVOLEMIA"
# create the list of column indicies other than those getting called in g2Test
tellMe <- c(1:ncol(ds))
tellMe <- tellMe[-c(col_c1, col_c2, sort(cols_c3))]
# rearrange using the indices
ds <- ds[, c(col_c1, col_c2, sort(cols_c3), tellMe)]
# check it
(n3 <- dimnames(ds))
# some of the results from this
# [[2]]
# [1] "PVSAT" "ARTCO2" "VENTMACH" "VENTTUBE"
All that's left is removing the names (just as you did) and then calling the function. Since the indices changed, your objects won't work here, though.
colnames(ds) <- NULL
# running the G2 test
# res <- g2Test(ds, col_c1, col_c2, sort(cols_c3), uni)
res2 <- g2Test(ds, 1, 2, c(3,4,5,6), c(3, 3, 4, 4, 4, 4))
# $statistic
# [1] 19.78506
#
# $df
# [1] 1024
#

Related

Dataframe output from a for-loop

I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2

Confusing non-numeric argument to binary operator error

I cannot seem to even create a reproducible example on this as it works fine when I go through the code one line at a time.
The error message I get is as follows:
"Error in testData[, colCheck][length(testData[, colCheck])] - testData[, :
non-numeric argument to binary operator "
Both colCheck and testData$linearcorrd15N are numeric and like I said, the calculation works fine when I run it at that line. The error comes only when I run the function from QTest(df, colCheck).
Here is an example of what some of the code looks like. It will not produce an error, but maybe you can see something that I don't.
QTest <- function(testData, colCheck)
#%#
# testData <- This is the entire data frame for the std/ref that has too high
# of a SD, this way the data frame can be returned without the outlier
# colCheck <- The column name for values that were flagged for having too high of a SD
# This Q test info provided by: https://www.statisticshowto.com/dixons-q-test/
#%#
{
#Get the mean of the highest and lowest values
testData <- arrange(testData, desc(testData[, colCheck]))
len <- length(testData[,colCheck])-1
high <- sapply(1:len, function(i) testData[,colCheck][i])
meanhigh <- mean(high)
testData <- arrange(testData, (testData[, colCheck]))
low <- sapply(1:len, function(i) testData[,colCheck][i])
meanlow <- mean(low)
#If the mean of the lowest numbers is lower than the mean of the highest numbers, do this
if(meanlow < meanhigh){
QexpVal <- abs((testData[, colCheck][2] - testData[, colCheck][1])/
(testData[, colCheck][length(testData[, colCheck])] - testData[, colCheck][1]))
outlier <- testData[,colCheck][1]
closest <- testData[,colCheck][2]
#else if the mean of the lowest numbers is higher than the mean of the highest numbers, do this
} else {
QexpVal <- abs((testData[, colCheck][length(testData[,colCheck])-1] - (testData[, colCheck][length(testData[,colCheck])])) /
(testData[,colCheck][length(testData[,colCheck])]) - (testData[,colCheck][1]))
outlier <- testData[,colCheck][length(testData[,colCheck])]
closest <- testData[,colCheck][length(testData[,colCheck])-1]
}
return(QexpVal)
}
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(std1)))
QTestCorrVals <- QTest(df, colCheck)
It seems you realy overcomplicate this function by pushing the whole table in the function and loop over everything and read a value again from the whole table...
just the part to get meanhigh and meanlow requires this:
v <- df[, colCheck]
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
Or if you use the decreasing ordering this:
v <- df[, colCheck]
v <- v[order(v, decreasing = T)]
n <- length(v)
meanhigh <- mean(v[1:n-1])
meanlow <- mean(v[2:n])
Full function
Hereby the full code using this approach and I agree that is not the specific question you asked, but the way you coded it is highly inefficient and error prone by every time take the whole data.frame and subset it and recalculate lengths every time. Also you just have to order once, as if the lowest is on top, the highest is per definition on the bottom. Then play around with the 1 for first and 2 for second and n for last and n-1 for second last.
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(df)))
QTest <- function(v) {
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
if(meanlow < meanhigh) {
QexpVal <- abs((v[2]-v[1])/(v[n]-v[1]))
outlier <- v[1]
closest <- v[2]
} else {
QexpVal <- abs((v[n-1]-v[n])/(v[n]-v[1]))
outlier <- v[n]
closest <- v[n-1]
}
return(QexpVal)
}
QTestCorrVals <- QTest(df[, colCheck])
Side note
Using the column index number works slightly different whether your data is a data.frame or a data.table
class(df)
df[, colCheck]
dt <- data.table(df)
class(dt)
dt[, ..colCheck]
dt[, colCheck] # throws an error

R for loop to calculate wilcox.test

I am trying to write a code that would automatically calculate Wilcoxon test p-value for several comparisons.
Data used: 2 data sets with the same information representing two groups of participants completed the same 5 tasks which means that the each table contains 5 columns (tasks) and X rows with tasks scores.
data_17_18_G2 # first data set (in data.table format)
data_18_20_G2 # second data set (in data.table format)
Both data sets have identical names of column which are to be used in the W-test the next way:
wilcox.test(Group1Task1, Group2Task1, paired = F)
wilcox.test(Group1Task2, Group2Task2, paired = F)
and so on.
The inputs (e.g., Grou1Task1) are two vectors of task scores (the first one will be from data_17_18_G2 and the other one from data_18_20_G2
Desired output: a data table with a column of p-values
The problem I faced is that no matter how I manipulated the val1 and val2 empty objects, in the second and the third lines the right size "as.numeric(unlist(data_17_18_G2[, ..i]))" gives a correct output (a numeric vector) but it's left size "val1[i]" always returns only one value from the vector. That gave me the idea that the main problem appeared on the step of creating an empty vector, however, I wasn't able to solve it.
Empty objects:
result <- data.table(matrix(ncol=2))
val1 <- as.numeric() # here I also tried functions "numeric" and "vector"
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
For loop
for (i in 1:5) {
val1[i] <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2[i] <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
Output:
Error in `[<-.data.table`(`*tmp*`, i, 2, value = NULL) :
When deleting columns, i should not be provided
1: В val1[i] <- as.numeric(unlist(data_17_18_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
2: В val2[i] <- as.numeric(unlist(data_18_20_G2[, ..i])) :
number of items to replace is not a multiple of replacement length
3: В res[i] <- wilcox.test(val1[i], val2[i], paired = F) :
number of items to replace is not a multiple of replacement length
Alternative:
I changed the second and the third lines
for (i in 1:5) {
val1[i] <- as.numeric(data_17_18_G2[ , ..i])
val2[i] <- as.numeric(data_18_20_G2[ , ..i])
res[i] <- wilcox.test(val1[i], val2[i], paired = F)
result[i, 1] <- i
result[i, 2] <- res$p.value
}
And got this
Error in as.numeric(data_17_18_G2[, ..i]) :
(list) object cannot be coerced to type 'double'
which means that the function wilcox.test cannot interpret this type of input.
How can I improve the code so that I get a data table of p-values?
There would appear to be some bugs in the code. I have rewritten the code using the cars dataset as a example.
## use the cars dataset as a example (change with appropriate data)
data(cars)
data_17_18_G2 <- as.data.table(cars)
data_18_20_G2 <- data_17_18_G2[,2:1]
## Fixed code
result <- data.table(matrix(as.numeric(), nrow=ncol(data_17_18_G2), ncol=2))
val1 <- as.numeric()
val2 <- as.numeric()
res <- vector(mode = "list", length = 7)
for (i in 1:ncol(data_17_18_G2)) {
val1 <- as.numeric(unlist(data_17_18_G2[ , ..i]))
val2 <- as.numeric(unlist(data_18_20_G2[ , ..i]))
res[[i]] <- wilcox.test(val1, val2, paired = F)
result[i, 1] <- as.numeric(i)
result[i, 2] <- as.numeric(res[[i]]$p.value)
}
Hope this gives you the output you are after.

How to calculate statistics over sequences of non zeros in a dataframe in R

I have a dataframe containing sequences as follows:
r1=c(0,0,0,1.2,5,0.5,3.3,0,0,2.1,0.7,1,3.3,0,0,0,0,2.5,4.2,1,5.2,0,0,0,0)
r2=c(0,0,3.5,5.1,2.5,0,0,0,0.6,1.7,1.6,1.2,1.6,0,0,0,0,1.5,1.8,1.5,0,0,0,0,0)
r=as.data.frame(cbind(r1,r2))
My actual data contain more columns and rows. For each column, I'd like to get the minimum/maximum/average (basic statistics) of the maximum of each sequence of non-zero values. That means that, considering one column, I extract the maximum value of each one of its sequences of successive non-0 values and then I perform the statistics over them.
Here I've written some functions to break your vectors into the individual runs, extract the values you want (the maxes within the runs), and then apply the basic statistics you are asking for. There may be a more elegant or more efficient method.
r1=c(0,0,0,1.2,5,0.5,3.3,0,0, 2.1,0.7,1,3.3,0,0,0,0,2.5,4.2,1,5.2,0,0,0,0)
r2=c(0,0,3.5,5.1,2.5,0,0,0,0.6,1.7,1.6,1.2,1.6,0,0,0,0,1.5,1.8,1.5,0,0,0,0,0)
r=as.data.frame(cbind(r1,r2))
my.stats.fun <- function(col){
# sub fuctions
remove.successive.0s <- function(col){
col <- c(col, 0)
i0 <- which(col==0)
i00 <- i0[which(diff(i0)==1)]
col2 <- col[-i00]
if(col2[1]==0){ col2 <- col2[-1] } # pops first 0
return(col2)
}
run.indicator <- function(col){
i0 <- which(col==0)
lr <- length(i0)
runs <- rep(1:lr, times=c(i0-c(0,i0[-lr])))
col <- col[-i0]
runs <- runs[-i0]
return(list(values=col, index=runs))
}
basic.stats <- function(maxes){
return(c(min=min(maxes), ave=mean(maxes), max=max(maxes)))
}
# apply functions
col <- remove.successive.0s(col)
runs <- run.indicator(col)
maxes <- aggregate(runs$values, by=list(runs$index), max)[,2]
stats <- basic.stats(maxes)
return(stats)
}
sapply(r, my.stats.fun)
# r1 r2
# min 3.3 1.700000
# ave 4.5 2.866667
# max 5.2 5.100000

Efficiency of transforming counts to percentages and index scores

I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code

Resources