R chi-squared statistic for two different distribution - r

I have two file.dat (random1.dat and random2.dat) which are generated from a random uniform distribution (changing the seed):
http://www.filedropper.com/random1_1: random1.dat
http://www.filedropper.com/random2 : random2.dat
I like to use R to make the X-squared to understand if the two distribution are statistically the same.
To do that i prove:
x1 -> read.table("random1.dat")
x2 -> read.table("random2.dat")
chisq.test(x1,x2)
but I receive an error message:
'x' and 'y' need to have the same length
Now the problem is that this two files are both 1000's rows. So I don't understand that. Another question is if I want to make this process automatic (iterate it) for istance 100 times with 100 different file, can i make something like:
DO i=1,100
x1 -> read.table("random'(i)'.dat")
x2 -> read.table("fixedfile.dat")
chisq.test(x1,x2)
save results from the chisq analys
END DO
Thanks so much for Your help.
ADDED:
#eipi10,
I try to use the first method You gave here and it works well for the data You generate here. Then, when I try it for my data (I put in a single file a 2-column matrix enter link description here of 1000 rows of two uniform distribution with a different seed) something do not work correctly:
I load the file with: dat = read.table("random2col.dat");
I use the command: csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x))) and a warning message appear;
finally I use: unlist(lapply(csq, function(x) x$p.value)) BUT the output is something like:
[...] 1 1 1 1 1 1 1 1 1 1 1 1 1
[963] 1 1 1 1 1.....1 1 1 1
[1000] 1

I don't think you need to use a loop. You can use lapply instead. Also, you're entering x1 and x2 as separate columns of data. When you do this, chisq.test computes a contingency table from these two columns, which wouldn't be meaningful for columns of real numbers. Instead, you need to feed chisq.test a single matrix or data frame whose columns are x1 and x2. But even then, the chisq.test is expecting count data, which isn't what you have here (although the "expected" frequency doesn't necessarily have to be an integer). In any case, here's some code that will make the test run the way you seem to be hoping:
# Simulate data: 5 columns of data, each from the uniform distribution
dat = data.frame(replicate(5, runif(20)))
# Chi-Square test of each column against column 1.
# Note use of cbind to combine the two columns into a single data frame,
# rather than entering each column as separate arguments.
csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x)))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)
On the other hand, if you were intending your data to be two streams of values that would then be converted into a contingency table, here's an example of that:
# Simulate data of 5 factor variables, each with 10 different levels
dat = data.frame(replicate(5, sample(c(1:10), 1000, replace=TRUE)))
# Chi-Square test of each column against column 1. Here the two columns of data are
# entered as separate arguments, so that chisq.test will convert them to a two-way
# contingency table before doing the test.
csq = lapply(dat[,-1], function(x) chisq.test(dat[,1],x))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)

Related

Multivariate Cointegration Test?

I am currently performing an analysis on FDI and income inequality in a panel of 10 countries over 30 years. In this setting, I want to test for panel cointegration, unit roots etc. My data is currently in the format:
ID year var1 var2
1 1 3 4
1 2 4 NA
1 3 1 6
2 1 4 2
2 2 1 3
2 3 2 2`
and so on. I have used
data.frame(split(df$var1, df$ID))
to create a dataframe with the IDs as columns to use the purtest function from the plm package (note that my panel data is unbalanced). However, I have a lot of variables and this whole procedure seems excessively tedious.
Is there a way to perform a cointegration test on multiple variables at once? There appears to be an index specification in purtest but I don't fully understand how to use it. What format would my data have to be in and how would I get there? Would it be an issue that there is quite a few NAs in some variables?
df <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year = c(1,2,3,4,5,6,7,1,2,3,4,5,6,7),
var1 = c(3,5,6,2,NA,6,4, 2,3,4,5,8,NA,7),
var2 = c(5,6,3,8,NA,NA,2, NA,6,7,4,5,9,2))
The index argument to plm's purtest works like in the other functions of the package: it is used to specify the individual and the time dimension of your data. However, there is no need to use it directly in the purtest (or other functions) if data is converted to a pdata.frame first.
There is no need to split the data by individual on your end. purtest supports various ways/formats to input data. A user-driven split is just one option. If you already have a data frame in long-format, I suggest to convert it to pdata.frame and input a pseries to purtest (thus, no need to split data yourself). Something along these lines:
library(plm)
df <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year = c(1,2,3,4,5,6,7,1,2,3,4,5,6,7),
var1 = c(3,5,6,2,NA,6,4,2,3,4,5,8,NA,7),
var2 = c(5,6,3,8,NA,NA,2,NA,6,7,4,5,9,2))
pdf <- pdata.frame(df)
purtest(pdf$var1, test = "hadri", exo = "intercept")
To test multiple variables at once, you can put the purtest function into a loop, e.g., like this:
myvars <- as.list(pdf[ , -c(1:2)], keep.attributes = TRUE)
lapply(myvars, function(x) purtest(x, test = "ips", exo = "intercept", pmax = 1L))

Build loop to use increasing part of dataframe in R as input to function

I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)

Combining row vectors for data frame after using quantile function

Novice problem. I ran following command:
CI_95_outcomes_male <- data.frame(do.call(cbind,lapply(1:ncol(outcomes_male_dt), function(r) quantile(outcomes_male_dt[,r],c(.95)))))
and end up with this output:
CI_95_outcomes_male
X1 X2 X3 X4
95% 9629902039 0 2.968924e+15 2.968924e+15
I would like to combine this vector with following vector to end up with 2X4 matrix:
#
mean_outcomes_male
ylg_smoking_simS deaths_averted total_cig total_tax_
9.62990 0.0000 2.78248 2.782480
I tried:
CI_95_outcomes_male<-colnames(mean_outcomes_male)
data.frame(mean_outcomes_male,CI_95_outcomes_male)
Error in data.frame(mean_outcomes_male, CI_95_outcomes_male) :
arguments imply differing number of rows: 4, 0
Any guidance appreciated, thanks!
CI_95_outcomes_male<-colnames(mean_outcomes_male)
I think you forgot to put colnames around CI_95_outcomes_male. But there's another problem here. I'm assuming that mean_outcomes_male is a vector, in which case colnames(mean_outcomes_male) is NULL.
data.frame(mean_outcomes_male,CI_95_outcomes_male)
Even if CI_95_outcomes_male was correct, the above command will result in a 4x5 data frame, with the first column being the mean_outcomes_male vector, second column being the CI_95_outcomes_male value for your first variable (repeated for each row),...,and the fifth column being the CI_95_outcomes_male value for your fourth variable (repeated for each row).
You need to do something like this:
set.seed(42)
# Generate a random dataset for outcomes_male_dt with 4 variables and n rows
n <- 100
outcomes_male_dt <- data.frame(x1=runif(n),x2=runif(n),x3=runif(n),x4=runif(n))
# I'm assuming you want the 95th percentile of each variable in outcomes_male_dt and store them in CI_95_outcomes_male
ptl <- .95 # if you want to add other percentiles you can replace this with something like "ptl <- c(.10,.50,.90,.95)"
CI_95_outcomes_male <- apply(outcomes_male_dt,2,quantile,probs=ptl)
# I'm going to assume that mean_outcomes_male is a vector of means for all the variables in outcomes_male_dt
mean_outcomes_male <- colMeans(outcomes_male_dt)
# You want to end up with a 2x4 matrix - I'm assuming you meant row 1 will be the means, and row 2 will be the 95th percentiles, and the columns will be the variables
want <- rbind(mean_outcomes_male, CI_95_outcomes_male)
colnames(want) <- colnames(outcomes_male_dt)
row.names(want) <- c('Mean',paste0("p",ptl*100)) # paste0("p",ptl*100) is equivalent to paste("p",ptl*100,sep="")
want # Resulting matrix

Chi-squared test of independence on all combinations of columns in a dataframe in R

this is my first time posting here and I hope this is all in the right place. I have been using R for basic statistical analysis for some time, but haven't really used it for anything computationally challenging and I'm very much a beginner in the programming/ data manipulation side of R.
I have presence/absence (binary) data on 72 plant species in 323 plots in a single catchment. The dataframe is 323 rows, each representing a plot, with 72 columns, each representing a species. This is a sample of the first 4 columns (some row numbers are missing because the 323 plots are a subset of a larger number of preassigned plots, not all of which were surveyed):
> head(plots[,1:4])
Agrostis.canina Agrostis.capillaris Alchemilla.alpina Anthoxanthum.odoratum
1 1 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
8 0 0 0 0
I want to to determine whether any of the plant species in this catchment are associated with any others, and if so, whether that is a positive or negative association. To do this I want to perform a chi-squared test of independence on each combination of species. I need to create a 2x2 contingency table for each speciesxspecies comparison, run a chi-squared test on each of those contingency tables, and save the output. Ultimately I would like to end up with a list or matrix of all species by species tests that shows whether that combination of species has a positive, negative, or no significant association. I'd also like to incorporate some code that only shows an association as positive if all expected values were greater than 5.
I have made a start by writing the following function:
CHI <- function(sppx, sppy)
{test <- chisq.test(table(sppx, sppy))
result <- c(test$statistic, test$p.value,
sign((table(sppx, sppy) - test$expected)[2,2]))
return(result)
}
This returns the following:
> CHI(plots$Agrostis.canina, plots$Agrostis.capillaris)
X-squared
1.095869e-27 1.000000e+00 -1.000000e+00
Warning message:
In chisq.test(chitbl) : Chi-squared approximation may be incorrect
Now I'm trying to figure out a way to apply this function to each speciesxspecies combination in the data frame. I essentially want R to take each column, apply the CHI function to that column and each other column in sequence, and so on through all the columns, subtracting each column from the dataframe as it is done so the same species pair is not tested twice. I have tried various methods trying to use "for" loops or "apply" functions, but have not been able to figure this out.
I hope that is clear enough. Any help here would be much appreciated. I have tried looking for existing solutions to this specific problem online, but haven't been able to find any that really helped. If anyone could link me to an existing answer to this that would also be great.
You need the combn function to find all the combinations of the columns and then apply them to your function, something like this:
apply(combn(1:ncol(plots), 2), 2, function(ind) CHI(plots[, ind[1]], plots[, ind[2]]))
I think you are looking for something like this. I used the iris dataset.
require(datasets)
ind<-combn(NCOL(iris),2)
lapply(1:NCOL(ind), function (i) CHI(iris[,ind[1,i]],iris[,ind[2,i]]))
The below R code run chisquare test for every categorical variable / every factor of a r dataframe, against a variable given (x or y chisquare parameter is kept stable, is explicitly defined):
Define your variable
Please - change df$variable1 to your desired factor variable and df to your desirable dataframe that contain all the factor variables tested against the given df$variable1
Define your Dataframe
A new dataframe is created (df2) that will contain all the chi square values / dfs, p value of the given variable vs dataframe comparisons
Code created / completed/ altered from similar posts in stackoverflow, neither that produced my desired outcome.
Chi-Square Tables statistic / df / p value for variable vs dataframe
"2" parameter define column wide comparisons - check apply (MARGIN) option.
df2 <- t(round(cbind(apply(df, 2, function(x) {
ch <- chisq.test(df$variable1, x)
c(unname(ch$statistic), ch$parameter, ch$p.value )})), 3))

Set values less than threshold to zero, with column-specific thresholds

I have two data frames. One of them contains 165 columns (species names) and almost 193.000 rows which in each cell is a number from 0 to 1 which is the percent possibility of the species to be present in that cell.
POINTID Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran
2 0.0279037 0.604687 0.0388309 0.0161980 0.0143966 0.240152
3 0.0294101 0.674846 0.0673055 0.0481405 0.0397423 0.231308
4 0.0292839 0.603869 0.0597947 0.0526606 0.0463431 0.188875
6 0.0331264 0.541165 0.0470451 0.0270871 0.0373348 0.256662
8 0.0393825 0.672371 0.0715808 0.0559353 0.0565391 0.230833
9 0.0376557 0.663732 0.0747417 0.0445794 0.0602539 0.229265
The second data frame contains 164 columns (species names, as the first data frame) and one row which is the threshold that above this we assume that the species is present and under of this the species is absent
Abie_Xbor Acer_Camp Acer_Hyrc Acer_Obtu Acer_Pseu Achi_Gran Acta_Spic
0.3155 0.2816 0.2579 0.2074 0.3007 0.3513 0.3514
What i want to do is to make a new data frame that will contain for every species in the presence possibility (my.data) the number of possibility if it is above the threshold (thres) and if it is under the threshold the zero number.
I know that it would be a for loop and if statement but i am new in R and i don't know for to do this.
Please help me.
I think you want something like this:
(Make up small reproducible example)
set.seed(101)
speciesdat <- data.frame(pointID=1:10,matrix(runif(100),ncol=10,
dimnames=list(NULL,LETTERS[1:10])))
threshdat <- rbind(seq(0.1,1,by=0.1))
Now process:
thresh <- unlist(threshdat) ## make data frame into a vector
## 'sweep' runs the function column-by-column if MARGIN=2
ss2 <- sweep(as.matrix(speciesdat[,-1]),MARGIN=2,STATS=thresh,
FUN=function(x,y) ifelse(x<y,0,x))
## recombine results with the first column
speciesdat2 <- data.frame(pointID=speciesdat$pointID,ss2)
It's simpler to have the same number of columns (with the same meanings of course).
frame2 = data.frame(POINTID=0, frame2)
R works with vectors so a row of frame1 can be directly compared to frame2
frame1[,1] < frame2
Could use an explicit loop for every row of frame1 but it's common to use the implicit loop of "apply"
answer = apply(frame1, 1, function(x) x < frame2)
This was all rather sloppy solution (especially changing frame2) but it hopefully demonstrates some basic R. Also, I'd generally prefer arrays and matrices when possible (they can still use labels but are generally faster).
This produces a logical matrix which can be used to generate assignments with "[<-"; (Assuming name of multi-row dataframe is "cols" and named vector is "vec":
sweep(cols[-1], 2, vec, ">") # identifies the items to keep
cols[-1][ sweep(cols[-1], 2, vec, "<") ] <- 0
Your example produced a warning about the mismatch of the number of columns with the length of the vector, but presumably you can adjust the length of the vector to be the correct number of entries.

Resources