Extract p values and r values for all pairwise variables - r

I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)

This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

In R, how do I concatenate multiple columns within a list of lists

I have a function in R which returns a list with N columns each of M rows of multiple types - date, numeric and char. I am using sapply to create multiple copies of these lists, which then end up in a top level list. I would like to concatenate the underlying lists together to produce a single list of N columns and M * number of list rows.
I've been trying different combinations of do.call, sapply, rbind, c, etc, but I think I'm missing something pretty fundamental. Below is a simple script that mimics the problem and shows the desired outcome. I've used 3 variables here, but the number of variables is arbitrary.
# Set up test function
testfun <- function(varName)
{
currDate = seq(as.Date('2018-12-31'), as.Date('2019-01-10'), "days")
t1 = runif(11)
t2 = runif(11)
groupNum = c(rep(1,5), rep(2,6))
varName = rep(varName, 11)
dataout= data.frame(currDate, t1, t2, groupNum, varName)
}
# create 3 test variables and run the data
varNames = c('test1', 'test2', 'test3')
tmp = sapply(varNames, testfun)
# I would like it to look like the following, but for any given number of variables
desiredAnswer <- rbind(as.data.frame(tmp[,1]),as.data.frame(tmp[,2]),as.data.frame(tmp[,3]))
The final answer will later be used to create a data table and feed ggplot with the varName as facets.
I'm happy to use any method to get the desired results, there's no reason the function needs to produce a list instead of say a data.frame. I'm certain I'm doing something dumb, any help appreciated.

Use custom function on data.table

I would like to ask if it is possible to apply this function to a data.table approach:
myfunction <- function(i) {
a <- test.dt[i, 1:21, with = F]
final <- t((t(b) == a) * value)
final[is.na(final)] <- 0
sum.value <- rowSums(final)
final1 <- cbind(train.dt, sum.value)
final1 <- final1[order(-sum.value),]
final1 <- final1[final1$sum.value > 0,]
suggestion <- unique(final1[, 22, with = F])
suggestion <- suggestion[1:5, ]
return(suggestion)
}
This is a custom kNN function I made to be used on character columns. It gives top 5 suggestions/predictions. However, It has performance issues on my end if it is performed on large test data (I cannot tweak it myself so far).
The variables used are as folllows:
train.dt -- the training data, includes 22 columns (21 features, 1 label column)
test.dt -- the test data, same structure as training data
value -- a vector that contains the weights/importance value of 21 features
sum.value -- sum of all the weights on value vector (sum(value))
b -- has the same data as the training data, but excluding the label column
a -- has the same data as the test data, but excluding the label column
suggestion -- the output
Also, I want to use lapply (or any appropriate apply family) on this function, and the i variable in the function pertains to the row number on the test data: meaning, I want to apply it on each rows of the test data. I cannot make it yet.
Hope you can understand and thank you in advance!

Converting cross-sectional data into an adjacency matrix in R

I am trying to convert cross-sectional data into an adjacency matrix, as I want to analyze how often certain variables are present together with social network analysis.
In case empirical examples would help with the logic, it's basically analogous to presenting 4 people with a choice of three objects; they can choose from 0 to 3 of the objects. I'd like to analyze how commonly different objects were chosen together and visualize this as a network of preferences.
The data is set up as cross-sectional data, below:
ID1 <- c(1,0,0)
ID2 <- c(1,0,1)
ID3 <- c(1,1,1)
ID4 <- c(0,0,0)
IDs <- c("1","2","3","4")
df <- data.frame(rbind(ID1, ID2, ID3, ID4))
df <- cbind(IDs, df)
colnames(df) <- c("ID", "Var1", "Var2", "Var3")
I'd like to create a weighted adjacency matrix for Var1, Var2 and Var3, with each cell containing the total number of times the two variables occur together among the observations.
So the basic procedure I was thinking about is to create a separate matrix for each row (each ID number) with a 1 or 0 for each cell indicating whether or not both variables are present for the ID. And then add these matrices together, so the final matrix gives the total number of joint appearances.
I've been looking around and haven't quite gotten it right. I thought of using outer, but it'd need to work for each column in sequence. This answer was pretty close, but I wasn't exactly sure how they were adding together the values. I ended up with a list of matrices, but the values didn't correspond to the initial data-
Convert categorical data in data frame to weighted adjacency matrix. And this answer was also close, although it seemed to have a different type of data. It gave me an adjacency matrix based on the IDs-
http://r.789695.n4.nabble.com/Conversion-to-Adjacency-Matrix-td794102.html
Here is very messy code to manually create a matrix for one observation, just so you get a sense for what I'm going for (using a vector representing just the first ID observation)
ID1 <- c(1,0,0)
var1 <- ID1[[1]]
var2 <- ID1[[2]]
var3 <- ID1[[3]]
onetwo <- var1 * var2
onethree <- var1 * var3
twothree <- var2 * var3
oneone <- var1 * var1
twotwo <- var2 * var2
threethree <- var3 * var3
rows1 <- rbind(oneone, onetwo, onethree)
rows2 <- rbind(onetwo, twotwo, twothree)
rows3 <- rbind(onethree, twothree, threethree)
df2 <- cbind(rows1, rows2, rows3)
This obviously is not ideal, my actual dataset has 198 observations and 33 variables, so even with looping or the use of apply functions it would be very inefficient.
I can't tell if I'm making this more difficult than it needs to be, or if I'm trying to force my data to do something it wasn't meant to do. But if anyone has run into this sort of task before, please let me know. Is there a way to create the desired adjacency matrix directly? Should I transfer this into an edge list first, and is there a good way to do that? Is there code that would make the first step(creating a matrix for each row of the dataframe) more efficient?
Thanks for your help,
I'm not sure if I understand the question, but is this what you want?
nc=33
nr=198
m3<-matrix(sample(0:1,nc*nr,replace=TRUE),nrow=nr)
df3<-data.frame(m3)
m3b <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
t3<-table(df3[,i],df3[,j])
m3b[i,j] = t3[2,2] # t3[2,2] contains the count of df3[,i] = df3[,j] = 1
# or
# t3 = sum(df3[,i]==df3[,j] & df3[,i] == 1)
# m3b[i,j] = t3
}
}
or, if you want the sum of the product, which gives the same result if everything is 1 or 0
m3c <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
sv=0
for (k in seq(1,nr)) {
vi = df3[k,i]
vj = df3[k,j]
sv=sv+vi*vj
}
m3c[i,j] = sv
}
}

Calculate statistics (e.g. average) across cells of identical data-frames

I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}

Resources