Related
I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)
CONTEXT:
I'm working with respondent-level survey data. Each row of my
data frame represents an individual person's survey responses.
My data frame consists of individual-level utility estimates from a Maximum Difference experiment AND
categorical variables indicating in which of several subgroups an individual survey respondent
resides.
Each subgroup variable is a single categorical variable with exactly two levels. However, in my
desired output, I'd like a data frame where each level of each subgroup has its own column.
OBJECTIVE:
I want to create a function that, for each user-defined subgroup, will conduct recursive T Tests over
every Maximum Difference item in the data frame, extract elements of the T Test output, and store the
elements in a data frame
Using T statistic results as an example, the end result should look like this:
Males_T_stat Females_T_stat
MD_item1 2.71 2.5
MD_item2 1.71 1.5
MD_item3 0.71 0.5
CURRENT CODE:
Right now, I'm focused on writing code to iteratively execute the T Tests and store each test's entire
output object in a list. The code I've used to, unsuccessfully, attempt this is below:
Create a test data frame:
dat <- data.frame(
md1 = 1:60,
gender = factor(rep(c("m", "f"), 30)),
generation = factor(rep(c("a", "b"), 30)),
md2 = 61:120
)
Specify the names of my respondent subgroup (i.e., the categorical variables).
groupnames <- c("gender", "generation")
item_vec <- dat %>% select(contains(("md")))
group_vec <- dat[groupnames]
Convert the subgroup name vectors to data frames. this step may be superfluous, but I'm more comfortable working with data frames.
item_vec <- data.frame(item_vec)
group_vec <- data.frame(group_vec)
So far, I've tried using nested for loops to run the T Tests and store each test output in a list. This code partially works; for each subgroup named in "group_vec", the code produces T Test results for the last item in "item_vec" only. However, I want the results for EVERY item in "item_vec", which is where I've currently stalled.
res <- list()
for (i in 1:length(group_vec)) {
res[[i]] <- list(test)
for (j in 1:length(item_vec)) {
test <- (t.test(item_vec[[j]] ~ group_vec[[i]]))
res[i] <- list(test)
}
}
res
Thank you in advance for any help you can provide!
In the nested loop, replace
res[i] <- list(test)
with
res[[i]][[j]] <- list(test)
as the 'j' is loop over the item_vec. If we just assign it to res[[i]] or res[i], for every item_vec in the 'group_vec', it just updates/replace the previous with the next and as there is nothing to update after the last, the last one remains for each 'group_vec'
Also, it may be better to initialize res as
res <- vector('list', length(group_vec))
and then make the changes as in the for loop
for (i in 1:length(group_vec)) {
res[[i]] <- list(test)
for (j in 1:length(item_vec)) {
test <- (t.test(item_vec[[j]] ~ group_vec[[i]]))
res[[i]][[j]] <- list(test)
}
}
I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.
Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.
Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.
Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.
Data: I cannot post the actual dataset, but here is a good simulated approximation:
randomRow <- function() {
# make a row with a single 1 and some NA's
sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F)
}
# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
data <- matrix(NA,ncol=12,nrow=1500)
for (i in 1:1500) {
data[i,] <- randomRow()
}
return(data)
}
mydata <- NULL
# combine 30 of these dataframes horizontally
for (i in 1:30) {
mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready
My solution:
# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
Z <- rep(1:30,each=12) # define selection vector
mydata[Z==i] # use selection vector to get groups of variables (x12)
})
recodeDf <- function(df) {
result <- as.numeric(apply(df,1,function(x) {
if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
})) # the if/else check is for the real data
return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))
All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.
Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)
se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.
MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))
# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
for(ii in seq_along(columnMatch[[jj]])) {
set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
}
}
This would work just as well adding columns by reference to the original.
Note set works on data.frames as well....
I really like #Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.
However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:
# Match is usually faster than which, because it only returns the first match
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1)
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)
If you had a very large data frame, and many processors, you may consider parallelizing this operation with:
library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores)
# Where numcores is your number of processors.
Having read #Arun and #mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to #mnel's answer.
I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than #mnel's solution, albeit slightly.
nograpes2<-function(mydata) {
test<-function(df) {
l<-lapply(df,function(x) which(x==1))
lens<-lapply(l,length)
rep.int(seq.int(l),times=lens)[order(unlist(l))]
}
S2<-split.default(mydata,rep(1:30,each=12))
data.frame(lapply(S2,test))
}
I would also like to add that #Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.
IIUC, you've only one 1 per 12 columns. You've the rest with 0's or NA's. If so, the operation can be performed much faster by this idea.
The idea: Instead of going through each row and asking for the position of 1, you could use a matrix with dimensions 1500 * 12 where each row is just 1:12. That is:
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
Now, you can multiply this matrix with each of your subset'd data.frame (of same dimensions, 1500*12 here) and them take their "rowSums" (which is vectorised) with na.rm = TRUE. This'll just give directly the row where you have 1 (because that 1 will have been multiplied by the corresponding value between 1 and 12).
data.table implementation: Here, I'll use data.table to illustrate the idea. Since it creates column by references, I'd expect that the same idea used on a data.frame would be a tad slower, although it should drastically speed up your current code.
require(data.table)
DT <- data.table(mydata)
ids <- seq(1, ncol(DT), by=12)
# for multiplying with each subset and taking rowSums to get position of 1
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
for (i in ids) {
sdcols <- i:(i+12-1)
# keep appending the new columns by reference to the original data
DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat,
na.rm = TRUE), .SDcols = sdcols]
}
# delete all original 360 columns by reference from the original data
DT[, grep("V", names(DT), value=TRUE) := NULL]
Now, you'll be left with 30 columns that correspond to the position of 1's. On my system, this takes about 0.4 seconds.
all(unlist(final.df) == unlist(DT)) # not a fan of `identical`
# [1] TRUE
Another way this could be done with base R is with simply getting the values you want to put in the new matrix and filling them in directly with matrix indexing.
idx <- which(mydata==1, arr.ind=TRUE) # get indices of 1's
i <- idx[,2] %% 12 # get column that was 1
idx[,2] <- ((idx[,2] - 1) %/% 12) + 1 # get "group" and put in "col" of idx
out <- array(NA, dim=c(1500,30)) # make empty matrix
out[idx] <- i # and fill it in!
I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}