Calculate statistics (e.g. average) across cells of identical data-frames - r

I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?

Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)

If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.

After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Random sampling R

I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?
library( "dplyr")
set.seed(123)
NF_1 <- slice_sample(NF, n = 5)
You can use the sample function from base R.
All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.
n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)
# Now index those selected rows
NF_1 <- NF[selected_samples, ]
Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.
This code should solve your problem, or at least give you a good idea or where to go.
n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups
# First we'll get the row samples.
selected_samples <- sample(
seq_len(nrow(NF)),
size = size,
replace = FALSE
)
# Now we split them between the number of groups
split_samples <- split(
selected_samples,
rep(1:n_groups, observations_per_group)
)
# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.
my_dataframes <- lapply(1:n_groups, function(x) {
# our subset df will be the original df with the list of samples
# for group at position "x" (1, 2, 3.., n_groups)
subset_df <- NF[split_samples[x], ]
return(subset_df)
})
# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements

Extract p values and r values for all pairwise variables

I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)

Calculating log returns over columns of a data frame + store the results in a new data frame

My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))

Bootstrapping data frame columns independently in R

I have a data.frame where each column represents a different individual and each row represents different food items eaten.
My goal is to resample each column via bootstrapping and then calculate a metric score and C.I.s for each individual (data column) using a defined function.
I have done this successfully on a single vector but cannot figure out how to apply the bootstrapping and metric function to individual columns in a data frame. Below is the code I have to apply it to a single vector:
data.1 <- c(10, 50, 200, 54, 6) ## example vector
## create function
metric.function <- function(x){
p <- x/sum(x)
dap <- 1/sum(p^2)
return(dap)
}
vect <- c() ## empty vector for bootstrap data
for (i in 1:1000){
data.2 <- sample(data.1, replace = TRUE) ##bootstrap sample ##
vect[i] <- metric.function (data.2) ## apply metric.function ##
}
summary(vect) ## summary
quantile(vect, probs = c(0.025, 0.975)) ## C.I.
This works fine for a single vector but I want to apply it independently to multiple columns in a data frame, for example in the example.df below I want to apply it to x1:x10 independently resulting in 10 metric scores and 10 C.I.s
example.df<-data.frame(replicate(10,sample(0:50,10,rep=TRUE)))
I have tried changing the vector item to a data.frame and messing around with apply and dply but cannot figure it out, can anyone suggest how to do it or point me in the direction of useful guide/website etc?
This is a perfect chance to use replicate and sapply.
replicate(1000, sapply(example.df, function(x)
metric.function(sample(x, replace = TRUE))))
sapply will operate column-wise (given that a data.frame is in a sense a list of columns); once we've isolated a column within sapply, we need only resample it & apply our metric.

Resources