I'm interested in arriving at a cross-tab of missing values across all columns in SparkR data frame. The data I'm trying to utilise can be generated with use of the code below:
Data
set.seed(2)
# Create basic matrix
M <- matrix(
nrow = 100,
ncol = 100,
data = base::sample(x = letters, size = 1e4, replace = TRUE)
)
## Force missing vales
M[base::sample(1:nrow(M), 10),
base::sample(1:ncol(M), 10)] <- NA
table(is.na(M))
SparkR
Following, this answer I would like to arrive at the desired solution using flatMap. The idea is to replace missing / non-missing values with T/F and then count occurrences for each variable. First it appears that flatMap was no exported by SparkR 2.1 so I had to dig it out with :::
# Import data to SparkR ---------------------------------------------------
# Feed data into SparkR
dtaSprkM <- createDataFrame(sqc, as.data.frame(M))
## Preview
describe(dtaSprkM)
# Missing values count ----------------------------------------------------
# Function to convert missing to T/F
convMiss <- function(x) {
ifelse(test = isNull(x),
yes = FALSE,
no = TRUE)
}
# Apply
dtaSprkMTF <- SparkR:::flatMap(dtaSprkM, isNull)
## Derive data frame
dtaSprkMTFres <- createDataFrame(sqc, dtaSprkMTF)
Second, after running the code fails with the following error message:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘isNull’ for signature ‘"list"’
Desired results
On an ordinary data frame in R the desired results can be achieved in the following manner
sapply(as.data.frame(M), function(x) {
prop.table(table(is.na(x)))
})
I like the flexibility that table and prop.table offer and ideally I would like to be able to arrive at similar flexibility via SparkR.
Compute fraction of NULL per column:
fractions <- select(dtaSprkM, lapply(columns(dtaSprkM), function(c)
alias(avg(cast(isNotNull(dtaSprkM[[c]]), "integer")), c)
)
This will create a single row Data.Frame which can be safely collected and easily reshaped locally, for example with tidyr:
library(tidyr)
fractions %>% as.data.frame %>% gather(variable, fraction_not_null)
Related
I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)
I have 24 variables that I need to get the frequency for when combined with another variables. In SAS it will be a simple macro :
%macro results (house_color,variable);
proc freq data = all_houses;
table variable*majority_&house_color. / missing out= &varibale._by_&house_color.;
run;
% mend;
%results( house_color = purple, variable=pool_or_no_pool) ;
%results (house_color = blue, variable = upb);
I tried creating a function in R using a code provided in a different question I asked earlier:
results <- function(x,y,d,z){
freqs <- freqlist(table(z[c(x,y)],useNA = "always"))
freq_df <- as.data.frame(freqs) %>% select(1:3)
colnames(freq_df)[3] <- d
freq_df
}
where x is the first variable, y is the second variable, d is what I want the Freq columns to be renamed as and z is the dataset. However, when I get the result from the function and I try to use cbind to have all of the information consolidated it is giving me the following error:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 10, 11
I was wondering if there any simpler way to obtain the frequency table of these 24 variables and either stack or cbind the information without creating too much NA's. I have use rbind.fill but the dataframe produce is messy and has a lot of NA's.
Thank you in advance for the help.
I used the following code to try to replace variables's value that are below the bottom 2.5% and above the top 97.5% with specific values.You can perform that code. It provides open data file.
credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
fun <- function(x){
quantiles <- quantile( x, c(.025, .975 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun(credit)
But the error message is appeared.
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected
What's the problem? I happy to any help!
+Addition comment
I found that the above function does not work in the data frame but works only in the vector.
I can change the outlier of each variable in the data file with the following code:
credit$Duration.of.Credit..month. <- pmax(quantile(credit$Duration.of.Credit..month.,.025),
pmin(credit$Duration.of.Credit..month., quantile(credit$Duration.of.Credit..month.,.975)))
However, my data file has so many variables that it is inconvenient to enter code one by one.
So how can I change the outliers of the variables that a specific value not pmax&pmin?
There's actually nothing wrong with your function as long as you apply it to a column. I'd use mutate_at or mutate_all (if you really want to apply it to all columns) of the dplyr package. Something like this:
library(dplyr)
credit_trunc <- credit %>%
mutate_at(vars(Credit.Amount, Creditability), funs(fun))
or
credit_trunc <- credit %>%
mutate_all(funs(fun))
or if you also have columns of another type (e.g. factor, character) in your data frame, you can use:
credit_trunc <- credit %>%
mutate_if(is.numeric, funs(fun))
This will give you back the data frame with the chosen / all columns / all numeric columns modified as you wanted it.
I tried searching the error which I am getting while using "mean" function in R 3.1.2.'
Purpose: Calculate Mean of datasets
Used Functions: sapply, summary to calculate mean as shown below:
sapply(data,mean,na.rm=TRUE)
summary(data)
Problem Faced: Now, I am trying to use "mean" function to calculate mean from complete dataset. I used the function like this:
> testingnew <-data[complete.cases(data),]
> mean(testingnew)
Popped Warning :
[1] NA
Warning message:
In mean.default(testingnew) :
argument is not numeric or logical: returning NA
Que: Can someone please tell me why this warning comes, I tried to remove NA(missing values) using complete.cases.
#To Eliminate missing values: # ! = is not
testingnew <- subset(data, !(is.na(data)))
#Choose a column to calculate the mean:
#Make sure it is numeric or integer
class(testingnew$Col1)
mean(testingnew$Col1, na.rm=TRUE)
Maybe you can try to reproduce this workflow with your own dataset... It seems the only thing missing is referring to individual columns with the mean function, or using sapply as you did before.
Create a dataframe using random values
my.df <- data.frame(x1 = rnorm(n = 200), x2 = rnorm(n=200))
Spread NA's randomly into the df
is.na(my.df) <- matrix(sample(c(TRUE,FALSE), replace= TRUE, size = 400,
prob=c(0.10, 0.90)),
ncol = 2)
For getting means without using complete cases:
mean(my.df$x1, na.rm=TRUE) # mean(my.df[,1], na.rm=TRUE) is equivalent
mean(my.df$x2, na.rm=TRUE) # mean(my.df[,2], na.rm=TRUE) is equivalent
Complete-case approach (if this is what you really need):
my.df.complete <- my.df[complete.cases(my.df),]
Get means for both columns
sapply(X = my.df.complete, FUN = mean)
Get mean from individual columns
mean(my.df.complete$x1)
mean(my.df.complete$x2)
Creating a subset helped:
data3 <-subset(data, !is.na(Ozone))
mean(data3$Ozone)
I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}