So I'm using the SpadeR package on R to test similarities of pairs of abundance data for my thesis. Does anyone have any idea of how to turn an output into a matrix for further analysis?
Here is the code im working with:
CompAB <- mydata %>% select(2, 3)
dataAB <- data.matrix(CompAB, rownames.force = NA)
SimilarityPair(dataAB, datatype = c("abundance"), nboot = 1000)
I want to do this so I can later loop the outputs from a randomisation run and put them into a matrix to analyse
Have you tried to use the matrix() command?
Related
The end goal here is to run a random sample (without replacement) of J different possible regression models rather than all 2^k - 1 possible models as in a traditional All Subsets Regression aka Best Subset Regression (also sometimes called Exhaustive Regression) on each of I different csv file formatted datasets all located within the same file folder.
Here is my code (it is in my GitHub Repository for this project, it is called 'EER script'):
# Load all libraries needed for this script.
# The library specifically needed to run a basic ASR is the 'leaps' library.
library(dplyr)
library(tidyverse)
library(stats)
library(ggplot2)
library(lattice)
library(caret)
library(leaps)
library(purrr)
directory_path <- "~/DAEN_698/sample obs"
filepath_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(filepath_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
datasets <- lapply(filepath_list, read.csv)
# code to run a normal All Subsets Regression
ASR_fits <- lapply(datasets, function(i)
regsubsets(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, data = i, nvmax = 15,
intercept = TRUE, method = "exhaustive"))
ASR_fits_summary <- summary(ASR_fits)
This is the part I am completely stuck, I got the above to run and the ASR_fits_summary object is a list with I elements, all of the class 'regsubsets' which is exactly what I was hoping for, but that is still just a list of the estimates made by a traditional ASR, what I need to figure out is where I should insert a sample(, J) function within this lapply function so that each regsubsets chooses the optimal model out of just J randomly selected models from the 2k - 1 possible models to made it computational feasible.
I am guessing I will have to either nest another lapply within my current lapply function, or write a simple custom function that takes J random samples without replacement, but I just don't know at what step to put it.
I have created a Random Forest model using the randomForest package
model_rf <- randomForest(y~ . , data = data_train,ntree=1000, keep.forest=TRUE,importance=TRUE)
To calculate Shapley values for the different features based on this RF model, I first create an "explainer object" and then use the "shapper" package
exp_rf <- DALEX::explain(model_rf, data = data_test[,-1], y = data_test[,1])
ive_rf <- shap(exp_rf, new_observation = data_test[1,-1])
To my knowledge, I can only apply the "shap" function to one observation (the "new_observation").
But I am looking for a way to calculate the shapley values for all of my respondents in my datafile.
I know this is possible in the "SHAP" package in Python; but is it also possible with the "shapper" package in R?
At the moment, I created a loop to calculate the shapley values for all respondents, but this will take me days to calculate for my entire datafile.
for(i in c(1:nrow(data_test)))
{
ive_rf <- shap(exp_rf,new_observation=data_test[i,-1])
shapruns<-cbind(shapruns,ive_rf[,"_attribution_"])
}
Any help would be much appreciated.
I recently published two R packages that are optimized for this kind of tasks: "kernelshap" (calculate SHAP values fast) and "shapviz" (plot SHAP values from any source). In your case, a working example would be:
library(randomForest)
library(kernelshap)
library(shapviz)
set.seed(1)
fit <- randomForest(Sepal.Length ~ ., data = iris,)
# Step 1: Calculate Kernel SHAP values
# bg_X is usually a small (50-200 rows) subset of the data
s <- kernelshap(fit, iris[-1], bg_X = iris)
# Step 2: Turn them into a shapviz object
sv <- shapviz(s)
# Step 3: Gain insights...
sv_importance(sv, kind = "bee")
sv_dependence(sv, v = "Petal.Length", color_var = "auto")
i've run my kmeans test from an excel data source and now want to get he results out in excel. I've tried the following code but all i get is a blank worksheet. I'm relatively new to R so i imagine its something simple that I'm missing - please help!
set.seed(123)
kmeansresults<-kmeans(df[,7], 5, iter.max = 50, nstart = 100)
x<-kmeansresults$clusters
write.csv(x, "clustering results.csv")
Try the following:
data("USArrests")
m <- scale(USArrests)
set.seed(123)
km_res <- kmeans(m, 4, nstart=25)
x <- km_res$cluster
write.csv(x, "/Users/user/Desktop/foo.csv")
I guess your problem was not writing a CSV file but calling km_res$clusters that should be km_res$cluster. In R you can access the structure of an object as str(km_res) to see what is "inside". There is no slot clusters but cluster.
I'm wondering if it's possible (using the built in features of SparkR or any other workaround), to extract the class probabilities of some of the classification algorithms that included in SparkR. Particular ones of interest are.
spark.gbt()
spark.mlp()
spark.randomForest()
Currently, when I use the predict function on these models I am able to extract the predictions, but not the actual probabilities or "confidence."
I've seen several other questions that are similar to this topic, but none that are specific to SparkR, and many have not been answered in regards to Spark's most recent updates.
i ran into the same problem, and following this answer now use SparkR:::callJMethod to transform the probability DenseVector (which R cannot deserialize) to an Array (which R reads as a List). It's not very elegant or fast, but it does the job:
denseVectorToArray <- function(dv) {
SparkR:::callJMethod(dv, "toArray")
}
e.g.:
start your spark session
#library(SparkR)
#sparkR.session(master = "local")
generate toy data
data <- data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
someString = base::sample(c("this", "that"),
100, replace=TRUE),
stringsAsFactors=FALSE)
trainidxs <- base::sample(nrow(data), nrow(data)*0.7)
traindf <- as.DataFrame(data[trainidxs,])
testdf <- as.DataFrame(data[-trainidxs,])
train a random forest and run predictions:
rf <- spark.randomForest(traindf,
clicked~.,
type = "classification",
maxDepth = 2,
maxBins = 2,
numTrees = 100)
predictions <- predict(rf, testdf)
collect your predictions:
collected = SparkR::collect(predictions)
now extract the probabilities:
collected$probabilities <- lapply(collected$probability, function(x) denseVectorToArray(x))
str(probs)
ofcourse, the function wrapper around SparkR:::callJMethod is a bit of an overkill. You can also use it directly, e.g. with dplyr:
withprobs = collected %>%
rowwise() %>%
mutate("probabilities" = list(SparkR:::callJMethod(probability,"toArray"))) %>%
mutate("prob0" = probabilities[[1]], "prob1" = probabilities[[2]])
I'm having a brain freeze, and hoping one of you can point me in the right direction. My end goal is the output of various regression coefficients (mainly interested in price elasticity), which I achieved via simple multiple regression, using the "by" function.
I am using the "by" function to loop through the regression formula for each iteration of the "State.UPC" variable. Since my data is quite large (~1MM rows), I had to subset my data into groups of 3-4 states (see mystates1...mystates10). I am then performing the regression on those subsets, each time changing my data source in the "datastep3" data frame. And this is where I need your help:
What is the best way to efficiently re-write this with a combination of my existing "by" regression function, and the "for" loops, so I can bypass the step of constantly changing the data frame name in "datastep3" and the "write.csv" steps. Essentially R looping through each "mystates" data subset and doing the regression by the "State.UPC" attributes?
I have tried several combinations with no success. Pardon the amateurish question...still learning R. Here is my code:
data <-read.csv("PriceData.csv")
datastep1 <-subset(data, subset=c(X..Vol>0, Unit.Vol>0))
datastep2 <- transform(datastep1, State.UPC = paste(State,UPC, sep="."))
mystates1 <- c("AL","AR","AZ")
mystates2 <- c("CA","CO","FL")
mystates3 <- c("GA","IA","IL")
mystates4 <- c("IN","KS","KY")
mystates5 <- c("LA","MI","MN")
mystates6 <- c("MO","MS","NC")
mystates7 <- c("NJ","NM","NV")
mystates8 <- c("NY","OH","OK")
mystates9 <- c("SC","TN","TX")
mystates10 <- c("UT","VA","WI","WV")
datastep3 <-subset(datastep2, subset=State %in% mystates10)
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+Memorial.Day+Thanksgiving+Christmas+New.Years+
Year+Month, data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file="mystates10.csv", row.names=TRUE)
Impossible to test this because you do not provide any data, but theoretically you could just combine the various mystatesN into a list and then run lapply(...) on that.
## Not tested...
get.PEbyItem <- function(i) {
datastep3 <-subset(datastep2, subset=State %in% mystates[[i]])
datastep4 <-na.omit(datastep3)
PEbyItem <- by(datastep4, datastep4$State.UPC, function(df)
lm(log(Unit.Vol)~log(Price) + Distribution+Independence.Day+Labor.Day+
Memorial.Day+Thanksgiving+Christmas+New.Years+Year+Month,
data=df))
x <- do.call("rbind",lapply(PEbyItem, coef))
y <-data.frame(x)
write.csv(x, file=paste(names(mystates[i]),"csv",sep="."), row.names=TRUE)
}
mystates <- list(ms1=mystates1, ms2=mystates2, ..., ms10=mystates10)
lapply(1:length(mystates),get.PEbyItem)
There are lots of other things that could be improved but without the dataset it's pointless to try.