I am trying to do something that I am sure is really simple in R. But I cannot figure it out. I want to run the same equation 6 times, changing the variables within the equation each time.
My data is something like this:
[#Rename my data
mydata <- BSC_OnlineSurvey_Salient.Beliefs
summary (mydata)
View(mydata)
##Descriptive stats
sapply(mydata, mean, na.rm = TRUE)
sumstats <- sapply(mydata, mean, na.rm = TRUE)
sumstats
#1st: Rename columns
colnames (mydata)
colnames(mydata)=c("ID", "Understands restocking", "Restocking will increase the No. of crabs", "Increasing the No. of crabs is...", "Restocking will result in more crabs to catch", "More crabs to catch is...", "Restocking will result in more fishers fishing for crabs", "More fishers fishing for crabs is...", "Resocking will result in no change in abundance of crabs", "No change in the abundance of crabs is...","Restocking will increase the fishing pressure on crabs", "Increasing the fishing pressure on crabs is", "Restocking will have an impact on the environment and other species", "Having an impact on the environment and other species is...", "Overall views on restocking")
View(mydata)
#Replace Belief evaluation (very unlikely to very likely) from -3-3 to 0-6
Eval1 <- mydata$`Restocking will increase the No. of crabs`
...#Done for 6 "Eval"
Eval1
Eval1\[Eval1 == 3\] <- 6
Eval1\[Eval1 == 2\] <- 5
Eval1\[Eval1 == 1\] <- 4
Eval1\[Eval1 == 0\] <- 3
Eval1\[Eval1 == -3\] <- 0
Eval1\[Eval1 == -2\] <- 1
Eval1\[Eval1 == -1\] <- 2
...
Strength1 <- mydata$`Increasing the No. of crabs is...`
Strength2 <- mydata$`More crabs to catch is...`
Strength3 <- mydata$`More fishers fishing for crabs is...`
...#Done for 6 "Strength"][1]
I do not want to write 6 times the same simple equation. I cannot figure out how to do it, I just have a slight idea that it is probably using one of the apply f(x) or making a loop...
My data`Is a set of variables, Eval(1,2,3...) are on a scale from -3 to 3; Strength (1,2,3,..) are on a scale from 0 to 6
I want to do the cross product of for each row, and then get the mean for each cross products:
Eval1*Strength1
Eval2*Strength2
Ideally without writting
crossprod1 <- mean(Eval1*Strength1, na.rm=TRUE)
crossprod1
If anyone could help with this I would really appreciate it!
Cheers!
[1]: https://i.stack.imgur.com/jH9Zs.png
Hopefully this gives you some ideas. Cheers!
meanTotals = c()
for(r in 1:nrow(dataset)){
rowTotals = c()
for(c in 1:ncol(dataset)/2){
rowTotals = c(rowTotals, dataset[r, 2*c-1] * dataset[r, 2*c])
}
meanTotals = c(meanTotals, rowTotals)
}
mean(meanTotals)
Related
I need to calculate each component of the time series for each X (50 levels) and Y (80 levels) from my dataset (df).
I wanted to go with something akin to the code below, where I tried to just get the seasonality. If I can get this it should be the same for the trend and random component of the decompose.
P <- df$X
for(y in 1:length(P)) {
OneP <- P[y]
AllS <- unique(df$Y[df$X== OneP])
for(i in 1:length(AllS)) {
OneS<- AllS[i]
df$TS[df$Y == OneS & df$X== OneP] <- ts(df$Mean[df$Y == OneS & df$X
== OneP], start = c(1999, 1), end = c(2015, 12), frequency = 12)
df$Dec[df$Y == OneS & df$X== OneP] <- decompose(ts(df$TS[df$Y == OneS &
df$X== OneP], frequency = 12), type = c("additive"))
df$Decomposition_seasonal[df$Y == OneS & df$X== OneP] <- df$Dec([df$Y == OneS & df$X== OneP], Dec$seasonal)
}
But this is not working. Error message is:
Error: attempt to apply non-function
I understand that the problem might come from my attempt to put decomposition output in a column. But how else to do it? Make a new dataset for every dev in every X * Y combination?
I know that the first lines of the code work as I used it before for something else. And I know this will run and give me TS and decomposition. It's the individual components bit that I am struggling with. Any advice is deeply appreciated.
Similar data:
X Y Mean Date(mY)
Tru A 35.6 02.2015
Fle A 15 05.2010
Srl C 67.1 05.1999
Tru A 13.2 08.2006
Srl B 89 08.2006
Tru B 14.8 12.2001
Fle A 21.5 11.2001
Lub D 34.8 03.2000
Dataset (simplified)
data <- data.frame()
data[1,1] <- "NO CB"
data[1,2] <- 1.13
data[1,3] <- 4.56
data[2,1] <- "NO CB"
data[2,2] <- 2.45
data[2,3] <- 7.54
data[3,1] <- "NO CB"
data[3,2] <- 3.56
data[3,3] <- 9.56
data[4,1] <- "NO CB"
data[4,2] <- 3.67
data[4,3] <- 7.89
data[5,1] <- "CB"
data[5,2] <- 1.18
data[5,3] <- 5.85
data[6,1] <- "CB"
data[6,2] <- 2.67
data[6,3] <- 7.86
colnames(data)[1] <- "Group"
colnames(data)[2] <- "Region.1"
colnames(data)[3] <- "Region.2"
In this dataset, I have an unbalanced amount of rows for the 'NO CB' group vs. the 'CB' group. What I want to achieve with my code is to randomly select 2 rows from the 'NO CB' group and use data from the selected 2 rows + data from the 'CB' group to train my randomforest model, and make predictions (I know 4 rows in total makes a bad predictive model, in my actual dataset I have hundreds of rows, but only a few are reproduced here for simplicity).
I wrote a function below, I hope to repeat the whole process 500 times: randomly select 2 rows from the 'NO CB' group for 500 times, and each time, repeat the procedure for the randomforest classification, extract the auc value from the test trial and store the auc value for each run.
myfun <- function(){
wocb.ROI <- subset(data,data$Group=="NO CB")
wcb.ROI <- subset(data,data$Group=="CB")
wocb.ROI <- wocb.ROI[sample(nrow(wocb.ROI),2),] # randomly sample 2 from the no cb group
same.ROI <- rbind(wocb.ROI,wcb.ROI)
same.ROI <- as.data.frame(same.ROI)
same.ROI$Group <- as.factor(same.ROI$Group)
trains.same.ROI <- createDataPartition(
y = same.ROI$Group,
p = 0.5, #traindata proportion
list=F
)
traindata.same.ROI <- same.ROI[trains.same.ROI,]
testdata.same.ROI <- same.ROI[-trains.same.ROI,]
form_cls.same.ROI <- as.formula(
paste0(
"Group~",
paste(colnames(traindata.same.ROI)[2:3],collapse="+")
)
)
fit.rf.cls.same.ROI <- randomForest(
form_cls.same.ROI,
data = traindata.same.ROI,
ntree=50, #number of decision tree
mtry =6,
importance=T
)
trainpredprob.same.ROI <- predict(fit.rf.cls.same.ROI,newdata=traindata.same.ROI,type="prob")
trainroc.same.ROI <- roc(response=traindata.same.ROI$Group,
predictor=trainpredprob.same.ROI[,2])
bestp.same.ROI <- trainroc.same.ROI$thresholds[
which.max(trainroc.same.ROI$sensitivities + trainroc.same.ROI$specificities -1)]
trainpredlab.same.ROI <- as.factor(
ifelse(trainpredprob.same.ROI[,2] >bestp.same.ROI, "No CB","CB")
)
testpredprob.same.ROI <- predict(fit.rf.cls.same.ROI,newdata=testdata.same.ROI,type="prob")
testpredlab.same.ROI <- as.factor(
ifelse(testpredprob.same.ROI[,2] >bestp.same.ROI,"No CB","CB")
)
testroc.same.ROI <- roc(response=testdata.same.ROI$Group,
predictor = testpredprob.same.ROI[,2])
auc <- testroc.same.ROI$auc
return(auc)
}
I then tried result <- replicate(500, myfun) but all I got was my code, instead of a dataframe containing the auc value.
I also tried to write loops but I am a bit clueless about how should I adjust my code to make it run.
I have checked similar posts, in fact, repeating the function for 500 times was inspired by one of the similar posts, but my problem still could not be solved. May I ask why my result does not return the auc values but the complete code?
How should I adapt my code to repeat the whole process many times? Thanks in advance for your help!
A solution would be to use one of the apply functions, like lapply. This way you can also keep track of how many times you have run the function, and in the end-result see which run gave which output.
myfun <- function(i) {
message("Randomforest run ", i)
# do whatever you need to calculate 'auc' here
return(auc)
}
res <- lapply(1:500, myfun)
I recently asked a question about improving performance in my code (Faster method than "while" loop to find chain of infection in R).
Background:
I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.
In my original question, I asked how I could output a data.frame corresponding to animal "d2"s "chain of infection (see below, outlined in green, for illustration of one "chain"). The suggested solution worked well for one animal.
In reality, I will need to calculate chains for about 400 animals, corresponding to a subset of all animals (allanimals table).
I've included a link to an example dataset that is large enough to play with.
Here is the code for one chain, starting with animal 5497370, and note that I've slightly changed column names from my previous question, and updated the code!
The code:
allanimals <- read.csv("https://www.dropbox.com/s/0o6w29lz8yzryau/allanimals.csv?raw=1",
stringsAsFactors = FALSE)
# Here's an example animal
ExampleAnimal <- 5497370
ptm <- proc.time()
allanimals_ID <- setdiff(unique(c(allanimals$ID, allanimals$InfectingAnimal_ID)), -1)
infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$ID, allanimals_ID)] <-
match(allanimals$InfectingAnimal_ID, allanimals_ID)
path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match(ExampleAnimal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
chain <- allanimals[path[seq_len(i - 1)], ]
chain
proc.time() - ptm
# check it out
chain
I'd like to output chains for each animal in "sel.set":
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
If possible, I'd like to store each "chain" data.frame as list with length = number of chains.
So I'll return the indices to access the data frame rather than all data frame subsets. You'll just need to use lapply(test, function(path) allanimals[path, ]) or with a more complicated function inside the lapply if you want to do other things on the data frame subsets.
One could think of just lapply on the solution for one animal:
get_path <- function(animal) {
curOne <- match(animal, allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
path[i] <- curOne
i <- i + 1
curOne <- nextOne
}
path[seq_len(i - 1)]
}
sel.set <- allanimals %>%
filter(HexRow < 4 & Year == 130) %>%
pull("ID")
system.time(
test <- lapply(sel.set, get_path)
) # 0.66 seconds
We could rewrite this function as a recursive function (this will introduce my third and last solution).
system.time(
sel.set.match <- match(sel.set, allanimals_ID)
) # 0
get_path_rec <- function(animal.match) {
`if`(is.na(nextOne <- infected[animal.match]),
NULL,
c(animal.match, get_path_rec(nextOne)))
}
system.time(
test2 <- lapply(sel.set.match, get_path_rec)
) # 0.06
all.equal(test2, test) # TRUE
This solution is 10 times as fast. I don't understand why though.
Why I wanted to write a recursive function? I thought you might have a lot of cases where you want for example to get the path of animalX and animalY where animalY infected animalX. So when computing the path of animalX, you would recompute all path of animalY.
So I wanted to use memoization to store already computed results and memoization works well with recursive functions. So my last solution:
get_path_rec_memo <- memoise::memoize(get_path_rec)
memoise::forget(get_path_rec_memo)
system.time(
test3 <- lapply(sel.set.match, get_path_rec_memo)
) # 0.12
all.equal(test3, test) # TRUE
Unfortunately, this is slower than the second solution. Hope it will be useful for the whole dataset.
I made a matrix based population model, however, I would like to run more than one simultaneously in order to represent different groups of animals, in order that dispersing individuals can move between matrices. I originally just repeated everything to get a second matrix but then I realised that because I run the model using a for loop and break() under certain conditions (when that specific matrix should stop running, ie that group has died out) it is, understandably, stopping the whole model rather than just that singular matrix.
I was wondering if anyone had any suggestions on the best ways to code the model so that instead of breaking, and stopping the whole for loop, it just stops running across that specific matrix. I'm a little stumped. I have include a single run of one matrix below.
Also if anyone has a more efficient way of creating and running 9 matrices than writing everything out 9 times advice much appreciated.
n.steps <- 100
mats <- array(0,c(85,85,n.steps))
ns <- array(0,c(85,n.steps))
ns[1,1]<-0
ns[12,1]<-rpois(1,3)
ns[24,1]<-rpois(1,3)
ns[85,1] <- 1
birth<-4
nextbreed<-12
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-sample(c(replicate(1000,
sample(c(1,0), prob=c(0.985, 1-0.985), size = 1))),1)
if (death == 0) {
break()}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
group.size <- apply(ns[1:85,],2,sum)
plot(group.size)
View(mat)
View(ns)
As somebody else suggested on Twitter, one solution might be to simply turn the matrix into all 0s whenever death happens. It looks to me like death is the probability that a local population disappears? It which case it seems to make good biological sense to just turn the entire population matrix into 0s.
A few other small changes: I made a list of replicate simulations so I could summarize them easily.
If I understand correctly,
death<-sample(c(replicate(1000,sample(c(1,0), prob=c(0.985, 1-0.985), size =1))),1)
says " a local population dies completely with probability 1.5% ". In which case, I think you could replace it with rbinom(). I did that below and my plots look similar to those I made with your code.
Hope that helps!
lots <- replicate(100, simplify = FALSE, expr = {
for (i in 2:n.steps){
# set up an empty matrix;
mat <- matrix(0,nrow=85,ncol=85)
surv.age.1 <- 0.95
x <- 2:10
diag(mat[x,(x-1)]) <- surv.age.1
surv.age.a <- 0.97
disp <- 1:74
disp <- disp*-0.001
disp1<-0.13
disp<-1-(disp+disp1)
survdisp<-surv.age.a*disp
x <- 11:84
diag(mat[x,(x-1)])<-survdisp
if (i == nextbreed) {
pb <- 1
} else {
pb <- 0
}
if (pb == 1) {
(nextbreed <- nextbreed+12)
}
mat[1,85] <- pb*birth
mat[85,85]<-1
death<-rbinom(1, size = 1, prob = 0.6)
if (death == 0) {
mat <- 0
}
mats[,,i]<- mat
ns[,i] <- mat%*%ns[,i-1]
}
ns
})
lapply(lots, FUN = function(x) apply(x[1:85,],2,sum))
I have a data.frame, ordered by mean column that looks like this:
10SE191_2 10SE207 10SE208 mean
7995783 12.64874 13.06391 12.69378 12.73937
8115327 12.69979 12.52285 12.41582 12.50363
8108370 12.58685 12.87818 12.66021 12.45720
7945680 12.46392 12.26087 11.77040 12.36518
7923547 11.98463 11.96649 12.50666 12.33138
8016718 12.81610 12.71548 12.48164 12.32703
I would like to apply a t.test to each row, using as input the intensity values: df[i,1:3] and the mean values from the rows with lower intensities. For example, for the first row I want to compute a t.test for df[1,1:3] vs _mean values_ from row 2 to row 6. My code uses a for loop but my current data.frame has more than 20,000 rows and 24 columns and it takes a long time. Any ideas for improving the code?
Thanks
Code:
temp <- matrix(-9, nrow=dim(matrix.order)[1], ncol=2) #create a result matrix
l <- dim(matrix.order)[1]
for (i in 1:l){
j <- 1+i
if (i < l | j +2 == l) { #avoid not enough y observations
mean.val <- matrix.order[j:l,4]
p <- t.test(matrix.order[i, 1:3], mean.val)
temp[i,1] <- p$p.value
}
else {temp[i,1] <- 1}
}
dput for my df
structure(list(`10SE191_2` = c(12.6487418898415, 12.6997932097351,12.5868508174491, 12.4639169398277, 11.9846348627906, 12.8160978540904), `10SE207` = c(13.0639063105224, 12.522848114011, 12.8781769160682, 12.260865493177, 11.9664905651469, 12.7154788700468), `10SE208` = c(12.6937808736673, 12.4158248856386, 12.6602128982717, 11.7704045448312, 12.5066604109231, 12.4816357798965), mean = c(12.7393707471856, 12.5036313008127, 12.4572035036992, 12.3651842840775, 12.3313821056582, 12.3270331271091)), .Names = c("10SE191_2", "10SE207", "10SE208", "mean"), row.names = c("7995783", "8115327", "8108370", "7945680", "7923547", "8016718"), class = "data.frame")
You can obtain all p-values (if possible) with this command:
apply(df, 1, function(x) {
y <- df$mean[df$mean < x[4]]
if(length(y) > 1)
t.test(x[1:3], y)$p.value
else NA
})
The function will return NA if there are not enough values for y.
7995783 8115327 8108370 7945680 7923547 8016718
0.08199794 0.15627947 0.04993244 0.50885253 NA NA
Running 2E4 t.tests probably takes a lot of time no matter what. Try using Rprof to find the hot spots. You might also want to use mcapply or similar parallel processing tools, since your analysis of each row is independent of all other data (which means this is a task well-suited to multicore parallel processing).