R for loop: perform iteration for every loop - r

I am trying to run a simulation in R, but I am having trouble writing the proper for loop.
The iteration I am trying to perform is
i=1
distance<-NULL
for(i in 1:48)
{
sample<-coordinates[sample(.N, i)]
meand = (dist(cbind(sample$x,sample$y)))
ppp<-sample
table<-as.matrix(dist(ppp))
table[table == 0] <- 1000
maxmin<-apply(table, 1, FUN=min)
distance.1<-mean(maxmin)
distance<-rbind(distance,distance.1)
}
The result give a 48 row dataframe of results ,where i = 1:48
What I would like to do is run about 1000 iteration for each i in the for loop. Then I would like to store the average of the 1000 results, and store them for each i.
I am thinking that replicate() function might be the solution, but I am having trouble using them.
So the expected output is somewhat
i=1 a (average of 1000 iteration)
i=2 b (average of 1000 iteration)
i=3 c (average of 1000 iteration)
.
.
.
i=48 d (average of 1000 iteration)
How should I rewrite my code to perform a fast iteration? I would sincerely appreciate some help.
EDIT
dput(coordinates)
structure(list(x = c(0.24, 0.72, 1.2, 3.675, 4.155, 4.635, 5.115,
5.595, 6.075, 8.55, 9.03, 9.51, 9.99, 10.47, 10.95, 13.425, 13.905,
14.385, 14.865, 15.345, 15.825, 18.3, 18.78, 19.26, 19.26, 18.78,
18.3, 15.825, 15.345, 14.865, 14.385, 13.905, 13.425, 10.95,
10.47, 9.99, 9.51, 9.03, 8.55, 6.075, 5.595, 5.115, 4.635, 4.155,
3.675, 1.2, 0.72, 0.24), y = c(0.24, 0.24, 0.24, 0.24, 0.24,
0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24,
0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 2.88, 2.88, 2.88,
2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88,
2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88, 2.88)), row.names = c(NA,
-48L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000027c2a7f1ef0>)

If I understood correctly your question, apply functions might work just fine to solve your problem. Below, I'm just a nesting a sapply to do the 1000 additional replicates within each i.
sapply(1:48, function(i){
mean(sapply(1:1000, function(x){
sample<-coordinates[sample(.N, i)]
meand = (dist(cbind(sample$x,sample$y)))
ppp<-sample
table<-as.matrix(dist(ppp))
table[table == 0] <- 1000
maxmin<-apply(table, 1, FUN=min)
mean(maxmin)
}))
})
I'd be easier with a sample of your data. Good luck!

Related

Extract all subsets in vector where elements are above a given threshold

I would like to know if there is an R way (one liner) to extract the coordinates of all subsets of a vector that are above a given threshold.
Suppose I have the following data:
v = c(3.48, 2.59, 1.73, 0.91, 0.13, -0.63, -1.34, -2.03, -2.67, -3.28, -3.04, -2.15, -1.20, -0.19, 0.84, 1.86, 2.84, 3.77, 4.60, 5.31, 4.16, 2.87, 1.89, 0.51, 0.23, 0.78, 1.34, 2.63, 1.72, 0.62, 0.98, 1.45)
and let's say I have threshold = 0.7. The desired output would be:
left right
1 4
15 23
26 29
31 32
I can in principle write a while loop or some sort, subsetting v and juggling with left and right coordinates of these regions, something like:
left = which(subset >= threshold)[1] + right
right = which(subset[left:length(subset)] < threshold)[1] - 1 # -1 to get the last element above the threshold
subset = v[(right + 1):length(v)]
(not tested), but I am sure there is an R way that i can't seem to remember.
I had a look here but it's not really what I am after. Any help is appreciated.
You can use rle() to find the runs of values that exceed your threshhold. When you can turn that into your desired format
rle(v>.7) |>
with(
data.frame(start=1, end=cumsum(lengths)) |>
transform(start=c(1, head(end, -1) + 1)) |>
subset(values)
)
And that returns
start end
1 1 4
3 15 23
5 26 29
7 31 32
This is nearly identical to this existing question with the main difference of using rle() on your Boolean condition and then subsetting to only the TRUE values.
Same solution but using data.table
v = c(3.48, 2.59, 1.73, 0.91, 0.13, -0.63, -1.34, -2.03, -2.67, -3.28, -3.04, -2.15, -1.20, -0.19, 0.84, 1.86, 2.84, 3.77, 4.60, 5.31, 4.16, 2.87, 1.89, 0.51, 0.23, 0.78, 1.34, 2.63, 1.72, 0.62, 0.98, 1.45)
data.table(v)[, .(start = .I[1], end = .I[.N], keep = unique(v > 0.7)), by = rleid(v > 0.7)][keep == T, .(start, end)]
# start end
# 1: 1 4
# 2: 15 23
# 3: 26 29
# 4: 31 32

plotting threshold/piecewise/change point models with 95% confidence intervals in R

I would like to plot a threshold model with smooth 95% confidence interval lines between line segments. You would think this would be on the simple side but I have not been able to find an answer!
My threshold/breakpoints are known, it would be great if there were a way to visualize this data. I have tried the segmented package which produces the following plot:
The plot shows a threshold model with a breakpoint at 5.4. However, the confidence intervals are not smooth between regression lines.
If anyone knows of any way to produce smooth (i.e. without the jump between line segments) CI lines between segmented regression lines (ideally in ggplot) that would be amazing. Thank you so much.
I have included sample data and the code I have tried below:
x <- c(2.26, 1.95, 1.59, 1.81, 2.01, 1.63, 1.62, 1.19, 1.41, 1.35, 1.32, 1.52, 1.10, 1.12, 1.11, 1.14, 1.23, 1.05, 0.95, 1.30, 0.79,
0.81, 1.15, 1.10, 1.29, 0.97, 1.05, 1.05, 0.84, 0.64, 0.80, 0.81, 0.61, 0.71, 0.75, 0.30, 0.30, 0.49, 1.13, 0.55, 0.77, 0.51,
0.67, 0.43, 1.11, 0.29, 0.36, 0.57, 0.02, 0.22, 3.18, 3.79, 2.49, 2.44, 2.12, 2.45, 3.22, 3.44, 3.86, 3.53, 3.13)
y <- c(22.37, 18.93, 16.99, 15.65, 14.62, 13.79, 13.09, 12.49, 11.95, 11.48, 11.05, 10.66, 10.30, 9.96, 9.65, 9.35, 9.07, 8.81,
8.56, 8.32, 8.09, 7.87, 7.65, 7.45, 7.25, 7.05, 6.86, 6.68, 6.50, 6.32, 6.15, 5.97, 5.80, 5.63, 5.47, 5.30,
5.13, 4.96, 4.80, 4.63, 4.45, 4.28, 4.09, 3.90, 3.71, 3.50, 3.27, 3.01, 2.70, 2.28, 22.37, 16.99, 11.05, 8.81,
8.56, 8.32, 7.25, 7.05, 6.50, 6.15, 5.63)
lin.mod <- lm(y ~ x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=2)
plot(x, y)
plot(segmented.mod, add=TRUE, conf.level = 0.95)
which produces the following plot (and associated jumps in 95% confidence intervals):
segmented plot
Background: The non-smoothness in existing change point packages are due to the fact that frequentist packages operate with a fixed change point value. But as with all inferred parameters, this is wrong because there is indeed uncertainty concerning the location of the change.
Solution: AFAIK, only Bayesian methods can quantify that and the mcp package fills this space.
library(mcp)
model = list(
y ~ 1 + x, # Segment 1: Intercept and slope
~ 0 + x # Segment 2: Joined slope (no intercept change)
)
fit = mcp(model, data = data.frame(x, y))
Default plot (plot.mcpfit() returns a ggplot object):
plot(fit) + ggtitle("Default plot")
Each line represents a possible model that generated the data. The posterior for the change point is shown as a blue density. You can add a credible interval on top using plot(fit, q_fit = TRUE) or plot it alone:
plot(fit, lines = 0, q_fit = c(0.025, 0.975), cp_dens = FALSE) + ggtitle("Credible interval only")
If your change point is indeed known and if you want to model different residual scales for each segment (i.e., quasi-emulate segmented), you can do:
model2 = list(
y ~ 1 + x,
~ 0 + x + sigma(1) # Add intercept change in residual scale
)
fit = mcp(model2, df, prior = list(cp_1 = 1.9)) # Note: prior is a fixed value - not a distribution.
plot(fit, q_fit = TRUE, cp_dens = FALSE)
Notice that the CI does not "jump" around the change point as in segmented. I believe that this is the correct behavior. Disclosure: I am the author of mcp.

Removing NAs from ggplot x-axis in ggplot2

I would like to get rid off the whole NA block (highlighted here ).
I tried na.ommit and na.rm = TRUE unsuccesfully.
Here is the code I used :
library(readxl)
data <- read_excel("Documents/TFB/xlsx_geochimie/solfatara_maj.xlsx")
View(data)
data <- gather(data,FeO:`Fe2O3(T)`,key = "Element",value="Pourcentage")
library(ggplot2)
level_order <- factor(data$Element,levels = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O"))
ggplot(data=data,mapping=aes(x=level_order,y=data$Pourcentage,colour=data$Ech)+geom_point()+geom_line(group=data$Ech) +scale_y_log10()
And here is my original file
https://drive.google.com/file/d/1bZi7fPWebbpodD1LFScoEcWt5Bs-cqhb/view?usp=sharing
If I run your code and look at data that goes into ggplot:
table(data$Element)
Al2O3 CaO Fe2O3 Fe2O3(T) FeO K2O LOI LOI2 MgO MnO
12 12 12 12 12 12 12 12 12 12
Na2O P2O5 SiO2 SO4 TiO2 Total Total 2 Total N Total S
12 12 12 12 12 12 12 12 12
You have included Total into the melted data frame.. which is not intended I guess. Hence when you do factor on these, and these "Total.." are not included in the levels, they become NA.
So we can do it from scratch:
data <- read_excel("solfatara_maj.xlsx")
The data:
structure(list(Ech = c("AGN 1A", "AGN 2A", "AGN 3B", "SOL 4B",
"SOL 8Ag", "SOL 8Ab", "SOL 16A", "SOL 16B", "SOL 16C", "SOL 22 A",
"SOL 22D", "SOL 25B"), FeO = c(0.2, 0.8, 1.7, 0.3, 1.7, NA, 0.2,
NA, 0.1, 0.7, 1.3, 2), `Total S` = c(5.96, 45.3, 0.22, 17.3,
NA, NA, NA, NA, NA, NA, 2.37, 0.36), SO4 = c(NA, 6.72, NA, 4.08,
0.06, 0.16, 42.2, 35.2, 37.8, 0.32, 6.57, NA), `Total N` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 15.2, NA, NA), SiO2 = c(50.2,
31.05, 56.47, 62.14, 61.36, 75.66, 8.41, 21.74, 17.44, 13.52,
19.62, 56.35), Al2O3 = c(15.53, 7.7, 17.56, 4.44, 17.75, 10.92,
31.92, 26.38, 27.66, 0.64, 3.85, 17.28), Fe2O3 = c(0.49, 0.63,
2.06, NA, 1.76, 0.11, 0.64, 0.88, 1.71, NA, 1.32, 2.67), MnO = c(0.01,
0.01, 0.13, 0.01, 0.09, 0.01, 0.01, 0.01, 0.01, 0.005, 0.04,
0.12), MgO = c(0.06, 0.07, 0.88, 0.03, 0.97, 0.05, 0.04, 0.07,
0.03, 0.02, 1.85, 1.63), CaO = c(0.2, 0.09, 3.34, 0.09, 2.58,
0.57, 0.2, 0.26, 0.15, 0.06, 35.66, 4.79), Na2O = c(0.15, 0.14,
3.23, 0.13, 3.18, 2.04, 0.68, 0.68, 0.55, 0.05, 0.45, 3.11),
K2O = c(4.39, 1.98, 8, 1.26, 8.59, 5.94, 8.2, 6.97, 8.04,
0.2, 0.89, 7.65), TiO2 = c(0.42, 0.27, 0.46, 0.79, 0.55,
0.16, 0.09, 0.22, 0.16, 0.222, 0.34, 0.53), P2O5 = c(0.11,
0.09, 0.18, 0.08, 0.07, 0.07, 0.85, 0.68, 0.62, NA, 0.14,
0.28), LOI = c(27.77, 57.06, 6.13, 29.03, 1.38, 4.92, 42.58,
37.58, 38.76, NA, 26.99, 3.92), LOI2 = c(27.79, 57.15, 6.32,
29.06, 1.57, 4.93, 42.6, 37.59, 38.77, 0.08, 27.13, 4.15),
Total = c(99.52, 99.88, 100.2, 98.25, 99.99, 100.5, 93.81,
95.57, 95.23, 15.25, 92.45, 100.3), `Total 2` = c(99.54,
99.96, 100.3, 98.28, 100.2, 100.6, 93.83, 95.58, 95.24, 15.33,
92.59, 100.6), `Fe2O3(T)` = c(0.71, 1.52, 3.95, 0.27, 3.65,
0.22, 0.87, 0.99, 1.82, 0.61, 2.76, 4.9)), row.names = c(NA,
-12L), class = c("tbl_df", "tbl", "data.frame"))
First we set the plotting level like you did:
plotlvls = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O")
Then we select only these columns, and also Ech, note I use pivot_longer() because gather() will supposedly be deprecated, and then we do the factoring too:
plotdf = data %>% select(c(plotlvls,"Ech")) %>%
pivot_longer(-Ech,names_to = "Element",values_to = "Pourcentage") %>%
mutate(Element=factor(Element,levels=toplot))
Finally we plot, and there are no NAs:
ggplot(data=plotdf,mapping=aes(x=Element,y=Pourcentage,colour=Ech))+
geom_point()+geom_line(aes(group=Ech)) +scale_y_log10()
1.Create reproducible minimal data
data <- data.frame(Element = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O",NA),
Pourcentage = 1:10,
Ech = c("AGN 1A", "SOL 16"))
2.Set factor levels for variable 'Element'
data$Element <- factor(data$Element,levels = c("SiO2","TiO2","Al2O3","Fe2O3","FeO","MgO","CaO","Na2O","K2O"))
3.Remove rows containing NA in the variable 'Element'
data <- data[!is.na(data$Element), ]
4.Plot data using ggplot2 (ggplot2 syntax uses NSE (non standard evaluation), which means you dont't have to pass the variable names as strings or using the $ notation):
ggplot(data=data,aes(x=Element,y=Pourcentage,colour=Ech)) +
geom_point() +
geom_line(aes(group=Ech)) +
scale_y_log10()

How to calculate the average of a comma separated string of numbers in R

I have following file :
file 1
structure(list(Total_Gene_Symbol = c("5S_rRNA", "7SK", "A1BG-AS1"
), Test = c("1.02, 1.12, 1.11, 1.18, 1.12, 1.19, 1.25, 1.24, 1.24, 1.02",
"1.97, 2.27, 2.14, 1.15", "1.3, 1.01, 1.36, 1.42, 1.38, 1.01, 1.31, 1.34,
1.29, 1.34, 2.02, 1.12, 1.01, 1.31, 1.22"
)), .Names = c("Total_Gene_Symbol", "Test"), row.names = c(NA,
3L), class = "data.frame")
file 1 column test is number separated by ",".
I tried
mat <- stri_split_fixed(Down_FC, ',', simplify=T)
mat <- `dim<-`(as.numeric(mat), dim(mat)) # convert to numeric and save dims
rowMeans(mat, na.rm=T)->M
View(M)
but the above code is averaging entire data.
I want output same like below file 2
file 2
structure(list(Total_Gene_Symbol = c("5S_rRNA", "7SK", "A1BG-AS1"
), Test = c("1.02, 1.12, 1.11, 1.18, 1.12, 1.19, 1.25, 1.24, 1.24, 1.02",
"1.97, 2.27, 2.14, 1.15", "1.3, 1.01, 1.36, 1.42, 1.38, 1.01, 1.31, 1.34,
1.29, 1.34, 2.02, 1.12, 1.01, 1.31, 1.22"
), Average = c(11.49, 7.53, 19.44)), .Names = c("Total_Gene_Symbol",
"Test", "Average"), row.names = c(NA, 3L), class = "data.frame")
What you want is the sum not average! The average is something like the mode, median, mean.
library(magrittr)
df1$total_sum<-
df1$Test %>% str_split(.,",\\s+") %>% sapply(function(x) as.numeric(x) %>% sum(na.rm=T))
Using apply
d1$sum <- apply(d1,1,
function(x)(sum(as.numeric(unlist(strsplit(x['Test'],','))),na.rm = TRUE)))
You can use scan :
df$sum <- sapply(df$Test, function(x) sum(scan(text = x, what=numeric(),sep=","), na.rm=TRUE))
df$average <- sapply(df$Test, function(x) mean(scan(text = x, what=numeric(),sep=","), na.rm=TRUE))
# Total_Gene_Symbol Test sum average
# 1 5S_rRNA 1.02, 1.12, 1.11, 1.18, 1.12, 1.19, 1.25, 1.24, 1.24, 1.02 11.49 1.1490
# 2 7SK 1.97, 2.27, 2.14, 1.15 7.53 1.8825
# 3 A1BG-AS1 1.3, 1.01, 1.36, 1.42, 1.38, 1.01, 1.31, 1.34, \n 1.29, 1.34, 2.02, 1.12, 1.01, 1.31, 1.22 19.44 1.2960

Plotting longitudinal data using loess smoother

I need to plot a set of smoothed trajectories for individuals in a longitudinal (person-period) dataset. I can plot the individual trajectories across days using OLS regression but I would like to know how to plot the trajectories using a non-parametric smoother.
Sample data below. Same outcome variable measured five times for each individual at ages eleven, twelve, thirteen, fourteen and fifteen. Exposure is a predictor variable but we are not interested in it for this exercise.
id <- c(9, 45, 268, 314, 442, 514, 569, 624, 723, 918, 949, 978, 1105, 1542, 1552, 1653)
eleven <- c(2.23, 1.12, 1.45, 1.22, 1.45, 1.34, 1.79, 1.12, 1.22, 1.00, 1.99, 1.22, 1.34, 1.22, 1.00, 1.11)
twelve <- c(2.23, 1.12, 1.45, 1.22, 1.45, 1.34, 1.79, 1.12, 1.22, 1.00, 1.99, 1.22, 1.34, 1.22, 1.00, 1.11)
thirteen <- c(1.90, 1.45, 1.99, 1.55, 1.45, 2.23, 1.90, 1.22, 1.12, 1.22, 1.12, 2.12, 1.99, 1.99, 2.23, 1.34)
fourteen <- c(2.12, 1.45, 1.79, 1.12, 1.67, 2.12, 1.99, 1.12, 1.00, 1.99, 1.45, 3.46, 1.90, 1.79, 1.55, 1.55)
fifteen <- c(2.66, 1.99, 1.34, 1.12, 1.90, 2.44, 1.99, 1.22, 1.12, 1.22, 1.55, 3.32, 2.12, 2.12, 1.55, 2.12)
exposure <- c(1.54, 1.16, 0.90, 0.81, 1.13, 0.90, 1.99, 0.98, 0.81, 1.21, 0.93, 1.59, 1.38, 1.44, 1.04, 1.25)
df <- data.frame(id, eleven, twelve, thirteen, fourteen, fifteen, exposure)
Now we convert the person-level dataset to a person-period (i.e. long) dataframe and add a time variable
library(reshape2)
library(plyr)
dfPP <- melt(df, measure.vars = c("eleven", "twelve", "thirteen", "fourteen", "fifteen"), var = "age", value.name = "score")
dfPP <- dfPP[order(dfPP$id), ]
dfPP$time <- rep(0:4, 16)
We can plot the raw data using an interaction plot
interaction.plot(dfPP$age, dfPP$id, dfPP$score)
but what we really want are lines of best fit for each individual
fit <- by(dfPP, dfPP$id, function (bydata) fitted.values(lm(score ~ time, data = bydata)))
fit <- unlist(fit)
interaction.plot(dfPP$age, dfPP$id, fit, xlab = "age", ylab = "score")
It would also be good to plot the average change trajectory across subjects
ints <- by(dfPP, df$id, function (data) coefficients(lm(score ~ time, data = data))[[1]])
ints1 <- unlist(ints)
slopes <- by(dfPP, dfPP$id, function (data) coefficients(lm(score ~ time, data = data))[[2]])
slopes1 <- unlist(slopes)
so to plot the mean trajectory we use an abline, superimposed over the existing graph
abline(a = mean(ints1), b = mean(slopes1), lwd = 2, col = "red")
Now this is all fine but does anyone know how to achieve a similar result except using a non-parametric smoother? Also with an average trajectory line superimposed. Using some kind of loess function perhaps?

Resources