Interpolation of loess.smooth in R - r

I'm running a loess.smooth method after running the spline method on it.
The input given below is the data I get after running the spline method.
However I'm going wrong with the loess.smooth method. The entire first column is returning the output in float format but I need it in integer format with an increment of 1.
Any help would be much appreciated.
Thanks
**input:** spline_file
1 0.157587435
2 0.146704412
3 0.129899285
4 0.138925582
5 0.104085676
out <- loess.smooth(spline_file$x, spline_file$y, span = 1, degree = 1,
family = c("gaussian"), length.out = seq(1, max_exp, by = 1), surface=
"interpolate", normalize = TRUE, method="linear")
**OUTPUT:**
0 0.150404703
1.020408163 0.154413716
2.040816327 0.158458172
3.06122449 0.162515428
4.081632653 0.166562839
5.102040816 0.170577762
**OUTPUT REQUIRED:**
x y
1 0.225926707
2 0.226026551
3 0.226241194
4 0.2265471
5 0.226920733

not sure if the following fully answers your question but maybe it helps. Below some code, demonstrative plot and some explanations/recommendations.
You should not use a degree of 1, your data requires a higher degree.
You should check the allowed parameters via ?loess.smooth. I think you mixed up some parameters of scatter.smooth and loess.smooth and further used some parameters that do not exist for the function (e.g. normalize - please correct me if I have overseen something).
In any case it makes sense that the output of a spline smoothing function has more data points than the original data. To be ablet to plot a smooth curve additional points are generated between your data points by the smoothing function. Check the plot generated at the end of below code. If the fit is good, is another question...
spline_file <- read.table(text = "
1 0.157587435
2 0.146704412
3 0.129899285
4 0.138925582
5 0.104085676
", stringsAsFactors = FALSE)
colnames(spline_file) <- c("x", "y")
spline_loess <- loess.smooth(spline_file$x, spline_file$y, span = 1, degree = 2,
family = c("gaussian")
,surface= "interpolate"
, statistics = "exact"
)
spline_loess
# $x
# [1] 1.000000 1.081633 1.163265 1.244898 1.326531 1.408163 1.489796
# [8] 1.571429 1.653061 1.734694 1.816327 1.897959 1.979592 2.061224
# [15] 2.142857 2.224490 2.306122 2.387755 2.469388 2.551020 2.632653
# [22] 2.714286 2.795918 2.877551 2.959184 3.040816 3.122449 3.204082
# [29] 3.285714 3.367347 3.448980 3.530612 3.612245 3.693878 3.775510
# [36] 3.857143 3.938776 4.020408 4.102041 4.183673 4.265306 4.346939
# [43] 4.428571 4.510204 4.591837 4.673469 4.755102 4.836735 4.918367
# [50] 5.000000
#
# $y
# [1] 0.1586807 0.1571512 0.1556485 0.1541759 0.1527367 0.1513344
# [7] 0.1499721 0.1486533 0.1473813 0.1461595 0.1449911 0.1438795
# [13] 0.1428280 0.1417881 0.1406496 0.1394364 0.1381783 0.1369053
# [19] 0.1356473 0.1344341 0.1332957 0.1322619 0.1313626 0.1306278
# [25] 0.1300873 0.1297791 0.1297453 0.1299324 0.1302747 0.1307066
# [31] 0.1311626 0.1315769 0.1318839 0.1320181 0.1319138 0.1315054
# [37] 0.1307273 0.1295270 0.1281453 0.1266888 0.1251504 0.1235232
# [43] 0.1218002 0.1199744 0.1180388 0.1159866 0.1138105 0.1115038
# [49] 0.1090594 0.1064704
plot(spline_file)
lines(spline_loess)

Related

R: Plotting Multiple Graphs using a "for loop"

I am using the R programming language.
Using the following code, I am able to put two plots on the same page:
#load library
library(dbscan)
#specify number of plots per page
par(mfrow = c(1,2))
#load libraries
library(dbscan)
library(dplyr)
#generate data
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
### calculate LOF score
lof <- lof(x, k=3)
### distribution of outlier factors (first plot)
summary(lof)
hist(lof, breaks=10)
### point size is proportional to LOF (second plot)
plot(x, pch = ".", main = "LOF (k=3)")
points(x, cex = (lof-1)*3, pch = 1, col="red")
This produces the following plot:
Now, I am trying to make several plots (e.g. 6 plots, 2 pairs of 3) on the same page. I tried to implement this with a "for loop" (for k = 3, 4, 5):
par(mfrow = c(3,2))
vals <- 3:5
combine <- vector('list', length(vals))
count <- 0
for (i in vals) {
lof_i <- lof(x, k=i)
### distribution of outlier factors
summary(lof_i)
hist(lof_i, breaks=10)
### point size is proportional to LOF
plot(x, pch = ".", main = "LOF (k=i)")
points(x, cex = (lof_i-1)*3, pch = 1, col="red")
}
However, this seems to just repeat the same graph 6 times on the same page:
Can someone please show me how to correct this code?
Is it also possible to save the files "lof_3, lof_4, lof_5"? It seems that none of these files are created, only "lof_i" is created:
> lof_3
Error: object 'lof_3' not found
> head(lof_i)
[1] 1.223307 1.033424 1.077149 1.011407 1.040634 1.431029
Thanks
Looking at your plots you seem to have generated and plotted different plots, but to have the labels correct you need to pass a variable and not a fixed character to your title (e.g. using the paste command).
To get the calculated values out of your loop you could either generate an empty list and assign the results in the loop to individual list elements, or use something like lapply that will automatically return the results in a list form.
To simplify things a bit you could define a function that either plots or returns the calculated values, e.g. like this:
library(dbscan)
#generate data
set.seed(123)
n <- 100
x <- cbind(
x=runif(10, 0, 5) + rnorm(n, sd=0.4),
y=runif(10, 0, 5) + rnorm(n, sd=0.4)
)
plotLOF <- function(i, plot=TRUE){
lof <- lof(x, k=i)
if (plot){
hist(lof, breaks=10)
plot(x, pch = ".", main = paste0("LOF (k=", i, ")"))
points(x, cex = (lof-1)*3, pch = 1, col="red")
} else return(lof)
}
par(mfrow = c(3,2))
invisible(lapply(3:5, plotLOF))
lapply(3:5, plotLOF, plot=FALSE)
#> [[1]]
#> [1] 1.1419243 0.9551471 1.0777472 1.1224447 0.8799095 1.0377858 0.8416306
#> [8] 1.0487133 1.0250496 1.3183819 0.9896833 1.0353398 1.3088266 1.0123238
#> [15] 1.1233530 0.9685039 1.0589151 1.3147785 1.0488644 0.9212146 1.2568698
#> [22] 1.0086274 1.0454450 0.9661698 1.0644528 1.1107202 1.0942201 1.5147076
#> [29] 1.0321698 1.0553455 1.1149748 0.9341090 1.2352716 0.9478602 1.4096464
#> [36] 1.0519127 1.0507267 1.3199825 1.2525485 0.9361488 1.0958563 1.2131615
#> [43] 0.9943090 1.0123238 1.1060491 1.0377766 0.9803135 0.9627699 1.1165421
#> [50] 0.9796819 0.9946925 2.1576989 1.6015310 1.5670315 0.9343637 1.0033725
#> [57] 0.8769431 0.9783065 1.0800050 1.2768800 0.9735274 1.0377472 1.0743988
#> [64] 1.7583562 1.2662485 0.9685039 1.1662145 1.2491499 1.1131718 1.0085023
#> [71] 0.9636864 1.1538360 1.2126138 1.0609829 1.0679010 1.0490234 1.1403292
#> [78] 0.9638900 1.1863703 0.9651060 0.9503445 1.0098536 0.8440855 0.9052420
#> [85] 1.2662485 1.4447713 1.0845415 1.0661381 0.9282678 0.9380078 1.1414628
#> [92] 1.0407138 1.0942201 1.0589805 1.0370938 1.0147094 1.1067291 0.8834466
#> [99] 1.7027132 1.1766560
#>
#> [[2]]
#> [1] 1.1667311 1.0409009 1.0920953 1.0068953 0.9894195 1.1332413 0.9764505
#> [8] 1.0228796 1.0446905 1.0893386 1.1211637 1.1029415 1.3453498 0.9712910
#> [15] 1.1635936 1.0265746 0.9480282 1.2144437 1.0570346 0.9314618 1.3345561
#> [22] 0.9816097 0.9929112 1.0322014 1.2739621 1.2947553 1.0202948 1.6153264
#> [29] 1.0790922 0.9987830 1.0378609 0.9622779 1.2974938 0.9129639 1.2601398
#> [36] 1.0265746 1.0241622 1.2420568 1.2204376 0.9297345 1.1148404 1.2546361
#> [43] 1.0059582 0.9819820 1.0342491 0.9452673 1.0369500 0.9791091 1.2000825
#> [50] 0.9878844 1.0205586 2.0057587 1.2757014 1.5347815 0.9622614 1.0692613
#> [57] 1.0026404 0.9408510 1.0280687 1.3534531 0.9669894 0.9300601 0.9929112
#> [64] 1.7567871 1.3861828 1.0265746 1.1120151 1.3542396 1.1562077 0.9842179
#> [71] 1.0301098 1.2326327 1.1866352 1.0403814 1.0577086 0.8745912 1.0017905
#> [78] 0.9904356 1.0602487 0.9501681 1.0176457 1.0405430 0.9718224 1.0046821
#> [85] 1.1909982 1.6151918 0.9640852 1.0141963 1.0270237 0.9867738 1.1474414
#> [92] 1.1293307 1.0323945 1.0859417 0.9622614 1.0290635 1.0186381 0.9225209
#> [99] 1.6456612 1.1366753
#>
#> [[3]]
#> [1] 1.1299335 1.0122028 1.2077092 0.9485150 1.0115694 1.1190314 0.9989174
#> [8] 1.0145663 1.0357546 0.9783702 1.1050504 1.0661798 1.3571416 1.0024603
#> [15] 1.1484745 1.0162149 0.9601474 1.1310442 1.0957731 1.0065501 1.2687934
#> [22] 0.9297323 0.9725355 0.9876444 1.2314822 1.2209304 0.9906446 1.4249452
#> [29] 1.2156607 0.9959685 1.0304305 0.9976110 1.1711354 1.0048161 0.9813000
#> [36] 1.0128909 0.9730295 1.1741982 1.3317209 0.9708714 1.0994309 1.1900047
#> [43] 0.9960765 0.9659553 0.9744357 0.9556112 1.0508484 0.9669406 1.3919743
#> [50] 0.9467537 1.0596883 1.7396644 1.1323109 1.6516971 0.9922995 1.0223594
#> [57] 0.9917594 0.9542419 1.0672565 1.2274498 1.0589385 0.9649404 0.9953886
#> [64] 1.7666795 1.3111620 0.9860706 1.0576620 1.2547512 1.0038281 0.9825967
#> [71] 1.0104708 1.1739417 1.1884817 1.0199412 0.9956941 0.9720389 0.9601474
#> [78] 0.9898781 1.1025485 0.9797453 1.0086780 1.0556471 1.0150204 1.0339022
#> [85] 1.1174116 1.5252177 0.9721734 0.9486663 1.0161640 0.9903872 1.2339874
#> [92] 1.0753099 0.9819882 1.0439012 1.0016272 1.0122706 1.0536213 0.9948601
#> [99] 1.4693656 1.0274264
Created on 2021-02-22 by the reprex package (v1.0.0)
for i in vector
eval(parse(text = sprintf("plot(df$%s)",i)))
This is very powerful line of code...can be very handy to plot graphs with loops.
{
eval(parse(text= sprintf('lof_%s <- lof(x, k=%s)',i,i)))
### distribution of outlier factors
eval(parse(text=sprintf('summary(lof_%s)',i)))
eval(parse(text=sprintf('hist(lof_%s, breaks=10)',i)))
### point size is proportional to LOF
eval(parse(text=sprintf("plot(x, pch = '.', main = 'LOF (k=%s)')",i)))
eval(parse(text=sprintf("points(x, cex = (lof_%s-1)*3, pch = 1, col='red')",i)))
}```
Exaplaination-
eval() - it evaluates the expression
parse() - it parse the text for evaluation
sprintf() - it creates a string(text) by concatenating with the parameter parsed.
Your code is not working because inside the loop i is being treated as character. It is not holding the values from the iterator.In case you need to understand above function then i would suggest you to just run this function and see the output sprintf('lof_%s <- lof(x, k=%s)',i,i).

Partial Variances at each row of a Matrix

I generated a series of 10,000 random numbers through:
rand_x = rf(10000, 3, 5)
Now I want to produce another series that contains the variances at each point i.e. the column look like this:
[variance(first two numbers)]
[variance(first three numbers)]
[variance(first four numbers)]
[variance(first five numbers)]
.
.
.
.
[variance of 10,000 numbers]
I have written the code as:
c ( var(rand_x[1:1]) : var(rand_x[1:10000])
but I am only getting 157 elements in the column rather than not 10,000. Can someone guide what I am doing wrong here?
An option is to loop over the index from 2 to 10000 in sapply, extract the elements of 'rand_x' from position 1 to the looped index, apply the var and return a vector of variance output
out <- sapply(2:10000, function(i) var(rand_x[1:i]))
Your code creates a sequence incrementing by one with the variance of the first two elements as start value and the variance of the whole vector as limit.
var(rand_x[1:2]):var(rand_x[1:n])
# [1] 0.9026262 1.9026262 2.9026262
## compare:
.9026262:3.33433
# [1] 0.9026262 1.9026262 2.9026262
What you want is to loop over the vector indices, using seq_along to get the variances of sequences growing by one. To see what needs to be done, I show you first a (rather slow) for loop.
vars <- numeric() ## initialize numeric vector
for (i in seq_along(rand_x)) {
vars[i] <- var(rand_x[1:i])
}
vars
# [1] NA 0.9026262 1.4786540 1.2771584 1.7877717 1.6095619
# [7] 1.4483273 1.5653797 1.8121144 1.6192175 1.4821020 3.5005254
# [13] 3.3771453 3.1723564 2.9464537 2.7620001 2.7086317 2.5757641
# [19] 2.4330738 2.4073546 2.4242747 2.3149455 2.3192964 2.2544765
# [25] 3.1333738 3.0343781 3.0354998 2.9230927 2.8226541 2.7258979
# [31] 2.6775278 2.6651541 2.5995346 3.1333880 3.0487177 3.0392603
# [37] 3.0483917 4.0446074 4.0463367 4.0465158 3.9473870 3.8537925
# [43] 3.8461463 3.7848464 3.7505158 3.7048694 3.6953796 3.6605357
# [49] 3.6720684 3.6580296
The first element has to be NA because the variance of one element is not defined (division by zero).
However, the for loop is slow. Since R is vectorized we rather want to use a function from the *apply family, e.g. vapply, which is much faster. In vapply we initialize with numeric(1) (or just 0) because the result of each iteration is of length one.
vars <- vapply(seq_along(rand_x), function(i) var(rand_x[1:i]), numeric(1))
vars
# [1] NA 0.9026262 1.4786540 1.2771584 1.7877717 1.6095619
# [7] 1.4483273 1.5653797 1.8121144 1.6192175 1.4821020 3.5005254
# [13] 3.3771453 3.1723564 2.9464537 2.7620001 2.7086317 2.5757641
# [19] 2.4330738 2.4073546 2.4242747 2.3149455 2.3192964 2.2544765
# [25] 3.1333738 3.0343781 3.0354998 2.9230927 2.8226541 2.7258979
# [31] 2.6775278 2.6651541 2.5995346 3.1333880 3.0487177 3.0392603
# [37] 3.0483917 4.0446074 4.0463367 4.0465158 3.9473870 3.8537925
# [43] 3.8461463 3.7848464 3.7505158 3.7048694 3.6953796 3.6605357
# [49] 3.6720684 3.6580296
Data:
n <- 50
set.seed(42)
rand_x <- rf(n, 3, 5)

How to remove elements of a list in R?

I have an igraph object, what I have created with the igraph library. This object is a list. Some of the components of this list have a length of 2. I would like to remove all of these ones.
IGRAPH clustering walktrap, groups: 114, mod: 0.79
+ groups:
$`1`
[1] "OTU0041" "OTU0016" "OTU0062"
[4] "OTU1362" "UniRef90_A0A075FHQ0" "UniRef90_A0A075FSE2"
[7] "UniRef90_A0A075FTT8" "UniRef90_A0A075FYU2" "UniRef90_A0A075G543"
[10] "UniRef90_A0A075G6B2" "UniRef90_A0A075GIL8" "UniRef90_A0A075GR85"
[13] "UniRef90_A0A075H910" "UniRef90_A0A075HTF5" "UniRef90_A0A075IFG0"
[16] "UniRef90_A0A0C1R539" "UniRef90_A0A0C1R6X4" "UniRef90_A0A0C1R985"
[19] "UniRef90_A0A0C1RCN7" "UniRef90_A0A0C1RE67" "UniRef90_A0A0C1RFI5"
[22] "UniRef90_A0A0C1RFN8" "UniRef90_A0A0C1RGE0" "UniRef90_A0A0C1RGX0"
[25] "UniRef90_A0A0C1RHM1" "UniRef90_A0A0C1RHR5" "UniRef90_A0A0C1RHZ4"
+ ... omitted several groups/vertices
For example, this one :
> a[[91]]
[1] "OTU0099" "UniRef90_UPI0005B28A7E"
I tried this but it does not work :
a[lapply(a,length)>2]
Any help?
Since you didn't provide any reproducible data or example, I had to produce some dummy data:
# create dummy data
a <- list(x = 1, y = 1:4, z = 1:2)
# remove elements in list with lengths greater than 2:
a[which(lapply(a, length) > 2)] <- NULL
In case you wanted to remove the items with lengths exactly equal to 2 (question is unclear), then last line should be replaced by:
a[which(lapply(a, length) == 2)] <- NULL

R + ggplot2: plot time series with linear regression with changepoint

I have a time series data which has 2 variables (x,y) and I am currently using R base plot to generate a plot like this.
the red lines is a linear model fitted between 2 points.
The data looks likes this.
X
[1] 559.2 559.8 560.6 561.1 561.2 561.8
[7] 562.4 563.0 563.4 563.5 563.5 563.5
[13] 563.5 563.5 563.5 563.5 563.8 564.5
[19] 565.3 565.9 566.4 566.5 566.7 567.4
[25] 567.6 568.5 569.3 570.3 571.6 572.2
[31] 572.5 573.6 574.1 575.5 576.9 578.1
[37] 579.0 580.1 580.9 581.4 581.8 583.1
[43] 583.8 584.4 585.2 586.0 586.1 586.2
[49] 586.8 587.4
**y**
[1] 115.4375 115.3008 115.2069 115.3306 115.3900 115.1189 114.8619
[8] 114.7992 114.7117 114.4722 114.7031 115.1358 115.4811 115.4500
[15] 115.6347 115.8286 115.8361 115.7986 115.9169 116.1225 116.1803
[22] 116.3794 116.2872 116.2517 116.3411 116.4167 116.5108 116.2900
[29] 116.3456 116.3658 116.1547 116.2042 116.1517 116.2083 116.3642
[36] 116.4347 116.5428 116.5119 116.5925 116.3969 116.2614 116.3494
[43] 116.1242 116.1469 116.0872 116.1000 116.2319 116.1225 116.1069
[50] 116.1364
I am calculating the change point manually from X.
Is this kind of plot possible in ggplot2?i.e. using ggplot2 to loop through change points and fit linear model?
Any help would be appreciated. Thanks.
#create some fake data
segment1 = 100:1 + runif(100)*10
df1 = data.frame(value = segment1, time = 1:100, type="segment1")
segment2 = 75:1 + runif(75)*10
df2 = data.frame(value = segment2, time = 101:175, type="segment2")
segment3 = 50:1 + runif(50)*10
df3 = data.frame(value = segment3, time = 176:225, type="segment3")
data.complete = rbind(df1,df2,df3)
#create the plot
require(ggplot2)
g = ggplot(data.complete,aes(x=time,y=value))
g = g + geom_line()
g = g + geom_smooth(method = "lm",aes(group=type))
g
To have the underlying line graph connected the group aesthetic must be called in the smoother.

Understanding xgb.dump

I'm trying to understand the intuition about what is going on in the xgb.dump of a binary classification with an interaction depth of 1. Specifically how the same split is used twiced in a row (f38 < 2.5) (code lines 2 and 6)
The resulting output looks like this:
xgb.dump(model_2,with.stats=T)
[1] "booster[0]"
[2] "0:[f38<2.5] yes=1,no=2,missing=1,gain=173.793,cover=6317"
[3] "1:leaf=-0.0366182,cover=3279.75"
[4] "2:leaf=-0.0466305,cover=3037.25"
[5] "booster[1]"
[6] "0:[f38<2.5] yes=1,no=2,missing=1,gain=163.887,cover=6314.25"
[7] "1:leaf=-0.035532,cover=3278.65"
[8] "2:leaf=-0.0452568,cover=3035.6"
Is the difference between the first use of f38 and the second use of f38 simply the residual fitting going on? At first it seemed weird to me, and trying to understand exactly what's going on here!
Thanks!
Is the difference between the first use of f38 and the second use of f38 simply the residual fitting going on?
most likely yes - its updating the gradient after the first round and finding the same feature with split point in your example
Here's a reproducible example.
Note how I lower the learning rate in the second example and its finds the same feature, same split point again for all three rounds. In the first example it uses different features in all 3 rounds.
require(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
#high learning rate, finds different first split feature (f55,f28,f66) in each tree
bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1, nrounds = 3,nthread = 2, objective = "binary:logistic")
xgb.dump(model = bst)
# [1] "booster[0]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [3] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=1.71218"
# [5] "4:leaf=-1.70044" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [7] "5:leaf=-1.94071" "6:leaf=1.85965"
# [9] "booster[1]" "0:[f59<-9.53674e-07] yes=1,no=2,missing=1"
# [11] "1:[f28<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.784718"
# [13] "4:leaf=-0.96853" "2:leaf=-6.23624"
# [15] "booster[2]" "0:[f101<-9.53674e-07] yes=1,no=2,missing=1"
# [17] "1:[f66<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.658725"
# [19] "4:leaf=5.77229" "2:[f110<-9.53674e-07] yes=5,no=6,missing=5"
# [21] "5:leaf=-0.791407" "6:leaf=-9.42142"
## changed eta to lower learning rate, finds same feature(f55) in first split of each tree
bst2 <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = .01, nrounds = 3,nthread = 2, objective = "binary:logistic")
xgb.dump(model = bst2)
# [1] "booster[0]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [3] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.0171218"
# [5] "4:leaf=-0.0170044" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [7] "5:leaf=-0.0194071" "6:leaf=0.0185965"
# [9] "booster[1]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [11] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.016952"
# [13] "4:leaf=-0.0168371" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [15] "5:leaf=-0.0192151" "6:leaf=0.0184251"
# [17] "booster[2]" "0:[f28<-9.53674e-07] yes=1,no=2,missing=1"
# [19] "1:[f55<-9.53674e-07] yes=3,no=4,missing=3" "3:leaf=0.0167863"
# [21] "4:leaf=-0.0166737" "2:[f108<-9.53674e-07] yes=5,no=6,missing=5"
# [23] "5:leaf=-0.0190286" "6:leaf=0.0182581"

Resources