Improving missing value removing code in R - r

I have to remove 0 or -ve values from a vector of time series. An example of a vector of time series is given below:
c(-1856, -1770, -1787, 1322, 1605, 1417, 1423, 1371, 1200, 1295, 1441, 587, 189, 330, 1133, 1398, 1455, 100, 455, 59, 222, 330, 289, 251, 1717, 1258, 1732, -3258)
-ve values and 0 values are treated as missing values and it should be replaced by the mean of immediate neighbors (non-missing value). As for example in the vector above missing/-ve values will be replaced by the mean of (1322 and 1732).
The output will be as follows
1527, 1527, 1527, 1322, 1605, 1417, 1423, 1371, 1200, 1295, 1441, 587, 189, 330, 1133, 1398, 1455, 100, 455, 59, 222, 330, 289, 251, 1717, 1258, 1732, 1527
There may be multiple cases of missing values and edge cases should also be considered. Some of the examples are given below.
Vector ending with -ve value
x = c(1856, 1770, 1787, 1322, 1605, 1417, 1423, 1371, 1200, 1295, 441, 587, 189, 330, 1133, 1398, 1455, 100, 455, 59, 222, 330, 289, 251, 1717, 1258, 1732,-3258)
Vector with multiple missing values
x = c(1856, 1770, 1787, 1322, 1605, 1417, 1423, 1371, 1200, 1295, 441, 587, 189, 330, 1133, 1398, 1455, -100, -455, -59, -222, -330, -289, 251, 1717, 1258, 1732,-3258)
Vector starting with -ve value
x = c(-1856, 1770, 1787, 1322, 1605, 1417, 1423, 1371, 1200, 1295, 441, 587, 189, 330, 1133, 1398, 1455, 100, 455, 59, 222, 330, 289, 251, 1717, 1258, 1732,3258)
Vector with starting and ending -ve values
I have managed to do the job but my code is very slow. I have to run it on a very large data set and it is taking 2 hrs for single year. Any help to improve my code will be appreciated..
My code
drop_remo = function(x){
if (sum(x<=0)>15){
finalx= x[x<=0 | x>0]=0
}else if (all(x>0)){
finalx = x
}else{
newx = c(x[1:length(x)],x)
nloc = which(newx <= 0)
Breaks <- c(which(diff(nloc) != 1), length(nloc))
vv = sapply(seq(length(Breaks) - 1),
function(n) nloc[(Breaks[n] + 1):Breaks[n+1]])
if (is.matrix(vv)){
b1<-apply(vv,2,list)
vv = lapply(b1,unlist)
}
for (i in vv){
#print(i)
backdata = newx[i[1]-1]
if(i[length(i)]==length(newx)){
smoothed = newx[length(newx)/2]
}else if((length(vv)==1) & (i[length(i)]==length(newx))){
forwarddata = newx[(i[length(i)]/2) +1]
}else{
forwarddata = newx[i[length(i)]+1]
smoothed = (backdata+forwarddata)/2
}
#print(backdata)
#print(forwarddata)
#print(smoothed)
for (m in i){
newx[m] = smoothed
}
}
finalx = newx[(length(x)+1):length(newx)]
}
}

One way using base R, could be to get index of values where x > 0 and then take mean of values at min and max index.
inds <- which(x > 0)
x[x <=0] <- mean(c(x[min(inds)] , x[max(inds)]))
Or a one-liner using range
x[x<=0] <- mean(x[range(which(x > 0))])
For the updated data, we can use ave. Find out index which needs to be replaced (inds), create a grouping variable by using diff and cumsum and based on condition get the mean
inds <- which(x <= 0)
x[inds] <- ave(inds, cumsum(c(0, diff(inds)) != 1), FUN = function(i) {
if((max(i) + 1) > length(x))
mean(c(x[min(i) - 1], x[which.max(x > 0)]))
else if ((min(i) - 1) <= 0)
mean(c(x[max(which(x > 0))], x[(max(i) + 1)]))
else
mean(c(x[min(i) - 1], x[(max(i) + 1)]))
})

Related

Calculate Latitudinal Range of species abundance

I'm trying to calculate the latitudinal ranges for species (ASVs) along a transect but cannot get my head around how to do this.
Basically I want the maximum latitude minus the minimum latitude where a species is present, i.e. abundance does not equal 0. For examples Species 1 is present for the first time at -35 deg S and present for the last time at -40 deg S, it's latitudinal range would be 5 degrees. Thanks!
My data looks like this:
> dput(test[1:30, c(1:5)])
structure(list(Station_neat = c("001_DCM", "001_SA", "003_DCM",
"003_SA", "005_DCM", "005_SA", "007_DCM", "007_SA", "009_DCM",
"009_SA", "011_DCM", "011_SA", "013_DCM", "013_SA", "015_DCM",
"015_SA", "017_DCM", "017_SA", "019_DCM", "019_SA", "021_DCM",
"021_SA", "023_DCM", "023_SA", "025_DCM", "025_SA", "027_DCM",
"027_SA", "029_DCM", "029_SA"), Lat = c(-29.997, -29.997, -30.9975,
-30.9975, -31.9995, -31.9995, -32.99816667, -32.99816667, -34.00016667,
-34.00016667, -34.9995, -34.9995, -36.00083333, -36.00083333,
-36.9985, -36.9985, -38.00016667, -38.00016667, -38.99833333,
-38.99833333, -39.999, -39.999, -40.99783333, -40.99783333, -42.0005,
-42.0005, -42.99633333, -42.99633333, -43.9975, -43.9975), asv_3 = c(80,
0, 65, 0, 41, 0, 50, 0, 44, 0, 53, 0, 59, 0, 38, 0, 43, 0, 25,
0, 29, 51, 35, 22, 133, 35, 159, 83, 965, 414), asv_4 = c(766,
694, 286, 791, 421, 1202, 382, 431, 484, 684, 431, 529, 454,
722, 621, 370, 472, 439, 394, 243, 414, 518, 297, 300, 574, 396,
395, 1359, 1113, 541), asv_5 = c(1314, 2812, 729, 2874, 915,
3720, 1226, 2046, 940, 1783, 1220, 2627, 986, 3195, 1514, 566,
590, 1603, 325, 667, 748, 932, 616, 339, 1167, 1088, 988, 2333,
1563, 2146)), row.names = c(NA, 30L), class = "data.frame")
Edit: ASVs (e.g. asv_4) are my species. I have about 600 of these.
Edit 2: Scatterplot with mean latitudinal range and latitude (see comment):
Maximum latitude minus the minimum latitude where a species is present, i.e. abundance does not equal 0
A base solution:
lapply(test[grepl("asv", names(test))], \(x) diff(range(test$Lat[x > 0])))
# $asv_3
# [1] 14.0005
#
# $asv_4
# [1] 14.0005
#
# $asv_5
# [1] 14.0005
Its dplyr equivalent:
library(dplyr)
test %>%
summarise(across(starts_with("asv"), ~ diff(range(Lat[.x > 0]))))
# asv_3 asv_4 asv_5
# 1 14.0005 14.0005 14.0005

Histogram and density plots with multiple groups

I have a dataset consist of 4 variables: CR, EN, LC and VU:
View first few values of my dateset
CR = c(2, 9, 10, 14, 24, 27, 29, 30, 34, 43, 50, 74, 86, 105, 140, 155, 200, …)
EN = c(24, 52, 86, 110, 144, 154, 206, 242, 300, 302, 366, 403, 422, 427, 427, 434, 448, …)
LC = c(447, 476, 543, 580, 647, 685, 745, 763, 819, 821, 863, 904, 908, 926, 934, 951, 968, …)
VU = c(75, 96, 97, 217, 297, 498, 511, 551, 560, 564, 570, 575, 609, 673, 681, 700, 755,...)
I want to create a histogram of a group of these variables in a plot by R that shows the normal distribution and density, a plot similar to the one below...
Could you please help me?
Here are the distributions, a clear-cut use of geom_density.
But first, to address "grouping", we need to pivot/reshape the data so that ggplot2 can automatically handle grouping. This will result in a column with a character (or factor) for each of the "CR", "EN", "LC", or "VU", and another column with the particular value. When pivoting, there is typically one or more columns that are preserved (an id, an x-value, a time/date, or something similar), but we don't have any data that would suggest something to preserve.
longdat <- tidyr::pivot_longer(dat, everything())
longdat
# # A tibble: 68 × 2
# name value
# <chr> <dbl>
# 1 CR 2
# 2 EN 24
# 3 LC 447
# 4 VU 75
# 5 CR 9
# 6 EN 52
# 7 LC 476
# 8 VU 96
# 9 CR 10
# 10 EN 86
# # … with 58 more rows
# # ℹ Use `print(n = ...)` to see more rows
ggplot(longdat, aes(x = value, group = name, fill = name)) +
geom_density(alpha = 0.2)
tidyr::pivot_longer works, one can also use melt from either reshape2 or data.table:
longdat <- reshape2::melt(dat, c())
## names are 'variable' and 'value' instead of 'name' and 'value'
Data
dat <- structure(list(CR = c(2, 9, 10, 14, 24, 27, 29, 30, 34, 43, 50, 74, 86, 105, 140, 155, 200), EN = c(24, 52, 86, 110, 144, 154, 206, 242, 300, 302, 366, 403, 422, 427, 427, 434, 448), LC = c(447, 476, 543, 580, 647, 685, 745, 763, 819, 821, 863, 904, 908, 926, 934, 951, 968), VU = c(75, 96, 97, 217, 297, 498, 511, 551, 560, 564, 570, 575, 609, 673, 681, 700, 755)), class = "data.frame", row.names = c(NA, -17L))

Repositioning and increasing the weight of borders around dendrogram produced with R 'plot' function

I am trying to cut a dendrogram into three classes using the rect.hclust function, but when I export the graph, it cuts off the borders at the bottom of the graph. In addition, I would like to increase the weight of the borders, but I am not sure how to do this, as the lwd argument doesn't seem to exist for this function. What can I do to fix these parameters?
Data:
cluster <- data.frame(plot=c(1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20),
meas1 = c(443,
836,
903,
684,
94,
125,
733,
846,
625,
234,
437,
775,
269,
774,
17,
502,
80,
51,
523,
229),
meas2 = c(735,
574,
793,
261,
961,
136,
404,
138,
45,
935,
698,
675,
594,
497,
152,
153,
30,
667,
547,
745),
meas3 = c(23,
526,
36,
93,
708,
970,
399,
111,
456,
439,
569,
503,
337,
213,
399,
850,
614,
491,
28,
452))
Code for hierarchical clustering and dendrogram generation:
#Generate distance matrix
dist_mat <- dist(cluster, method = 'euclidean')
#Hierarchical clustering
hclust_avg <- hclust(dist_mat, method = 'average')
#Cut into 3 classes
cut_avg <- cutree(hclust_avg, k = 3)
#Plot dendrogram
plot(hclust_avg, xlab = "", ylab = "Euclidean Distance", sub = "",
main = "Cluster Dendrogram", lwd = 2)
#Generate borders around each group
rect.hclust(hclust_avg, k=3, border = 2:6)
Exporting the graph looks as follows:
I need the red-green-blue borders to be thicker (higher border weights), and I also need them to not be cut off at the bottom of the graph.
I bypassed the problem of the missing lower edge of the rectangle by setting the figure margins to zero before drawing the rectangle.
The line width of the rectangles can be set by setting par(lwd), e.g. par(lwd=4), as in the example below:
cluster <- data.frame(plot=1:20,
meas1 = c(443,
836,
903,
684,
94,
125,
733,
846,
625,
234,
437,
775,
269,
774,
17,
502,
80,
51,
523,
229),
meas2 = c(735,
574,
793,
261,
961,
136,
404,
138,
45,
935,
698,
675,
594,
497,
152,
153,
30,
667,
547,
745),
meas3 = c(23,
526,
36,
93,
708,
970,
399,
111,
456,
439,
569,
503,
337,
213,
399,
850,
614,
491,
28,
452))
#Generate distance matrix
dist_mat <- dist(cluster, method = 'euclidean')
#Hierarchical clustering
hclust_avg <- hclust(dist_mat, method = 'average')
#Cut into 3 classes
cut_avg <- cutree(hclust_avg, k = 3)
pars <- par()
#Plot dendrogram
plot(hclust_avg, xlab = "", ylab = "Euclidean Distance", sub = "",
main = "Cluster Dendrogram", lwd = 2)
par(lwd=4, mar=c(0,0,0,0))
#Generate borders around each group
rect.hclust(hclust_avg, k=3, border = 2:6)
# reset par
par(lwd=pars$lwd, mar=pars$mar)
Created on 2020-06-30 by the reprex package (v0.3.0)

How to change distance between breaks for continuous x-axis on ggplot?

I have a dataset with y-axis = diversity indices and x-axis = depth. I am looking at how diversity changes with depth (increases/decreases). It is informative to visualize these changes over depth (so transforming isn't helpful), however it is difficult with the disparity between number of samples for different depths (more samples at shallower versus deeper depths. With the following code:
breaks_depth=c(0,50,100,150,250,350,450,500,1200)
ggplot(data=df, aes(x=Depth, y=Diversity)) +
geom_line()+
scale_y_continuous(breaks=seq(0,1400,200), limits=c(0,1400))+
scale_x_continuous(breaks=breaks_depth, limits=c(0,1200))
I get the following plot:
I would like to get a plot such that the distance between 500m and 1200m depth is smaller and the distance between the shallower depths (0-150m) is greater. Is this possible? I have tried expand and different break and limit variations. The dput() of this dataset can be found here. The rownames are the sample IDs and the columns I am using for the plot are: y-axis=invsimpson_rd, and x-axis=Depth_rd. TIA.
****EDIT*****
Winner code and plot modified from Calum's answer below.
ggplot(data=a_div, aes(x=Depth_rd, y=invsimpson_rd)) +
geom_line()+
scale_y_continuous(breaks=seq(0,1400,200), limits=c(0,1400))+
scale_x_continuous(trans="log10",breaks = c(0,
15,25,50,100,150,200,250,300,350,400,450, seq(600, 1200, by = 200)))
Here's an example with the built in economics dataset. You can see that you can specify the breaks however you want as per usual, but the "sqrt" transformation shifts the actual plotted values to have more space near the beginning of the series. You can use other built in transformations or define your own as well.
EDIT: updated with example data and some comparison of common different trans options.
library(tidyverse)
tbl <- structure(list(yval = c(742, 494, 919, 625, 124, 788, 583, 213, 715, 363, 15, 313, 472, 559, 314, 494, 388, 735, 242, 153, 884, 504, 267, 454, 325, 305, 746, 628, 549, 345, 327, 230, 271, 486, 971, 979, 857, 779, 394, 903, 585, 238, 702, 850, 611, 710, 694, 674, 1133, 468, 784, 634, 234, 61, 325, 505, 693, 1019, 766, 435, 407, 772, 925, 877, 187, 290, 782, 674, 1263, 1156, 935, 499, 791, 797, 537, 308, 761, 744, 674, 764, 560, 805, 540, 427, 711), xval = c(80, 350, 750, 100, 20, 200, 350, 50, 110, 20, 200, 350, 60, 100, 20, 40, 60, 100, 20, 40, 350, 50, 20, 40, 50, 30, 40, 260, 1000, 200, 200, 200, 500, 50, 350, 360, 380, 250, 60, 190, 40, 70, 70, 40, 40, 70, 180, 180, 440, 370, 130, 1200, 20, 20, 30, 80, 120, 200, 220, 120, 40, 80, 350, 750, 20, 80, 200, 320, 500, 220, 160, 80, 140, 350, 100, 40, 350, 100, 200, 340, 60, 40, 100, 60, 40)), .Names = c("yval", "xval"), row.names = c(NA, -85L), class = c("tbl_df", "tbl", "data.frame"))
ggplot(tbl) +
geom_line(aes(x = xval, y = yval)) +
scale_x_continuous(trans = "sqrt", breaks = c(0,50,100,150,250,350,450,500,1200))
ggplot(tbl) +
geom_line(aes(x = xval, y = yval)) +
scale_x_continuous(trans = "log10", breaks = c(0,50,100,150,250,350,450,500,1200))
Created on 2018-04-27 by the reprex package (v0.2.0).

Assign a binary vector based on blocks of data within another vector

I have a data frame:
dat <- data.frame(diffsecs=(c(189, 245, 13988, 2396, 29601, 263, 297, 292, 230, 257, 192,
286, 236, 261, 286, 268, 294, 260, 286, 299, 514, 2287, 234,
195, 250, 519, 560, 3314, 12340, 186, 184, 180, 180, 180, 180,
180, 180, 180, 180, 180, 3072, 180, 180, 206, 180, 180, 180,
360, 180, 180, 180, 180, 5220, 180, 437, 246, 218, 212, 472,
2356, 2641, 363, 425, 757, 403, 181, 355, 192, 192, 784, 238,
250, 261, 272, 2554, 29524, 4482, 6762, 1252, 269, 303, 294,
286, 273, 289, 274, 216, 255, 180, 252, 322, 238, 583, 289, 317,
308, 305, 308, 312, 330)))
It has blocks of instances where there are multiple, consecutive rows equaling 180. I want to assign a binary vector which equals 1 when the value of diffsecs equals 180 and 0 otherwise. However, I only want it to equal 1 when in a block of 5 or more consecutive instances of 180. So if there is 3 consecutive values of 180 the binary vector will equal 0.
I tried using the loop
total<- nrow(dat)
len<- 1:total
for(i in len){
temp<- dat[i:(i+5),]
xdiff<- ifelse(mean(temp$diffsecs)>178 & mean(temp$diffsecs)<182 ,1,0)
temp2<- cbind(dat[i,],xdiff)
if(i==1) {dat2 <- temp2}
else {dat2<- rbind(dat2,temp2)}
}
But it doesn't manage it and assigns shorter blocks than required.
You can take advantage of the great rle function and it's inverse counterpart :
RLE <- rle(dat$diffsecs)
RLE$values <- ifelse(RLE$values == 180 & RLE$lengths >= 5,1,0)
dat2 <- cbind(dat,binarycol=inverse.rle(RLE))
As correctly pointed out by #Frank, you can shorten the second line to :
RLE$values <- as.integer(RLE$values == 180 & RLE$lengths >= 5)
or even :
RLE$values <- RLE$values == 180 & RLE$lengths >= 5
if a vector of FALSE/TRUE is ok for you instead of 0/1
With data.table, you can use rleid:
library(data.table)
setDT(dat)
dat[, v :=
(diffsecs==180)*(.N >= 5)
, by = rleid(diffsecs == 180)][]

Resources