tsclean not working on single vector data - r

I have a simple dataset as attached. I see a clear outlier, (Qty=6), which should get corrected after processing it through tsclean.
c(6, 187, 323, 256, 289, 387, 335, 320, 362, 359, 426, 481,
356, 408, 497, 263, 330, 521, 406, 350, 478, 320, 339)
What I have:
library(forecast)
data1 <- read_csv("sample.csv", col_names = FALSE)
count_qty <-ts(data1, frequency = 12)
data1$clean_qty = tsclean(count_qty)
and the data returns
X1 clean_qty[,"X1"]
<dbl> <dbl>
1 6 6
2 187 187
3 323 323
4 256 256
5 289 289
6 387 387
7 335 335
8 320 320
9 362 362
10 359 359
# ... with 13 more rows
The first item should be removed.

You can remove outliers using boxplot:
vec1[! vec1 %in% boxplot(vec1, plot = F)$out]
# [1] 323 256 289 387 335 320 362 359 426 481 356 408 497 263 330 521 406 350 478 320 339
Note that 187 is also an outlier. As you said, 6 is the obvious one;

Related

How to replace values in columns based on row values and column names?

I have previously posted a question on subsetting columns from row values on GIS StackExchange: here.
In short, I would like to set data to NA, if the column name (e.g. 100) is less than the row value of s_mean (e.g. value is 101).
It worked for specific applications but now it does not work, and I get the following error:
Error: Can't subset columns that don't exist.
x Locations 304, 303, 302, 301, 300, etc. don't exist.
i There are only 197 columns.
Run `rlang::last_error()` to see where the error occurred.
Here is the data:
# A tibble: 2,937 x 197
ID doy FireID Year sE NAME L1NAME ID_2 area s_count s_mean s_median s_stdev s_min doydiff ID_E5 32 33 34 35
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2246 173 30048 2015 0 A T 30048 3.86e6 0 100 0 0 0 73 56 267. 265. 264. 265.
2 2275 174 30076 2015 0 A T 30076 2.15e6 0 100 0 0 0 74 533 266. 266. 263. 264.
3 704 294 28542 2015 1381 A T 28542 6.44e5 0 100 0 0 0 194 562 277. 277. 278. 279.
4 711 110 28549 2015 0 NA NA 28549 2.15e5 0 101 0 0 0 9 569 262. 264. 260. 262.
5 690 161 28528 2015 232 A T 28528 4.29e5 0 101 0 0 0 60 580 280. 279. 280. 279.
6 692 331 28530 2015 0 M M 28530 2.15e5 0 101 0 0 0 130 582 280. 279. 281. 280.
7 667 47 28506 2015 232 M M 28506 2.79e6 0 10 0 0 0 37 589 280. 282. 281. 280.
8 672 188 28511 2015 0 NA NA 28511 2.79e6 0 101 0 0 0 87 594 254. 261. 259. 254.
9 657 171 28496 2015 578 NA NA 28496 8.59e5 0 101 0 0 0 170 611 256. 263. 260. 254.
10 635 301 28474 2015 1084 M M 28474 1.50e6 0 101 0 0 0 200 621 282. 282. 282. 281.
The data columns continue until columns name 212. It is not shown here.
Here is the script:
polydata = read_csv("path/E15.csv")
polydata$s_mean <- round(polydata$s_mean)
polydata <- polydata[order(polydata$s_mean),]
# slice each row, and put each slice in a list
df_sub = lapply(1:nrow(polydata),
function(x){
polydata[x,c(1,10,polydata$s_mean[x]:187+10)] # + 10 because of the offset: doy_columns start at 11
})
Why do I get an error that I return too many columns when I specify 187+10 as the subsetting parameter?
What should be changed?
I eventually want this to be the outcome (compare the column names to s_mean to better understand the desired output):
ID s_mean 32 33 34 35 36 ... 212
1 30 267 278 270 269 267 ... 298
2 100 NA NA NA NA NA ... 298
3 35 NA NA NA 242 246 ... 298
We can use across from dplyr and refer to column names using cur_column. From there, we can use an ifelse to replace the data with NA if the column name is less than s_mean. I created a toy dataset to illustrate the solution which can be found at the end of this post.
library(dplyr)
pdat1 %>%
mutate(across(`32`:`35`,
~ifelse(s_mean > as.numeric(cur_column()), NA, .)))
#> ID s_mean 32 33 34 35
#> 1 2246 30 267 265 264 265
#> 2 2275 100 NA NA NA NA
#> 3 704 100 NA NA NA NA
#> 4 711 34 NA NA 260 262
#> 5 690 101 NA NA NA NA
#> 6 692 101 NA NA NA NA
#> 7 667 10 280 282 281 280
#> 8 672 101 NA NA NA NA
#> 9 657 101 NA NA NA NA
#> 10 635 101 NA NA NA NA
Toy Dataset:
pdat1 <- structure(list(ID = c(2246L, 2275L, 704L, 711L, 690L, 692L, 667L, 672L,
657L, 635L),
s_mean = c(30L, 100L, 100L, 34L, 101L, 101L, 10L, 101L,
101L, 101L),
`32` = c(267, 266, 277, 262, 280, 280, 280, 254, 256, 282),
`33` = c(265, 266, 277, 264, 279, 279, 282, 261, 263, 282),
`34` = c(264, 263, 278, 260, 280, 281, 281, 259, 260, 282),
`35` = c(265, 264, 279, 262, 279, 280, 280, 254, 254, 281)),
class = "data.frame",
row.names = c("1", "2", "3", "4","5", "6", "7", "8", "9", "10"))
#> ID s_mean 32 33 34 35
#> 1 2246 30 267 265 264 265
#> 2 2275 100 266 266 263 264
#> 3 704 100 277 277 278 279
#> 4 711 34 262 264 260 262
#> 5 690 101 280 279 280 279
#> 6 692 101 280 279 281 280
#> 7 667 10 280 282 281 280
#> 8 672 101 254 261 259 254
#> 9 657 101 256 263 260 254
#> 10 635 101 282 282 282 281

how to create weekly result from daily data using R?

I try to create a weekly cumulative result from this daily data with detail below
date A B C D E F G H
16-Jan-22 227 441 3593 2467 9 6 31 2196
17-Jan-22 224 353 3555 2162 31 5 39 2388
18-Jan-22 181 144 2734 2916 0 0 14 1753
19-Jan-22 95 433 3610 3084 42 19 10 2862
20-Jan-22 141 222 3693 3149 183 19 23 2176
21-Jan-22 247 426 3455 4016 68 0 1 2759
22-Jan-22 413 931 4435 4922 184 2 39 3993
23-Jan-22 389 1340 5433 5071 200 48 27 4495
24-Jan-22 281 940 6875 5009 343 47 71 3713
25-Jan-22 314 454 5167 4555 127 1 68 3554
26-Jan-22 315 973 5789 3809 203 1 105 4456
27-Jan-22 269 1217 6776 4578 227 91 17 5373
28-Jan-22 248 1320 5942 3569 271 91 156 4260
29-Jan-22 155 1406 6771 4328 426 44 109 4566
Solution using data.table and lubridate
library(lubridate)
library(data.table)
setDT(df)
df[, lapply(.SD, sum), by = isoweek(dmy(date))]
# isoweek A B C D E F G H
# 1: 2 227 441 3593 2467 9 6 31 2196
# 2: 3 1690 3849 26915 25320 708 93 153 20426
# 3: 4 1582 6310 37320 25848 1597 275 526 25922
I wanted to provide a solution using the tidyverse principles.
You will need to use the group_by and summarize formulas and this very useful across() function.
#recreate data in tribble
df <- tribble(
~"date", ~"A", ~"B", ~"C", ~"D", ~"E", ~"F", ~"G", ~H,
"16-Jan-22",227, 441, 3593, 2467, 9, 6, 31, 2196,
"17-Jan-22",224, 353, 3555, 2162, 31, 5, 39, 2388,
"18-Jan-22",181, 144, 2734, 2916, 0, 0, 14, 1753,
"19-Jan-22",95, 433, 3610, 3084, 42, 19, 10, 2862,
"20-Jan-22",141, 222, 3693, 3149, 183, 19, 23, 2176,
"21-Jan-22",247, 426, 3455, 4016, 68, 0, 1, 2759,
"22-Jan-22",413, 931, 4435, 4922, 184, 2, 39, 3993,
"23-Jan-22",389, 1340, 5433, 5071, 200, 48, 27, 4495,
"24-Jan-22",281, 940, 6875, 5009, 343, 47, 71, 3713,
"25-Jan-22",314, 454, 5167, 4555, 127, 1, 68, 3554,
"26-Jan-22",315, 973, 5789, 3809, 203, 1, 105, 4456,
"27-Jan-22",269, 1217, 6776, 4578, 227, 91, 17, 5373,
"28-Jan-22",248, 1320, 5942, 3569, 271, 91, 156, 4260,
"29-Jan-22",155, 1406, 6771, 4328, 426, 44, 109, 4566)
#change date to format date
df$date <- ymd(df$date)
# I both create a new column "week_num" and group it by this variables
## Then I summarize that for each column except for "date", take each sum
df %>%
group_by(week_num=lubridate::isoweek(date)) %>%
summarize(across(-c("date"),sum))
# I get this results
week_num A B C D E F G H
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 2017 6028 33189 26785 990 243 310 25464
2 4 1482 4572 34639 26850 1324 131 400 23080
Group_by() and summarize() are relatively straight forward. Across() is a fairly new verb that is super powerful. It allows you to reference columns using tidy selection principles (eg. starts_with(), c(1:9), etc) and if you apply it a formula it will allow that formula to each of the selected columns. Less typing!
Alternatively you would have to individually sum each column A=sum(A) which is more typing.

Loop through and mutate multiple columns using an incrementing variable

My goal is to perform multiple column operations in one line of code without hard coding the variable names.
structure(list(Subject = 1:6, Congruent_1 = c(359, 391, 384,
316, 287, 403), Congruent_2 = c(361, 378, 322, 286, 276, 363),
Congruent_3 = c(342, 355, 334, 274, 297, 335), Congruent_4 = c(365,
503, 324, 256, 266, 388), Congruent_5 = c(335, 354, 320,
272, 260, 337), Incongruent_1 = c(336, 390, 402, 305, 310,
400), Incongruent_2 = c(366, 407, 386, 280, 243, 393), Incongruent_3 = c(323,
455, 317, 308, 259, 325), Incongruent_4 = c(361, 392, 357,
274, 342, 350), Incongruent_5 = c(300, 366, 378, 263, 258,
349)), row.names = c(NA, 6L), class = "data.frame")
My data looks like this.
I want need to do column subtraction and save those new values into new columns. For example, a new column by the name of selhist_1 should be computed as Incongruent_1 - Congruent_1. I tried to write a for loop that indexes the existing columns by their names and creates new columns using the same indexing variable:
for(i in 1:5)(
DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) = as.name(paste("Incongruent_",i,sep="")) - as.name(paste("Congruent_",i,sep="")))
)
but I received this error:
Error: unexpected '=' in: "for(i in 1:5)( DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) ="
I rather use this modular approach, as opposed to hard coding and writing out "selhistB = incongruent_1 - congruent_1" five times, using the mutate() function.
I also wonder if i could achieve the same goal on the long version of this data, and maybe it would make more sense.
library(dplyr)
d %>%
pivot_longer(-Subject,
names_to = c(".value", "id"),
names_sep = "_") %>%
mutate(selhistB = Incongruent - Congruent) %>%
pivot_wider(names_from = id, values_from = c(Congruent, Incongruent, selhistB))
Or just skip the last pivot, and keep everything long.
As long as you are already using tidyverse packages, the following code will do exactly what you need:
library(dplyr)
for(i in 1:5){
DP4 <- DP4 %>% mutate(UQ(sym(paste0("selhistB_",i))) :=
UQ(sym(paste0("Incongruent_",i))) - UQ(sym(paste0("Congruent_",i))))
}
DP4
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12
You can use split.default and split on column names suffix, then loop over the list and subtract column 2 from column 1, i.e.
lapply(split.default(df[-1], sub('.*_', '', names(df[-1]))), function(i) i[1] - i[2])
Using subtract over all matching columns, then cbind, try:
x <- df1[, grepl("^C", colnames(df1)) ] - df1[, grepl("^I", colnames(df1)) ]
names(x) <- paste0("selhistB_", seq_along(names(x)))
res <- cbind(df1, x)
res
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12

Spline Interpolation in R

I have some data with
h = c[157 144 80 106 124 46 207 188 190 208 143 170 162 178 155 163 162 149 135 160 149 147 133 146 126 120 151 74 122 145 160 155 173 126 172 93]
Then I get local maxima with diff function
max = [1 5 7 10 12 14 16 20 24 27 31 33 35 36]
I have simple matlab code to find spline interpolation
maxenv = spline(max,h(max),1:N);
This code will show result
maxenv = 157 86.5643152828762 67.5352696350679 84.9885891697257 124 169.645228239041 207 224.396380746179 223.191793341491 208 185.421032390413 170 172.173624690130 178 172.759468849065 163 158.147870987344 157.874968589889 159.414581897490 160 157.622863516083 153.308219179638 148.839465253375 146 146.051320982064 148.167322961480 151 153.474200222188 155.606188003845 157.685081783579 160 163.653263154551 173 186.027639098700 172 93
Now I'm trying with R but got some errors
maxenv <- spline(max, h(max), seq(N))
Then error
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
You're code and question are really not clear, so I'm not 100% sure this is what you are looking for:
# vector of values
h <- c(157, 144, 80, 106, 124, 46, 207, 188, 190, 208, 143, 170, 162, 178, 155,
163, 162, 149, 135, 160, 149, 147, 133, 146, 126, 120, 151, 74, 122, 145,
160, 155, 173, 126, 172, 93)
# local maxima
lmax <- h[c(1, which(diff(sign(diff(h)))==-2)+1, length(h))]
# spline calculation
spl <- spline(1:length(lmax), lmax)
# visual inspection
plot(spl)
lines(spl)

Counting how many values but the most common one there is in a data.frame, with R

Sample datas :
> ind1
Ind Gb19a Gb19b Gb24a Gb24b Gb28a Gb28b Gb11a Gb11b
1 9-2-J1-N3 378 386 246 248 360 372 162 261
2 9-2-J1-N3 380 386 246 248 360 372 187 261
14 9-2-J1-N3 380 386 246 248 NA NA NA NA
15 9-2-J1-N3 NA 246 248 360 187 NA NA NA
16 9-2-J1-N3 380 386 380 386 378 386 380 386
17 9-2-J1-N3 380 386 246 248 360 372 187 261
19 9-2-J1-N3 360 372 360 372 360 372 360 372
20 9-2-J1-N3 187 261 187 261 162 261 187 261
21 9-2-J1-N3 380 386 240 246 360 372 187 NA
> class(ind1)
[1] "data.frame"
So I need to count, for every columns, how many values but the most common one there is. Expected output would be :
Gb19a 3
Gb19b 3
Gb24a 5
ect...
I have a solution given by folks here from a previous question I asked, (thanks to them) that explicitly do calculation for every variable, but I don't think it's a workable solution for my situation.
> table(ind1$Gb19a)
187 360 378 380
1 1 1 5
counts1 <- as.data.frame(table(ind1$Gb19a), stringsAsFactors = FALSE)
modal_value1 <- which.max(counts1$Freq)
(sum(counts1$Freq)-counts1$Freq[modal_value1])
[1] 3
How to apply this to entire data.frame ?
As always, thanx for any help !
You just say the word !
("How to apply this to entire data.frame?")
countValsButMostFreq <- function(values){
counts1 <- as.data.frame(table(values), stringsAsFactors = FALSE)
modal_value1 <- which.max(counts1$Freq)
return (sum(counts1$Freq)-counts1$Freq[modal_value1])
}
ind1 <- rbind.data.frame(
c('9-2-J1-N3', 378, 386, 246, 248, 360, 372, 162, 261),
c('9-2-J1-N3', 380, 386, 246, 248, 360, 372, 187, 261),
c('9-2-J1-N3', 380, 386, 246, 248, NA, NA, NA, NA),
c('9-2-J1-N3', NA, 246, 248, 360, 187, NA, NA, NA),
c('9-2-J1-N3', 380, 386, 380, 386, 378, 386, 380, 386),
c('9-2-J1-N3', 380, 386, 246, 248, 360, 372, 187, 261),
c('9-2-J1-N3', 360, 372, 360, 372, 360, 372, 360, 372),
c('9-2-J1-N3', 187, 261, 187, 261, 162, 261, 187, 261),
c('9-2-J1-N3', 380, 386, 240, 246, 360, 372, 187, NA))
colnames(ind1) <- c('Ind', 'Gb19a', 'Gb19b', 'Gb24a', 'Gb24b', 'Gb28a', 'Gb28b', 'Gb11a', 'Gb11b')
res <- apply(X=ind1,MARGIN=2,FUN=countValsButMostFreq)
res
Result:
Ind Gb19a Gb19b Gb24a Gb24b Gb28a Gb28b Gb11a Gb11b
0 3 3 5 5 3 2 3 2
Here is an example of doing this for mtcars:
as.data.frame(
lapply(mtcars,
function(x)unname(tail(sort(table(x)), 1))
)
)
mpg cyl disp hp drat wt qsec vs am gear carb
1 2 14 3 3 3 3 2 18 19 15 10
How does this work?
Set up a function to get the frequency count for a single column:
Use table to get your counts
Sort the results
Get the last value with tail
Use unname to drop the name
Then simply pass that to lapply and convert the results to a data.frame
You're looking for the apply family. I'd probably use sapply here but that's your choice.
ind1 <- read.table(text="Ind Gb19a Gb19b Gb24a Gb24b Gb28a Gb28b Gb11a Gb11b
1 9-2-J1-N3 378 386 246 248 360 372 162 261
2 9-2-J1-N3 380 386 246 248 360 372 187 261
14 9-2-J1-N3 380 386 246 248 NA NA NA NA
15 9-2-J1-N3 NA 246 248 360 187 NA NA NA
16 9-2-J1-N3 380 386 380 386 378 386 380 386
17 9-2-J1-N3 380 386 246 248 360 372 187 261
19 9-2-J1-N3 360 372 360 372 360 372 360 372
20 9-2-J1-N3 187 261 187 261 162 261 187 261
21 9-2-J1-N3 380 386 240 246 360 372 187 NA", header=TRUE)
hapax <- function(x) {x <- na.omit(x); length(setdiff(unique(x), x[duplicated(x)]))}
sapply(ind1, hapax)
apply(mymatrix, 2, myfunc) runs myfunc(onecolumn) on each column matrix, or data frame.
myfunc would be the code you posted to calculate the sum, except ind1$Gb19a is replaced with onecolumn.

Resources