I know how to take the lagged difference:
delX = diff(x)
But the only way I know to take the lagged sum is:
sumY = apply(embed(c(0,y),2),1, sum)
Is there a function that can take the lagged sum? This way (or sliding the index in some other fashion) is not very intuitive.
You're looking for filter:
x <- 1:10
filter(x, filter=c(1,1), sides=1)
# [1] NA 3 5 7 9 11 13 15 17 19
You could also use head and tail:
head(x, -1) + tail(x, -1)
# [1] 3 5 7 9 11 13 15 17 19
Two more options:
x <- 1:10
x + dplyr::lag(x)
# [1] NA 3 5 7 9 11 13 15 17 19
x + data.table::shift(x)
# [1] NA 3 5 7 9 11 13 15 17 19
Note that you can easily change the number of lags in both functions. Instead of lagging, you can also create a leading vector by using dplyr::lead() or data.table::shift(x, 1L, type = "lead"). Both functions also allow you to specify default values (which are NA by default).
Related
I'm looking for a way to extract evenly spaced elements in a vector. I'd like a general way to do this because I am trying to specify the values that I want in a plotly chart. I tried using pretty but that only seems to work with ggplot2.
I'm pretty much looking for an R version of this question that was answered for python.
Here's a sample set. This sample is a vector of 23 elements, a prime that cannot be factored.
x <- 1:23
Ideally, there would be a function that takes a number for the spacing (n) and that splits x into a subset of n evenly spaced values that also includes the first and last element. For example:
split_func(x, n = 4)
[1] 1 4 8 12 16 20 23
The output elements centered between the first and last elements and are spaced by 4, with the exception of the first/second and second-to-last/last elements.
A couple other examples:
split_func(x, n = 5)
[1] 1 5 10 15 20 23 # either this
[1] 1 4 9 14 19 23 # or this would work
split_func(1:10, n = 3)
[1] 1 3 6 9 10 # either this
[1] 1 2 5 8 10 # or this would work
split_func(1:27, n = 6)
[1] 1 5 11 17 23 27
Is there a function that does this already?
Try this:
split_func <- function(x, by) {
r <- diff(range(x))
out <- seq(0, r - by - 1, by = by)
c(round(min(x) + c(0, out - 0.51 + (max(x) - max(out)) / 2), 0), max(x))
}
split_func(1:23, 4)
# [1] 1 4 8 12 16 20 23
split_func(1:23, 5)
# [1] 1 4 9 14 19 23
split_func(1:10, 3)
# [1] 1 4 7 10
split_func(1:27, 6)
# [1] 1 5 11 17 23 27
I have a column:
Y = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
I would like to split into multiple columns, based on the positions of the column values. For instance, I would like:
Y1=c(1,2,3,4,5)
Y2=c(6,7,8,9,10)
Y3=c(11,12,13,14,15)
Y4=c(16,17,18,19,20)
Since I am working with a big data time series set, the divisions will be arbitrary depending on the length of one time period.
You can use the base split to split this vector into vectors that are each 5 items long. You could also use a variable to store this interval length.
Using rep with each = 5, and creating a sequence programmatically, gets you a sequence of the numbers 1, 2, ... up to the length divided by 5 (in this case, 4), each 5 times consecutively. Then split returns a list of vectors.
It's worth noting that a variety of SO posts will recommend you store similar data in lists such as this, rather than creating multiple variables, so I'm leaving it in list form here.
Y <- 1:20
breaks <- rep(1:(length(Y) / 5), each = 5)
split(Y, breaks)
#> $`1`
#> [1] 1 2 3 4 5
#>
#> $`2`
#> [1] 6 7 8 9 10
#>
#> $`3`
#> [1] 11 12 13 14 15
#>
#> $`4`
#> [1] 16 17 18 19 20
Created on 2019-02-12 by the reprex package (v0.2.1)
Not a dplyr solution, but I believe the easiest way would involve using matrices.
foo = function(data, sep.in=5) {
data.matrix = matrix(data,ncol=5)
data.df = as.data.frame(data.matrix)
return(data.df)
}
I have not tested it but this function should create a data.frame who can be merge to a existing one using cbind()
We can make use of split (writing the commented code as solution) to split the vector into a list of vectors.
lst <- split(Y, as.integer(gl(length(Y), 5, length(Y))))
lst
#$`1`
#[1] 1 2 3 4 5
#$`2`
#[1] 6 7 8 9 10
#$`3`
#[1] 11 12 13 14 15
#$`4`
#[1] 16 17 18 19 20
Here, the gl create a grouping index by specifying the n, k and length parameters where n - an integer giving the number of levels, k - an integer giving the number of replications, and length -an integer giving the length of the result.
In our case, we want to have 'k' as 5.
as.integer(gl(length(Y), 5, length(Y)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
If we want to have multiple objects in the global environment, use list2env
list2env(setNames(lst, paste0("Y", seq_along(lst))), envir = .GlobalEnv)
Y1
#[1] 1 2 3 4 5
Y2
#[1] 6 7 8 9 10
Y3
#[1] 11 12 13 14 15
Y4
#[1] 16 17 18 19 20
Or as the OP mentioned dplyr/tidyr in the question, we can use those packages as well
library(tidyverse)
tibble(Y) %>%
group_by(grp = (row_number()-1) %/% 5 + 1) %>%
summarise(Y = list(Y)) %>%
pull(Y)
#[[1]]
#[1] 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 10
#[[3]]
#[1] 11 12 13 14 15
#[[4]]
#[1] 16 17 18 19 20
data
Y <- 1:20
I have an annual record of temperature. I need to select special row (days) with five rows before them (to take the mean of five days) and then take the mean of the selected groups. here is my data frame and the following code that i applied but didn't work.
Day T.m
1 22
2 21
3 34
4 28
5 14
6 7
7 12
8 22
9 11
10 12
11 14
12 3
13 4
14 11
15 16
a <- c(8, 12,14)
apply(DF [c((a-5):a),2], 1, mean)
We can use mapply
mapply(function(x, y) mean(DF[[2]][x:y]), a-5, a)
#[1] 19.500000 12.333333 9.166667
Or a vectorized approach would be
tapply(DF[[2]][rep(a-5 , each = 6) + 0:5], rep(1:3, each = 6), FUN = mean)
# 1 2 3
#19.500000 12.333333 9.166667
I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
This may not be the best title, feel free to edit it.
x=c(NA,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA....)
or
x=c(NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,0,NA,NA,1,NA,NA,NA,NA,0,NA,NA,NA,1,NA....)
y=c(seq(1:length(x)))
I would like z to be a new vector that is equal to y except when NAs are between 0 and 1 (not 1 and 0) where it should repeat the value taken when x=0
[1] 1 2 3 4 5 6 7 8 9 9 9 9 13 14 15 16 16 16 16 16 21 22 23 24 25 25
or
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25 26
depending on x
I really don't know how to translate this condition in R.
My solution is clunkier than #James's (now deleted) answer but maybe (?) it's more flexible:
## identify strings of NAs preceded by 0
library(zoo)
na_following_zero <- na.locf(c(1,x))[-1]==0 & is.na(x)
## now identify the 'chunks' to reset
## (there may be a more elegant way to do this)
rr <- rle(na_following_zero)
startvals <- cumsum(c(0,rr$lengths))+1
endvals <- cumsum(rr$lengths)
values <- c(NA,y[startvals-1])
z <- y
## replace values in chunks
for (i in seq_along(rr$values)[rr$values])
z[startvals[i]:endvals[i]] <- values[i]
If time isn't prohibitive, you can just use a "for" loop:
z <- y
between.0.1 <- rep(FALSE, length(x))
for(i in 2:length(x)){
if(!is.na(x[i-1]) && x[i-1]==0){ # switch on after a 0
between.0.1[i] <- TRUE
value.at.0 <- y[i-1]
z[i] <- value.at.0
}
if(between.0.1[i-1]){ # if switched on, stay switched on
between.0.1[i] <- TRUE
z[i] <- value.at.0
}
if(!is.na(x[i]) && x[i]==1){ # switch off if at a 1
between.0.1[i] <- FALSE
value.at.0 <- NA
}
}
z[between.0.1] # diagnostic check
Another approach:
y0 <- which(x==0)
y1<-which(x==1)
# need a kicker to start with first zero
y1<-y1[y1>y0[1]]
# and check for end of sequence
if(y1[length(y1)]< length(x)) y1[(length(y1)+1] <- length(x)+1
#now y0 and y1 better be same length
z<-y
#now do a loop any way you want
for (jj in 1: length(y0) ) z[y0[jj]:(y1[jj]-1)]<-y[y0[jj]]
Rgames> z
[1] 1 2 3 4 4 4 4 4 9 10 11 12 13 13 13 16 17 18 19 20 21 21 21 21 25
[26] 26