I have a time series data in a data table format (let's say it has columns "date" and "y"), and I would like to cut the non-zero values of y into quartiles by date, so that each quartile gets the label 1-4, and the zero values to have a label of 0. So I know that if I just wanted to do this for all values of y, I would just run:
dt <- dt %>%
group_by(date) %>%
mutate(quartile = cut(y, breaks = 4, labels = (1:4)))
But I can't figure out how to do it to get labels 0-4, with 0 allocated to 0-values of y, and 1-4 being the quartiles in the non-zero values.
Edit: To clarify, what I want to do is the following: for each date, I would like to divide the values of y in that date into 5 groups: 1) y=0, 2) bottom 25% of y (in that date), 3) 2nd 25% of y, 3) 3rd 25% of y, 4) the top 25% of y.
Edit 2:
So I have found 2 more solutions for this:
dt[,quartile := cut(y, quantile(dt[y>0]$y, probs = 0:4/4),
labels = (1:4)), by = date]
and
dt %>%
group_by(date) %>%
mutate(quartile = findInterval(y, quantile(dta[y>0]$y,
probs= 0:4/4)))
But what both of these seem to do is to first calculate the break points for the entire data and then cut the data by date. But I want the break points to be calculated by date, since obs distribution can be different in different dates.
You can pass the output of quantile to the breaks argument of cut. By default, quantile will produce quartile breaks.
x <- rpois(100,4)
table(x)
x
0 1 2 3 4 5 6 7 8 9 10 12
1 7 17 19 17 18 12 5 1 1 1 1
cut(x,breaks=quantile(x),labels=1:4)
[1] 2 2 2 1 2 1 1 2 3 3 1 4 1 4 1
[16] 2 4 2 4 2 3 1 4 1 2 2 1 1 2 2
[31] 1 2 2 3 4 1 4 2 2 1 2 4 4 3 1
[46] 3 1 1 3 3 2 4 2 2 1 2 2 4 1 1
[61] 1 2 2 4 4 3 3 2 1 1 3 2 3 2 3
[76] 2 4 2 <NA> 2 3 2 4 2 1 4 4 3 4 1
[91] 2 4 3 2 2 3 4 4 3 2
Levels: 1 2 3 4
Note that the minimum value is excluded by default. If you want your ranges to be computed including zero, the zeros will be NA's and you can use this to your advantage and use is.na to treat this differently afterwards.
However, if you want to exclude the zero's before computing the breaks, you will need to reduce the minimum break value slightly to ensure all values are given a label. You can do this by using quantile(x[x>0])-c(1e-10,rep(0,4)) for example. The zeros will again appear as NA's in this case.
I'm admittedly not sure what you mean by "cutting the non-zero values of y into quartiles by date", and I'm afraid I don't have enough reputation to ask.
If 'date' is an actual date column, and you mean, "the new variable 'quartile' should indicate what part of the year y occurred in, assuming y isn't 0, in which case it should be 0", I'd do it like this:
library(dplyr)
library(lubridate)
# create example
dt <- data.frame(y = c(0, 1, 3, 4), date = c("01-02-18", "01-06-18",
"01-12-16", "01-04-17"))
dt <- dt %>%
## change 'date' to an actual date
mutate(date = as_date(date)) %>%
## extract the quarter
mutate(quartile = quarter(date)) %>%
## replace all quarters with 0 where y was 0
mutate(quartile = if_else(y == 0, 0, as.double(quartile)))`
EDIT: I think I understand the problem now. This is probably a little verbose, but I think it does what you want:
library(dplyr)
dt <- tibble(y = c(20, 30, 40, 20, 30, 40, 0), date = c("01-02-16",
"01-02-16", "01-02-16", "01-08-18", "01-08-18", "01-08-18",
"01-08-18"))
new_dt <- dt %>%
# filter out all cases where y is greater than 0
filter(y > 0) %>%
# group by date
group_by(date) %>%
# cut the y values per date
mutate(quartile = cut(y, breaks = 4, labels = c(1:4)))
dt <- dt %>%
# take the original dt, add in the newly calculated quartiles
full_join(new_dt, by = c("y", "date")) %>%
# replace the NAs by 0
mutate(quartile = ifelse (is.na(quartile), 0, quartile))
Related
I have a dataframe with 4 columns.
set.seed(123)
df <- data.frame(A = round(rnorm(1000, mean = 1)),
B = rpois(1000, lambda = 3),
C = round(rnorm(1000, mean = -1)),
D = round(rnorm(1000, mean = 0)))
I would like to compute the differences for every possible combination of my columns (A-B, A-C, A-D, B-C, B-D, C-D) at every row of my dataframe.
This would be the equivalent of doing df$A - df$B for every combination.
Can we use the dist() function to compute this efficiently as I have a very large dataset? I would like to then convert the dist object into a data.frame to plot the results with ggplot2.
Unless there is a good tidy version of doing the above.
Many Thanks
The closest I got was doing the below, but I am not sure to what the column names refer to.
d <- apply(as.matrix(df), 1, function(e) as.vector(dist(e)))
t(d)
dist will compare every value in a vector to every other value in the same vector, so if you are looking to compare columns row-by-row, this is not what you are looking for.
If you just want to calculate the difference between all columns pairwise, you can do:
df <- cbind(df,
do.call(cbind, lapply(asplit(combn(names(df), 2), 2), function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste(x, collapse = ""))
})))
head(df)
#> A B C D AB AC AD BC BD CD
#> 1 0 1 -2 -1 -1 2 1 3 2 -1
#> 2 1 1 -1 1 0 2 0 2 0 -2
#> 3 3 1 -2 -1 2 5 4 3 2 -1
#> 4 1 3 0 -1 -2 1 2 3 4 1
#> 5 1 3 0 1 -2 1 0 3 2 -1
#> 6 3 3 1 0 0 2 3 2 3 1
Created on 2022-06-14 by the reprex package (v2.0.1)
Using base r:
df_dist <- t(apply(df, 1, dist))
colnames(df_dist) <- apply(combn(names(df), 2), 2, paste0, collapse = "_")
If you really want to use a tidy-approach, you could go with c_across, but this also removes the names, and is much slower if your data is huge
Suppose you have a data frame of columns "a" and "b" with the values shown below, generated with df <- data.frame(a=c(0, 1, 2, 2, 3), b=c(1, 3, 8, 9, 4)). Suppose you want to add a column "c", whereby if a value in "a" equals the value in the immediately preceding row in col "a", then the corresponding row values in col "b" are summed; otherwise a 0 value is shown. A column "c" is added to the below to illustrate what I'm trying to do:
a b add col c
1 0 1 0
2 1 3 0
3 2 8 0
4 2 9 17 (since the values in col "a" rows 3 and 4 are equal, add the values in col b rows 3 and 4)
5 3 4 0
Or in this scenario, whereby cols "a" and "b" are generated by df <- data.frame(a=c(0,1,2,2,2,3), b=c(1,2,3,4,5,6)):
a b add col c
1 0 1 0
2 1 2 0
3 2 3 0
4 2 4 7 (3+4 from col "b")
5 2 5 9 (4+5 from col "b")
6 3 6 0 (since 2 from prior row <> 3 from current row)
What is the easiest way to do this in native R?
As we are interested in the adjacent values to be equal, use rleid (from data.table) to create a grouping index, then create the 'c', by adding the 'b' with lag of 'b' and replace the default first value of lag (NA) to 0
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rleid(a)) %>%
mutate(c = replace_na(b + lag(b), 0)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 6 × 3
a b c
<dbl> <dbl> <dbl>
1 0 1 0
2 1 2 0
3 2 3 0
4 2 4 7
5 2 5 9
6 3 6 0
Or using base R - a similar approach is with the rle to create the 'grp', then use ave to do the addition of previous with current value (by removing the first and last) and then append 0 at the beginning
grp <- with(rle(df$a), rep(seq_along(values), lengths))
df$c <- with(df, ave(b, grp, FUN = function(x) c(0, x[-1] + x[-length(x)])))
I followed this example to do a rolling mean rollmin in R similar to zoo package rollmax
But the first few are filled with NA's. How can I fill the NA's with the original value so that I don't lose datapoints?
We may use coalesce with the original vector to replace the NA with that corresponding non-NA element from original vector
library(dplyr)
library(zoo)
coalesce(rollmeanr(x, 3, fill = NA), x)
If it is a data.frame
ctd %>%
group_by(station) %>%
mutate(roll_mean_beam = coalesce(rollmeanr(beam_coef,
k = 5, fill = NA), beam_coef))
data
x <- 1:10
1) Using the original values seems a bit bizarre. Taking the rolling minimum of 1:10 using a width of 3 would give
1 2 1 2 3 4 5 6 7 8
I think what you really want is to apply min to however many points are available so that in this example we get
1 1 1 2 3 4 5 6 7 8
Now rollapplyr with partial=TRUE will use whatever number of points are available if fewer than width=3 exist at that point. At the first point only one point is available so it returns min(x[1]). At the second only two points are available so it returns min(x[1:2]). For all the rest it can use three points. Only zoo is used.
library(zoo)
x <- 1:10
rollapplyr(x, 3, min, partial = TRUE)
## [1] 1 1 1 2 3 4 5 6 7 8
2) The above seems more logical than filling the first two points with the first two input values but if you really wanted to do that anyways then simply prefix the series with the original values using c or use one of the other alternatives shown below. Only zoo is used.
c(x[1:2], rollapplyr(x, 3, min))
## [1] 1 2 1 2 3 4 5 6 7 8
pmin(rollapplyr(x, 3, min, fill = max(x)), x)
## [1] 1 2 1 2 3 4 5 6 7 8
replace(rollapplyr(x, 3, min, fill = NA), 1:2, x[1:2])
## [1] 1 2 1 2 3 4 5 6 7 8
Min <- function(x) if (length(x) < 3) tail(x, 1) else min(x)
rollapplyr(x, 3, Min, partial = TRUE)
## [1] 1 2 1 2 3 4 5 6 7 8
I have a list of participants with 10 measurement occasions each. A boolean value is assigned to each measurement occasion. I want to scan the 'boolean' column, 4 measurements at a time (1->4, 2->5, 3->6, etc.). I want to figure out in which window I have the maximum amount of '1' values, with the result being stored in the 'MaxWindow' column.
An example of what it should look like is provided below:
note: If the max can be observed in more than one window, I want to simply take the window furthest in the scan.
Any help is much appreciated.
You can use rollsumr from zoo to get rolling sum for a particular window size and select the last max using which + max and create a maxwindow column for each Participant.
library(dplyr)
win_size <- 4
df %>%
group_by(Participant) %>%
mutate(sum = zoo::rollsumr(Boolean, win_size, fill = NA),
maxwindow = {i <- max(which(sum == max(sum, na.rm = TRUE)));
paste(i - win_size + 1, i, sep = '-')})
# Participant Measurmeant Boolean sum maxwindow
# <dbl> <int> <dbl> <dbl> <chr>
# 1 1 1 1 NA 5-8
# 2 1 2 1 NA 5-8
# 3 1 3 0 NA 5-8
# 4 1 4 0 2 5-8
# 5 1 5 1 2 5-8
# 6 1 6 1 2 5-8
# 7 1 7 1 3 5-8
# 8 1 8 0 3 5-8
# 9 1 9 0 2 5-8
#10 1 10 0 1 5-8
data
df <- data.frame(Participant = 1, Measurmeant = 1:10,
Boolean = c(1, 1, 0, 0, 1, 1, 1, 0, 0, 0))
Uing R, I want to count the number of occurences in two variables by two other variables; IDS and year. One of the variables counted need to be counted by unique value.
I have really looked around for an answer to this but I cannot seem to find it.
I have a dataset like this (though including many more variables):
IDS = c(1,1,1,1,1,1,2,2)
year = c(1,1,1,1,1,2,1,1)
x = c(5, 5, 5, 10, 2, NA, 3, 3)
y = c(1, 2, 4, 0, NA, 2, 0, NA)
dfxy = data.frame(IDS, year, x, y)
dfxy
IDS year x y
1 1 1 5 1
2 1 1 5 2
3 1 1 5 4
4 1 1 10 0
5 1 1 2 NA
6 1 2 NA 2
7 2 1 3 0
8 2 1 3 NA
I want a count of the number of occurences in the two columns x and y by each IDS and each year. The count in x needs to be by unique value of x.
I want an output like this:
IDS year x y
1 1 1 3 4
2 1 2 0 1
3 2 1 1 1
It is similar to the answer with cbind in
Aggregate / summarize multiple variables per group (i.e. sum, mean, etc)
which for me would look like
aggregate(cbind(x, y)~IDS+year, data=dfxy, ???)
NA counts as no occurence, any number counts as an occurence in y, in x each unique occurence must be counted (as long as it is not NA). There are no rows with NA in both x and y.
I have tried using length instead of sum, but this only seem to summarize the number of rows equally for both x and y.
Ideas or a link I can find an answer to this in?
Thanks
In aggregate, you need to specify the na.action parameter, as with the formula interface it defaults to na.omit, which will exclude most of your data:
aggregate(cbind(x, y) ~ IDS + year, dfxy,
FUN = function(x){sum(!is.na(x))}, na.action = na.pass)
## IDS year x y
## 1 1 1 3 3
## 2 2 1 1 1
## 3 1 2 0 1
For the new question, add unique:
aggregate(cbind(x, y) ~ IDS + year, df,
FUN = function(x){sum(!is.na(unique(x)))}, na.action = na.pass)
## IDS year x y
## 1 1 1 3 4
## 2 2 1 1 1
## 3 1 2 0 1
or
aggregate(cbind(x, y) ~ IDS + year, df,
FUN = function(x){length(unique(na.omit(x)))}, na.action = na.pass)
## IDS year x y
## 1 1 1 3 4
## 2 2 1 1 1
## 3 1 2 0 1
We can try with dplyr
library(dplyr)
dfxy %>%
group_by(IDS, year) %>%
summarise_each(funs(sum(!is.na(.))))