I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!
We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)
Related
I have started using h2o for aggregating large datasets and I have found peculiar behaviour when trying to aggregate the maximum value using h2o's h2o.group_by function. My dataframe often has variables which comprise some or all NA's for a given grouping. Below is an example dataframe.
df <- data.frame("ID" = 1:16)
df$Group<- c(1,1,1,1,2,2,2,3,3,3,4,4,5,5,5,5)
df$VarA <- c(NA_real_,1,2,3,12,12,12,12,0,14,NA_real_,14,16,16,NA_real_,16)
df$VarB <- c(NA_real_,NA_real_,NA_real_,NA_real_,10,12,14,16,10,12,14,16,10,12,14,16)
df$VarD <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
ID Group VarA VarB VarD
1 1 1 NA NA 10
2 2 1 1 NA 12
3 3 1 2 NA 14
4 4 1 3 NA 16
5 5 2 12 10 10
6 6 2 12 12 12
7 7 2 12 14 14
8 8 3 12 16 16
9 9 3 0 10 10
10 10 3 14 12 12
11 11 4 NA 14 14
12 12 4 14 16 16
13 13 5 16 10 10
14 14 5 16 12 12
15 15 5 NA 14 14
16 16 5 16 16 16
In this dataframe Group == 1 is completely missing data for VarB (but this is important information to know, so the output for aggregating for the maximum should be NA), while for Group == 1 VarA only has one missing value so the maximum should be 3.
This is a link which includes the behaviour of the behaviour of the na.methods argument (https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/groupby.html).
If I set the na.methods = 'all' as below then the aggregated output is NA for Group 1 for both Vars A and B (which is not what I want, but I completely understand this behaviour).
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "all"))
Group max_ID max_VarA max_VarB max_VarD
1 1 4 NaN NaN 16
2 2 7 12 14 14
3 3 10 14 16 16
4 4 12 NaN 16 16
5 5 16 NaN 16 16
If I set the na.methods = 'rm' as below then the aggregated output for Group 1 is 3 for VarA (which is the desired output and makes complete sense) but for VarB is -1.80e308 (which is not what I want, and I do not understand this behaviour).
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "rm"))
Group max_ID max_VarA max_VarB max_VarD
<int> <int> <int> <dbl> <int>
1 1 4 3 -1.80e308 16
2 2 7 12 1.4 e 1 14
3 3 10 14 1.6 e 1 16
4 4 12 14 1.6 e 1 16
5 5 16 16 1.6 e 1 16
Similarly I get the same output if set the na.methods = 'ignore'.
h2o_agg <- h2o.group_by(data = df_h2o, by = 'Group', max(), gb.control = list(na.methods = "ignore"))
Group max_ID max_VarA max_VarB max_VarD
<int> <int> <int> <dbl> <int>
1 1 4 3 -1.80e308 16
2 2 7 12 1.4 e 1 14
3 3 10 14 1.6 e 1 16
4 4 12 14 1.6 e 1 16
5 5 16 16 1.6 e 1 16
I am not sure why something as common as completely missing data for a given variable within a specific group is being given a value of -1.80e308? I tried the same workflow in dplyr and got results which match my expectations (but this is not a solution as I cannot process datasets of this size in dplyr, and hence my need for a solution in h2o). I realise dplyr is giving me -inf values rather than NA, and I can easily recode both -1.80e308 and -Inf to NA, but I am trying to make sure that this isn't a symptom of a larger problem in h2o (or that I am not doing something fundamentally wrong in my code when attempting to aggregate in h2o). I also have to aggregate normalised datasets which often have values which are approximately similar to -1.80e308, so I do not want to accidentally recode legitimate values to NA.
library(dplyr)
df %>%
group_by(Group) %>%
summarise(across(everything(), ~max(.x, na.rm = TRUE)))
Group ID VarA VarB VarD
<dbl> <int> <dbl> <dbl> <dbl>
1 1 4 3 -Inf 16
2 2 7 12 14 14
3 3 10 14 16 16
4 4 12 14 16 16
5 5 16 16 16 16
This is happening because H2O considers value -Double.MAX_VALUE to be the lowest possible representable floating-point number. This value corresponds to -1.80e308. I agree this is confusing and I would consider this to be a bug. You can file an issue in our bug tracker: https://h2oai.atlassian.net/ (PUBDEV project)
Not sure how to achieve that with h2o.group_by() – I get the same weird value when running your code. If you are open for a somewhat hacky workaround, you might want to try the following (I included the part on H2O initialization for future reference):
convert your frame to long format, ie key-value representation
split by group and apply aggregate function using h2o.ddply()
convert your frame back to wide format
## initialize h2o
library(h2o)
h2o.init(
nthreads = parallel::detectCores() * 0.5
)
df_h2o = as.h2o(
df
)
## aggregate per group
df_h2o |>
# convert to long format
h2o.melt(
id_vars = "Group"
, skipna = TRUE # does not include `NA` in the result
) |>
# calculate `max()` per group
h2o.ddply(
.variables = c("Group", "variable")
, FUN = function(df) {
max(df[, 3])
}
) |>
# convert back to wide format
h2o.pivot(
index = "Group"
, column = "variable"
, value = "ddply_C1"
)
# Group ID VarA VarB VarD
# 1 4 3 NaN 16
# 2 7 12 14 14
# 3 10 14 16 16
# 4 12 14 16 16
# 5 16 16 16 16
#
# [5 rows x 5 columns]
## shut down h2o instance
h2o.shutdown(
prompt = FALSE
)
I am trying to find the angle to the closest point from a line in multiple cases within and across time. I have a data set that looks something like this. Four points in group 1, four in group 2 and one in group 3.
x <- sample(1:50, 27)
y <- sample(1:50, 27)
group <- c(1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3,1,1,1,1,2,2,2,2,3)
id <- rep(seq(1,9,1), 3)
time <- rep(1:3, each = 9)
df <- as.data.frame(cbind(x, y, group, id, time))
x y group id time
1 25 36 1 1 1
2 49 35 1 2 1
3 41 27 1 3 1
4 28 47 1 4 1
5 7 3 2 5 1
6 46 25 2 6 1
7 15 7 2 7 1
8 32 15 2 8 1
9 38 29 3 9 1
10 19 4 1 1 2
11 18 14 1 2 2
12 8 37 1 3 2
13 29 8 1 4 2
14 6 1 2 5 2
15 30 6 2 6 2
16 10 19 2 7 2
17 45 49 2 8 2
18 40 43 3 9 2
19 17 48 1 1 3
20 27 21 1 2 3
21 26 20 1 3 3
22 33 50 1 4 3
23 16 16 2 5 3
24 23 46 2 6 3
25 21 26 2 7 3
26 13 31 2 8 3
27 11 41 3 9 3
the item in group 3 is used to identify which point is the base of all of the lines. in this example for time 1 - id 3 in group 1 in closest. this signals that a line should be made to all other points in group 1 (3-1, 3-2 and 3-4). I then need to identify which id in group 2 is closest to each of the 3 lines. for example, point 6 might be closest to the line 3-2 and from that I would calculate the angle in points 6-3-2. I need to calculate this for all other lines in this time, and then perform this again across all other times.
The following code identifies the base point for the lines (it is not optimal but I need the other data it calculates for other uses)
#### calculate distance between all points
distance = function(x1,x2,y1,y2) sqrt(((x2-x1)^2)+((y2-y1)^2)) #distance function
distance2 = function(x,y,.pred) distance(x, x[.pred], y, y[.pred]) #
distance3 = function(x, y, id){
dists = map(1:9, ~distance2(x,y, which(id == .x)))
}
#use distance formula
df2 <- df %>%
group_by(time) %>%
mutate(distances=distance3(x, y, id))
distances <- df2$distances # extract distance list
distances <- do.call(rbind.data.frame, distances) # change list to dataframe
colnames(distances) <- c(paste0("dist", 1:9)) # change column names
df <- cbind(df,distances) # merge dataframes
group3 <- df %>% filter(group == 3)
df <- df %>% filter(group == 1 | group == 2) #remove group 3 (id 9 from data as no longer needed)
#new columns with id and group as closest to position of id 9
df <- df %>% group_by(time) %>% mutate(closest = id[which.min(dist9)]) %>%
mutate(closest.group = group[which.min(dist9)]) %>% ungroup
This is about as far as I can get on my own. I have found the following formula on here which I can use to calculate the distance of a point to a line. in an individual case but I have no idea how to integrate it across the multiple time periods and with conditions.
dist2d <- function(a,b,c) {
v1 <- b - c
v2 <- a - b
m <- cbind(v1,v2)
d <- abs(det(m))/sqrt(sum(v1*v1))
}
for clarification, the line only goes between the two points and does not extend to infinity.
I have a large dataset where I am trying to extract intervals (from the column Zone) where the Anom value is >1 for 5+ consecutive cells, and calculate the means of each interval. In the example below I would like to extract the information that Anom intervals include Zones = 5 to 11 and 17 to 26, but ignoring 28 to 29 (as the number of consecutive cells is <5). Any help is much appreciated.
df <- data.frame("Zone" = 1:30, "Anom" = 1:30)
df[,2] <- 0
df[5:11,2] <- 1
df[17:26,2] <- 1
df[28:29,2] <- 1
df
Zone Anom
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
17 17 1
18 18 1
19 19 1
20 20 1
21 21 1
22 22 1
23 23 1
24 24 1
25 25 1
26 26 1
27 27 0
28 28 1
29 29 1
30 30 0
The sort of output I would like to generate
1 Zone.From Zone.To Anom.Mean
2 5 11 1
3 17 26 1
One way using dplyr and data.table's rleid is to create a new group for each change in Anom. For each group get first and last value of Zone, mean of Anom, number of rows in it and first value of Anom. We can then filter and keep only those groups where we have greater than equal to 5 rows and Anom is greater than 0.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(Anom)) %>%
summarise(Zone.From = first(Zone),
Zone.To = last(Zone),
mean_anom = mean(Anom),
N = n(),
Anom = first(Anom)) %>%
filter(Anom > 0 & N >= 5) %>%
select(-c(grp, N, Anom))
# Zone.From Zone.To mean_anom
# <int> <int> <dbl>
#1 5 11 1
#2 17 26 1
I have the following dataframe containing a variable "group" and a variable "number of elements per group"
group elements
1 3
2 1
3 14
4 10
.. ..
.. ..
30 5
then I have a bunch of numbers going from 1 to (let's say) 30
when summing "elements" I would get 900. what I want to obtain is to randomly select a number (from 0 to 30) from 1-30 and assign it to each group until I fill the number of elements for that group. Each of those should appear 30 times in total.
thus, for group 1, I want to randomly select 3 number from 0 to 30
for group 2, 1 number from 0 to 30 etc. until I filled all of the groups.
the final table should look like this:
group number(randomly selected)
1 7
1 20
1 7
2 4
3 21
3 20
...
any suggestions on how I can achieve this?
In base R, if you have df like this...
df
group elements
1 3
2 1
3 14
Then you can do this...
data.frame(group = rep(df$group, #repeat group no...
df$elements), #elements times
number = unlist(sapply(df$elements, #for each elements...
sample.int, #...sample <elements> numbers
n=30, #from 1 to 30
replace = FALSE))) #without duplicates
group number
1 1 19
2 1 15
3 1 28
4 2 15
5 3 20
6 3 18
7 3 27
8 3 10
9 3 23
10 3 12
11 3 25
12 3 11
13 3 14
14 3 13
15 3 16
16 3 26
17 3 22
18 3 7
Give this a try:
df <- read.table(text = "group elements
1 3
2 1
3 14
4 10
30 5", header = TRUE)
# reproducibility
set.seed(1)
df_split2 <- do.call("rbind",
(lapply(split(df, df$group),
function(m) cbind(m,
`number(randomly selected)` =
sample(1:30, replace = TRUE,
size = m$elements),
row.names = NULL
))))
# remove element column name
df_split2$elements <- NULL
head(df_split2)
#> group number(randomly selected)
#> 1.1 1 25
#> 1.2 1 4
#> 1.3 1 7
#> 2 2 1
#> 3.1 3 2
#> 3.2 3 29
The split function splits the df into chunks based on the group column. We then take those smaller data frames and add a column to them by sampling 1:30 a total of elements time. We then do.call on this list to rbind back together.
Yo have to generate a new dataframe repeating $group $element times, and then using sample you can generate the exact number of random numbers:
data<-data.frame(group=c(1,2,3,4,5),
elements=c(2,5,2,1,3))
data.elements<-data.frame(group=rep(data$group,data$elements),
number=sample(1:30,sum(data$elements)))
The result:
group number
1 1 9
2 1 4
3 2 29
4 2 28
5 2 18
6 2 7
7 2 25
8 3 17
9 3 22
10 4 5
11 5 3
12 5 8
13 5 26
I solved as follow:
random_sample <- rep(1:30, each=30)
random_sample <- sample(random_sample)
then I create a df with this variable and a variable containing one group per row repeated by the number of elements in the group itself
I'm using the aggregate function for calculating the difference for every observation of two variables,so somehow like this (and the I want to save the result as a new variable) :
data1
Group Points_Attempt1 Points_Attempt2
1 1 10 5
2 1 34 23
3 1 50 5
4 1 10 12
5 2 11 21
6 2 23 23
7 2 32 10
8 2 12 10
I'm able to do something like this:
aggregate(data1[c("Points_Attempt1","Points_Attempt2")],list(data1$group),diff)
But I want it for every single observations and I just do not now to select the observations, so somehow the row numbers (here from 1-8).
So I'm searching for the following fourth column (Difference), which I then would like to safe as a new variable:
Group Points_Attempt1 Points_Attempt2 Difference
1 1 10 5 5
2 1 34 23 11
3 1 50 5 45
4 1 10 12 -2
5 2 11 21 -10
6 2 23 23 0
7 2 32 10 22
8 2 12 10 2
I would be highly thankful, if someone could help me with this.
We can use mutate_each
library(dplyr)
data1 %>%
group_by(Group) %>%
mutate_each(funs(c(NA, diff(.))), 2:3)
Or if we need to subtract between the variables,
data1 %>%
mutate(Difference = Points_Attemp1 - Points_Attemp2)