Re-bin a data frame in R - r

I have a data frame which holds activity (A) data across time (T) for a number of subjects (S) in different groups (G). The activity data were sampled every 10 minutes. What I would like to do is to re-bin the data into, say, 30-minute bins (either adding or averaging values) keeping the subject Id and group information.
Example. I have something like this:
S G T A
1 A 30 25
1 A 40 20
1 A 50 15
1 A 60 20
1 A 70 5
1 A 80 20
2 B 30 10
2 B 40 10
2 B 50 10
2 B 60 20
2 B 70 20
2 B 80 20
And I'd like something like this:
S G T A
1 A 40 20
1 A 70 15
2 B 40 10
2 B 70 20
Whether time is the average time (as in the example) or the first/last time point and whether the activity is averaged (again, as in the example) or summed is not important for now.
I will appreciate any help you can provide on this. I was thinking about creating a script in Python to re-bin this particular dataframe, but I thought that there may be a way of doing it in R in a way that may be applied to any dataframe with differing numbers of columns, etc.

There are some ways to come to the wished dataframe.
I have reproduced your dataframe:
df <- data.frame(S = c(rep(1,6),rep(2,6)),
G = c(rep("A",6),rep("B",6)),
T = rep(seq(30,80,10),2),
A = c(25, 20, 15, 20, 5, 20, 10, 10, 10, 20, 20, 20))
The classical way could be:
df[df$T == 40 | df$T == 70,]
The more modern tidyverse way is
library(tidyverse)
df %>% filter(T == 40 | T ==70)
If you want to get the average of each group of G filtered for T==40 and 70:
df %>% filter(T == 40 | T == 70) %>%
group_by(G) %>%
mutate(A = mean(A))

Related

Group column values by a set numeric difference in R (big dataset)

I want a script to put column values in groups with a maximum range of 10. So that all values in the group are <10 different from each other. I'm trying to get a script that defines the groups, counts how many values are in each group, and find the mean.
So if I had this df:
cat1 = c(85, 60, 60, 55, 55, 15, 0, 35, 35 )
cat2 = c("a","a","a","a","a","a","a","a","a")
df <- data.frame(cat1, cat2)
cat1 cat2
1 85 a
2 60 a
3 60 a
4 55 a
5 55 a
6 15 a
7 0 a
8 35 a
9 35 a
the output would be:
numValues cat1avg
1 85
4 57.5
1 15
1 0
2 35
I followed the top-rated answer here, but I'm getting weird outputs on some of my groups. Specifically, it doesn't seem like the script is properly adding the number of values in each group.
The only way I can think to do it is through for loops with a million if statements, and I have 1,000+ of these little dfs that I need to summarise.
I was also thinking of doing a fuzzy count. But I haven't been able to find anything about that anywhere either.
I also can't just cut up the cat1 range into groups of 10 and just allocate them all into a bin because it's less about the level and more about how close they are to each other.
Thanks!
I'd suggest working at this in the opposite order. Instead of assigning groups based on distance and seeing how many groups there are, we could specify a number of groups (k) and ask R to pick the most distinct clusterings, and compare how well those clusterings fit our purpose. There is a built-in algorithm in R, kmeans, to do this for us.
Let's say we expect between 1 and 6 groups. There are only 6 unique values when I run unique(cat1) so it can't be more than that. We can then use map from purrr in tidyverse to use each of 1:6 in a kmeans algorithm, and we can use augment from broom to extract the output from that in a tidy way.
library(tidyverse); library(broom)
kclusts <- tibble(k = 1:6) %>%
mutate(kclust = map(k, ~kmeans(cat1, .x)),
augmented = map(kclust, augment, df)
)
This will create a nested table with the results we want inside the augmented column. Let's pull those out:
assignments <- kclusts %>%
unnest(cols = c(augmented))
We could visualize these like so. Note that with k = 1, everything is in cluster 1. With k = 5, the 55 + 60s are paired. I think the trivial k = 6 case is just left out.
ggplot(assignments, aes(x = cat1, y = cat2)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
facet_wrap(~ k)
We could see how much range is in each cluster produced in each case, and find the widest range cluster for each k. We see that dividing into four groups would have at least 15 range (see the 4 facet in the chart above), but 5 groups would be adequate to keep the within-cluster range under 5.
assignments %>%
group_by(k, .cluster) %>%
summarize(range = max(cat1) - min(cat1)) %>%
summarize(max_range = max(range))
# A tibble: 6 × 2
k max_range
<int> <dbl>
1 1 85
2 2 35
3 3 30
4 4 15
5 5 5
6 6 0
And finally:
assignments %>%
filter(k == 5) %>%
group_by(.cluster) %>%
summarize(numValues = n(),
cat1avg = mean(cat1))
.cluster numValues cat1avg
<fct> <int> <dbl>
1 1 1 85
2 2 1 15
3 3 2 35
4 4 4 57.5
5 5 1 0

In an R dataframe, identify/remove rows with at least two duplicate values

Suppose we have an R dataframe. How can we identify (and remove) any rows where some value occurs at least two times? After some searching, I still can not find a solution on the web. A small code example illustrates what I am after:
> df <- data.frame(x = c(10, 20, 30, 50), y = c(30, 40, 40, 50),
z = c(40, 50, 10, 50), w = c(50, 40, 50, 50))
This gives the dataframe
>df
x y z w
1 10 30 40 50
2 20 40 50 40
3 30 40 10 50
4 50 50 50 50
So, df has duplicate values in row 2 and 4, and I want to remove those rows, to get the result:
> result
x y z w
1 10 30 40 50
3 30 40 10 50
For my application I can use a solution where one assumes there are just four columns, although of course a general solution would be better.
Here is one option in base R - loop over the rows with apply, check for duplicates (anyDuplicated - return the index of first duplicate, if no duplicates, it returns 0), then negate (! - so that 0 becomes TRUE and all others FALSE) to subset the rows
df[!apply(df, 1, anyDuplicated),]
-output
x y z w
1 10 30 40 50
3 30 40 10 50

How to create a new variable on condition of others in R

I have the following data frame:
ID Measurement A Measurement B Date of Measurements A and B Date of Measurement C
1 23 24 12 16
1 22 23 12 15
1 24 22 12 17
1 21 20 12 11
1 27 29 12 17
This is example using 1 Identifier (ID), in reality I have thousands.
I want to create a variable which encapsulates
"if this ID's Measurement A OR Measurement B is > xxx, before the date of Measurement C, ON MORE THAN TWO OCCASSIONS, then designate
them a 1 in a new column called new_var".
So far, I removed all Date of Measurements A and B > Date of Measurement C
measurements <- subset(measurements, dateofmeasurementsAandB < dateofmeasurementC)
And then added in the cut offs in an ifelse statement
measurements$new_var<- ifelse(measurements$measurementA >= xxx | measurements$measurementB >= xxx, 1, 0)
But can't factor in the 'on more than one occasion bit' (as you can see from example, each ID has multiple rows/occasions)
Any help would be great, especially if it could be done simpler!
If I undestand what you're asking, I think I would use dplyr's count function:
#Starting from your dataframe
library(tidyverse)
df <- measurements %>%
filter(dateofmeasurementsAandB < dateofmeasurementC,
measurements$measurementA >= xxx | measurements$measurementB >= xxx)
This data frame should only have the conditions you're going for, so now we count them and filter the result:
df <- df %>% count(ID) %>% filter(n >= 2)
The vector df$ID should now only have the IDs that have been measured more than once which you can then feed back into your measurements data frame with ease, but I'm partial to this:
measurements$new_var <- 0
measurements[measurements$ID %in% df$ID]$new_var <- 1

generate a dataframe of parameter values

I'm trying to generate a dataframe of parameter values for a sensitivity analysis where each row is a parameter space. I'd like to be able to automate the generation of the dataframe such that each parameter is varied by -10% and +10% whilst all the other values are kept the same (see below example of desired df). Does anyone know how I can do this? I feel like the answer is obvious, but really can't see what it is!
Example of desired df:
a <- c(10,9,11,10,10,10,10,10,10)
b <- c(20,20,20,18,22,20,20,20,20)
c <- c(30,30,30,30,30,27,33,30,30)
d <- c(40,40,40,40,40,40,40,36,44)
parms <- data.frame(a,b,c,d)
I think the function expand.grid is what you are looking for.
a <- c(9,10,11)
b <- c(18,20,22)
c <- c(27,30,33)
d <- c(36,40,44)
test <- expand.grid(a,b,c,d)
To automate the first part (variation by 10% around center value) you may use this approach:
library(magrittr)
vary_around_center <- function(center){
c(center*0.9, center, center*1.1)
}
c(10,20,30,40) %>%
lapply(vary_around_center) %>%
expand.grid
I think this will get you the "one parameter changing at a time" pattern you showed in your example.
params <- c(a = 10, b = 20, c = 30, d = 40)
builder_func <- function(params) {
opts <- map_df(params, ~c(., .*.9, .*1.1))
stocks <- map_df(params, ~rep(., 3))
map_df(names(opts),
~ bind_cols(
opts[.],
stocks[. != names(stocks)]
)) %>%
unique()
}
builder_func(params)
# A tibble: 9 x 4
a b c d
<dbl> <dbl> <dbl> <dbl>
1 10 20 30 40
2 9 20 30 40
3 11 20 30 40
4 10 18 30 40
5 10 22 30 40
6 10 20 27 40
7 10 20 33 40
8 10 20 30 36
9 10 20 30 44
Sorry I missed that nuance the first time I read your question. Let me know if something isn't quit right...

Sum of every n-th column of a data frame

Let's assume the data,
a <- c(10, 20, 30, 40, 50)
b <- c(100, 200, 300, 400, 500)
c <- c(1, 2, 3, 4, 5)
d <- c(5, 4, 3, 2, 1)
df <- data.frame(a, b, c, d)
df
a b c d
1 10 100 1 5
2 20 200 2 4
3 30 300 3 3
4 40 400 4 2
5 50 500 5 1
I want to sum every alternate columns, i.e. a+cand b+d and so on. The solution should be applicable or modified very easily to other cases like summing every second column, i.e. a+c, b+d, c+e etc. For the example above, the solution should look like this,
> dfsum
aplusc bplusd
1 11 105
2 22 204
3 33 303
4 44 402
5 55 501
Is there any easy way to do that? I have figured out how to do sequential sum, e.g. df[,c(T, F)] + df[,c(F, T)];, but how to do sum of every n-th column? Besides rbase, is there any tidy solution for this problem?
Here is a more generic approach which however, assumes that the number of columns in your data frame is even number, i.e.
n = 2
Reduce(`+`, split.default(df, rep(seq(ncol(df) / n), each = ncol(df) / n)))
# a b
#1 11 105
#2 22 204
#3 33 303
#4 44 402
#5 55 501
The above basically splits the dataframe every 2 columns, i.e. a and b, c and d. Using Reduce, all first elements are added together, then all seconds and so on. So for your case, a will be added with c, and b with d. If you want to take the sum every 3 columns, just change the denominator of the above split.default method to 3. However, note that you must have a number of columns divisible by 3 (or by any n).
One approach is to use mutate:
library(tidyverse)
df %>%
mutate(aplusc = a + c,
bplusd = b + d) %>%
select(aplusc, bplusd)
#aplusc bplusd
#1 11 105
#2 22 204
#3 33 303
#4 44 402
#5 55 501
Edit
Here's an approach based on #Sotos's anwer, so it could work on a larger dataset:
Reduce(`+`, split.default(df, (seq_along(df) - 1) %/% 2))

Resources