Group column values by a set numeric difference in R (big dataset) - r

I want a script to put column values in groups with a maximum range of 10. So that all values in the group are <10 different from each other. I'm trying to get a script that defines the groups, counts how many values are in each group, and find the mean.
So if I had this df:
cat1 = c(85, 60, 60, 55, 55, 15, 0, 35, 35 )
cat2 = c("a","a","a","a","a","a","a","a","a")
df <- data.frame(cat1, cat2)
cat1 cat2
1 85 a
2 60 a
3 60 a
4 55 a
5 55 a
6 15 a
7 0 a
8 35 a
9 35 a
the output would be:
numValues cat1avg
1 85
4 57.5
1 15
1 0
2 35
I followed the top-rated answer here, but I'm getting weird outputs on some of my groups. Specifically, it doesn't seem like the script is properly adding the number of values in each group.
The only way I can think to do it is through for loops with a million if statements, and I have 1,000+ of these little dfs that I need to summarise.
I was also thinking of doing a fuzzy count. But I haven't been able to find anything about that anywhere either.
I also can't just cut up the cat1 range into groups of 10 and just allocate them all into a bin because it's less about the level and more about how close they are to each other.
Thanks!

I'd suggest working at this in the opposite order. Instead of assigning groups based on distance and seeing how many groups there are, we could specify a number of groups (k) and ask R to pick the most distinct clusterings, and compare how well those clusterings fit our purpose. There is a built-in algorithm in R, kmeans, to do this for us.
Let's say we expect between 1 and 6 groups. There are only 6 unique values when I run unique(cat1) so it can't be more than that. We can then use map from purrr in tidyverse to use each of 1:6 in a kmeans algorithm, and we can use augment from broom to extract the output from that in a tidy way.
library(tidyverse); library(broom)
kclusts <- tibble(k = 1:6) %>%
mutate(kclust = map(k, ~kmeans(cat1, .x)),
augmented = map(kclust, augment, df)
)
This will create a nested table with the results we want inside the augmented column. Let's pull those out:
assignments <- kclusts %>%
unnest(cols = c(augmented))
We could visualize these like so. Note that with k = 1, everything is in cluster 1. With k = 5, the 55 + 60s are paired. I think the trivial k = 6 case is just left out.
ggplot(assignments, aes(x = cat1, y = cat2)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
facet_wrap(~ k)
We could see how much range is in each cluster produced in each case, and find the widest range cluster for each k. We see that dividing into four groups would have at least 15 range (see the 4 facet in the chart above), but 5 groups would be adequate to keep the within-cluster range under 5.
assignments %>%
group_by(k, .cluster) %>%
summarize(range = max(cat1) - min(cat1)) %>%
summarize(max_range = max(range))
# A tibble: 6 × 2
k max_range
<int> <dbl>
1 1 85
2 2 35
3 3 30
4 4 15
5 5 5
6 6 0
And finally:
assignments %>%
filter(k == 5) %>%
group_by(.cluster) %>%
summarize(numValues = n(),
cat1avg = mean(cat1))
.cluster numValues cat1avg
<fct> <int> <dbl>
1 1 1 85
2 2 1 15
3 3 2 35
4 4 4 57.5
5 5 1 0

Related

Expand each row with specific value in tidyr [duplicate]

This question already has answers here:
Repeat rows of a data.frame [duplicate]
(10 answers)
Closed 3 years ago.
I have a dataset with grouped observations per row. However, I would like to expand each row observation from a single observation per replicate to a set number (in this case "20" observations each).
In the attached picture,
Each replicate is a row. I would like to expand each row into 20. So "wellA" for "LS x SB" becomes expands to 20 of the same line. As a bonus, I would also like to make a new column called "Replicate2" that numerically lists 1 to 20 to reflect these 20 new rows per replicate.
The idea would to then add the survival status per individual (reflected in the new columns "Status" and "Event").
I think the "expand" function in tidyr has potential but can't figure out how to just add a fixed number per replicate. Using the "Alive" column is adding a variable number of observations.
expand<-DF %>% expand(nesting(Date, Time, Cumulative.hrs, Timepoint, Treatment, Boat, Parentage, Well, Mom, Dad, Cone, NumParents, Parents), Alive)
Any help appreciated!
In base R, we can use rep to repeat rows and transform to add new columns
n <- 20
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
Using a reproducible example with n = 3
df <- data.frame(a = 1:3, b = 4:6, c = 7:9)
n <- 3
transform(df[rep(seq_len(nrow(df)), each = n), ], Replicate = 1:n, row.names = NULL)
# a b c Replicate2
#1 1 4 7 1
#2 1 4 7 2
#3 1 4 7 3
#4 2 5 8 1
#5 2 5 8 2
#6 2 5 8 3
#7 3 6 9 1
#8 3 6 9 2
#9 3 6 9 3
Using dplyr we can use slice to repeat rows and mutate to add new column.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = n)) %>%
mutate(Replicate2 = rep(seq_len(n), n))
Do a cross join between your existing data and the numbers 1:20.
tidyr::crossing(DF, replicate2 = 1:20)
If you want to add additional columns, use mutate:
... %>% mutate(status = 1, event = FALSE)

Re-bin a data frame in R

I have a data frame which holds activity (A) data across time (T) for a number of subjects (S) in different groups (G). The activity data were sampled every 10 minutes. What I would like to do is to re-bin the data into, say, 30-minute bins (either adding or averaging values) keeping the subject Id and group information.
Example. I have something like this:
S G T A
1 A 30 25
1 A 40 20
1 A 50 15
1 A 60 20
1 A 70 5
1 A 80 20
2 B 30 10
2 B 40 10
2 B 50 10
2 B 60 20
2 B 70 20
2 B 80 20
And I'd like something like this:
S G T A
1 A 40 20
1 A 70 15
2 B 40 10
2 B 70 20
Whether time is the average time (as in the example) or the first/last time point and whether the activity is averaged (again, as in the example) or summed is not important for now.
I will appreciate any help you can provide on this. I was thinking about creating a script in Python to re-bin this particular dataframe, but I thought that there may be a way of doing it in R in a way that may be applied to any dataframe with differing numbers of columns, etc.
There are some ways to come to the wished dataframe.
I have reproduced your dataframe:
df <- data.frame(S = c(rep(1,6),rep(2,6)),
G = c(rep("A",6),rep("B",6)),
T = rep(seq(30,80,10),2),
A = c(25, 20, 15, 20, 5, 20, 10, 10, 10, 20, 20, 20))
The classical way could be:
df[df$T == 40 | df$T == 70,]
The more modern tidyverse way is
library(tidyverse)
df %>% filter(T == 40 | T ==70)
If you want to get the average of each group of G filtered for T==40 and 70:
df %>% filter(T == 40 | T == 70) %>%
group_by(G) %>%
mutate(A = mean(A))

(dplyr) Sum of N values most recent to a date

I'm trying to create a function that sums the closest n values to a given date. So if I had 5 weeks of data, and n=2, the value on week 1 would be the sum of weeks 2&3, the value on week 2 would be the sum of weeks 1&3, etc. Example:
library(dplyr)
library(data.table)
Week <- 1:5
Sales <- c(1, 3, 5, 7, 9)
frame <- data.table(Week, Sales)
frame
Week Sales Recent
1: 1 1 8
2: 2 3 6
3: 3 5 10
4: 4 7 14
5: 5 9 12
I want to make a function that does this for me with an input for most recent n (not just 2), but for now I want to get 2 right. Here's my function using lag/lead:
RecentSum = function(Variable, Lags){
Sum = 0
for(i in 1:(Lags/2)){ #Lags/2 because I want half values before and half after
#Check to see if you can go backwards. If not, go foward (i.e. use lead).
if(is.na(lag(Variable, i))){
LoopSum = lead(Variable, i)
}
else{
LoopSum = lag(Variable, i)
}
Sum = Sum + LoopSum
}
for(i in 1:(Lags/2)){
if(is.na(lead(Variable, i))){ #Check to see if you can go forward. If not, go backwards (i.e. use lag).
LoopSum = lag(Variable, i)
}
else{
LoopSum = lead(Variable, i)
}
Sum = Sum + LoopSum
}
Sum
}
When I do RecentSum(frame$Sale,2) I get [1] 6 10 14 18 NA which is wrong for a number of reasons:
My if statements are only hitting on week one, so it will always be NA for lag and always be non-NA for lead.
I need to have a way to see if it uses lag/lead the first time. The first value is 6 instead of 8 because the first for-loop sends it to lead(_,1), but then the second for-loop does the same. I can't think of how I'd make my second for-loop recognize this.
Is there a function or library (Zoo?) that makes this task easy? I'd like to get my own function to work for the sake of practice/understanding, but at this point I'd rather just get it done.
Thanks!
To elaborate on my comment, lead and lag are functions that are meant to be used within vectorized functions such as dplyr. Here is a way to do it within dplyr without using a function:
df <- tibble(week = Week, sales = Sales)
df %>%
mutate(recent = case_when(is.na(lag(sales)) ~ lead(sales, n = 1) + lead(sales, n = 2),
is.na(lead(sales)) ~ lag(sales, n = 1) + lag(sales, n = 2),
TRUE ~ lag(sales) + lead(sales)))
That gives you this:
# A tibble: 5 x 3
week sales recent
<int> <dbl> <dbl>
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
1) Assuming that k is even define to as a vector of indices such that for each element of to we sum the k+1 elements of Sales that end in that index and from that subtract Sales:
k <- 2 # number of elements to sum
n <- nrow(frame)
to <- pmax(k+1, pmin(1:n + k/2, n))
Sum <- function(to, Sales) sum(Sales[seq(to = to, length = k+1)])
frame %>% mutate(recent = sapply(to, Sum, Sales) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Note that by replacing the last line of code above with the following line the solution can be done entirely in base R:
transform(frame, recent = sapply(to, Sum, Sales) - Sales)
2) This concatenates the appropriate elements before and after the Sales series so that an ordinary rolling sum gives the result.
library(zoo)
ix <- c(seq(to = k+1, length = k/2), 1:n, seq(to = n-k, length = k/2))
frame %>% mutate(recent = rollsum(Sales[ix], k+1) - Sales)
Note that if k=2 then it reduces this to this one-liner:
frame %>% mutate(recent = rollsum(Sales[c(3, 1:n(), n()-2)], 3) - Sales)
giving:
Week Sales recent
1 1 1 8
2 2 3 6
3 3 5 10
4 4 7 14
5 5 9 12
Update: fixed for k > 2

How to get a data.frame with cases from a contingency table in r?

I would like to reproduce some calculations from a book (logit regression). The book gives a contingency table and the results.
Here is the Table:
.
example <- matrix(c(21,22,6,51), nrow = 2, byrow = TRUE)
#Labels:
rownames(example) <- c("Present","Absent")
colnames(example) <- c(">= 55", "<55")
It gives me this:
>= 55 <55
Present 21 22
Absent 6 51
But to use the glm()-function the data has to be in the following way:
(two colums, one with "Age", and one with "Present", filled with 0/1)
age <- c(rep(c(0),27), rep(c(1),73))
present <- c(rep(c(0),21), rep(c(1),6), rep(c(0),22), rep(c(1),51))
data <- data.frame(present, age)
> data
present age
1 0 0
2 0 0
3 0 0
. . .
. . .
. . .
100 1 1
Is there a simple way to get this structure from the table/matrix?
reshape2::melt(example)
This will give you,
Var1 Var2 value
1 Present >= 55 21
2 Absent >= 55 6
3 Present <55 22
4 Absent <55 51
which you can easily use for glm
You could perhaps use the countsToCases function as defined here.
countsToCases(as.data.frame(as.table(example)))
# Var1 Var2
#1 Present >= 55
#1.1 Present >= 55
#1.2 Present >= 55
#1.3 Present >= 55
#1.4 Present >= 55
#1.5 Present >= 55
# ...
You can always recode the variables to numeric afterwards, if you prefer.
I would go for:
library(data.table)
tab <- data.table(AGED = c(1, 1, 0, 0),
CHD = c(1, 0, 1, 0),
Count = c(21, 6, 22, 51))
tabExp <- tab[rep(1:.N, Count), .(AGED, CHD)]
Edit: Quick explanation, as it took me some time to figure it out:
In data.table objects .N stores the number of rows of a group (if grouped with by) or just the number of rows of the whole data.table, so in this example:
tab[rep(1:.N, Count)]
and
tab[rep(1:4, Count)]
and finally
tab[rep(1:4, c(21, 6, 22, 51)]
are equivalent.
Same with base R:
tab2 <- data.frame(AGED = c(1, 1, 0, 0),
CHD = c(1, 0, 1, 0),
Count = c(21, 6, 22, 51))
tabExp2 <- tab2[rep(1:nrow(tab2), tab2$Count), c("AGED", "CHD")]
The code below might look long but only the group_by() and do() instruction deal with expanding the data. All the rest is about changing the data in long format and encoding character variables as 0 and 1. I tried to start from the exact matrix you gave in your question.
Load data manipulation packages
library(tidyr)
library(dplyr)
Create a data frame
Create a matrix as in your example, but avoid ">" signs in column names
example <- matrix(c(21,22,6,51), nrow = 2, byrow = TRUE)
rownames(example) <- c("Present","Absent")
colnames(example) <- c("above55", "below55")
Convert the matrix to a data frame
example <- data.frame(example) %>%
add_rownames("chd")
Or simply create a data frame directly
data.frame(chd = c("Present", "Absent"),
above55 = c(21,6),
below55 = c(22,51))
Reshape data
data2 <- example %>%
gather(age, nrow, -chd) %>%
# Encode chd and age as 0 or 1
mutate(chd = ifelse(chd=="Present",1,0),
age = ifelse(age=="above55",1,0)) %>%
group_by(chd, age) %>%
# Expand each variable by nrow
do(data.frame(chd = rep(.$chd,.$nrow),
age = rep(.$age,.$nrow)))
head(data2)
# Source: local data frame [6 x 2]
# Groups: chd, age [1]
#
# chd age
# (dbl) (dbl)
# 1 0 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
tail(data2)
# Source: local data frame [6 x 2]
# Groups: chd, age [1]
#
# chd age
# (dbl) (dbl)
# 1 1 1
# 2 1 1
# 3 1 1
# 4 1 1
# 5 1 1
# 6 1 1
table(data2)
# age
# chd 0 1
# 0 51 6
# 1 22 21
Same as your example except for the age encoding
issue mentioned in my comment above.
So, glm is not quite that inflexible. In part ?glm reads
For ‘binomial’ and ‘quasibinomial’ families the response can also
be specified as a ‘factor’ (when the first level denotes failure
and all others success) or as a two-column matrix with the columns
giving the numbers of successes and failures.
I'll assume you want to test the effect of age on Present/Absent.
The key is for to specify the response like (in psueudo-code) c(success, failure).
So you need data like data.frame(Age= ..., Present = ..., Absent). The easiest way to do this from your example is to transpose, then coerce to data.frame, and add a column:
example_t <- as.data.frame(t(example))
example_df <- data.frame(example_t, Age=factor(row.names(example_t)))
which gives you
Present Absent Age
>= 55 21 6 >= 55
<55 22 51 <55
Then, you can run the glm:
glm(cbind(Present, Absent) ~ Age, example_df, family = 'binomial')
to get
Call: glm(formula = cbind(Present, Absent) ~ Age, family = "binomial",
data = example_for_glm)
Coefficients:
(Intercept) Age<55
1.253 -2.094
Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
Null Deviance: 18.7
Residual Deviance: -1.332e-15 AIC: 11.99
Addendum
You could also get here via the answer by #therimalaya. But it's just the first step
as.data.frame(as.table(example))
(only gets you part way there)
Var1 Var2 Freq
1 Present >= 55 21
2 Absent >= 55 6
3 Present <55 22
4 Absent <55 51
but to actually have a column of successes and failures, you need to do something more. You could use tidyr to get there
as.data.frame(as.table(example)) %>% tidyr::spread(Var1, Freq)
is similar to my example_df above
Var2 Present Absent
1 >= 55 21 6
2 <55 22 51

Creating balanced groups based on three categorical variables

I'm creating a group assignment for a college class (~180 students) I'm instructing. It's important that these groups be as heterogeneous as possible across three variables (field of study (FOS), sex, division:i.e., newer/older students).
FOS has 5 levels, sex has 2, division has 2. Given the project, I'd like to create about 8-9 groups. In other words, I'd like groups of approximately 6 with a "good" balance of different fields of study, males/females, and new and older students. I'd then simply post the names with the automated assignments.
The instructor before did it all by hand, but I've tried playing around with R to see if there's a more systematic way of doing this, but only came up with repeated (and clunky) sorting. I expect the 5 FOS levels to vary in size, so I recognize that it will not be a perfect solution. Interested in people's clever solutions. Here's a reproducible sample:
dat <- data.frame(
student = 1:180,
gender = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.52,.48)),
labels=c("female","male")),
division = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.6,.4)),
labels=c("lower","upper")),
field = factor(sample(LETTERS[1:5], 180, replace = T,
prob = c(.26,.21,.35,.07,.11)),
labels = c("humanities","natural science",
"social science","engineer","other")))
This was what I was playing with, but it's really increasing the randomness in assignment and not so much the balance as can be seen:
library(dplyr)
dat$rand <- sample(1:180,180)
dat1 <- arrange(dat, field, division, gender, rand)
dat1$grp <- 1:(nrow(dat1)/6) #issue if not divisible
Which does not result in adequate balance:
with(dat1, table(gender, grp)) #as a check
with(dat1, table(field, grp))
with(dat1, table(division, grp))
I know this is an old question, but I had a similar problem today and here's the solution I came up with. Basically you assign groups randomly then use either chi square test for categorical variables or ANOVA for continuous variables to test for group differences for each variable. You set a threshold for the p-value that you do not want to drop below. The code will reshuffle the groups until all p values are above that threshold. If it goes through 10,000 iterations without reaching a grouping solution, it will stop and suggest that you lower the threshold.
set.seed(905)
#let's say you have a continuous variable you would also like to keep steady across groups
dat$age <- sample(18:35, nrow(dat), replace = TRUE)
dat$group <- rep_len(1:20, length.out = nrow(dat)) #if you wanted to make 20 groups
dat$group <- as.factor(dat$group)
a <- 0.1; b <- 0.1; c <- 0.1; d <- 0.1
thresh <- 0.85 #Minimum threshold for p value
z <- 1
while (a < thresh | b < thresh |c < thresh |d < thresh) {
dat <- transform(dat, group = sample(group)) #shuffles the groups
x <- summary(aov(age ~ group, dat)) #ANOVA for continuous variables
a <- x[[1]]['group','Pr(>F)']
x <- summary(table(dat$group, dat$gender)) #Chi Sq for categorical variables
b <- x[['p.value']]
x <- summary(table(dat$group, dat$division))
c <- x[['p.value']]
x <- summary(table(dat$group, dat$field))
d <- x[['p.value']]
z <- z + 1
if (z > 10000) {
print('10,000 tries, no solution, reduce threshold')
break
}
}
With enough datapoints per combination of the variables, you should be able to do this:
dat <- groupdata2::fold(dat, k = 8,
cat_col = c("gender", "division", "field"))
with(dat, table(gender, .folds))
## .folds
## gender 1 2 3 4 5 6 7 8
## female 11 12 11 12 12 11 12 12
## male 10 11 11 11 11 11 11 11
with(dat, table(field, .folds))
## .folds
## field 1 2 3 4 5 6 7 8
## humanities 5 8 9 7 9 6 6 5
## natural science 2 3 4 6 3 9 2 4
## social science 9 7 6 8 5 6 9 6
## engineer 3 3 2 1 3 0 2 4
## other 2 2 1 1 3 1 4 4
with(dat, table(division, .folds))
## .folds
## division 1 2 3 4 5 6 7 8
## lower 11 15 13 14 10 13 11 15
## upper 10 8 9 9 13 9 12 8

Resources