Combine rows based on ranges in a column - r

I have a pretty large dataset where I have a column for time in seconds and I want to combine rows where the time is close (range: .1-.2 seconds apart) as a mean.
Here is an example of how the data looks:
BPM seconds
63.9 61.899
63.9 61.902
63.8 61.910
62.1 130.94
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4
So I would want to combine the first 3 rows, then the 2 following after that, and the rest would stand alone. Meaning I would want the data to look like this:
BPM seconds
63.9 61.904
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4

We need to create groups, this is the important bit, the rest is standard aggregation:
cumsum(!c(0, diff(df1$seconds)) < 0.2)
# [1] 0 0 0 1 1 2 3 4
Then aggregate using aggregate:
aggregate(df1[, 2], list(cumsum(!c(0, diff(df1$seconds)) < 0.2)), mean)
# Group.1 x
# 1 0 61.90367
# 2 1 130.94500
# 3 2 211.59000
# 4 3 280.50000
# 5 4 290.40000
Or use dplyr:
library(dplyr)
df1 %>%
group_by(myGroup = cumsum(!c(0, diff(seconds)) < 0.2)) %>%
summarise(BPM = first(BPM),
seconds = mean(seconds))
# # A tibble: 5 x 3
# myGroup BPM seconds
# <int> <dbl> <dbl>
# 1 0 63.9 61.9
# 2 1 62.1 131.
# 3 2 61.8 212.
# 4 3 63.8 280.
# 5 4 60.3 290.
Reproducible example data:
df1 <- read.table(text = "BPM seconds
63.9 61.899
63.9 61.902
63.8 61.910
62.1 130.94
62.1 130.95
61.8 211.59
63.8 280.5
60.3 290.4", header = TRUE)

Related

Modify the number of classes of a column to quantile groups with R

I'm having difficulties clustering my data in a column to 4 groups which refer to quantile percentages. Can someone help me out?
I have listed my unsuccessful attempts below.
Attempt number 1:
data$Temperatura <- cut(data$Temperatura, breaks = c(96.3, 97.8, 98.7, 100,8),
labels = c(1,2,3,4))
Attempt number 2:
data$Temperatura = data.frame(1 = c(96.3, 97.8, 98.7, 100,8))
data$Temperatura <- cut(Temperatura, c(96.3, 97.8, 98.7, 100,8))
Attempt number 3:
sapply(data, class)
range(Temperatura)
quantile(data$Temperatura)
Thank you in advance!
Does this give you what you want?
# example data
Temperatura <- runif(30, 90, 110)
# cut by quantile
cTemperatura <- cut(Temperatura,
breaks=quantile(Temperatura),
labels = as.character(1:4),
include.lowest = TRUE)
# display
setNames(round(Temperatura,1), cTemperatura)
# 1 4 3 4 2 1 1 4 3 2 1 1 3 3
# 92.6 107.2 99.2 108.4 97.5 94.6 92.1 108.9 101.2 95.2 91.0 94.4 104.8 104.0
# 2 2 1 4 4 2 3 3 4 3 2 2 1 4
# 96.5 97.5 90.3 107.7 107.0 95.6 106.0 102.9 109.8 98.6 98.4 95.3 90.7 106.7
# 4 1
#108.5 93.5

filter by observation that cumulate X% of values

I would like to filter by observations (after sorting in decreasing way in every group) that cumulate X % of values, in my case less than or equal to 80 percent of total of the values. And that in every group.
So from this dataframe below:
Group<-c("A","A","A","A","A","B","B","B","B","C","C","C","C","C","C")
value<-c(c(2,3,6,3,1,1,3,3,5,4,3,5,3,4,2))
data1<-data.frame(Group,value)
data1<-data1%>%arrange(Group,desc(value))%>%
group_by(Group)%>%mutate(pct=round (100*value/sum(value),1))%>%
mutate(cumPct=cumsum(pct))
I would like to have the below filtered dataframe according to conditions I decribed above:
Group value pct cumPct
1 A 6 40.0 40.0
2 A 3 20.0 60.0
3 A 3 20.0 80.0
4 B 5 41.7 41.7
5 B 3 25.0 66.7
6 C 5 23.8 23.8
7 C 4 19.0 42.8
8 C 4 19.0 61.8
9 C 3 14.3 76.1
You can arrange the data in descending order of value, for each Group calculate pct and cum_pct and select rows where cum_pct is less than equal to 80.
library(dplyr)
data1 %>%
arrange(Group, desc(value)) %>%
group_by(Group) %>%
mutate(pct = value/sum(value) * 100,
cum_pct = cumsum(pct)) %>%
filter(cum_pct <= 80)
# Group value pct cum_pct
# <chr> <dbl> <dbl> <dbl>
#1 A 6 40 40
#2 A 3 20 60
#3 A 3 20 80
#4 B 5 41.7 41.7
#5 B 3 25 66.7
#6 C 5 23.8 23.8
#7 C 4 19.0 42.9
#8 C 4 19.0 61.9
#9 C 3 14.3 76.2

dplyr mutate find max of n next values in column

Given the following tibble :
library(tidyverse)
set.seed(1)
my_tbl = tibble(x = rep(words[1:5], 50) %>% sort(),
y = 1:250,
z = sample(seq(from = 30 , to = 90, by = 0.1), size = 250, replace = T))
i’m trying to create a new column
which will populate the max value of the next 3 values in column z
for example
for row 1 max_3_next should be 84.5 (of row 4)
for row 5 max_3_next should be 86.7 (of row 7)
here is what I try to do:
my_tbl %>%
mutate(max_next_3 = max(.$z[(y + 1):(y + 3)]))
> my_tbl %>%
+ mutate(max_3_next = max(.$z[(y + 1):(y + 3)]))
# A tibble: 250 x 4
x y z max_3_next
<chr> <int> <dbl> <dbl>
1 a 1 45.9 84.5
2 a 2 52.3 84.5
3 a 3 64.4 84.5
4 a 4 84.5 84.5
5 a 5 42.1 84.5
6 a 6 83.9 84.5
7 a 7 86.7 84.5
8 a 8 69.7 84.5
9 a 9 67.8 84.5
10 a 10 33.7 84.5
# ... with 240 more rows
Warning messages:
1: In (y + 1):(y + 3) :
numerical expression has 250 elements: only the first used
2: In (y + 1):(y + 3) :
numerical expression has 250 elements: only the first used
I get the above warnings
How can I change the code to achieve the desired result?
My preference is for a dplyer solution
But i’ll be happy to learn other solutions alongside as well since performance is an issue
since the original dataset may have 1 M ~ rows
Thanks
Rafael
We can use rollmax from zoo library with align="left", to instruct the window from the current observation along with the following two observations
library(zoo)
my_tbl %>%
mutate(max_3_next = rollmax(z,3, fill = NA, align = "left"))
# A tibble: 250 x 4
x y z max_3_next
<chr> <int> <dbl> <dbl>
1 a 1 45.9 64.4
2 a 2 52.3 84.5
3 a 3 64.4 84.5
4 a 4 84.5 84.5
5 a 5 42.1 86.7
6 a 6 83.9 86.7
7 a 7 86.7 86.7
8 a 8 69.7 69.7
9 a 9 67.8 67.8
10 a 10 33.7 42.3
# ... with 240 more rows
Sorry, I believe that I misunderstand the OP correctly. So here is the correct solution -inspired from Joshua Ulrich answer's at this question- I hope. I will keep the previous answer just in case needed by future readers.
my_tbl %>%
mutate(max_3_next = rollapply(z, list((1:3)), max, fill=NA, align = "left", partial=TRUE))
# A tibble: 250 x 4
x y z max_3_next
<chr> <int> <dbl> <dbl>
1 a 1 45.9 84.5
2 a 2 52.3 84.5
3 a 3 64.4 84.5
4 a 4 84.5 86.7
5 a 5 42.1 86.7
6 a 6 83.9 86.7
7 a 7 86.7 69.7
8 a 8 69.7 67.8
9 a 9 67.8 42.3
10 a 10 33.7 71.2
# ... with 240 more rows

How best to index for max values in data frame?

Here dataset in use is genotype from the cran package,MASS.
> names(genotype)
[1] "Litter" "Mother" "Wt"
> str(genotype)
'data.frame': 61 obs. of 3 variables:
$ Litter: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 1 1 1 1 1 ...
$ Mother: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 2 2 2 3 3 ...
$ Wt : num 61.5 68.2 64 65 59.7 55 42 60.2 52.5 61.8 ...
This was the given question from a tutorial:
Exercise 6.7. Find the heaviest rats born to each mother in the genotype() data.
tapply, whence split by factor genotype$Mother gives:
> tapply(genotype$Wt, genotype$Mother, max)
A B I J
68.2 69.8 61.8 61.0
Also:
> out <- tapply(genotype$Wt, genotype[,1:2],max)
> out
Mother
Litter A B I J
A 68.2 60.2 61.8 61.0
B 60.3 64.7 59.0 51.3
I 68.0 69.8 61.3 54.5
J 59.0 59.5 61.4 54.0
First tapply gives the heaviest rats from each mother , and second (out) gives a table that allows me identify which type of litter of each mother was heaviest. Is there another way to match which Litter is has the most weight for each mother, for instance if the 2 dim table is real large.
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(genotype)). Create the index using which.max and subset the rows of the dataset grouped by the 'Mother'.
library(data.table)#v1.9.5+
setDT(genotype)[, .SD[which.max(Wt)], by = Mother]
# Mother Litter Wt
#1: A A 68.2
#2: B I 69.8
#3: I A 61.8
#4: J A 61.0
If we are only interested in the max of 'Wt' by 'Mother'
setDT(genotype)[, list(Wt=max(Wt)), by = Mother]
# Mother Wt
#1: A 68.2
#2: B 69.8
#3: I 61.8
#4: J 61.0
Based on the last tapply code showed by the OP, if we need similar output, we can use dcast from the devel version of 'data.table'
dcast(setDT(genotype), Litter ~ Mother, value.var='Wt', max)
# Litter A B I J
#1: A 68.2 60.2 61.8 61.0
#2: B 60.3 64.7 59.0 51.3
#3: I 68.0 69.8 61.3 54.5
#4: J 59.0 59.5 61.4 54.0
data
library(MASS)
data(genotype)
From stats:
aggregate(. ~ Mother, data = genotype, max)
or
aggregate(Wt ~ Mother, data = genotype, max)

Reshaping a data frame with more than one measure variable

I'm using a data frame similar to this one:
df<-data.frame(student=c(rep(1,5),rep(2,5)), month=c(1:5,1:5),
quiz1p1=seq(20,20.9,0.1),quiz1p2=seq(30,30.9,0.1),
quiz2p1=seq(80,80.9,0.1),quiz2p2=seq(90,90.9,0.1))
print(df)
student month quiz1p1 quiz1p2 quiz2p1 quiz2p2
1 1 1 20.0 30.0 80.0 90.0
2 1 2 20.1 30.1 80.1 90.1
3 1 3 20.2 30.2 80.2 90.2
4 1 4 20.3 30.3 80.3 90.3
5 1 5 20.4 30.4 80.4 90.4
6 2 1 20.5 30.5 80.5 90.5
7 2 2 20.6 30.6 80.6 90.6
8 2 3 20.7 30.7 80.7 90.7
9 2 4 20.8 30.8 80.8 90.8
10 2 5 20.9 30.9 80.9 90.9
Describing grades received by students during five months – in two quizzes divided into two parts each.
I need to get the two quizzes into separate rows – so that each student in each month will have two rows, one for each quiz, and two columns – for each part of the quiz.
When I melt the table:
melt.data.frame(df, c("student", "month"))
I get the two parts of the quiz in separate lines too.
dcast(dfL,student+month~variable)
of course gets me right back where I started, and I can't find a way to cast the table back in to the required form.
Is there a way to make the melt command function something like:
melt.data.frame(df, measure.var1=c("quiz1p1","quiz2p1"),
measure.var2=c("quiz1p2","quiz2p2"))
Here's how you could do this with reshape(), from base R:
df2 <- reshape(df, direction="long",
idvar = 1:2, varying = list(c(3,5), c(4,6)),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
## Checking the output
rbind(head(df2, 3), tail(df2, 3))
# student month time p1 p2
# 1.1.quiz1 1 1 quiz1 20.0 30.0
# 1.2.quiz1 1 2 quiz1 20.1 30.1
# 1.3.quiz1 1 3 quiz1 20.2 30.2
# 2.3.quiz2 2 3 quiz2 80.7 90.7
# 2.4.quiz2 2 4 quiz2 80.8 90.8
# 2.5.quiz2 2 5 quiz2 80.9 90.9
You can also use column names (instead of column numbers) for idvar and varying. It's more verbose, but seems like better practice to me:
## The same operation as above, using just column *names*
df2 <- reshape(df, direction="long", idvar=c("student", "month"),
varying = list(c("quiz1p1", "quiz2p1"),
c("quiz1p2", "quiz2p2")),
v.names = c("p1", "p2"), times = c("quiz1", "quiz2"))
I think this does what you want:
#Break variable into two columns, one for the quiz and one for the part of the quiz
dfL <- transform(dfL, quiz = substr(variable, 1,5),
part = substr(variable, 6,7))
#Adjust your dcast call:
dcast(dfL, student + month + quiz ~ part)
#-----
student month quiz p1 p2
1 1 1 quiz1 20.0 30.0
2 1 1 quiz2 80.0 90.0
3 1 2 quiz1 20.1 30.1
...
18 2 4 quiz2 80.8 90.8
19 2 5 quiz1 20.9 30.9
20 2 5 quiz2 80.9 90.9
There was a very similar question asked about half a year ago, in which I wrote the following function:
melt.wide = function(data, id.vars, new.names) {
require(reshape2)
require(stringr)
data.melt = melt(data, id.vars=id.vars)
new.vars = data.frame(do.call(
rbind, str_extract_all(data.melt$variable, "[0-9]+")))
names(new.vars) = new.names
cbind(data.melt, new.vars)
}
You can use the function to "melt" your data as follows:
dfL <-melt.wide(df, id.vars=1:2, new.names=c("Quiz", "Part"))
head(dfL)
# student month variable value Quiz Part
# 1 1 1 quiz1p1 20.0 1 1
# 2 1 2 quiz1p1 20.1 1 1
# 3 1 3 quiz1p1 20.2 1 1
# 4 1 4 quiz1p1 20.3 1 1
# 5 1 5 quiz1p1 20.4 1 1
# 6 2 1 quiz1p1 20.5 1 1
tail(dfL)
# student month variable value Quiz Part
# 35 1 5 quiz2p2 90.4 2 2
# 36 2 1 quiz2p2 90.5 2 2
# 37 2 2 quiz2p2 90.6 2 2
# 38 2 3 quiz2p2 90.7 2 2
# 39 2 4 quiz2p2 90.8 2 2
# 40 2 5 quiz2p2 90.9 2 2
Once the data are in this form, you can much more easily use dcast() to get whatever form you desire. For example
head(dcast(dfL, student + month + Quiz ~ Part))
# student month Quiz 1 2
# 1 1 1 1 20.0 30.0
# 2 1 1 2 80.0 90.0
# 3 1 2 1 20.1 30.1
# 4 1 2 2 80.1 90.1
# 5 1 3 1 20.2 30.2
# 6 1 3 2 80.2 90.2

Resources