Here is a part of my data
dat<-read.table (text="
Flower A1 A2 A3 TM MN B1 B2 B3
F1 12 9 11 12 0.56 19 1 12
F2 11 16 13 13 0.65 22 4 12
F3 10 12 14 11 0.44 29 9 12
", header=TRUE)
I want to calculate Max for column MN. For example, for value 0.44, the max is max(0.44,1-0.44)= 0.56.
I struggle to get it with a data frame.
Here is the outcome of the interest:
Flower A TM B MN Max
F1 12 12 19 0.56 0.56
F2 11 13 22 0.65 0.65
F3 10 11 29 0.44 0.56
F1 9 12 1 0.56 0.56
F2 16 13 4 0.65 0.65
F3 12 11 9 0.44 0.56
F1 11 12 12 0.56 0.56
F2 13 13 12 0.65 0.65
F3 14 11 12 0.44 0.56
Try the code below
transform(
reshape(
setNames(dat, gsub("(\\d+)", ".\\1", names(dat))),
direction = "long",
idvar = c("Flower", "TM", "MN"),
varying = -c(1, 5, 6)
),
Max = pmax(MN, 1 - MN)
)
which gives
Flower TM MN time A B Max
F1.12.0.56.1 F1 12 0.56 1 12 19 0.56
F2.13.0.65.1 F2 13 0.65 1 11 22 0.65
F3.11.0.44.1 F3 11 0.44 1 10 29 0.56
F1.12.0.56.2 F1 12 0.56 2 9 1 0.56
F2.13.0.65.2 F2 13 0.65 2 16 4 0.65
F3.11.0.44.2 F3 11 0.44 2 12 9 0.56
F1.12.0.56.3 F1 12 0.56 3 11 12 0.56
F2.13.0.65.3 F2 13 0.65 3 13 12 0.65
F3.11.0.44.3 F3 11 0.44 3 14 12 0.56
Using reshape and ave.
reshape(dat, varying=list(2:4, 7:9), direction='long', idvar='Flower') |>
transform(Max=ave(MN, Flower, FUN=max))
# Flower TM MN time A1 B1 Max
# F1.1 F1 12 0.56 1 12 19 0.56
# F2.1 F2 13 0.65 1 11 22 0.65
# F3.1 F3 11 0.44 1 10 29 0.44
# F1.2 F1 12 0.56 2 9 1 0.56
# F2.2 F2 13 0.65 2 16 4 0.65
# F3.2 F3 11 0.44 2 12 9 0.44
# F1.3 F1 12 0.56 3 11 12 0.56
# F2.3 F2 13 0.65 3 13 12 0.65
# F3.3 F3 11 0.44 3 14 12 0.44
Note: R >= 4.1 used.
library(psych)
library(mokken)
bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %T>%
summary %>%
{.$Hi[.$Hi<0]}
A1
-0.3873723
Above script works well.I get the final output but still want to review the output of summary.
How to make summary output too in this pipe?
If we want the summary as well, place it in a list
library(psych)
library(mokken)
library(magrittr)
out <- bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %>%
{list(summary(.), .$Hi[.$Hi < 0])}
out
#[[1]]
# ItemH #ac #vi #vi/#ac maxvi sum sum/#ac zmax #zsig crit
#A1 -0.39 75 54 0.72 0.52 9.79 0.1305 16.75 51 550
#A2 0.06 50 8 0.16 0.14 0.63 0.0126 4.76 7 128
#A3 0.09 30 6 0.20 0.12 0.45 0.0149 4.63 6 134
#[[2]]
# A1
#-0.3873723
You can use %T>% print() to show the result of summary() but not return it.
bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %T>%
{print(summary(.))} %>%
{.$Hi[.$Hi<0]}
# ItemH #ac #vi #vi/#ac maxvi sum sum/#ac zmax #zsig crit
# A1 -0.39 75 54 0.72 0.52 9.79 0.1305 16.75 51 550
# A2 0.06 50 8 0.16 0.14 0.63 0.0126 4.76 7 128
# A3 0.09 30 6 0.20 0.12 0.45 0.0149 4.63 6 134
#
# A1
# -0.3873723
If you assign it to a variable, it doesn't store the result of summary().
out <- ...
out
# A1
# -0.3873723
I have a data frame like this:
for each gill, I would like to find the maximum time for which the Diameter is different from 0. I have tried to use the function aggregate and the dplyr package but this did not work. A combinaison of for, if and aggregate would probably work but I did not find how to do it.
I'm not sure of the best way to approach this. I'd appreciate any help.
After grouping by 'Gill', subset the 'Time' where 'Diametre' is not 0 and get the max (assuming 'Time' is numeric class)
library(dplyr)
df1 %>%
group_by(Gill) %>%
summarise(Time = max(Time[Diametre != 0]))
Here how you can use aggregate:
> df<- data.frame(
Gill = rep(1:11, each = 2),
diameter = c(0,0,1,0,0,0,73.36, 80.08,1,25.2,53.48,61.21,28.8,28.66,71.2,80.25,44.55,53.50,60.91,0,11,74.22),
time = 0.16
)
> df
Gill diameter time
1 1 0.00 0.16
2 1 0.00 0.16
3 2 1.00 0.16
4 2 0.00 0.16
5 3 0.00 0.16
6 3 0.00 0.16
7 4 73.36 0.16
8 4 80.08 0.16
9 5 1.00 0.16
10 5 25.20 0.16
11 6 53.48 0.16
12 6 61.21 0.16
13 7 28.80 0.16
14 7 28.66 0.16
15 8 71.20 0.16
16 8 80.25 0.16
17 9 44.55 0.16
18 9 53.50 0.16
19 10 60.91 0.16
20 10 0.00 0.16
21 11 11.00 0.16
22 11 74.22 0.16
> # Remove diameter == 0 before aggregate
> dfnew <- df[df$diameter != 0, ]
> aggregate(dfnew$time, list(dfnew$Gill), max )
Group.1 x
1 2 0.16
2 4 0.16
3 5 0.16
4 6 0.16
5 7 0.16
6 8 0.16
7 9 0.16
8 10 0.16
9 11 0.16
I would use a different approach than the elegant solution that akrun suggested. I know how to use this method to create the column MaxTime that you show in your image.
#This will split your df into a list of data frames for each gill.
list.df <- split(df1, df1$Gill)
Then you can use lapply to find the maximum of Time for each Gill and then make that value a new column called MaxTime.
lapply(list.df, function(x) mutate(x, MaxTime = max(x$Time[x$Diametre != 0])))
Then you can combine these split dataframes back together using bind_rows()
df1 = bind_rows(list.df)
I have two large dataframes, one is called Dates_only and the other Values
**Dates_only:**
ID Quart_y Quart
1 1118 2017Q3 0.25
2 1118 2017Q4 0.50
3 1118 2018Q1 0.75
4 1118 2018Q2 1.00
5 1118 2018Q3 1.25
6 1118 2018Q4 1.50
7 1118 2019Q1 1.75
8 1118 2019Q2 2.00
9 1119 2017Q3 0.25
10 1119 2017Q4 0.50
11 1119 2018Q1 0.75
12 1119 2018Q2 1.00
13 1119 2018Q3 1.25
14 1119 2018Q4 1.50
15 1119 2019Q1 1.75
16 1119 2019Q2 2.00
17 13PP 2017Q3 0.25
18 13PP 2017Q4 0.50
19 13PP 2018Q1 0.75
20 13PP 2018Q2 1.00
21 13PP 2018Q3 1.25
22 13PP 2018Q4 1.50
23 13PP 2019Q1 1.75
24 13PP 2019Q2 2.00
And the second dataset:
**Values**
ID Day Value
1 1118 0 7.6
2 1119 0 6.2
3 13PP 0 6.8
4 1118 0.14 7.1
5 1119 0.13 6.2
6 13PP 0.13 5.9
7 1118 0.20 6.8
8 1119 0.23 5.8
9 13PP 0.24 4.6
10 1118 0.27 6.5
11 1119 0.28 5.4
12 13PP 0.32 4.2
13 1118 0.32 6.3
14 1119 0.32 4.8
15 13PP 0.44 4.0
16 1118 0.47 6.0
17 1119 0.49 4.3
18 13PP 0.49 3.8
19 1118 0.59 5.9
20 1119 0.64 4.0
21 13PP 0.61 3.6
22 1118 0.72 5.6
23 1119 0.71 3.8
24 13PP 0.73 3.4
25 1118 0.95 5.4
26 1119 0.86 3.2
27 13PP 0.78 3.0
28 1118 1.10 5.0
29 1119 0.93 2.9
30 13PP 1.15 2.9
What I want to do is to create another column (a fourth) in the Dates_only called Value_average, and it will contain average scores extracted from Values dataframe from the column Values$Value.
Specifically, as you can observe in Dates_only the Quart_y represents quarters/year, the Quart quantify this with a number from 0.25:2.
So, the pattern goes like this Q3 - x.25, Q4 - x.50, Q1 - x.75, Q2 - x.00.
In the second dataframe, Values, we have some scores that represent days of the year. The concept is that for days that have scores 0<Day<0.25 belong to the 2017Q3, days with scores 0.25<Day<0.50 belong to 2017Q4, and days with scores 1.00<Day<1.25 belong to 2018Q3.
I want for each ID from Dates_only dataframe to find the average of the Values$Value numbers that belong to the appropriate time frame:
For ID=1118 and for 2017Q3 the 'Values$Day' elements that are between 0<Day<=0.25 are (0, 0.14, 0.20) and the equivalent Values$Value are (7.6, 7.1, 6.8) so the Dates_only$Value_average is going to be 7.16. The next will average values for days 0.25<Day<=0.50 etc.
**Dates_only:**
ID Quart_y Quart Value_average
1 1118 2017Q3 0.25 7.16
2 1118 2017Q3 0.50 6.27
The code that I have used is:
Dates_only$Value_average <- 0
for (i in 1:length(Dates_only$ID)){
id <- as.character(Dates_only$ID[i])
quart <- as.numeric(Dates_only$Quart[i])
quart_prev <- quart-0.25
count_d <- 0
sum_val <- 0
for (k in 1:length(Values$ID)){
if (id==as.character(Values$ID[k])
&& quart>=as.numeric(Values$Day[k])
&& as.numeric(Values$Day[k])>quart_prev){
sum_val <- as.numeric(Values$Value[k]) + sum_val
count_d <- count_d + 1
}
}
av_value <- sum_val/count_d
Dates_only$Value_average[i] <- av_value
}
Is there a more efficient code to do that in very large datasets (over 300K observations)? I am pretty sure there is but my novice skills on R do not help a lot.
To replicate the two dataframes:
Dates_only <- data.frame(ID=c('1118','1118','1118','1118','1118',
'1118','1118','1118','1119','1119',
'1119','1119','1119','1119','1119',
'1119','13PP','13PP','13PP','13PP',
'13PP','13PP','13PP','13PP'),
Quart_y=c('2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2'),
Quart=c(0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00))
Values <- data.frame(ID=c('1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP'),
Day=c(0,0,0,0.14,0.13,0.13,0.2,0.23,0.24,0.27,0.28,
0.32,0.32,0.32,0.44,0.47,0.49,0.49,0.59,0.64,
0.61,0.72,0.71,0.73,0.95,0.86,0.78,1.1,0.93,1.15),
Value=c(7.6,6.2,6.8,7.1,6.2,5.9,6.8,5.8,4.6,6.5,5.4,
4.2,6.3,4.8,4,6,4.3,3.8,5.9,4,3.6,5.6,3.8,
3.4,5.4,3.2,3,5,2.9,2.9))
We can accomplish almost all of this using the dplyr package
library(dplyr)
Values %>%
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
full_join(., Dates_only, by = c("ID", "Quart")) %>%
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))
Which gives you:
ID Quart Quart_y Value_average
<fctr> <dbl> <fctr> <dbl>
1 1118 0.25 2017Q3 7.166667
2 1118 0.50 2017Q4 6.266667
3 1118 0.75 2018Q1 5.750000
4 1118 1.00 2018Q2 5.400000
5 1118 1.25 2018Q3 5.000000
6 1118 1.50 2018Q4 NaN
7 1118 1.75 2019Q1 NaN
8 1118 2.00 2019Q2 NaN
9 1119 0.25 2017Q3 6.066667
10 1119 0.50 2017Q4 4.833333
# ... with 14 more rows
See below for a breakdown of each line of code for any questions:
# Start with your `Values` data frame
Values %>%
# Recode `Day` that are '0.00', as they currently will be excluded from
# the rule 2017Q3: 0<Day<=0.25
# I picked 0.01 arbitrarily to fit this rule
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
# Now round all `Day` values up to the nearest 0.25
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
# Now join the two data frames using a `full_join`
# A left_join may also be used if you are uninterested in NA's
full_join(., Dates_only, by = c("ID", "Quart")) %>%
# Finally, designate groupings to calculate the mean values
# for each ID for each quarter
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))