Classify table based on value 'moving window' range and proportions? - r

I have a datasets of forest stands, each containing several tree layers of different age and volume.
I want to classify the stands as even- or uneven-aged, combining volume and age data. The forest is considered even-aged if more then 80% of the volume is allocated to age classes within 20 years apart. I wonder how to implement the 'within 20 years apart' condition? I can easily calculate the sum of volume and it's share for individual tree layers (strat). But how to check for 'how many years they are apart?' Is it some sort of moving window?
Dummy example:
# investigate volume by age classes?
library(dplyr)
df <- data.frame(stand = c("id1", "id1", "id1", "id1",
'id2', 'id2', 'id2'),
strat = c(1,2,3,4,
1,2,3),
v = c(4,10,15,20,
11,15,18),
age = c(5,10,65,80,
10,15,20))
# even age = if more of teh 80% of volume is allocated in layers in 20 years range
df %>%
group_by(stand) %>%
mutate(V_tot = sum(v)) %>%
mutate(V_share = v/V_tot*100)
Expected outcome:
stand strat v age V_tot V_share quality
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id1 1 4 5 49 8.16 uneven-aged
2 id1 2 10 10 49 20.4 uneven-aged
3 id1 3 15 65 49 30.6 uneven-aged
4 id1 4 20 80 49 40.8 uneven-aged #* because age classes 65 and 80, even less then 20 years apart have only 70% of total volume
5 id2 1 11 10 44 25 even-aged
6 id2 2 15 15 44 34.1 even-aged
7 id2 3 18 20 44 40.9 even-aged

Another tidyverse solution implementing a moving average:
library(tidyverse)
df <- structure(list(stand = c("id1", "id1", "id1", "id1", "id2", "id2", "id2"), strat = c(1, 2, 3, 4, 1, 2, 3), v = c(4, 10, 15, 20, 11, 15, 18), age = c(5, 10, 65, 80, 10, 15, 20), V_tot = c(49, 49, 49, 49, 44, 44, 44), V_share = c(8.16326530612245, 20.4081632653061, 30.6122448979592, 40.8163265306122, 25, 34.0909090909091, 40.9090909090909)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))
df %>%
group_by(stand) %>%
mutate(range20 = map_dbl(age, ~ sum(V_share[which(abs(age - .x) <= 20)])),
quality = ifelse(any(range20 > 80), "even-aged", "uneven-aged"))
#> # A tibble: 7 × 8
#> # Groups: stand [2]
#> stand strat v age V_tot V_share range20 quality
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 id1 1 4 5 49 8.16 28.6 uneven-aged
#> 2 id1 2 10 10 49 20.4 28.6 uneven-aged
#> 3 id1 3 15 65 49 30.6 71.4 uneven-aged
#> 4 id1 4 20 80 49 40.8 71.4 uneven-aged
#> 5 id2 1 11 10 44 25 100 even-aged
#> 6 id2 2 15 15 44 34.1 100 even-aged
#> 7 id2 3 18 20 44 40.9 100 even-aged
Created on 2021-09-08 by the reprex package (v2.0.1)

Interesting issue, I think I have a solution using the runner package
df %>%
group_by(stand) %>%
mutate(
V_tot = sum(v),
V_share = v/V_tot*100,
test = sum_run(
V_share,
k = 20L,
idx = age,
na_rm = TRUE,
na_pad = FALSE
),
quality = if_else(any(test >= 80), 'even-aged', 'uneven-aged')
) %>%
select(-test)

Related

replacing rowwise() operations in grouped data

Anonymised example subset of a much larger dataset (now edited to show an option with multiple competing types):
structure(list(`Sample File` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C"),
Marker = c("X", "X", "X", "X", "Y", "Y", "Y", "Y", "Y", "Z",
"Z", "Z", "Z", "Z", "q", "q", "q", "q"), Allele = c(19, 20,
22, 23, 18, 18.2, 19, 19.2, 20, 12, 13, 14, 15, 16, 10, 10.2,
11, 12), Size = c(249.15, 253.13, 260.64, 264.68, 366, 367.81,
369.97, 372.02, 373.95, 91.65, 95.86, 100, 104.24, 108.38,
177.51, 179.4, 181.42, 185.49), Height = c(173L, 1976L, 145L,
1078L, 137L, 62L, 1381L, 45L, 1005L, 38L, 482L, 5766L, 4893L,
19L, 287L, 36L, 5001L, 50L), Type = c("minusone", "allele",
"minusone", "allele", "ambiguous", "minushalf", "allele",
"minushalf", "allele", "minustwo", "ambiguous", "allele",
"allele", "plusone", "minusone", "minushalf", "allele", "plusone"
), LUS = c(11.75, 11.286, 13.375, 13.5, 18, 9, 19, 10, 20,
12, 11, 14, 15, 16, 9.5, NA, 11, 11.5)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L), groups = structure(list(
`Sample File` = c("A", "A", "B", "C"), Marker = c("X", "Y",
"Z", "q"), .rows = structure(list(1:4, 5:9, 10:14, 15:18), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))
I want to look up values based on the classification $Type.
"minustwo" means I want to look up the "Allele", "Height" and "LUS"
values for the row with "Allele" equal to the current row plus two,
with the same Sample File and Marker.
"minusone" means the same but for "Allele" equal to the current row plus one.
"minushalf" means the same but for "Allele" equal to the current row plus 0.2 but the dot values here are 25% each, so 12.1, 12.3, 12.3, 13, 13.1 etc - I have a helper function plusTwoBP() for this.
"plusone" means the same for "Allele" equal to the current row -1
"allele" or "ambiguous" don't need to do anything.
Ideal output:
# A tibble: 18 × 10
# Rowwise: Sample File, Marker
`Sample File` Marker Allele Size Height Type LUS ParentHeight ParentAllele ParentLUS
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 A X 19 249. 173 minusone 11.8 1976 20 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 1078 23 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 1381 19 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 1005 20 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 5766 14 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 4893 15 15
15 C q 10 178. 287 minusone 9.5 5001 11 11
16 C q 10.2 179. 36 minushalf NA 5001 11 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 5001 11 11
I have a rather belaboured way of doing it:
# eg for minustwo
sampleData %>%
filter(Type == "minustwo") %>%
rowwise() %>%
mutate(ParentHeight = sampleData$Height[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentAllele = sampleData$Allele[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentLUS = sampleData$LUS[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)]) %>%
right_join(sampleData)
I then have to redo that for each of my Types
My real dataset is thousands of rows so this ends up being a little slow but manageable, but more to the point I want to learn a better way to do it, in particular the sampleData$'Sample File' == 'Sample File' & sampleData$Marker == Marker seems like it should be doable with grouping so I must be missing a trick there.
I have tried using group_map() but I've clearly not understood it correctly:
sampleData$ParentHeight <- sampleData %>%
group_by(`Sample File`, `Marker`) %>%
group_map(.f = \(.x, .y) {
pmap_dbl(.l = .x, .f = \(Allele, Height, Type, ...){
if(Type == "allele" | Type == "ambiguous") { return(0)
} else if (Type == "plusone") {
return(.x$Height[.x$Allele == round(Allele - 1, 1)])
} else if (Type == "minushalf") {
return(.x$Height[.x$Allele == round(plustwoBP(Allele), 1)])
} else if (Type == "minusone") {
return(.x$Height[.x$Allele == round(Allele + 1, 1)])
} else if (Type == "minustwo") {
return(.x$Height[.x$Allele == round(Allele + 2, 1)])
} else { stop("unexpected peak type") }
})}) %>% unlist()
Initially seems to work, but on investigation it's not respecting both layers of grouping, so brings matches from the wrong Marker. Additionally, here I'm assigning the output to a new column in the data frame, but if I try to instead wrap a mutate() around this so that I can create all three new columns in one go then the group_map() no longer works at all.
I also considered using complete() to hugely extend the data frame will all possible values of Allele (including x.0, x.1, x.2, x.3 variants) then use lag() to select the corresponding rows, then drop the spare rows. This seems like it'd make the data frame enormous in the interim.
To summarise
This works, but it feels ugly and like I'm missing a more elegant and obvious solution. How would you approach this?
You can create two versions of Allele: one identical to the original Allele, and one that is equal to an adjustment based on minusone, minustwo, etc
Then do a self left join, based on that adjusted version of Allele (and Sample File and Marker)
sampleData = sampleData %>% group_by(`Sample File`,Marker) %>% mutate(id = Allele) %>% ungroup()
left_join(
sampleData %>%
mutate(id = case_when(
Type=="minusone"~id+1,
Type=="minustwo"~id+2,
Type=="plusone"~id-1,
Type=="minushalf"~ceiling(id))),
sampleData %>% select(-c(Size,Type)),
by=c("Sample File", "Marker", "id"),
suffix = c("", ".parent")
) %>% select(-id)
Output:
# A tibble: 14 × 10
`Sample File` Marker Allele Size Height Type LUS Allele.parent Height.parent LUS.parent
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl> <int> <dbl>
1 A X 19 249. 173 minusone 11.8 20 1976 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 23 1078 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 19 1381 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 20 1005 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 14 5766 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 15 4893 15
15 C q 10 178. 287 minusone 9.5 11 5001 11
16 C q 10.2 179. 36 minushalf NA 11 5001 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 11 5001 11

Compare effect of environment between two dataframes

I have two dataframes as follows:
a <- structure(list(Bacteria_A = c(12, 23, 45, 32, 34, 0), Bacteria_B = c(23,
12, 33, 44, 55, 3), Bacteria_C = c(25, 10, 50, 38, 3, 34), Group = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "soil")), class = "data.frame", row.names = c("Sample_1",
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))
b <- structure(list(Bacteria_A = c(14, 10, 40, 40, 37, 3), Bacteria_B = c(25,
14, 32, 23, 45, 35), Bacteria_C = c(12, 34, 45, 22, 7, 23), Group = structure(c(1L,
1L, 1L, 1L, 1L, 1L), class = "factor", .Label = "water")), class = "data.frame", row.names = c("Sample_1",
"Sample_2", "Sample_3", "Sample_4", "Sample_5", "Sample_6"))
> a
Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 12 23 25 soil
Sample_2 23 12 10 soil
Sample_3 45 33 50 soil
Sample_4 32 44 38 soil
Sample_5 34 55 3 soil
Sample_6 0 3 34 soil
> b
Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 14 25 12 water
Sample_2 10 14 34 water
Sample_3 40 32 45 water
Sample_4 40 23 22 water
Sample_5 37 45 7 water
Sample_6 3 35 23 water
I want to compare the difference between each group across samples between soil and water.
For exemple For Bacteria_A i want to know if there is a difference between soil and water. Same for Bacteria_B and Bacteria_c (i have 900 bacteria). I though of a t-test but not sure how to do it with two dataframes.
Forgot to mention that not all bacteria are present in both dataframes so it could happen that one bacteria is not present in one of the environements. If bacteria are found in both environements they have exactly the same name.
Teh original dataframe is 160 samples per 500 Bacteria and data is not normally distributed.
Thanks for your help.
First of all, I want to mention that there are statistical methods to do the comparison which are more adequate than a t-test. They take into account the distribution the numbers are coming from (Negative-Binomial usually). You can check our DESeq2 package for instance.
As to your technical issue I would do:
for (bac in setdiff(intersect(colnames(a), colnames(b)), "Group")){
print(t.test(a[,bac], b[,bac]))
}
Your values do not seem to be in a normal or near-normal distribution, so you should stay away from the t-test. If you are unsure which distribution you are dealing with, you could use a wilcox.test.
You can stick your two data frames together quite easily then convert them to long format before running the appropriate tests:
library(tidyr)
library(dplyr)
bind_rows(a,b) %>%
pivot_longer(c(Bacteria_A, Bacteria_B, Bacteria_C)) %>%
group_by(name) %>%
summarise(mean_soil = mean(value[Group == "soil"]),
mean_water = mean(value[Group == "water"]),
pvalue = wilcox.test(value ~ Group)$p.value)
Which gives you
#> # A tibble: 3 x 4
#> name mean_soil mean_water pvalue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Bacteria_A 24.3 24 0.936
#> 2 Bacteria_B 28.3 29 0.873
#> 3 Bacteria_C 26.7 23.8 0.748
This finds the bacteria names that exist in both data frames and then does a t.test between the same names giving a list L whose names are the bacteria names. The last line uses tidy to convert L to a data frame. You can replace t.test with wilcox.test if you prefer a non-parametric test. (Of course this does not take into account the problems of performing multiple hypothesis tests but rather just does the calculations.)
Name <- intersect(names(Filter(is.numeric, a)), names(Filter(is.numeric, b)))
L <- Map(t.test, a[Name], b[Name])
library(broom)
cbind(Name, do.call("rbind", lapply(L, tidy)))
The last line gives the following data frame:
Name estimate estimate1 estimate2 statistic p.value
Bacteria_A Bacteria_A 0.3333333 24.33333 24.00000 0.03485781 0.9728799
Bacteria_B Bacteria_B -0.6666667 28.33333 29.00000 -0.07312724 0.9435532
Bacteria_C Bacteria_C 2.8333333 26.66667 23.83333 0.30754940 0.7650662
parameter conf.low conf.high method alternative
Bacteria_A 9.988603 -20.97689 21.64356 Welch Two Sample t-test two.sided
Bacteria_B 7.765869 -21.80026 20.46692 Welch Two Sample t-test two.sided
Bacteria_C 9.492873 -17.84326 23.50993 Welch Two Sample t-test two.sided
Note
LinesA <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 12 23 25 soil
Sample_2 23 12 10 soil
Sample_3 45 33 50 soil
Sample_4 32 44 38 soil
Sample_5 34 55 3 soil
Sample_6 0 3 34 soil"
LinesB <- "Bacteria_A Bacteria_B Bacteria_C Group
Sample_1 14 25 12 water
Sample_2 10 14 34 water
Sample_3 40 32 45 water
Sample_4 40 23 22 water
Sample_5 37 45 7 water
Sample_6 3 35 23 water"
a <- read.table(text = LinesA, as.is = TRUE)
b <- read.table(text = LinesB, as.is = TRUE)

Merging rows with same value with conditions for keeping multiple dummies

Creating a subset example of the DF (the code for a part of the actual one is at the end)
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 0 0 1
2011 41 12 1 1 0 0
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
I want to merge the rows with the same value for PK_COD_ENTIDADE into a single one, and keep the values "1" for the dummies with the same PK_COD_ENTIDADE. I don't care for the different values in PK_COD_TURMA, doesn't matter which one stays at the final DF (27 or 41).
MY DF have multiple variables like PK_COD_TURMA that I don't care for the final value, the important one are the PK_COD_ENTIDADE and the dummies with value "1"
It would look like this at the end:
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 1 0 1
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
Look at how I have the values "1" for 2 dummies in the first observation of PK_COD_ENTIDADE = 12 and another value "1" in another dummy with the PK_COD_ENTIDADE = 12, and at the end they merged in a single observation for the same PK_COD_ENTIDADE keeping the different dummies "1" (and the same dummies with 1 for different observations don't sum to 2, because they are dummies)
I have no idea how to do this, I searched for some solutions with dplyr but couldn't apply anything close to working...
Here is the structure of the df with all variables:
dftest2 <- structure(list(ANO_CENSO = c(2011, 2011, 2011, 2011), PK_COD_TURMA = c(27,
41, 18, 16), NU_DURACAO_TURMA = c(250, 255, 255,
255), FK_COD_ETAPA_ENSINO = c(41, 19, 19, 19), PK_COD_ENTIDADE = c(12,
12, 13, 14), FK_COD_ESTADO = c(11, 11, 11,
11), SIGLA = c("RO", "RO", "RO", "RO"), FK_COD_MUNICIPIO = c(1100023,
1100023, 1100023, 1100023), ID_LOCALIZACAO = c(1, 1, 1, 1), ID_DEPENDENCIA_ADM = c(2,
2, 2, 2), MAIS_ENSINO_FUND = c(1, 1, 0, 1), MAIS_ENSINO_MED = c(0,
1, 0, 1), ENSINO_INTEG_FUND = c(0L, 0L, 0L, 0L), ENSINO_INTEG_MED = c(1L,
0L, 1L, 1L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
The sample data you give for dftest2 does not match the data you present at the beginning of your post.
In response to your question, an option is to use aggregate
aggregate(
. ~ PK_COD_ENTIDADE,
data = transform(dftest2, SIGLA = as.factor(SIGLA)),
FUN = max)
#P K_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_ENSINO
#1 12 2011 41 255 41
#2 13 2011 18 255 19
#3 14 2011 16 255 19
# FK_COD_ESTADO SIGLA FK_COD_MUNICIPIO ID_LOCALIZACAO ID_DEPENDENCIA_ADM
#1 11 1 1100023 1 2
#2 11 1 1100023 1 2
#3 11 1 1100023 1 2
# MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
#1 1 1 0 1
#2 0 0 0 1
#3 1 1 0 1
Explanation: We first convert the character column SIGLA to a factor; then we aggregate data in all columns (except PK_COD_ENTIDADE) by PK_COD_ENTIDADE, and return the max value (which should be consistent with your problem statement).
You can do something similar using dplyrs group_by and summarise_all
library(dplyr)
dftest2 %>%
group_by(PK_COD_ENTIDADE) %>%
summarise_all(~ifelse(is.character(.x), last(.x), max(.x))) %>%
ungroup()
# A tibble: 3 x 14
PK_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_EN…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2011 41 255 41
2 13 2011 18 255 19
3 14 2011 16 255 19
# … with 9 more variables: FK_COD_ESTADO <dbl>, SIGLA <chr>,
# FK_COD_MUNICIPIO <dbl>, ID_LOCALIZACAO <dbl>, ID_DEPENDENCIA_ADM <dbl>,
# MAIS_ENSINO_FUND <dbl>, MAIS_ENSINO_MED <dbl>, ENSINO_INTEG_FUND <int>,
# ENSINO_INTEG_MED <int>

Use dplyr to filter dataframe by 32 conditions stored in a 2nd dataframe

Let me dive right into a reproducible example here:
Here is the dataframe with these "possession" conditions to be met for each team:
structure(list(conferenceId = c("A10", "AAC", "ACC", "AE", "AS",
"BIG10", "BIG12", "BIGEAST", "BIGSKY", "BIGSOUTH", "BIGWEST",
"COLONIAL", "CUSA", "HORIZON", "IVY", "MAAC", "MAC", "MEAC",
"MVC", "MWC", "NE", "OVC", "PAC12", "PATRIOT", "SEC", "SOUTHERN",
"SOUTHLAND", "SUMMIT", "SUNBELT", "SWAC", "WAC", "WCC"), values = c(25.5,
33.625, 57.65, 16, 20.9, 48.55, 63.9, 45, 17.95, 28, 11, 24.4,
23.45, 10.5, 16, 12.275, 31.5, 10.95, 21.425, 36.8999999999999,
31.025, 18.1, 23.7, 19.675, 52.9999999999997, 24.5, 15, 27.5,
12.6, 17.75, 13, 33)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -32L))
> head(poss_quantiles)
# A tibble: 6 x 2
conferenceId values
<chr> <dbl>
1 A10 25.5
2 AAC 33.6
3 ACC 57.6
4 AE 16
5 AS 20.9
6 BIG10 48.5
My main dataframe looks as followed:
> head(stats_df)
# A tibble: 6 x 8
season teamId teamName teamMarket conferenceName conferenceId possessions games
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <int>
1 1819 AFA Falcons Air Force Mountain West MWC 75 2
2 1819 AKR Zips Akron Mid-American MAC 46 3
3 1819 ALA Crimson Tide Alabama Southeastern SEC 90.5 6
4 1819 ARK Razorbacks Arkansas Southeastern SEC 71.5 5
5 1819 ARK Razorbacks Arkansas Southeastern SEC 42.5 5
6 1819 ASU Sun Devils Arizona State Pacific 12 PAC12 91.5 7e: 6 x 8
> dim(stats_df)
[1] 6426 500
I need to filter the main dataframe stats_df so that each conference's possessions is greater than their respective possession value in the poss_quantiles dataframe. I am struggling to figure out the best way to do this w/ dplyr.
I believe the following is what the question asks for.
I have made up a dataset to test the code. Posted at the end.
library(dplyr)
stats_df %>%
inner_join(poss_quantiles) %>%
filter(possessions > values) %>%
select(-values) %>%
left_join(stats_df)
# conferenceId possessions otherCol oneMoreCol
#1 s 119.63695 -1.2519859 1.3853352
#2 d 82.68660 -0.4968500 0.1954866
#3 b 103.58936 -1.0149620 0.9405918
#4 o 139.69607 -0.1623095 0.4832004
#5 q 76.06736 0.5630558 0.1319336
#6 x 86.19777 -0.7733534 2.3939706
#7 p 135.80127 -1.1578085 0.2037951
#8 t 136.05944 1.7770844 0.5145781
Data creation code.
set.seed(1234)
poss_quantiles <- data.frame(conferenceId = letters[sample(26, 20)],
values = runif(20, 50, 100),
stringsAsFactors = FALSE)
stats_df <- data.frame(conferenceId = letters[sample(26, 20)],
possessions = runif(20, 10, 150),
otherCol = rnorm(20),
oneMoreCol = rexp(20),
stringsAsFactors = FALSE)

Add a column dependent on another column and a subset (using the tidyverse)

I have the following data set of stock data (as a tibble)
Open Volume Ticker
<dbl> <dbl> <chr>
1106. 10 AAL.L
1086. 30 AAL.L
1043. 9 ABF.L
1055. 2 ABF.L
1048. 90000 BT-A.L
1077. 8000 BT-A.L
Using the dplyr package I would like to obtain this data set, with the average volume for a given Ticker.
Open Volume Ticker AvgVolume
<dbl> <dbl> <chr> <dbl>
1106. 10 AAL.L 20
1086. 30 AAL.L 20
1043. 9 ABF.L 5.5
1055. 2 ABF.L 5.5
1048. 90000 BT-A.L 49000
1077. 8000 BT-A.L 49000
I will be doing this over 500000 rows so speed and efficiency is key (no for loops etc...)
I am new to the Tidyverse and would appreciate an explanation of the answer.
Though you ask for a tidyverse solution and #akrun already posted one, I will post a base R way.
This is because your problem is textbook perfect for ave.
ave(df1$Volume, df1$Ticker)
#[1] 20.0 20.0 5.5 5.5 49000.0 49000.0
And just assign this output to a new column.
df1$AvgVolume <- ave(df1$Volume, df1$Ticker)
df1
## A tibble: 6 x 4
# Open Volume Ticker AvgVolume
# <dbl> <int> <fct> <dbl>
#1 1106 10 AAL.L 20
#2 1086 30 AAL.L 20
#3 1043 9 ABF.L 5.5
#4 1055 2 ABF.L 5.5
#5 1048 90000 BT-A.L 49000
#6 1077 8000 BT-A.L 49000
Data.
df1 <-
structure(list(Open = c(1106, 1086, 1043, 1055, 1048, 1077),
Volume = c(10L, 30L, 9L, 2L, 90000L, 8000L), Ticker = structure(c(1L,
1L, 2L, 2L, 3L, 3L), .Label = c("AAL.L", "ABF.L", "BT-A.L"
), class = "factor"), AvgVolume = c(20, 20, 5.5, 5.5, 49000,
49000)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
We group by 'Ticker', create the 'AvgVolume' as the mean of 'Volume' with mutate
library(dplyr)
df1 %>%
group_by(Ticker) %>%
mutate(AvgVolume = mean(Volume))
# A tibble: 6 x 4
# Groups: Ticker [3]
# Open Volume Ticker AvgVolume
# <int> <int> <chr> <dbl>
#1 1106 10 AAL.L 20
#2 1086 30 AAL.L 20
#3 1043 9 ABF.L 5.5
#4 1055 2 ABF.L 5.5
#5 1048 90000 BT-A.L 49000
#6 1077 8000 BT-A.L 49000

Resources