Merging rows with same value with conditions for keeping multiple dummies - r

Creating a subset example of the DF (the code for a part of the actual one is at the end)
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 0 0 1
2011 41 12 1 1 0 0
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
I want to merge the rows with the same value for PK_COD_ENTIDADE into a single one, and keep the values "1" for the dummies with the same PK_COD_ENTIDADE. I don't care for the different values in PK_COD_TURMA, doesn't matter which one stays at the final DF (27 or 41).
MY DF have multiple variables like PK_COD_TURMA that I don't care for the final value, the important one are the PK_COD_ENTIDADE and the dummies with value "1"
It would look like this at the end:
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 1 0 1
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
Look at how I have the values "1" for 2 dummies in the first observation of PK_COD_ENTIDADE = 12 and another value "1" in another dummy with the PK_COD_ENTIDADE = 12, and at the end they merged in a single observation for the same PK_COD_ENTIDADE keeping the different dummies "1" (and the same dummies with 1 for different observations don't sum to 2, because they are dummies)
I have no idea how to do this, I searched for some solutions with dplyr but couldn't apply anything close to working...
Here is the structure of the df with all variables:
dftest2 <- structure(list(ANO_CENSO = c(2011, 2011, 2011, 2011), PK_COD_TURMA = c(27,
41, 18, 16), NU_DURACAO_TURMA = c(250, 255, 255,
255), FK_COD_ETAPA_ENSINO = c(41, 19, 19, 19), PK_COD_ENTIDADE = c(12,
12, 13, 14), FK_COD_ESTADO = c(11, 11, 11,
11), SIGLA = c("RO", "RO", "RO", "RO"), FK_COD_MUNICIPIO = c(1100023,
1100023, 1100023, 1100023), ID_LOCALIZACAO = c(1, 1, 1, 1), ID_DEPENDENCIA_ADM = c(2,
2, 2, 2), MAIS_ENSINO_FUND = c(1, 1, 0, 1), MAIS_ENSINO_MED = c(0,
1, 0, 1), ENSINO_INTEG_FUND = c(0L, 0L, 0L, 0L), ENSINO_INTEG_MED = c(1L,
0L, 1L, 1L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))

The sample data you give for dftest2 does not match the data you present at the beginning of your post.
In response to your question, an option is to use aggregate
aggregate(
. ~ PK_COD_ENTIDADE,
data = transform(dftest2, SIGLA = as.factor(SIGLA)),
FUN = max)
#P K_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_ENSINO
#1 12 2011 41 255 41
#2 13 2011 18 255 19
#3 14 2011 16 255 19
# FK_COD_ESTADO SIGLA FK_COD_MUNICIPIO ID_LOCALIZACAO ID_DEPENDENCIA_ADM
#1 11 1 1100023 1 2
#2 11 1 1100023 1 2
#3 11 1 1100023 1 2
# MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
#1 1 1 0 1
#2 0 0 0 1
#3 1 1 0 1
Explanation: We first convert the character column SIGLA to a factor; then we aggregate data in all columns (except PK_COD_ENTIDADE) by PK_COD_ENTIDADE, and return the max value (which should be consistent with your problem statement).
You can do something similar using dplyrs group_by and summarise_all
library(dplyr)
dftest2 %>%
group_by(PK_COD_ENTIDADE) %>%
summarise_all(~ifelse(is.character(.x), last(.x), max(.x))) %>%
ungroup()
# A tibble: 3 x 14
PK_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_EN…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2011 41 255 41
2 13 2011 18 255 19
3 14 2011 16 255 19
# … with 9 more variables: FK_COD_ESTADO <dbl>, SIGLA <chr>,
# FK_COD_MUNICIPIO <dbl>, ID_LOCALIZACAO <dbl>, ID_DEPENDENCIA_ADM <dbl>,
# MAIS_ENSINO_FUND <dbl>, MAIS_ENSINO_MED <dbl>, ENSINO_INTEG_FUND <int>,
# ENSINO_INTEG_MED <int>

Related

replacing rowwise() operations in grouped data

Anonymised example subset of a much larger dataset (now edited to show an option with multiple competing types):
structure(list(`Sample File` = c("A", "A", "A", "A", "A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C"),
Marker = c("X", "X", "X", "X", "Y", "Y", "Y", "Y", "Y", "Z",
"Z", "Z", "Z", "Z", "q", "q", "q", "q"), Allele = c(19, 20,
22, 23, 18, 18.2, 19, 19.2, 20, 12, 13, 14, 15, 16, 10, 10.2,
11, 12), Size = c(249.15, 253.13, 260.64, 264.68, 366, 367.81,
369.97, 372.02, 373.95, 91.65, 95.86, 100, 104.24, 108.38,
177.51, 179.4, 181.42, 185.49), Height = c(173L, 1976L, 145L,
1078L, 137L, 62L, 1381L, 45L, 1005L, 38L, 482L, 5766L, 4893L,
19L, 287L, 36L, 5001L, 50L), Type = c("minusone", "allele",
"minusone", "allele", "ambiguous", "minushalf", "allele",
"minushalf", "allele", "minustwo", "ambiguous", "allele",
"allele", "plusone", "minusone", "minushalf", "allele", "plusone"
), LUS = c(11.75, 11.286, 13.375, 13.5, 18, 9, 19, 10, 20,
12, 11, 14, 15, 16, 9.5, NA, 11, 11.5)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -18L), groups = structure(list(
`Sample File` = c("A", "A", "B", "C"), Marker = c("X", "Y",
"Z", "q"), .rows = structure(list(1:4, 5:9, 10:14, 15:18), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -4L), .drop = TRUE))
I want to look up values based on the classification $Type.
"minustwo" means I want to look up the "Allele", "Height" and "LUS"
values for the row with "Allele" equal to the current row plus two,
with the same Sample File and Marker.
"minusone" means the same but for "Allele" equal to the current row plus one.
"minushalf" means the same but for "Allele" equal to the current row plus 0.2 but the dot values here are 25% each, so 12.1, 12.3, 12.3, 13, 13.1 etc - I have a helper function plusTwoBP() for this.
"plusone" means the same for "Allele" equal to the current row -1
"allele" or "ambiguous" don't need to do anything.
Ideal output:
# A tibble: 18 × 10
# Rowwise: Sample File, Marker
`Sample File` Marker Allele Size Height Type LUS ParentHeight ParentAllele ParentLUS
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <int> <dbl> <dbl>
1 A X 19 249. 173 minusone 11.8 1976 20 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 1078 23 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 1381 19 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 1005 20 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 5766 14 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 4893 15 15
15 C q 10 178. 287 minusone 9.5 5001 11 11
16 C q 10.2 179. 36 minushalf NA 5001 11 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 5001 11 11
I have a rather belaboured way of doing it:
# eg for minustwo
sampleData %>%
filter(Type == "minustwo") %>%
rowwise() %>%
mutate(ParentHeight = sampleData$Height[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentAllele = sampleData$Allele[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)],
ParentLUS = sampleData$LUS[sampleData$`Sample File` == `Sample File` & sampleData$Marker == Marker & sampleData$Allele == (Allele + 2)]) %>%
right_join(sampleData)
I then have to redo that for each of my Types
My real dataset is thousands of rows so this ends up being a little slow but manageable, but more to the point I want to learn a better way to do it, in particular the sampleData$'Sample File' == 'Sample File' & sampleData$Marker == Marker seems like it should be doable with grouping so I must be missing a trick there.
I have tried using group_map() but I've clearly not understood it correctly:
sampleData$ParentHeight <- sampleData %>%
group_by(`Sample File`, `Marker`) %>%
group_map(.f = \(.x, .y) {
pmap_dbl(.l = .x, .f = \(Allele, Height, Type, ...){
if(Type == "allele" | Type == "ambiguous") { return(0)
} else if (Type == "plusone") {
return(.x$Height[.x$Allele == round(Allele - 1, 1)])
} else if (Type == "minushalf") {
return(.x$Height[.x$Allele == round(plustwoBP(Allele), 1)])
} else if (Type == "minusone") {
return(.x$Height[.x$Allele == round(Allele + 1, 1)])
} else if (Type == "minustwo") {
return(.x$Height[.x$Allele == round(Allele + 2, 1)])
} else { stop("unexpected peak type") }
})}) %>% unlist()
Initially seems to work, but on investigation it's not respecting both layers of grouping, so brings matches from the wrong Marker. Additionally, here I'm assigning the output to a new column in the data frame, but if I try to instead wrap a mutate() around this so that I can create all three new columns in one go then the group_map() no longer works at all.
I also considered using complete() to hugely extend the data frame will all possible values of Allele (including x.0, x.1, x.2, x.3 variants) then use lag() to select the corresponding rows, then drop the spare rows. This seems like it'd make the data frame enormous in the interim.
To summarise
This works, but it feels ugly and like I'm missing a more elegant and obvious solution. How would you approach this?
You can create two versions of Allele: one identical to the original Allele, and one that is equal to an adjustment based on minusone, minustwo, etc
Then do a self left join, based on that adjusted version of Allele (and Sample File and Marker)
sampleData = sampleData %>% group_by(`Sample File`,Marker) %>% mutate(id = Allele) %>% ungroup()
left_join(
sampleData %>%
mutate(id = case_when(
Type=="minusone"~id+1,
Type=="minustwo"~id+2,
Type=="plusone"~id-1,
Type=="minushalf"~ceiling(id))),
sampleData %>% select(-c(Size,Type)),
by=c("Sample File", "Marker", "id"),
suffix = c("", ".parent")
) %>% select(-id)
Output:
# A tibble: 14 × 10
`Sample File` Marker Allele Size Height Type LUS Allele.parent Height.parent LUS.parent
<chr> <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl> <int> <dbl>
1 A X 19 249. 173 minusone 11.8 20 1976 11.3
2 A X 20 253. 1976 allele 11.3 NA NA NA
3 A X 22 261. 145 minusone 13.4 23 1078 13.5
4 A X 23 265. 1078 allele 13.5 NA NA NA
5 A Y 18 366 137 ambiguous 18 NA NA NA
6 A Y 18.2 368. 62 minushalf 9 19 1381 19
7 A Y 19 370. 1381 allele 19 NA NA NA
8 A Y 19.2 372. 45 minushalf 10 20 1005 20
9 A Y 20 374. 1005 allele 20 NA NA NA
10 B Z 12 91.6 38 minustwo 12 14 5766 14
11 B Z 13 95.9 482 ambiguous 11 NA NA NA
12 B Z 14 100 5766 allele 14 NA NA NA
13 B Z 15 104. 4893 allele 15 NA NA NA
14 B Z 16 108. 19 plusone 16 15 4893 15
15 C q 10 178. 287 minusone 9.5 11 5001 11
16 C q 10.2 179. 36 minushalf NA 11 5001 11
17 C q 11 181. 5001 allele 11 NA NA NA
18 C q 12 185. 50 plusone 11.5 11 5001 11

R : regression line interrupted in ggplot while a continuous line is expected

I created a multilevel regression model with nlme package and now I would like to illustrate the regression line obtained for some patients (unfortunately I cannot use geom_smooth with nlme).
So using the model I obtained the following predicted values (predicted_value) at different times (date_day) and here for two patients (ID1 and ID2).
df <- data.frame (ID = c (rep (1, 10), rep(2, 10)),
date_day = c (7:16, 7:16),
predicted_value = c (33, 33, 33, 33, 33, NA, 34, NA, NA, NA,
55, NA, NA, 53.3, NA, NA, 51.6, NA, 50.5, NA))
ID date_day predicted_value
1 1 7 33.0
2 1 8 33.0
3 1 9 33.0
4 1 10 33.0
5 1 11 33.0
6 1 12 NA
7 1 13 34.0
8 1 14 NA
9 1 15 NA
10 1 16 NA
11 2 7 55.0
12 2 8 NA
13 2 9 NA
14 2 10 53.3
15 2 11 NA
16 2 12 NA
17 2 13 51.6
18 2 14 NA
19 2 15 50.5
20 2 16 NA
Now I would like to draw the regression line for each of these patients. So I tried the following
ggplot(df%>% filter(ID %in% c("1", "2")))+
aes(x = date_day, y = predicted_value) +
geom_point(shape = "circle", size = 1.5, colour = "#112446", na.rm = T) +
geom_line(aes(y = predicted_value), na.rm = T, size = 1) +
theme_minimal() +
facet_wrap(vars(ID)) +
scale_x_continuous(name="days", limits=c(7, 16)) +
scale_y_continuous(name="predicted values", limits=c(0, 60))
But I end with the following plots: patient 1 : the line is interrupted, and patient 2 no line at all. How can I fix that ?
Thanks a lot
Thank you #BenBolker , indeed changing the first line
ggplot(df%>% filter(ID %in% c("1", "2")))
to
ggplot(na.omit(df)%>% filter(ID %in% c("1", "2")))
allowed to solve the job

Classify table based on value 'moving window' range and proportions?

I have a datasets of forest stands, each containing several tree layers of different age and volume.
I want to classify the stands as even- or uneven-aged, combining volume and age data. The forest is considered even-aged if more then 80% of the volume is allocated to age classes within 20 years apart. I wonder how to implement the 'within 20 years apart' condition? I can easily calculate the sum of volume and it's share for individual tree layers (strat). But how to check for 'how many years they are apart?' Is it some sort of moving window?
Dummy example:
# investigate volume by age classes?
library(dplyr)
df <- data.frame(stand = c("id1", "id1", "id1", "id1",
'id2', 'id2', 'id2'),
strat = c(1,2,3,4,
1,2,3),
v = c(4,10,15,20,
11,15,18),
age = c(5,10,65,80,
10,15,20))
# even age = if more of teh 80% of volume is allocated in layers in 20 years range
df %>%
group_by(stand) %>%
mutate(V_tot = sum(v)) %>%
mutate(V_share = v/V_tot*100)
Expected outcome:
stand strat v age V_tot V_share quality
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id1 1 4 5 49 8.16 uneven-aged
2 id1 2 10 10 49 20.4 uneven-aged
3 id1 3 15 65 49 30.6 uneven-aged
4 id1 4 20 80 49 40.8 uneven-aged #* because age classes 65 and 80, even less then 20 years apart have only 70% of total volume
5 id2 1 11 10 44 25 even-aged
6 id2 2 15 15 44 34.1 even-aged
7 id2 3 18 20 44 40.9 even-aged
Another tidyverse solution implementing a moving average:
library(tidyverse)
df <- structure(list(stand = c("id1", "id1", "id1", "id1", "id2", "id2", "id2"), strat = c(1, 2, 3, 4, 1, 2, 3), v = c(4, 10, 15, 20, 11, 15, 18), age = c(5, 10, 65, 80, 10, 15, 20), V_tot = c(49, 49, 49, 49, 44, 44, 44), V_share = c(8.16326530612245, 20.4081632653061, 30.6122448979592, 40.8163265306122, 25, 34.0909090909091, 40.9090909090909)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))
df %>%
group_by(stand) %>%
mutate(range20 = map_dbl(age, ~ sum(V_share[which(abs(age - .x) <= 20)])),
quality = ifelse(any(range20 > 80), "even-aged", "uneven-aged"))
#> # A tibble: 7 × 8
#> # Groups: stand [2]
#> stand strat v age V_tot V_share range20 quality
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 id1 1 4 5 49 8.16 28.6 uneven-aged
#> 2 id1 2 10 10 49 20.4 28.6 uneven-aged
#> 3 id1 3 15 65 49 30.6 71.4 uneven-aged
#> 4 id1 4 20 80 49 40.8 71.4 uneven-aged
#> 5 id2 1 11 10 44 25 100 even-aged
#> 6 id2 2 15 15 44 34.1 100 even-aged
#> 7 id2 3 18 20 44 40.9 100 even-aged
Created on 2021-09-08 by the reprex package (v2.0.1)
Interesting issue, I think I have a solution using the runner package
df %>%
group_by(stand) %>%
mutate(
V_tot = sum(v),
V_share = v/V_tot*100,
test = sum_run(
V_share,
k = 20L,
idx = age,
na_rm = TRUE,
na_pad = FALSE
),
quality = if_else(any(test >= 80), 'even-aged', 'uneven-aged')
) %>%
select(-test)

R Aggregate over multiple columns

i´m currently working with a large dataframe of 75 columns and round about 9500 rows. This dataframe contains observations for every day from 1995-2019 for several observation points.
Edit: The print from dput(head(df))
> dput(head(df))
structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
The dataframe looks like this sample from it:
date x1 x2 x3 x4 x5 xn year month day
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1995-01-01 50.8 62.2 90.2 60 NA 53.2 1995 1 1
2 1999-08-02 62.6 58.7 NA 72 NA 61.1 1999 8 2
3 2001-09-03 57.2 49.9 70.1 68.4 NA 56.6 2001 9 3
4 2008-05-04 56.6 56.4 75.8 65.5 NA 58.6 2008 5 4
5 2012-04-05 36.8 43.2 83.3 63.2 NA 36.2 2012 4 5
6 2019-12-31 39.1 41.6 98.5 55.9 NA 44.4 2019 12 31
str(df)
tibble [9,131 x 75] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:9131], format: "1995-01-01" "1995-01-02" ...
$ x1 : num [1:9131] 50.8 62.6 57.2 56.6 36.8 ...
$ x2 : num [1:9131] 62.2 58.7 49.9 56.4 43.2 ...
xn
$ year : num [1:9131] 1995 1995 1995 1995 1995 ...
$ month : num [1:9131] 1 1 1 1 1 1 1 1 1 1 ...
$ day : num [1:9131] 1 2 3 4 5 6 7 8 9 10 ...
My goal is to get for every observation point xn the count of all observations which cross a certain limit per year.
So far i tried to reach this with the Aggregate function.
To get the mean of every year i used the following command:
aggregate(list(df), by=list(year=df$year), mean, na.rm=TRUE)
this works perfect, i get the mean for every year for every observation point.
To get the sum of one station i used the following code
aggregate(list(x1=df$x1), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
which results in this print:
year x1
1 1995 52
2 1996 43
3 1997 44
4 1998 42
5 1999 38
6 2000 76
7 2001 52
8 2002 58
9 2003 110
10 2004 34
11 2005 64
12 2006 46
13 2007 46
14 2008 17
15 2009 41
16 2010 30
17 2011 40
18 2012 47
19 2013 40
20 2014 21
21 2015 56
22 2016 27
23 2017 45
24 2018 22
25 2019 45
So far, so good. I know i could expand the code by adding (..,x2=data$x2, x3=data$x3,..xn) to the list argument in code above. which i tried and they work.
But how do I get them all at once?
I tried the following codes:
aggregate(.~(date, year, month, day), by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler: Unerwartete(s) ',' in "aggregate(.~(date,"
aggregate(.~date+year+month+day, by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
aggregate(. ~ date + year + month + day, data = df,by=list(year=df$year), function(x) sum(rle(x)$values>120, na.rm=TRUE))
Fehler in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
Argumente müssen dieselbe Länge haben
But unfortunately none of them works. Could someone please give me a hint where my mistake is?
Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.
data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))
First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.
theCols <- data[,colnames(data)[grepl("x",colnames(data))]]
Second, we cbind() the year onto the matrix of logical values.
x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))
Finally, we use aggregate across all columns other than year and sum the columns.
aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)
...and the output:
Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>
Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.
Two tidyverse solutions
A tidyverse solution to the same problem is as follows. It includes the following steps.
Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.
Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.
Use pivot_wider() to convert the data back to one column for each x measurement.
...and the code is:
devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)
...and the output, which matches the Base R solution that I originally posted:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>
...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)
...and the output:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>
Conclusions
As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.
That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.
This should solve your problem
library(tidyverse)
library(lubridate)
df_example <- structure(list(date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"), x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125), x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625), x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875), x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375), year = c(1995, 1995, 1995, 1995,
1995, 1995), month = c(1, 1, 1, 1, 1, 1), day = c(1, 2, 3,
4, 5, 6)), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year(date)) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE))

Need to create a variable based on the equality of other variables

I have a dataset called CSES (Comparative Study of Electoral Systems) where each row corresponds to an individual (one interview in a public opinion survey), from many countries, in many different years .
I need to create a variable which identifies the ideology of the party each person voted, as perceived by this same person.
However, the dataset identifies this perceived ideology of each party (as many other variables) by letters A, B, C, etc. Then, when it comes to identify WHICH PARTY each person voted for, it has a UNIQUE CODE NUMBER, that does not correspond to these letters across different years (i.e., the same party can have a different letter in different years – and, of course, it is never the same party across different countries, since each country has its own political parties).
Fictitious data to help clarify, reproduce and create a code:
Let’s say:
country = c(1,1,1,1,2,2,2,2,3,3,3,3)
year = c (2000,2000,2004,2004, 2002,2002,2004,2008,2000,2000,2000,2000)
party_A_number = c(11,11,12,12,21,21,22,23,31,31,31,31)
party_B_number = c(12, 12, 11, 11, 22,22,21,22,32,32,32,32)
party_C_number = c(13,13,13,13,23,23,23,21,33,33,33,33)
party_voted = c(12,13,12,11,21,24,23,22,31,32,33,31)
ideology_party_A <- floor(runif (12, min=1, max=10))
ideology_party_B <- floor(runif (12, min=1, max=10))
ideology_party_C <- floor(runif (12, min=1, max=10))
Let’s call the variable I want to create “ideology_voted”:
I need something like:
IF party_A_number == party_voted THEN ideology_voted = ideology_party_A
IF party_B_number == party_voted, THEN ideology_voted == ideology_party_B
IF party_C_number == party_voted, THEN ideology_voted == ideology_party_C
The real dataset has 9 letters for (up to) 9 main parties in each country , dozens of countries and election-years. Therefore, it would be great to have a code where I could iterate through letters A-I instead of “if voted party A, then …; if voted party B then….”
Nevertheless, I am having trouble even when I try longer, repetitive codes (one transformation for each party letter - which would give me 8 lines of code)
library(tidyverse)
df <- tibble(
country = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
year = c(2000, 2000, 2004, 2004, 2002, 2002, 2004, 2008, 2000, 2000, 2000, 2000),
party_A_number = c(11, 11, 12, 12, 21, 21, 22, 23, 31, 31, 31, 31),
party_B_number = c(12, 12, 11, 11, 22, 22, 21, 22, 32, 32, 32, 32),
party_C_number = c(13, 13, 13, 13, 23, 23, 23, 21, 33, 33, 33, 33),
party_voted = c(12, 13, 12, 11, 21, 24, 23, 22, 31, 32, 33, 31),
ideology_party_A = floor(runif (12, min = 1, max = 10)),
ideology_party_B = floor(runif (12, min = 1, max = 10)),
ideology_party_C = floor(runif (12, min = 1, max = 10))
)
> df
# A tibble: 12 x 9
country year party_A_number party_B_number party_C_number party_voted ideology_party_A ideology_party_B
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2000 11 12 13 12 9 3
2 1 2000 11 12 13 13 2 6
3 1 2004 12 11 13 12 3 8
4 1 2004 12 11 13 11 7 8
5 2 2002 21 22 23 21 2 7
6 2 2002 21 22 23 24 8 2
7 2 2004 22 21 23 23 1 7
8 2 2008 23 22 21 22 7 7
9 3 2000 31 32 33 31 4 3
10 3 2000 31 32 33 32 7 5
11 3 2000 31 32 33 33 1 6
12 3 2000 31 32 33 31 2 1
# ... with 1 more variable: ideology_party_C <dbl>
It seems you're after conditioning using case_when:
ideology_voted <- df %>% transmute(
ideology_voted = case_when(
party_A_number == party_voted ~ ideology_party_A,
party_B_number == party_voted ~ ideology_party_B,
party_C_number == party_voted ~ ideology_party_C,
TRUE ~ party_voted
)
)
> ideology_voted
# A tibble: 12 x 1
ideology_voted
<dbl>
1 3
2 7
3 3
4 8
5 2
6 24
7 8
8 7
9 4
10 5
11 6
12 2
Note that the evaluation of case_when is lazy, so the first true condition is used (if it happens that more than one is actually true, say).

Resources