I am stuck with reshaping data in R and I hope someone could help me out.
The data looks like this:
ID
measurement
biomarker_x
biomarker_y
1
1
10
100
1
2
11
110
1
3
12
120
2
1
20
200
2
2
19
190
2
3
21
210
And needs to be reshaped to looking like this:
ID
biomarker
measurement1
measurement2
measurement3
1
x
10
11
12
1
y
100
110
120
2
x
20
19
21
2
y
200
190
210
I tried to work with tidyr::gather and spread and with pivot_wider and pivot_longer but failed.
If someone would have a solution for applying this on multiple biomarkers I would be very thankful.
can be done in tidyr only
library(tidyr)
df <- read.table(header = T, text = 'ID measurement biomarker_x biomarker_y
1 1 10 100
1 2 11 110
1 3 12 120
2 1 20 200
2 2 19 190
2 3 21 210')
df %>% pivot_longer(starts_with('biomarker'), names_to = 'biomarker', names_prefix = 'biomarker_') %>%
pivot_wider(names_from = measurement, values_from = value, names_prefix = 'measurement_')
#> # A tibble: 4 x 5
#> ID biomarker measurement_1 measurement_2 measurement_3
#> <int> <chr> <int> <int> <int>
#> 1 1 x 10 11 12
#> 2 1 y 100 110 120
#> 3 2 x 20 19 21
#> 4 2 y 200 190 210
Created on 2021-07-06 by the reprex package (v2.0.0)
Using recast from reshape2
library(reshape2)
names(df1)[-(1:2)] <- sub("biomarker_", "", names(df1)[-(1:2)])
reshape2::recast(df1, id.var = c("ID", "measurement"),
ID + variable ~ paste0('measurement', measurement), value.var = 'value')
-output
ID variable measurement1 measurement2 measurement3
1 1 x 10 11 12
2 1 y 100 110 120
3 2 x 20 19 21
4 2 y 200 190 210
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), measurement = c(1L,
2L, 3L, 1L, 2L, 3L), biomarker_x = c(10L, 11L, 12L, 20L, 19L,
21L), biomarker_y = c(100L, 110L, 120L, 200L, 190L, 210L)),
class = "data.frame", row.names = c(NA,
-6L))
Does this work:
library(dplyr)
library(tidyr)
library(stringr)
df %>% pivot_longer(-c(ID, measurement), names_to = 'biomarker') %>% mutate(biomarker = str_extract(biomarker, '[xy]$')) %>%
pivot_wider(c(ID, biomarker), names_from = measurement, names_prefix = 'measurement', values_from = value)
# A tibble: 4 x 5
ID biomarker measurement1 measurement2 measurement3
<int> <chr> <int> <int> <int>
1 1 x 10 11 12
2 1 y 100 110 120
3 2 x 20 19 21
4 2 y 200 190 210
Here is one approach.
library(tidyverse)
dat |>
pivot_longer(
cols = starts_with("bio"),
names_to = "biomarker"
) |>
mutate(biomarker = str_remove(biomarker, "biomarker_")) |>
pivot_wider(
names_from = measurement,
values_from = value,
names_prefix = "measurement"
)
# # A tibble: 4 x 5
# ID biomarker measurement1 measurement2 measurement3
# <int> <chr> <int> <int> <int>
# 1 1 x 10 11 12
# 2 1 y 100 110 120
# 3 2 x 20 19 21
# 4 2 y 200 190 210
A pure base R option using nested ´reshape`
reshape(
reshape(
df,
direction = "long",
idvar = c("ID", "measurement"),
varying = -(1:2),
sep = "_"
),
direction = "wide",
idvar = c("ID", "time"),
timevar = "measurement"
)
gives
ID time biomarker.1 biomarker.2 biomarker.3
1.1.x 1 x 10 11 12
2.1.x 2 x 20 19 21
1.1.y 1 y 100 110 120
2.1.y 2 y 200 190 210
Related
I have the following table that I want to modify
Debt2017 Debt2018 Debt2019 Cash2017 Cash2018 Cash2019 Year Other
2 4 3 5 6 7 2018 x
3 8 9 7 9 9 2017 y
So that the result is the following
Debt Cash FLAG After Other
2 5 0 x
3 7 1 x
8 9 1 y
9 9 1 y|
Basically, I want to change the data so that I have the different years in different rows, eliminating the values for the year indicated in the column "Year" and adding a FLAG that tells me whether the data indicated in the row is from a previous (0) or following (1) year (with respect to the year indicated in the column "Year").
Furthermore, I also want to keep the column "Other".
Does anybody know how to do it in R?
library(dplyr)
library(tidyr)
df %>%
pivot_longer(Debt2017:Cash2019,
names_to = c(".value", "Year2"),
names_pattern = "(\\D+)(\\d+)") %>%
filter(Year != Year2) %>%
mutate(flag = +(Year2 > Year))
# # A tibble: 4 × 6
# Year Other Year2 Debt Cash flag
# <int> <chr> <chr> <int> <int> <int>
# 1 2018 x 2017 2 5 0
# 2 2018 x 2019 3 7 1
# 3 2017 y 2018 8 9 1
# 4 2017 y 2019 9 9 1
Data
df <- structure(list(Debt2017 = 2:3, Debt2018 = c(4L, 8L), Debt2019 = c(3L, 9L),
Cash2017 = c(5L, 7L), Cash2018 = c(6L, 9L), Cash2019 = c(7L, 9L),
Year = 2018:2017, Other = c("x", "y")), class = "data.frame", row.names = c(NA, -2L))
I'm working with a dataframe of trial participant blood test results, with some sporadic missing values (analyte failed). Fortunately we have two time points quite close together, so for missing values at timepoint 1, i'm hoping to impute the corresponding value from timepoint 2.
I am just wondering, if there is an elegant way to code this in R/tidyverse for multiple test results?
Here is some sample data:
timepoint = c(1,1,1,1,1,2,2,2,2,2),
fst_test = c(NA,sample(1:40,9, replace =F)),
scd_test = c(sample(1:20,8, replace = F),NA,NA))
So far I have been pivoting wider, then manually coalescing the corresponding test results, like so:
test %>%
pivot_wider(names_from = timepoint,
values_from = fst_test:scd_test) %>%
mutate(fst_test_imputed = coalesce(fst_test_1, fst_test_2),
scd_test_imputed = coalesce(scd_test_1, scd_test_2)) %>%
select(ID, fst_test_imputed, scd_test_imputed)
However for 15 tests this is cumbersome...
I thought there might be an elegant R / dplyr solution for this situation?
Many thanks in advance for your help!!
We could use fill after creating a grouping column with rowid on the 'timepoint' (as the OP mentioned to replace with corresponding data point in 'timepoint' column). Then, we just need fill and specify the .direction as "updown" to fill NA in the preceding value with the succeeding non-NA first (if it should be only to take care of 'NA' in 'timepoint' 1, then change the .direction = "up")
library(dplyr)
library(tidyr)
library(data.table)
test %>%
group_by(grp = rowid(timepoint)) %>%
fill(fst_test, scd_test, .direction = "updown") %>%
ungroup %>%
select(-grp)
data
test <- structure(list(timepoint = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
fst_test = c(NA,
16L, 30L, 29L, 14L, 32L, 21L, 20L, 3L, 23L), scd_test = c(18L,
17L, 8L, 20L, 1L, 10L, 14L, 19L, NA, NA)),
class = "data.frame", row.names = c(NA,
-10L))
You could pivot your data so that "timepoint" defines the columns, with all your tests on the rows. In order to perform this pivot without creating list-cols, we'll have to group by "timepoint" and create an index for each row within the group:
test <- tibble(
timepoint = c(1,1,1,1,1,2,2,2,2,2),
fst_test = c(NA,sample(1:40,9, replace =F)),
scd_test = c(sample(1:20,8, replace = F),NA,NA))
)
test_pivoted <- test %>%
group_by(timepoint) %>%
mutate(idx = row_number()) %>%
pivot_longer(-c(timepoint, idx)) %>%
pivot_wider(names_from = timepoint, values_from = value, names_prefix = 'timepoint')
idx name timepoint1 timepoint2
<int> <chr> <int> <int>
1 1 fst_test NA 39
2 1 scd_test 5 10
3 2 fst_test 37 7
4 2 scd_test 20 3
5 3 fst_test 5 26
6 3 scd_test 19 11
7 4 fst_test 17 28
8 4 scd_test 9 NA
9 5 fst_test 14 32
10 5 scd_test 8 NA
Now we can coalesce once across the two timepoints for all tests:
test_pivoted %>%
mutate(
imputed = coalesce(timepoint1, timepoint2)
)
idx name timepoint1 timepoint2 imputed
<int> <chr> <int> <int> <int>
1 1 fst_test NA 39 39
2 1 scd_test 5 10 5
3 2 fst_test 37 7 37
4 2 scd_test 20 3 20
5 3 fst_test 5 26 5
6 3 scd_test 19 11 19
7 4 fst_test 17 28 17
8 4 scd_test 9 NA 9
9 5 fst_test 14 32 14
10 5 scd_test 8 NA 8
And if you wanted to clean up the result a little more:
test_pivoted %>%
mutate(
imputed = coalesce(timepoint1, timepoint2)
) %>%
select(name, idx, imputed) %>%
pivot_wider(names_from = name, values_from = imputed)
idx fst_test scd_test
<int> <int> <int>
1 1 39 5
2 2 37 20
3 3 5 19
4 4 17 9
5 5 14 8
Hi I'm analysing the pattern of spending for individuals before they died. My dataset contains individuals' monthly spending and their dates of death. The dataset looks similar to this:
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
Each column denotes the month of the year. For example, "2018_11" means November 2018. The number in each cell denotes the spending in that specific month.
I would like to construct a data frame which contains the spending data of each individual in their last 0-12 months. It will look like this:
ID last_12_month last_11_month ...... last_1_month last_0_month date_of_death
A 6 23 30 1 2020-01-02
B 2 5 30 31 2019-11-15
Each individual died at different time. For example, individual A died on 2020-01-02, so the data of the "last_0_month" for this person should be extracted from the column "2020_01", and that of "last_12_month" extracted from "2019_01"; individual B died on 2019-11-15, so the data of "last_0_month" for this person should be extracted from the column "2019_11", and that of "last_12_month" should be extracted from the column "2018_11".
I will be really grateful for your help.
Using data.table and lubridate packages
library(data.table)
library(lubridate)
setDT(dt)
dt <- melt(dt, id.vars = c("ID", "date_of_death"))
dt[, since_death := interval(ym(variable), ymd(date_of_death)) %/% months(1)]
dt <- dcast(dt[since_death %between% c(0, 12)], ID + date_of_death ~ since_death, value.var = "value", fun.aggregate = sum)
setcolorder(dt, c("ID", "date_of_death", rev(names(dt)[3:15])))
setnames(dt, old = names(dt)[3:15], new = paste("last", names(dt)[3:15], "month", sep = "_"))
Results
dt
# ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month
# 1: A 2020-01-02 6 23 23 5 6 30 1 15 6 7
# 2: B 2019-11-15 2 5 6 7 7 8 9 15 12 14
# last_2_month last_1_month last_0_month
# 1: 8 30 1
# 2: 31 30 31
Data
dt <- structure(list(ID = c("A", "B"), `2018_11` = c(15L, 2L), `2018_12` = c(14L,
5L), `2019_01` = c(6L, 6L), `2019_02` = c(23L, 7L), `2019_03` = c(23L,
7L), `2019_04` = c(5L, 8L), `2019_05` = c(6L, 9L), `2019_06` = c(30L,
15L), `2019_07` = c(1L, 12L), `2019_08` = 15:14, `2019_09` = c(6L,
31L), `2019_10` = c(7L, 30L), `2019_11` = c(8L, 31L), `2019_12` = c(30L,
0L), `2020_01` = 1:0, date_of_death = structure(c(18263L, 18215L
), class = c("IDate", "Date"))), row.names = c(NA, -2L), class = c("data.frame"))
here you can find a similar approach to the one presented by #RuiBarradas but using lubridate for extracting the difference in months:
library(dplyr)
library(tidyr)
library(lubridate)
# Initial data
df <- structure(list(
ID = c("A", "B"),
`2018_11` = c(15, 2),
`2018_12` = c(14, 5),
`2019_01` = c(6, 6),
`2019_02` = c(23, 7),
`2019_03` = c(23, 7),
`2019_04` = c(5, 8),
`2019_05` = c(6, 9),
`2019_06` = c(30, 15),
`2019_07` = c(1, 12),
`2019_08` = c(15, 14),
`2019_09` = c(6, 31),
`2019_10` = c(7, 30),
`2019_11` = c(8, 31),
`2019_12` = c(30, 0),
`2020_01` = c(1, 0),
date_of_death = c("2020-01-02", "2019-11-15")
),
row.names = c(NA, -2L),
class = "data.frame"
)
# Convert to longer all cols that start with 20 (e.g. 2020, 2021)
df_long <- df %>%
pivot_longer(starts_with("20"), names_to = "month")
# treatment
df_long <- df_long %>%
mutate(
# To date, just in case
date_of_death = as.Date(date_of_death),
# Need to reformat the colnames from (e.g.) 2021_01 to 2021-01-01
month_fmt = as.Date(paste0(gsub("_", "-", df_long$month), "-01")),
# End of month
month_fmt = ceiling_date(month_fmt, "month") - days(1),
# End of month for month of death
date_of_death_eom = ceiling_date(date_of_death, "month") - days(1),
# Difference in months (using end of months
month_diff = round(time_length(
interval(month_fmt, date_of_death_eom),"month"),0)) %>%
# Select only months bw 0 and 12
filter(month_diff %in% 0:12) %>%
# Create labels for the next step
mutate(labs = paste0("last_", month_diff,"_month"))
# To wider
end <- df_long %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = labs,
values_from = value
)
end
#> # A tibble: 2 x 15
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month
#> <chr> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2020-01-02 6 23 23 5
#> 2 B 2019-11-15 2 5 6 7
#> # ... with 9 more variables: last_8_month <dbl>, last_7_month <dbl>,
#> # last_6_month <dbl>, last_5_month <dbl>, last_4_month <dbl>,
#> # last_3_month <dbl>, last_2_month <dbl>, last_1_month <dbl>,
#> # last_0_month <dbl>
Created on 2022-03-09 by the reprex package (v2.0.1)
Here is a tidyverse solution.
Reshape the data to long format, coerce the date columns to class "Date", use Dirk Eddelbuettel's accepted answer to this question to compute the date differences in months and keep the rows with month differences between 0 and 12.
This grouped long format is probably more useful and I compute means by group and plot the spending of the last 12 months prior to death but since the question asks for a wide format, the output data set spending12_wide is created.
options(width=205)
df1 <- read.table(text = "
ID 2018_11 2018_12 2019_01 2019_02 2019_03 2019_04 2019_05 2019_06 2019_07 2019_08 2019_09 2019_10 2019_11 2019_12 2020_01 date_of_death
A 15 14 6 23 23 5 6 30 1 15 6 7 8 30 1 2020-01-02
B 2 5 6 7 7 8 9 15 12 14 31 30 31 0 0 2019-11-15
", header = TRUE, check.names = FALSE)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
library(ggplot2)
# Dirk's functions
monnb <- function(d) {
lt <- as.POSIXlt(as.Date(d, origin = "1900-01-01"))
lt$year*12 + lt$mon
}
# compute a month difference as a difference between two monnb's
diffmon <- function(d1, d2) { monnb(d2) - monnb(d1) }
spending12 <- df1 %>%
pivot_longer(cols = starts_with('20'), names_to = "month") %>%
mutate(month = as.Date(paste0(month, "_01"), "%Y_%m_%d"),
date_of_death = as.Date(date_of_death)) %>%
group_by(ID, date_of_death) %>%
mutate(diffm = diffmon(month, date_of_death)) %>%
filter(diffm >= 0 & diffm <= 12)
spending12 %>% summarise(spending = mean(value), .groups = "drop")
#> # A tibble: 2 x 3
#> ID date_of_death spending
#> <chr> <date> <dbl>
#> 1 A 2020-01-02 12.4
#> 2 B 2019-11-15 13.6
spending12_wide <- spending12 %>%
mutate(month = zoo::as.yearmon(month)) %>%
pivot_wider(
id_cols = c(ID, date_of_death),
names_from = diffm,
names_glue = "last_{.name}_month",
values_from = value
)
spending12_wide
#> # A tibble: 2 x 15
#> # Groups: ID, date_of_death [2]
#> ID date_of_death last_12_month last_11_month last_10_month last_9_month last_8_month last_7_month last_6_month last_5_month last_4_month last_3_month last_2_month last_1_month last_0_month
#> <chr> <date> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 A 2020-01-02 6 23 23 5 6 30 1 15 6 7 8 30 1
#> 2 B 2019-11-15 2 5 6 7 7 8 9 15 12 14 31 30 31
ggplot(spending12, aes(month, value, color = ID)) +
geom_line() +
geom_point()
Created on 2022-03-09 by the reprex package (v2.0.1)
I have a quite long table that looks like this:
library(tidyverse)
x=structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), loc = c("A", "B", "?", "A", "B", "?"), count1 = c(10L, 20L, 50L, 5L, 22L, 10L), count2 = c(324L, 564L, 121L, 87L, 66L, 445L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
x
#> # A tibble: 6 x 4
#> id loc count1 count2
#> <int> <chr> <int> <int>
#> 1 1 A 10 324
#> 2 1 B 20 564
#> 3 1 ? 50 121
#> 4 2 A 5 87
#> 5 2 B 22 66
#> 6 2 ? 10 445
For each id, I would like to dispatch counts where loc is unknown uniformly to loc A and loc B.
For instance, for id==1 and column count1, loc A represent 1/3 of the total, therefore 1/3 of 50 is allocated to group A and 2/3 of 50 is allocated to group B, which would end on respectively 26.7 and 53.3. Rows with unknown loc should then be dropped.
There is (currently) no other possible value for loc than A, B or ?.
The ratio of A/A+B is different for every count and for every id.
I tried multiple ways of doing this, involving pivoting and transposing, but I never managed to achieve the intended result.
Here is the complete expected output:
expected=structure(list(id = c(1L, 1L, 2L, 2L), loc = c("A", "B", "A", "B"), count1 = c(26.67, 53.33, 6.85, 30.15), count2 = c(368.15, 640.85, 340.04, 257.96)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
expected
#> # A tibble: 4 x 4
#> id loc count1 count2
#> <int> <chr> <dbl> <dbl>
#> 1 1 A 26.7 368.
#> 2 1 B 53.3 641.
#> 3 2 A 6.85 340.
#> 4 2 B 30.2 258.
#Totals are obviously the same:
x %>% group_by(id) %>% summarise(across(count1:count2, sum))
#> # A tibble: 2 x 3
#> id count1 count2
#> <int> <int> <int>
#> 1 1 80 1009
#> 2 2 37 598
expected %>% group_by(id) %>% summarise(across(count1:count2, sum))
#> # A tibble: 2 x 3
#> id count1 count2
#> <int> <dbl> <dbl>
#> 1 1 80 1009
#> 2 2 37 598
Created on 2021-07-16 by the reprex package (v2.0.0)
Perhaps this helps
library(dplyr)
x %>%
group_by(id) %>%
mutate(across(starts_with('count'), ~ {
tmp <- .
i1 <- loc == '?'
tmp[!i1] <- tmp[!i1] + tmp[!i1]/
sum(tmp[!i1]) * tmp[i1]
tmp})) %>%
ungroup %>%
filter(loc != '?')
-ouptut
# A tibble: 4 x 4
id loc count1 count2
<int> <chr> <dbl> <dbl>
1 1 A 26.7 368.
2 1 B 53.3 641.
3 2 A 6.85 340.
4 2 B 30.1 258.
A bit verbose but will do the trick:
library(dplyr)
x %>%
filter(loc != "?") %>%
left_join(x %>%
filter(loc == "?"), by = "id") %>%
group_by(id) %>%
mutate(across(ends_with(".x") & !contains("loc"), ~ prop.table(.x), .names = '{.col}_prop'),
across(ends_with(".x") & !contains("loc"), ~
get(gsub(".x", ".y", cur_column())) * get(paste(cur_column(), "_prop", sep = "")) + .x)) %>%
select(1:4) %>%
rename_with(~ gsub(".x", "", .), !id)
# A tibble: 4 x 4
# Groups: id [2]
id loc count1 count2
<int> <chr> <dbl> <dbl>
1 1 A 26.7 368.
2 1 B 53.3 641.
3 2 A 6.85 340.
4 2 B 30.1 258.
Here is an intutional solution with data.table
library(data.table)
x=structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), loc = c("A", "B", "?", "A", "B", "?"), count1 = c(10L, 20L, 50L, 5L, 22L, 10L), count2 = c(324L, 564L, 121L, 87L, 66L, 445L)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
setDT(x)
x[,`:=`(count1 = as.double(count1),
count2 = as.double(count2))]
x[,`:=`(count1 = fifelse(loc != "?", count1 + count1[3] * count1/(count1[1] + count1[2]), count1),
count2 = fifelse(loc != "?", count2 + count2[3] * count2/(count2[1] + count2[2]), count2)),
by=id][loc != "?"]
#> id loc count1 count2
#> 1: 1 A 26.666667 368.1486
#> 2: 1 B 53.333333 640.8514
#> 3: 2 A 6.851852 340.0392
#> 4: 2 B 30.148148 257.9608
Created on 2021-07-17 by the reprex package (v2.0.0)
For the following data - I would like to count the number of students per class each year.
Class Students Gender Height Year_1999 Year_2000 Year_2001 Year_2002
1 Mark M 180 80 54 22 12
2 John M 234 0 59 32 62
1 Tom M 124 0 53 26 12
2 Jane F 180 80 54 22 0
3 Kim F 140 0 2 3 32
The output should be
Class Year_1999 Year_2000 Year_2001 Year_2002
1 1 2 2 2
2 1 2 2 1
3 0 1 1 1
I tried the following but didn't have much luck
Number_obs = df %>%
group_by(class) %>%
summarise(count=n())
We can use summarise_at in dplyr. After grouping by 'Class', loop through the columns that have 'year' matches in the column names in summarise_at, get the sum of values that are not equal to 0
library(dplyr)
df1 %>%
group_by(Class) %>%
summarise_at(vars(matches("Year")), list(~ sum(as.logical(.))))
# A tibble: 3 x 5
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or we can gather into 'long' format, do the group_by operation on a single column and spread it to 'wide' format
library(tidyr)
df1 %>%
gather(key, val, matches("Year")) %>%
group_by(Class, key) %>%
summarise(val = sum(val != 0)) %>%
spread(key, val)
Or using data.table
library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(as.logical(x))), .(Class), .SDcols = 5:8]
Or using base R with aggregate
aggregate(.~ Class, df1[-(2:4)], function(x) sum(x != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or using rowsum
rowsum(+(!!df1[5:8]), df1$Class)
# Year_1999 Year_2000 Year_2001 Year_2002
#1 1 2 2 2
#2 1 2 2 1
#3 0 1 1 1
Or using colSums
t(sapply(split(as.data.frame(df1[5:8] != 0), df1$Class), colSums))
data
df1 <- structure(list(Class = c(1L, 2L, 1L, 2L, 3L), Students = c("Mark",
"John", "Tom", "Jane", "Kim"), Gender = c("M", "M", "M", "F",
"F"), Height = c(180L, 234L, 124L, 180L, 140L), Year_1999 = c(80L,
0L, 0L, 80L, 0L), Year_2000 = c(54L, 59L, 53L, 54L, 2L), Year_2001 = c(22L,
32L, 26L, 22L, 3L),
Year_2002 = c(12L, 62L, 12L, 0L, 32L)), class = "data.frame",
row.names = c(NA,
-5L))
Similar to #akrun's colSums solution, using by.
do.call(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# 1 1 2 2 2
# 2 1 2 2 1
# 3 0 1 1 1
or
Reduce(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# init 1 2 2 2
# 1 2 2 1
# 0 1 1 1
do.call is faster.
Using dplyr, we can use summarise_at
library(dplyr)
df %>%
group_by(Class) %>%
summarise_at(vars(starts_with("Year")), ~sum(. != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1