I have a data frame of values across successive years (columns) for unique individuals (rows). A dummy data example is provided here:
dt = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), `2015` = c(0,
0.8219178, 0, 0.1369863, 0, 1.369863, 0.2739726, 0.8219178, 5,
0), `2016` = c(0, 1.369863, 0, 0.2739726, 0, 0.2739726, 0, 3.2876712,
0, 0), `2017` = c(0.6849315, 0, 0, 0.6849315, 0, 0.5479452, 0,
0, 0, 0), `2018` = c(1.0958904, 0.5479452, 1.9178082, 0, 0, 0,
0, 0, 0, 3), `2019` = c(0, 0, 0, 1.0958904, 0, 0.9589041, 0.5479452,
0, 0, 0), `2020` = c(0.4383562, 0, 0, 0, 0.2739726, 0.6849315,
0, 0, 0, 0)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-10L))
I want to create a dataset where the maximum value for each individual that should appear for each year is 1. In cases where it exceeds this value, I want to carry the excess value over 1 into the next year (column) and sum it to that year's value for each individual and so on.
The expected result is:
dt_expected = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), `2015` = c(0,
0.8219178, 0, 0.1369863, 0, 1, 0.2739726, 0.8219178, 1, 0), `2016` = c(0,
1, 0, 0.2739726, 0, 0.6438356, 0, 1, 1, 0), `2017` = c(0.6849315,
0.369863, 0, 0.6849315, 0, 0.5479452, 0, 1, 1, 0), `2018` = c(1,
0.5479452, 1, 0, 0, 0, 0, 1, 1, 1), `2019` = c(0.0958904, 0,
0.9178082, 1, 0, 0.9589041, 0.5479452, 0.2876712, 1, 1), `2020` = c(0.4383562,
0, 0, 0.0958904, 0.2739726, 0.6849315, 0, 0, 0, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
I am at a total loss of where to start with this problem, so any assistance achieving this using data.table would be greatly appreciated. My only thought is to use lapply with an ifelse function for the conditional component. Then should I be using rowSums or Reduce to achieve my outcome of shifting excess values across columns?
A translation of Martin Morgan's answer to data.table:
for (i in 2:(ncol(dt) - 1)) {
x = dt[[i]]
set(dt, j = i, value = pmin(x, 1))
set(dt, j = i + 1, value = dt[[i + 1L]] + pmax(x - 1, 0))
}
Not particularly pretty or efficient, but as a starting point I used pmin() and pmax() to update each year (and the subsequent year), iteratively. The current year is the minimum of the current year and 1 (pmin(x, 1)); the subsequent year is the current subsequent year plus the excess of the previous year (pmax(x - 1, 0))
update <- function(df) {
result = df
for (idx in 2:(ncol(df) - 1)) {
x = result[[ idx ]]
result[[ idx ]] = pmin(x, 1)
result[[ idx + 1 ]] = result[[ idx + 1 ]] + pmax(x - 1, 0)
}
result
}
We have
> all.equal(update(dt), dt_expected)
[1] TRUE
I don't know how to translate this into efficient data.table syntax, but the function 'works' as is on a data.table, update(as.data.table(dt)).
Not sure if there is a more efficient way with built in functions, but I simply wrote a recursive function that implements your described algorithm for the rows and then apply it over every row.
f <- function(l, rest = 0, out = list()) {
if (length(l) == 0) return(unlist(out))
if (l[[1]] + rest <= 1) {
f(l[-1], rest = 0, out = append(out, list(l[[1]] + rest)))
} else (
f(l[-1], rest = l[[1]] + rest - 1, out = append(out, list(1)))
)
}
dt[-1] <- apply(dt[-1], 1, f, simplify = F) |>
do.call(what = rbind)
dt
#> # A tibble: 10 × 7
#> ID `2015` `2016` `2017` `2018` `2019` `2020`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 0 0.685 1 0.0959 0.438
#> 2 2 0.822 1 0.370 0.548 0 0
#> 3 3 0 0 0 1 0.918 0
#> 4 4 0.137 0.274 0.685 0 1 0.0959
#> 5 5 0 0 0 0 0 0.274
#> 6 6 1 0.644 0.548 0 0.959 0.685
#> 7 7 0.274 0 0 0 0.548 0
#> 8 8 0.822 1 1 1 0.288 0
#> 9 9 1 1 1 1 1 0
#> 10 10 0 0 0 1 1 1
Created on 2022-03-25 by the reprex package (v2.0.1)
Here is my solution:
dt |>
pivot_longer(cols = -ID, "year") |>
arrange(ID, year) |>
group_by(ID) |>
mutate(x = {
r <- accumulate(value,
~max(0,.y + .x - 1),
.init = 0)
pmin(1, value + head(r, -1))
}) |>
select(x, year, ID) |>
pivot_wider(names_from = "year", values_from = "x")
##> + # A tibble: 10 × 7
##> # Groups: ID [10]
##> ID `2015` `2016` `2017` `2018` `2019` `2020`
##> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##> 1 1 0 0 0.685 1 0.0959 0.438
##> 2 2 0.822 1 0.370 0.548 0 0
##> 3 3 0 0 0 1 0.918 0
##> 4 4 0.137 0.274 0.685 0 1 0.0959
##> 5 5 0 0 0 0 0 0.274
##> 6 6 1 0.644 0.548 0 0.959 0.685
##> 7 7 0.274 0 0 0 0.548 0
##> 8 8 0.822 1 1 1 0.288 0
##> 9 9 1 1 1 1 1 0
##> 10 10 0 0 0 1 1 1
Related
I have this dataset:
structure(list(ID = c(1, 2, 3, 4, 6, 7), V = c(0, 0, 1, 1,
1, 0), Mus = c(1, 0, 1, 1, 1, 0), R = c(1, 0, 1, 1, 1, 1),
E = c(1, 0, 0, 1, 0, 0), S = c(1, 0, 1, 1, 1, 0), t = c(0,
0, 0, 1, 0, 0), score = c(1, 0.4, 1, 0.4, 0.4, 0.4)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), na.action = structure(c(`5` = 5L,
`12` = 12L, `15` = 15L, `21` = 21L, `22` = 22L, `23` = 23L, `34` = 34L,
`44` = 44L, `46` = 46L, `52` = 52L, `56` = 56L, `57` = 57L, `58` = 58L
), class = "omit"))
I would like to make new assignment on the score column, in this way:
in the case of each ID, if there is an occurrence of number 1 higher than 3, then in the last column should appear number 1.
in the case of each ID, if there is an occurrence of the number 1 equal to 3, then the last column should appear number 0.4.
in the case of each ID, if there is an occurrence of number 1 lower than 3, then the last column should appear number 0.
Could please suggest a way to do this via for loop, dplyr, map, or apply functions?
Thanks
This should work - calculating the number of 1s in the new ones column then applying the conditions using case_when:
library(tidyverse)
df |>
rowwise() |>
mutate(ones = sum(c_across(V:t)),
score = case_when(
ones > 3 ~ 1,
ones == 3 ~ 0.4,
ones < 3 ~ 0
))
#> # A tibble: 6 × 9
#> # Rowwise:
#> ID V Mus R E S t score ones
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 1 1 1 0 1 4
#> 2 2 0 0 0 0 0 0 0 0
#> 3 3 1 1 1 0 1 0 1 4
#> 4 4 1 1 1 1 1 1 1 6
#> 5 6 1 1 1 0 1 0 1 4
#> 6 7 0 0 1 0 0 0 0 1
To make it tidier, you can use sum(c_across(V:t)) directly in case_when to not need a new variable (though it would repeat the calculation each time):
df |>
rowwise() |>
mutate(score = case_when(
sum(c_across(V:t)) > 3 ~ 1,
sum(c_across(V:t)) == 3 ~ 0.4,
sum(c_across(V:t)) < 3 ~ 0
))
I want to count the number of occurrences that a specific factor level occurs across multiple factor varaibles per row.
Simplified, I want to know how many times each factor level is chosen across specific variables per row (memberID).
Example data:
results=data.frame(MemID=c('A','B','C','D','E','F','G','H'),
value_a = c(1,2,1,4,5,1,4,0),
value_b = c(1,5,2,3,4,1,0,3),
value_c = c(3,5,2,1,1,1,2,1)
)
In this example, I want to know the frequency of each factor level for value_a and value_b for each MemID. How many times does A respond 1? How many times does A respond 2? Etc...for each level and for each MemID but only for value_a and value_b.
I would like the output to look something like this:
counts_by_level = data.frame(MemID=c('A','B','C','D','E','F','G','H'),
count_1 = c(2, 0, 1, 0, 0, 2, 0, 0),
count_2 = c(0, 1, 1, 0, 0, 0, 0, 0),
count_3 = c(0, 0, 0, 1, 0, 0, 0, 1),
count_4 = c(0, 0, 0, 1, 1, 0, 1, 0),
count_5 = c(0, 1, 0, 0, 1, 0, 0, 0))
I have been trying to use add_count or add_tally as well as table and searching other ways to answer this question. However, I am struggling to identify specific factor levels across multiple variables and then output new columns for the counts of those levels for each row.
You could do something like this. Note that you didn't include a zero count, but there are some zero selections.
library(tidyverse)
results |>
select(-value_c) |>
pivot_longer(cols = starts_with("value"),
names_pattern = "(value)") |>
mutate(count = 1) |>
select(-name) |>
pivot_wider(names_from = value,
values_from = count,
names_prefix = "count_",
values_fill = 0,
values_fn = sum)
#> # A tibble: 8 x 7
#> MemID count_1 count_2 count_5 count_4 count_3 count_0
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 2 0 0 0 0 0
#> 2 B 0 1 1 0 0 0
#> 3 C 1 1 0 0 0 0
#> 4 D 0 0 0 1 1 0
#> 5 E 0 0 1 1 0 0
#> 6 F 2 0 0 0 0 0
#> 7 G 0 0 0 1 0 1
#> 8 H 0 0 0 0 1 1
Another solution:
results %>%
group_by(MemID, value_a, value_b) %>%
summarise(n=n()) %>%
pivot_longer(c(value_a,value_b)) %>%
group_by(MemID, value) %>%
summarise(n=sum(n)) %>%
pivot_wider(MemID,
names_from = value, names_sort = T, names_prefix = "count_",
values_from=n, values_fn=sum, values_fill = 0)
Hopefully this is straightforward, and I'm just thinking too hard. I have a matrix of peak counts from mass spec (MS) where peaks are rows and columns are sample names. The sample locations have several sampling sites and I would like to add the counts between sites within locations.
For example, one sample with three replicates is identified as "S19S_0010_Sed_Field_ICR.D_p2", "S19S_0010_Sed_Field_ICR.M_p2", and "S19S_0010_Sed_Field_ICR.U_p2" where it's the same location but downstream (D), midstream (M), and upstream (U). The first two samples have one count of a specific peak each, so I would like to merge the three samples to just say "S19S_0010_Sed_Field_ICR.all_p2" with two counts of the wavelength. Example dataset:
> dput(data.sed.ex)
structure(list(S19S_0004_Sed_Field_ICR.M_p15 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), S19S_0006_Sed_Field_ICR.D_p2 = c(0, 0, 0,
0, 0, 0, 1, 1, 0, 0), S19S_0006_Sed_Field_ICR.M_p2 = c(0, 0,
0, 0, 0, 0, 1, 0, 0, 0), S19S_0006_Sed_Field_ICR.U_p2 = c(0,
0, 0, 0, 0, 0, 1, 1, 0, 0), S19S_0008_Sed_Field_ICR.M_p15 = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0), S19S_0009_Sed_Field_ICR.M_p2 = c(0,
0, 1, 0, 0, 0, 1, 0, 0, 0), S19S_0009_Sed_Field_ICR.U_p2 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.D_p15 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.M_p15 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.U_p15 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c("200.002276", "200.015107",
"200.0564158", "200.0565393", "200.0578394", "200.0677581", "200.092796",
"200.1291723", "200.1292836", "200.9238455"), class = "data.frame")
TIA
maybe wrangling to a long format can help. In this format, you can summarize by groups e.g. sample, or sample, and location, using sum, mean, sd among others.
hope this helps,
Convert to long format
## dd is the `data.sed.ex` object above
library(tidyverse)
ddLong <- dd %>%
rownames_to_column(var = "peak") %>%
pivot_longer(cols = matches("^S")) %>%
mutate(sample = gsub("(.*)\\.(.*)", "\\1", name), ## pull sample info
location = gsub("(.*)\\.([DMU])_(.*)", "\\2", name), ## pull D M U
p = gsub("(.*)\\.([DMU])_(p.*)", "\\3", name), ## get p2, p15
peak = as.numeric(peak)) ## coerce peak to numeric
ddLong
#> # A tibble: 100 × 6
#> peak name value sample location p
#> <dbl> <chr> <dbl> <chr> <chr> <chr>
#> 1 200. S19S_0004_Sed_Field_ICR.M_p15 0 S19S_0004_Sed_Field… M p15
#> 2 200. S19S_0006_Sed_Field_ICR.D_p2 0 S19S_0006_Sed_Field… D p2
#> 3 200. S19S_0006_Sed_Field_ICR.M_p2 0 S19S_0006_Sed_Field… M p2
#> 4 200. S19S_0006_Sed_Field_ICR.U_p2 0 S19S_0006_Sed_Field… U p2
#> 5 200. S19S_0008_Sed_Field_ICR.M_p15 0 S19S_0008_Sed_Field… M p15
#> 6 200. S19S_0009_Sed_Field_ICR.M_p2 0 S19S_0009_Sed_Field… M p2
#> 7 200. S19S_0009_Sed_Field_ICR.U_p2 0 S19S_0009_Sed_Field… U p2
#> 8 200. S19S_0010_Sed_Field_ICR.D_p15 0 S19S_0010_Sed_Field… D p15
#> 9 200. S19S_0010_Sed_Field_ICR.M_p15 0 S19S_0010_Sed_Field… M p15
#> 10 200. S19S_0010_Sed_Field_ICR.U_p15 0 S19S_0010_Sed_Field… U p15
#> # … with 90 more rows
Summarize by one or more groups
## summarise using group_by + verbs
ddLong %>%
group_by(sample, location) %>%
summarise(n = n(),
sum.value = sum(value),
mean.peak = mean(peak))
#> `summarise()` has grouped output by 'sample'. You can override using the
#> `.groups` argument.
#> # A tibble: 10 × 5
#> # Groups: sample [5]
#> sample location n sum.value mean.peak
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 S19S_0004_Sed_Field_ICR M 10 0 200.
#> 2 S19S_0006_Sed_Field_ICR D 10 2 200.
#> 3 S19S_0006_Sed_Field_ICR M 10 1 200.
#> 4 S19S_0006_Sed_Field_ICR U 10 2 200.
#> 5 S19S_0008_Sed_Field_ICR M 10 1 200.
#> 6 S19S_0009_Sed_Field_ICR M 10 2 200.
#> 7 S19S_0009_Sed_Field_ICR U 10 1 200.
#> 8 S19S_0010_Sed_Field_ICR D 10 1 200.
#> 9 S19S_0010_Sed_Field_ICR M 10 1 200.
#> 10 S19S_0010_Sed_Field_ICR U 10 0 200.
ddLong %>%
group_by(sample, p) %>%
summarise(n = n(),
sum.value = sum(value),
mean.peak = mean(peak))
#> `summarise()` has grouped output by 'sample'. You can override using the
#> `.groups` argument.
#> # A tibble: 5 × 5
#> # Groups: sample [5]
#> sample p n sum.value mean.peak
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 S19S_0004_Sed_Field_ICR p15 10 0 200.
#> 2 S19S_0006_Sed_Field_ICR p2 30 5 200.
#> 3 S19S_0008_Sed_Field_ICR p15 10 1 200.
#> 4 S19S_0009_Sed_Field_ICR p2 20 3 200.
#> 5 S19S_0010_Sed_Field_ICR p15 30 2 200.
I have df and I would like to calculate percentage (.x/.x[1] * 100 ) when row_number >2 and the first row in the same col is not 0. What should I do if we want to use mutate(across...? where and how can I add the part on .x[1]!=0?
mutate(across(.fns = ~ifelse(row_number() > 2 ... sprintf("%1.0f (%.2f%%)", .x, .x/.x[1] * 100), .x)))
df<-structure(list(Total = c(4, 2, 1, 1, 0, 0), `ELA` = c(0,
0, 0, 0, 0, 0), `Math` = c(4, 2, 1, 1, 0,
0), `PE` = c(0, 0, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
df %>%
mutate(across(
where(~.x[1] > 0),
~ifelse(
row_number() > 2,
sprintf("%1.0f (%.2f%%)", .x, .x/.x[1] * 100),
.x
)))
# # A tibble: 6 × 4
# Total ELA Math PE
# <chr> <dbl> <chr> <dbl>
# 1 4 0 4 0
# 2 2 0 2 0
# 3 1 (25.00%) 0 1 (25.00%) 0
# 4 1 (25.00%) 0 1 (25.00%) 0
# 5 0 (0.00%) 0 0 (0.00%) 0
# 6 0 (0.00%) 0 0 (0.00%) 0
Have a look at the ?across help page for more examples.
I have a df that looks like this:
It can be build using codes:
structure(list(ID = c(1, 2, 3, 4, 5), Pass = c(0, 1, 1, 1, 1),
Math = c(0, 0, 1, 1, 1), ELA = c(0, 1, 0, 1, 0), PE = c(0,
0, 1, 1, 1)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Where pass stand for a student pass any test or not. Now I want to build a new var Result to capture a student's test results like following, what should I do?
Try the base R code below
q <- with(data.frame(which(df[-(1:2)] == 1, arr.ind = TRUE)),
tapply(names(df[-(1:2)])[col], factor(row, levels = 1:nrow(df)), toString))
df$Result <- ifelse(is.na(q), "Not Pass", paste0("Pass: ", q))
which gives
> df
# A tibble: 5 x 6
ID Pass Math ELA PE Result
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 0 0 0 0 Not Pass
2 2 1 0 1 0 Pass: ELA
3 3 1 1 0 1 Pass: Math, PE
4 4 1 1 1 1 Pass: Math, ELA, PE
5 5 1 1 0 1 Pass: Math, PE
Using dplyr with rowwise
library(dplyr)
library(stringr)
df1 %>%
rowwise %>%
mutate(Result = if(as.logical(Pass))
str_c('Pass: ', toString(names(select(., Math:PE))[as.logical(c_across(Math:PE))])) else 'Not pass' ) %>%
ungroup
# A tibble: 5 x 6
# ID Pass Math ELA PE Result
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 1 0 0 0 0 Not pass
#2 2 1 0 1 0 Pass: ELA
#3 3 1 1 0 1 Pass: Math, PE
#4 4 1 1 1 1 Pass: Math, ELA, PE
#5 5 1 1 0 1 Pass: Math, PE
data
df1 <- structure(list(ID = c(1, 2, 3, 4, 5), Pass = c(0, 1, 1, 1, 1),
Math = c(0, 0, 1, 1, 1), ELA = c(0, 1, 0, 1, 0), PE = c(0,
0, 1, 1, 1)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Here's one solution:
library(dplyr)
library(magrittr)
library(stringr)
df <- structure(list(ID = c(1, 2, 3, 4, 5), Pass = c(0, 1, 1, 1, 1),
Math = c(0, 0, 1, 1, 1), ELA = c(0, 1, 0, 1, 0), PE = c(0,
0, 1, 1, 1)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
df %<>% pivot_longer(cols = -c(ID, Pass), names_to = "sub", values_to = "done")
df %<>% group_by(ID) %>% mutate(Result = paste0(ifelse(done == 1, sub, NA), collapse = ", ")) %>% ungroup()
df %<>% pivot_wider(names_from = sub, values_from = done)
df %<>% mutate(Result = paste0("Pass: ", str_replace_all(Result, "NA[, ]*", "")))
df %<>% mutate(Result = ifelse(str_detect(Result, "Pass: $"), "Not pass", str_replace_all(Result, ",[\\s]*$", "")))
df
# # A tibble: 5 x 6
# ID Pass Result Math ELA PE
# <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 0 Not pass 0 0 0
# 2 2 1 Pass: ELA 0 1 0
# 3 3 1 Pass: Math, PE 1 0 1
# 4 4 1 Pass: Math, ELA, PE 1 1 1
# 5 5 1 Pass: Math, PE 1 0 1
I can provide an explanation of what the code is doing if necessary.