Unable to get 0.00% for cells with 0 observations - r

I have a dataset which looks like:
ID week status
20 0 2
20 1 2
20 2 2
20 3 2
20 4 3
I need the proportion of status by week.
So I used the code
g_young= dat_young%>%group_by(week)%>%count(status)%>%mutate(dist=prop.table(n)*100)
I get the answer all right, but, the issue is that cells where the observation is 0, R is not showing the percentage for those as 0.00.
For example:
week status n dist
0 1 1957 12.9
0 3 1301 86.4
0 5 90 0.59
In normal situation this would not have been an issue, but, I need to make a graph after this and the fact that there is no value for status 2 and4 in the above table is causing a weird step like function in the graph. Any ideas, how I could sort this out?
Thanks a lot. Appreciate the time and effort in helping me with a solution.

You would need to convert your status variable into a factor type and then add the argument .drop = FALSE to the count() function.
For example:
suppressMessages(library(dplyr))
dat <- tibble(week = c(0,0,0,0,0,1,1,1,1,2),
status = c(1,1,2,1,1,2,3,1,2,1))
dat
#> # A tibble: 10 x 2
#> week status
#> <dbl> <dbl>
#> 1 0 1
#> 2 0 1
#> 3 0 2
#> 4 0 1
#> 5 0 1
#> 6 1 2
#> 7 1 3
#> 8 1 1
#> 9 1 2
#> 10 2 1
dat %>%
mutate(status = factor(status)) %>%
group_by(week) %>%
count(status, .drop = FALSE) %>%
mutate(dist = prop.table(n)*100)
#> # A tibble: 9 x 4
#> # Groups: week [3]
#> week status n dist
#> <dbl> <fct> <int> <dbl>
#> 1 0 1 4 80
#> 2 0 2 1 20
#> 3 0 3 0 0
#> 4 1 1 1 25
#> 5 1 2 2 50
#> 6 1 3 1 25
#> 7 2 1 1 100
#> 8 2 2 0 0
#> 9 2 3 0 0
Created on 2020-10-12 by the reprex package (v0.3.0)
rdplyr

Related

How to calculate cumulative sum for each group in time?

For each unique ID and rep, I want to calculate the cumulative number of babies at each age?
For instance, A1, the cumulative sum should look like 1,3,6
I tried the folowing method
id <- c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B")
rep <- c(1,1,1,2,2,2,1,1,1,1,2,2,2,2,2)
age <- c(0,1,2,0,1,2,0,1,2,3,0,1,2,3,4)
babies <- c(1,2,3,0,1,3,0,1,5,1,0,0,12,1,1)
df <- data.frame(id,rep,age,babies)
df$csum <- ave(df$babies, c(df$id,df$age, df$age), FUN=cumsum)
The result is cumulative sum is calculated over ID alone but not replicate or age. Any suggestions?
How about this:
library(dplyr)
id <- c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B")
rep <- c(1,1,1,2,2,2,1,1,1,1,2,2,2,2,2)
age <- c(0,1,2,0,1,2,0,1,2,3,0,1,2,3,4)
babies <- c(1,2,3,0,1,3,0,1,5,1,0,0,12,1,1)
df <- data.frame(id,rep,age,babies)
df %>%
group_by(id, rep) %>%
arrange(age, .by_group = TRUE) %>%
mutate(csum = cumsum(babies))
#> # A tibble: 15 × 5
#> # Groups: id, rep [4]
#> id rep age babies csum
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 0 1 1
#> 2 A 1 1 2 3
#> 3 A 1 2 3 6
#> 4 A 2 0 0 0
#> 5 A 2 1 1 1
#> 6 A 2 2 3 4
#> 7 B 1 0 0 0
#> 8 B 1 1 1 1
#> 9 B 1 2 5 6
#> 10 B 1 3 1 7
#> 11 B 2 0 0 0
#> 12 B 2 1 0 0
#> 13 B 2 2 12 12
#> 14 B 2 3 1 13
#> 15 B 2 4 1 14
Created on 2022-12-08 by the reprex package (v2.0.1)

Creating iGraph object by matching ID numbers

I am seeking a solution to an issue relating to the creation of an igraph object for my masters thesis. I need to conduct a network analysis for database which matches participant by project ID - there are about 1500 total rows consisting of about 600 or so different participants, and about 300 or so projects. Multiple participants are involved in multiple different projects. A sample of the code is below
Participant_ID Project_Number
1 1 101314
2 2 101314
3 3 101314
4 1 101314
5 5 101346
6 6 101346
7 7 101346
8 8 101531
9 9 101531
10 3 101531
11 11 101533
12 8 101533
13 3 101533
14 14 101533
15 9 101612
How can I match these in a way that creates a network? How could I code these into an edge list - whether it be via a matrix or otherwise, as there are so many observations and differing numbers of participants on each project. I'm finding it hard to find much online that shows how I can create an edge list by ID numbers, and I need to overcome what I imagine is a fairly ordinary few lines of code.
Like this? Project_Number must be converted into a factor to prevent the creation of a graph with 100k+ nodes:
library(tidyverse)
library(tidygraph)
#>
#> Attaching package: 'tidygraph'
#> The following object is masked from 'package:stats':
#>
#> filter
graph <-
tribble(
~Participant_ID, ~Project_Number,
1, 101314,
2, 101314,
3, 101314,
1, 101314
) %>%
mutate(Project_Number = Project_Number %>% factor()) %>%
tbl_graph(edges = .)
graph
#> # A tbl_graph: 4 nodes and 4 edges
#> #
#> # A directed acyclic multigraph with 1 component
#> #
#> # Node Data: 4 × 1 (active)
#> name
#> <chr>
#> 1 1
#> 2 101314
#> 3 2
#> 4 3
#> #
#> # Edge Data: 4 × 2
#> from to
#> <int> <int>
#> 1 1 2
#> 2 3 2
#> 3 4 2
#> # … with 1 more row
graph %>% as.igraph() %>% plot()
Created on 2021-09-10 by the reprex package (v2.0.1)
Maybe a bipartite graph could help
graph_from_data_frame(df) %>%
set_vertex_attr(name = "type", value = names(V(.)) %in% df$Participant_ID) %>%
plot(layout = layout_as_bipartite)
If you want a matrix output, you can try
> table(df)
Project_Number
Participant_ID 101314 101346 101531 101533 101612
1 2 0 0 0 0
2 1 0 0 0 0
3 1 0 1 1 0
5 0 1 0 0 0
6 0 1 0 0 0
7 0 1 0 0 0
8 0 0 1 1 0
9 0 0 1 0 1
11 0 0 0 1 0
14 0 0 0 1 0

R update values within a grouped df with information from updated previous value

I would like conditionally mutate variables (var1, var2) within groups (id) at different timepoints (timepoint) using previously updated/muated values according to this function:
change_function <- function(value,pastvalue,timepoint){
if(timepoint==1){valuenew=value} else
if(value==0){valuenew=pastvalue-1}
if(value==1){valuenew=pastvalue}
if(value==2){valuenew=pastvalue+1}
return(valuenew)
}
pastvalue is the MUTATED/UPDATED value at timepoint -1 for timepoint 2:4
Here is an example and output file:
``` r
#example data
df <- data.frame(id=c(1,1,1,1,2,2,2,2),timepoint=c(1,2,3,4,1,2,3,4),var1=c(1,0,1,2,2,2,1,0),var2=c(2,0,1,2,3,2,1,0))
df
#> id timepoint var1 var2
#> 1 1 1 1 2
#> 2 1 2 0 0
#> 3 1 3 1 1
#> 4 1 4 2 2
#> 5 2 1 2 3
#> 6 2 2 2 2
#> 7 2 3 1 1
#> 8 2 4 0 0
#desired output
output <- data.frame(id=c(1,1,1,1,2,2,2,2),timepoint=c(1,2,3,4,1,2,3,4),var1=c(1,0,0,1,2,3,3,2),var2=c(2,1,1,2,3,4,4,3))
output
#> id timepoint var1 var2
#> 1 1 1 1 2
#> 2 1 2 0 1
#> 3 1 3 0 1
#> 4 1 4 1 2
#> 5 2 1 2 3
#> 6 2 2 3 4
#> 7 2 3 3 4
#> 8 2 4 2 3
```
<sup>Created on 2020-11-23 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
My Approach: use my function using dplyr::mutate_at
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(.=change_function(.,dplyr::lag(.),timepoint)))
However, this does not work because if/else is not vectorized
Update 1:
Using a nested ifelse function does not give the desired output, because it does not use updated pastvalue's:
change_function <- function(value,pastvalue,timepoint){
ifelse((timepoint==1),value,
ifelse((value==0),pastvalue-1,
ifelse((value==1),pastvalue,
ifelse((value==2),pastvalue+1,NA))))
}
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(.=change_function(.,dplyr::lag(.),timepoint)))
id TimePoint var1 var2 var1_. var2_.
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 1 2
2 1 2 0 0 0 1
3 1 3 1 1 0 0
4 1 4 2 2 2 2
5 2 1 2 3 2 3
6 2 2 2 2 3 4
7 2 3 1 1 2 2
8 2 4 0 0 0 0
Update 2:
According to the comments, purrr:accumulate could be used
Thanks to akrun I could get the correct function:
# write a vectorized function
change_function <- function(prev, new) {
change=if_else(new==0,-1,
if_else(new==1,0,1))
if_else(is.na(new), new, prev + change)
}
# use purrr:accumulate
df %>%
group_by(id) %>%
mutate_at(.vars=vars(var1,var2),
.funs=funs(accumulate(.,change_function)))
# A tibble: 8 x 4
# Groups: id [2]
id timepoint var1 var2
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 2
2 1 2 0 1
3 1 3 0 1
4 1 4 1 2
5 2 1 2 3
6 2 2 3 4
7 2 3 3 4
8 2 4 2 3

Adding sequential IDs to rows in data frame

I have a dataset called Snapper_new that has 330 rows and each set of nine rows is named 1 through 9 as shown in the id column. I want each set of nine rows (1-9, 10-18, etc.) to have a unique ID (1,2, etc.). How would I do this in R?
Here an approach with the tidyverse
library(tidyverse)
Snapper_new <- rep(seq(1:9), 3) %>%
enframe(name=NULL, value="id")
Snapper_new %>%
mutate(group_start=case_when(id==1 ~ 1,
TRUE ~ as.numeric(0))) %>%
mutate(group_index=cumsum(group_start))
#> # A tibble: 27 x 3
#> id group_start group_index
#> <int> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 0 1
#> 3 3 0 1
#> 4 4 0 1
#> 5 5 0 1
#> 6 6 0 1
#> 7 7 0 1
#> 8 8 0 1
#> 9 9 0 1
#> 10 1 1 2
#> # ... with 17 more rows
Created on 2020-11-30 by the reprex package (v0.3.0)
Pure R answer.
a = data.frame("test"=1:330, "pokus" = 1:330)
b <- unlist(lapply(1:ceiling(330/9), function(x) {replicate(9, x)}))
b <- b[1:nrow(a)]
a <- cbind(a, b)

Is there a way in R to calculate how many times it happens that if there is a 0, the subsequent number will be also a zero?

I have a binary dataset with 0 and 1. Is there a way to calculate how many times it happens that if there is a 0, the subsequent number will be also a zero (so 1010100). And how many times it happens that there are three subsequent zeros, etc.
A zero in my data means that a bird is not on its nest at a particular time. For every three minutes, it is noted if the bird is on its nest (1) or not (0). My goal is to find out if birds in particular areas leave their nests longer because it takes them longer to get enough food. Longer means more subsequent zeros.
bla <- rle(c(0,1,0,0,1,0,0,0,1,0,0))
table(bla$lengths[bla$values == 0])
gives you
1 2 3
1 2 1
meaning that a sequence of 1 zero appears once, 2 zeros twice and 3 zeros again once.
You can use run length encoding to reconstruct this data. Suppose your data looked like this:
set.seed(69)
df <- data.frame(time = seq(1:20), on_off = rbinom(20, 1, 0.5))
df
#> time on_off
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 0
#> 6 6 1
#> 7 7 0
#> 8 8 1
#> 9 9 0
#> 10 10 0
#> 11 11 1
#> 12 12 0
#> 13 13 1
#> 14 14 0
#> 15 15 0
#> 16 16 0
#> 17 17 0
#> 18 18 1
#> 19 19 0
#> 20 20 1
Then you can convert it into a run-length encoded data frame like this:
RLE <- rle(df$on_off)
len <- RLE$lengths
new_df <- data.frame(time = df$time[cumsum(c(1, len))[seq_along(len)]],
on_off = RLE$value, duration = len)
new_df
#> time on_off duration
#> 1 1 1 4
#> 2 5 0 1
#> 3 6 1 1
#> 4 7 0 1
#> 5 8 1 1
#> 6 9 0 2
#> 7 11 1 1
#> 8 12 0 1
#> 9 13 1 1
#> 10 14 0 4
#> 11 18 1 1
#> 12 19 0 1
#> 13 20 1 1
Created on 2020-07-01 by the reprex package (v0.3.0)

Resources