r convert summary data to presence/absence data - r

I conducted 5 presence/absence measures at multiple sites and summed them together and ended up with a dataframe that looked something like this:
df <- data.frame("site" = c("a", "b", "c"),
"species1" = c(0, 2, 1),
"species2" = c(5, 2, 4))
ie. at site "a" species1 was recorded 0/5 times and species2 was recorded 5/5 times.
What I would like to do is convert this back into presence/absence data. Something like this:
data.frame("site" = ("a", "b", "c"),
"species1" = c(0,0,0,0,0, 1,1,0,0,0, 1,0,0,0,0),
"species2" = c(1,1,1,1,1, 1,1,0,0,0, 1,1,1,1,0))
I can duplicate each row 5 times with:
df %>% slice(rep(1:n(), each = 5))
but I can't figure out how to change "2" into "1,1,0,0,0". Ideally the order of the 1s and 0s (within each site) would also be randomised (ie. "0,0,1,0,1"), but that might be too difficult.
Any help would be appreciated.

We can also use uncount
library(dplyr)
library(tidyr)
df %>%
uncount(max(species2), .remove = FALSE) %>%
group_by(site) %>%
mutate(across(starts_with('species'), ~ as.integer(row_number() <= first(.))))
# A tibble: 15 x 3
# Groups: site [3]
# site species1 species2
# <chr> <int> <int>
# 1 a 0 1
# 2 a 0 1
# 3 a 0 1
# 4 a 0 1
# 5 a 0 1
# 6 b 1 1
# 7 b 1 1
# 8 b 0 0
# 9 b 0 0
#10 b 0 0
#11 c 1 1
#12 c 0 1
#13 c 0 1
#14 c 0 1
#15 c 0 0

After repeating the rows you can compare the row number with any value of the respective column and assign 1 if the current row number is less than the value.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = 5)) %>%
group_by(site) %>%
mutate(across(starts_with('species'), ~+(row_number() <= first(.))))
#Use mutate_at with old dplyr
#mutate_at(vars(starts_with('species')), ~+(row_number() <= first(.)))
# site species1 species2
# <chr> <int> <int>
# 1 a 0 1
# 2 a 0 1
# 3 a 0 1
# 4 a 0 1
# 5 a 0 1
# 6 b 1 1
# 7 b 1 1
# 8 b 0 0
# 9 b 0 0
#10 b 0 0
#11 c 1 1
#12 c 0 1
#13 c 0 1
#14 c 0 1
#15 c 0 0

Related

How to count the cumulative number of subgroupings using dplyr?

I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)
Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA

R dplyr filter direct and indirect sequences of string

Assume I have the table below. And I want to find all id where A is directly or indirectly followed by B.
A->B is the direct sequence for id 1, while A->B is the indirect sequence for id 2, id 3, and id 4.
have <- tibble(id=c(rep(1,2),rep(2,4),rep(3,3),rep(4,4))
,sequence=c('A','B','A','D','C','B','D','A','C','B','D','A','B'))
have
# A tibble: 13 × 2
id sequence
<dbl> <chr>
1 1 A
2 1 B
3 2 A
4 2 D
5 2 C
6 2 B
7 3 D
8 3 A
9 3 C
10 4 B
11 4 D
12 4 A
13 4 B
For A->B direct sequences, I do the following. But I don't think the same logic applies to indirect sequences unless I use regex expressions on the concatenated.
have %>% group_by(id) %>%
dplyr::mutate(process_seq = paste(lag(sequence), '->', sequence)) %>%
dplyr::filter(process_seq == 'A -> B')
want
# A tibble: 13 × 2
id sequence-type
<dbl> <chr>
1 1 direct
2 2 indirect
3 3 indirect
4 4 indirect
ABs = have %>%
group_by(id) %>%
mutate(rownum = row_number(),
letternum = match(sequence, LETTERS[1:26])) %>%
filter(sequence == "A" | sequence == "B") %>%
mutate(dif_row = rownum - lag(rownum),
dif_let = letternum - lag(letternum)) %>%
filter(!is.na(dif_row)) %>%
mutate(has_direct_link = max(dif_row==1),
has_indirect_link = max(dif_row >1 & dif_let == 1),
has_reverse_link = max(dif_row >1 & dif_let < 0)) %>%
select(id, starts_with("has_")) %>%
distinct()
res = have %>% left_join(ABs) %>%
mutate(has_no_link = as.integer(is.na(has_direct_link))) %>%
mutate_if(is.numeric,coalesce,0)
> res
# A tibble: 13 x 6
id sequence has_direct_link has_indirect_link has_reverse_link has_no_link
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 A 1 0 0 0
2 1 B 1 0 0 0
3 2 A 0 1 0 0
4 2 D 0 1 0 0
5 2 C 0 1 0 0
6 2 B 0 1 0 0
7 3 D 0 0 0 1
8 3 A 0 0 0 1
9 3 C 0 0 0 1
10 4 B 1 0 1 0
11 4 D 1 0 1 0
12 4 A 1 0 1 0
13 4 B 1 0 1 0
#deschen's answer above is elegant, but somewhat incomplete: I don't think the "definition" of an indirect link is correct (every direct link is also an indirect link). But my answer could probably be improved by deschen.
Here‘s one approach:
library(tidyverse)
have %>%
group_by(id) %>%
mutate(direct = if_else(sequence == 'A' & lead(sequence) == 'B', 1, 0)) %>%
mutate(indirect = any(sequence == 'A') & any(sequence == 'B')) %>%
filter(any(direct == 1) | indirect == TRUE) %>%
ungroup()
which gives:
# A tibble: 10 x 4
id sequence direct indirect
<dbl> <chr> <dbl> <lgl>
1 1 A 1 TRUE
2 1 B 0 TRUE
3 2 A 0 TRUE
4 2 D 0 TRUE
5 2 C 0 TRUE
6 2 B 0 TRUE
7 4 B 0 TRUE
8 4 D 0 TRUE
9 4 A 1 TRUE
10 4 B 0 TRUE
You can of course deselect the created direct /indirect columns.

R - Mutate column based on another column

Using R:
For the dataframe:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
How do you add a column such that the output is the same as:
A<-c(3,3,3,3,1,1,2,2,2,2,2)
df<-data.frame(A)
B<-c(1,1,1,0,1,0,1,1,0,0,0)
mutate(df,B)
In other words, is there a formula for column 'B' - such that it looks at column 'A'....and lists '1', 3 times the puts a '0' .....etc etc.
So - the desired output (given column 'A') is:
Thankyou.
Here I assign a new group each time A changes, then within each group put a 1 in B in the first #A rows.
(If the values of A are distinct for each group, you could replace the first two lines with group_by(A), but unclear if that's a fair assumption.)
library(dplyr)
df %>%
mutate(group = cumsum(A != lag(A, default = 0))) %>%
group_by(group) %>%
mutate(B = 1 * (row_number() <= A)) %>%
ungroup()
result
# A tibble: 11 x 3
A group B
<dbl> <int> <dbl>
1 3 1 1
2 3 1 1
3 3 1 1
4 3 1 0
5 1 2 1
6 1 2 0
7 2 3 1
8 2 3 1
9 2 3 0
10 2 3 0
11 2 3 0
After grouping by 'A', use rep with 1, 0 on the value of 'A' and the difference of number of rows with group value
library(dplyr)
library(data.table)
df %>%
group_by(A, grp = rleid(A)) %>%
mutate(B = rep(c(1, 0), c(first(A), n() - first(A)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 11 x 2
# A B
# <dbl> <dbl>
# 1 3 1
# 2 3 1
# 3 3 1
# 4 3 0
# 5 1 1
# 6 1 0
# 7 2 1
# 8 2 1
# 9 2 0
#10 2 0
#11 2 0
Or using rle from base R
with(rle(df$A), rep(rep(c(1, 0), length(values)), c(values, lengths-values)))
#[1] 1 1 1 0 1 1 0 1 0 0 0

How can I subtract values within one column based on values in mutliple other columns?

I have a dataframe like this:
dat <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1))
dat
# c id measurement
# 1 0 A 1
# 2 0 B 2
# 3 0 C 3
# 4 5 A 4
# 5 5 B 5
# 6 5 C 6
# 7 10 A 7
# 8 10 B 8
# 9 10 C 1
I want to subtract the values in the column "measurement" where c is 0 from all other values in this column. This should happen separately based on the info given in the column "id". E.g. the value where c is 0 and "id" is A should be subtracted from all values where c is > 0 and "id" is A. The value where c is 0 and "id" is B should be subtracted from all values where c is > 0 and "id" is B and so on.
If the difference would be negative the result should be 0.
The result should look like this:
result <- data.frame(c = c(rep(0, 3), rep(5, 3), rep(10, 3)),
id = c(rep(c("A","B","C"), 3)),
measurement = c(1:8, 1),
difference = c(0,0,0,3,3,3,6,6,0))
result
# c id measurement difference
# 1 0 A 1 0
# 2 0 B 2 0
# 3 0 C 3 0
# 4 5 A 4 3
# 5 5 B 5 3
# 6 5 C 6 3
# 7 10 A 7 6
# 8 10 B 8 6
# 9 10 C 1 0
I used dplyr to select the values of "measurement" based on the info from the other columns, but unfortunately I don't know how to do the calculations. So any suggestions are welcome!
For each id you can subtract measurement values with the value where c = 0. Using pmax we replace negative values with 0.
library(dplyr)
dat %>%
group_by(id) %>%
mutate(difference = pmax(measurement - measurement[c == 0], 0))
# c id measurement difference
# <dbl> <chr> <dbl> <dbl>
#1 0 A 1 0
#2 0 B 2 0
#3 0 C 3 0
#4 5 A 4 3
#5 5 B 5 3
#6 5 C 6 3
#7 10 A 7 6
#8 10 B 8 6
#9 10 C 1 0
Try this. You can use a join and filter the data for you defined filter. After that dplyr verbs are useful to reach the expected output:
library(dplyr)
#Code
new <- dat %>%
left_join(
dat %>% filter(c==0) %>% select(-c) %>% rename(Var=measurement)
) %>%
mutate(measurement=measurement-Var) %>%
replace(.<=0,0) %>% select(-Var)
Output:
c id measurement
1 0 A 0
2 0 B 0
3 0 C 0
4 5 A 3
5 5 B 3
6 5 C 3
7 10 A 6
8 10 B 6
9 10 C 0

Create Counter with Binary Variable

I am trying to create a counter variable that starts over at 1 every time there is a change in a binary variable.
bin <- c(1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0)
df <- as.data.frame(bin)
df <- df %>%
group_by(bin) %>%
mutate(cntr = row_number())
I would like to get the following results:
bin cntr
1 1
0 1
0 2
1 1
1 2
1 3
...
But instead I'm getting:
1 1
0 1
0 2
1 2
1 3
1 4
I understand why this is ... I just don't know how to get my desired results. Any help would be appreciated.
You can easily do this by combining sequence and rle. No packages required.
data.frame(bin, cntr = sequence(rle(bin)$lengths))
# bin cntr
#1 1 1
#2 0 1
#3 0 2
#4 1 1
#5 1 2
#6 1 3
#7 1 4
#8 1 5
#9 0 1
#10 0 2
#11 0 3
#12 0 4
#13 1 1
#14 0 1
#15 1 1
#16 0 1
We need a run-length-id to group the adjacent same elements into a single group. It can be done with rleid from data.table or create a logical index and then do the cumulative sum (cumsum(bin != lag(bin, default = first(bin))))
library(data.table)
library(dplyr)
df %>%
group_by(grp = rleid(bin)) %>%
mutate(cntr = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 16 x 2
# bin cntr
# <dbl> <int>
# 1 1 1
# 2 0 1
# 3 0 2
# 4 1 1
# 5 1 2
# 6 1 3
# 7 1 4
#..
In data.table, this can be done more compactly as the := happens
library(data.table)
setDT(df)[, cntr := rowid(rleid(bin))]
df
# bin cntr
# 1: 1 1
# 2: 0 1
# 3: 0 2
# 4: 1 1
# 5: 1 2
# 6: 1 3
# 7: 1 4
#..

Resources