count total and positive samples by group - r

I have a dataframe like this;
df <- data.frame(concentration=c(0,0,0,0,2,2,2,2,4,4,6,6,6),
result=c(0,0,0,0,0,0,1,0,1,0,1,1,1))
I want to count the total number of results for each concentration level.
I want to count the number of positive samples for each concentration level.
And I want to create a new dataframe with concentration level, total results, and number positives.
conc pos_c total_c
0 0 4
2 1 4
4 1 2
6 3 3
This is what I've come up with so far using plyr;
c <- count(df, "concentration")
r <- count(df, "concentration","result")
names(c)[which(names(c) == "freq")] <- "total_c"
names(r)[which(names(r) == "freq")] <- "pos_c"
cbind(c,r)
concentration total_c concentration pos_c
1 0 4 0 0
2 2 4 2 1
3 4 2 4 1
4 6 3 6 3
Repeating concentration column. I think there is probably a way better/easier way to do this I'm missing. Maybe another library. I'm not sure how to do this in R and it's relatively new to me. Thanks.

We need a group by sum. Using tidyverse, we group by 'concentration (group_by), then summarise to get the two columns - 1) sum of the logical expression (result > 0), 2) number of rows (n())
library(dplyr)
df %>%
group_by(conc = concentration) %>%
summarise(pos_c = sum(result > 0), # in the example just sum(result)
total_c = n())
# A tibble: 4 x 3
# conc pos_c total_c
# <dbl> <int> <int>
#1 0 0 4
#2 2 1 4
#3 4 1 2
#4 6 3 3
Or using base R with table and addmargins
addmargins(table(df), 2)[,-1]

Related

Use replicate to create new variable

I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)

How to use a for loop to changed consecutive values in R?

How can I run a loop over multiple columns changing consecutive values to true values?
For example, if I have a dataframe like this...
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
I want to show the binned values...
Time Value Bin Subject_ID
1 6 1 1
2 4 2 1
4 8 3 1
1 2 4 1
Is there a way to do it in a loop?
I tried this code...
for (row in 2:nrow(df)) {
if(df[row - 1, "Subject_ID"] == df[row, "Subject_ID"]) {
df[row,1:2] = df[row,1:2] - df[row - 1,1:2]
}
}
But the code changed it line by line and did not give the correct values for each bin.
If you still insist on using a for loop, you can use the following solution. It's very simple but you have to first create a copy of your data set as your desired output values are the difference of values between rows of the original data set. In order for this to happen we move DF outside of the for loop so the values remain intact, otherwise in every iteration values of DF data set will be replaced with the new values and the final output gives incorrect results:
df <- read.table(header = TRUE, text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1")
DF <- df[, c("Time", "Value")]
for(i in 2:nrow(df)) {
df[i, c("Time", "Value")] <- DF[i, ] - DF[i-1, ]
}
df
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
The problem with the code in the question is that after row i is changed the changed row is used in calculating row i+1 rather than the original row i. To fix that run the loop in reverse order. That is use nrow(df):2 in the for statement. Alternately try one of these which do not use any loops and also have the advantage of not overwriting the input -- something which makes the code easier to debug.
1) Base R Use ave to perform Diff by group where Diff uses diff to actually perform the differencing.
Diff <- function(x) c(x[1], diff(x))
transform(df,
Time = ave(Time, Subject_ID, FUN = Diff),
Value = ave(Value, Subject_ID, FUN = Diff))
giving:
Time Value Bin Subject_ID
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
2) dplyr Using dplyr we write the above except we use lag:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(Time = Time - lag(Time, default = 0),
Value = Value - lag(Value, default = 0)) %>%
ungroup
giving:
# A tibble: 4 x 4
Time Value Bin Subject_ID
<dbl> <dbl> <int> <int>
1 1 6 1 1
2 2 4 2 1
3 4 8 3 1
4 1 2 4 1
or using across:
library(dplyr)
df %>%
group_by(Subject_ID) %>%
mutate(across(Time:Value, ~ .x - lag(.x, default = 0))) %>%
ungroup
Note
Lines <- "Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1"
df <- read.table(text = Lines, header = TRUE)
Here is a base R one-liner with diff in a lapply loop.
df[1:2] <- lapply(df[1.2], function(x) c(x[1], diff(x)))
df
# Time Value Bin Subject_ID
#1 1 1 1 1
#2 2 2 2 1
#3 4 4 3 1
#4 1 1 4 1
Data
df <- read.table(text = "
Time Value Bin Subject_ID
1 6 1 1
3 10 2 1
7 18 3 1
8 20 4 1
", header = TRUE)
dplyr one liner
library(dplyr)
df %>% mutate(across(c(Time, Value), ~c(first(.), diff(.))))
#> Time Value Bin Subject_ID
#> 1 1 6 1 1
#> 2 2 4 2 1
#> 3 4 8 3 1
#> 4 1 2 4 1

How can I find the column index of the first non-zero value in a row with R dplyr?

I'm working in R. I have a dataset of COVID case totals that looks like this:
Facility
Day_1
Day_2
Day_3
A
0
0
1
B
1
2
5
C
0
2
6
D
0
0
0
I would like to use mutate() to create a new column, first_case, that has the column index of the first non-zero element in each row -- or "NA" if there is no non-zero element. I thought about using where(), but couldn't quite figure out how to get a column index instead of a row index.
Any help is much appreciated!
We can use max.col to get the first instance when the value is non-zero in each zero.
library(dplyr)
df %>%
mutate(first_case = {
tmp <- select(., starts_with('Day'))
ifelse(rowSums(tmp) == 0, NA, max.col(tmp != 0, ties.method = 'first'))
})
# Facility Day_1 Day_2 Day_3 first_case
#1 A 0 0 1 3
#2 B 1 2 5 1
#3 C 0 2 6 2
#4 D 0 0 0 NA
first_case has column number of the 'Day' columns, if you need column number in the data you can add + 1 to above output.
This is probably unnecessarily complex, because the data is not in a long ('tidy') format that dplyr etc expect.
datlong <- dat %>%
pivot_longer(cols=starts_with("Day"), names_to = c("day"), names_pattern="_(\\d+)")
## A tibble: 12 x 3
# Facility day value
# <chr> <chr> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 B 1 1
# 5 B 2 2
# 6 B 3 5
# 7 C 1 0
# 8 C 2 2
# 9 C 3 6
#10 D 1 0
#11 D 2 0
#12 D 3 0
It's then simple to get the first/second/third/[n]th day above whatever value, as well as to calculate minimums, maximums, means, weekly averages, rolling averages, whatever, because you are now dealing with a plain old vector of values rather than a list of values across multiple columns.
datlong %>%
group_by(Facility) %>%
filter(value > 0, .preserve=TRUE) %>%
summarise(first_day = first(day))
#`summarise()` ungrouping output (override with `.groups` argument)
## A tibble: 4 x 2
# Facility first_day
# <chr> <chr>
#1 A 3
#2 B 1
#3 C 2
#4 D <NA>
Alternative using indexes and stuff, which is less dplyr-like:
datlong %>%
group_by(Facility) %>%
summarise(first_day = day[value > 0][1])

Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

I have a data frame with two categorical variables.
samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
samples groups
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times. In the example here X=2. Is there an easy way to do this? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.
desired output
samples groups
1 A 1
2 A 1
3 A 2
4 B 1
5 B 1
You can sample minimum of number of rows or x for each group :
library(dplyr)
x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))
# samples groups
# <chr> <dbl>
#1 A 1
#2 A 1
#3 A 2
#4 B 1
#5 B 1
However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.
However, as #tmfmnk mentioned we don't need to call n() here. Try :
df %>% group_by(samples, groups) %>% slice_sample(n = x)
One option with data.table:
df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]
samples groups
1: A 1
2: A 1
3: A 2
4: B 1
5: B 1

Finding individual values from cumulative mean in R Dataframe

I am having troubles finding how to find individual values from the running mean in an R dataframe.
I have an R dataframe:
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
Where the mean is the mean for the x measurements for the specific ID in the dataframe.
To find the individual values at each x value rather than the mean, I was thinking that I needed to apply a recursive function on the dataframe and group by the ID. How could I do this in a dataframe while grouping by one of the values when any apply function wouldn't have access to the previous entry in the dataframe?
When completed and appended to the dataframe, I am hoping it to look like this:
x ID Mean IndivValues
1 1 1 1
1 2 5 5
2 1 3 5
2 2 6 7
It's much easier to calculate this from totals -> to individual observation, as below:
Example data.frame:
df <- read.table(text='
x ID Mean
1 1 1
1 2 5
2 1 3
2 2 6
', header=T)
Solution:
library(dplyr); library(magrittr)
df %>%
group_by(id) %>%
mutate(
total = mean * x,
ind_value = total - lag(total, default=0) )
## A tibble: 4 x 5
## Groups: ID [2]
# x ID Mean total ind_value
# <int> <int> <int> <int> <int>
#1 1 1 1 1 1
#2 1 2 5 5 5
#3 2 1 3 6 5
#4 2 2 6 12 7

Resources