Why does my table say “No data available” instead of zero? - r

I need to create multiple new data frames based on different filters that contain two variable counts “d” & “e” based of the values in columns “a”, “b” and “c”. I have created a function for this that works as long as at least one column has a value. However, sometimes certain groups will have no answer for a, b or c. I want both d and e to both return zero in the columns when this happens but instead it says “No data available in table”. I’ve added my code below.
f_calculate_net = function(data)
{ data %>% mutate(a = ifelse("a" %in% colnames(data), a, 0)) %>%
mutate(b = ifelse("b" %in% colnames(data), b, 0)) %>%
mutate(c = ifelse("c" %in% colnames(data), c, 0)) %>%
mutate(d = ifelse(a + b + c == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b + c == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }
A sample of the dataframe is
wt
beet
ilo
age
country
ine
sex
647
a
3
19
1
24
1
875
b
3
18
1
27
2
647
c
1
24
1
3
2
875
b
3
20
1
27
2
435
b
2
66
4
31
1
643
a
1
32
3
5
1
496
b
2
47
2
1
2
511
c
2
23
4
2
1
774
a
2
37
5
5
1
550
b
1
24
1
1
2
I take the main dataset and then apply a filter and count the number responses of the variable beet
data2 <- df_beet %>% filter(age == 18 & sex == 1 & ilo == 2) %>%
count(beet, wt = wt) %>%
pivot_wider(names_from = beet, values from = n) %>%
f_calculate_net()
There are no results and the resulting dataframe shows the columns d and e but it doesn’t show zeros and instead shows “no data available”

Your main problem here is in the way you are using ifelse. The expression "a" %in% colnames(data) always returns a length-1 logical vector (either TRUE or FALSE). So the output of the expression ifelse("a" %in% colnames(data), a, 0) will also be of length 1. It will return either the first element of a or a single 0. Since this is inside a mutate call, a will either be overwritten by the first element of a, or will be created as a column of zeros. Instead of ifelse you should use
if(!"a" %in% colnames(data)) data$a <- 0
And the same for columns b and c.
You will sometimes still get a NaN entry in columns d and e here if both a and b are 0, but c isn't, since your expression ((a/(a+b))*c)+a) divides by the sum of a and b. You should only check whether a + b == 0, since in that case you should return a 0
So the fixed function would be something like:
f_calculate_net = function(data) {
if(!"a" %in% colnames(data)) data$a <- 0
if(!"b" %in% colnames(data)) data$b <- 0
if(!"c" %in% colnames(data)) data$c <- 0
data %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e)
}
Let's create some random data to test this:
set.seed(123)
df <- data.frame(a = rpois(5, 1), b = rpois(5, 2), c = rpois(5, 1))
df
#> a b c
#> 1 0 0 3
#> 2 2 2 1
#> 3 1 4 1
#> 4 2 2 1
#> 5 3 2 0
And we see that we get the expected output:
f_calculate_net(df)
#> d e
#> 1 0.0 0.0
#> 2 2.5 2.5
#> 3 1.2 4.8
#> 4 2.5 2.5
#> 5 3.0 2.0
Created on 2022-08-15 by the reprex package (v2.0.1)

When a and b are zero a/b is NAN. If you want this case to be zero, try change a + b + c == 0 to (a + b) == 0
Based on Allan's explanation and comment, another possibility is to make a logical vector of the same length as the number of rows:
f_calculate_net = function(data)
{ data %>%
mutate(a = ifelse(rep("a" %in% colnames(data), nrow(data)), a, 0)) %>%
mutate(b = ifelse(rep("b" %in% colnames(data), nrow(data)), b, 0)) %>%
mutate(c = ifelse(rep("c" %in% colnames(data), nrow(data)), c, 0)) %>%
mutate(d = ifelse(a + b == 0, 0, ((a/(a+b))*c)+a)) %>%
mutate(e = ifelse(a + b == 0, 0, ((b/(a+b))*c)+b)) %>%
select(d,e) }

Related

Random Sample From a Dataframe With Specific Count

This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?
Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])
Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33
library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1

Defining indices for row sequences more succintly

I have a dataframe like this:
set.seed(123)
df <- data.frame(A = sample(LETTERS[1:5], 50, replace = TRUE),
B = sample(LETTERS[1:5], 50, replace = TRUE))
I want to filter the dataframe on two parameters: (i) the target rows that match a certain criterion and (ii) a certain number of rows that precede the target rows. Specifically, I want to filter rows where A == "A" & B == "A" as well as the five rows preceding the target row. I can do this with a two-step operation: first by defining a function, and second by using the function as input for slice:
Sequ <- function(col1, col2) {
# get row indices of target row with function `which`
inds <- which(col1 == "A" & col2 == "A")
# sort row indices of the rows before target row AND target row itself
sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))
}
library(dplyr)
df %>%
slice(Sequ(col1 = A, col2 = B))
A B
1 D C
2 D B
3 C B
4 C D
5 B B
6 A A
7 E B
8 E D
9 D C
10 D D
11 A A
12 C C
13 D E
14 B E
15 B E
16 B A
17 A A
18 C D
19 C B
20 B D
21 A B
22 A A
But surely there must be a more efficient replacement for this part: sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds))). In case I want to filter not just the preceding 5 but, say, 10 or 100 rows this way of defining each index individually becomes quickly impractical. How can this part be coded more economically?
1) Define bothA which takes a matrix and returns TRUE if any row is all A's. Then use rollapply to apply it as a moving window.
library(zoo)
bothA <- function(x) any(rowSums(rbind(x) == "A") == 2)
ok <- rollapply(df, 6, bothA, align = "left", partial = TRUE, by.column = FALSE)
df[ok, ]
2) or in a pipe
df %>%
filter(rollapply(., 6, bothA, align = "left", partial = TRUE, by.column = FALSE))
3) This also works:
ok <- rollapply(rowSums(df == "A") == 2, 6, any, align = "left", partial = TRUE)
df[ok, ]
Here is a dplyr solution that can be directly used in a pipe, with no need for filter.
Sequ <- function(x, col1, col2, value = "A"){
x %>%
mutate(grp = lag(cumsum({{col1}} == value & {{col2}} == value), default = 0)) %>%
group_by(grp) %>%
slice_tail(n = 5) %>%
ungroup() %>%
select(-grp)
}
df %>% Sequ(A, B)
## A tibble: 23 x 2
# A B
# <chr> <chr>
# 1 B D
# 2 C C
# 3 E A
# 4 D B
# 5 A A
# 6 C D
# 7 E E
# 8 C E
# 9 C C
#10 A A
## … with 13 more rows
One dplyr and purrr solution could be:
df %>%
filter(!row_number() %in% unlist(map(which(A == "A" & B == "A"), ~ (.x-5):.x)))

How to get sum by each factor level?

I have filtered data and one of the columns has 5 factor levels and I want to get sum for each of the factor level.
I am using the below code
levels(df_Temp$ATYPE)
[1] "a" "b" "c" "d" "Unknown"
I am using the below code
cast(df_Temp,ATYPE~AFTER_ADM, sum, value = "CHRGES")
but the output I am getting is as below
ATYPE 0 1
1 a 0 2368968.39
2 b 0 3206567.47
3 c 0 19551.19
4 e 0 2528688.12
I want to all the factor levels and sum as "0" for those missing data of factors level.
So the desired output is
ATYPE 0 1
1 a 0 2368968.39
2 b 0 3206567.47
3 c 0 19551.19
4 d 0 0
5 e 0 2528688.12
Using xtabs from base R
xtabs(CHRGES ~ ATYPE + AFTER_ADM, subset(df_Temp, ATYPE != "e"))
# AFTER_ADM
#ATYPE 0 1
# a 0.00000000 -5.92270971
# b -1.68910431 0.05222349
# c -0.26869311 0.16922669
# d 1.44764443 -1.59011411
# e 0.00000000 0.00000000
data
set.seed(24)
df_Temp <- data.frame(ATYPE = sample(letters[1:5], 20, replace = TRUE),
AFTER_ADM = sample(0:1, 20, replace = TRUE), CHRGES = rnorm(20))
If I understand your question correctly, you can use dplyr. First I created an example dataset:
set.seed(123)
x <- sample(letters[1:5], 1e3, replace = T)
x[x == "e"] <- "Unknown"
y <- sample(1:100, 1e3, replace = T)
df1 <- data.frame(ATYPE = factor(x), AFTER_ADM = y)
df1$AFTER_ADM[df1$ATYPE == "Unknown"] <- NA
head(df1, 10)
ATYPE AFTER_ADM
1 b 28
2 d 60
3 c 17
4 Unknown NA
5 Unknown NA
6 a 48
7 c 78
8 Unknown NA
9 c 7
10 c 45
And then use group_by and summarise to get the sum and the counts. I was not sure if you would want the counts for the factor levels but it is easy to take out if you are not interested:
library(dplyr)
df1 %>%
group_by(ATYPE) %>%
summarise(sum_AFTER_ADM = sum(AFTER_ADM, na.rm = T),
n_ATYPE = n())
# A tibble: 5 x 3
ATYPE sum_AFTER_ADM n_ATYPE
<fct> <int> <int>
1 a 10363 198
2 b 11226 206
3 c 9611 203
4 d 9483 195
5 Unknown 0 198
Another possible solution using dplyr and tidyr. Using count and complete from the two packages will help solve your problem.
library(dplyr)
library(tidyr)
#using iris as toy data
iris2 <- iris %>%
filter(Species != "setosa")
#count data and then fill n with 0
ir3 <- count(iris2, Species) %>%
complete(Species, fill = list(n =0))

How to use window function in R

I have the following data frame structure :
id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0
Here a,b,c are unique id's and status is a flag ranging from 0,1 and 2.
I need to select each individual id whose status has changed from 0 to 1 in any point during the whole time frame, so the expected output of this would be two id's 'b' and 'c'.
I thought of using lag to accomplish that but in that case, I wont't be able to handle id 'c', in which there is a 0 in the beginning but it reaches 1 at some stage. Any thoughts on how we can achieve this using window functions (or any other technique)
You want to find id's having a status of 1 after having had a status of 0.
Here is a dplyr solution:
library(dplyr)
# Generate data
mydf = data_frame(
id = c(rep("a", 3), rep("b", 4), rep("c", 4), rep("d", 3)),
status = c(1, 2, 1, 1, 1, 0, 1, 0, 0, 2, 1, 0, 2, 0)
)
mydf %>% group_by(id) %>%
# Keep only 0's and 1's
filter(status %in% c(0,1)) %>%
# Compute diff between two status
mutate(dif = status - lag(status, 1)) %>%
# If it is 1, it is a 0 => 1
filter(dif == 1) %>%
# Catch corresponding id's
select(id) %>%
unique
One possible way using dplyr (Edited to include id only when a 1 appears after a 0):
library(dplyr)
df %>%
group_by(id) %>%
filter(status %in% c(0, 1)) %>%
filter(status == 0 & lead(status, default = 0) == 1) %>%
select(id) %>% unique()
#> # A tibble: 2 x 1
#> # Groups: id [2]
#> id
#> <chr>
#> 1 b
#> 2 c
Data
df <- read.table(text = "id status
a 1
a 2
a 1
b 1
b 1
b 0
b 1
c 0
c 0
c 2
c 1
d 0
d 2
d 0", header = TRUE, stringsAsFactors = FALSE)
I dunno if this is the most efficient way, but: split by id, check statuses for 0, and if there is any, check for 1 behind the 0 index:
lst <- split(df$status, df$id)
f <- function(x) {
if (!any(x==0)) return(FALSE)
any(x[which.max(x==0):length(x)]==1)
}
names(lst)[(sapply(lst, f))]
# [1] "b" "c"

Add group counter on data frame based on column

Say I have a sorted data frame with a distance variable d indicating the distance between measures in variable a.
library(dplyr)
set.seed(1)
df <-
data.frame(a=sort(sample(2:20,8))) %>%
mutate(d = a-lag(a))
This gives:
> df
a d
1 5 NA
2 7 2
3 8 1
4 9 1
5 11 2
6 14 3
7 15 1
8 16 1
I am trying to add a kind off counter/grouping variable g which indicates whether d is larger than, say, 2. g could take values like: g1, g2, ... etc. In other words I would like to "increase" g when d > 2. In the data below we would get:
>df a d g
1 5 NA g1
2 7 2 g1
3 8 1 g1
4 9 1 g1
5 11 2 g1
6 14 3 g2
7 15 1 g2
8 16 1 g2
I though of using a function with global side-effect along (and yes, this is generally a bad idea, I could not think of anything else):
f <- function(x){
if(x)
g <<- g +1
return(paste0('g', g))
}
And then do:
g=0
df %>%
mutate(g = ifelse(is.na(d)|d>2, f(T), f(F)))
But g is not increased in mutate (or sapply). In real -world data I might have 1000s of g groups.
You can try,
with(df, paste0('g', cumsum(replace(d, is.na(d), 0) > 2)+1))
#[1] "g1" "g1" "g1" "g1" "g1" "g2" "g2" "g2"
A solution using dplyr and data.table. df2 is the final output.
library(dplyr)
library(data.table)
df2 <- df %>%
mutate(Large2 = ifelse(d > 2, 1, NA)) %>%
mutate(RunID = rleid(Large2)) %>%
mutate(ID = ifelse(RunID %% 2 == 0, RunID + 1, RunID)) %>%
mutate(g = paste0("g", group_indices(., ID))) %>%
select(a, d, g)

Resources