Combining MIN and MAX into a rowise function in R - r

I have a bit of code that I used in an excel spreadsheet that used min and max that I'm trying to transfer over to R.
I have two columns, "mini" and "maxi" which represent a range of possible values. The third column I'm trying to populate is the proportion of that range that falls between 5 and 19. Looking at the first row in the example, if "mini" was 10 and "maxi" was 15, the value of the 5-19 column should be 1, since the range falls completely in that span. In row 9, the "mini" is 1 and the "maxi" is 3, meaning it falls completely outside of the 5-19 range, and should therefore be 0. Row 3 however, straddles this range, and only 25% falls in the range of 5-19, so the output value should be 0.25.
Edit I have updated R and although several solutions worked before, I am now getting the error:
Error in mutate_impl(.data, dots, caller_env()) :
attempt to bind a variable to R_UnboundValue
Here's an example of how the DF looks:
ID mini maxi
1 10 15
2 17 20
3 2 5
4 40 59
5 40 59
6 21 39
7 21 39
8 17 20
9 1 3
10 4 6
The code that I used previously was something like this:
=MAX((MIN(maxi,19)-MAX(mini,5)+1),0)/(maxi-mini+1)
I was initially trying to use something like
percentoutput <- mutate(DF, output = MAX((MIN(maxi,19) - MAX(mini,5) + 1),0)/(maxi-mini + 1))
This resulted in the ouput column being full of NAs.
I wasn't sure if this is a situation where I'd need to run an apply function, but I'm not sure how to go about setting it up. Any guidance is appreciated!
Here is an example DF:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), min = c(10,
17, 2, 40, 40, 21, 21, 17, 1, 4), max = c(15, 20, 5, 59, 59,
39, 39, 20, 3, 6)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(ID = structure(list(), class = c("collector_double",
"collector")), mini = structure(list(), class = c("collector_double",
"collector")), maxi = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

We can calculate ratio of min to max values that are in range of 5:19 using rowwise.
library(dplyr)
df %>% rowwise() %>% mutate(ratio = mean(min:max %in% 5:19))
# ID min max ratio
# <dbl> <dbl> <dbl> <dbl>
# 1 1 10 15 1
# 2 2 17 20 0.75
# 3 3 2 5 0.25
# 4 4 40 59 0
# 5 5 40 59 0
# 6 6 21 39 0
# 7 7 21 39 0
# 8 8 17 20 0.75
# 9 9 1 3 0
#10 10 4 6 0.667
and similarly in base R using apply :
df$ratio <- apply(df[-1], 1, function(x) mean(x[1]:x[2] %in% 5:19))

Here is a vectorized version using data.table:
DT[, portion := {
mn <- pmax(mini, lb)
mx <- pmin(maxi, ub)
fifelse(mn <= mx, (mx - mn + 1L) / (maxi - mini + 1L), 0)
}]
Or equivalently in base R:
DF$mn <- pmax(DF$mini, lb)
DF$mx <- pmin(DF$maxi, ub)
DF$portion <- ifelse(DF$mn <= DF$mx, (DF$mx - DF$mn + 1L) / (DF$maxi - DF$mini + 1L), 0)
output:
ID mini maxi portion
1: 1 10 15 1.0000000
2: 2 17 20 0.7500000
3: 3 2 5 0.2500000
4: 4 40 59 0.0000000
5: 5 40 59 0.0000000
6: 6 21 39 0.0000000
7: 7 21 39 0.0000000
8: 8 17 20 0.7500000
9: 9 1 3 0.0000000
10: 10 4 6 0.6666667
data:
library(data.table)
DT <- fread("ID mini maxi
1 10 15
2 17 20
3 2 5
4 40 59
5 40 59
6 21 39
7 21 39
8 17 20
9 1 3
10 4 6")
lb <- 5L
ub <- 19L

We can use map2
library(dplyr)
library(purrr)
df %>%
mutate(ratio = map2_dbl(min, max, ~ mean(.x:.y %in% 5:19)))

Related

How to convert list of lists into a dataframe while keeping track of list "number" innR?

I have a list of lists stored in df_list (sample code below). I want to convert it into a dataframe but I want to keep a counter count such that it tracks which list the data is from.
I want the end product to look something like this:
count replicate level high.density low.density
1 1 low 54 36
1 1 low 54 31
1 2 low 11 28
1 2 low 11 45
1 1 mid 24 10
1 2 mid 12 24
1 2 mid 12 17
1 2 up 40 2
2 1 low 54 31
2 1 low 54 31
2 2 low 11 45
2 2 low 11 28
2 1 mid 24 10
2 2 mid 12 24
2 2 up 20 2
......
1000 2 up 40 5
#List of Lists code
df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10,
NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
df$replicate <- as.factor(as.numeric(df$replicate))
df$level <- as.factor(as.character(df$level))
df_shuffle <- function(DF) {
my_split <- split(DF, f = ~ DF$replicate + DF$level)
shuffle <- lapply(my_split, \(x) {
nrX <- nrow(x)
cbind(x[, c('replicate', 'level')],
high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
})
DF_new <- do.call(rbind, shuffle)
rownames(DF_new) <- NULL
return(DF_new)
}
B <- 1000
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
Or, we can simply utilize data.table::rbindlist() function:
library(data.table)
rbindlist(df_list, idcol = 'count')
# count replicate level high.density low.density
# 1: 1 1 low 14 31
# 2: 1 1 low 14 31
# 3: 1 2 low 12 45
# 4: 1 2 low 12 45
# 5: 1 1 mid 24 NA
# ---
# 10996: 1000 2 mid NA 17
# 10997: 1000 2 mid NA 17
# 10998: 1000 2 up 20 10
# 10999: 1000 2 up 20 5
# 11000: 1000 2 up 40 2
We can add a column containg the index or the names of the list elements with dplyr::bind_rows by specifying the .id argument. If the list has names bind_rows will add those, otherwise it will take the index (we can't choose which one bind_rows should take).
df_list %>%
bind_rows(., .id = "count")
If we want to go beyond that, or if the list has names and we want to use the index instead, we can use purrr::imap and pipe the result into bind_rows:
library(tidyverse)
imap(df_list,
~ mutate(.x,
count = .y,
.before = "replicate")) %>%
bind_rows()
#> count replicate level high.density low.density
#> 1 1 1 low 14 31
#> 2 1 1 low 14 36
#> 3 1 2 low 11 45
#> 4 1 2 low 12 28
#> 5 1 1 mid 24 10
#> 6 1 1 mid 24 10
#> 7 1 2 mid 12 17
#> 8 1 2 mid NA 17
#> 9 1 2 up 40 5
#> 10 1 2 up 20 5
#> 11 1 2 up 20 5
#> 12 2 1 low 14 31
#> 13 2 1 low 54 31
#> ...
Created on 2022-01-20 by the reprex package (v2.0.1)
In imap we create a new column for each data.frame .x with mutate. The new column is called count and .y is the index of each list element. We use mutates .before argument to make this column the first column of each data.frame. The result of the call to imap is a list of data.frames which we merge together into one large data.frame with dplyr::bind_rows.

Create column which tells source of number between the first two columns

I have a data frame which has three columns:
df <- structure(list(lowage = c(45, 15, 9, 51, 22, 45, 4, 4, 9, 25),
highage = c(50, 21, 14, 60, 24, 50, 8, 8, 14, 30)), .Names = c("lowage",
"highage"), row.names = c(NA, 10L), class = "data.frame")
df$random_number <- apply(df, 1, function(x) sample(seq(x[1], x[2]), 1))
I want to create a fourth column that tells us the source of the where the random_number comes from. So for example, in the first row, the column lowage = 45 and highage = 46. Say, the random number generated is 46 (for example). I'd like to create a fourth column where it says as a label 'highage' since it comes from the highage column. And so on...
If the solution can be in dplyr, that would be great!
Is this what you want?
df %>%
mutate(newcol =
case_when(random_number == lowage ~ "lowage",
random_number == highage ~ "highage",
TRUE ~ "between"))
# lowage highage random_number newcol
# 1 45 50 47 between
# 2 15 21 18 between
# 3 9 14 13 between
# 4 51 60 57 between
# 5 22 24 23 between
# 6 45 50 49 between
# 7 4 8 4 lowage
# 8 4 8 6 between
# 9 9 14 9 lowage
# 10 25 30 27 between

Is there a way to automate the multiplication of columns in R in relation to their names?

I have a dataset like this.
> dataset
id
a
b
c
d
1
10
1
30
50
2
5
0
5
60
3
20
1
18
90
4
103
0
20
80
5
16
1
56
100
And so on up to 'n' number of columns in relation to the requirement.
My colleagues in the research area carry out certain analyzes where what they give me as an input is a data frame. In which the row names correspond to the variables of interest for a new dataset.
Something like this.
> rownames(Features)
a
b
d
a:d
b:d
b:c
a:c
Where the colon (:) represents the "product of". So in order to continue working I need to include those products in the original dataset.
I have manually created a function that reproduces the multiplications in the following way (where x represents my original dataset):
Products<- function(x){x<- x %>% mutate(Product1=x$a*x$d)
x<- x %>% mutate(Product2=x$b*x$d)
x<- x %>% mutate(Product3=x$b*x$c)
x<- x %>% mutate(Product4=x$a*x$c)
return(x)}
However, given that the number of products to create is variable every time, I want to find a way to automate the creation of these column multiplications with the names that they give me as input. I'm sure my approach isn't the best, so any help is very welcome.
This seems to be some modelling being done and the choice of : to stand for product is quite intriguing as that is often the case in the modeling world. In base R, you could do:
model.matrix(reformulate(c(-1,rownames(Features))), dataset)
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
I added -1 in order to remove the intercept. Otherwise you could maintain it. Also note that this is a matrix, you can then change it to a dataframe.
I am not sure if the following code works for your, where eval + gsub are used to produce the product columns
with(
dataset,
list2DF(
setNames(
lapply(
gsub(":", "*", rownames(Features)),
function(x) eval(str2lang(x))
),
rownames(Features)
)
)
)
which gives
a b d a:d b:d b:c a:c
1 10 1 50 500 50 30 300
2 5 0 60 300 0 0 25
3 20 1 90 1800 90 18 360
4 103 0 80 8240 0 0 2060
5 16 1 100 1600 100 56 896
Data
> dput(dataset)
structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
> dput(Features)
structure(list(Features = 1:7), class = "data.frame", row.names = c("a",
"b", "d", "a:d", "b:d", "b:c", "a:c"))
We could use strsplit to split the names that have :, select the column in the 'dataset' based on that splitted named, Reduce with * to do elementwise multiplication, and assign those 'Product' columns to the original 'dataset'
nm1 <- grep(':', rownames(Features), value = TRUE)
lst1 <- lapply(strsplit(nm1, ":", fixed = TRUE),
function(x) Reduce(`*`, dataset[x]))
dataset[paste0("Product", seq_along(lst1))] <- lst1
-output
dataset
# id a b c d Product1 Product2 Product3 Product4
#1 1 10 1 30 50 500 50 30 300
#2 2 5 0 5 60 300 0 0 25
#3 3 20 1 18 90 1800 90 18 360
#4 4 103 0 20 80 8240 0 0 2060
#5 5 16 1 56 100 1600 100 56 896
data
dataset <- structure(list(id = 1:5, a = c(10, 5, 20, 103, 16), b = c(1,
0, 1, 0, 1), c = c(30, 5, 18, 20, 56), d = c(50, 60, 90, 80,
100)), class = "data.frame", row.names = c(NA, -5L))
Features <- structure(1:7, .Dim = c(7L, 1L), .Dimnames = list(c("a", "b",
"d", "a:d", "b:d", "b:c", "a:c"), NULL))
This is similar to many solutions above. One thing about R is you can achieve the same results with so many different ways though the underline principle is still the same.
library(dplyr)
dataset <- tibble(id = c(1, 2, 3, 4, 5),
a = c(10, 5, 20, 103, 16),
b = c(1, 0, 1, 0, 1),
c = c(30, 5, 18, 20, 56),
d = c(50, 60, 90, 80, 100))
features = c("a", "b", "d", "a:d", "b:d", "b:c", "a:c")
final <- bind_cols(
map(features,
function(x) {
dataset %>%
mutate(!!x := eval(parse(text=gsub(":", "*", x)))) %>%
select(!!x)
}
))
Final dataset.
# A tibble: 5 x 9
id a b c d `a:d` `b:d` `b:c` `a:c`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 1 30 50 500 50 30 300
2 2 5 0 5 60 300 0 0 25
3 3 20 1 18 90 1800 90 18 360
4 4 103 0 20 80 8240 0 0 2060
5 5 16 1 56 100 1600 100 56 896

Add row to dataframe based on condition of previous row in R

I have a dataframe that looks something like this:
class <- c(3,0,3,0,0)
value <- c(50,50,70,30,100)
days <- c(3,3,2,2,1)
mydata <- data.frame(class, value, days)
What I need is for each day to have both classes represented - so if there is no class 3 on a given day (in this example, day 1) I'd like to add a row where class = 3 and value = 0 and day = 1. My real data is more complicated, because there are varying numbers of rows for each day (and many more days than 3), and many other columns (but for which it would be fine to enter NA). This doesn't seem like too complicated a problem, but I'm having trouble wrapping my head around the code. Thanks so much!
Using tidyverse you can use complete:
library(tidyverse)
mydata %>%
complete(days, class, fill = list(value = 0))
Output
# A tibble: 6 x 3
days class value
<dbl> <dbl> <dbl>
1 1 0 100
2 1 3 0
3 2 0 30
4 2 3 70
5 3 0 50
6 3 3 50
Data
mydata <- structure(list(class = c(3, 0, 3, 0, 0), value = c(50, 50, 70,
30, 100), days = c(3, 3, 2, 2, 1)), class = "data.frame", row.names = c(NA,
-5L))
With base R, we can do
out <- merge(expand.grid(lapply(mydata[c('class', 'days')],
unique)), mydata, all.x = TRUE)
out$value[is.na(out$value)] <- 0
out
# class days value
#1 0 1 100
#2 0 2 30
#3 0 3 50
#4 3 1 0
#5 3 2 70
#6 3 3 50
NOTE: No packages used
Or with data.table
library(data.table)
setDT(mydata)[CJ(class, days, unique = TRUE),
on = .(class, days)][is.na(value), value := 0][]
# class value days
#1: 0 100 1
#2: 0 30 2
#3: 0 50 3
#4: 3 0 1
#5: 3 70 2
#6: 3 50 3
Or using crossing/left_join from tidyverse
library(dplyr)
library(tidyr)
tidyr::crossing(class = unique(mydata$class),
days = unique(mydata$days)) %>%
left_join(mydata) %>%
mutate(value = replace_na(value, 0))
# A tibble: 6 x 3
# class days value
# <dbl> <dbl> <dbl>
#1 0 1 100
#2 0 2 30
#3 0 3 50
#4 3 1 0
#5 3 2 70
#6 3 3 50
data
mydata <- structure(list(class = c(3, 0, 3, 0, 0), value = c(50, 50, 70,
30, 100), days = c(3, 3, 2, 2, 1)), class = "data.frame", row.names = c(NA,
-5L))

Removing row with duplicated values in all columns of a data frame (R)

With the following data frame:
d <- structure(list(n = c(2, 3, 5), s = c(2, 8, 3),t = c(2, 18, 30)), .Names = c("n", "s","t"), row.names = c(NA, -3L), class = "data.frame")
which looks like:
> d
n s t
1 2 2 2
2 3 8 18
3 5 3 30
How can I remove row with duplicated values in all column.
Yielding:
n s t
2 3 8 18
3 5 3 30
Here's one possible approach, which compares all columns to the first
d[rowSums(d == d[,1]) != ncol(d),]
# n s t
# 2 3 8 18
# 3 5 3 30

Resources