R dplyr: Problem converting a column from character to integer using dplyr - r

I am having a problem with the following script. When converting the min and max columns of the data.frame base to character, using dplyr, it "converts" back to character. Where the result that should be 582, ends up becoming 513.
base%>%
mutate(ocor=str_count(pass,letter))%>%
filter(ocor%>%between(min,max))%>%
count()
To correct the problem, I tried to convert the variables into the mechanics of dplyr. However, he seems to convert back.
base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
filter(ocor%>%between(min,max))%>%
count()
class(base$max)
class(base$min)
n
1 513
> class(base$max)
[1] "character"
> class(base$min)
[1] "character"
Not using dplyr I got the correct result, an example:
a<-base%>%
mutate(ocor=str_count(pass,letter))%>%
select(ocor)
class(base$max)
class(base$min)
base$max<-as.integer(base$max)
base$min<-as.integer(base$min)
sum(a >= base$min & a <= base$max)
[1] 582
I can't understand what's going on. An example of the database for clarification:
head(base)
min max letter pass ocor
1 2 6 c fcpwjqhcgtffzlbj 2
2 6 9 x xxxtwlxxx 6
3 7 10 q nfbrgwqlvljgq 2
4 2 3 g gjggg 4
5 2 6 s sjsssss 6
6 4 13 b mdbctbzgcpdjbhsdctrd 3
The Original Basewithout changes:
> head(base)
V1 V2 V3
1 2-6 c: fcpwjqhcgtffzlbj
2 6-9 x: xxxtwlxxx
3 5-6 w: wwwwlwwwh
4 7-10 q: nfbrgwqlvljgq
5 2-3 g: gjggg
6 9-11 q: qqqqqqnqgqq
The changes:
base<-read.table('base.txt')
library(tidyverse)
base<-base%>%
separate(V1,c('min','max'),'-')%>%
rename(letter=V2,pass=V3)%>%
mutate(letter = str_replace(letter,':',''))

That's because you are not altering base.
%>% does not assign the result to a variable. I.e.
base %>% mutate(foo=bar(x))
does not alter base. It will just show the result on the console (and none if you are running the script or calling it from a function).
You might be confusing the pipe-operator with %<>% (found in the package magrittr) which uses the left-hand variable as input for the pipe, and overwrites the variable with the modified result.
Try
base <- base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
filter(ocor%>%between(min,max))%>%
count()
Re. the issue with min and max being converted back to characters, I cannot reproduce.
Re. the issue with filtering not working as expected, it that between doesn't seem to care for vectors for inputs left and right. A fairly new thing is the use of rowwise:
Without rowwise:
base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
mutate(between(ocor, min,max))
min max letter pass ocor between(ocor, min, max)
1 2 6 c fcpwjqhcgtffzlbj 2 TRUE
2 6 9 x xxxtwlxxx 6 TRUE
3 7 10 q nfbrgwqlvljgq 2 TRUE
4 2 3 g gjggg 4 TRUE
5 2 6 s sjsssss 6 TRUE
6 4 13 b mdbctbzgcpdjbhsdctrd 3 TRUE
With rowwise:
base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
rowwise %>% mutate(between(ocor, min,max))
# A tibble: 6 x 6
# Rowwise:
min max letter pass ocor `between(ocor, min, max)`
<dbl> <dbl> <chr> <chr> <int> <lgl>
1 2 6 c fcpwjqhcgtffzlbj 2 TRUE
2 6 9 x xxxtwlxxx 6 TRUE
3 7 10 q nfbrgwqlvljgq 2 FALSE
4 2 3 g gjggg 4 FALSE
5 2 6 s sjsssss 6 TRUE
6 4 13 b mdbctbzgcpdjbhsdctrd 3 FALSE

Related

R improve loop efficiency: Operating on columns that correspond to rows in a second dataframe

I have two data frames:
dat <- data.frame(Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24)
dat
#> Digits_Lower Digits_Upper random
#> 1 1 6 20
#> 2 2 7 21
#> 3 3 8 22
#> 4 4 9 23
#> 5 5 10 24
cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4)
cb
#> Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4
I am trying to perform some operation on multiple columns in dat similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I
am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb. The solution should be applicable
for a large dataset. I have created this for-loop so far.
dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <-
#some operation using every value in a column in `dat` with its corresponding row in `cb`
(dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop
#> Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
I will then perform operations on the data that I appended to dat in dat.loop applying a similar
for-loop, and then perform yet another operation on those values. My dataset is very large, and I imagine
my use of for-loops will become cumbersome. I am wondering:
Would another method improve efficiency such as using data.table or tidyverse?
How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code
to perform operations on columns in dat with corresponding rows in cb. Ideally, I would split my for-loop into
multiple functions that would for example, avoid indexing into cb for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to
do this.
Any help is appreciated!
EDIT:
I've modified the code #Desmond provided allowing for more generic code since dat and cb will be from user-inputted files,
and dat can have a varying number of columns/ column names that I will be operating on (columns in dat will always start with
"Digits_" and will be specified in the "Digits" column of cb.
library(tidytable)
results <- dat %>%
crossing.(cb) %>%
mutate_rowwise.(disp = (get(`Digits`)-y) *x ) %>%
pivot_wider.(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results2 <- results %>%
fill.(starts_with("disp"), .direction = c("downup"), .by = 'random') %>%
select.(-c(x,y)) %>%
distinct.()
results2
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Here's a tidyverse solution:
crossing generates combinations from both datasets
case_when to apply your logic
pivot_wider, filter and bind_cols to clean up the output
To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing() with crossing.(), pivot_wider() with pivot_wider.(), etc
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
dat <- data.frame(
Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24
)
cb <- data.frame(
Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4
)
results <- dat |>
crossing(cb) |>
mutate(disp = case_when(
Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
)) |>
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results |>
filter(!is.na(disp_Digits_Lower)) |>
select(-c(x, y, disp_Digits_Upper)) |>
bind_cols(results |>
filter(!is.na(disp_Digits_Upper)) |>
select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> <int> <int> <int> <int> <int>
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Created on 2022-08-20 by the reprex package (v2.0.1)

Divide multiple variable values by a specific value in R

I'm trying to pull something that is simple but can't seem to get my head over it. My data looks like this
|Assay|Sample|Number|
|A|1|10|
|B|1|25|
|C|1|30|
|A|2|45|
|B|2|65|
|C|2|8|
|A|3|10|
|B|3|81|
|C|3|12|
What I need to do is to divide each "Number" value for each sample by the value of the respective assay A. That is, for sample 1, I would like to have 10/10, 25/10 and 30/10. Then for sample 2, I would need 45/45, 65/45 and 8/45 and so on with the rest of the samples.
I have already tried doing:
mutate(Normalised = Number/Number[Assay == "A"])
as suggested in another post but the results are not correct.
Any help would be great. Thank you very much!
Using dplyr
df <- data.frame(Assay=rep(c('A','B','C'),3),
Sample=rep(1:3,each=3),
Number=c(10,25,30,45,65,8,10,81,12))
df <- df %>%
group_by(Sample) %>%
arrange(Assay) %>%
mutate(Normalised=Number/first(Number)) %>%
ungroup() %>%
arrange(Sample)
gives out
> df
# A tibble: 9 × 4
Assay Sample Number Normalised
<chr> <int> <dbl> <dbl>
1 A 1 10 1
2 B 1 25 2.5
3 C 1 30 3
4 A 2 45 1
5 B 2 65 1.44
6 C 2 8 0.178
7 A 3 10 1
8 B 3 81 8.1
9 C 3 12 1.2
Note: I added arrange(Assay) just to make sure "A" is always the first row within each group. Also, arrange(Sample) is there just to get the output in the same order as it was but it doesn't really need to be there if you don't care about the display order.

Rolling sum of one variable in data.frame in number of steps defined by another variable

I'm trying to sum up the values in a data.frame in a cumulative way.
I have this:
df <- data.frame(
a = rep(1:2, each = 5),
b = 1:10,
step_window = c(2,3,1,2,4, 1,2,3,2,1)
)
I'm trying to sum up the values of b, within the groups a. The trick is, I want the sum of b values that corresponds to the number of rows following the current row given by step_window.
This is the output I'm looking for:
data.frame(
a = rep(1:2, each = 5),
step_window = c(2,3,1,2,4,
1,2,3,2,1),
b = 1:10,
sum_b_step_window = c(3, 9, 3, 9, 5,
6, 15, 27, 19, 10)
)
I tried to do this using the RcppRoll but I get an error Expecting a single value:
df %>%
group_by(a) %>%
mutate(sum_b_step_window = RcppRoll::roll_sum(x = b, n = step_window))
I'm not sure if having variable window size is possible in any of the rolling function. Here is one way to do this using map2_dbl :
library(dplyr)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = purrr::map2_dbl(row_number(), step_window,
~sum(b[.x:(.x + .y - 1)], na.rm = TRUE)))
# a b step_window sum_b_step_window
# <int> <int> <dbl> <dbl>
# 1 1 1 2 3
# 2 1 2 3 9
# 3 1 3 1 3
# 4 1 4 2 9
# 5 1 5 4 5
# 6 2 6 1 6
# 7 2 7 2 15
# 8 2 8 3 27
# 9 2 9 2 19
#10 2 10 1 10
1) rollapply
rollapply in zoo supports vector widths. partial=TRUE says that if the width goes past the end then use just the values within the data. (Another possibility would be to use fill=NA instead in which case it would fill with NA's if there were not enough data left) . align="left" specifies that the current value at each step is the left end of the range to sum.
library(dplyr)
library(zoo)
df %>%
group_by(a) %>%
mutate(sum = rollapply(b, step_window, sum, partial = TRUE, align = "left")) %>%
ungroup
2) SQL
This can also be done in SQL by left joining df to itself on the indicated condition and then for each row summing over all rows for which the condition matches.
library(sqldf)
sqldf("select A.*, sum(B.b) as sum
from df A
left join df B on B.rowid between A.rowid and A.rowid + A.step_window - 1
and A.a = B.a
group by A.rowid")
Here is a solution with the package slider.
library(dplyr)
library(slider)
df %>%
group_by(a) %>%
mutate(sum_b_step_window = hop_vec(b, row_number(), step_window+row_number()-1, sum)) %>%
ungroup()
It is flexible on different window sizes.
Output:
# A tibble: 10 x 4
a b step_window sum_b_step_window
<int> <int> <dbl> <int>
1 1 1 2 3
2 1 2 3 9
3 1 3 1 3
4 1 4 2 9
5 1 5 4 5
6 2 6 1 6
7 2 7 2 15
8 2 8 3 27
9 2 9 2 19
10 2 10 1 10
slider is a couple-of-months-old tidyverse package specific for sliding window functions. Have a look here for more info: page, vignette
hop is the engine of slider. With this solution we are triggering different .start and .stop to sum the values of b according to the a groups.
With _vec you're asking hop to return a vector: a double in this case.
row_number() is a dplyr function that allows you to return the row number of each group, thus allowing you to slide along the rows.
data.table solution using cumulative sums
setDT(df)
df[, sum_b_step_window := {
cs <- c(0,cumsum(b))
cs[pmin(.N+1, 1:.N+step_window)]-cs[pmax(1, (1:.N))]
},by = a]

How to use dplyr::group_by with multiple groups when programming

Okay so it's one of those days where a previously working piece of code suddenly breaks. Here's a reprex of the code in question:
test = data.frame(factor1 = sample(1:5, 10, replace=T),
factor2 = sample(letters[1:5], 10, replace=T),
variable = sample(100:200, 10))
group_vars = c('factor1','factor2') %>% paste(., collapse = ',')
> test %>% dplyr::group_by_(group_vars)
Error in parse(text = x) : <text>:1:8: unexpected ','
1: factor1,
^
Now I sweaaaar this worked until today. Of course dplyr is trying to do away with the 'x_' functions anyway, but I've tried to plug everything I can think of into group_by()- using combinations of !!, !!!, sym(), quo(), enquo(), etc and can't figure it out. I've tried not pasting the column names together and AT BEST it simply takes the first one and ignores everything else. Most commonly I get the following error message:
Error: Column <chr> must be length 10 (the number of rows) or one, not 2
I've also read over Hadley's dplyr programming guide (https://dplyr.tidyverse.org/articles/programming.html), WHICH SEEMS to cover the issue, except that I'm generating the column names internally and not accepting them as arguments to the function. Has anyone come across this or understand quoting well enough to know a solution to this?
Also, to be clear, this works when only using a single grouping variable. The problem is with multiple groups.
Thanks!
Instead of pasteing and using group_by_ (deprecated - but it would not work because it is expecting NSE), we can directly use the vector in group_by_at
library(dplyr)
group_vars <- c('factor1','factor2')
test %>%
group_by_at(group_vars)
# A tibble: 10 x 3
# Groups: factor1, factor2 [10]
# factor1 factor2 variable
# <int> <fct> <int>
# 1 1 d 145
# 2 5 e 119
# 3 4 a 181
# 4 3 e 155
# 5 3 d 164
# 6 3 b 135
# 7 4 e 137
# 8 4 d 197
# 9 2 d 142
#10 2 c 110
Or another option is to convert to symbols (syms from rlang) and evaluate (!!!) within group_by
test %>%
group_by(!!! rlang::syms(group_vars))
If we go by the route of paste, then one option is parse_expr (from rlang)
group_vars = c('factor1','factor2') %>% paste(., collapse = ';')
test %>%
group_by(!!! rlang::parse_exprs(group_vars))
# A tibble: 10 x 3
# Groups: factor1, factor2 [10]
# factor1 factor2 variable
# <int> <fct> <int>
# 1 1 d 145
# 2 5 e 119
# 3 4 a 181
# 4 3 e 155
# 5 3 d 164
# 6 3 b 135
# 7 4 e 137
# 8 4 d 197
# 9 2 d 142
#10 2 c 110

cumulative grouping

I have the following data frame:
df = data.frame(a = c(1,1,3,2,2), b=6:10)
## a b
## 1 6
## 1 7
## 3 3
## 2 9
## 2 10
I want to analyze the data by groups (a is the grouping parameter), but instead of the usual (e.g. each value specify a group of rows, and the groups are disjoint) I need "cumulative groups". that is, for the value of a=i, the group should contain all the rows in which a<=i. These are not disjoint groups, but still I want to summarize each group separately.
So for example, if for each group I want the mean of b, the result would be:
## a mean_b
## 1 6.5
## 2 8
## 3 7
note that in the real scenario behind this simplified example, I cannot analyze disjoint group separately and then aggregate the relevant groups. the summarize function must be "aware" of all the rows in that group to perform the computation.
So of course, I can use some apply functions and compute things in the good old way, and make a new df out of it, but I look for the dplyr/tidyverse like functions to do that.
any suggestions?
How about something like this?
library(dplyr)
df %>%
arrange(a) %>%
group_by(a) %>%
summarise(sum_b = sum(b)) %>%
ungroup() %>%
mutate(sum_b = cumsum(sum_b))
# a sum_b
# <dbl> <int>
#1 1. 13
#2 2. 32
#3 3. 40
We take sum by group (a) and then take cumulative sum adding the previous value of the group in the next group.
I had a look and I don't see how it is possible with dplyr itself. However, we can hack the group_by function to make it cumulative. I'll quickly walkd you through it:
First, I make your df. It doesn't really fit your output above, so I slightly changed it.
df = data.frame(a = c(1,1,3,2,2), b=6:10)
df$b[3] <- 3
Now I use the normal group_by to check out what it actually does to the data.frame.
library(dplyr)
df_grouped <- df %>%
arrange(a) %>%
group_by(a)
> attributes(df_grouped)
$class
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
$row.names
[1] 1 2 3 4 5
$names
[1] "a" "b"
$vars
[1] "a"
$drop
[1] TRUE
$indices
$indices[[1]]
[1] 0 1
$indices[[2]]
[1] 2 3
$indices[[3]]
[1] 4
$group_sizes
[1] 2 2 1
$biggest_group_size
[1] 2
$labels
a
1 1
2 2
3 3
So besides other things, there is a new attribute called indices where the group of each element in the grouped variable is referenced. We can actually just change that to make it cumulative.
for (i in seq_along(attributes(df_grouped)[["indices"]])[-1]) {
attributes(df_grouped)[["indices"]][[i]] <- c(
attributes(df_grouped)[["indices"]][[i - 1]],
attributes(df_grouped)[["indices"]][[i]]
)
}
It looks a bit weird but is straightforward. The elements of each group are added to the next group. E.g. all elements from group 1 are added to group 2.
> attributes(df_grouped)$indices
[[1]]
[1] 0 1
[[2]]
[1] 0 1 3 4
[[3]]
[1] 0 1 3 4 2
We can use the changed groups in the normal dplyr way.
> df_grouped %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
Now of course this is pretty ugly and looks very hacky. But inside a function that doesn't really matter as long as it is still efficient (which it is). So let's make a custom group_by.
group_by_cuml <- function(.data, ...) {
.data_grouped <- group_by(.data, ...)
for (i in seq_along(attributes(.data_grouped)[["indices"]])[-1]) {
attributes(.data_grouped)[["indices"]][[i]] <- c(
attributes(.data_grouped)[["indices"]][[i - 1]],
attributes(.data_grouped)[["indices"]][[i]]
)
}
return(.data_grouped)
}
Now you can use the custom function in clean dplyr pipe.
> df %>%
+ group_by_cuml(a) %>%
+ summarise(sum_b = mean(b))
# A tibble: 3 x 2
a sum_b
<dbl> <dbl>
1 1 6.5
2 2 8
3 3 7
I would do it this way :
df %>%
arrange(a) %>%
map_dfr(seq_along(as <- unique(.$a)),
~filter(.y, a %in% as[1:.]),.y = ., .id = "a") %>%
group_by(a = meta_group) %>%
summarise(b = mean(b))
# # A tibble: 3 x 2
# a b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
If you want a separate function you can do :
summarize2 <- function(.data, ..., .by){
grps <- select_at(.data,.by) %>% pull %>% unique
.data %>%
arrange_at(.by) %>%
map_dfr(seq_along(grps),
~ filter_at(.y, .by,all_vars(. %in% grps[1:.x])),
.y = .,
.id = "meta_group") %>%
group_by(meta_group) %>%
summarise(...)
}
df %>%
summarize2(b = mean(b), .by = "a")
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
df %>%
summarize2(b = mean(b), .by = vars(a))
# # A tibble: 3 x 2
# meta_group b
# <chr> <dbl>
# 1 1 6.5
# 2 2 7.0
# 3 3 8.0
One way is to use the base function Reduce with the argument accumulate = TRUE. Once you concatenate, then you can apply any function, i.e.
Reduce(c, split(df$b,df$a), accumulate = TRUE)
#[[1]]
#[1] 6 7
#[[2]]
#[1] 6 7 9 10
#[[3]]
#[1] 6 7 9 10 3
and then for the mean,
sapply(Reduce(c, split(df$b,df$a), accumulate = TRUE), mean)
[1] 6.5 8.0 7.0

Resources