adding hash to each row using dplyr and digest in R - r

I need to add a fingerprint to each row in a dataset so to check with a later version of the same set to look for difference.
I know how to add hash for each row in R like below:
data.frame(iris,hash=apply(iris,1,digest))
I am learning to use dplyr as the dataset is getting huge and I need to store them in SQL Server, I tried something like below but the hash is not working, all rows give the same hash:
iris %>%
rowwise() %>%
mutate(hash=digest(.))
Any clue for row-wise hashing using dplyr? Thanks!

We could use do
res <- iris %>%
rowwise() %>%
do(data.frame(., hash = digest(.)))
head(res, 3)
# A tibble: 3 x 6
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species hash
# <dbl> <dbl> <dbl> <dbl> <fctr> <chr>
#1 5.1 3.5 1.4 0.2 setosa e261621c90a9887a85d70aa460127c78
#2 4.9 3.0 1.4 0.2 setosa 7bf67322858048d82e19adb6399ef7a4
#3 4.7 3.2 1.3 0.2 setosa c20f3ee03573aed5929940a29e07a8bb
Note that in the apply procedure, all the columns are converted to a single class as apply converts to matrix and matrix can hold only a single class. There will be a warning about converting the factor to character class

Since do is being superseded, this option may be better now:
library(digest)
library(tidyverse)
# Create a tibble for practice
df <- tibble(x = rep(c(1,2), each=2), y = c(1,1,3,4), z = c(1,1,6,4))
# Note that row 1 and 2 are equal.
# This will generate a sha1 over specific columns (column z is excluded)
df %>% rowwise() %>% mutate(m = sha1( c(x, y ) ))
# This will generate over all columns,
# then convert the hash to integer
# (better for joining or other data operations later)
df %>%
rowwise() %>%
mutate(sha =
digest2int( # generates a new integer hash
sha1( c_across(everything() ) ) # across all columns
)
)
It may be a better option to convert everything to character and paste it together to use just one hash function call. You can use unite:
df %>% rowwise() %>%
unite(allCols, everything(), sep = "", remove = FALSE) %>%
mutate(hash = digest2int(allCols)) %>%
select(-allCols)

Related

How to create a column for each level of another column in R?

The goal I am trying to achieve is an expanded data frame in which I will have created a new column for each level of a specific column in R. Here is a sample of the initial data frame and the data frame I am trying to achieve:
Original Data Frame:
record crop_land fishing_ground
BiocapPerCap 1.5 3.4
Consumption 2.3 0.5
Goal Data Frame:
crop_land.BiocapPerCap crop_land.Consumption fishing_ground.BiocapPerCap fishing_ground.Consumption
1.5 2.3 3.4 0.5
We can use pivot_wider from the tidyr package as follows.
library(tidyr)
library(magrittr)
dat2 <- dat %>%
pivot_wider(names_from = "record", values_from = c("crop_land", "fishing_ground"),
names_sep = ".")
dat2
# # A tibble: 1 x 4
# crop_land.BiocapPerCap crop_land.Consumption fishing_ground.BiocapPer~ fishing_ground.Consumpti~
# <dbl> <dbl> <dbl> <dbl>
# 1 1.5 2.3 3.4 0.5
DATA
dat <- read.table(text = "record crop_land fishing_ground
BiocapPerCap 1.5 3.4
Consumption 2.3 0.5",
header = TRUE, stringsAsFactors = FALSE)
Using tidyr is one option.
tidyr::pivot_longer() converts crop_land and fishing_ground to variable-value pairs. tidyr::unite() combines the record and variable to new names.
tidyr::pivot_wider() creates the wide data frame you are after.
library(tidyr)
library(magrittr) # for %>%
tst <- data.frame(
record = c("BiocapPerCap", "Consumption"),
crop_land = c(1.5, 2.3),
fishing_ground = c(3.4, 0.5)
)
pivot_longer(tst, -record) %>%
unite(new_name, record, name, sep = '.') %>%
pivot_wider(names_from = new_name, values_from = value)

summarise_at dplyr multiple columns

I am trying to apply a complex function on multiple columns after applying a group on it.
Code example is:
library(dplyr)
data(iris)
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise_at(.vars=c('Sepal.Length', 'Sepal.Width'),
.funs = add('Sepal.Length', 'Sepal.Width' ) )
I was expecting that the function would be applied to each group and returned as a new column but I get:
Error in x + y : non-numeric argument to binary operator
How can I get this work?
Note my real problem has a much more complicated function than the simple add function I've written here that requires the two columns be fed in as separate entities I can't just sum them first.
Thanks
Don't think you need summarise_at, since your definition of add takes care fo the multiple input arguments. summarise_at is useful when you are applying the same change to multiple columns, not for combining them.
If you just want sum of the columns, you can try:
iris %>%
group_by(Species) %>%
summarise_at(
.vars= vars( Sepal.Length, Sepal.Width),
.funs = sum)
which gives:
Species Sepal.Length Sepal.Width
<fctr> <dbl> <dbl>
1 setosa 250 171
2 versicolor 297 138
3 virginica 329 149
in case you want to add the columns together, you can just do:
iris %>%
group_by(Species) %>%
summarise( k = sum(Sepal.Length, Sepal.Width))
which gives:
Species k
<fctr> <dbl>
1 setosa 422
2 versicolor 435
3 virginica 478
using this form with your definition of add
add = function(x,y) {
z = x+y
return(mean(z))
}
iris %>%
group_by(Species) %>%
summarise( k = add(Sepal.Length, Sepal.Width))
returns
Species k
<fctr> <dbl>
1 setosa 8
2 versicolor 9
3 virginica 10
summarize() already allows you to summarize multiple columns.
example:
summarize(mean_xvalues = mean(x) , sd_yvalues = sd(y), median_zvalues = median(z))
where x,y,z are columns of a dataframe.

dplyr+purrr: n() refers to map() groups, not local groups?

I have a parent dataset nesting multiple datasets (i.e. a tibble where each cell is a tibble) , where I want for each dataset, to find the number of rows of each group. Standard way, using a single dataset, would simply be to do group_by(var) %>% mutate(nrow=n()).
But now that I do this for multiple datasets with a map() call, it looks like the n() call refers to the (implicit) grouping made by map(), not the explicit grouping within my local dataset made by group_by?
Standard way for one single dataset, n() returns 50:
iris %>%
group_by(., Species) %>%
mutate(nrow=n())
Dataset of datasets:
df <- data_frame(name=c("a", "b"), Data=list(iris, iris))
df2 <- df %>%
mutate(Data2=map(Data, ~group_by(., Species) %>%
mutate(nrow=n()) %>%
ungroup()))
but now n() returned 2?
df2[1,]$Data2[[1]]
If you define the function outside of mutate it works fine (I assume this output is what you have in mind...)
fun <- function(x) {
df <- group_by(x, Species) %>%
summarise(nrow = n())
}
df2 <- df %>%
mutate(Data2=map(Data, fun))
df2$Data2
# [[1]]
# # A tibble: 3 x 2
# Species nrow
# <fctr> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
#
# [[2]]
# # A tibble: 3 x 2
# Species nrow
# <fctr> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
Another option, available since version 0.7.0 is to use add_count(), which will not conflict with the map(), and anyway simplifies the code:
# standard case:
iris %>%
add_count(Species)
## df of df:
df2 <- df %>%
mutate(Data2=map(Data, ~add_count(., Species)))

group_by variables which meet a certain condition in dplyr [duplicate]

Is it possible to group_by using regex match on column names using dplyr?
library(dplyr) # dplyr_0.5.0; R version 3.3.2 (2016-10-31)
# dummy data
set.seed(1)
df1 <- sample_n(iris, 20) %>%
mutate(Sepal.Length = round(Sepal.Length),
Sepal.Width = round(Sepal.Width))
Group by static version (looks/works fine, imagine if we have 10-20 columns):
df1 %>%
group_by(Sepal.Length, Sepal.Width) %>%
summarise(mySum = sum(Petal.Length))
Group by dynamic - "ugly" version:
df1 %>%
group_by_(.dots = colnames(df1)[ grepl("^Sepal", colnames(df1))]) %>%
summarise(mySum = sum(Petal.Length))
Ideally, something like this (doesn't work, as starts_with returns indices):
df1 %>%
group_by(starts_with("Sepal")) %>%
summarise(mySum = sum(Petal.Length))
Error in eval(expr, envir, enclos) :
wrong result size (0), expected 20 or 1
Expected output:
# Source: local data frame [6 x 3]
# Groups: Sepal.Length [?]
#
# Sepal.Length Sepal.Width mySum
# <dbl> <dbl> <dbl>
# 1 4 3 1.4
# 2 5 3 10.9
# 3 6 2 4.0
# 4 6 3 43.7
# 5 7 3 15.7
# 6 8 4 6.4
Note: sounds very much like a duplicated post, kindly link the relevant posts if any.
This feature will be implemented in future release, reference GitHub issue #2619:
Solution would be to use group_by_at function:
df1 %>%
group_by_at(vars(starts_with("Sepal"))) %>%
summarise(mySum = sum(Petal.Length))
Edit: This is now implemented in dplyr_0.7.1
if you just want to keep it with dplyr functions, you can try:
df1 %>%
group_by_(.dots = df1 %>% select(contains("Sepal")) %>% colnames()) %>%
summarise(mySum = sum(Petal.Length))
though it's not necessarily much prettier, but it gets rid of the regex

R dplyr summarise multiple functions to selected variables

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.
Let me start with an example of what I would like to achieve:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)to summarise?
So far I have tried the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.
The only solution I have been able to achieve is the following:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.
Thank you
Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.
Solution one
My own take:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two
An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion
The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.
Both solutions hinge on three realizations:
summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
Because the product is not a tibble or a data.frame, but a list, you
simply need to use reduce with inner_join or just merge.
Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).
If you wanted to do something more complex like that, you could write your own version of summarize_at. With this version you supply triplets of column names, functions, and naming rules. For example
Here's a rough start
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name. The idea is just to dynamically build the summarize_ expression. The summarize_at function is really just a convenience wrapper around that basic function.
If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind.
This returns correctly the result:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width, then this could be elegantly written as:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)
I was looking for something similar and tried the following. It works well and much easier to read than the suggested solutions.
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.

Resources