Benford’s Law by group in R - r

I am attempting to implement Benford’s Law using the benford.analysis package in R across all vendors’ invoices. Over the entire dataset the data confirms. I’m trying to find a way to group by vendor to determine if any individual vendor is displaying fraud indicators by not conforming. Is there a way to break out non-conforming by group?

Here is a way to use group_by and group_map to create benford.analysis plots for each group. In this example, grouping Iris data by Species and performing analysis on Sepal Length variable.
In group_map(), .x means the grouped subset data, and .y means the name of the group.
library(dplyr)
library(benford.analysis)
iris %>%
group_by(Species) %>%
group_map(.f = ~ plot(benford(.x$Sepal.Length)))

Related

group_by and summarize usage in tidyverse package in r

I am analyzing the COVID-19 data in r and I want to get the aggregate result of total case in different continent.
total_cases_continent <- covid_data %>%
select(continent, new_cases) %>%
group_by(continent) %>%
summarize(total_cases = sum(new_cases))
I get this result below, instead of present total cases in different continent, this only shows total cases in different continent in one row
It looks like there might be some issues with the values of your variable "continent". I would recommend checking the class of the variable, as well as making sure all of the values are what you expected them to be. This is probably causing the issues within your group_by statement.

Taking the mean of a multitude of variables that will grouped by a set of categorcal variables

I have 500 columns. One is a categorical variable with 3 categories and the rest are continuous variables. There are 50 rows that fall under these columns. How do I group the data frame by the categorical variables, and take the mean of the observations that fall within each category for every column that has continuous variables for that DF? ALSO, remove all NA. I want to create a new CD from this info.
Best,
Henry
When posting to SO, please ensure to include a reproducible example of your data (dput is helpful for this). As it is, I can only guess to the structure of your data.
I like doing general grouping/summarising operations with dplyr. Using iris as an example, you might be able to do somehting like this
library(dplyr)
library(tidyr)
data(iris)
iris %>%
drop_na() %>%
group_by(Species) %>%
summarise_all(mean)
summarise_all just automatically uses all non-grouping columns, and takes a function you want to apply.
Note, if you use the dev version of dplyr, you could also do something like
iris %>%
group_by(Species) %>%
summarise(across(is.numeric), mean)
Since summarise_all is being replaced in favor of across

Lag function usage within a dplyr subset

My basic goal is to subset a data set, and summarise with new columns that use the lag function. I understand how to subset and the data set, but am struggling to complete using the lag function within my data set and that is giving me trouble.
I have already tried a few different ways of implementing it, but have been unsuccessful.
gapminder %>%
na.omit() %>%
group_by(country) %>%
summarise(prevPeriod = lag(year),
lifeExpGrowth = lag(lifeExp),
popGrowth = lag(pop),
gdppcGrowth = 100*(gdpPercap/lag(gdpPercap) - 1)))
I am currently getting my code to run a lag based upon the country, not the year. the gdppcGrowth is supposed to return a percent as well and I am getting an error;
Column `gdppcGrowth` must be length 1 (a summary value), not 12
For each of the functions, I want to analyze the data by country focusing on growth rates. I want to use the lag(x) function to access the previous value of a series or vector so that 100*(x/lag(x) - 1) computes standard (arithmetic) growth rates of x expressed as a percent.

how to apply lm() to datasets split by factors

In a housing dataset, there are three variables, which are bsqft (the building size of the house), county(a factor variable with 9 levels) and price. I would like to fit an individual regression line using bsqft and price for each separate county. Instead of calling lm() function repeatedly, I prefer using apply function in r but have no idea to create it. Could anyone help me with that? Thanks a lot.
You can use dplyr and broom to do regressions by group and summarise the information back into a dataframe
library(dplyr)
library(broom)
your_dataset %>%
group_by(county) %>%
do(tidy(lm(price ~ bsqft, data=.)))

Correlations between vectors in two groups (defined by: group_by)

I want to make a correlation between two vectors in two different groups (defined by group_by). The solution needs to be based on dplyr.
My data is in the so-called CDISC format. For simplicity: here is some dummy data (note that one column ("values") holds all the data):
n=5
bmi<-rnorm(n=n,mean=25)
glucose<-rnorm(n=n,mean=5)
insulin<-rnorm(n=n,mean=10)
id<-rep(paste0("id",1:n),3)
myData<-data.frame(id=id,measurement=c(rep("BMI",n),rep("glucose",n),rep("insulin",n)),values=c(bmi,glucose,insulin))
Keeping in mind that all my functions for working with this kind of data is by using dplyr package, such as:
myData %>% group_by(measurement) %>% summarise(mean(values), n())
How do I get the correlation between glucose and insulin (cor(glucose, insulin))? Or in a more general way: how do I get the correlation between two groups?
The following solution is obviously very wrong (but may help to understand my question):
myData %>% group_by(measurement) %>% summarise(cor(glucose,insulin))

Resources