This question is based on How do I calculate a grouped z score in R using dplyr?.
Here data are scaled (zscores) for different groups and ungrouped.
dat = iris %>%
gather(variable, value, -Species) %>%
group_by(Species, variable) %>%
mutate(z_score_group = (value - mean(value)) / sd(value)) %>%
ungroup %>%
mutate(z_score_ungrouped = (value - mean(value)) / sd(value))
Scaling ungrouped preserves the order of the data.
> identical(order(dat$z_score_ungrouped), order(dat$value))
[1] TRUE
However, interestingly the data change their order by scaling group wise.
> identical(order(dat$z_score_group), order(dat$value))
[1] FALSE
In my opinion scaling should never change the order of data because this has a huge impact on rank based analysis (e.g. ROC-curves). Does anyone have an idea why grouping changes the order?
Related
I'm using a grouped time series data set where there's often NAs for more recent dates (length of NAs varies fairly randomly). A total of all the series is provided in the data, where for more recent dates, this total is actually greater than the sum of the individual series, I guess because of imputation/forecasting.
So, my question is, how can the missing values be estimated, assuming that the series total is correct?
My general approach is to calculate what proportion of the total each series is, and somehow extrapolate to the future missing dates. As you can see by the graphs, I'm not so successful. There's complications caused by differing last dates of reported data. I'm not sure if cumulative makes a difference.
R code for simulated data and failed solution below:
Code:
## simulate simple grouped time series
library(tidyverse)
set.seed(555)
## time series length, e.g. 10
len=10
## group names
grps=letters[1:5]
df=bind_rows(lapply(grps,function(z){
tibble(rn=seq(1:len)) %>%
mutate(real_val=runif(len,min=0,max=1)) %>%
mutate(grp=z) %>%
select(grp,rn,real_val) %>%
## replace final data points with NA, length varying by group
## this simulates delays in data reporting across groups
mutate(reported_val=ifelse(rn>len-match(z,letters)+1,NA,real_val)) %>%
# mutate(reported_val=ifelse(rn>len-runif(1,0,round(max_trim)),NA,real_val)) %>%
## make cumulative to assist viz a bit. may affect imputation method.
group_by(grp) %>%
arrange(rn) %>%
mutate(real_val=cumsum(real_val),reported_val=cumsum(reported_val)) %>%
ungroup()
}))
df
## attempt to impute/estimate missing real values, given total for each rn (date)
## general solution is to use (adjusted?) proportions of the total.
df2=df %>%
group_by(rn) %>%
mutate(sum_real_val=sum(real_val),sum_reported_val=sum(reported_val,na.rm=T)) %>%
ungroup() %>%
## total value missing for each date
mutate(val_missing=sum_real_val-sum_reported_val) %>%
## what proportion of the continent that country takes up
mutate(prop=reported_val/sum_real_val) %>%
# mutate(prop=reported_val/sum_reported_val) %>%
## fill missing proportions to end terminus from most recent value
group_by(grp) %>%
arrange(rn) %>%
fill(prop,.direction='down') %>%
ungroup() %>%
## get estimated proportion of those missing
mutate(temp1=ifelse(is.na(reported_val),prop,NA)) %>%
## re-calculate proportion as only of those missing.
group_by(grp) %>%
mutate(prop_temp1=temp1/sum(temp1,na.rm=T)) %>%
ungroup() %>%
## if value missing, then multiply total missing by expected proportion of missing
mutate(result_val=ifelse(is.na(reported_val),val_missing*prop_temp1,reported_val)) %>%
ungroup()
## time series plot
## stacked by group, black line shows real_val total.
ggplot(df2 %>%
select(grp,rn,real_val,reported_val,sum_real_val,result_val) %>%
gather(val_grp,val,-c(grp,rn,sum_real_val)) %>%
ungroup(),aes(rn,val))+
# geom_line(aes(colour=loc))+
geom_area(aes(fill=grp))+
geom_line(aes(y=sum_real_val))+
facet_wrap("val_grp")
## but alas the result total doesn't agree with the reported total
## nb the imputed values for each group don't necessarily have to agree with the real values.
It's hard to know what conclusions to draw from the incomplete dates. One simple assumption could be to take the last share and extrapolate that into the future:
default_share <- df %>%
count(rn, grp, wt = !is.na(reported_val)) %>%
count(rn, n) %>%
filter(nn == max(nn)) %>%
slice_max(rn) %>%
left_join(df) %>%
mutate(share = real_val / sum(real_val)) %>%
select(rn, grp, share)
df %>%
group_by(rn) %>%
mutate(result_val = if_else(rn > default_share$rn[[1]],
sum(real_val) * default_share$share[rn == rn],
real_val)
) %>% ungroup() %>%
select(-reported_val) %>%
pivot_longer(-c(grp:rn)) %>%
ggplot(aes(rn,value))+
geom_area(aes(fill=grp))+
facet_wrap(~name)
I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)
I am new to R and this is my first post on SO - so please bear with me.
I am trying to identify outliers in my dataset. I have two data.frames:
(1 - original data set, 192 rows): observations and their value (AvgConc)
(2 - created with dplyr, 24 rows): Group averages from the original data set, along with quantiles, minimum, and maximum values
I want to create a new column within the original data set that gives TRUE/FALSE based on whether (AvgConc) is greater than the maximum or less than the minimum I have calculated in the second data.frame. How do I go about doing this?
Failed attempt:
Outliers <- Original.Data %>%
group_by(Status, Stim, Treatment) %>%
mutate(Outlier = Original.Data$AvgConc > Quantiles.Data$Maximum | Original.Data$AvgConc < Quantiles.Data$Minimum) %>%
as.data.frame()
Error: Column Outlier must be length 8 (the group size) or one, not 192
Here, we need to remove the Quantiles.Data$ by doing a join with 'Original.Data' by the 'Status', 'Stim', 'Treatment'
library(dplyr)
Original.Data %>%
inner_join(Quantiles.Data %>%
select(Status, Stim, Treatment, Maximum, Minimum)) %>%
group_by(Status, Stim, Treatment) %>%
mutate(Outlier = (AvgConc > Maximum) |(AvgConc < Minimum)) %>%
as.data.frame()
If I'm working with a dataset and I want to group the data (i.e. by country), compute a summary statistic (mean()) and then ungroup() the data.frame to have a dataset with the original dimensions (country-year) and a new column that lists the mean for each country (repeated over n years), how would I do that with dplyr? The ungroup() function doesn't return a data.frame with the original dimensions:
gapminder %>%
group_by(country) %>%
summarize(mn = mean(pop)) %>%
ungroup() # returns data.frame with nrows == length(unique(gapminder$country))
ungroup() is useful if you want to do something like
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
where you want to do some sort of transformation that uses an entire group's statistics. In the above example, mn is the ratio of a population to the group's average population. When it is ungrouped, any further mutations called on it would not use the grouping for aggregate statistics.
summarize automatically reduces the dimensions, and there's no way to get that back. Perhaps you wanted to do
gapminder %>%
group_by(country) %>%
mutate(mn = mean(pop)) %>%
ungroup()
Which creates mn as the mean for each group, replicated for each row within that group.
The summarize() reduced the number of rows. If you didn't want to change the number of rows, then use mutate() rather than summarize().
actually ungroup() is not needed in your case.
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop))
generates the same results as the following:
gapminder %>%
group_by(country) %>%
mutate(mn = pop/mean(pop)) %>%
ungroup()
The only difference is that the latter actually runs a bit slower.
I have a series of categorical variables that have the response options (Favorable, Unfavorable, Neutral).
I want to create a table in R that will give the list of all 10 variables in rows (one variable per row) - with the percentage response "Favorable, Unfavorable, Neutral" in the columns. Is this possible in R? Ideally, I would also want to be able to group this by another categorical variable (e.g. to compare how males vs. females responded to the questions differently).
You'll get better answers if you provide a sample of your actual data (see this post). That said, here is a solution using dplyr:: (and reshape2::melt).
# function to create a column of fake data
make_var <- function(n=100) sample(c("good","bad","ugly"), size=n, replace=TRUE)
# put ten of them together
dat <- as.data.frame(replicate(10, make_var()), stringsAsFactors=FALSE)
library("dplyr")
# then reshape to long format, group, and summarize --
dat %>% reshape2::melt(NULL) %>% group_by(variable) %>% summarize(
good_pct = (sum(value=="good") / length(value)) * 100,
bad_pct = (sum(value=="bad") / length(value)) * 100,
ugly_pct = (sum(value=="ugly") / length(value)) * 100
)
Note that to group by another column (e.g. sex), you can just say group_by(variable, sex) before you summarize (as long as sex is a column of the data, which isn't the case in this constructed example).
Adapting lefft's example but trying to do everything in dplyr:
dat %>%
gather(variable, value) %>%
group_by(variable) %>%
count(value) %>%
mutate(pct = n / sum(n) * 100) %>%
select(-n) %>%
spread(value, pct)