dplyr::group_by() fails to group the variables of the following data.frame contained in a pc-axis file:
library("pacman")
pacman::p_load(pxR, dplyr, janitor)
px_file <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1502040100_131"
pxR::read.px(base::url(px_file))$DATA$value %>% # the data.frame
janitor::clean_names() %>%
dplyr::select (student_level = studienstufe,
year = jahr,
counts = value) %>% # dplyr::rename() also fails
dplyr::group_by (year, student_level) %>% # not grouping!
dplyr::summarise(totals = sum (counts))
I believe it could be due to an encoding issue, but I cannot find the problem. Any ideas? Thanks.
The only fault I could find was that you use select instead of rename. You wrote that rename didn't work for you. This worked for me:
library("pacman")
library("dplyr")
library("janitor")
# Loading your data
pacman::p_load(pxR, dplyr, janitor)
px_file <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1502040100_131"
px <- pxR::read.px(base::url(px_file))$DATA$value
# Cleaning the column names
px1 <- px %>% janitor::clean_names()
# Rename the columns
px2 <- px1 %>%
dplyr::rename (student_level = studienstufe,
sex = geschlecht,
year = jahr,
counts = value)
# Grouping data
px3 <- px2 %>%
dplyr::group_by (year, student_level) %>%
dplyr::summarise(totals = sum (counts))
I split every step into an own dataframe to see the result. This is not necessary.
If this doesn't work, you may upload your session info.
P.S. I also renamed the column geschlecht :)
Related
Psychologists work with likert scales almost all the time and let's say I have this dataset:
data <- data.frame(x1 = c(NA,2,4),
x2 = c(NA,3,2),
x3 = c(NA,6,NA))
I would like to use RowSums only if X1, X2, and X3 are not missing.
This won't work because it will not consider any variable with missing cases:
data %>%
mutate(total_score = rowSums(select(.,x1:x3), na.rm=F))
And this will not work neither:
data %>%
filter_at(vars(x1:x2), any_vars(!is.na(.))) %>%
mutate(total_score = rowSums(select(.,x1:x3), na.rm=T))
Because it will filter my dataset and then reduce the number of observations.
Therefore, I would like to integrate filter within mutate.
I have read a post before this one, but I was not able to implement it.
ps: I would like to keep in tidyverse environment.
Thank you
My code:
data <- data.frame(x1 = c(NA,2,4),
x2 = c(NA,3,2),
x3 = c(NA,6,NA))
data %>%
mutate(total_score = rowSums(select(.,x1:x3), na.rm=F))
data %>%
filter_at(vars(x1:x2), any_vars(!is.na(.))) %>%
mutate(total_score = rowSums(select(.,x1:x3), na.rm=T))
The following works for me:
data <- data.frame(x1 = c(NA,2,4),
x2 = c(NA,3,2),
x3 = c(NA,6,NA))
mutate(data, tmp = x1+x2+x3)
If you are committed to using the rowSums function then one option is to coalesce first:
data %>%
mutate_all(~{coalesce(.,-1000)}) %>% # replace all NA with -1000
mutate(total_score = rowSums(select(.,x1:x3), na.rm=F)) %>%
mutate_all(~{ifelse(.<0,NA,.)}) # replace any negative numbers with NA
Coalesce is a dplyr function that returns the first non-NA value. The idea here is that if we carry our a row-sum after replacing all NAs with large negative numbers, then any negatives in the result must have come from NAs. Of course this assumes all your input values are non-negative.
In case it is unfamiliar the ~{.} pattern is used for anonymous functions. So ~{coalesce(.,-1000)} is equivalent to function(x){coalesce(x,-1000)}
I am trying to compute the column means and row means of some data I have.
Its similar to the following:
library(rsample)
library(tidyquant)
library(tidyverse)
library(tsibble)
aapl <- tq_get("AAPL", start_date = "2000-01-01")
aapl_monthly_nested <- aapl %>%
mutate(ym = yearmonth(date)) %>%
nest(-ym)
aapl_rolled <- aapl_monthly_nested %>%
rolling_origin(cumulative = FALSE)
map(aapl_rolled$splits, ~ analysis(.x)) %>%
head
I try using the summarise_all function once I have mapped over the data but I cannot seem to get the colMeans. I have replaced colMeans with mean without luck.
x <- map(aapl_rolled$splits, ~analysis(.x),
~map(data,
~summarise_all(.funs(colMeans))))
x[[1]]$data
I would like a single observation of the column means for each of the splits.
EDIT:
I think I got it. - I believe I forgot the unnest the data after nesting it previously.
x <- map(aapl_rolled$splits, ~ analysis(.x) %>%
unnest() %>%
as_tibble(.) %>%
select(-year_month) %>%
summarise_all(mean))
If you have a better solution please let me know.
Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))
I am trying to build a summary table of a data frame like DataProfile below.
The idea is to transform each column into a row and add variables for count, nulls, not nulls, unique, and add additional mutations of those variables.
It seems like there should be a better faster way to do this. Is there a function that does this?
#trying to write the functions within dplyr & magrittr framework
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
#
total <- mtcars %>% summarise_all(funs(n())) %>% melt
nulls <- mtcars %>% summarise_all(funs(sum(is.na(.)))) %>% melt
filled <- mtcars %>% summarise_all(funs(sum(!is.na(.)))) %>% melt
uniques <- mtcars %>% summarise_all(funs(length(unique(.)))) %>% melt
mtcars %>% summarise_all(funs(n_distinct(.))) %>% melt
#Build a Data Frame from names of mtcars and add variables with mutate
DataProfile <- as.data.frame(names(mtcars))
DataProfile <- DataProfile %>% mutate(Total = total$value,
Nulls = nulls$value,
Filled = filled $value,
Complete = Filled/Total,
Cardinality = uniques$value,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
#These are other attempts with Base R, but they are harder to read and don't play well with summarise_all
sapply(mtcars, function(x) length(unique(x[!is.na(x)]))) %>% melt
rapply(mtcars,function(x)length(unique(x))) %>% melt
The summarise_all() function can process more than one function at a time, so you can consolidate code by doing it in one pass then formatting your data to get to the type of "profile" per variable that you want.
library(tidyverse)
mtcars[2,2] <- NA # Add a null to test completeness
DataProfile <- mtcars %>%
summarise_all(funs("Total" = n(),
"Nulls" = sum(is.na(.)),
"Filled" = sum(!is.na(.)),
"Cardinality" = length(unique(.)))) %>%
melt() %>%
separate(variable, into = c('variable', 'measure'), sep="_") %>%
spread(measure, value) %>%
mutate(Complete = Filled/Total,
Uniqueness = Cardinality/Total,
Distinctness = Cardinality/Filled)
DataProfile
I want to add a suffix or prefix to most variable names in a data.frame, typically after they've all been transformed in some way and before performing a join. I don't have a way to do this without breaking up my piping.
For example, with this data:
library(dplyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
I want to get to this result (note variable names):
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
My current approach is:
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.)))
names(means14)[2:length(names(means14))] <- paste0(names(means14)[2:length(names(means14))], "_mean_2014")
Is there an alternative to that clunky last line that breaks up my pipes? I've looked at select() and rename() but don't want to explicitly specify each variable name, as I usually want to rename all except a single variable and might have a much wider data.frame than in this example.
I'm imagining a final piped command that approximates this made-up function:
appendname(cols = 2:n, str = "_mean_2014", placement = "suffix")
Which doesn't exist as far as I know.
You can pass functions to rename_at, so do
means14 <- dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_all(funs(mean(.))) %>%
rename_at(vars(-class),function(x) paste0(x,"_2014"))
After additional experimenting since posting this question, I've found that the setNames function will work with the piping as it returns a data.frame:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
setNames(c(names(.)[1], paste0(names(.)[-1],"_mean_2014")))
class speed_mean_2014 power_mean_2014 force_mean_2014
1 a 0.5572500 0.8 0.5519802
2 b 0.2850798 0.6 1.0888116
This is a bit quicker, but not totally what you want:
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) -> means14
names(means14)[-1] %<>% paste0("_mean_2014")
if you haven't used the %<>%-operator before definitely check this link out, its a super-useful tool.
you can also use it for recomputing or rounding some columns, like this df$meancolumn %<>% round() , and so on, it just comes up very often and just saves you a lot of writing
As of February 2017 you can do this with the dplyr command rename_(...).
In the case of this example you could do.
dat14 %>%
group_by(class) %>%
select(-ID) %>%
summarise_each(funs(mean(.))) %>%
rename_(names(.)[-1], paste0(names(.)[-1],"_mean_2014")))
This is rather similar to the answer with set_names but works with tibbles too!
This is more of a step back, but you might think of reshaping your data in order to apply the function to multiple years at the same time. This will preserve tidyness. If you're going to want to end up comparing different years, it might make sense to have the year be a separate variable in a dataframe, rather than storing the year in the names. You should be able to use summarise_ to get the mean_year behavior. See http://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
library(dplyr)
library(tidyr)
set.seed(1)
dat14 <- data.frame(ID = 1:10, speed = runif(10), power = rpois(10, 1),
force = rexp(10), class = rep(c("a", "b"),5))
dat14 %>%
gather(variable, value, -ID, -class) %>%
mutate(year = 2014) %>%
group_by(class, year, variable)%>%
summarise(mean = mean(value))`
While Sam Firkes solution using setNames() ist certainly the only solution keeping an unbroken pipe, it will not work with the tbl objects from dplyr, since the column names are not accessible by methods from the usual base R naming functions. Here is a function that you can use within a pipe with tbl objects as well, thanks to this solution by hrbrmstr. It adds predefined prefixes and suffixes at the specified column indices. Default is all columns.
tbl.renamer <- function(tbl,prefix="x",suffix=NULL,index=seq_along(tbl_vars(tbl))){
newnames <- tbl_vars(tbl) # Get old variable names
names(newnames) <- newnames
names(newnames)[index] <- paste0(prefix,".",newnames,suffix)[index] # create a named vector for .dots
rename_(tbl,.dots=newnames) # rename the variables
}
Example usage (Assume auth_users beeing an tbl_sql object):
auth_user %>% tbl_vars
tbl.renamer(auth_user) %>% tbl_vars
auth_user %>% tbl.renamer %>% tbl_vars
auth_user %>% tbl.renamer(index = c(1,5)) %>% tbl_vars