Assign median value for each quartile group using loop [R] - r

I need to categorize numeric variable into the quartile and assign the median values for the quartile groups using loop (because my original dataset has lots of variable).
What I intend is doing the following manipulation over lots of variables:
data(iris)
iris%>%mutate(Sepal.Lengthq=as.factor(ntile(Sepal.Length,4)))%>%
group_by(Sepal.Lengthq)%>%
mutate(Sepal.Lengthq_median=median(Sepal.Length,na.rm=T))
I need loop, so I wrote codes like:
quartilization=c("Sepal.Length","Sepal.Width")
for (i in seq_along(quartilization)){
iris2=iris %>%
mutate(!!str_c(quartilization[i],"q"):=ntile(.[[quartilization[i]]],4)) %>%
group_by_at(vars(one_of(!!str_c(quartilization[i],"q")))) %>%
mutate(!!str_c(quartilization[i],"qn"):=median(.[[quartilization[i]]],na.rm=T)) %>%
ungroup()
}
However, 1) it does not return "Sepal.Lengthqn" and 2) "Sepal.Widthqn" is a same value over samples.
I feel like the syntax for the median function is wrong, but cannot fix it.
So appreciated if anyone could share me some input. Thank you.

When you are using ., you refer to entire dataframe, hence you get the same value for all the years. Use .data in median to get data in the group.
I use map_dfc instead of for loop because it is easier and shorter. I also use transmute instead of mutate because mutate returns all the column every time whereas transmute only returns the changed columns which can be binded to original dataframe.
library(dplyr)
library(purrr)
library(stringr)
quartilization=c("Sepal.Length","Sepal.Width")
bind_cols(iris, map_dfc(quartilization, ~{
iris %>%
group_by(!!str_c(.x,"q") := ntile(.[[.x]],4)) %>%
transmute(!!str_c(.x,"qn"):= median(.data[[.x]],na.rm=TRUE))
}))

Related

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

Error when trying to calculate column average - R

I have a dataframe so when I try to calculate the mean of column A I just write
mean(df$A)
and it works fine.
But when I try to calculate mean of only part of the data frame I get an error saying it isn't a number or logical value
df$A %>% filter(A=="some value") %>% mean(df$A)
The type of A is double. I also tried to convert it to numeric using
df$A <- as.numeric(as.character(df$A))
but it didn't work.
Best would be to provide an example of your column A.
However, by just looking to your question the problem is in your magrittr-dplyr syntax.
base syntax:
mean(df$A[df$A == 'some value'])
dplyr with pipes:
df %>% filter(A==2) %>% summarise(., average = mean(A))
Careful with syntax and pipes, more info here.
Try df %>% filter(A==some value) %>% summarise(mean(A)).
Note that the mean will be some value because of the filter.
Also, mean() works fine with objects of class double

How to Create Multiple Frequency Tables with Percentages Across Factor Variables using Purrr::map

library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))

How to compute a column that depends on a function that uses the value of a variable of each row?

This is a mock-up based on mtcars of what I would like to do:
compute a column that counts the number of cars that have less
displacement (disp) of the current row within the same gear type
category (am)
expected column is the values I would like to get
try1 is one try with the findInterval function, the problem is that I cannot make it count across the subsets that depend on the category (am)
I have tried solutions with *apply but I am somehow never able to make the function called work only on a subset that depends on the value of a variable of the row that is processed (hope this makes sense).
x = mtcars[1:6,c("disp","am")]
# expected values are the number of cars that have less disp while having the same am
x$expected = c(1,1,0,1,2,0)
#this ordered table is for findInterval
a = x[order(x$disp),]
a
# I use the findInterval function to get the number of values and I try subsetting the call
# -0.1 is to deal with the closed intervalq
x$try1 = findInterval(x$disp-0.1, a$disp[a$am==x$am])
x
# try1 values are not computed depending on the subsetting of a
Any solution will do; the use of the findInterval function is not mandatory.
I'd rather have a more general solution enabling a column value to be computed by calling a function that takes values from the current row to compute the expected value.
As pointed out by #dimitris_ps, the previous solution neglects the duplicated counts. Following provides the remedy.
library(dplyr)
x %>%
group_by(am) %>%
mutate(expected=findInterval(disp, sort(disp) + 0.0001))
or
library(data.table)
setDT(x)[, expected:=findInterval(disp, sort(disp) + 0.0001), by=am]
Based on #Khashaa's logic this is my approach
library(dplyr)
mtcars %>%
group_by(am) %>%
mutate(expected=match(disp, sort(disp))-1)

Error dplyr summarise

I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')

Resources