Factor data frame values into quartile/decile ranges

Factor data frame values into quartile/decile ranges - r

I'm trying to create decile factors corresponding to my dataframe's values. I would like the factors to appear as a range e.g. if the value is "164" then the factored result should be "160 - 166".
In the past I would do this:
quantile(countries.Imported$Imported, seq(0,1, 0.1), na.rm = T) # display deciles
Imported.levels <- c(0, 1000, 10000, 20000, 30000, 50000, 80000) # create levels from observed deciles
Imported.labels <- c('< 1,000t', '1,000t - 10,000t', '10,000t - 20,000t', etc) # create corresponding labels
colfunc <- colorRampPalette(c('#E5E4E2', '#8290af','#512888'))
# apply factor function
Imported.colors <- colfunc(10)
names(Imported.colors) <- Imported.labels
countries.Imported$Imported.fc <- factor(
cut(countries.Imported$Imported, Imported.levels),labels = Imported.labels)
Instead, I would like to apply a function that will factor the values into decile range. I want to avoid manually setting factor labels since I will be running many queries and plotting maps that have discrete legends. I've created a column called Value.fc but I cannot format it to "160 - 166" from "(160, 166]". Please see the problematic code below:
corn_df <- corn_df %>%
mutate(Value.fc = gtools::quantcut(Value, 10))
corn_df %>%
select(Value, unit_desc, domain_desc, Value.fc) %>%
head(6)
A tibble: 6 x 4
Value unit_desc domain_desc Value.fc
<dbl> <chr> <chr> <fct>
1 164. BU / ACRE TOTAL (160,166]
2 196. BU / ACRE TOTAL (191,200]
3 203. BU / ACRE TOTAL (200,230]
4 205. BU / ACRE TOTAL (200,230]
5 172. BU / ACRE TOTAL (171,178]
6 213. BU / ACRE TOTAL (200,230]

You can try to use dplyr::ntile() or Hmisc::cut2().
If you're interested where the decline of the variable starts and ends you can use Hmisc::cut2() and stringr::str_extract_all()
require(tidyverse)
require(Hmisc)
require(stringr)
df <- data.frame(value = 1:100) %>%
mutate(decline = cut2(value, g=10),
decline = factor(sapply(str_extract_all(decline, "\\d+"),
function(x) paste(x, collapse="-"))))
head(df)
value decline
1 1 1-11
2 2 1-11
3 3 1-11
4 4 1-11
5 5 1-11
6 6 1-11
If you're looking only for the decline of the variable you can use dplyr::ntile().
require(tidyverse)
df <- data.frame(value = 1:100) %>%
mutate(decline = ntile(value, 10))
head(df)
value decline
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1

Related

How to calculate mean by row for multiple groups using dplyr in R?

I have a dataframe with 4 columns Age, Location, Distance, and Value. Age and Location each have two possible values whereas Distance can have three. Value is the observed continuous variable which has been measured 3 times per Distance.
Accounting for Age and Location, I would like to calculate a mean for one of the Distance values and then calculate another mean Value when the other two Distance are combined. I am trying to answer, what is the mean Value for Distance 0.5 relative to Distance 1.5 & 2.5 for each Age and Location?
How can I do this using dplyr?
Example Data
library(dyplyr)
set.seed(123)
df1 <- data.frame(matrix(ncol = 4, nrow = 36))
x <- c("Age","Location","Distance","Value")
colnames(df1) <- x
df1$Age <- rep(c(1,2), each = 18)
df1$Location <- as.character(rep(c("Central","North"), each = 9))
df1$Distance <- rep(c(0.5,1.5,2.5), each = 3)
df1$Value <- round(rnorm(36,200,25),0)
Output should look something like this
Age Location Mean_0.5 Mean_1.5_and_2.5
1 1 Central 206 202
2 1 North 210 201
3 2 Central 193 186
4 2 North 202 214

We may use %in% or == to subset the 'Value' based on the 'Distance' values (assuming the precision is correct) after grouping by 'Age', 'Location'
library(dplyr)
df1 %>%
group_by(Age, Location) %>%
summarise(Mean_0.5 = mean(Value[Distance == 0.5]),
Mean_1.5_and_2.5 = mean(Value[Distance %in% c(1.5, 2.5)]),
.groups = 'drop')
-output
# A tibble: 4 × 4
Age Location Mean_0.5 Mean_1.5_and_2.5
<dbl> <chr> <dbl> <dbl>
1 1 Central 206. 202.
2 1 North 210. 201.
3 2 Central 193 186.
4 2 North 202. 214.

How to take an arithmetic average over common variable, rather than whole data?

So I have a data frame which is daily data for stock prices, however, I have also a variable that indicates the week of year (1,2,3,4,...,51,52) this is repeated for 22 companies. I would like to create a new variable that takes an average of the daily prices but only across each week.
The above equation has d = day and t = week. My challenge is taking this average of days across each week. Therefore, I should have 52 values per stock that I observe.

Using ave().
dat <- transform(dat, avg_week_price=ave(price, week, company))
head(dat, 9)
# week company wday price avg_week_price
# 1 1 1 a 16.16528 15.47573
# 2 2 1 a 18.69307 15.13812
# 3 3 1 a 11.01956 12.99854
# 4 1 2 a 15.92029 14.56268
# 5 2 2 a 12.26731 13.64916
# 6 3 2 a 17.40726 17.27226
# 7 1 3 a 11.83037 13.02894
# 8 2 3 a 13.09144 12.95284
# 9 3 3 a 12.08950 15.81040
Data:
setseed(42)
dat <- expand.grid(week=1:3, company=1:5, wday=letters[1:7])
dat$price <- runif(nrow(dat), 10, 20)

An option with dplyr
library(dplyr)
dat %>%
group_by(week, company) %>%
mutate(avg_week_price = mean(price))

Sort Numeric Bands in R

I have some numeric variables which are categorised into a few bands (like 1-3, 3-5, 5-7 etc). I want to main their band order. For example, in the data frame below.
df <- data.frame(x = c("1-3", "3-5","5-9", "9-10", "10-12"))
When I run any data manipulation operation (like group_by, count) in this column, it returns this output.
Current Output
library(tidyverse)
df %>% count(x)
x n
<fct> <int>
1 1-3 1
2 3-5 1
3 5-9 1
4 9-10 1
5 10-12 1
Desired Output
x n
<fct> <int>
1 1-3 1
2 3-5 1
3 5-9 1
4 9-10 1
5 10-12 1
Important Note - Solution should be dynamic which means it should run on any type of numeric bands even if it starts from 1000 or any other numeric value (For example 1250 - 2500, 2500 - 5000, 5000 - 10000, 10000 - 20000 etc). Solution in dplyr is preferred one.

If x is always sorted and in the same order as shown in the example you could arrange the factor levels based on their appearance before using count.
library(dplyr)
library(rlang)
df %>%
mutate(x = factor(x, levels = unique(x))) %>%
count(x)
However, a general solution would be to get the number before "-" and arrange data based on that.
df %>%
mutate(x1 = as.numeric(sub('-.*', '', x)),
x = factor(x, levels = x[order(x1)])) %>%
count(x)
To wrap this in a function we can use :
count_band_data <- function(data, col, sep = '-') {
data %>%
mutate(temp = as.numeric(sub(paste0(sep, '.*'), '', {{col}})),
{{col}} := factor({{col}}, levels = {{col}}[order(temp)])) %>%
count({{col}})
}
and then use it as :
df %>% count_band_data(x)
# A tibble: 5 x 2
# x n
# <fct> <int>
#1 1-3 1
#2 3-5 1
#3 5-9 1
#4 9-10 1
#5 10-12 1

Apply function to create mean for filtered columns across multiple columns r

I have a data frame with likert scoring across multiple aspects of a course (about 40 columns of likert scores like the two in the sample data below).
Not all rows contain valid scores. Valid scores are 1:5. Invalid scores are allocated 96:99 or are simply missing.
I would like to create an average score for each individual ID for each of the satisfaction columns that:
1) filters for invalid scores,
2) creates a mean of the valid scores for each id .
3) places the mean satisfaction score for each id in a new column labelled [column.name].mean as in Skill.satisfaction.mean below
I have included a sample data frame and the transformation of the data frame that I would like on a single row below.
####sample score vector
possible.scores <-c(1:5, 96,97, 99,"")
####data frame
ratings <- data.frame(ID = c(rep(1:7, each =2), 8:10), Degree = c(rep("Double", times = 14), rep("Single", times = 3)),
Skill.satisfaction = sample(possible.scores, size = 17, replace = TRUE),
Social.satisfaction = sample(possible.scores, size = 17, replace = TRUE)
)
####transformation applied over one of the satisfaction scales
ratings<- ratings %>%
group_by(ID) %>%
filter(!Skill.satisfaction %in% c(96:99), Skill.satisfaction!="") %>%
mutate(Skill.satisfaction.mean = mean(as.numeric(Skill.satisfaction), na.rm = T))

library(dplyr)
ratings %>%
group_by(ID) %>%
#Change satisfaction columns from factor into numeric
mutate_at(vars(-ID,-Degree), list(~as.numeric(as.character(.)))) %>%
#Get mean for values in 1:5
mutate_at(vars(-ID,-Degree), list(mean=~mean(.[. %in% 1:5], na.rm = T)))
# A tibble: 6 x 6
# Groups: ID [3]
ID Degree Skill.satisfaction Social.satisfaction Skill.satisfaction_mean Social.satisfaction_mean
<int> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Double 96 99 2 NaN
2 1 Double 2 97 2 NaN
3 2 Double 1 97 1 NaN
4 2 Double 97 NA 1 NaN
5 3 Double 96 96 NaN 3
6 3 Double 99 3 NaN 3

Using tidyverse, sum values conditionally on distributions within each subset

I have an example dataframe below where each day of the month and precip are recorded.
set.seed(560)
df<-data.frame(month= rep(1:4, each=30),
precip= rep(c(rnorm(30, 20, 10), rnorm(30, 10, 2),
rnorm(30, 50, 1), rnorm(30, 15, 3))))
For each subset, I wish to count the number of instances a value was +/- 2 standard deviations (sd) above or below the mean of that month's precip values. Essentially I ned to find values at the extremes of the distribution of values (i.e. the tails of the distribution). This result column will be called count.
The output would appear as follows for this example dataset:
set.seed(560)
output<-data.frame(month= rep(1:4, each=1), count= c(1,2,1,1))
Notice for month 1 values above 35.969 and values below 2.61 are within +/- 2sd of the mean. One value (precip=41.1) fits this requirement. Proof:
sub1<- subset(df, month==1)
v1<- mean(sub1$precip)+ 2*sd(sub1$precip)#35.969
v2<- mean(sub1$precip)- 2*sd(sub1$precip)#2.61
sub2<- subset(df, month==2)
v3<- mean(sub2$precip)+ 2*sd(sub2$precip)#13.89
v4<- mean(sub2$precip)- 2*sd(sub2$precip)#7.35
sub3<- subset(df, month==3)
v5<- mean(sub3$precip)+ 2*sd(sub3$precip)#51.83
v6<- mean(sub3$precip)- 2*sd(sub3$precip)#48.308
sub4<- subset(df, month==4)
v7<- mean(sub4$precip)+ 2*sd(sub4$precip)#18.69
v8<- mean(sub4$precip)- 2*sd(sub4$precip)#9.39
I have tried:
output<-
df %>%
group_by(month)%>%
summarise(count= sum(precip > (mean(precip)+(2*sd(precip)))&
precip < (mean(precip)-(2*sd(precip))))))

Very simple fix, change your logic AND & to OR | as no row will be in both conditions.
output<-
df %>%
group_by(month)%>%
summarise(count= sum(precip > (mean(precip)+(2*sd(precip))) |
precip < (mean(precip)-(2*sd(precip)))))
output
# A tibble: 4 x 2
# month count
# <int> <int>
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 1
And to add a base R solution using by (the counterpart to dplyr::group_by())
do.call(rbind,
by(df, df$month, FUN=function(i){
tmp <- i[i$precip < mean(i$precip) - 2*sd(i$precip) |
i$precip > mean(i$precip) + 2*sd(i$precip),]
return(data.frame(month=i$month[[1]], count=nrow(tmp)))
})
)
# month count
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 1
Alternatively, with ave, ifelse, and aggregate:
df$count <- ifelse(df$precip > ave(df$precip, df$month, FUN=function(g) mean(g) + 2*sd(g)) |
df$precip < ave(df$precip, df$month, FUN=function(g) mean(g) - 2*sd(g)), 1, 0)
aggregate(count ~ month, df, FUN=sum)
# month count
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 1

In base R
tapply(df$precip, df$month, function(a) sum(abs(scale(a)) >= 2))
Output
1 2 3 4
1 2 2 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Factor data frame values into quartile/decile ranges - r

Related

How to calculate mean by row for multiple groups using dplyr in R?

How to take an arithmetic average over common variable, rather than whole data?

Sort Numeric Bands in R

Apply function to create mean for filtered columns across multiple columns r

Using tidyverse, sum values conditionally on distributions within each subset

Categories

Resources