This question already has answers here:
split a vector by percentile
(5 answers)
Closed 6 years ago.
Is it possible to bin a variable in to quintile (1/5th) using R. And select only the variables that fall in the 5th bin.
As of now I am using the closest option which is quartile (.75) as there is not a function to do quintile.
Any suggestions please.
Not completely sure what you mean, but this divides a dataset into 5 equal groups based on value and subsequently selects the fifth group
obs = rnorm(100)
qq = quantile(obs, probs = seq(0, 1, .2))
obs[obs >= qq[5]]
Related
This question already has answers here:
Calculate the mean by group
(9 answers)
Mean per group in a data.frame [duplicate]
(8 answers)
How to calculate mean of all columns, by group?
(6 answers)
Closed 1 year ago.
I have some fish catch data. Each row contains a species name, a catch value (cpue), and some other unrelated identifying fields (year, location, depth, etc). This code will produce a dataset with the correct structure:
# a sample dataset
set.seed(1337)
fish = rbind(
data.frame(
spp = "Flounder",
cpue = rnorm(5, 5, 2)
),
data.frame(
spp = "Bass",
cpue = rnorm(5, 15, 1)
),
data.frame(
spp = "Cod",
cpue = rnorm(5, 2, 4)
)
)
I'm trying to create a normalized cpue column cpue_norm. To do this, I apply the following function to each cpue value:
cpue_norm = (cpue - cpue_mean)/cpue_std
Where cpue_mean and cpue_std are, respectively, the mean and standard deviation of cpue. The caveat is that I need to do this by each species i.e. when I calculate the cpue_norm for a particular row, I need to calculate the cpue_mean and cpue_std using cpue from only that species.
The trouble is that all of the species are in the same dataset. So for each row, I need to calculate the mean and standard deviation of cpue for that species and then use those values to calculate cpue_norm.
I've been able to make some headway with tapply:
calc_cpue_norm = function(l) {
return((l - mean(l))/sd(l))
}
tapply(fish$cpue, fish$spp, calc_cpue_norm)
but I end up with lists when I need to be adding these values to the dataframe rows instead.
Anyone who knows R better than me have some wisdom to share?
This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 2 years ago.
I have data about the population of every state over time. For each row in this dataframe, I want to add an avg_pop column that is the average population of that state over all time periods. How can I achieve that in R?
Example:
st1 10
st1 20
should become
st1 10 15
st1 10 15
Because the average across st1 is 15.
I tried this but it does not work because the size of the dataframes is different:
averages = aggregate(data, list(data$state_name), mean, na.rm=T)
data$avg_pop = subset(averages, state_name==data$state_name)$stpop
If we want to create a column, use ave
data$avg_pop <- with(data, ave(col_name, state_name))
To calculate a new column, you can use the mutate function:
library(tidyverse)
df %>% mutate(avg_pop = mean(mass, na.rm = TRUE))
This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 2 years ago.
I am struggling to make a barplot with two variables in R. One variable has data ranging from 0-90, and I need to split it up into 3 groups-- the data that is <5, 5-10, and >10. So that there are only 3 bars in the plot instead of 90. Here is the code I have tried to use but I can't figure out how to get this to work. The problem is in the use of the <,>, and - signs.
First I created a new variable
SVLivedPlot <- SDreal2$SVLived
And then I am trying to group all the numbers that are under 5 to be the value of 1, 5-10 to be the value of 2, and greater than 10 to be 3.
SVLivedPlot[SDreal2$SVLived == c(<5)] <- 1
SVLivedPlot[SDreal2$SVLived == c(5-10)] <- 2
SVLivedPlot[SDreal2$SVLived == c(>90)] <-3
Once I get those values changed I will use the following code to save that new variable with the correct groupings as the variable I will use in my barplot
DataFrameName$OldVariableName <- NewVariableName
Once I can get this new variable created I know how to put it in the barplot() formula to get the plot. I just need to know how to group those data! Any help would be great! Thank you!:)
We can use cut
SDreal2$NewVar <- with(SDreal2, as.integer(cut(SVLived,
breaks = c(-Inf, 5, 10, 90))))
This question already has answers here:
Replace all values lower than threshold in R
(4 answers)
Closed 3 years ago.
I have some values such as -77777 which denote a special type of missing info in my dataset. I want to replace these with the smallest value or the largest value in their own column. Say I'm working with dataset HLDE and column is RTLM.
HLDE <- data.frame(RTLM = c(0:9, -77777))
This is not a duplicate! The so-called duplicate has no resemblance.
Use conditional assignment with max or min. To make it more robust set na.rm=TRUE.
HLDE[HLDE$RTLM == -77777, "RTLM"] <- max(HLDE$RTLM, na.rm=TRUE)
This question already has answers here:
Replace missing values with column mean
(14 answers)
Closed 6 years ago.
I have a dataset named malt, where one of the columns is named ka. I want to replace NA values in that ka column by mean values in malt$ka and other value remain as it is, so do this by if else
malt$ka <- ifelse(malt$ka=="NA", mean(malt$ka), "malt$AcqCostPercust")
This does not seem to work, and I am confused how to replace values the NA values.
Or
malt$ka[is.na(malt$ka)] <- mean(malt$ka, na.rm = TRUE)
x <- mean(malt$ka, na.rm=T) # assign mean value to a variable x
malt$ka<-ifelse(is.na(malt$ka),x,malt$ka)