How to create bins in R - r

I have a data frame named cst with columns country, ID, and age. I want to make bins for age (divide all ID's into deciles or quartiles) for each separate country. I used this way:
cut(cst[!is.na(cst$age), "age"], quantile(cst["age"], probs = seq(0,1,0.1), na.rm = T))
However, it makes bins for all data frame, but I need for each country separately.
Could you help me?

I'd try with a dplyr solution, this would look someithing like this:
library(dplyr)
cst2 <- cst %>%
group_by(country) %>%
mutate(
bin = cut(age, quantile(age, probs=seq(0,1,0.1), na.rm=TRUE))
) %>%
ungroup()

All you need to do is to apply a subset before using the cut. It also does not employ the dplyr library.
for (c in unique(as.list(cst$country))) {
sub <- subset(cst, country == c)
cut(sub[!is.na(sub$age), "age"], quantile(sub["age"], probs = seq(0,1,0.1), na.rm = T))
}

Related

How to change column value?

I am a novice of R. Thanks for your help in advance.
I newly created the new data frame by joinning 3 data frames as in the image below.
After merging the dfs, I tried to rename the column names to get the values for over 70's. But there are still 3 different column names as below.
How can I merging the values for 70 and over 70's?
You haven't shared the data in a reproducible format nor did you share the code that resulted in the above output but looking at the image here is an attempt which might work for you.
library(dplyr)
df %>%
group_by(age = ifelse(age %in% c('70+', '70-79', '80+'), '70+', age)) %>%
summarise(across(`2020`:`2017`, sum, na.rm = TRUE)) -> result
result
You can write the above in base R as :
aggregate(.~age, transform(df,
age = ifelse(age %in% c('70+', '70-79', '80+'), '70+', age)),
sum, na.rm = TRUE)
We can also use case_when
library(dplyr)
out <- df %>%
group_by(age = case_when(age %in% c("70+", "70-79", "80+") ~ "70+",
TRUE ~ age)) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE))

Reduce a data frame by combining like rows according to two qualitative factors

I have a dataframe like the following:
observations<- data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00KS177011","00P0006","00P006","00P006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,16,0,6,13,29), hzdepb = c(20,30,15,30,13,30,20,30,16,30,6,13,29,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Black","Red","White","White","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,0.53,0.47,0.2,0.23,0.53,0.04))
I want to be able to reduce this so that anytime X and Y are the same for two rows, the observations are combined i.e.
data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00P0006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,0,6), hzdepb = c(20,30,15,30,13,30,20,30,30,6,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Red","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,1.00,0.20,0.80))
Any suggestions on how to best go about this?
Edit: ok, now that I see how hzdept and hzdepb are supposed to be combined from your commment above:
library(tidyverse)
df <- observations %>% count(X,Y,wt = Z,name = "Z")
df_hzdept <- observations %>%
arrange(hzdept) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdept)
df_hzdepb <- observations %>%
arrange(desc(hzdepb)) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdepb)
df <- df %>% left_join(df_hzdept) %>% left_join(df_hzdepb)
Using dplyr
Here is how you would group by two columns and summarize using the minimum, max, and sum other columns in a dataframe:
library(magrittr) # For the pipe: %>%
observations %>%
dplyr::group_by(X, Y) %>%
dplyr::summarise(hzdept = min(hzdept),
hzdepb = max(hzdepb),
Z = sum(Z), .groups = 'drop')

Apply a function within list-column to another column (compare to reference ecdf by group)

I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Summarize each category of rows in one column using R

I'm wondering if this is something possible in R:
I have 2 columns. Column A (primaryhistory2.DEPT) has a bunch of categorical data, column B (primaryhistry2.ACT.ENROLL) has numbers and NAs.
I want to get a summary of column B for each category in column A.
Something like, for "NUT" in column A, I want to see min, max, mean, median, NAs, etc. And I would like to see this for every category. Like when you use summary() command.
Not sure if this is possible.. Thank you all in advance!
#Moody_Mudskipper
The results are what I'm looking for. But without column names it's hard to read.
and for the base R, it's not doing counts for NAs, which I do see a lot of NAs in my file.
Very possible using dplyr library:
library(dplyr)
most.of.the.answer = df %>%
group_by(primaryhistory2.DEPT) %>%
summarise(min = min(primaryhistry2.ACT.ENROLL, na.rm = TRUE), max = max(primaryhistry2.ACT.ENROLL, na.rm = TRUE), mean = mean(primaryhistry2.ACT.ENROLL, na.rm = TRUE), median = median(primaryhistry2.ACT.ENROLL, na.rm = TRUE))
(assuming your dataframe is called df)
For counting NA's, try dplyr's filter feature:
count.NAs = df %>% filter(is.na(primaryhistry2.ACT.ENROLL)) %>%
group_by(primaryhistory2.DEPT) %>%
summarise(count.NA = n())
I'll leave it to you to merge the two dataframes.
with base R you can do this:
temp <- aggregate(primaryhistory2..ACT.ENROLL ~ primaryhistory2.DEPT,df,function(x){c(mean = mean(x,na.rm=T),median = median(x,na.rm=T),min = min(x,na.rm=T),max = max(x,na.rm=T),nas=sum(is.na(x)))})
res <- cbind(temp[1],temp[[2]])
If you want to use summary:
summary1 <- sapply(unique(df$primaryhistory2.DEPT),function(x) summary(subset(df,primaryhistory2.DEPT == x)$primaryhistory2..ACT.ENROLL))
colnames(summary1) <- unique(df$primaryhistory2.DEPT)

Resources