How to plot a histogram from a group? - r

I am using R studio. What happens is that I have a dataset in which I have 1000k data. I have all columns called FINAL_CLASSIFICATION and AGE. In the FINAL_RANKING column there is data ranging from 1 to 7. In this column we say that those with 1, 2 or 3, are infected with SARS_COVID, while in those with 4, 5,6 and 7 are those who are healthy. I need to make a histogram of the ages of those who are infected and for this I understand that I must make a group to see the ages that coincide with 1, 2 and 3 of the column CLASIFICACION_FINAL and those ages will be of the infected people and, from there I need to make the histogram but I do not find the way to create the group or to obtain this.
Could you help me?
I have the following code
#1)
# import the data into R
# RECOMMENDATION: use read_csv
covid_dataset <- read_csv("Desktop/Course in R/Examples/covid_dataset.csv")
View(covid_dataset)
#------------------------------------------------------------------------------------------
#2) Extract a random sample of 100k records and assign it into a new variable. From now on work with this dataset
# HINT: use dplyr's sample_n function
sample <- sample_n(covid_dataset, 100000)
# With the function sample_n what we get is a syntax sample_n(x,n) where we have that
#x will be our dataset from where we want to extract the sample and n is the sample size
#that we want
nrow(sample)
#with this function we can corroborate that we have extracted a 100K sample.
#------------------------------------------------------------------------------------------
#3)Make a statistical summary of the dataset and also show the data types by column.
summary(sample)
#The summary function is the one that gives us the summary statistics.
map(sample, class)
#The map() function gives us the data type by columns and we can see that there are
#more numeric data type.
#-------------------------------------------------------------------------------------------
#4)Filter the rows that are positive for SARS-COVID and calculate the number of records.
## Positive cases are those that in the FINAL_CLASSIFICATION column have 1, 2 or 3.
## To filter the rows, we will make use of the PIPE operator and the select function of dplyr.
#This will help us to select the column and to be able to filter the rows where
#the FINAL_CLASSIFICATION column is 1, 2 or 3, i.e. SARS-COVID positive results.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 1
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 2
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 3
# I do them separately to have a better view of the records.
#Now if we want to get them all together we simply do the following
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3)
#This gives us the rows less than or equal to 3, which is the same as giving us the rows in which the
#Rows where the FINAL_RANKING column has 1, 2 or 3.
#Now, if we want the number of records, doing it separately, we simply add
#another PIPE operator in which we will add the nrow() function to give me the number of #rows for each record.
#rows for each record.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) %>% nrow()
#gives us a result of 1471
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2) %>% nrow()
#gives us a result of 46
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3) %>% nrow()
#Gives us a result of 37703
#If we add the 3 results, we have that the total number of records is
1471+46+37703
#Which gives us 39220
#But it can be simplified by doing it in a straightforward way as follows
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3) %>% nrow()
#And we notice that we get the same result as the previous code.
#In conclusion, we have a total of 39220 positive SARS-COVID cases.
#---------------------------------------------------------------------------------------------
#5)Count the number of null records per column (HINT: Use sapply or map, and is.na)
apply(sample, MARGIN = 2, function(x)sum(is.na(x))))
#This shows us the number of NA's per column. We notice that the only column
#that has NA's is the DATE_DEF with a total of 95044, this tells us that out of the
#100K data, only approximately 5k data are known for DATE_DEF.
#------------------------------------------------------------------------------------------
#6)
##a)Calculate the mean age of covid infectees.
##b)Make a histogram of the ages of the infected persons.
##c)Make a density plot of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION
group_by(FINAL_CLASSIFICATION <= 3 ) %>% %>%
summarise(average = mean(AGE))
#Then the total average number of infected is 43.9
#Now we make a histogram of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION <=3, AGE) %>% summarise(count = n())
It is in the last part where I have doubts. I want to find the average of the ages of the infected people, I used the code that I put there using group_by but I don't know if that is correct. And my doubts are already with the other two questions in #6, where I want to know about the histograms and how to plot them.

What I gathered is that you wish to 1. create a variable 'FINAL_CLASSIFICATION' based on values of 'FINAL_RANKING,' 2. summarize the average age of groups in FINAL_CLASSIFICATION, and 3. create a histogram of the positive cases in FINAL_CLASSIFICATION
I created a random sample of 100 cases with random assumptions of AGE and FINAL_RANKING
library(dplyr)
library(ggplot2)
sample <- tibble(FINAL_RANKING = sample(1:7, 100, replace = T), AGE = sample(10:100, 100, replace = T) )
sample <- sample %>%
mutate(
FINAL_CLASSIFICATION = case_when(
FINAL_RANKING %in% 1:3 ~ "SARS_COVID_POSITIVE",
FINAL_RANKING %in% 4:7 ~ "SARS_COVID_NEGATIVE")
)
sample %>%
group_by(FINAL_CLASSIFICATION) %>%
summarize(average_age = mean(AGE))
sample %>%
filter(FINAL_CLASSIFICATION == "SARS_COVID_POSITIVE") %>%
ggplot(., aes(x = AGE)) +
geom_histogram()
Gives summary output:
# A tibble: 2 x 2
FINAL_CLASSIFICATION average_age
<chr> <dbl>
1 SARS_COVID_NEGATIVE 51.8
2 SARS_COVID_POSITIVE 58.6
and plot:
As noted in output, you should adjust bins

Related

Perform mathematical operation on column values exceeding specific condition in R

Data cleaning question: I have a column in a dataframe that has survey responses to weight in both pounds and kilograms. I need to convert all of the kilogram values of the column to pounds and still keep my dataframe. The kilogram values are preceded by a 9 (they are coded as follows: for 95kg, the number is 9095). So, basically, I have to subtract 9000 from each value and multiply it by 2.20462. I am comparing people in a large dataset by whether or not they have arthritis. Here is the problem I am trying to solve and then the code:
Q3: Compare only those who have and have not had some form of arthritis, rheumatoid arthritis, gout, etc. For those groupings, convert reported weight in kilograms to pounds. Then, compute the mean and standard deviation of the newly created weight in pounds variable. Use the conversion 1KG = 2.20462 LBS. Make sure the units are in pounds, not two decimals implied. The names of the variables should be mean_weight and sd_weight. mean_weight should equal 183.04. The output for this should be a tibble/dataframe/table with two rows (one for the haves and one for the have-nots) and two columns (mean_weight and sd_weight). So four values all together:
mean_weight sd_weight
183.04 xx.xx
xxx.xx xx.xx
Okay. The code so far:
(library(tidyverse))
(library(lm.beta))
brfss <- read.csv("BRFSS2015.csv")
arthritis <- brfss %>%
select(HAVARTH3, WEIGHT2) %>%
filter(HAVARTH3 == '1') %>%
filter(WEIGHT2 != 7777) %>%
filter(WEIGHT2 != 9999)
kg2lb1 <- ifelse(arthritis$WEIGHT2 > 9000, arthritis$WEIGHT2 - 9000*2.20462,
arthritis$WEIGHT2)
no_arthritis <- brfss %>%
select(HAVARTH3, WEIGHT2) %>%
filter(HAVARTH3 == '2') %>%
filter(WEIGHT2 != 7777) %>%
filter(WEIGHT2 != 9999)
kg2lb2 <- ifelse(no_arthritis$WEIGHT2 > 9000, no_arthritis$WEIGHT2 - 9000*2.20462,
no_arthritis$WEIGHT2)
When I use the ifelse function, it converts it into a huge numeric list. I need to retain the data frame format without writing a whole bunch more code.
Write a conversion function,
kg_to_lb <- \(x) x - 9000*2.20462
apply it in transform to add a new variable with converted values
mtcars <- transform(mtcars, mpg_lb=kg_to_lb(mpg))
and aggregate it by a group (here I use the binary am group of the mtcars data set that comes with R) and applying a FUNction.
aggregate(mpg_lb ~ am, mtcars, FUN=\(x) c(mean=mean(x), sd=sd(x)))
# am mpg_lb.mean mpg_lb.sd
# 1 0 -19824.432632 3.833966
# 2 1 -19817.187692 6.166504

How to divide a data frame into groups of a predefined size while keeping each category of a variable represented in each group

I am having trouble doing cross-validation for a hierarchical dataset. There is a level 2 factor ("ID") that needs to be equally represented in each subset. For this dataset, there are 157 rows and 28 IDs. I want to divide my data up into five subsets, each containing 31 rows, where each of the 28 IDs is represented (a stand can be repeated within a subset).
I have gotten as far as:
library(dplyr)
df %>%
group_by(ID) %>%
and have no clue where to take it from there. Any help is appreciated!
Here's what I'd do: assign one row from each ID randomly to each of the 5 subsets, and then distribute the leftovers fully randomly. Without sample data this is untested, but it should at least get you on the right track.
df %>%
group_by(ID) %>%
mutate(
random_rank = rank(runif(n())),
strata = ifelse(random_rank <= 5, random_rank, sample(1:5, size = n(), replace = TRUE))
) %>%
select(-random_rank) %>%
ungroup()
This should create a strata column as described above. If you want to split the data into a list of data frames for each strata, ... %>% group_by(strata) %>% group_split().

Down-Sample data in R to a given distribution

I have a dataset includig 60 predictors and a dependend variable which indicates if a purchase has taken place and how much was spend. The conversion-rate in my data 3.5% and I want to downsample it to 2.5% by excluding records with a purchase. The original distributions should be preserved.
Thanks you for your help!
bjoern.
First, some simpler data (2 columns instead of 60) with 3.5% TRUE values in column b:
library(tidyverse)
n <- 10000
df <- data.frame(
a = rnorm(n)) %>%
mutate(b = row_number() <= .035*n)
df %>%
summarize(mean(b))
mean(b)
1 0.035
One way to downsample would be to rbind all of the FALSE values in a that you'd like to keep with a sample of the TRUE values reduced by a target amount via sample_frac:
df2 <- rbind(
df %>% filter(!b),
df %>% filter(b) %>% sample_frac(.025/.035)
)
df2 %>%
summarize(mean(b))
mean(b)
1 0.02525253
You might not get exactly 2.5%, depending on the original size of your data since we can only sample in whole numbers.

How do I compare group means to individual observations and make a new TRUE/FALSE column?

I am new to R and this is my first post on SO - so please bear with me.
I am trying to identify outliers in my dataset. I have two data.frames:
(1 - original data set, 192 rows): observations and their value (AvgConc)
(2 - created with dplyr, 24 rows): Group averages from the original data set, along with quantiles, minimum, and maximum values
I want to create a new column within the original data set that gives TRUE/FALSE based on whether (AvgConc) is greater than the maximum or less than the minimum I have calculated in the second data.frame. How do I go about doing this?
Failed attempt:
Outliers <- Original.Data %>%
group_by(Status, Stim, Treatment) %>%
mutate(Outlier = Original.Data$AvgConc > Quantiles.Data$Maximum | Original.Data$AvgConc < Quantiles.Data$Minimum) %>%
as.data.frame()
Error: Column Outlier must be length 8 (the group size) or one, not 192
Here, we need to remove the Quantiles.Data$ by doing a join with 'Original.Data' by the 'Status', 'Stim', 'Treatment'
library(dplyr)
Original.Data %>%
inner_join(Quantiles.Data %>%
select(Status, Stim, Treatment, Maximum, Minimum)) %>%
group_by(Status, Stim, Treatment) %>%
mutate(Outlier = (AvgConc > Maximum) |(AvgConc < Minimum)) %>%
as.data.frame()

Summing across rows conditional on groups with dplyr using select, group_by, and mutate

Problem: I'm making an aggregate market share variable in a car market with 286 distinct models sold and a total of 501 cars sold. This group share is based on only on car characteristic: cat= "compact", "midsize", "large" and yr=77,78,79,80,81, and the share, a small double variable; a total of 15 groups in the market.
Closest answer I've found: by mishabalyasin on community.rstudio: "Calculating rowwise totals and proportions using tidyeval?" link to post on community.rstudio.
Applying the principle of select-split-combine is the closest I've come to getting the correct answer is the 15 groups (15 x 3(cat, yr, s)):
df<- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(s))
#in my actual data, this is what fills by group share to get what I want, but this isn't the desired pipele-based answer
blp$group_share=0 #initializing the group_share, the 50th col
for(i in 1:501){
for(j in 1:15){
if((blp[i,31]==df[j,1])&&(blp[i,3]==df[j,2])){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,50]=df[j,3]
}
}
}
This is great, but I know this can be done in one fell swoop... Hopefully, the idea is clear from what I've described above. A simple fix may be a loop and set by conditions on cat and yr, and that'd help, but I really am trying to get better at data wrangling with dplyr, so, any insight along that line to get the pipelining answer would be wonderful.
Example for the site: This example below doesn't work with the code I provided, but this is the "look" of my data. There is a problem with the share being a factor.
#45 obs, 3 cats, 5 yrs
cat=c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr=c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s=c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
blp=as.data.frame(cbind(unlist(lapply(cat,as.character,stringsAsFactors=FALSE)),as.numeric(yr),unlist(as.numeric(s))))
names(blp)<-c("cat","yr","s")
head(blp)
#note: one example of a group share would be summing the share from
(group_share.blp.large.81.s=(blp[cat== "large" &yr==81,]))
#works thanks to akrun: applying the code I provided for what leads to the 15 groups
df <- blp %>%
select(cat,yr,s) %>%
group_by(cat,yr) %>%
summarise(group_share = sum(as.numeric(as.character(s))))
#manually filling doesn't work, but this is what I'd want if I didn't want pipelining
blp$group_share=0
for(i in 1:45){
if( ((blp[i,1])==(df[j,1])) && (as.numeric(blp[i,2])==as.numeric(df[j,2]))){ #if(sameCat & sameYr){blpGS=dfGS}
blp[i,4]=df[j,3];
}
}
if I understood your problem correctly this should ideally help!
Here the only difference that instead of using summarize which will automatically result only in the grouped column and the summarized one you can use mutate to keep the original columns and add to them an aggregate one.
# Sample input
## 45 obs, 3 cats, 5 yrs
cat <- c( "compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large","compact","midsize","large")
yr <- c(77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81,77,78,79,80,81)
s <- c(.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002,.001,.0005,.002,.0001,.0002)
# Calculation
blp <-
data.frame(cat, yr, s, stringsAsFactors = FALSE) %>% # To create dataframe
group_by(cat, yr) %>% # Grouping by category and year
mutate(group_share = sum(s, na.rm = TRUE)) %>% # Calculating sum share per category/year
ungroup()
Expected output
Expected output

Resources