I have a dataset that contains zip codes of houses and the price for each house. I need to split it into three datasets based on average price of the zip codes. For example, one set with the zip codes with the highest price, average price, and lowest price.
My idea was to order the dataset from lowest to highest based on price, split it into thirds, and then see where each zip code showed the most, but that feels inefficient. Is there any better way to do this?
Here is a solution that uses dplyr. It is a little bit verbose, but it gets the job done. Using group_by calculates mean prices for each postcode, so that you can more precisely split up according to expensive, average, and cheap postcodes.
library(dplyr)
# Generate sample data
dat <- tibble(postcode = sample(c("5432", "5654", "2342", "1231", "8543", "4324"), 1000, replace = TRUE),
price = rnorm(1000, 400000, 50000))
# Work out mean price for each postcode
mean_prices <- dat %>%
group_by(postcode) %>%
summarise(mean_price = mean(price))
# Find split points for the mean postcode price
split_points <- quantile(unique(mean_prices$mean_price), (1:3)/3)
# Get the postcodes that are within cheap, middle, or expensive price ranges
cheap_postcodes <- mean_prices %>%
filter(mean_price <= split_points[1]) %>%
pull(postcode)
middle_postcodes <- mean_prices %>%
filter(mean_price > split_points[1] & mean_price <= split_points[2]) %>%
pull(postcode)
expensive_postcodes <- mean_prices %>%
filter(mean_price > split_points[2]) %>%
pull(postcode)
# Create the three datasets
cheap_third <- dat %>% filter(postcode %in% cheap_postcodes)
middle_third <- dat %>% filter(postcode %in% middle_postcodes)
expensive_third <- dat %>% filter(postcode %in% expensive_postcodes)
Related
I am using R studio. What happens is that I have a dataset in which I have 1000k data. I have all columns called FINAL_CLASSIFICATION and AGE. In the FINAL_RANKING column there is data ranging from 1 to 7. In this column we say that those with 1, 2 or 3, are infected with SARS_COVID, while in those with 4, 5,6 and 7 are those who are healthy. I need to make a histogram of the ages of those who are infected and for this I understand that I must make a group to see the ages that coincide with 1, 2 and 3 of the column CLASIFICACION_FINAL and those ages will be of the infected people and, from there I need to make the histogram but I do not find the way to create the group or to obtain this.
Could you help me?
I have the following code
#1)
# import the data into R
# RECOMMENDATION: use read_csv
covid_dataset <- read_csv("Desktop/Course in R/Examples/covid_dataset.csv")
View(covid_dataset)
#------------------------------------------------------------------------------------------
#2) Extract a random sample of 100k records and assign it into a new variable. From now on work with this dataset
# HINT: use dplyr's sample_n function
sample <- sample_n(covid_dataset, 100000)
# With the function sample_n what we get is a syntax sample_n(x,n) where we have that
#x will be our dataset from where we want to extract the sample and n is the sample size
#that we want
nrow(sample)
#with this function we can corroborate that we have extracted a 100K sample.
#------------------------------------------------------------------------------------------
#3)Make a statistical summary of the dataset and also show the data types by column.
summary(sample)
#The summary function is the one that gives us the summary statistics.
map(sample, class)
#The map() function gives us the data type by columns and we can see that there are
#more numeric data type.
#-------------------------------------------------------------------------------------------
#4)Filter the rows that are positive for SARS-COVID and calculate the number of records.
## Positive cases are those that in the FINAL_CLASSIFICATION column have 1, 2 or 3.
## To filter the rows, we will make use of the PIPE operator and the select function of dplyr.
#This will help us to select the column and to be able to filter the rows where
#the FINAL_CLASSIFICATION column is 1, 2 or 3, i.e. SARS-COVID positive results.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 1
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 2
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 3
# I do them separately to have a better view of the records.
#Now if we want to get them all together we simply do the following
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3)
#This gives us the rows less than or equal to 3, which is the same as giving us the rows in which the
#Rows where the FINAL_RANKING column has 1, 2 or 3.
#Now, if we want the number of records, doing it separately, we simply add
#another PIPE operator in which we will add the nrow() function to give me the number of #rows for each record.
#rows for each record.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) %>% nrow()
#gives us a result of 1471
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2) %>% nrow()
#gives us a result of 46
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3) %>% nrow()
#Gives us a result of 37703
#If we add the 3 results, we have that the total number of records is
1471+46+37703
#Which gives us 39220
#But it can be simplified by doing it in a straightforward way as follows
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3) %>% nrow()
#And we notice that we get the same result as the previous code.
#In conclusion, we have a total of 39220 positive SARS-COVID cases.
#---------------------------------------------------------------------------------------------
#5)Count the number of null records per column (HINT: Use sapply or map, and is.na)
apply(sample, MARGIN = 2, function(x)sum(is.na(x))))
#This shows us the number of NA's per column. We notice that the only column
#that has NA's is the DATE_DEF with a total of 95044, this tells us that out of the
#100K data, only approximately 5k data are known for DATE_DEF.
#------------------------------------------------------------------------------------------
#6)
##a)Calculate the mean age of covid infectees.
##b)Make a histogram of the ages of the infected persons.
##c)Make a density plot of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION
group_by(FINAL_CLASSIFICATION <= 3 ) %>% %>%
summarise(average = mean(AGE))
#Then the total average number of infected is 43.9
#Now we make a histogram of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION <=3, AGE) %>% summarise(count = n())
It is in the last part where I have doubts. I want to find the average of the ages of the infected people, I used the code that I put there using group_by but I don't know if that is correct. And my doubts are already with the other two questions in #6, where I want to know about the histograms and how to plot them.
What I gathered is that you wish to 1. create a variable 'FINAL_CLASSIFICATION' based on values of 'FINAL_RANKING,' 2. summarize the average age of groups in FINAL_CLASSIFICATION, and 3. create a histogram of the positive cases in FINAL_CLASSIFICATION
I created a random sample of 100 cases with random assumptions of AGE and FINAL_RANKING
library(dplyr)
library(ggplot2)
sample <- tibble(FINAL_RANKING = sample(1:7, 100, replace = T), AGE = sample(10:100, 100, replace = T) )
sample <- sample %>%
mutate(
FINAL_CLASSIFICATION = case_when(
FINAL_RANKING %in% 1:3 ~ "SARS_COVID_POSITIVE",
FINAL_RANKING %in% 4:7 ~ "SARS_COVID_NEGATIVE")
)
sample %>%
group_by(FINAL_CLASSIFICATION) %>%
summarize(average_age = mean(AGE))
sample %>%
filter(FINAL_CLASSIFICATION == "SARS_COVID_POSITIVE") %>%
ggplot(., aes(x = AGE)) +
geom_histogram()
Gives summary output:
# A tibble: 2 x 2
FINAL_CLASSIFICATION average_age
<chr> <dbl>
1 SARS_COVID_NEGATIVE 51.8
2 SARS_COVID_POSITIVE 58.6
and plot:
As noted in output, you should adjust bins
I have a lab dataset where some patients have multiple recordings on the same date. I am trying to take the average of the results and produce that as my final result for that particular date. I used the approach mentioned below, however, it doesn't seem to work.
ACCT <- c(4333,3234,4232,1313,1341,4244,3211)
TEST_DATE <- c('2016-04-01', '2016-04-01', '2016-04-01', '2016-04-01','2016-04-01','2016-04-01','2016-04-01')
RESULTS <- c(1.4,1.7,1.2,1.8,1.5,1.7)
df <- data.frame(ACCT,TEST_DATE,RESULTS)
df$TEST_DATE <- as.POSIX(df$TEST_DATE)
Creating non duplicate rows based on ACCT and TEST_DATE. This will give me accounts that don't have duplicate dates:
df_nonduplicates <- df %>% group_by(ACCT,TEST_DATE) %>% filter(!n()>1)
Creating a data frame that gives me ACCTS that have duplicate TEST_DATE's:
df_duplicates <- df %>% group_by(ACCT,TEST_DATE) %>% filter(n()>1)
Trying to take the average result of ACCT's that have more than one result on a TEST_DATE:
df_cleaned_duplicates <- df_duplicates %>% group_by(ACCT) %>%mutate(avg_result=mean(as.numeric(df_duplicates$RESULTS, NA.rm=TRUE))) %>% select(ACCT, TEST_DATE, avg_result)
This is giving me NA values for the entire avg_result column. I'm not able to understand why.
Joining the two datasets:
final_result <- rbind(df_nonduplicates, df_cleaned_duplicates)
I have 60 data sets I created from one massive original one. They are split by Year, and I named them all using their year number - like Year1, Year2, Year3, Year4, etc to Year60. Each data set has a column "Car" and "Weeks". I am trying to loop through every dataset to sort by the largest Number of Cars value, take the row that value is in, and get the value for "Weeks" for that row (basically the week in which the most cars were sold per year, for each of the 60 years).
My code is:
Year1$Car <- as.integer(Year1$Car)
df.1 <- aggregate(Car ~ Week, Year1, max)
df.a <- merge(df.1, Year1)
print(paste("Year 1 Most Cars Sold in Week", print(df.a$Week))
I am trying to find a way to run through this quicker than just manually typing for each dataset Year1, Year2, etc all the way to Year60.
I tried:
for (i in 1:60){
Year"i"$Car <- as.integer(Year"i"$Car)
df.1 <- aggregate(Car ~ Week, Year"i", max)
df.a <- merge(df.1, Year"i")
print(paste("Year "i" Most Cars Sold in Week", print(df.a$Week))
}
that didn't work :/ Would really appreciate any suggestions!
If you want to keep the list intact, you can use sapply to go through each dataframe and extract the Week number of row with maximum Car value.
sapply(mget(paste0('Year', 1:60)), function(x) x$Week[which.max(x$Car)])
Or with dplyr you can combine all the datasets into one group_by each Year and select the row with maximum value of Car.
library(dplyr)
bind_rows(mget(paste0('Year', 1:60)), .id = "id") %>%
group_by(id) %>%
slice(which.max(Car))
I conducted a dietary analysis in a raptor species and I would like to calculate the percentage of occurence of the prey items in the three different stages of it's breeding cycle. I would like the occurence to be expressed a percentage of the sample size. As an example if the sample size is 135 and I get an occurence of Orthoptera 65. I would like to calculate the percentage: 65/135.
So far I have tried with the long version without succes. The result I am getting is not correct. Any help is highly recommended and sorry if this question is reposted.
The raw dataset is as it follows:
set.seed(123)
pellets_2014<-data.frame(
Period = sample(c("Prebreeding","Breeding","Postbreedng"),12, replace=TRUE),
Orthoptera = sample(0:10, 12,replace=TRUE),
Coleoptera=sample(0:10,12,replace = TRUE),
Mammalia=sample(0:10,12, replace=TRUE))
##I transform the file to long format
##Library all the necessary packages
library(dplyr)
library(tidyr)
library(scales)
library(naniar)
pellets2014_long<-gather(pellets_2014,Categories, Count, c(Orthoptera,Coleoptera,Mammalia))
##I trasnform the zero values to NAs
pellets2014_NA<-pellets2014_long %>% replace_with_na(replace = list(Count = 0))
## Try to calculate the occurence
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())
## I do get here but I don't get the right number of occurence and I am stuck how to get the right percentage
##If I try this:
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())%>%mutate(Freq_n=n/sum(n)*100)
##The above is also wrong because I need it to be divide by the sample size in each period (here is 4 samples per period, the overall sample size is 12)!
The output must be occurence and percentage of occurence for its prey category in each Period. As it is shown in the picture below
Desired output
Is this close to what you're looking for?
Occurence2014 <- pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(
freq = n / sum(n)
)
Something like this?
Occurence2014 <- pellets2014_NA %>%
group_by(Period) %>%
mutate(period_sample_size = n()) %>%
ungroup() %>%
group_by(Period,Categories,period_sample_size) %>%
summarise(n=n())%>%
mutate(Freq_n=n/period_sample_size*100)
I have an unbalanced panel of repeated cross sectional data with different number of observations with different number of ages of individuals by sampling year something like the following:
mydata <- data.frame(age = sample(60, 1000, replace=TRUE),
year=sample(3,1000, replace=TRUE),
x=rnorm(1000))
I would like to balance my cross sections panels so that there is an equal number of ages for each cross section. I have thought of a few ways to do this. I believe the easiest would be to count the number of people in each cross section for each age.
mydata <- dplyr::mutate(group_by(mydata, age, year), nage=n())
Then I find the minimum count for each age group across years.
mydata <- dplyr::mutate(group_by(mydata, age), minN=min(nage))
Now the last part is the part I don't know how to do. I would now like to select the first 1:N observations within each group. The obvious way to do this would be to create an index variable within each group. Then subset the data.frame to only those observations which are less than that index value that counts from 1 to N.
mydata <- dplyr::mutate(group_by(mydata, age, year), index=index())
subset(mydata, index <= minN)
Of course this is the problem. The function index does not exist. I have written out this entire explanation so that either someone can provide the function I am looking for or someone can suggest an alternative method to accomplish this same objective, or both. Thanks for your consideration!
Old solution:
mydata %>% group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
filter(row_number()%in%1:min(nage))
Final solution:
mydata %>%
group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
mutate(minN = min(nage)) %>%
group_by(age, year) %>%
slice(seq_len(minN[1L]))