R: Balancing a Repeated Cross-Section Sample - r

I have an unbalanced panel of repeated cross sectional data with different number of observations with different number of ages of individuals by sampling year something like the following:
mydata <- data.frame(age = sample(60, 1000, replace=TRUE),
year=sample(3,1000, replace=TRUE),
x=rnorm(1000))
I would like to balance my cross sections panels so that there is an equal number of ages for each cross section. I have thought of a few ways to do this. I believe the easiest would be to count the number of people in each cross section for each age.
mydata <- dplyr::mutate(group_by(mydata, age, year), nage=n())
Then I find the minimum count for each age group across years.
mydata <- dplyr::mutate(group_by(mydata, age), minN=min(nage))
Now the last part is the part I don't know how to do. I would now like to select the first 1:N observations within each group. The obvious way to do this would be to create an index variable within each group. Then subset the data.frame to only those observations which are less than that index value that counts from 1 to N.
mydata <- dplyr::mutate(group_by(mydata, age, year), index=index())
subset(mydata, index <= minN)
Of course this is the problem. The function index does not exist. I have written out this entire explanation so that either someone can provide the function I am looking for or someone can suggest an alternative method to accomplish this same objective, or both. Thanks for your consideration!

Old solution:
mydata %>% group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
filter(row_number()%in%1:min(nage))
Final solution:
mydata %>%
group_by(age, year) %>%
mutate(nage=n()) %>%
group_by(age) %>%
mutate(minN = min(nage)) %>%
group_by(age, year) %>%
slice(seq_len(minN[1L]))

Related

How do you count the number of observations in multiple columns and use mutate to make the counts as new columns in R?

I have a dataset that has multiple lines of survey responses from different years and from different organizations. There are 100 questions in the survey and people can skip them. I am trying to get the average for each question by year by organization (so grouped by organization and year). I also want to get the count of the number of people in those averages since people can skip them. I want these two data points as new columns as well, so it will add 200 columns total. I figured out how to the average. See code below. I can't seem to use the same function to get the count of observation.
This is how I successfully got the average.
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Question'), mean, na.rm = TRUE, .names = "{.col}_average")) %>%
ungroup()
I am now trying to use a similar set up to get the count of observations. I duplicated the columns with the raw data and added Count in the title so that the new average columns are not counted as columns that R needs to find the ncount for
df<- df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'), function(x){sum(!is.na(.))}, .names = "{.col}_ncount")) %>%
ungroup()
The code above does get me the new columns but the n count is the same of all columns and all rows? Any thoughts?
The issue is in the lambda function i.e. function(x) and then the sum is on the . instead of x. . by itself can be evaluated as the whole data
library(dplyr)
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
function(x){sum(!is.na(x))},
.names = "{.col}_ncount")) %>%
ungroup()
If we want to use the . or .x, specify the lambda function as ~
df%>%
group_by(Organization, Year) %>%
mutate(across(contains('Count'),
~ sum(!is.na(.)),
.names = "{.col}_ncount")) %>%
ungroup()

I need to use a loop in R but don't know where to start

I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.

R function to segment data based on two columns?

I have a dataset that contains zip codes of houses and the price for each house. I need to split it into three datasets based on average price of the zip codes. For example, one set with the zip codes with the highest price, average price, and lowest price.
My idea was to order the dataset from lowest to highest based on price, split it into thirds, and then see where each zip code showed the most, but that feels inefficient. Is there any better way to do this?
Here is a solution that uses dplyr. It is a little bit verbose, but it gets the job done. Using group_by calculates mean prices for each postcode, so that you can more precisely split up according to expensive, average, and cheap postcodes.
library(dplyr)
# Generate sample data
dat <- tibble(postcode = sample(c("5432", "5654", "2342", "1231", "8543", "4324"), 1000, replace = TRUE),
price = rnorm(1000, 400000, 50000))
# Work out mean price for each postcode
mean_prices <- dat %>%
group_by(postcode) %>%
summarise(mean_price = mean(price))
# Find split points for the mean postcode price
split_points <- quantile(unique(mean_prices$mean_price), (1:3)/3)
# Get the postcodes that are within cheap, middle, or expensive price ranges
cheap_postcodes <- mean_prices %>%
filter(mean_price <= split_points[1]) %>%
pull(postcode)
middle_postcodes <- mean_prices %>%
filter(mean_price > split_points[1] & mean_price <= split_points[2]) %>%
pull(postcode)
expensive_postcodes <- mean_prices %>%
filter(mean_price > split_points[2]) %>%
pull(postcode)
# Create the three datasets
cheap_third <- dat %>% filter(postcode %in% cheap_postcodes)
middle_third <- dat %>% filter(postcode %in% middle_postcodes)
expensive_third <- dat %>% filter(postcode %in% expensive_postcodes)

Calculate percentage occurence in R

I conducted a dietary analysis in a raptor species and I would like to calculate the percentage of occurence of the prey items in the three different stages of it's breeding cycle. I would like the occurence to be expressed a percentage of the sample size. As an example if the sample size is 135 and I get an occurence of Orthoptera 65. I would like to calculate the percentage: 65/135.
So far I have tried with the long version without succes. The result I am getting is not correct. Any help is highly recommended and sorry if this question is reposted.
The raw dataset is as it follows:
set.seed(123)
pellets_2014<-data.frame(
Period = sample(c("Prebreeding","Breeding","Postbreedng"),12, replace=TRUE),
Orthoptera = sample(0:10, 12,replace=TRUE),
Coleoptera=sample(0:10,12,replace = TRUE),
Mammalia=sample(0:10,12, replace=TRUE))
##I transform the file to long format
##Library all the necessary packages
library(dplyr)
library(tidyr)
library(scales)
library(naniar)
pellets2014_long<-gather(pellets_2014,Categories, Count, c(Orthoptera,Coleoptera,Mammalia))
##I trasnform the zero values to NAs
pellets2014_NA<-pellets2014_long %>% replace_with_na(replace = list(Count = 0))
## Try to calculate the occurence
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())
## I do get here but I don't get the right number of occurence and I am stuck how to get the right percentage
##If I try this:
Occurence2014<-pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n=n())%>%mutate(Freq_n=n/sum(n)*100)
##The above is also wrong because I need it to be divide by the sample size in each period (here is 4 samples per period, the overall sample size is 12)!
The output must be occurence and percentage of occurence for its prey category in each Period. As it is shown in the picture below
Desired output
Is this close to what you're looking for?
Occurence2014 <- pellets2014_NA %>%
group_by(Period,Categories) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(
freq = n / sum(n)
)
Something like this?
Occurence2014 <- pellets2014_NA %>%
group_by(Period) %>%
mutate(period_sample_size = n()) %>%
ungroup() %>%
group_by(Period,Categories,period_sample_size) %>%
summarise(n=n())%>%
mutate(Freq_n=n/period_sample_size*100)

Group and summarize with iterative filter using dplyr

Upfront apology if this has been asked, I have been searching all day and have not found an answer I can apply to my problem.
I am trying to solve this issue using dplyr (and co.) because my previous method (for loops) was too inefficient. I have a dataset of event times, at sites, that are in groups. I want to summarize the number (and proportion) of events that occur in a moving window along a sequence.
# Example data
set.seed(1)
sites = rep(letters[1:10],10)
groups = c('red','blue','green','yellow')
times = round(runif(length(sites),1,100))
timePeriod = seq(1,100)
# Example dataframe
df = data.frame(site = sites,
group = rep(groups,length(sites)/length(groups)),
time = times)
This is my attempt to summarize the number of sites from each group that contain a time (event) within a given moving window of time.
The goal is to move through each element of the vector timePeriod and summarize how many events in each group occurred at timePeriod[i] +/- half-window. Ultimately storing them in, e.g., a dataframe with a column for each group, and a row for each time step, is ideal.
df %>%
filter(time > timePeriod[i]-25 & time < timePeriod[i]+25) %>%
group_by(group) %>%
summarise(count = n())
How can I do this without looping through my sequence of time and storing the summary table for each group individually? Thanks!
Combining lapply and dplyr, you can do the following, which is close to what you had worked so far.
lapply(timePeriod, function(i){
df %>%
filter(time > (i - 25) & time < ( i + 25 ) ) %>%
group_by(group) %>%
summarise(count = n()) %>%
mutate(step = i)
}) %>%
bind_rows()

Resources