Data manipulation for sampling with replacement - r

By comparison to most on this site, I am an extreme newbie when it comes to R, and would appreciate any possible help. I am looking to sample my data with replacement, but given how my data is set up, I am not sure how to go about that. I have 11 plant species. For each species I took 5 plant cuttings, and sampled 10 leaves from each cutting totaling 50 leaves per plant species. I need to sample with replacement within species. I was looking at using the sample function for this, but considering I need to sample within species I am not sure if I can. Attached is a photo of my data for context.
Data image
Apologies in advance for the naivety of my question and thanks in advance for any help!

you can group by and then sample. This is assuming you have 5 cuttings per species. If not you may want to remove this condition from the group by
library(dplyr)
data %>%
group_by(species, cutting) %>%
slice_sample(weight_by = `leaf size`, n=10, replace = TRUE) %>%
ungroup()

Related

How do I count two unique values in one column based on another?

If anyone could help with this would be appreciated. I am trying to add two new variables to
my existing DF (*Grade1Sum and *Grade2Sum) summing each film's grade (ET got graded one three times so the *Grade1Sum total is always equal to 3 for example)
There are only two grades (1 and 2). Hopefully this makes sense, so if anyone can help, would be appreciated (I want to add them to an existing DF, thus currently I only have the first three columns).
I am adding what I want the DF to look like as a pic as it won't format correctly:
**Additional does anyone know how to count the amount of times each grade was received per film?
Here is one way to do it
df %>%
group_by(Film) %>%
mutate(Grade1Sum=sum(Grade[Grade==1]), Grade2Sum=sum(Grade[Grade==2]))
Here is a more flexible way to do it
df %>%
group_by(Film, Grade) %>%
summarise(Sum=sum(Grade)) %>%
pivot_wider(names_from=Grade, values_from=Sum,
names_glue="Grade{Grade}{.value}") %>%
right_join(., df)

select lowest values which sum up to 10% of total

Im new to this place and I'm not super experienced with R but I need it at work and I really hope you can support me
So i have a huge data set but i will explain the issue using small sample
I have already grouped my data set to achieve a layout which i want
So basically i have multiple EXCPosOutlet and EXCPPMonth names and i need to remove lowest values per EXCPosOutlet per EXCMonth which sum up to 10% of total for that individual group.
So lets say that total of AvaragePrice for a sampleName for Month 612 is 1000$. i need to remove all rows with lowest values of AveragePrice which sum up to 100$
If removing is messy, even creating extra column (mutate) using ifelse for example which would just tell me if it falls under my criteria, that would be totally enough
I have tried all ntile, quntile fucntions but im not geeting what i need.
Thank you so much in advance
LEt me know if I should provide more details
One possibility is to use the dplyr package and, for legibility, the pipe operator %>%. There's other ways towards the same result, but you might want to give it a try:
library(dplyr)
## generate example data:
data.frame(
EXCPosOutlet = gl(3,12),
AveragePrice = runif(36) * 100
) %>%
## sort dataframe by outlet and (increasing) price:
arrange(EXCPosOutlet, AveragePrice) %>%
## group by outlet:
group_by(EXCPosOutlet) %>%
## calculate cumulative price:
mutate(cumAveragePrice = cumsum(AveragePrice)) %>%
## keep rows which, per outlet, total less than the treshold of $100:
filter(cumAveragePrice <= 100)

Recoding a variable across waves of a longitudinal dataset

I realised recently that a longitudinal variable in my dataset (whether people stated they were on furlough after being asked why their working hours were reduced from the previous wave of the study) was coded incorrectly. Right now, the variable is coded as “1” if a respondent reported furlough as the reason for fewer working hours than the last survey wave, and “0" even if a respondent was on furlough but whose working hours did not change from the last survey wave. Therefore, I want to recode this variable so that after the first report of a furlough-related decrease in working hours (“1”), the rest of the data (i.e. the proceeding waves) would also be coded “1". This may be a simple change to execute in R, but I spent a few hours this morning trying if-else statements and dplyr with no success.
TLDR: I would like to recode a variable so that if the variable equals 1 for a specified wave of my longitudinal dataset, it also equals 1 for the rest of the waves of the dataset.
Can I please ask for any suggestions you have for resolving this? Thank you so much!
It is very hard to help without seeing your data. I made a basic data following your explanation. Check if this is what you have in mind.
library(dplyr)
data <- tibble(ID = c(1,2,3,1,2,3,1,2,3),
Wave= c(1,1,1,2,2,2,3,3,3),
Reduced = c(1,1,0,1,0,0,1,0,0))
data2 <- data %>% filter(Wave == 1)
data <- data %>% group_by(Wave) %>% mutate(Reduced_new = data2$Reduced)
rm(data2)

Finding the mean of a subset with 2 variables in R

Sincere apologies if my terminology is inaccurate, I am very new to R and programming in general (<1m experience). I was recently given the opportunity to do data analysis on a project I wish to write-up for a conference and could use some help.
I have a csv file (cida_ams_scc_csv) with patient data from a recent study. It's a dataframe, with columns of patient ID ('Cow ID'), location of sample ('QTR', either LH LF RH or RF), date ('Date', written DD/MM/YY), and the lab result from testing of the sample ('SCC', an integer).
For any given day, each of the four anatomic locations for each patient were sampled and tested. I want to find the average 'SCC' of the each of the locations for each of the patients, across all days the patient was sampled.
I was able to find the average SCC for each patient across all days and all anatomic sites using the code below.
aggregate(cida_ams_scc_csv$SCC, list(cida_ams_scc_csv$'Cow ID'), mean)
Now I want to add another "layer," where I see not just the patient's average, but the average of each patient for each of the 4 sample sites.
I honestly have no idea where to start. Please walk me through this in the simplest way possible, I will be eternally grateful.
It is always better to provide a minimal reproducible example. But here the answer might be easy enough so its not necessary...
You can use the same code to do what you want. If we look at the aggregate documentation ?aggregate we find that the second argument by is
a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Therefore running:
aggregate(mtcars$mpg, by = list(mtcars$cyl, mtcars$gear), mean)
Returns the "double grouped" means
In your case that means adding the "second layer" to the list you pass as value for the by parameter.
I'd recommend dplyr for working with data frames - it's fast and friendly. Here's a way to calculate the average of each patient for each location using it:
# Load the package
library(dplyr)
# Make some fake data that looks similar to yours
cida_ams_scc_csv <- data.frame(
QTR=gl(
n = 4, k = 5, labels = c("LH", "LF", "RH", "RF"), length = 40
),
Date=rep(c("06/10/2021", "05/10/2021"), each=5),
SCC=runif(40),
Cow_ID=1:5)
# Group by ID and QTR, then calculate the mean for each group
cida_ams_scc_csv %>%
group_by(Cow_ID, QTR) %>%
summarise(grouped_mean=mean(SCC))
which returns

How can I add the populations of males and females together to remove gender as a variable in a demographics table. In R Studio [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
This is my first time posting a question, so may not have the correct info to start, apologies in advance. Am new to R. Prefer to use dplyr or tidyverse because those are the packages we've used so far. I did search for a similar question, but most gender/sex related questions are around separating the data, or performing operations on each separately.
I have a table of population counts, with variables (factors) Age Range, Year and Sex, with Population as the dependent variable. I want to create a plot to show if the population is aging - that is, showing how the relative proportion of different ages groups changes over time. But gender is not relevant, so I want to add together the population counts for males and females, for each year and age range.
I don't know how to provide a copy of the raw data .csv file, so if you have any suggestions, please let me know.
This is a sample of the data(output table):
And here is the code so far:
file_name <- "AusPopDemographics.csv"
AusDemo_df = read.table(file_name,",", header=TRUE)
(grp_AusDemo_df <- AusDemo_df %>% group_by(Year, Age))
I am guessing it may be something like pivot(wider) to bring male and female up as column headings, then transmute() to sum them and create a new population column.
Thanks for your help.
With dplyr you could do something like this
library(dplyr)
grp_AusDemo_df <- AusDemo_df %>%
group_by(Year, Age) %>%
summarise(Population = sum(Population, na.rm = TRUE))

Resources