This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a set of data from children, recorded across a number of sessions. The number of sessions and age of each child in each session is different for each participant, so it looks something like this:
library(tibble)
mydf <- tribble(~subj, ~age,
"A", 16,
"A", 17,
"A", 19,
"B", 10,
"B", 11,
"B", 12,
"B", 13)
What I don't currently have in the data is a variable for Session number, and I'd like to add this to my dataframe. Basically I want to create a numeric variable that is ordinal from 1-n for each child, something like this:
mydf2 <- tribble(~subj, ~age, ~session,
"A", 16, 1,
"A", 17, 2,
"A", 19, 3,
"B", 10, 1,
"B", 11, 2,
"B", 12, 3
"B", 13, 4)
Ideally I'd like to do this in dplyr().
You simply need to group by subj and use row_number():
mydf %>%
group_by(subj) %>%
mutate(session = row_number())
Related
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 months ago.
I want to transform a df from a "counting" approach (number of cases) to a "individual observations" approach.
Example:
df <- dplyr::tibble(
city = c("a", "a", "b", "b", "c", "c"),
sex = c(1,0,1,0,1,0),
age = c(1,2,1,2,1,2),
cases = c(2, 3, 1, 1, 1, 1))
Expected result
df <- dplyr::tibble(
city = c("a","a","a","a","a", "b", "b", "c", "c"),
sex = c(1,1,0,0,0,1,0,1,0),
age = c(1,1,2,2,2,1,2,1,2))
uncount() from tidyr can do that for you.
df |> tidyr::uncount(cases)
This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 1 year ago.
I want to get the prop inside each factor using dplyr. The desired result appears in desired$prop
Thanks in advance :))
data <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa","uk",
"spain","usa","uk","spain","usa","uk","spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25)
)
desired <- data.frame(
team = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
country = c("usa",
"uk","spain","usa","uk","spain","usa","uk",
"spain"),
value = c(40, 20, 10, 50, 30, 35, 50, 60, 25),
prop = c(0.285714286,0.181818182,0.142857143,0.357142857,
0.272727273,0.5,0.357142857,0.545454545,
0.357142857)
)
#MrFlick is right. And also faster than I am.
library(dplyr)
df <- data %>%
group_by(country) %>%
mutate(prop = value/sum(value))
This is driving me crazy. I am using anova_test from rstatix and it's telling me that my columns aren't there when they clearly are.
This is what my dataframe looks like:
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
df <- data.frame(ID, Form, Pen, Time)
ID, Form, and Pen are factors, while Time is numeric. So each subject completed forms A and B with Red, Blue, and Green pens, and I measured how long each took in completing the form.
This is a fake dataset that I've purposefully come up with to ask this question. In reality, this dataframe is derived from a larger dataset with several more variables. Each variable has a lot more observations (so not just one datapoint for subject 1 & Form A & Red Pen, as in this example, but multiple), so I've summarized them to get mean Time.
df <- original.df %>% dplyr::select(ID, Form, Pen, Time)
df <- df %>% dplyr::group_by(ID, Form, Pen) %>% dplyr::summarise(Time = mean(Time))
df <- df %>% convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
I wanted to test the main and interaction effects, so I'm doing a 2 by 3 repeated measures ANOVA (a two-way ANOVA, because Form and Pen are two independent variables).
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
and I KEEP getting this error:
Error: Problem with `mutate()` input `data`.
x Can't subset columns that don't exist.
x Columns `ID` and `Form` don't exist.
ℹ Input `data` is `map(.data$data, .f, ...)`.
WHY?! Any help would be greatly appreciated. I've been searching solutions for HOURS and I'm getting pretty frustrated.
Thank you for adding the additional details to the post - based on what you've provided it looks like you need to ungroup your df before passing it to anova_test(), e.g.
#install.packages("rstatix")
library(rstatix)
library(tidyverse)
ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)
Form = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B")
Pen = c("Red", "Blue", "Green", "Red", "Blue", "Green", "Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green","Red", "Blue", "Green")
Time = c(20, 4, 6, 2, 76, 3, 86, 35, 74, 94, 14, 35, 63, 12, 15, 73, 87, 33)
original.df <- data.frame(ID, Form, Pen, Time)
df <- original.df %>%
dplyr::select(ID, Form, Pen, Time)
df <- df %>%
dplyr::group_by(ID, Form, Pen) %>%
dplyr::summarise(Time = mean(Time))
df <- df %>%
convert_as_factor(ID, Form, Pen)
df$Time <- as.numeric(df$Time)
df <- ungroup(df)
aov <- rstatix::anova_test(data = df, dv = Time, wid = ID, within = c(Form, Pen))
You can see whether a dataframe is grouped using str(), e.g. str(df) before and after ungrouped() shows you the difference. Please let me know if you are still getting errors after making this change
I have a data set that is shown below:
library(tidyverse)
data <- tribble(
~category, ~product_id,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
"C", 30,
"A", 10,
"A", 10,
"B", 20,
)
And now, I want to group it by the "category" variable, keep the "product_id" and add a new variable that counts the categories:
aggregated_data <- tribble(
~category, ~product_id, ~numberOfcategory
"A", 10, 5,
"B", 20, 3,
"C", 30, 2,
)
I already got the "numberOfcategory" with this code:
data %>%
group_by(category) %>%
tally(sort=TRUE)
But somehow I could not keep the product_id.
Could someone help me to get the dataframe (aggregated_data)? Thanks in advance.
You were close! Just also group by product_id as follows:
data %>%
group_by(category,product_id) %>%
tally(sort=TRUE)
I am trying to split a dataset in 80/20 - training and testing sets. I am trying to split by location, which is a factor with 4 levels, however each level has not been sampled equally. Out of 1892 samples -
Location1: 172
Location2: 615
Location3: 603
Location4: 502
I am trying to split the whole dataset 80/20, as mentioned above, but I also want each location to be split 80/20 so that I get an even proportion from each location in the training and testing set. I've seen one post about this using stratified function from the splitstackshape package but it doesn't seem to want to split my factors up.
Here is a simplified reproducible example -
x <- c(1, 2, 3, 4, 1, 3, 7, 4, 5, 7, 8, 9, 4, 6, 7, 9, 7, 1, 5, 6)
xx <- c("A", "A", "B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D")
df <- data.frame(x, xx)
validIndex <- stratified(df, "xx", size=16/nrow(df))
valid <- df[-validIndex,]
train <- df[validIndex,]
where A, B, C, D correspond to the factors in the approximate proportions as the actual dataset (~ 10, 32, 32, and 26%, respectively)
Using bothSets should return you a list containing the split of the original data frame into validation and training set (whose union should be the original data frame):
splt <- stratified(df, "xx", size=16/nrow(df), replace=FALSE, bothSets=TRUE)
valid <- splt[[1]]
train <- splt[[2]]
## check
df2 <- as.data.frame(do.call("rbind",splt))
all.equal(df[with(df, order(xx, x)), ],
df2[with(df2, order(xx, x)), ],
check.names=FALSE)