Performing calculations between rows in R - r

I’m trying to figure out how to make a calculation across (or between?) rows. I’ve tried looking this up but clearly my Google-Fu is not strong today, because I’m just not finding the right search terms.
Here is a super simplified example of the type of data I’m trying to deal with:
mydf <- data.frame(pair = rep(1,2),
participant = c("PartX", "PartY"),
goalsAtt = c(6, 3),
goalsScr = c(2, 3))
We have data on how many "goals" a participant attempted and how many they actually scored, and lets say I want to know about their "defensive" ability. Now essentially what I want to do is mutate() two new columns called… let’s say saved and missed, where saved would be the goals attempted by the opposite participant minus the goals scored by them, and missed would just be goals scored by the opposite participant. So obviously participant X would have saved 0 and missed 3, and participant Y will have saved 4 and missed 2.
Now obviously this is a stupid simple example, and I’ll have like 6 different “goal” types to deal with and 2.5k pairs to go through, but I’m just having a mental block about where to start with this.
Amateur coder here, and Tidyverse style solutions are appreciated.

OK, so let's assume that pair can only be for 2 teams. Here's a tidyverse solution where we first set an index number for position within a group and then subtract for goals saved. Something similar for goals missed.
library(tidyverse)
mydf %>%
group_by(pair) %>%
mutate(id = row_number()) %>%
mutate(goalsSaved = if_else(id == 1,
lead(goalsAtt) - lead(goalsScr),
lag(goalsAtt) - lag(goalsScr))) %>%
mutate(goalsMissed = if_else(id == 1,
lead(goalsScr),
lag(goalsScr)))
# A tibble: 2 x 7
# Groups: pair [1]
pair participant goalsAtt goalsScr id goalsSaved goalsMissed
<dbl> <fct> <dbl> <dbl> <int> <dbl> <dbl>
1 1 PartX 6 2 1 0 3
2 1 PartY 3 3 2 4 2

Related

When working with multiple rows with same Id in R, how to assign different values for different groups?

I am dealing with some patients data in R. I need to calculate the time it takes between the first visit to the last visit for the normal patients, and the time between the first visit to the date of first disease diagnosis for patients who progressed to the disease. I have tried but it didn't work out. I really appreciate if someone could help.
My data looks like "patient", where visit_number = the visit orders, followup_days=days between the first visit to each follow up visit.
patient<-data.frame(patient_ID=c(1,1,2,2,2,3,3,4,4,4),age=c(63,64,60,61,63,61,62,77,77,79),
visit_number=c(1,2,1,2,3,1,2,1,2,3), followup_days=c(0,504,0,390,798,0,379,0,310,621),diagnosis=c(0,0,0,1,1,0,0,0,0,1))
enter image description here
The new data needs to look like "patient1". I need to create a new variable "time".
For patients with a normal status, the time is the length of days between the first visit and the last visit.
For patients with a disease diagnosis (diagnosis=1), the time is the length of days between the first visit, and the FIRST time of the diagnosis of 1.
patient1 <-data.frame(patient_ID=c(1,1,2,2,2,3,3,4,4,4),age=c(63,64,60,61,63,61,62,77,77,79),
visit_number=c(1,2,1,2,3,1,2,1,2,3), followup_days=c(0,504,0,390,798,0,379,0,310,621),
diagnosis=c(0,0,0,1,1,0,0,0,0,1), time=c(504,504,390,390,390,379,379,621,621,621))
enter image description here
Lastly, for the final data set, I would like to only keep the first visit for each patients, with the "time" column added.
new_patient <-data.frame(patient_ID=c(1,2,3,4),age=c(63,60,61,77),
visit_number=c(1,1,1,1), followup_days=c(0,0,0,0),disgonosis=c(0,0,0,0), time=c(504,390,379,621))
enter image description here
Any ideas how to make it happen? Thank you
To create the patient1 data, we first load the dplyr package and create a function that returns the minimum positve value, we then proceed by grouping the patients and create the time variable conditional on the diagnosis variable:
library(dplyr)
minpositive = function(x) min(x[x > 0])
patient1 <- patient %>% group_by(patient_ID) %>%
mutate(time = ifelse(sum(diagnosis)>0,
minpositive(followup_days * diagnosis),
max(followup_days)))
To create the final dataset we filter based on visit_number:
new_patient <- patient1 %>% filter(visit_number == 1)
This should create the desired output.
Group by patient_ID, and use an if-else statement to generate the time variable, conditional on the presence of 1 in the diagnosis column:
library(dplyr)
patient %>%
group_by(patient_ID) %>%
mutate(time = ifelse(1 %in% diagnosis, min(followup_days[diagnosis==1]),max(followup_days))) %>%
filter(visit_number==1)
Output:
patient_ID age visit_number followup_days diagnosis time
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 63 1 0 0 504
2 2 60 1 0 0 390
3 3 61 1 0 0 379
4 4 77 1 0 0 621

Conditional sampling by group based on sample mean

I am trying to use R to make a bunch of different trivia quizzes. I have a large dataset (quiz_df) containing numerous questions divided into categories and difficulties looking like this:
ID Category Difficulty
1 1 Sports 3
2 2 Science 7
3 3 Low culture 4
4 4 High culture 2
5 5 Geography 8
6 6 Lifestyle 3
7 7 Society 3
8 8 History 5
9 9 Sports 2
10 10 Science 8
... ... ... ...
1000 1000 Science 3
Now I want to randomly sample 3 questions from each of the 8 categories, so that the mean difficulty is 4 (or the sum being 4*24 = 96).
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3)
This creates a random quiz set with 3 questions in each category, but does not take into consideration the difficulty. I am aware of the weight-option in sample_n:
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3, weight = Diffculty)
But this does not solve the issue. Ideally, I would like to add an option like: sum = 96, or mean = 4.
Does anyone have any clues?
This is a brute-force solution:
library(dplyr)
# Generating sample data.
set.seed(1986)
n = 1000
quiz_df = data.frame(id = 1:n,
Category = sample(c("Sports", "Science", "Society"), size = n, replace = TRUE),
Difficulty = sample(1:10, size = n , replace = TRUE))
# Solution: resample until condition is met.
repeat {
temp.draw = quiz_df %>% group_by(Category) %>% slice_sample(n = 3) # From documentation, sample_n() is outdated!
temp.mean = mean(temp.draw$Difficulty)
if (temp.mean == 4) # Check if the draw satisfies your condition.
{
final.draw = temp.draw
break
}
}
final.draw
mean(final.draw$Difficulty)
First of all, as you are new to SO, let me tell you that you should always include some example data in your questions - not just the structure, but something we can run on our machine. Anyway, for this time I just simulated some data, including three instances of Category only. My solution runs in less than two seconds, however with the whole data set the code may need more time.
The idea is to just resample until we get 24 questions, three for each category, whose mean Difficulty equals 4. Clearly, this is not an elegant solution, but it may be a first step.
I am going to try again to find a better solution. I guess the problem is that the draws are not independent, I will look deeper into that.
Ps, from the documentation I see that sample_n() has been superseeded by slice_sample(), so I suggest you to rely on the latter.

Combining/aggregating data in R

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--
I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?
Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.
Thanks!
There are multiple parts to your question so let's take it step by step.
1.
For the first part this is a great use of tidyr::fct_collapse(). See example here:
library(tidyverse)
set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()
# original distribution
table(d)
#> d
#> a b c d e
#> 6 4 3 1 6
# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#> a other
#> 6 14
Created on 2022-02-10 by the reprex package (v2.0.1)
2.
For the second part, you will have to clarify and share some data to get more help.
3.
Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.
However something like:
df %>% group_by(birth_year) %>% summarize(n = n())
will give you number of people with that birth year listed.

How can I calculate a new dataframe only for one outcome type?

I'm working with some data that involves participants running on a cognitive task that measures their outcome (Correct or Incorrect) and reaction time (RT) (the entire dataset is called practice). For each participant, I want to create a new dataframe with their average RT when they got the answer correct, and one for when they were incorrect. I've tried
practice %>%
mutate(correctRT = mean(practice$RT[practice$Outcome=="Correct"]))
Using dplyr and tidyverse, as well as
correctRT <- c(mean(practice$RT[practice$Outcome=="Correct"]))
(which I'm sure isn't the correct way to do it) and nothing seems to be working. I'm a complete novice and am working with this dataset in order to learn how to do stats with R and just can't find any answers with R.
In R you can "keep" multiple objects (e.g. data frames) in a single list. This saves you from storing every (sub)dataframe in a separate variable (e.g. through subsetting your problem and storing it based on Participant, Outcome). This will come handy when you have "many" individuals and a manual filter and storing of the (sub)dataframe becomes prohibitive.
Conceptually, your problem is to "subset" your data to the Participant and Outcome you aim for and calculate the mean on this group.
The following is based on {tidyverse}, i.e. {dplyr}.
data
As you have not provided a reproducble example, this is a quick hack of your data:
practice <- data.frame(
Participant = c("A","A","A","B","B","B","B","C","C","D"),
RT = c(10, 12, 14, 9, 12, 13, 17, 11, 13, 17),
Outcome = c("Incorrect","Correct", "Correct","Incorrect","Incorrect","Correct", "Correct","Incorrect","Correct", "Correct")
)
which looks like the following:
practice
Participant RT Outcome
1 A 10 Incorrect
2 A 12 Correct
3 A 14 Correct
4 B 9 Incorrect
5 B 12 Incorrect
6 B 13 Correct
7 B 17 Correct
8 C 11 Incorrect
9 C 13 Correct
10 D 17 Correct
splitting groups of a dataframe
The {tidyverse} provides some neat functions for the general data processing.
{dplyr} has a group_split() function that returns such a list.
library(dplyr)
practice %>% group_split(Participant, Outcome)
<list_of<
tbl_df<
Participant: character
RT : double
Outcome : character
>
>[7]>
[[1]]
# A tibble: 2 x 3
Participant RT Outcome
<chr> <dbl> <chr>
1 A 12 Correct
2 A 14 Correct
[[2]]
...
You can address the respective list-elements with the [[]] notation.
Store the list in a variable and try my_list_name[[3]] to extract the 3rd element.
potential summary for your problem
If you do not need a list you could wrap this into a data summary.
If you want to split on Outcomes, you may want to filter your data in 2 sub-dataframes only holding the respective outcome (e.g. correct <- practice %>% filter(Outcome == "Correct")).
Group your data dependent on the summary you want to construct.
Use summarise() to summarise your groups into a 1-row summary.
Note you can combine multiple operations. For example next to the mean reaction time, the following counts the number of rows (:= attempts).
practice %>%
group_by(Participant, Outcome) %>%
##--------- summarise data into 1 row summarise
summarise( Mean_RT = mean(RT) # calculate mean reaction time
,Attempts = n() ) # how many times
This yields:
# A tibble: 7 x 4
# Groups: Participant [4]
Participant Outcome Mean_RT Attempts
<chr> <chr> <dbl> <int>
1 A Correct 13 2
2 A Incorrect 10 1
3 B Correct 15 2
4 B Incorrect 10.5 2
5 C Correct 13 1
6 C Incorrect 11 1
7 D Correct 17 1
Please note that this is a grouped data frame. If you further process the data, you need to "remove" the grouping. Otherwise any follow up operation in a pipe will be on the group-level.
For this you can either use summarise(...., .groups = "drop") or you add ... %>% ungroup() to your pipe.
If you need to split the result, check for above group_split().

How can I count and compare data over multiple columns in R?

I am working with a dataframe which contains text data which has been categorised and coded. Each numerical value from 1-12 represent a type of word.
I want to count the frequencies of occurrence each number (1 to 12) over 6 columns (pre1 to pre6) so I know how many types of words have been used. Could anyone please advise on how to do this?
My df is structured as such:
Would something like that work for you?
library(dplyr)
df <- data.frame(pre1 = c(sample(1:12, 10)),
pre2 = c(sample(1:12, 10)),
pre3 = c(sample(1:12, 10)),
pre4 = c(sample(1:12, 10)),
pre5 = c(sample(1:12, 10)),
pre6 = c(sample(1:12, 10)))
count <- count(df, pre1, pre2, pre3, pre4, pre5, pre6)
One solution is this:
library(tidyverse)
mtcars %>%
select(cyl, am, gear) %>% # select the columns of interest
gather(column, number) %>% # reshape
count(column, number) # get counts of numbers for each column
# # A tibble: 8 x 3
# column number n
# <chr> <dbl> <int>
# 1 am 0 19
# 2 am 1 13
# 3 cyl 4 11
# 4 cyl 6 7
# 5 cyl 8 14
# 6 gear 3 15
# 7 gear 4 12
# 8 gear 5 5
In your case column will get values as pre1, pre2, etc., number' will get values 1 - 12 andn` will be the count of a specific number at a specific column.
It is not entirely clear from the question, whether you want frequency tables for all of these columns together or for each column seperately. In possible further questions you should also make clear, whether those numbers are coded as numerics, as characters or as factors (the result of str(pCat) is a good way to do that). For this particular question, it fortunately does not matter.
The answers I have already given in the comments are
table(unlist(pCat[,4:9]))
and
table(pCat$pre3)
as an extension for the latter, I shall also point to the comment by ANG , which boils down to
lapply(pCat[,4:9], table)
These are straightforward solutions with base R without any further unneccessary packages. The answers by JonGrub and AntoniosK base on the tidyverse. There is no obvious need to import dplyr or tidyverse for this problem but I guess, the authors open those packages anyways, whenever they use R, so it does not really impose any cost on them. Other great packages to base good answers on are data.table and sqldf. Those are good packages and many people do a lot of things, that could be done in base R in one of these packages. The packages promise to be more clear or faster or reuse possible knowledge you might already have. Nothing is wrong with that. However, I take your question as an indication, that you are still learning R and I would advise, to learn R first, before you become distracted by learning special packages and DSLs.
People have been using base R for decades and they will continue to do so. Learning base R wil lnot distract you from R and the knowledge will continue to be worthwhile in decades. If the same can be said for the tidyverse or datatable, time will tell (although sqldf is probably also a solid investment in the future, maybe more so than R).

Resources