I'm struggling to get window functions working in R to rank groups by the number of rows.
Here's my sample code:
data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")
data %>%
group_by(state) %>%
mutate(num_books = n(),
state_rank = dense_rank(num_books)) %>%
arrange(num_books)
The expected output is that the original data will have a new column that tells me the rank for each row (book, state, price and review) depending on whether that row is for a state with the most book reviews (would have state_rank of 1); second most books (rank 2), etc.
Manually I can get the output like this:
manual_ranks <- data %>%
count(state) %>%
mutate(state_rank = rank(state))
desired_output <- data %>%
left_join(manual_ranks)
In other words, I want the last column of this table:
data %>%
count(state) %>%
mutate(state_rank = rank(state))
added to each row of the original table (without having to create this table and then using left_join by state; that's the point of window functions).
Anyway, with the original code, you'll see that all state_rank just say 1, when I would expect states with the most book reviews to be ranked 1, second most reviews would have 2, etc.
My goal is to then be able to filter by, say, state_rank > 4. That is, I want to keep all the rows in the original data for top 4 states with the most book reviews.
Related
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.
I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row.
I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments.
df <- data.frame(Name = c("Bob","Ben","Bill"),
Team = c("Dogs","Cats","Birds"),
Runs = c(6, 4, 2)
I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.
It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set.
library(dplyr)
CO2 %>%
group_by(Plant) %>%
summarise(
n = n(), #Calculate number of rows in each group
meanUptake = mean(uptake) # Aggregate data and take mean for each group
) %>%
ungroup()
Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.
Being more or less a beginner in R, I have a quick question. Indeed, I would like to attach a series of elements (country number) to different categories (n°id). The idea is as follows: as soon as a country number belongs 3 times in a row to a certain id number, it is attached to this id number. Here is a simplified example below:
Starting database Desired outcome
I think I can do this using the R program, although I couldn't find similar questions on the different forums.
Thank you very much for your help,
Gauthier
Assuming a n-n relationship between country number and id (e.g. each country can have 0-n IDs and each ID can be tied to 0-n countries), here is one solution:
library(dplyr)
dataframe %>%
mutate(Count = 1) %>%
group_by("Country number","n°id") %>%
summarise(Count = sum(Count, na.rm = TRUE) %>%
ungroup() %>%
filter(Count >= 3) %>%
select(-Count)
I have two datasets:
DS1 - contains a list of subjects with a columns for name, ID number and Employment status
DS2 - contains the same list of subjects names and ID numbers but some of these are missing on the second data set.
Finally it contains a 3rd column for Education Level.
I want to merge the Education column onto the first dataset. I have done this using the merge function sorting by ID number but because some of the ID numbers are missing on the second data set I want to merge the remaining Education level by name as a secondary option. Is there a way to do this using dplyr/tidyverse?
There are two ways you can do this. Choose the one based on your preference.
1st option:
#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>%
dplyr::left_join(DS2 %>%
dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>%
dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))
2nd option:
#first find the IDs which are present in the 2nd dataset
commonIds = DS1 %>%
dplyr::inner_join(DS2 %>%
dplyr::select(ID,EducationLevel),by=c('ID'))
#now the records where ID was not present in DS2
idsNotPresent = DS1 %>%
dplyr::filter(!ID %in% commonIds$ID) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel),by=c('Name'))
#bind these two dfs to get the final df
finalDf = bind_rows(commonIds,idsNotPresent)
Let me know if this works.
The second option in makeshift-programmer's answer worker for me. Thank you so much. Had to play around with it for my actual data sets but the basic structure worked very well and it was easy to adapt
Ciao, I have two columns. Every row represents one student. The first column tells what class the student is in. The second column tells if the student passed a exam.
Here is my replicating example.
This is the data I have now:
a=c("A","A","A","A","B","B","B","C","C")
b=c(0,0,1,0,0,0,0,1,1)
mydata=data.frame(a,b)
names(mydata)=c("CLASS","PASSED")
This is the data I seek to attain:
a1=c("A","B","C")
b1=c(4,3,2)
c1=c(1,0,2)
mydataWANT=data.frame(a1,b1,c1)
names(mydataWANT)=c("CLASS","SIZE","PASSED")
Here is my attempt for the dplyr package
mydataWANT <- data.frame(mydata %>%
group_by(CLASS,PASSED) %>%
summarise(N = n()))
yet it does not yield the desire output.