Same difference in alternate rows per sub group - r

I have data frame where i need to find difference but for every alternate row the difference should stay same as the things to do are same like this:
but I have used this:
things <- data.frame( category = c("A","B","A","B","A","B","A","B","A","B"),
things2do = c("ball","ball","bat","bat","hockey","hockey","volley ball","volley ball","foos ball","foos ball"),
number = c(12,5,4,1,0,2,2,0,0,2))
things %>%
mutate(diff = number - lead(number,order_by=things2do))
but it is not helpful,as I am getting this:
Can i get some help here?

library(tidyverse)
things2 <- things %>%
spread(category, number) %>%
mutate(diff = B - A) %>%
gather(category, number, A:B) %>%
select(category, things2do, number, diff) %>%
arrange(things2do)

One way is to group the data by things2do and subsequently take an iterated difference.
library(dplyr)
things %>%
group_by(things2do) %>%
mutate(diff = diff(number))

Related

Subtract V1 and V0 from other visit values in long format using grouping

I have a longitudinal dataset in long format and I am want to create two variables, one where visit=1 value for a given ID is subtracted from all other values and one where visit=0 value for a given ID is subtracted from all other values. Sample dataset:
df<-data.frame(value=rnorm(30),
visit=rep(seq(0:9), 3), i
d=c(rep("A", 10), rep("B", 10), rep("C", 10)))
I know one method is to use the following code:
df %>% group_by(id) %>%
arrange(visit, .by_group=TRUE) %>%
mutate(value_chg_v1=value-nth(value, n=2),
value_chg_v0=value-nth(value, n=1))
but I want to construct the code so that if a entire row is missing for visit=0, value_chg_v0=NA. Is there a way to do this that does not involve adding in the rows where visit=0 is missing? Thanks in advance.
We could use complete to create the missing combinations
library(dplyr)
library(tidyr)
df %>%
complete(id, visit = 0:9) %>%
group_by(id) %>%
arrange(visit, .by_group=TRUE) %>%
mutate(value_chg_v1=value-nth(value, n=2),
value_chg_v0=value-nth(value, n=1)) %>%
ungroup %>%
drop_na(value)

Finding the first row after which x rows meet some criterium in R

A data wrangling question:
I have a dataframe of hourly animal tracking points with columns for id, time, and whether the animal is on land or in water (0 = water; 1 = land). It looks something like this:
set.seed(13)
n <- 100
dat <- data.frame(id = rep(1:5, each = 10),
datetime=seq(as.POSIXct("2020-12-26 00:00:00"), as.POSIXct("2020-12-30 3:00:00"), by = "hour"),
land = sample(0:1, n, replace = TRUE))
What I need to do is flag the first row after which the animal uses land at least once for 3 straight days. I tried doing something like this:
dat$ymd <- ymd(dat$datetime[1]) # make column for year-month-day
# add land points within each id group
land.pts <- dat %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
drop_na(land) %>%
mutate(all.land = cumsum(land))
#flag days that have any land points
flag <- land.pts %>%
group_by(id, ymd) %>%
arrange(id, datetime) %>%
slice(n()) %>%
mutate(flag = if_else(all.land == 0,0,1))
# Combine flagged dataframe with full dataframe
comb <- left_join(land.pts, flag)
comb[is.na(comb)] <- 1
and then I tried this:
x = comb %>%
group_by(id) %>%
arrange(id, datetime) %>%
mutate(time.land=ifelse(land==0 | is.na(lag(land)) | lag(land)==0 | flag==0,
0,
difftime(datetime, lag(datetime), units="days")))
But I still can't quite wrap my head around what to do to make it so that I can figure out when the animal has been on land at least once for three days straight, and then flag that first point on land. Thanks so much for any help you can provide!
Create a date column from the timestamp. Summarise the data and keep only 1 row for each id and date which shows whether the animal was on land even once in the entire day.
Use zoo's rollapply function to mark the first day as TRUE if the next 3 days the animal was on land.
library(dplyr)
library(zoo)
dat <- dat %>% mutate(date = as.Date(datetime))
dat %>%
group_by(id, date) %>%
summarise(on_land = any(land == 1)) %>%
mutate(consec_three = rollapply(on_land, 3,all, align = 'left', fill = NA)) %>%
ungroup %>%
#If you want all the rows of the data
left_join(dat, by = c('id', 'date'))

Finding the top n represented entries in a grouped dataframe in R

I am a beginner in R and would be very thankful for a response as I am stuck on this code (this is my attempt at solving the problem but it does not work):
personal_spotify_df <- fromJSON("data/StreamingHistory0.json")
personal_spotify_df = personal_spotify_df %>%
mutate(minutesPlayed = msPlayed/1000/60)
personal_spotify_df_ranked <- personal_spotify_df %>%
group_by(artistName) %>%
filter(top_n(15, max(nrows())))
I have a dataframe (see below for a screenshot on how its structured) which is my spotity listening history. I want to group this dataframe by artists and afterwards arrange the new dataframe to show the top 15 artists with the most songs listened to. I am stuck on how to get from grouping by artistName to actually filtering out the top 15 represented artists from the dataframe.
The dataframe
We may use slice_max, with n specified as 15 and the order column created with add_count
library(dplyr)
personal_spotify_df %>%
add_count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count") %>%
select(-Count)
If we want to get only the top 15 distinct 'artistName',
personal_spotify_df %>%
count(artistName, name = "Count") %>%
slice_max(n = 15, order_by = "Count")
Or an option with filter after arrangeing the rows based on the count
personal_spotify_df %>%
add_count(artistName) %>%
arrange(desc(n)) %>%
filter(artistName %in% head(unique(artistName), 15))
In base R, you can make use of table, sort and head to get top 15 artists with their count
table(personal_spotify_df$artistName) |>
sort(decreasing = TRUE) |>
head(15) |>
stack()
The pipe operator (|>) requires R 4.1 if you have a lower version use -
stack(head(sort(table(personal_spotify_df$artistName), decreasing = TRUE), 15))

Finding closest matching number between 2 dataframes using a grouped dplyr identifier

I have 2 datasets each with a 'Patient ID` and a collection date measured from the same date "from start." In order to join these dataframes together, I'd like to match each sample in d1 to it's closest neighbor in d2. How can this be done with a function in dplyr?
d1<-data.frame(`Patient ID`=c(rep("001",4),rep("002",5)),`fromstart`=c(-5,30,90,150,-10,15,45,100,250),check.names = F)
d2<-data.frame(`Patient ID`=c(rep("001",7),rep("002",4)),`fromstart`=c(-20,10,30,50,90,110,150,-10,15,45,100),check.names = F)
closest_date<-function(cases,d2) {
return(d2 %>% select(`Patient ID`,fromstart) %>% unique() %>% filter(`Patient ID`==cases$`Patient ID`) %>% rowwise() %>% mutate(date_match=as.numeric(cases$fromstart[which.min(abs(fromstart - cases$fromstart))])))
}
d1 %>% select(`Patient ID`,fromstart) %>% unique() %>% group_by(`Patient ID`) %>% rowwise() %>% mutate(closest=closest_date(.,d2))
If I understood your problem correctly you want to join by patient ID and then select those lines where the difference between fromstart is the smallest? If so this would be a solution
library(dplyr)
d1 %>%
dplyr::full_join(d2, by = c("Patient ID"), suffix = c("_1", "_2")) %>%
dplyr::mutate(DIF = abs(fromstart_1 - fromstart_2)) %>%
dplyr::group_by(`Patient ID`, fromstart_1) %>%
dplyr::filter(DIF == min(DIF))
As you can see this does not really work well if you want unique combinations because there can be cases where the distance is the same... than again maybe I did not get your questoin right
Instead of using the absolute value you could filter for positive differences/distances as well if you want fromstart of the second table to be larger than from the first, this would reduce double entries to a certain degree

transform() to add rows with dplyr()

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)

Resources