Issues with dplyr and ggplot2 on summarizing data - r

I have an issue when trying to use dplyr and ggplot2 to summarize data. I have a data set(excel file) that I imported:
df<-read.xlsx('sample.xlsx', sheet = 1)
With a sample of the data
date user vert aff browser clicks age rpc installs revenue Week Month Year
1 2017-10-25 2017-10-25 maps_1 appfocus1 Chrome 13 0 0.4436 37 5.7668 43 10 2017
2 2017-10-25 2017-10-25 maps_1 appfocus1 Chrome 1140 0 0.4436 2914 505.7040 43 10 2017
3 2017-10-25 2017-10-25 maps appfocus84 Chrome 2189 0 0.4436 7543 971.0404 43 10 2017
4 2017-10-25 2017-10-25 maps_1 appfocus1 Firefox 1 0 0.4436 6 0.4436 43 10 2017
5 2017-10-25 2017-10-25 maps_1 appfocus1 Firefox 123 0 0.4436 170 54.5628 43 10 2017
6 2017-10-25 2017-10-25 maps appfocus84 Firefox 331 0 0.4436 497 146.8316 43 10 2017
source
1 googlepartner
2 search
3 NULL
4 googlepartner
5 search
6 NULL
The code below takes a column "affiliate" and generate the summation of two fields based on that column. Then I create a calculated field by "affiliate":
UC10 <- filter(df, UCMonth == 10)
UC101 <- UC10 %>% group_by(affiliate) %>%
summarise_at(vars(revenue,installs),sum)%>%
mutate(RPI = revenue/installs)
And get the below data:
# A tibble: 2 x 4
affiliate revenue installs RPI
<chr> <dbl> <dbl> <dbl>
1 appfocus1 53603. 809580 0.0662
2 appfocus84 174479. 2768181 0.0630
Then I try to plot, by affiliate, the total RPI using ggplot2:
gcor <- ggplot(UC101, aes(x = affiliate, y = RPI)) +
geom_boxplot(color = "dark red")
My problem is the output of the graph. When I look at the graph, I get the below error:
Can anyone help understand why it isn't showing a full boxplot? This is really my first time using dplyr and ggplot2 together, so any help would be appreciated.

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

How to count maximum value for given time period in R?

I got the data from MySQL and I'm trying to visualize it and uncover some answers. Using R for the statistic.
The final product is % discount for reach price change (=row).
Here is an example of my dataset.
itemId pricehis timestamp
1 69295477 1290 2022-04-12 04:42:53
2 69295624 1145 2022-04-12 04:42:53
3 69296136 3609 2022-04-12 04:42:54
4 69296607 855 2022-04-12 04:42:53
5 69295291 1000 2022-04-12 04:42:50
6 69295475 4188 2022-04-12 04:42:52
7 69295614 1145 2022-04-12 04:42:51
8 69295622 1290 2022-04-12 04:42:50
9 69295692 3609 2022-04-12 04:42:49
10 69295917 1725 2022-04-12 04:42:48
11 69296090 2449 2022-04-12 04:42:53
12 69296653 1145 2022-04-12 04:42:51
13 69296657 5638 2022-04-12 04:42:48
14 69296661 1725 2022-04-12 04:42:51
15 69296696 710 2022-04-12 04:42:51
I've been stuck at one part of the calculation - maximum value for each productId in 6 months.
In the dataset there are rows for specific productId with different pricehis values and different timestamps. I need to find the max value for a given row no older than 6 months.
The formula for calculating the desired discount is:
Discount grouped by itemId = 1 - pricehis / max(pricehis in the last 6 months)
At this moment I'm unable to solve the second part - pricehis in the last 6 months.
- I need a new column with maximum 'pricehis' in the last 6 months for the itemId. Also could be known as interval maximum.
I can group it by the itemId, but I can't figure out how to add the condition on 6 months max.
Any tips on how to get this?
I like slider::slide_index_dbl for this sort of thing. Here's some fake data chosen to demonstrate the 6mo window:
data.frame(itemId = rep(1:2, each = 6),
price = floor(100*cos(0:11)^2),
timestamp = as.Date("2000-01-01") + 50*(0:11)) -> df
We can start with df, group it by itemId, and then calula and then apply the window function. (Note that slider requires the data to be sorted by date within each group.)
library(dplyr).
library(lubridate) # for `%m-%`, to get sliding months (harder than it sounds!)
df %>%
group_by(itemId) %>%
mutate(max_6mo = slider::slide_index_dbl(.x = price, # based on price...
.i = timestamp, # and timestamp...
.f = max, # what's the max...
.before = ~.x %m-% months(6))) %>% # over the last 6mo
mutate(discount = 1 - price / max_6mo) %>% # use that to calc discount
ungroup()
Result
# A tibble: 12 × 5
itemId price timestamp max_6mo discount
<int> <dbl> <date> <dbl> <dbl>
1 1 100 2000-01-01 100 0
2 1 29 2000-02-20 100 0.71
3 1 17 2000-04-10 100 0.83
4 1 98 2000-05-30 100 0.0200
5 1 42 2000-07-19 98 0.571 # new max since >6mo since 100
6 1 8 2000-09-07 98 0.918
7 2 92 2000-10-27 92 0
8 2 56 2000-12-16 92 0.391
9 2 2 2001-02-04 92 0.978
10 2 83 2001-03-26 92 0.0978
11 2 70 2001-05-15 83 0.157 # new max since >6mo since 92
12 2 0 2001-07-04 83 1

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Subsetting data by date range across years in R

I have a long term sightings data set of identified individuals (~16,000 records from 1979- 2019) and I would like to subset the same date range (YYYY-09-01 to YYYY(+1)-08-31) across years in R. I have successfully done so for each "year" (and obtained the unique IDs) using:
library(dplyr)
library(lubridate)
year79 <-data%>%
select(ID, Sex, AgeClass, Age, Date, Month, Year)%>%
filter(Date>= as.Date("1978-09-01") & Date<= as.Date("1979-08-31")) %>%
filter(!duplicated(ID))
year80 <-data%>%
select(ID, Sex, AgeClass, Age, Date, Month, Year)%>%
filter(Date>= as.Date("1979-09-01") & Date<= as.Date("1980-08-31")) %>%
filter(!duplicated(ID))
I would like to clean up the code and ideally not need to specify the each range (just have it iterate through). I am new at R and stuck how to do this. Any suggestions?
FYI "Month" and "Year" are included for producing a table via melt and cast later on.
example data:
ID Year Month Day Date AgeClass Age Sex
1 1034 1979 4 17 1979-04-17 U 3 F
2 1127 1979 5 3 1979-05-03 A 13 F
3 1222 1979 5 3 1979-05-03 U 0 F
4 1303 1979 6 16 1979-06-16 U 0 F
5 1153 1980 4 16 1980-04-16 C 0 F
6 1014 1980 4 16 1980-04-16 U 6 F
ID Year Month Day Date AgeClass Age Sex
16428 2503 2019 5 8 2019-05-08 U NA F
16429 3760 2019 5 8 2019-05-08 A 12 F
16430 4080 2019 5 9 2019-05-09 A 9 F
16431 4095 2019 5 9 2019-05-09 A 9 U
16432 1204 2019 5 11 2019-05-11 A 37 F
16433 1204 2019 5 11 2019-05-11 A NA F
#> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Every year has 122 days from Sept 1 to Dec 31 inclusive, so you could add a variable marking the "fiscal year" for each row:
set.seed(42)
library(dplyr)
my_data <- tibble(ID = 1:6,
Date = as.Date("1978-09-01") + c(-1, 0, 1, 364, 365, 366))
my_data
# There are 122 days from each Aug 31 (last of the FY) to the end of the CY.
# lubridate::ymd(19781231) - lubridate::ymd(19780831)
my_data %>%
mutate(FY = year(Date + 122))
## A tibble: 6 x 3
# ID Date FY
# <int> <date> <dbl>
#1 1 1978-08-31 1978
#2 2 1978-09-01 1979
#3 3 1978-09-02 1979
#4 4 1979-08-31 1979
#5 5 1979-09-01 1980
#6 6 1979-09-02 1980
You could keep the data in one table and do subsequent analysis using group_by(FY), or use %>% split(.$FY) to put each FY into its own element of a list. From my limited experience, I think it's generally an anti-pattern to create separate data frames for annual subsets of your data, as that makes your code harder to maintain, troubleshoot, and modify.

Is there a way to filter that does not include duplicates/repeated entries by particular groups?

Some context first:
I'm working with a data set which includes health related data. It includes questionnaire scores pre and post treatment. However, some clients reappear within the data for further treatment. I've provided a mock example of the data in the code section.
I have tried to come up with a solution on dplyr as this is package I'm most familiar with, but I didn't achieve what I've wanted.
#Example/mock data
ClientNumber<-c("4355", "2231", "8894", "9002", "4355", "2231", "8894", "9002", "4355", "2231")
Pre_Post<-c(1,1,1,1,2,2,2,2,1,1)
QuestionnaireScore<-c(62,76,88,56,22,30, 35,40,70,71)
df<-data.frame(ClientNumber, Pre_Post, QuestionnaireScore)
df$ClientNumber<-as.character(df$ClientNumber)
df$Pre_Post<-as.factor(df$Pre_Post)
View(df)
#tried solution
df2<-df%>%
group_by(ClientNumber)%>%
filter( Pre_Post==1|Pre_Post==2)
#this doesn't work, or needs more code to it
As you can see, the first four client numbers both have a pre and post treatment score. This is good. However, client numbers 4355 and 2231 appear again at the end (you could say they have relapsed and started new treatment). These two clients do not have a post treatment score.
I only want to analyse clients that have a pre and post score, therefore I need to filter clients which have completed treatment, while excluding ones that do not have a post treatment score if they have appeared in the data again. In relation to the example I've provided, I want to include the first 8 for analysis while excluding the last two, as they do not have a post treatment score.
If these cases are to be kept in order, you could try:
library(dplyr)
df %>%
group_by(ClientNumber) %>%
filter(!duplicated(Pre_Post) & n_distinct(Pre_Post) == 2)
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 4355 1 62
2 2231 1 76
3 8894 1 88
4 9002 1 56
5 4355 2 22
6 2231 2 30
7 8894 2 35
8 9002 2 40
I don't know if you actually need to use n_distinct() but it won't hurt to keep it. This will remove cases who have a pre score but no post score if they exist in the data.
First arrange ClientNumbers then group_by and finally filter using dplyr::lead and dplyr::lag
library(dplyr)
df %>% arrange(ClientNumber) %>% group_by(ClientNumber) %>%
filter(Pre_Post==1 & lead(Pre_Post)==2 | Pre_Post==2 & lag(Pre_Post)==1)
# A tibble: 8 x 3
# Groups: ClientNumber [4]
ClientNumber Pre_Post QuestionnaireScore
<fct> <dbl> <dbl>
1 2231 1 76
2 2231 2 30
3 4355 1 62
4 4355 2 22
5 8894 1 88
6 8894 2 35
7 9002 1 56
8 9002 2 40
Another option is to create groups of 2 for every ClientNumber and select only those groups which have 2 rows in them.
library(dplyr)
df %>%
arrange(ClientNumber) %>%
group_by(ClientNumber, group = cumsum(Pre_Post == 1)) %>%
filter(n() == 2) %>%
ungroup() %>%
select(-group)
# ClientNumber Pre_Post QuestionnaireScore
# <chr> <fct> <dbl>
#1 2231 1 76
#2 2231 2 30
#3 4355 1 62
#4 4355 2 22
#5 8894 1 88
#6 8894 2 35
#7 9002 1 56
#8 9002 2 40
The same can be translated in base R using ave
new_df <- df[order(df$ClientNumber), ]
subset(new_df, ave(Pre_Post,ClientNumber,cumsum(Pre_Post == 1),FUN = length) == 2)

Resources