I have been teaching myself R from scratch so please bear with me. I have found multiple ways to count observations, however, I am trying to figure out how to count frequencies using (logical?) expressions. I have a massive set of data approx 1 million observations. The df is set up like so:
Latitude Longitude ID Year Month Day Value
66.16667 -10.16667 CPUELE25399 1979 1 7 0
66.16667 -10.16667 CPUELE25399 1979 1 8 0
66.16667 -10.16667 CPUELE25399 1979 1 9 0
There are 154 unique ID's and similarly 154 unique lat/long. I am focusing in on the top 1% of all values for each unique ID. For each unique ID I have calculated the 99th percentile using their associated values. I went further and calculated each ID's 99th percentile for individual years and months i.e.. for CPUELE25399 for 1979 for month=1 the 99th percentile value is 3 (3 being the floor of the top 1%)
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
I have tried at least 100 different approaches to this but I think that I am fundamentally misunderstanding something maybe in the syntax? This is the snippet of code that has gotten me the farthest:
ddply(Total,
c('Latitude','Longitude','ID','Year','Month'),
function(x) c(Threshold=quantile(x$Value,probs=.99,na.rm=TRUE),
Frequency=nrow(x$Value>=quantile(x$Value,probs=.99,na.rm=TRUE))))
R throws a warning message saying that >= is not useful for factors?
If any one out there understands this convoluted message I would be supremely grateful for your help.
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
Does this mean you want to
calculate the 99th percentile for each ID (i.e. disregarding month year etc), and THEN
work out the number of times you exceed this value, but now split up by month and year as well as ID?
(note: your example code groups by lat/lon but this is not mentioned in your question, so I am ignoring it. If you wish to add it in, just add it as a grouping variable in the appropriate places).
In that case, you can use ddply to calculate the per-ID percentile first:
# calculate percentile for each ID
Total <- ddply(Total, .(ID), transform, Threshold=quantile(Value, probs=.99, na.rm=T))
And now you can group by (ID, month and year) to see how many times you exceed:
Total <- ddply(Total, .(ID, Month, Year), summarize, Freq=sum(Value >= Threshold))
Note that summarize will return a dataframe with only as many rows as there are columns of .(ID, Month, Year), i.e. will drop all the Latitude/Longitude columns. If you want to keep it use transform instead of summarize, and then the Freq will be repeated for all different (Lat, Lon) for each (ID, Mon, Year) combo.
Notes on ddply:
can do .(ID, Month, Year) rather than c('ID', 'Month', 'Year') as you have done
if you just want to add extra columns, using something like summarize or mutate or transform lets you do it slickly without needing to do all the Total$ in front of the column names.
Related
I have (df) has (ID), (Adm_Date), (ICD_10), (points). and it has 1,000,000 rows.
(Points) represent value for (ICD_10)
(ID): each one has many rows
(Adm_Date) from 2010-01-01 to 2018-01-01.
I want the sum (points) without duplicate for filter rows starting from (Adm_date) to 2 years previous back from (Adm_Date) by (ID).
The periods like these:
01-01-2010 to 31-01-2012,
01-02-2010 to 29-02-2012,
01-03-2010 to 31-03-2012,...... so on to the last date 01-12-2016 to 31-12-2018.
my problem is with the filter of the dates. It does not filter the rows based on period date. It does sum (points) for each (ID) without duplicates for all data from the 2010 to 2018 period instead of summing them per period date for each (ID).
I used these codes
start.date= seq(as.Date (df$Adm_Date))
end.date = seq(as.Date (df$Adm_Date+ years(-2)))
Sum_df<- df %>% dplyr::filter(Adm_Date >=start.date & Adm_Date<=end.date) %>%
group_by(ID) %>%
mutate(sum_points = sum(points*!duplicated(ICD_10)))
but the filiter did not work, because it does sum (points) for each (ID) from all dates from the 2010 to 2018 instead of summing them per period date for each (ID).
sum_points will start from 01-01-2012, any Adm_Date >= 01-01-2012 I need to get their sum.
If I looked at the patient has ID=11. I will sum points from row 3 to row 23, Also I need to ignore repeat ICD_10 (e.g. G81, and I69 have repeated in this period). so results show like this
ID(11), Adm_Date(07-05-2012), sum_points(17), while the sum points for the same patient at Adm_Date(13-06-2013) I will sum from row 11 to row 27 because look back for 2 years from Adm_Date. So,
ID(11), Adm_Date(13-06-2013), sum_points(14.9)
I have about a half million of ID and more than a million rows.
I hope I explained it well. Thank you
enter image description here
I'm working on some Covid-19 related questions in R Studio.
I have a data frame containing the columns of the date, cases (newly infected people on this date), deaths on this date, country, population, and indicator 14, which is the Number of cases per 100,000 residents over the last 14 days including the current date.
Now I want to create a new indicator, which is looking at the cases per 100,000 over the last 7 days.
The way to calculate it would of course be: 7 days indicator = (sum from k= i-6 to i of cases_k/population) * 100,000
So I wanted to code a function incidence <- function(cases, population) {} performing the formula on the data but I'm struggling:
How can I always address the last 7 days?
I know that I can e.g. compute a sum vom 0 to 5 with the following: i <- 0:5; sum(i^2) but how do I define from k= i-6 to i in this case?
Do I have to use a loop inside the function?
Thank you!
I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))
I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users.
The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days.
So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
My third variable of interest (new__x3):
The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?
I have a dataframe with 8 variables:
For the variable Labor Category, we have 5 factors: Holiday Worked, Regular, Overtime, Training, Other Worked.
The question is: Can I find a way to aggregate rows with same values except Labor Category and sum up the Sum_FTEvariable?
i.e. Can we reduce the number of rows while add more columns:
"Labor.CategoryHoliday.Worked","Labor.CategoryOther.Worked","Labor.CategoryOvertime","Labor.CategoryRegular","Labor.CategoryTraining" and use 0 or 1 to indicate the status of each factor. And then sum up the Total FTE from rows with same values except Labor Category.
We can do one of group by operations. Using dplyr, we specify the column names in the group_by as grouping variables and then get the sum of "Sum_FTE" with summarise.
library(dplyr)
df1 %>%
group_by_(.dots= names(df1)[c(1:2,4:5)]) %>%
summarise(TotalFTE= sum(Sum_FTE))
For the second part of the question, we can use dcast (it would have been better to show the dataset with dput instead of image file)
library(data.table)
setDT(df1)[, N := 1:.N, (Labor.Category)]
dcast(df1, Med.Center+Charged.Job+Month+Pay.Period.End ~N,
value.var="Labor.Category, length)