about filter in R - r

I have (df) has (ID), (Adm_Date), (ICD_10), (points). and it has 1,000,000 rows.
(Points) represent value for (ICD_10)
(ID): each one has many rows
(Adm_Date) from 2010-01-01 to 2018-01-01.
I want the sum (points) without duplicate for filter rows starting from (Adm_date) to 2 years previous back from (Adm_Date) by (ID).
The periods like these:
01-01-2010 to 31-01-2012,
01-02-2010 to 29-02-2012,
01-03-2010 to 31-03-2012,...... so on to the last date 01-12-2016 to 31-12-2018.
my problem is with the filter of the dates. It does not filter the rows based on period date. It does sum (points) for each (ID) without duplicates for all data from the 2010 to 2018 period instead of summing them per period date for each (ID).
I used these codes
start.date= seq(as.Date (df$Adm_Date))
end.date = seq(as.Date (df$Adm_Date+ years(-2)))
Sum_df<- df %>% dplyr::filter(Adm_Date >=start.date & Adm_Date<=end.date) %>%
group_by(ID) %>%
mutate(sum_points = sum(points*!duplicated(ICD_10)))
but the filiter did not work, because it does sum (points) for each (ID) from all dates from the 2010 to 2018 instead of summing them per period date for each (ID).
sum_points will start from 01-01-2012, any Adm_Date >= 01-01-2012 I need to get their sum.
If I looked at the patient has ID=11. I will sum points from row 3 to row 23, Also I need to ignore repeat ICD_10 (e.g. G81, and I69 have repeated in this period). so results show like this
ID(11), Adm_Date(07-05-2012), sum_points(17), while the sum points for the same patient at Adm_Date(13-06-2013) I will sum from row 11 to row 27 because look back for 2 years from Adm_Date. So,
ID(11), Adm_Date(13-06-2013), sum_points(14.9)
I have about a half million of ID and more than a million rows.
I hope I explained it well. Thank you
enter image description here

Related

How to compare 4 dates across 4 columns and create 4 new columns with the dates that are chronologically ordered in R dataframe

New to R and to stack overflow. Thanks for your consideration of my question.
I have a dataframe which is wide and contains a row for each unique IDs and up to 4 dates in 4 columns for each ID. Not everyone has all four dates and the dates were not entered chronologically across the columns.
What I'm trying to do is create 4 new columns with the dates ordered chronologically for each row (ID). That is, the earliest date among the between 1 and 4 dates in the original columns I wish to appear in newcol1; then the next earliest date appears in the next column (newcol2) and so on.
So, for example, if the first ID has 4 dates in 4 columns:
2021-08-17, 2022-08-02, 2021-12-12, 2022-03-15
I want them to appear in chronological order in the 4 new columns:
2021-08-17, 2021-12-12, 2022-03-15, 2022-08-02
I know how to pull min and max, but that's not helpful for the middle values.

Selecting data from a dataset using information from another dataset in R

I am trying to calculate animal home ranges using movement data from the 30 days before two animals encountered each other, using R. So, for example, if animal1 meets animal2 on the 15th of June, I would like to select all movement data available between the 16th of May and the 14th of June for each animal. The problem I have is that I do not know how to program the subsetting of the movement data based on the date and animal id.
I would like to end up with two new datasets of movement data for each encounter, one per animal. Each new dataset would contain all movement data recorded for one of the encountering animals in the 30 days before the encounter.
I share part of the data with you in this Wetransfer link . The workbook contains 2 tabs:
encounters: Contains one line per encounter, with a column for the
date, another with the ID of group1 A and another with the ID of
group2. I would use the date and the IDs of this dataset to select
the data from the other dataset (movement_data)
movement_data: Contains one line per GPS point collected. There are
columns for the id of the point, the ID of the group, the date in
with the GPS point was taken, the latitude and the longitude.
Does anybody know how to do this? I don't even know where to start
Thank you very much!
So to subset by the id of the animal, you would just need to use DPLYR to subset the data by ID:
data %>% filter(ID == "A")
To get the dates you could add a column in excel where you subtract 30 days from each encounter and then filter for dates between that column and the encounter date
data %>%
filter(ID == "A") %>%
filter(between(date_column, as.Date('YYYY-MM-DD'), as.Date('YYYY-MM-DD')))

Sum over dates in R studio for formula

I'm working on some Covid-19 related questions in R Studio.
I have a data frame containing the columns of the date, cases (newly infected people on this date), deaths on this date, country, population, and indicator 14, which is the Number of cases per 100,000 residents over the last 14 days including the current date.
Now I want to create a new indicator, which is looking at the cases per 100,000 over the last 7 days.
The way to calculate it would of course be: 7 days indicator = (sum from k= i-6 to i of cases_k/population) * 100,000
So I wanted to code a function incidence <- function(cases, population) {} performing the formula on the data but I'm struggling:
How can I always address the last 7 days?
I know that I can e.g. compute a sum vom 0 to 5 with the following: i <- 0:5; sum(i^2) but how do I define from k= i-6 to i in this case?
Do I have to use a loop inside the function?
Thank you!

Sum rows that share all but two observations in the same dataframe

In my dataframe, see table attached here, I have three columns: country, results and eurosceptic.
I would like to know if it is possible to merge together rows that share all but two observations, which should be the results and eurosceptic observations.
For example, a function that would leave me with two rows for Belgium. One in which the eurosceptic value is 1 and one where it's 0. Then, the results column in each of those rows would be the sum formed by the results of the former rows that shared either 1 or 0 for the eurosceptic variable.
So, the eurosceptic = 0 Belgium row would have its results observation equal the sum of the results observations of the rows in my current table, which were related to Belgium and all had the eurosceptic value as 0.
In short, a transformation of my df to one with two rows per country, the eurosceptic value as 0 and 1, where the results observation for each is the summed results observations of the previous rows with the corresponding country and eurosceptic values.
Is this possible?
Thank for your help in advance!
My table as it is now
We can group by 'Country', 'eurosceptic' get the sum of 'results'
library(dplyr)
df1 %>%
group_by(Country, eurosceptic) %>%
summarise(results = sum(results))

Counting Frequencies Using (logical?) Expressions

I have been teaching myself R from scratch so please bear with me. I have found multiple ways to count observations, however, I am trying to figure out how to count frequencies using (logical?) expressions. I have a massive set of data approx 1 million observations. The df is set up like so:
Latitude Longitude ID Year Month Day Value
66.16667 -10.16667 CPUELE25399 1979 1 7 0
66.16667 -10.16667 CPUELE25399 1979 1 8 0
66.16667 -10.16667 CPUELE25399 1979 1 9 0
There are 154 unique ID's and similarly 154 unique lat/long. I am focusing in on the top 1% of all values for each unique ID. For each unique ID I have calculated the 99th percentile using their associated values. I went further and calculated each ID's 99th percentile for individual years and months i.e.. for CPUELE25399 for 1979 for month=1 the 99th percentile value is 3 (3 being the floor of the top 1%)
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
I have tried at least 100 different approaches to this but I think that I am fundamentally misunderstanding something maybe in the syntax? This is the snippet of code that has gotten me the farthest:
ddply(Total,
c('Latitude','Longitude','ID','Year','Month'),
function(x) c(Threshold=quantile(x$Value,probs=.99,na.rm=TRUE),
Frequency=nrow(x$Value>=quantile(x$Value,probs=.99,na.rm=TRUE))))
R throws a warning message saying that >= is not useful for factors?
If any one out there understands this convoluted message I would be supremely grateful for your help.
Using these threshold values: For each ID, for each year, for each month- I need to count the amount of times (per month per year) that the value >= that IDs 99th percentile
Does this mean you want to
calculate the 99th percentile for each ID (i.e. disregarding month year etc), and THEN
work out the number of times you exceed this value, but now split up by month and year as well as ID?
(note: your example code groups by lat/lon but this is not mentioned in your question, so I am ignoring it. If you wish to add it in, just add it as a grouping variable in the appropriate places).
In that case, you can use ddply to calculate the per-ID percentile first:
# calculate percentile for each ID
Total <- ddply(Total, .(ID), transform, Threshold=quantile(Value, probs=.99, na.rm=T))
And now you can group by (ID, month and year) to see how many times you exceed:
Total <- ddply(Total, .(ID, Month, Year), summarize, Freq=sum(Value >= Threshold))
Note that summarize will return a dataframe with only as many rows as there are columns of .(ID, Month, Year), i.e. will drop all the Latitude/Longitude columns. If you want to keep it use transform instead of summarize, and then the Freq will be repeated for all different (Lat, Lon) for each (ID, Mon, Year) combo.
Notes on ddply:
can do .(ID, Month, Year) rather than c('ID', 'Month', 'Year') as you have done
if you just want to add extra columns, using something like summarize or mutate or transform lets you do it slickly without needing to do all the Total$ in front of the column names.

Resources