Related
I have a dataset I'm working with in R that has an ID number (ID), year they submitted data (Year), some other data (which isn't relevant to my question, but just consider them as "columns"), and a date of registration on our systems (DateR).
This dateR is autogenerated from the dataset I am using, and is supposed to represent the "earliest" date the ID number appears on our systems.
However, due to some kind problem with how the data is being pulled that I can't get fixed, the date is being recorded as a new date that updates every year, instead of simply the earliest date.
Thus, my goal would be to create a script that reworks the data and does the following two checks:
Firstly, it checks the row and identifies which rows have the matching ID number
Secondly, it then applies the "earliest" date of all of the matching ID numbers in the date column.
So below is the example of a Dataset like what I am using
#
ID1
YearSubmitted
Data
DateR
1
12345
2017
100
22-03-2017
2
12345
2018
100
22-03-2018
3
12345
2019
100
22-03-2019
4
22221
2018
100
22-03-2018
5
22221
2019
100
22-03-2019
This is what I would like it to look like (I have bolded the changed numbers for clarity)
#
ID1
YearSubmitted
Data
DateR
1
12345
2017
100
22-03-2017
2
12345
2018
100
22-03-2017
3
12345
2019
100
22-03-2017
4
22221
2018
100
22-03-2018
5
22221
2019
100
22-03-2018
Most of the reference questions I have searched for this reference either replacing data with values fromanother column like If data present, replace with data from another column based on row ID, or use the replacement value as pulled from another dataframe like Replace a value in a dataframe by using other matching IDs of another dataframe in R.
I would prefer to acheive in dplyr if possible.
Preferably I'd like to start this with
data %>%
group_by(ID1, Yearsubmitted) %>%
mutate(across(c(DateR),
And I understand I could use the match function .. but I just draw a blank from this point on.
Thus, I would appreciate advice on how to:
Conditionally change the date if it's matching ID1 values, and secondly, to change all dates to the earliest value in the date column (DateR).
Thanks for your time.
Try this:
quux %>%
mutate(DateR = as.Date(DateR, format = "%d-%m-%Y")) %>%
group_by(ID1) %>%
mutate(DateR = min(DateR)) %>%
ungroup()
# # A tibble: 5 × 5
# `#` ID1 YearSubmitted Data DateR
# <int> <int> <int> <int> <date>
# 1 1 12345 2017 100 2017-03-22
# 2 2 12345 2018 100 2017-03-22
# 3 3 12345 2019 100 2017-03-22
# 4 4 22221 2018 100 2018-03-22
# 5 5 22221 2019 100 2018-03-22
This involves converting DateR to a "real" Date-class object, where numeric comparisons (such as min) are unambiguous and correct.
Data
quux <- structure(list("#" = 1:5, ID1 = c(12345L, 12345L, 12345L, 22221L, 22221L), YearSubmitted = c(2017L, 2018L, 2019L, 2018L, 2019L), Data = c(100L, 100L, 100L, 100L, 100L), DateR = c("22-03-2017", "22-03-2018", "22-03-2019", "22-03-2018", "22-03-2019")), class = "data.frame", row.names = c(NA, -5L))
Here is a similar approach using dplyrs first function after using arrange to sort the years:
df %>%
group_by(ID1) %>%
arrange(YearSubmitted,.by_group = TRUE) %>%
mutate(DateR = first(DateR))
ID1 YearSubmitted Data DateR
<int> <int> <int> <chr>
1 12345 2017 100 22-03-2017
2 12345 2018 100 22-03-2017
3 12345 2019 100 22-03-2017
4 22221 2018 100 22-03-2018
5 22221 2019 100 22-03-2018
I have two dataframes, one with climate data for every location and date across 4 years. The other data frame has a date for each day an animal was trapped at a site. I am trying to calculate the mean of each climate variable based on a specific amount of time before the day the animal was trapped (time length depends on variable in question).
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
> climate
site precip humid date
1 1 0.1 1 6/13/2020
2 1 0.2 1 6/12/2020
3 1 0.1 3 6/11/2020
4 1 0.1 1 6/14/2020
5 2 0.5 2 6/13/2020
6 2 0.2 3 6/12/2020
7 2 0.3 3 6/11/2020
8 2 0.1 1 6/14/2020
9 1 0.2 1 2/13/2019
10 1 0.1 3 2/14/2019
11 1 0.1 1 2/15/2019
12 1 0.5 2 2/16/2019
> trap
site date
1 1 7/1/2020
2 2 7/1/2020
3 3 7/2/2020
4 3 7/4/2020
I want to calculate the mean humid 18-20 days before the date written in the trap dataframe. So essentially what is the mean humid between 6/11/2020 and 6/13/2020 according to the climate data.frame for animals trapped on 7/1/2020. So for site 1 that would be: 1.667 and site 2 that would be 2.67.
I also want to calculate the sum of precipitation 497-500 days before the date written in the trap dataframe. So I would need to calculate the sum (total) precip between 2/13/2019 and 2/16/2019 for an animal trapped on 7/1/2020 at each site. So for site 1 precip would be 0.9.
I know how to create new columns in the trap data frame for mean precip and sum humid but I'm not sure where to start in terms of coding so that each value is calculated as described above and the data that corresponds to the correct date is used for the large dataset that contains many different trap dates.
Thank you very much, hopefully I am being clear in my description.
I have a solution using functions from the tidyverse. It is always useful to convert date variables to the class date. With this class, you can make calculations. Note, that I renamed the date column in the trap data to trap_date. See comments for more details:
library(tidyverse)
climate <- data.frame(site=c(1,1,1,1,2,2,2,2,1,1,1,1),
precip=c(0.1,0.2,0.1,0.1,0.5,0.2,0.3,0.1,0.2,0.1,0.1,0.5),
humid=c(1,1,3,1,2,3,3,1,1,3,1,2),
date=c("6/13/2020","6/12/2020","6/11/2020","6/14/2020","6/13/2020","6/12/2020","6/11/2020","6/14/2020","2/13/2019","2/14/2019","2/15/2019","2/16/2019"))
trap <- data.frame(site=c(1,2,3,3), trap_date=c("7/1/2020","7/1/2020","7/2/2020","7/4/2020"))
# merge data
data <- merge(climate, trap, by="site")
> head(data)
site precip humid date trap_date
1 1 0.1 1 2020-06-13 2020-07-01
2 1 0.2 1 2020-06-12 2020-07-01
3 1 0.1 3 2020-06-11 2020-07-01
4 1 0.1 1 2020-06-14 2020-07-01
5 1 0.2 1 2019-02-13 2020-07-01
6 1 0.1 3 2019-02-14 2020-07-01
# parse dates to class 'date'; enables calculations
data <- data %>%
mutate(date = parse_date(date, format="%m/%d/%Y"),
trap_date = parse_date(trap_date, format="%m/%d/%Y"))
For means:
# humid means
data %>%
group_by(site) %>%
filter(date >= trap_date-20 & date <= trap_date-18) %>%
summarise(mean = mean(humid))
# A tibble: 2 x 2
site mean
<dbl> <dbl>
1 1 1.67
2 2 2.67
However, it seems that the range of 497 to 500 days before the trap date contains no observations. When I used your specified dates, I got the same result of 0.9:
# precip sums
data %>%
group_by(site) %>%
filter(date >= trap_date-500 & date <= trap_date-497)
# A tibble: 0 x 5
# Groups: site [0]
# ... with 5 variables: site <dbl>, precip <dbl>, humid <dbl>,
# date <date>, trap_date <date>
# using your provided dates
data %>%
group_by(site) %>%
filter(date >= as.Date("2019-02-13") & date <= as.Date("2019-02-16")) %>%
summarise(sum = sum(precip))
# A tibble: 1 x 2
site sum
<dbl> <dbl>
1 1 0.9
Hope I can help.
I have a dataframe with ID and treatment dates like this as below
ID Dates
1 01/2/2012
1 02/8/2012
1 03/8/2012
1 04/5/2013
1 05/5/2013
2 01/2/2012
2 03/5/2013
2 04/6/2013
I need to find for each ID, if there is a treatment date break for more than a year. If yes, then I need to break them into two courses, and list the start & end date. So after executing R codes, it will look like below:
ID Course1StarteDate Course1EndDate Break1to2(Yr) Course2StartDate Course2EndDate
1 01/2/2012 03/8/2012 1.075 04/5/2013 05/5/2013
2 01/2/2012 01/2/2012 1.173 03/5/2013 04/6/2013
The dataframe I have includes hundreds of IDs, and I don't know how many courses there will be. Is there an efficient way of using R to solve this? Thanks in advance!
If d is your data, you can identify when the difference between a row's date and the prior row's date exceeds 365 (or perhaps 365.25), and, then use cumsum to generate the distinct treatment courses. Finally add a column that estimates the duration of the "break" between courses.
as_tibble(d) %>%
group_by(ID) %>%
mutate(trt=as.numeric(Dates-lag(Dates)),
trt=cumsum(if_else(is.na(trt),0,trt)>365)+1) %>%
group_by(ID,trt) %>%
summarize(StartDate = min(Dates),
EndDate = max(Dates),.groups = "drop_last") %>%
mutate(Break:=as.numeric(lead(StartDate) - EndDate)/365)
Output:
ID trt StartDate EndDate Break
<dbl> <dbl> <date> <date> <dbl>
1 1 1 2012-01-02 2012-03-08 1.08
2 1 2 2013-04-05 2013-05-05 NA
3 2 1 2012-01-02 2012-01-02 1.17
4 2 2 2013-03-05 2013-04-06 NA
I would suggest keeping in this long format, rather than swinging to wide format as you have in your example, especially with hundreds of IDs, all with potentially different numbers of courses. The long format is almost always better.
However, if you really want this, you can continue the pipeline from above, like this:
ungroup %>%
pivot_wider(id_cols =ID,
names_from = trt,
values_from = c(StartDate:Break),
names_glue = "Course{trt}_{.value}",
names_vary = "slowest")
to produce this "wide" format:
ID Course1_StartDate Course1_EndDate Course1_Break Course2_StartDate Course2_EndDate Course2_Break
<dbl> <date> <date> <dbl> <date> <date> <dbl>
1 1 2012-01-02 2012-03-08 1.08 2013-04-05 2013-05-05 NA
2 2 2012-01-02 2012-01-02 1.17 2013-03-05 2013-04-06 NA
I have a table as shown in the image, where each comment has a publication date, with year, month, day and time, I would like to add the sentiment values by day.
this is how the table is composed
serie <- data.frame(comments$created_time,sentiment2$positive-sentiment2$negative)
Using dplyr you can do:
library(dplyr)
df %>%
group_by(as.Date(comments.created_time)) %>%
summarize(total = sum(sentiment))
Here is some sample data that will help others to troubleshoot and understand the data:
df <- tibble(comments.created_time = c("2015-01-26 22:43:00",
"2015-01-26 22:44:00",
"2015-01-27 22:43:00",
"2015-01-27 22:44:00",
"2015-01-28 22:43:00",
"2015-01-28 22:44:00"),
sentiment = c(1,3,5,1,9,1))
Using the sample data will yield:
# A tibble: 3 × 2
`as.Date(comments.created_time)` total
<date> <dbl>
1 2015-01-26 4
2 2015-01-27 6
3 2015-01-28 10
I am currently working on a task that requires me to query a list of stocks from an sql db.
The problem is that it is a list where there are 1:n stocks traded per date. I want to calculate the the share of each stock int he portfolio on a given day (see example) and pass it to a new data frame. In other words date x occurs 2 times (once for stock A and once for stock B) and then pull it together that date x occurs only one time with the new values.
'data.frame': 1010 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Date : Date, format: "2019-11-22" "2019-11-21" "2019-11-20" "2019-11-19" ...
$ Close: num 52 51 50.1 50.2 50.2 ...
$ Volume : num 5415 6196 3800 4784 6189 ...
$ Stock_ID : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"), Close=c(50,55,56,10,11,12,200),Volume=c(100,110,150,60,70,80,30),Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
*cannot transfer the date to a date variable in this example
I would like to have a new dataframe that generates the Value traded per day, the weight of each stock, and the daily returns per day, while keeping the number of stocks variable.
I hope I translated the issue properly so that I can receive help.
Thank you!
I think the easiest way to do this would be to use the dplyr package. You may need to read some documentation but the mutate and group_by function may be able do what you want. This function will allow you to modify the current dataframe by either adding a new column or changing the existing data.
Lets start with a reproducible dataset
RawInput<-data.frame(Date=c("2017-22-11","2017-22-12","2017-22-13","2017-22-11","2017-22-12","2017-22-13","2017-22-11"),
Close=c(50,55,56,10,11,12,200),
Volume=c(100,110,150,60,70,80,30),
Stock_ID=c(1,1,1,2,2,2,3))
RawInput$Stock_ID<-as.factor(RawInput$Stock_ID)
library(magrittr)
library(dplyr)
dat2 <- RawInput %>%
group_by(Date, Stock_ID) %>% #this example only has one stock type but i imagine you want to group by stock
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume)) #what ever computation you need to do with
#multiple stock values for a given date goes here
dat2 %>% select(Stock_ID, Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct() #dat2 will still be the same size as dat, thus use the distinct() function to reduce it to unique values
# A tibble: 7 x 6
# Groups: Date, Stock_ID [7]
Stock_ID Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 2017-22-11 50 50 100 100
2 1 2017-22-12 55 55 110 110
3 1 2017-22-13 56 56 150 150
4 2 2017-22-11 10 10 60 60
5 2 2017-22-12 11 11 70 70
6 2 2017-22-13 12 12 80 80
7 3 2017-22-11 200 200 30 30
This data set that you provided actually only has one unique Stock_ID and Date combinations so there was nothing actually done with the data. However if you remove Stock_ID where necessary you can see how this function would work
dat2 <- RawInput %>%
group_by(Date) %>%
mutate(CloseMean=mean(Close),
CloseSum=sum(Close),
VolumeMean=mean(Volume),
VolumeSum=sum(Volume))
dat2 %>% select(Date, CloseMean, CloseSum, VolumeMean,VolumeSum) %>% distinct()
# A tibble: 3 x 5
# Groups: Date [3]
Date CloseMean CloseSum VolumeMean VolumeSum
<fct> <dbl> <dbl> <dbl> <dbl>
1 2017-22-11 86.7 260 63.3 190
2 2017-22-12 33 66 90 180
3 2017-22-13 34 68 115 230
After reading your first reply, You will have to be specific on how you are trying to calculate the weight. Also define your end result.
Im going to assume weight is just percentage by total cost. And the end result is for each date show the weight per stock. In other words a matrix of dates and stock Ids
library(tidyr)
RawInput %>%
group_by(Date) %>%
mutate(weight=Close/sum(Close)) %>%
select(Date, weight, Stock_ID) %>%
spread(key = "Stock_ID", value = "weight", fill = 0)
# A tibble: 3 x 4
# Groups: Date [3]
Date `1` `2` `3`
<fct> <dbl> <dbl> <dbl>
1 2017-22-11 0.192 0.0385 0.769
2 2017-22-12 0.833 0.167 0
3 2017-22-13 0.824 0.176 0