Merge and fill R from two dataframes on date - r

I have reproduced my example as follows:
x<-as.Date(c(as.Date('2015-01-01'):as.Date('2016-01-01')), origin='1970-01-01')
dates<-as.Date(c(as.Date('2015-01-01'),as.Date('2015-03-04'),as.Date('2015-07-01')),origin='1970-01-01')
values<-c(3,7,10)
What I would like is an output dataframe where values for a particular date in x are joined to the most recent, but historic, date entry. For example:
x, value
2015-01-01, 3
2015-01-02, 3
....
2015-03-04, 7
2015-03-05, 7
....
2015-07-01, 10
2015-07-02, 10
....
2016-01-01, 10
I've currently implemented this through a for loop, but it feels slow and horrendously inefficient - I'm sure there must be some way in R to do it more automatically?

Try something like it
x<-data.frame(x=as.Date(c(as.Date('2015-01-01'):as.Date('2016-01-01')), origin='1970-01-01'))
dates<-as.Date(c(as.Date('2015-01-01'),as.Date('2015-03-04'),as.Date('2015-07-01')),origin='1970-01-01')
values<-c(3,7,10)
a=data.frame(dates,values)
y=merge(x,a,by.x='x',by.y='dates',all.x=T)
colnames(y)=c("x","value")
test=function(i){
if(is.na(y[i,2])){
if(i==1) return(NA)
return(test(i-1))
}else{
return(y[i,2])
}
}
y$value=sapply(1:nrow(y),test)

Related

data.table aggregation based on multiple criteria

I am trying to calculate how many pid within a set fid's have a yob smaller than person's yob. The second question is about unique pid. Updating the question based on efforts #langtang and my own reflections:
#Libraries:
library(data.table)
library(tictoc)
#Make it replicable:
set.seed(1)
#Define parameters of the simulation:
pid<-1:1000
fid<-1:5
time_periods<-1:12
yob<-sample(seq(1900,2010),length(pid),replace = TRUE)
#Obtain in how many firms a given pid works in a givem month:
nr_firms_pid_time<-sample(1:length(fid),length(pid),replace = TRUE)
#This means:
#First pid: works in first firm;
#Second pid: works in first four firms;
#Third pid: works in first firm;
#Fourth pid: works in two firms.
#Aux functions:
function_rep<-function(x){
rep(1:12,x)
}
function_seq<-function(x){
1:x
}
#Create panel
data_panel<-data.table(pid = rep(pid,nr_firms_pid_time*length(time_periods)))
data_panel[,yearmonth:=do.call(c,sapply(nr_firms_pid_time,function_rep))]
data_panel[,fid:=rep(do.call(c,sapply(nr_firms_pid_time,function_seq)),each = 12)]
#Merge in yob:
data_yob<-data.table(pid = pid,yob = yob)
data_panel<-merge(data_panel,data_yob,by = c("pid"),all.x = TRUE)
#Remove not needed stuff:
rm(pid)
rm(fid)
rm(time_periods)
rm(yob)
rm(data_yob)
#Solution 1 (terribly slow):
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
sum(data_func[pid!=id]$yob<dob_to_use)
}
older_coworkers_unique = function(id,yrmonth) {
#First obtain firms in which a worker works in a given month:
id_firms<-data_panel[pid==id&yearmonth==yrmonth,fid]
#Then extract data at a given month:
data_func<-data_panel[(fid %in% id_firms)&(yearmonth==yrmonth)]
#Then extract his dob:
dob_to_use<-unique(data_func[pid==id,yob])
#Get UNIQUE number of coworkers:
sum(unique(data_func[pid!=id],by = c("pid"))$yob<dob_to_use)
}
#Works but is terrible slow:
tic()
sol_1<-data_panel[, .(older_coworkers(.BY$pid,.BY$yearmonth)),by = c("pid","yearmonth")]
toc()
#Solution 2 (better but do not like it, what if I want unique older coworkers)
function_older<-function(x){
noc<-lapply(
1:length(x),
function(i){
sum(x[-i]<x[i])
}
)
unlist(noc)
}
#This is fast but I cannot get unique number:
tic()
sol_2<-data_panel[,.(pid,function_older(yob)),by = c("fid","yearmonth")][,sum(V2),by = c("pid","yearmonth")][order(pid,yearmonth)]
toc()
#Everything works:
identical(sol_1,sol_2)
The question is how to implement older_coworkers_unique in a very fast manner. Any suggestions would be greatly appreciated.
Update, based on OP's new reproducible dataset
If you want a one-liner to reproduce sol_2 above, you can do this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )][, .N, by=.(i.pid, yearmonth)]
Explanation:
The above is using a non-equi join, which can be a helpful approach when using data.table. I am joining data_panel on itself, requiring that yearmonth and fid be equal, but that year of birth (left side of join) is less than year of birth (right side of join). This will return a data.table where firms and yearmonth matches, but where every older coworker (pid) is matched to their younger coworkers (i.pid). We can thus count the rows (.N) by each younger coworker (i.pid) and yearmonth. This produces the same as sol_1 and sol_2 above. You commented that you would like to find the unique coworkers, and so the second approach below does that, by using len(unique(pid)) as below, in Option 2.
The same non-equi join approach can be used to get unique older coworkers, like this:
data_panel[data_panel, on=.(yearmonth, fid, yob<yob )] %>%
.[, .(older_coworkers = length(unique(pid))), by=.(i.pid, yearmonth)]
Previous Response, based on OP's original very small example dataset
I'm not sure exactly what you want the output to look like. However in your example data, I first drop the duplicate row (because I couldn't understand why it was there (see my comment above)), and then I apply a function that counts that number of older coworkers for each pid/fid/ym.
# make your example data unique
data=unique(data)
# make a small function that counts the number of coworkers with
# earlier dob than this individual
older_coworkers = function(birth,firm,yrmonth,id) {
data[dob<birth & fid==firm & ym==yrmonth & pid!=id,.N]
}
# apply the function to the data
data[, .(num_older_coworkers = older_coworkers(dob,.BY$fid, .BY$ym, .BY$pid)), by=.(pid,fid,ym)]
Output:
pid fid ym num_older_coworkers
1: 1 1 200801 1
2: 1 2 200802 0
3: 2 1 200801 0
4: 3 2 200801 0
Person 1 at Firm 1 has one older coworker in the month of 2008-01 -- that is, Person 2 at Firm 1 in 2008-01.
Person 1 at Firm 2 (born in 1950) would also have an older coworker, namely, Person 3 at Firm 2 (born in 1930), but the result shows 0, because Person 1 at Firm 2 ym (i.e. 2008-01) does not match with that potential older coworker's ym (i.e. 2008-02).

Is there a way to mark "troughs" in a graph in R according to specific criteria?

I have a data set which, when plotted, produces a graph that looks like this:
Plot
The head of this data is:
> head(data_frame)
score position
73860 10 43000
73859 10 43001
73858 10 43002
73857 10 43003
73856 10 43004
73855 10 43005
I've uploaded the whole file as a tab delimited text file here.
As you can see, the plot has regions which have a score of around 10, but there's one region in the middle that "dips". I would like to identify these dips.
Defining a dip as:
Starting when the score is below 7
Ending when the score rises to 7 or above and stays at 7 or above for at least 500 positions
I would like to identify all the regions which meet the above definition, and output their start and end positions. In this case that would only be the one region.
However, I'm at a bit of a loss as to how to do this. Looks like the rle() function could be useful, but I'm not too sure how to implement it.
Expected output for the data frame would be something like:
[1] 44561 46568
(I haven't actually checked that everything in between these would qualify under the definition, but from the plot this looks about right)
I would be very grateful for any suggestions!
Andrei
So I've come up with one solution that uses a series of loops. I do realise this is inefficient, though, so if you have a better answer, please let me know.
results <- data.frame(matrix(ncol=2,nrow=1))
colnames(results) <- c("start","end")
state <- "out"
count <- 1
for (i in 1:dim(data_frame)[1]){
print(i/dim(data_frame)[1])
if (data_frame[i,3] < 7 & state=="out") {
results[count,1] <- data_frame[i,2]
state <- "in"
next
}
if (data_frame[i,3] >= 7 & state=="in") {
if ((i+500)>dim(data_frame)[1]){
results[count,2] <- data_frame[dim(data_frame)[1],2]
state <- "out"
break
}
if (any(data_frame[(i+1):(i+500),3] < 7)) {
next
} else {
results[count,2] <- data_frame[i-1,2]
count <- count+1
state <- "out"
next
}
}
if ((i+500)>dim(data_frame)[1] & state == "out") {
break
}
}
Something like this is a tidyverse solution and uses rle as OP suggested...
below7 <- data_frame$score < 7
x <- rle(below7)
runs <- tibble(
RunLength=x$lengths,
Below7=x$values,
RunStart=df$position[1]
) %>%
mutate(
RunStart=ifelse(
row_number() == 1,
data_frame$position[1],
data_frame$position[1] + cumsum(RunLength)-RunLength+1
),
RunEnd=RunStart + RunLength-1,
Dip=Below7,
Dip=Dip | Below7 | (!Below7 & RunLength < 500)
)
as.data.frame(runs)
Giving
RunLength Below7 RunStart RunEnd Dip
1 1393 FALSE 43000 44392 FALSE
2 84 TRUE 44394 44477 TRUE
3 84 FALSE 44478 44561 TRUE
...
19 60 FALSE 46338 46397 TRUE
20 171 TRUE 46398 46568 TRUE
21 2433 FALSE 46569 49001 FALSE
So to get OP's final answer
runs %>%
filter(Dip) %>%
summarise(
DipStart=min(RunStart),
DipEnd=max(RunEnd)
)
# A tibble: 1 x 2
DipStart DipEnd
<dbl> <dbl>
1 44394 46568
If the original data.frame might contain more than one dip, you'd have to do a little more work when creating the runs tibble: having indentified each individual run, you'd need to create an additional column, DipIndex say, which indexed each individual Dip.

How to setup two dynamic conditions in SUMIFS like problem in R?

I already tried my best but am still pretty much a newbie to R.
Based on like 500mb of input data that currently looks like this:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days
1 2818 5829821 335511.0 1
2 20168 5829746 335265.2 3
3 25428 5830640 331534.6 0
4 27886 5832156 332003.1 3
5 28658 5830888 329727.2 3
6 28871 5829980 332071.3 7
I need to calculate the conditional sum of reviews_last30days - the conditions being a specific and changing area range for each respective record, i.e. R should sum only those reviews for which the calc.latitude and calc.longitude do not deviate more than +/-500 from the longitude and latitude values in each row.
EXAMPLE:
ROW 1 has a calc.latitude 5829821 and a calc.longitude 335511.0, so R should take the sum of all reviews_last30days for which the following ranges apply: calc.latitude 5829321‬ to 5830321‬ (value of Row 1 latitude +/-500)
calc.longitude 335011.0 to 336011.0 (value of Row 1 longitude +/-500)
So my intended output would look somewhat like this in column 5:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days reviewsper1000
1 2818 5829821 335511.0 1 4
2 20168 5829746 335265.2 3 4
3 25428 5830640 331534.6 0 10
4 27886 5832156 332003.1 3 3
5 28658 5830888 331727.2 3 10
6 28871 5829980 332071.3 7 10
Hope I calculated correctly in my head, but you get the idea..
Until now I particularly struggle with the fact that my sum conditions are dynamic and "newly assigned" since the latitude and longitude conditions have to be adjusted for each record.
My current code looks like this but it obviously doesn't work that way:
review1000 <- function(TOTALLISTINGS = NULL){
# tibble to return
to_return <- TOTALLISTINGS %>%
group_by(listing_id) %>%
summarise(
reviews1000 = sum(reviews_last30days[(calc.latitude>=(calc.latitude-500) | calc.latitude<=(calc.latitude+500))]))
return(to_return)
}
REVIEWPERAREA <- review1000(TOTALLISTINGS)
I know I also would have to add something for longitude in the code above
Does anyone have an idea how to fix this?
Any help or hints highly appreciated & thanks in advance! :)
See whether the below code will help.
TOTALLISTINGS$reviews1000 <- sapply(1:nrow(TOTALLISTINGS), function(r) {
currentLATI <- TOTALLISTINGS$calc.latitude[r]
currentLONG <- TOTALLISTINGS$calc.longitude[r]
sum(TOTALLISTINGS$reviews_last30days[between(TOTALLISTINGS$calc.latitude,currentLATI - 500, currentLATI + 500) & between(TOTALLISTINGS$calc.longitude,currentLONG - 500, currentLONG + 500)])
})

Quickly create new columns in dataframe using lists - R

I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.

R - subtracting multiple columns from multiple columns with 2 data frames

I have two dataframes as below:
> head(VN.GRACE.Int, 4)
DecimDate CSR GFZ JPL
1 2003.000 12.1465164 5.50259937 15.7402752
2 2003.083 1.8492431 0.27744418 3.4811423
3 2003.167 1.5168512 -0.06333961 1.7962201
4 2003.250 -0.2355813 6.16296554 0.7215013
> head(VN.GLDAS, 4)
Decim_Date NOAH_SManom CLM_SManom VIC_SManom SM_Month_Mean
1 2003.000 3.0596372 0.4023805 -0.2175665 1.081484
2 2003.083 -1.4459928 -1.0255955 -3.1338024 -1.868464
3 2003.167 -3.9945788 -1.4646734 -4.2052981 -3.221517
4 2003.250 -0.9737429 0.4213161 -1.0537822 -0.535403
EDIT: The names below (UN.GRACE.Int and UN.GLDAS) are the names of the two dataframes above. Have added an example of what the final data frame will look like.
I want to subtract columns [,2:5] in VN.GLDAS data frame from EACH of the columns [,2:4] in UN.GRACE.Int and put the results in a separate data frame (new data frame will have 12 columns) as below:
EXAMPLE <- data.frame(CSR_NOAH=numeric(), CSR_CLM=numeric(), CSR_VIC=numeric(), CSR_SM_Anom=numeric(),
GFZ_NOAH=numeric(), GFZ_CLM=numeric(), GFZ_VIC=numeric(), GFZ_SM_Anom=numeric(),
JPL_NOAH=numeric(), JPL_CLM=numeric(), JPL_VIC=numeric(), JPL_SM_Anom=numeric())
I've looked into 'sweep' as suggested in another post, but am not sure whether my query would be better suited using a for loop, which I'm a novice at. Also looked at subtracting values in one data frame from another but doesn't answer my query I don't believe - Thanks in advance
res <- cbind(VN.GRACE.Int[,1,drop=F],
do.call(cbind,lapply(VN.GLDAS[,2:5],
function(x) VN.GRACE.Int[,2:4]-x)))
dim(res)
#[1] 4 13

Resources