I have the following Data:
id customer date value1 value2 isTrue
10 13 2013-08-20 00:00:00.0000 170 180680 0
11 13 2013-09-02 00:00:00.0000 190 181830 0
12 13 2013-09-07 00:00:00.0000 150 183000 1
13 13 2013-09-14 00:00:00.0000 150 183930 0
14 13 2013-09-16 00:00:00.0000 150 184830 0
15 13 2013-09-19 00:00:00.0000 150 185765 1
16 13 2013-09-30 00:00:00.0000 800 187080 0
17 13 2013-10-02 00:00:00.0000 100 188210 0
28 13 2013-10-04 00:00:00.0000 380 188250 1
How can i have the following Results,where SumValue1 is the summury of value1 until field isTrue gets True and resets after and Difference Value2 is the difference of field value2 everytime the IsTrue field gets True?
id customer date value1 value2 isTrue SumValue1 DifferenceValue2
10 13 2013-08-20 00:00:00.0000 170 180680 0
11 13 2013-09-02 00:00:00.0000 190 181830 0
12 13 2013-09-07 00:00:00.0000 150 183000 1 510 2320
13 13 2013-09-14 00:00:00.0000 150 183930 0
14 13 2013-09-16 00:00:00.0000 150 184830 0
15 13 2013-09-19 00:00:00.0000 150 185765 1 450 2765
16 13 2013-09-30 00:00:00.0000 800 187080 0
17 13 2013-10-02 00:00:00.0000 100 188210 0
28 13 2013-10-04 00:00:00.0000 380 188250 1 1280 2485
Assuming id ordering, this query will do:
SELECT
id, customer, date, value1, value2, isTrue,
CASE isTrue WHEN 1 THEN (SELECT TOTAL(value1) FROM t WHERE customer=t2.customer AND id>t2.prev_id AND id<=t2.id) END AS SumValue1,
CASE isTrue WHEN 1 THEN value2-(SELECT value2 FROM t WHERE customer=t2.customer AND id=t2.prev_id) END AS DifferenceValue2
FROM (SELECT *, CASE isTrue WHEN 1 THEN COALESCE((SELECT id FROM t AS _ WHERE customer=t.customer AND date<t.date AND isTrue ORDER BY date DESC LIMIT 1), -1) END AS prev_id FROM t) AS t2;
Steps by step:
Previous id where isTrue is given by:
SELECT id FROM t AS _ WHERE customer=t.customer AND date<t.date AND isTrue ORDER BY date DESC LIMIT 1
Using COALESCE(..., -1) will ensure a non-null id before all others (-1).
SELECT *, CASE isTrue WHEN 1 THEN ... END AS prev_id FROM t will return all rows from t with column prev_id added.
At last, querying SELECT TOTAL(value1) FROM t WHERE customer=t2.customer AND id>t2.prev_id AND id<=t2.id and value2-(SELECT value2 FROM t WHERE customer=t2.customer AND id=t2.prev_id over previous results will return desired results.
Related
I have a dataset with over 1000 unique IDs and for each ID about 15 Surgery Codes done on different Dates(recorded as Days Diff)
I want to take only 1 record per 30 days within the group of each surgery code for each ID.
Adding a demo data here:
ID Age Diag.Date Surgery.Code Days.diff
1 1 67 4/8/2011 A364 421
2 1 67 4/8/2011 A364 1197
3 1 67 4/8/2011 A364 2207
4 1 67 4/8/2011 A364 2226
5 1 67 4/8/2011 A364 2247
6 1 67 4/8/2011 A364 2254
7 1 67 4/8/2011 A364 2331
8 1 67 4/8/2011 A364 2367
9 1 67 4/8/2011 A364 2905
10 1 67 4/8/2011 A364 2918
11 1 67 4/8/2011 D365 2200
12 1 67 4/8/2011 D441 308
13 1 67 4/8/2011 D443 218
14 1 67 4/8/2011 A446 308
15 2 56 6/4/2018 A453 2260
16 2 56 6/4/2018 D453 645
17 2 56 6/4/2018 D453 3095
18 2 56 6/4/2018 B453 645
Diff of 2226-2207 days is 19 days so row4 will delete, again diff of 2247-2207 days is 40 days so row5 will get recorded.
Again diff of 2254-2247 days is 7 days so row6 will get deleted.
Similarly, row10 will get deleted.
Any help is appreciated!
Use dplyr::group_by(ID, Surgery.Code) to filter within individuals and surgeries;
Within each group, use Days.diff - dplyr::lag(Days.diff) <= 30 to test for adjacent rows within 30 days;
Because the results of (2) may change when rows are removed, you'll want to iterate by removing one row at a time per group, then re-testing. You can use while to iterate until no more cases are detected.
library(dplyr)
filtered <- surgeries %>%
group_by(ID, Surgery.Code) %>%
mutate(within30 = if_else(
Days.diff - lag(Days.diff) <= 30,
row_number(),
NA_integer_
))
while (any(!is.na(filtered$within30))) {
filtered <- filtered %>%
mutate(within30 = if_else(
Days.diff - lag(Days.diff) <= 30,
row_number(),
NA_integer_
)) %>%
filter(is.na(within30) | within30 != min(within30, na.rm = TRUE))
}
filtered %>%
select(!within30) %>%
ungroup()
#> # A tibble: 15 x 5
#> ID Age Diag.Date Surgery.Code Days.diff
#> <int> <int> <chr> <chr> <int>
#> 1 1 67 4/8/2011 A364 421
#> 2 1 67 4/8/2011 A364 1197
#> 3 1 67 4/8/2011 A364 2207
#> 4 1 67 4/8/2011 A364 2247
#> 5 1 67 4/8/2011 A364 2331
#> 6 1 67 4/8/2011 A364 2367
#> 7 1 67 4/8/2011 A364 2905
#> 8 1 67 4/8/2011 D365 2200
#> 9 1 67 4/8/2011 D441 308
#> 10 1 67 4/8/2011 D443 218
#> 11 1 67 4/8/2011 A446 308
#> 12 2 56 6/4/2018 A453 2260
#> 13 2 56 6/4/2018 D453 645
#> 14 2 56 6/4/2018 D453 3095
#> 15 2 56 6/4/2018 B453 645
Created on 2022-03-01 by the reprex package (v2.0.1)
I am trying to extract the tables from different pages into one data frame. However, I am only able to get it as a list and I am unable to convert to one table. Could you please help me out?
Code we are using so far:
Tables_recent <- lapply(paste0("http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;home_or_away=1;home_or_away=2;home_or_away=3;page=",
1:50,
";template=results;type=batting"),
function(url){
url %>% read_html() %>%
html_nodes(xpath= '//*[#id="ciHomeContentlhs"]/div[3]/table[3]') %>%
html_table()
})
It's nested within a list, so you need to get out the first element, and also remove entries that are "No records available to match this query"
library(dplyr)
library(textreadr)
library(rvest)
library(dplyr)
LINK = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;home_or_away=1;home_or_away=2;home_or_away=3;page="
Tables_recent <- lapply(paste0(LINK, 1:50,";template=results;type=batting"), function(url){
url %>% read_html() %>%
html_nodes(xpath= '//*[#id="ciHomeContentlhs"]/div[3]/table[3]') %>%
html_table()
})
we check the number of columns for each page entry:
> sapply(Tables_recent,function(i)ncol(i[[1]]))
[1] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
[26] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 1 1 1 1 1 1 1 1 1 1
Those with ncol == 1 are empty:
> Tables_recent[[50]]
[[1]]
X1
1 No records available to match this query
We loop through non-empty and then rbind
wh = which(sapply(Tables_recent,function(i)ncol(i[[1]]))==16)
Table = do.call(rbind,lapply(Tables_recent[wh],"[[",1))
head(Table)
Player Span Mat Inns NO Runs HS Ave BF SR 100
1 RG Sharma (INDIA) 2007-2019 101 93 14 2539 118 32.13 1843 137.76 4
2 V Kohli (INDIA) 2010-2019 72 67 18 2450 90* 50 1811 135.28 0
3 MJ Guptill (NZ) 2009-2019 83 80 7 2436 105 33.36 1810 134.58 2
4 Shoaib Malik (ICC/PAK) 2006-2019 111 104 30 2263 75 30.58 1824 124.06 0
5 BB McCullum (NZ) 2005-2015 71 70 10 2140 123 35.66 1571 136.21 2
6 DA Warner (AUS) 2009-2019 76 76 8 2079 100* 30.57 1476 140.85 1
50 0 4s 6s
1 18 6 225 115 NA
2 22 2 235 58 NA
3 15 2 215 113 NA
4 7 1 186 61 NA
5 13 3 199 91 NA
6 15 5 203 86 NA
What you have is a list. try
do.call(rbind,lapply(Tables_recent,function(x){x<-as.data.frame(x); if(length(x)>1)x}))
or you could do
do.call(rbind,Filter(function(x)length(x)>1,lapply(Tables_recent,as.data.frame)))
I'm pretty fresh to r (like 2 days old). I have a set of data that is a time series taken every 200 msecs over a few hours. Here's the
head(dat):
Date Time MSec Sample Pat1 Pat2 Pat3
1 8/7/~ 14:34 411 0 100 13 13
2 8/7/~ 14:34 615 1 13 13 143
3 8/7/~ 14:34 814 2 13 13 13
4 8/7/~ 14:34 12 3 130 13 13
5 8/7/~ 14:34 216 4 13 13 130
6 8/7/~ 14:34 417 5 139 13 13
It goes down for 2 hours, so several thousands points and over for several hundred patients. The 13 is our baseline and what we are interested in spikes in activity over say 100. I have been trying to create a new column for each Patient column for every time a signal is over 100. I've worked out the follow code:
dat$Pat1exc <- as.numeric(dat$Pat1 >=100)
This works and gives me the new column and my data looks like below:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1exc
1 8/7/~ 14:34 411 0 100 13 13 1
2 8/7/~ 14:34 615 1 13 13 143 0
3 8/7/~ 14:34 814 2 13 13 13 0
4 8/7/~ 14:34 12 3 130 13 13 1
5 8/7/~ 14:34 216 4 13 13 130 0
6 8/7/~ 14:34 417 5 139 13 13 1
This is exactly what I want, but I don't know how to iterate through each column to create Pat2exc, Pat3exc, etc. I figured I could use sapply or vapply after I create a function. However, I can't get the function to work.
excite <- function(x, y) {y <- as.numeric(x >=100)}
excite(x=dat$Pat2, y=dat$Pat2exc)
This gives me no errors, but doesn't modify the dat data frame. Essentially, in the end I just want to sum up all the excited columns (>=100). If there is an easier way to count the samples over 100 for each patient then I'd be happy to learn how to do that as well.
Sorry if this is unclear. Thanks in advance.
P.S.: I am also looking for a good way to combine the Time and Msec columns.
Edit: Added in unabbreviated data:
Date Time Msecs
8/7/2018 14:34:07 411
8/7/2018 14:34:07 615
8/7/2018 14:34:07 814
8/7/2018 14:34:08 12
8/7/2018 14:34:08 216
8/7/2018 14:34:08 417
8/7/2018 14:34:08 619
8/7/2018 14:34:08 816
8/7/2018 14:34:09 15
We can use mutate_at from dplyr to create the binary variables and mutate + rowSums to add them all up:
library(dplyr)
df %>%
mutate_at(vars(starts_with("Pat")), funs(exc = (. >= 100)*1)) %>%
mutate(exc_total = rowSums(.[grepl('_exc', names(.))]))
Result:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1_exc Pat2_exc Pat3_exc exc_total
1 8/7/~ 14:34 411 0 100 13 13 1 0 0 1
2 8/7/~ 14:34 615 1 13 13 143 0 0 1 1
3 8/7/~ 14:34 814 2 13 13 13 0 0 0 0
4 8/7/~ 14:34 12 3 130 13 13 1 0 0 1
5 8/7/~ 14:34 216 4 13 13 130 0 0 1 1
6 8/7/~ 14:34 417 5 139 13 13 1 0 0 1
I have a data frame like below, where the error 1 is present if there is a error in DOB and after that the corrected DOB for the same record with no error. I want to extract only the data which are not corrected and having the error 1. Could anyone out there help me on this.
ID Date1 Date2 DOB Code Error
381 2002-10-01 2015-10-01 1967-01-22 4 1
381 2002-10-01 2015-10-01 1967-01-20 4 NA
381 2011-10-01 2015-10-01 1969-05-13 11 1
381 2011-10-01 2015-10-01 1968-05-13 11 NA
837 2005-12-07 2015-12-07 1987-11-19 8 1
837 2005-12-08 2015-12-08 1989-12-07 8 1
837 2001-04-15 2015-04-15 1984-08-11 18 1
840 2001-04-23 2015-04-23 1999-03-14 18 NA
The output table will have the details below.
ID Date1 Date2 DOB Code Error
837 2005-12-07 2015-12-07 1987-11-19 8 1
837 2005-12-08 2015-12-08 1989-12-07 8 1
837 2001-04-15 2015-04-15 1984-08-11 18 1
I'm struggling with a space-time problem. Help is very much appreciated! Here's first what I'm looking for:
I've got a dataframe with fixes of roe deer datapoints (x,y-coordinates) taken in a irregular approximatly five minute interval.
I want to find all fixes within a certain radius (let's say 15 Meters) of the acutall position of the roe deer, with the condition that they are consecutiv for about 30 Minutes (so six fixes). Oviously the number of fixes can vary along the trajectory. I want to write down the number of fixes into a new column and want to mark in another column (0=no,1=yes), if the condition is met.
If the conditions are met, I would like to calculate the center of gravity (in x,y-coordinates) of said pointcloud and write it to this or another dataframe.
According to another users question (Listing number of obervations by location) I was able to find the fixes within the radius, but i couldn't figure a way how to ensure that they are consecutive
Here's some rows of my dataframe (in realty there's more than 10000 rows, because i subsetted this data.frame the ID's are not starting with one):
FID No CollarID Date Time Latitude__ Longitude_ Height__m_ DOP FixType
1 0 667 7024 2013-10-22 06:01:49 47.26859 8.570701 609.94 10.6 GPS-3D
2 1 668 7024 2013-10-22 06:06:04 47.26861 8.570634 612.31 10.4 GPS-3D
3 2 669 7024 2013-10-22 06:11:07 47.26871 8.570402 609.43 9.8 GPS-3D
4 3 670 7024 2013-10-22 06:16:14 47.26857 8.570796 665.40 4.4 val. GPS-3D
5 4 671 7024 2013-10-22 06:20:36 47.26855 8.570582 653.65 4.6 val. GPS-3D
6 5 672 7024 2013-10-22 06:25:50 47.26850 8.570834 659.03 4.8 val. GPS-3D
7 6 673 7024 2013-10-23 06:00:53 47.27017 8.569882 654.86 3.6 val. GPS-3D
8 7 700 7024 2013-10-26 12:00:18 47.26904 8.569596 651.88 3.8 val. GPS-3D
9 8 701 7024 2013-10-26 12:05:41 47.26899 8.569640 652.76 3.8 val. GPS-3D
10 9 702 7024 2013-10-26 12:10:40 47.26898 8.569534 650.42 4.6 val. GPS-3D
11 10 703 7024 2013-10-26 12:16:17 47.26896 8.569606 653.77 11.4 GPS-3D
12 11 704 7024 2013-10-26 12:20:18 47.26903 8.569792 702.49 9.8 val. GPS-3D
13 12 705 7024 2013-10-26 12:25:47 47.26901 8.569579 670.12 2.4 val. GPS-3D
14 13 706 7024 2013-10-26 12:30:18 47.26900 8.569477 685.65 2.0 val. GPS-3D
15 14 707 7024 2013-10-26 12:35:23 47.26885 8.569400 685.15 6.2 val. GPS-3D
Temp___C_ X Y ID Trajectory distance speed timelag timevalid
1 19 685667.7 235916.0 RE01 RE01 5.420858 0.021258268 4.250000 1
2 20 685662.6 235917.8 RE01 RE01 21.276251 0.070218649 5.050000 1
3 20 685644.9 235929.5 RE01 RE01 34.070730 0.110979577 5.116667 1
4 20 685675.0 235913.5 RE01 RE01 16.335573 0.062349516 4.366667 1
5 20 685658.8 235911.3 RE01 RE01 19.896906 0.063365941 5.233333 1
6 20 685677.9 235905.7 RE01 RE01 199.248728 0.002346781 1415.050000 0
7 22 685603.2 236090.4 RE01 RE01 126.831124 0.000451734 4679.416667 0
8 22 685583.4 235965.1 RE01 RE01 6.330467 0.019598970 5.383333 1
9 22 685586.8 235959.8 RE01 RE01 8.270701 0.027661208 4.983333 1
10 23 685578.8 235957.8 RE01 RE01 5.888147 0.017472246 5.616667 1
11 22 685584.3 235955.7 RE01 RE01 16.040998 0.066560158 4.016667 1
12 23 685598.3 235963.6 RE01 RE01 16.205330 0.049256322 5.483333 1
13 23 685582.2 235961.6 RE01 RE01 7.742184 0.028568946 4.516667 1
14 23 685574.5 235960.9 RE01 RE01 18.129019 0.059439406 5.083333 1
15 23 685568.8 235943.7 RE01 RE01 15.760165 0.051672673 5.083333 1
Date_text Time_text DateTime Flucht FluchtALL
1 22.10.2013 06:01:49 22.10.2013 06:01:49 0 0
2 22.10.2013 06:06:04 22.10.2013 06:06:04 0 0
3 22.10.2013 06:11:07 22.10.2013 06:11:07 0 0
4 22.10.2013 06:16:14 22.10.2013 06:16:14 0 0
5 22.10.2013 06:20:36 22.10.2013 06:20:36 0 0
6 22.10.2013 06:25:50 22.10.2013 06:25:50 0 0
7 23.10.2013 06:00:53 23.10.2013 06:00:53 0 0
8 26.10.2013 12:00:18 26.10.2013 12:00:18 0 0
9 26.10.2013 12:05:41 26.10.2013 12:05:41 0 0
10 26.10.2013 12:10:40 26.10.2013 12:10:40 0 0
11 26.10.2013 12:16:17 26.10.2013 12:16:17 0 0
12 26.10.2013 12:20:18 26.10.2013 12:20:18 0 0
13 26.10.2013 12:25:47 26.10.2013 12:25:47 0 0
14 26.10.2013 12:30:18 26.10.2013 12:30:18 0 0
15 26.10.2013 12:35:23 26.10.2013 12:35:23 0 0
and here's the code i've got so far:
for (i in seq(nrow(df)))
{
# circle's centre
xcentre <- df[i,'X']
ycentre <- df[i,'Y']
# checking how many fixes lie within 15 m of the above centre, noofcloserest column will contain this value
df[i,'noofclosepoints'] <- sum(
(df[,'X'] - xcentre)^2 +
(df[,'Y'] - ycentre)^2
<= 15^2
) - 1
cat(i,': ')
# this prints the true/false vector for which row is within the radius, and which row isn't
cat((df[,'X'] - xcentre)^2 +
(df[,'Y'] - ycentre)^2
<= 15^2)
cat('\n')
}
I tried to test the consecutive-condition on the TRUE/FALSE List from cat(), but i couldn't access the results to process them any further.
I know this is a mile long question, I would be very glad if someone could help me with this or part of this problem. I will be thankful till the end of my life :). Btw: you may have noticed, that I'm an unfortunate R beginner. Many thanks!
Here's how
You can apply a rolling function to your time series for a window of 6.
library(xts)
## You read you time serie using the handy `read.zoo`
## Note the use of the index here
dx <-
read.zoo(text="Date Time DOP X Y noofclosepoints
4705 09.07.2014 11:05:33 3.4 686926.8 231039.3 14
4706 09.07.2014 11:10:53 3.2 686930.5 231042.5 14
4707 09.07.2014 11:16:29 15.8 686935.2 231035.5 14
4708 09.07.2014 11:20:08 5.2 686932.9 231035.6 14
4709 09.07.2014 11:25:17 4.8 686933.8 231038.6 14
4710 09.07.2014 11:30:16 2.2 686938.0 231037.0 15
4711 09.07.2014 11:35:13 2.0 686930.9 231035.8 14
4712 09.07.2014 11:40:09 2.0 686930.6 231035.7 14
4713 09.07.2014 11:45:25 3.4 686907.2 231046.8 0
4714 09.07.2014 11:50:25 3.2 686936.1 231037.1 14",
index = 1:2,format="%d.%m.%Y %H:%M:%S",tz="")
## You apply your rolling to all columns
## for each window you compute the distance between the current
## point and others , then you return the number of points within the radius
as.xts(rollapplyr(dx,7,function(x) {
curr <- tail(x,1)
others <- head(x,-1)
dist <- sqrt((others[,"X"]-curr[,"X"])^2 + (others[,"Y"]-curr[,"Y"])^2 )
sum(dist<15)
},by.column=FALSE))
# [,1]
# 2014-07-09 11:35:13 6
# 2014-07-09 11:40:09 6
# 2014-07-09 11:45:25 0
# 2014-07-09 11:50:25 5