I am trying to clean up a timeseries with multiple data points. The data is arranged by day and by 'beep'. I want to only keep items that are 1 beep away from each other on the same day.
In order to do this I have created a dummy variable by multiplying day number by 10 and adding the beep number to it.
I was wondering if it would be possible to use some kind of clause to specify that I want to keep data that is contiguously = 1 to its lead OR lag variable, but also less than 50 (so that it will keep the days isolated). Alternatively is there a way to group by participant and then by day so that it will apply across participant and across each day in a way that won't delete incorrect data between days e.g. it should not delete day 2 beep 1 for being too far away from day 1 beep 7.
I am doing this so I can use a function called lagvar from an ESM package to created a time-lagged series. Before doing this I want to make sure that any variables in day_beep that are greater than 1 from their contiguous neighbours are removed.
E.g.
Take the following rows and day_beep values
Participant ID Day Beep Dummy Variable
1 1 1 101
1 1 2 102
1 1 4 104
**1 1 7 107**
1 2 3 203
1 2 4 204
2 1 2 102
2 1 3 103
**2 2 5 205
2 3 4 305**
**3 1 1 101**
3 2 4 204
3 2 5 205
**4 1 7 107**
4 4 4 404
4 4 5 405
In this instance I would want to remove the data held between the asterisks as it is either contiguously more than 1 beep from its neighbours, or an isolated beep in the series.
What would be the easiest way to do this for the entire dataframe?
Any help would be greatly appreciated!
You can use lead and lag from dplyr to keep only the rows that have a contiguous value before or after:
library(dplyr)
df %>%
group_by(Participant_ID) %>%
filter(((Dummy_Variable - lag(Dummy_Variable)) == 1) |
(lead(Dummy_Variable) - Dummy_Variable == 1))
output
Participant_ID Day Beep Dummy_Variable
1 1 1 1 101
2 1 1 2 102
3 1 2 3 203
4 1 2 4 204
5 2 1 2 102
6 2 1 3 103
7 3 2 4 204
8 3 2 5 205
9 4 4 4 404
10 4 4 5 405
I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")
I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0
I am trying to number in sequence locations gathered within a certain time period (those with time since previous location >60 seconds). I've eliminated columns irrelevant to this question, so example data looks like:
TimeSincePrev
1
1
1
1
511
1
2
286
1
My desired output looks like this: (sorry for the underscores, but I couldn't otherwise figure out how to get it to include my spaces to make the columns obvious...)
TimeSincePrev ___ NoInSeries
1 ________________ 1
1 ________________ 2
1 ________________ 3
1 ________________ 4
511 ______________ 1
1 ________________ 2
2 ________________ 3
286 ______________ 1
1 ________________ 2
...and so on for another 3500 lines
I have tried a couple of ways to approach this unsuccessfully:
First, I tried to do an ifelse, where I would make the NoInSequence 1 if the TimeSincePrev was more than a minute, or else the previous row's value +1..(In this case, I first insert a line number column to help me reference the previous row, but I suspect there is an easier way to do this?)
df$NoInSeries <- ifelse((dfTimeSincePrev > 60), 1, ((df[((df$LineNo)-1),"NoInSeries"])+1)).
I don't get any errors, but it only gives me the 1s where I want to restart sequences but does not fill in any of the other values:
TimeSincePrev ___ NoInSeries
1 ________________ NA
1 ________________ NA
1 ________________ NA
1 ________________ NA
511 ______________ 1
1 ________________ NA
2 ________________ NA
286 ______________ 1
1 ________________ NA
I assume this has something to do with trying to reference back to itself?
My other approach was to try to get it to do sequences of numbers (max 15), restarting every time there is a change in the TimeSincePrev value:
df$NoInSeries <- ave(df$TimeSincePrev, df$TimeSincePrev, FUN=function(y) 1:15)
I still get no errors but exactly the same output as before, with NAs in place and no other numbers filled in.
Thanks for any help!
Using ave after creating a group detecting serie's change using (diff + cumsum)
dt$NoInSeries <-
ave(dt$TimeSincePrev,
cumsum(dt$TimeSincePrev >60),
FUN=seq)
The result is:
dt
# TimeSincePrev NoInSeries
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 511 1
# 6 1 2
# 7 2 3
# 8 286 1
# 9 1 2
add steps explanation:
## detect time change > 60 seconds
## group value by the time change
(gg <- cumsum(dt$TimeSincePrev >60))
[1] 0 0 0 0 1 1 1 2 2
## get the sequence by group
ave(dt$TimeSincePrev, gg, FUN=seq)
[1] 1 2 3 4 1 2 3 1 2
Using data.table
library(data.table)
setDT(dt)[,NoInSeries:=seq_len(.N), by=cumsum(TimeSincePrev >60)]
dt
# TimeSincePrev NoInSeries
#1: 1 1
#2: 1 2
#3: 1 3
#4: 1 4
#5: 511 1
#6: 1 2
#7: 2 3
#8: 286 1
#9: 1 2
Or
indx <- c(which(dt$TimeSincePrev >60)-1, nrow(dt))
sequence(c(indx[1], diff(indx)))
#[1] 1 2 3 4 1 2 3 1 2
data
dt <- data.frame(TimeSincePrev=c(1,1,1,1,511, 1,2, 286,1))
I have a data frame which has 2 columns - A & B. I want to replace the values of column B in such a way that, when the VALUE>=5 replace with 1, else replace with 0.
Note - There are 2 conditions to be checked.
X=read.csv("Y:/impdat.csv")
A B
3 16
12 3
1 2
12 9
4 4
5 6
21 1
4 14
3 10
12 1
So after replacing, the data should be
A B
3 1
12 0
1 0
12 1
4 0
5 1
21 0
4 1
3 1
12 0
Sounds simple. But I am unable to implement it.
I tried
ifelse(X$B>=5,1,0)
This only prints the new values, but the original data remains the same.
X$B <- as.integer(X$B >= 5)
will do the trick.
transform(X, B=ifelse(B>=5,1,0))
Got it.
Just had to assign the object.
X$B=ifelse(X$B>=5,1,0)