I am trying to clean up a timeseries with multiple data points. The data is arranged by day and by 'beep'. I want to only keep items that are 1 beep away from each other on the same day.
In order to do this I have created a dummy variable by multiplying day number by 10 and adding the beep number to it.
I was wondering if it would be possible to use some kind of clause to specify that I want to keep data that is contiguously = 1 to its lead OR lag variable, but also less than 50 (so that it will keep the days isolated). Alternatively is there a way to group by participant and then by day so that it will apply across participant and across each day in a way that won't delete incorrect data between days e.g. it should not delete day 2 beep 1 for being too far away from day 1 beep 7.
I am doing this so I can use a function called lagvar from an ESM package to created a time-lagged series. Before doing this I want to make sure that any variables in day_beep that are greater than 1 from their contiguous neighbours are removed.
E.g.
Take the following rows and day_beep values
Participant ID Day Beep Dummy Variable
1 1 1 101
1 1 2 102
1 1 4 104
**1 1 7 107**
1 2 3 203
1 2 4 204
2 1 2 102
2 1 3 103
**2 2 5 205
2 3 4 305**
**3 1 1 101**
3 2 4 204
3 2 5 205
**4 1 7 107**
4 4 4 404
4 4 5 405
In this instance I would want to remove the data held between the asterisks as it is either contiguously more than 1 beep from its neighbours, or an isolated beep in the series.
What would be the easiest way to do this for the entire dataframe?
Any help would be greatly appreciated!
You can use lead and lag from dplyr to keep only the rows that have a contiguous value before or after:
library(dplyr)
df %>%
group_by(Participant_ID) %>%
filter(((Dummy_Variable - lag(Dummy_Variable)) == 1) |
(lead(Dummy_Variable) - Dummy_Variable == 1))
output
Participant_ID Day Beep Dummy_Variable
1 1 1 1 101
2 1 1 2 102
3 1 2 3 203
4 1 2 4 204
5 2 1 2 102
6 2 1 3 103
7 3 2 4 204
8 3 2 5 205
9 4 4 4 404
10 4 4 5 405
I have a dataset (see example below) in which each individual underwent two sessions, each with 4 trials. In each trial they could either pick correctly (1) or incorrectly (0) as designated by the y variable. I am trying to calculate the rate correct choices per individual per session. (This is an example dataset, the real one is larger and has many more rows so I don't want to do this by hand)
df
head(df, 16)
row name session_number y
1 1 Tom 1 1
2 2 Tom 1 1
3 3 Tom 1 0
4 4 Tom 1 0
5 5 Tom 2 1
6 6 Tom 2 0
7 7 Tom 2 1
8 8 Tom 2 0
9 9 Rob 1 0
10 10 Rob 1 1
11 11 Rob 1 0
12 12 Rob 1 1
13 13 Rob 2 0
14 14 Rob 2 1
15 15 Rob 2 0
16 16 Rob 2 1
For example, I want to know that Tom, on his first session, picked correctly in 0.50 of his trials. This is calculated by summing Y and dividing by the number of rows associated with "Tom" AND "Session 1". I can't seem to figure out how to calculate those number of rows though in a larger dataset.
I tried using group_by() and mutate(), but I still can't seem to get it to work because the count() is not working.
by_name_by_session <- df %>%
group_by(df$name) %>%
group_by(session_number) %>%
mutate(rate = (sum(df$y)/count(df$name)))
Thanks in advance to anyone who can help!
I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")
I have spent a significant amount of time searching for an answer with little luck. I have some time series data and need to collapse and create a rolling mean of every nth row in that data. It looks like this is possible in zoo and maybe hmisc and i am sure other packages. I need to average rows 1,2,3 then 3,4,5 then 5,6,7 and so on. my data looks like such and has thousands of observations:
id time x.1 x.2 y.1 y.2
10 1 22 19 0 -.5
10 2 27 44 -1 0
10 3 19 13 0 -1.5
10 4 7 22 .5 1
10 5 -15 5 .33 2
10 6 3 17 1 .33
10 7 6 -2 0 0
10 8 44 25 0 0
10 9 27 12 1 -.5
10 10 2 11 2 1
I would like it to look like this when complete:
id time x.1 x.2 y.1 y.2
10 1 22.66 25.33 -.33 -.66
10 2 3.66 13.33 .27 .50
The time var 1 would actually be times 1,2,3 averaged and 2 would be 3,4,5 averaged but at this point the time var would not be important to keep. I would need to group by id as it does change eventually. The only way I could figure out how to do this successfully was to use Lag() and make new rows lead by 1 and another by 2 then take average across columns. after that you have to delete every other row
1 NA NA
2 1 NA
3 2 1
4 3 2
5 4 3
use the 123 and 345 and remove 234... to do this for each var would be outrageous especially as i gather new data.
any ideas? help would be much appreciated
something like this maybe?
# sample data
id <- c(10,10,10,10,10,10)
time <- c(1,2,3,4,5,6)
x1 <- c(22,27,19,7,-15,3)
x2 <- c(19,44,13,22,5,17)
df <- data.frame(id,time,x1,x2)
means <- data.frame(rollmean(df[,c(1,3:NCOL(df))], 3))
means <- means[c(T,F),]
means$time <- seq(1:NROW(means))
row.names(means) <- 1:NROW(means)
> means
id x1 x2 time
1 10 22.666667 25.33333 1
2 10 3.666667 13.33333 2
I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0