Calculate difference based on two columns in R - r

I have a little bit of a tricky question. Here is my data:
> structure(list(seconds = c(689, 689.25, 689.5, 689.75, 690, 690.25, 690.5, 690.75, 691, 691.25, 691.5, 691.75, 692, 692.25, 692.5 ), threat = c(NA, NA, NA, NA, NA, NA, 1L, 1L, 0L, 0L, 1L, NA, NA, 1L, 1L), bins = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L)), .Names = c ("seconds", "threat", "bins"), class = "data.frame", row.names = c(NA, -15L))
seconds threat bins
1 689.00 NA 1
2 689.25 NA 1
3 689.50 NA 1
4 689.75 NA 1
5 690.00 NA 1
6 690.25 NA 2
7 690.50 1 2
8 690.75 1 2
9 691.00 0 2
10 691.25 0 2
11 691.50 1 3
12 691.75 NA 3
13 692.00 NA 3
14 692.25 1 3
15 692.50 1 3
Within each bin, I am trying to calculate the amount of time they are in each type of "threat" in the threat column. So I would need to calculate the difference score every time something different happens in threat and within each bin. So here is an example of something I am hoping to achieve:
bin threat seconds
1 NA 1.25
1 1 0.00
1 0 0.00
2 NA 0.25
2 1 0.50
2 0 0.50
3 NA 0.50
3 1 0.75
3 0 0.00

Here's a tidyverse solution:
df %>% arrange(seconds) %>%
mutate(duration = lead(seconds) - seconds) %>%
complete(bins, threat, fill = list(duration = 0)) %>%
group_by(bins, threat) %>%
summarize(seconds = sum(duration, na.rm = TRUE))
# A tibble: 9 x 3
# Groups: bins [?]
# bins threat seconds
# <int> <int> <dbl>
# 1 1 0 0
# 2 1 1 0
# 3 1 NA 1.25
# 4 2 0 0.5
# 5 2 1 0.5
# 6 2 NA 0.25
# 7 3 0 0
# 8 3 1 0.5
# 9 3 NA 0.5
You may erase complete(bins, threat, fill = list(duration = 0)) if adding rows where seconds is 0 is not necessary.
So, first we arrange the data to be safe. Then due to the interactions between threat we define a new variable duration. Next we add new rows with duration == 0 for those (bins, threat) cases that are not yet present. Lastly we group by bins and threat and sum up the durations.

Related

Increasingly count the number of times that a certain condition is met (R)

I would like to know how to increasingly count the number of times that a column in my data.frame satisfies a condition. Let's consider a data.frame such as:
x hour count
1 0 NA
2 1 NA
3 2 NA
4 3 NA
5 0 NA
6 1 NA
...
I would like to have this output:
x hour count
1 0 1
2 1 NA
3 2 NA
4 3 NA
5 0 2
6 1 NA
...
With the count column increasing by 1 everytime the condition hour==0 is met.
Is there a smart and efficient way to perform this? Thanks
You can use seq_along on the rows where hour == 0.
i <- x$hour == 0
x$count[i] <- seq_along(i)
x
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
Data:
x <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L), count = c(NA,
NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
You can use cumsum to count incremental number of 0 occurrences and replace counts where hour values is not 0 to NA.
library(dplyr)
df %>%
mutate(count = cumsum(hour == 0),
count = replace(count, hour != 0 , NA))
# x hour count
#1 1 0 1
#2 2 1 NA
#3 3 2 NA
#4 4 3 NA
#5 5 0 2
#6 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))
Using data.table
library(data.table)
setDT(df)[hour == 0, count := seq_len(.N)]
df
# x hour count
#1: 1 0 1
#2: 2 1 NA
#3: 3 2 NA
#4: 4 3 NA
#5: 5 0 2
#6: 6 1 NA
data
df <- structure(list(x = 1:6, hour = c(0L, 1L, 2L, 3L, 0L, 1L)),
class = "data.frame", row.names = c(NA, -6L))

Sum column values over a window and report the values of the previous window

I´m having a data.frame of the following form:
ID Var1
1 1
1 1
1 3
1 4
1 1
1 0
2 2
2 2
2 6
2 7
2 8
2 0
3 0
3 2
3 1
3 3
3 2
3 4
and I would like to get there:
ID Var1 X
1 1 0
1 1 0
1 3 0
1 4 5
1 1 5
1 0 5
2 2 0
2 2 0
2 6 0
2 7 10
2 8 10
2 0 10
3 0 0
3 2 0
3 1 0
3 3 3
3 2 3
3 4 3
so in words: I´d like to calculate the sum of the variable in a window = 3, and then report the results obtained in the previous window. This should happen with respect to the IDs and thus the first three observations on every ID should be returned with 0, as there is no previous time period that could be reported.
For understanding: In the actual dataset each row corresponds to one week and the window = 7. So X is supposed to give information on the sum of Var1 in the previous week.
I have tried using some rollapply stuff, but always ended in an error and also the window would be a rolling window if I got that right, which is specifically not what I need.
Thanks for your answers!
In rollapply, the width argument can be a list which provides the offsets to use. In this case we want to use the points 3, 2 and 1 back for the first point, 4, 3 and 2 back for the second, 5, 4 and 3 back for the third and then recycle. That is, for a window width of k = 3 we would want the following list of offset vectors:
w <- list(-(3:1), -(4:2), -(5:3))
In general we can write w below in terms of the window width k. ave then invokes rollapply with that width list for each ID.
library(zoo)
k <- 3
w <- lapply(1:k, function(x) seq(to = -x, length = k))
transform(DF, X = ave(Var1, ID, FUN = function(x) rollapply(x, w, sum, fill = 0)))
giving:
ID Var1 X
1 1 1 0
2 1 1 0
3 1 3 0
4 1 4 5
5 1 1 5
6 1 0 5
7 2 2 0
8 2 2 0
9 2 6 0
10 2 7 10
11 2 8 10
12 2 0 10
13 3 0 0
14 3 2 0
15 3 1 0
16 3 3 3
17 3 2 3
18 3 4 3
Note
The input DF in reproducible form is:
DF <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)),
class = "data.frame", row.names = c(NA, -18L))
We could group by 'ID', create a new grouping column with window size of 3 using gl, then get the summarized output by taking the sum of 'Var1' and placing the 'Var1' in a list, get the lag of 'X' and unnest
library(dplyr) #1.0.0
library(tidyr)
df1 %>%
# // grouping by ID
group_by(ID) %>%
# // create another group added with gl
group_by(grp = as.integer(gl(n(), 3, n())), .add = TRUE) %>%
# // get the sum of Var1, while changing the Var1 in a list
summarise(X = sum(Var1), Var1 = list(Var1)) %>%
# // get the lag of X
mutate(X = lag(X, default = 0)) %>%
# // unnest the list column
unnest(c(Var1)) %>%
select(names(df1), X)
# A tibble: 18 x 3
# Groups: ID [3]
# ID Var1 X
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 1 0
# 3 1 3 0
# 4 1 4 5
# 5 1 1 5
# 6 1 0 5
# 7 2 2 0
# 8 2 2 0
# 9 2 6 0
#10 2 7 10
#11 2 8 10
#12 2 0 10
#13 3 0 0
#14 3 2 0
#15 3 1 0
#16 3 3 3
#17 3 2 3
#18 3 4 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)), class = "data.frame",
row.names = c(NA,
-18L))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

Tidy up and reshape messy dataset (reshape/gather/unite function)?

Following my earlier question:
R: reshape/gather function to create dataset ready for multilevel analysis
I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story:
I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire.
In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.
I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner).
Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.
This code worked for me for a small subset of my data:
mydf_long <- mydf4 %>%
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)
results in:
id groep_MNC zkhs fbeh pgebdat couple bb1,
1 3 1 1 1 1955-12-01 1 4
2 3 1 1 1 1955-12-01 2 5
3 5 1 1 1 1943-04-09 1 2
4 5 1 1 1 1943-04-09 2 2
But now it copies and pastes the date of birth of the patient also to 'partner' row.
I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:
Example of data
id groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age prgesl relpnst
1 3 1 1 1 1955-12-01 42.50000 1 <NA> NA 2 1
2 5 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.50000 1 2
3 7 1 1 1 1958-04-10 40.25000 1 <NA> NA 2 1
4 10 1 1 1 1958-04-17 40.25000 1 1957-07-31 41.33333 2 1
5 12 1 1 2 1947-11-01 50.66667 1 1944-06-08 54.58333 2 1
And then, after couple of hundred variables for only patients, this partner data comes along:
pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
1 5 5 5 5 2 5 4 2 3
2 2 1 4 1 3 4 3 3 4
3 5 3 4 4 4 3 5 3 4
4 5 3 5 5 5 5 4 4 4
5 5 5 5 5 5 4 4 3 4
note, I didn't create this dataset myself - I'm just here to tidy up the mess :)
Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.
Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:
library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
value.name = c('age','geslacht','geboortedatum')
)[, variable := c('patient','partner')[variable]][]
you get:
id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable age geslacht geboortedatum
1: 3 1 1 1 1 5 5 patient 42.50000 1 1955-12-01
2: 5 1 1 1 2 2 1 patient 55.16667 1 1943-04-09
3: 7 1 1 1 1 5 3 patient 40.25000 1 1958-04-10
4: 10 1 1 1 1 5 3 patient 40.25000 1 1958-04-17
5: 12 1 1 2 1 5 5 patient 50.66667 1 1947-11-01
6: 3 1 1 1 1 5 5 partner NA 2 <NA>
7: 5 1 1 1 2 2 1 partner 36.50000 1 1962-04-18
8: 7 1 1 1 1 5 3 partner NA 2 <NA>
9: 10 1 1 1 1 5 3 partner 41.33333 2 1957-07-31
10: 12 1 1 2 1 5 5 partner 54.58333 2 1944-06-08
Instead of patterns you could also use a list of column indexes or columnnames.
HTH
Used data:
df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 2L),
pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
pgesl = c(1L, 1L, 1L, 1L, 1L),
prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
prgesl = c(2L, 1L, 2L, 2L, 2L),
relpnst = c(1L, 2L, 1L, 1L, 1L),
pbb1 = c(5L, 2L, 5L, 5L, 5L),
pbb2 = c(5L, 1L, 3L, 3L, 5L)),
.Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

How can I exclude zeros when finding a frequency and seperate into 4 categories

I have a data frame data.2016 and am trying to find the frequency in which "DIPL" occurs (excluding zero), "DIPL" is the number of a worms parasite found in the a fish.
Data looks something like this:
data.2016
Site DIPL
1 0
1 1
1 1
2 6
2 8
2 1
2 1
3 0
3 0
3 0
4 1258
4 501
I want to output to look like this:
Site freq
1 2
2 4
3 0
4 2
From this I can interpret, out of the 3 fish found in site #1 (from the data frame), 2 of them had worm parasites.
I've tried
aggregate(DIPL~Site, data=data.2016, frequency) #and get:
Site DIPL
1 1 1
2 2 1
3 3 1
4 4 1
Is there a way to count the number of fish with worms from the DIPL column (meaning the value in the column is higher than zero) per site?
Just use a custom function that removes the zeros.
aggregate(DIPL ~ Site, data.2016, function(x) length(x[x != 0])) # or sum(x != 0)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
Another option would be to temporarily transform the DIPL column then just take the sum.
aggregate(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0), sum)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
xtabs() is fun too ...
xtabs(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0))
# Site
# 1 2 3 4
# 2 4 0 2
By the way, frequency is for use on time-series data.
Data:
data.2016 <- structure(list(Site = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), DIPL = c(0L, 1L, 1L, 6L, 8L, 1L, 1L, 0L, 0L, 0L, 1258L,
501L)), .Names = c("Site", "DIPL"), class = "data.frame", row.names = c(NA,
-12L))
Might something like this be what you're looking for?
# first some fake data
site <- c("A","A","A","B","B","B")
numworms <- c(1,0,3,0,0,42)
data.frame(site,numworms)
site numworms
1 A 1
2 A 0
3 A 3
4 B 0
5 B 0
6 B 42
tapply(numworms, site, function(x) sum(x>0))
A B
2 1

Resources