How to change the date in the data frame - r

I have a lot data frame, like this
ID 1 2 3 4 5 type c new_ee first_ee_t
A 20051110 20051111 20051114 20051208 20060105 DATE 1 none none
A NA 1 3 24 2 diff_date 2 1 20051110
B 20050422 20050613 20050711 20071023 NA DATE 1 none none
B NA 52 28 834 999 diff_date 2 1 20050422
C 20021206 20040224 20040423 20040507 20040528 DATE 1 none none
C NA 445 59 14 21 diff_date 2 1 20021206
D 20030708 20050228 20050228 20050815 20050915 DATE 1 none none
D NA 601 0 168 31 diff_date 2 1 20030708
E 20000123 20040306 20060919 20060919 20060920 DATE 1 none none
E NA 1504 927 0 1 diff_date 2 1 20000123
F 20070413 NA NA NA NA DATE 1 none none
F NA 999 999 999 999 diff_date 2 0 0
G 20020318 20020411 NA NA NA DATE 1 none none
G NA 24 999 999 999 diff_date 2 0 0
I have to change first_ee_t variable. if the ID first - second time >365 then the first_ee_t variable change second time
if the first-second time and second-third time >365 the change third time
such as
ID 1 2 3 4 5 type c new_ee first_ee_t
A 20051110 20051111 20051114 20051208 20060105 DATE 1 none none
A NA 1 3 24 2 diff_date 2 1 20051110
B 20050422 20050613 20050711 20071023 NA DATE 1 none none
B NA 52 28 834 999 diff_date 2 1 20050422
C 20021206 20040224 20040423 20040507 20040528 DATE 1 none none
C NA 445 59 14 21 diff_date 2 1 20040224
D 20030708 20050228 20050228 20050815 20050915 DATE 1 none none
D NA 601 0 168 31 diff_date 2 1 20050228
E 20000123 20040306 20060919 20060919 20060920 DATE 1 none none
E NA 1504 927 0 1 diff_date 2 1 20060919
F 20070413 NA NA NA NA DATE 1 none none
F NA 999 999 999 999 diff_date 2 0 0
G 20020318 20020411 NA NA NA DATE 1 none none
G NA 24 999 999 999 diff_date 2 0 0

Assuming your expected output above has a few errors, I think this is what you are after
#first, here's the data in a copy/paste-able form
dd <-
structure(list(ID = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L), .Label = c("A", "B", "C", "D", "E",
"F", "G"), class = "factor"), X1 = c(20051110L, NA, 20050422L,
NA, 20021206L, NA, 20030708L, NA, 20000123L, NA, 20070413L, NA,
20020318L, NA), X2 = c(20051111L, 1L, 20050613L, 52L, 20040224L,
445L, 20050228L, 601L, 20040306L, 1504L, NA, 999L, 20020411L,
24L), X3 = c(20051114L, 3L, 20050711L, 28L, 20040423L, 59L, 20050228L,
0L, 20060919L, 927L, NA, 999L, NA, 999L), X4 = c(20051208L, 24L,
20071023L, 834L, 20040507L, 14L, 20050815L, 168L, 20060919L,
0L, NA, 999L, NA, 999L), X5 = c(20060105L, 2L, NA, 999L, 20040528L,
21L, 20050915L, 31L, 20060920L, 1L, NA, 999L, NA, 999L), type = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("DATE",
"diff_date"), class = "factor"), c = c(1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), new_ee = structure(c(3L, 2L,
3L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 3L, 1L, 3L, 1L), .Label = c("0",
"1", "none"), class = "factor"), first_ee_t = c("none", "20051110",
"none", "20050422", "none", "20021206", "none", "20030708", "none",
"20000123", "none", "0", "none", "0")), .Names = c("ID", "X1",
"X2", "X3", "X4", "X5", "type", "c", "new_ee", "first_ee_t"), row.names = c(NA,
-14L), class = "data.frame")
And how, here's the code that will do the transformation
result<-unsplit(lapply(split(dd, dd$ID), function(x) {
if (all(is.na(x[1,4:6]))) {
x[2, "first_ee_t"]<-0
} else {
first<-min(which(x[2,2:6]<365))
if(is.finite(first)) {
x[2,"first_ee_t"]<-x[1, first]
}
}
x
}), dd$ID)
This is assuming that each ID has exactly two rows and that the second row always contains the datediffs and the first always contains the dates themselves.
This does produce a warning in the case of ID F who seems to have no values that satisfy the requirements so that row is left untouched.

Related

time series plot for missing data

I have some sequence event data for which I want to plot the trend of missingness on value across time. Example below:
id time value
1 aa122 1 1
2 aa2142 1 1
3 aa4341 1 1
4 bb132 1 2
5 bb2181 2 1
6 bb3242 2 3
7 bb3321 2 NA
8 cc122 2 1
9 cc2151 2 2
10 cc3241 3 1
11 dd161 3 3
12 dd2152 3 NA
13 dd3282 3 NA
14 ee162 3 1
15 ee2201 4 2
16 ee3331 4 NA
17 ff1102 4 NA
18 ff2141 4 NA
19 ff3232 5 1
20 gg142 5 3
21 gg2192 5 NA
22 gg3311 5 NA
23 gg4362 5 NA
24 ii111 5 NA
The NA suppose to increase over time (the behaviors are fading). How do I plot the NA across time
I think this is what you're looking for? You want to see how many NA's appear over time. Assuming this is correct, if each time is a group, then you can count the number of NA's appear in each group
data:
df <- structure(list(id = structure(1:24, .Label = c("aa122", "aa2142",
"aa4341", "bb132", "bb2181", "bb3242", "bb3321", "cc122", "cc2151",
"cc3241", "dd161", "dd2152", "dd3282", "ee162", "ee2201", "ee3331",
"ff1102", "ff2141", "ff3232", "gg142", "gg2192", "gg3311", "gg4362",
"ii111"), class = "factor"), time = c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L), value = c(1L, 1L, 1L, 2L, 1L, 3L, NA, 1L, 2L, 1L, 3L,
NA, NA, 1L, 2L, NA, NA, NA, 1L, 3L, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-24L))
library(tidyverse)
library(ggplot2)
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value)))
# A tibble: 5 × 2
time sumNA
<int> <int>
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
You can then plot this using ggplot2
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value))) %>%
ggplot(aes(x=time)) +
geom_line(aes(y=sumNA))
As you can see, as time increases, the number of NA's also increases

New data frame, if specific value(s) is contained AND other values aren't included in a range of columns in r

So, I have a large data frame with monthly observations of n individuals.
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
A 33 6 1 2 1 5
B 36 5 0 2 1 5
C 22 4 1 NA 1 5
D 2 2 0 2 1 5
E 5 2 1 2 1 6
F 7 1 0 2 1 5
G 8 6 1 2 1 5
H 2 8 0 2 2 5
I 1 3 1 2 1 5
J 3 2 0 2 1 5
I want to create a new data frame, in which include the individuals who meet some specific conditions.
E.g. if, for individual i, the range of column y_0101:y_0312 does NOT include values of 3 & 6 & NA, AND include values of 2 | 1 THEN for individual i should be included in new data frame. Which produce the following data frame:
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
B 36 5 0 2 1 5
D 2 2 0 2 1 5
F 7 1 0 2 1 5
H 2 8 0 2 2 5
I tried different ways, but I can't figure out how to get multiple conditions included.
df <- df %>% filter(vars(starts_with("y_"))!=3 | !=6 | != NA)
or
df <- df %>% filter_at(vars(starts_with("y_")), all_vars(!=3 | !=6 | != NA)
I've tried some other things as well, like !%in%, but that doesn't seem to work. Any ideas?
I think you're almost there, but might need a slight shift in the logic:
df <- data.frame(A1 = 1:10,
A2 = 10:1,
A3 = 1:10,
B1 = 1:10)
df %>%
filter_at(vars(starts_with("A")), ~!(.x %in% c(3, 6, NA))) %>%
filter(if_any(starts_with("A"), ~ .x %in% c(1, 2)))
In the first step, I filter out all rows where any of the columns are 3, 6, or NA. In the second row, I filter down to only rows where at least one of the columns is 1 or 2. Does this help with your case?
Here is a base R option using rowSums :
cols <- grep('y_', names(df))
include <- c(1, 2)
not_include <- c(3, 6, NA)
result <- subset(df, rowSums(sapply(df[cols], `%in%`, include)) > 0 &
rowSums(sapply(df[cols], `%in%`, not_include)) == 0)
result
# ind y_0101 y_0102 y_0103 y_0104 y_0311 y_0312
#2 B 36 5 0 2 1 5
#4 D 2 2 0 2 1 5
#6 F 7 1 0 2 1 5
#8 H 2 8 0 2 2 5
data
df <- structure(list(ind = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"), y_0101 = c(33L, 36L, 22L, 2L, 5L, 7L, 8L, 2L, 1L,
3L), y_0102 = c(6L, 5L, 4L, 2L, 2L, 1L, 6L, 8L, 3L, 2L), y_0103 = c(1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L), y_0104 = c(2L, 2L, NA, 2L,
2L, 2L, 2L, 2L, 2L, 2L), y_0311 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L), y_0312 = c(5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L
)), class = "data.frame", row.names = c(NA, -10L))

How to change the values within the group?

I created a column of nesting success with a value of "1" if nest's fate was "rearing" or "fledged", and 0 if nest's fate was "nest failed". For some cases, the nest's fate was "rearing" in the first visit and "failed" for the second visit. In such cases, success of a single nest turned out to be both 1 and 0 (see nest "D063" and "D063").
How to remove "1"s or assign "NA", and only keep "0"s in the cases with both 1 and 0 in the success of the same nest?
In other words, I'd like to have only one success outcome per nest (single 1 or 0), not multiple. And, I want to keep all the rows.
My data looks like this:
Example data:
structure(list(date = structure(c(4L, 2L, 1L, 5L, 3L, 1L, 5L,
2L, 1L, 5L, 3L, 1L, 5L, 2L, 1L), .Label = c("14/06/2018", "17/05/2018",
"21/05/2018", "5/05/2018", "6/05/2018"), class = "factor"), nest.code = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("D046",
"D047", "D062", "D063", "W18003"), class = "factor"), year = c(2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L), species = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("AA",
"BB"), class = "factor"), visit = c(1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), eggs = c(1L, 0L, 0L, 1L, 0L,
0L, 2L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), chicks = c(0L, NA, NA,
0L, 1L, 0L, 0L, 2L, 0L, 0L, 1L, 0L, 0L, NA, 1L), fate = structure(c(2L,
4L, 5L, 2L, 4L, 3L, 2L, 4L, 3L, 2L, 4L, 3L, 2L, 5L, 1L), .Label = c("fledged",
"incubating", "nest failed", "rearing", "unknown"), class = "factor"),
success = c(NA, 1L, NA, NA, 1L, 0L, NA, 1L, 0L, NA, 1L, 0L,
NA, NA, 1L)), class = "data.frame", row.names = c(NA, -15L
))
This is the code I tried:
datanew <- data %>%
group_by(year, species, nest.code)%>%
mutate(Real_success = ifelse(success ==1 & 0, 0, success))
I'm not sure how you imagine it to look in the end. Do you want to have all rows preserved, do you want to have it ordered in some way. Anyway, this is what I came up with:
UPDATE: Sry, I missed "fledged" in the first answer
dat %>%
group_by(year, species, nest.code)%>%
arrange(year, species, nest.code, success) %>%
mutate(success = ifelse(row_number() > 1, NA, success))
# A tibble: 15 x 9
# Groups: year, species, nest.code [5]
date nest.code year species visit eggs chicks outcome success
<fct> <fct> <int> <fct> <int> <int> <int> <fct> <int>
1 17/05/2018 D046 2018 AA 2 0 NA rearing 1
2 5/05/2018 D046 2018 AA 1 1 0 incubating NA
3 14/06/2018 D046 2018 AA 3 0 NA unknown NA
4 14/06/2018 D047 2018 AA 3 0 0 nest failed 0
5 21/05/2018 D047 2018 AA 2 0 1 rearing NA
6 6/05/2018 D047 2018 AA 1 1 0 incubating NA
7 14/06/2018 D062 2018 AA 3 0 0 nest failed 0
8 17/05/2018 D062 2018 AA 2 0 2 rearing NA
9 6/05/2018 D062 2018 AA 1 2 0 incubating NA
10 14/06/2018 D063 2018 AA 3 0 0 nest failed 0
11 21/05/2018 D063 2018 AA 2 0 1 rearing NA
12 6/05/2018 D063 2018 AA 1 1 0 incubating NA
13 14/06/2018 W18003 2018 BB 3 0 1 fledged 1
14 6/05/2018 W18003 2018 BB 1 1 0 incubating NA
15 17/05/2018 W18003 2018 BB 2 0 NA unknown NA
there definitly will be some easier way to do this. No pro in dplyr myself.
If it works, I'm happy.
Here's an approach that puts a zero in all rows for nests with at least one fail, a 1 if there is at least one success and no fail, and NA otherwise:
library(dplyr)
mydata %>%
group_by(year, species, nest.code) %>%
mutate(real_success = case_when(
sum(1-success, na.rm = T) > 0 ~ 0, # There was a fail
sum(success, na.rm = T) > 0 ~ 1,
TRUE ~ NA_real_)) %>%
ungroup()
# A tibble: 15 x 10
date nest.code year species visit eggs chicks fate success real_success
<fct> <fct> <int> <fct> <int> <int> <int> <fct> <int> <dbl>
1 5/05/2018 D046 2018 AA 1 1 0 incubating NA 1
2 17/05/2018 D046 2018 AA 2 0 NA rearing 1 1
3 14/06/2018 D046 2018 AA 3 0 NA unknown NA 1
4 6/05/2018 D047 2018 AA 1 1 0 incubating NA 0
5 21/05/2018 D047 2018 AA 2 0 1 rearing 1 0
6 14/06/2018 D047 2018 AA 3 0 0 nest fail… 0 0
7 6/05/2018 D062 2018 AA 1 2 0 incubating NA 0
8 17/05/2018 D062 2018 AA 2 0 2 rearing 1 0
9 14/06/2018 D062 2018 AA 3 0 0 nest fail… 0 0
10 6/05/2018 D063 2018 AA 1 1 0 incubating NA 0
11 21/05/2018 D063 2018 AA 2 0 1 rearing 1 0
12 14/06/2018 D063 2018 AA 3 0 0 nest fail… 0 0
13 6/05/2018 W18003 2018 BB 1 1 0 incubating NA 1
14 17/05/2018 W18003 2018 BB 2 0 NA unknown NA 1
15 14/06/2018 W18003 2018 BB 3 0 1 fledged 1 1

Selecting a sequence of random length starting and ending with specific values and limited by another column

I have a fairly large data set that has the form of the following table:
value ID
1 0 A
2 0 A
3 1 A
4 1 A
5 0 A
6 -1 A
7 0 B
8 1 B
9 1 B
10 0 B
11 0 B
12 0 B
13 1 C
14 1 C
15 0 C
16 1 C
17 1 C
18 1 C
19 0 C
Essentially I'd like to transform the above, keeping only the first and last values of sequences that start with an occurrence of zero followed by a unknown number of ones and end at the last occurrence of one:
value ID
2 0 A
4 1 A
7 0 B
9 1 B
15 0 C
18 1 C
Is there an easy way to accomplish this?
dput of the first example follows:
structure(list(value = structure(c(2L, 2L, 3L, 3L, 2L, 1L, 2L,
3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L), .Label = c("-1",
"0", "1"), class = "factor"), ID = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("value", "ID"), row.names = c(NA, -19L), class = "data.frame")
Here's my attempt using data.table and stringi packages combination
library(stringi)
library(data.table)
setDT(df)[, .(.I[stri_locate_all_regex(paste(value, collapse = ""), "01+")[[1]]], 0:1), by = ID]
# ID V1 V2
# 1: A 2 0
# 2: A 4 1
# 3: B 7 0
# 4: B 9 1
# 5: C 15 0
# 6: C 18 1
This basically converts each group to a single string and then detects the beginning and the end of parts that match the 01+ regex while subsetting from the row index .I. Eventually I'm just adding 0:1 to the data (which seems redundant to me at least).

'Complex' aggregation function in dcast from reshape2

I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790

Resources