Specify which column(s) a specific date appears in R - r

I have a subset of my data in a dataframe (dput codeblock below) containing dates in which a storm occurred ("Date_AR"). I'd like to know if a storm occurred in the north, south or both, by determining whether the same date occurred in the "Date_N" and/or "Date_S" column/s.
For example, the first date is Jan 17, 1989 in the "Date_AR" column. In the location column, I would like "S" to be printed, since this date is found in the "Date_S" column. If Apr 5. 1989 occurs in "Date_N" and "Date_S", the I would like a "B" (for both) to be printed in the location column.
Thanks in advance for the help! Apologies if this type of question is already out there. I may not know the keywords to search.
structure(list(Date_S = structure(c(6956, 6957, 6970, 7008, 7034,
7035, 7036, 7172, 7223, 7224, 7233, 7247, 7253, 7254, 7255, 7262, 7263, 7266, 7275,
7276), class = "Date"),
Date_N = structure(c(6968, 6969, 7035, 7049, 7103, 7172, 7221, 7223, 7230, 7246, 7247,
7251, 7252, 7253, 7262, 7266, 7275, 7276, 7277, 7280), class = "Date"),
Date_AR = structure(c(6956, 6957, 6968, 6969, 6970, 7008,
7034, 7035, 7036, 7049, 7103, 7172, 7221, 7223, 7224, 7230,
7233, 7246, 7247, 7251), class = "Date"), Precip = c(23.6,
15.4, 3, 16.8, 0.2, 3.6, 22, 13.4, 0, 30.8, 4.6, 27.1, 0,
19, 2.8, 11.4, 2, 57.6, 9.4, 39), Location = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 20L), class = "data.frame")

Using dplyr::case_when you could do:
library(dplyr)
dat |>
mutate(Location = case_when(
Date_AR %in% Date_S & Date_AR %in% Date_N ~ "B",
Date_AR %in% Date_S ~ "S",
Date_AR %in% Date_N ~ "N"
))
#> Date_S Date_N Date_AR Precip Location
#> 1 1989-01-17 1989-01-29 1989-01-17 23.6 S
#> 2 1989-01-18 1989-01-30 1989-01-18 15.4 S
#> 3 1989-01-31 1989-04-06 1989-01-29 3.0 N
#> 4 1989-03-10 1989-04-20 1989-01-30 16.8 N
#> 5 1989-04-05 1989-06-13 1989-01-31 0.2 S
#> 6 1989-04-06 1989-08-21 1989-03-10 3.6 S
#> 7 1989-04-07 1989-10-09 1989-04-05 22.0 S
#> 8 1989-08-21 1989-10-11 1989-04-06 13.4 B
#> 9 1989-10-11 1989-10-18 1989-04-07 0.0 S
#> 10 1989-10-12 1989-11-03 1989-04-20 30.8 N
#> 11 1989-10-21 1989-11-04 1989-06-13 4.6 N
#> 12 1989-11-04 1989-11-08 1989-08-21 27.1 B
#> 13 1989-11-10 1989-11-09 1989-10-09 0.0 N
#> 14 1989-11-11 1989-11-10 1989-10-11 19.0 B
#> 15 1989-11-12 1989-11-19 1989-10-12 2.8 S
#> 16 1989-11-19 1989-11-23 1989-10-18 11.4 N
#> 17 1989-11-20 1989-12-02 1989-10-21 2.0 S
#> 18 1989-11-23 1989-12-03 1989-11-03 57.6 N
#> 19 1989-12-02 1989-12-04 1989-11-04 9.4 B
#> 20 1989-12-03 1989-12-07 1989-11-08 39.0 N

Related

Take the first value of group of columns in R

I have some data:
data
structure(list(WBC_BASELINE = c(2.9, NA, NA, 6.9, NA, NA, NA,
NA, NA, NA, 7.4, 12.8, NA, NA, NA, NA, NA, 4.2, NA, NA), WBC_FIRST = c(2.4,
14.8, 11, 7.3, 4.5, NA, NA, 6.1, 7.7, 16.2, 5.3, 10.3, 14.5,
NA, NA, 12.8, 3.7, 4.7, 16.6, 9.3), neuts_BASELINE = c(2, NA,
NA, 5.4, NA, NA, NA, NA, NA, NA, 4.96, 8.9, NA, NA, NA, NA, NA,
NA, NA, NA), neuts_FIRST = c(1.5, 13, 5.8, 4.5, 1.6, NA, NA,
1.7, 4.3, 9.3, 3.4, 5.8, 10.1, NA, NA, 9.7, 2.3, 3.5, 5, 8.2)), row.names = c(NA,
20L), class = "data.frame")
In the dataset I have some blood test results (in this case WBC and neuts taken at 2 time points - baseline, and first). I want to select the baseline value if it exists, else take the first value.
I can do this separately for WBC and neuts, but I want to do it for 20 different blood tests without hard coding it each time...
Hard coding way:
data %>% mutate(WBC_first_value=ifelse(!is.na(WBC_BASELINE), WBC_BASELINE, WBC_FIRST)) %>%
mutate(neuts_first_value=ifelse(!is.na(neuts_BASELINE), neuts_BASELINE, neuts_FIRST))
Please note that each blood test is always followed by _BASELINE and _FIRST
I'd be grateful for any help please!
We could automate this process with some data wrangling using pivot_longer and pivot_wider in combination:
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c('grp', '.value'),
names_sep = "_") %>%
group_by(grp) %>%
transmute(rn, new = coalesce(BASELINE, FIRST)) %>%
pivot_wider(names_from = grp, values_from = new) %>%
select(-rn) %>%
bind_cols(data, .)
output:
WBC_BASELINE WBC_FIRST neuts_BASELINE neuts_FIRST WBC neuts
1 2.9 2.4 2.00 1.5 2.9 2.00
2 NA 14.8 NA 13.0 14.8 13.00
3 NA 11.0 NA 5.8 11.0 5.80
4 6.9 7.3 5.40 4.5 6.9 5.40
5 NA 4.5 NA 1.6 4.5 1.60
6 NA NA NA NA NA NA
7 NA NA NA NA NA NA
8 NA 6.1 NA 1.7 6.1 1.70
9 NA 7.7 NA 4.3 7.7 4.30
10 NA 16.2 NA 9.3 16.2 9.30
11 7.4 5.3 4.96 3.4 7.4 4.96
12 12.8 10.3 8.90 5.8 12.8 8.90
13 NA 14.5 NA 10.1 14.5 10.10
14 NA NA NA NA NA NA
15 NA NA NA NA NA NA
16 NA 12.8 NA 9.7 12.8 9.70
17 NA 3.7 NA 2.3 3.7 2.30
18 4.2 4.7 NA 3.5 4.2 3.50
19 NA 16.6 NA 5.0 16.6 5.00
20 NA 9.3 NA 8.2 9.3 8.20
You could do this with a loop!
vars <- c("WBC", "neuts")
for(v in vars){
df[,paste0(v, "_new")] <- ifelse(!is.na(df[,paste0(v, "_BASELINE")]), df[,paste0(v, "_BASELINE")], df[,paste0(v, "_FIRST")])
}
Or with sapply:
sapply(vars, function(v) ifelse(!is.na(df[,paste0(v, "_BASELINE")]),df[,paste0(v, "_BASELINE")], df[,paste0(v, "_FIRST")]))
Also could define vars programmatically:
vars <- unique(gsub(pattern = "^([A-Za-z]+)_[A-Za-z]+", "\\1", names(df)))

Determine range of time where measurements are not NA

I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"

How to melt a multiple columns df with R?

I have this data for which I would like to transform it to long.
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
df <- structure(list(
julian_days = c(
127, 130, 132, 134, 137, 139,
141, 144, 148, 151, 153, 155, 158, 160, 162, 165, 167, 169, 172,
NA, NA, NA, NA, NA, NA, NA, NA, NA
), sea_ice_algae_last_cm = c(
0.636,
0.698, 0.666666666666667, 0.685384615384615, 0.713, 0.6375, 0.58375,
0.637272727272727, 0.6575, 0.691666666666667, 0.629166666666667,
0.637142857142857, 0.589166666666667, 0.56, 0.571818181818182,
0.492, 0.31, 0.312, 0.203076923076923, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), sd = c(
0.0227058484879019, 0.0369684550213647, 0.0533853912601565,
0.0525381424324881, 0.0413790070231539, 0.0381682876458741, 0.0277788888666675,
0.0410099766132362, 0.0222076972732838, 0.0194079021706795, 0.0299873710792131,
0.0363841933236059, 0.0253908835942542, 0.055746679790749, 0.0604678727620178,
0.0294957624075053, 0.10770329614269, 0.0657267069006199, 0.0693282789084673,
NA, NA, NA, NA, NA, NA, NA, NA, NA
), julian_days_2 = c(
127, 130,
132, 134, 137, 139, 141, 144, 146, 148, 151, 153, 155, 158, 160,
162, 165, 167, 169, 172, 174, 176, 179, 181, 183, 186, 188, 190
), water_1_5_m_depth = c(
0.69, 0.5475, 0.596, 0.512, 0.598, 0.488333333333333,
0.27, 0.41, 0.568, 0.503333333333333, 0.668333333333333, 0.71,
0.636666666666667, 0.623333333333333, 0.66, 0.541666666666667,
0.57, 0.545, 0.501666666666667, 0.526666666666667, 0.566666666666667,
0.493333333333333, 0.59, 0.518333333333333, 0.443333333333333,
0.605, 0.58, 0.478333333333333
), sd_2 = c(
0.121655250605964,
0.0718215380880506, 0.0736885337077625, 0.0376828873628335, 0.084380092438916,
0.0636919670497516, 0.054037024344425, 0.0540370243444251, 0.0370135110466435,
0.0571547606649408, 0.0702614166286638, 0.0442718872423573, 0.0799166232186176,
0.0480277697448743, 0.0409878030638384, 0.0462240918425302, 0.0920869154657709,
0.0706399320497981, 0.0511533641774093, 0.100531918646103, 0.0186189867250252,
0.0588784057755188, 0.0841427358718512, 0.0934701378337842, 0.0492612085384298,
0.0653452370108182, 0.0878635305459549, 0.0851860708488579
),
water_10_m_depth = c(
0.66, 0.732, 0.595, 0.712, 0.514, 0.48,
0.35, 0.44, 0.535, 0.403333333333333, 0.728, 0.746, 0.625,
0.698333333333333, 0.705, 0.555, 0.585, 0.651666666666667,
0.603333333333333, 0.595, 0.615, 0.615, 0.658333333333333,
0.641666666666667, 0.623333333333333, 0.628333333333333,
0.661666666666667, 0.631666666666667
), sd_3 = c(
0, 0.0342052627529742,
0.0387298334620742, 0.0327108544675923, 0.0610737259384104,
0.0700000000000001, 0.127279220613579, 0.0972111104761177,
0.0564800849857717, 0.0504645089807343, 0.0540370243444252,
0.0415932686861709, 0.0809320702811933, 0.0475043857624395,
0.0398748040747538, 0.0568330889535313, 0.0388587184554509,
0.0204124145231932, 0.058878405775519, 0.0896102672688791,
0.0535723809439155, 0.0488876262463212, 0.043089055068157,
0.0306050104830347, 0.0527888877195444, 0.0708284312029193,
0.0426223728418147, 0.0348807492274272
), julian_days_3 = c(
134,
137, 139, 141, 146, 148, 153, 155, 160, 162, 165, 169, 172,
174, 176, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA
), water_40_m_depth = c(
0.523166666666667, 0.360833333333333,
0.279, 0.228, 0.551166666666667, 0.358666666666667, 0.593,
0.6225, 0.6665, 0.5468, 0.334714285714286, 0.654, 0.567666666666667,
0.664166666666667, 0.6345, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
), sd_4 = c(
0.0793937445058905, 0.0346145441493408,
0.0834625664594612, 0.105740247777277, 0.0437008771841786,
0.0810719844747042, 0.0849529281425892, 0.0539620236833275,
0.0689514321823702, 0.0344992753547085, 0.0889713704621029,
0.064221491729794, 0.0166933120340652, 0.0545982295195244,
0.0578472125516865, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), julian_days_4 = c(
181, 183, 186, 188, 190, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA
), water_60_m_depth = c(
0.617833333333333,
0.492333333333333, 0.642166666666667, 0.7265, 0.686166666666667,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA
), sd_5 = c(
0.0574818812032684,
0.049766119666563, 0.0704540039079871, 0.0286618212959331,
0.0382225936674458, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)
), row.names = c(
NA,
-28L
), class = c("tbl_df", "tbl", "data.frame"))
arrange(df, desc(julian_days_4)) # Look at the data at day 190
#> # A tibble: 28 x 14
#> julian_days sea_ice_algae_l… sd julian_days_2 water_1_5_m_dep…
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 137 0.713 0.0414 137 0.598
#> 2 134 0.685 0.0525 134 0.512
#> 3 132 0.667 0.0534 132 0.596
#> 4 130 0.698 0.0370 130 0.548
#> 5 127 0.636 0.0227 127 0.69
#> 6 139 0.638 0.0382 139 0.488
#> 7 141 0.584 0.0278 141 0.27
#> 8 144 0.637 0.0410 144 0.41
#> 9 148 0.658 0.0222 146 0.568
#> 10 151 0.692 0.0194 148 0.503
#> # … with 18 more rows, and 9 more variables: sd_2 <dbl>,
#> # water_10_m_depth <dbl>, sd_3 <dbl>, julian_days_3 <dbl>,
#> # water_40_m_depth <dbl>, sd_4 <dbl>, julian_days_4 <dbl>,
#> # water_60_m_depth <dbl>, sd_5 <dbl>
I would like to “stack” all this into 3 columns:
julian with all columns starting with “julian”
measure with all columns starting with “water” or “sea”
sd with all columns starting with “sd”
Note that in the “water” columns, the numbers represent the depth (ex.: water_1_5_m_depth means 1.5 m).
The desired output for the first line would be something like:
tibble(
julian = c(127, 127, 127, 134, 181),
type = c("sea", "water_1.5", "water_10", "water_40", "water_60"),
measure = c(0.64, 0.69, 0.66, 0.52, 0.62),
sd = c(0.02, 0.12, 0, 0.08, 0.06)
)
#> # A tibble: 5 x 4
#> julian type measure sd
#> <dbl> <chr> <dbl> <dbl>
#> 1 127 sea 0.64 0.02
#> 2 127 water_1.5 0.69 0.12
#> 3 127 water_10 0.66 0
#> 4 134 water_40 0.52 0.08
#> 5 181 water_60 0.62 0.06
My attempt so far was with data.table.
melt(
setDT(df),
measure = patterns("^julian", "^sea", "^water_1_5", "^water_10", "^water_40", "^water_60", "^sd"),
value.name = c("julian", "sea", "water_1.5", "water_10", "water_40", "water_60", "sd")
)
#> variable julian sea water_1.5 water_10 water_40 water_60
#> 1: 1 127 0.6360000 0.6900 0.660 0.5231667 0.6178333
#> 2: 1 130 0.6980000 0.5475 0.732 0.3608333 0.4923333
#> 3: 1 132 0.6666667 0.5960 0.595 0.2790000 0.6421667
#> 4: 1 134 0.6853846 0.5120 0.712 0.2280000 0.7265000
#> 5: 1 137 0.7130000 0.5980 0.514 0.5511667 0.6861667
#> ---
#> 136: 5 NA NA NA NA NA NA
#> 137: 5 NA NA NA NA NA NA
#> 138: 5 NA NA NA NA NA NA
#> 139: 5 NA NA NA NA NA NA
#> 140: 5 NA NA NA NA NA NA
#> sd
#> 1: 0.02270585
#> 2: 0.03696846
#> 3: 0.05338539
#> 4: 0.05253814
#> 5: 0.04137901
#> ---
#> 136: NA
#> 137: NA
#> 138: NA
#> 139: NA
#> 140: NA
Any help appreciated.
UPDATE:
Here is the file I received.
Created on 2019-04-12 by the reprex package (v0.2.1)
library(tidyverse)
list_of_dfs <- split.default(df, rep(1:4, c(3, 5, 3, 3)))
list_of_dfs[[5]] <- list_of_dfs[[2]][, c(1, 4, 5)]
list_of_dfs[[2]] <- list_of_dfs[[2]][, 1:3]
list_of_dfs %>%
map(~ .[complete.cases(.), ]) %>%
map(~ mutate(., type = grep("^sea|^water", names(.), value = TRUE))) %>%
map(setNames, nm = c("julian", "measure", "sd", "type")) %>%
bind_rows()
# # A tibble: 95 x 4
# julian measure sd type
# <dbl> <dbl> <dbl> <chr>
# 1 127 0.636 0.0227 sea_ice_algae_last_cm
# 2 130 0.698 0.0370 sea_ice_algae_last_cm
# 3 132 0.667 0.0534 sea_ice_algae_last_cm
# 4 134 0.685 0.0525 sea_ice_algae_last_cm
# 5 137 0.713 0.0414 sea_ice_algae_last_cm
# 6 139 0.638 0.0382 sea_ice_algae_last_cm
# 7 141 0.584 0.0278 sea_ice_algae_last_cm
# 8 144 0.637 0.0410 sea_ice_algae_last_cm
# 9 148 0.658 0.0222 sea_ice_algae_last_cm
# 10 151 0.692 0.0194 sea_ice_algae_last_cm
# # … with 85 more rows
It would be nice if you share your desired output. I think this is what you want:
df %>%
select(starts_with("julian")) %>%
gather(key = col, julian) %>%
bind_cols(df %>%
select(starts_with("water")) %>%
gather(col_water, measure)) %>%
#bind_cols(df %>%
# select(starts_with("sea")) %>%
# gather(col_sea, measure2)) %>%
bind_cols(df %>%
select(starts_with("sd")) %>%
gather(col_sd, sd)) %>%
select(julian, measure, sd)
julian measure sd
<dbl> <dbl> <dbl>
1 127 0.69 0.122
2 130 0.548 0.0718
3 132 0.596 0.0737
4 134 0.512 0.0377
5 137 0.598 0.0844
6 139 0.488 0.0637
7 141 0.27 0.0540
8 144 0.41 0.0540
9 148 0.568 0.0370
10 151 0.503 0.0572
# ... with 102 more rows
In this try i did not include the variables starting with sea, sice it would lead to a one to many merge. Let me know if I am in the right direction to include that one.
data.table::melt(
df,
measure.vars = patterns("^julian"),
variable.name = "julian_variable",
value.name = "julian_value"
) %>%
data.table::melt(
measure.vars = patterns(measure = "^sea|^water"),
variable.name = "measure_variable",
value.name = "measure_value"
) %>%
data.table::melt(
measure.vars = patterns(measure = "^sd"),
variable.name = "sd_variable",
value.name = "sd_value"
)
# julian_variable julian_value measure_variable measure_value sd_variable sd_value
# 1: julian_days 127 sea_ice_algae_last_cm 0.6360000 sd 0.02270585
# 2: julian_days 130 sea_ice_algae_last_cm 0.6980000 sd 0.03696846
# 3: julian_days 132 sea_ice_algae_last_cm 0.6666667 sd 0.05338539
# 4: julian_days 134 sea_ice_algae_last_cm 0.6853846 sd 0.05253814
# 5: julian_days 137 sea_ice_algae_last_cm 0.7130000 sd 0.04137901
# ---
# 2796: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2797: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2798: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2799: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2800: julian_days_4 NA water_60_m_depth NA sd_5 NA
Though it is unclear as to what the desired output is. This solution obviously leads to a lot of duplication (basically, each individual value is duplicated 100 times! 4 "julian" columns * 5 "measure" columns * 5 "sd" columns).

How to drop NA variables in a data frame by row

Here is my data frame:
structure(list(Q = c(NA, 346.86, 166.95, 162.57, NA, NA, NA,
266.7), L = c(18.93, NA, 15.72, 39.51, NA, NA, NA, NA), C = c(NA,
23.8, NA, 8.47, 20.89, 18.72, 14.94, NA), X = c(40.56, NA, 26.05,
3.08, 23.77, 59.37, NA, NA), W = c(29.47, NA, NA, NA, 36.08,
NA, 27.34, 28.19), S = c(NA, 7.47, NA, NA, 18.64, NA, 25.34,
NA), Y = c(NA, 2.81, 0, NA, NA, 21.18, 10.83, 12.19), H = c(0,
NA, NA, NA, NA, 0, NA, 0)), class = "data.frame", row.names = c(NA,
-8L), .Names = c("Q", "L", "C", "X", "W", "S", "Y", "H"))
Each row has 4 variables that are NAs, now I want to do the same operations to every row:
Drop those 4 varibles that are NAs
Calculate diversity for the rest 4 variables (it's just some computations involved with the rest, here I use diversity() from vegan)
Append the output to a new data frame
But the problem is:
How to do drop NA variables using dplyr? I don't know whether select() can make it.
How to apply operations to every row of a data frame?
It seems that drop_na() will remove the entire row for my dataset, any suggestion?
With tidyverse it may be better to gather into 'long' format and then spread it back. Assuming that we have exactly 4 non-NA elements per row, create a row index with rownames_to_column (from tibble), gather (from tidyr) into 'long' format, remove the NA elements, grouped by row number ('rn'), change the 'key' values to common values and then spread it to wide' format
library(tibble)
library(tidyr)
library(dplyr)
res <- rownames_to_column(df1, 'rn') %>%
gather(key, val, -rn) %>%
filter(!is.na(val)) %>%
group_by(rn) %>%
mutate(key = LETTERS[1:4]) %>%
spread(key, val) %>%
ungroup %>%
select(-rn)
res
# A tibble: 8 x 4
# A B C D
#* <dbl> <dbl> <dbl> <dbl>
#1 18.9 40.6 29.5 0
#2 347 23.8 7.47 2.81
#3 167 15.7 26.0 0
#4 163 39.5 8.47 3.08
#5 20.9 23.8 36.1 18.6
#6 18.7 59.4 21.2 0
#7 14.9 27.3 25.3 10.8
#8 267 28.2 12.2 0
diversity(res)
# 1 2 3 4 5 6 7 8
#1.0533711 0.3718959 0.6331070 0.7090783 1.3517680 0.9516232 1.3215712 0.4697572
Regarding the diversity calculation, we can replace NA with 0 and apply on the whole dataset i.e.
library(vegan)
diversity(replace(df1, is.na(df1), 0))
#[1] 1.0533711 0.3718959 0.6331070 0.7090783
#[5] 1.3517680 0.9516232 1.3215712 0.4697572
as we get the same output as in the first solution

Sum Event Data in R

I'm working with some daily rainfall data that spans several years. I want to sum the rainfall on consecutive rainy day to get a rainfall total for that rainfall event. It would also be nice to get a start and stop date and rainfall intensity per event. I'm thinking I could hack something together with aggregate however what I'm thinking of doing in my head seems very bulky. Is there a quick and elegant solution possibly to be found with dplyr,tdyror data.table.
Data
structure(list(Time = structure(c(1353398400, 1353484800, 1353571200,
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600,
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000,
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400,
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800,
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200,
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600,
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000,
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA,
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39,
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001,
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame")
Desired output
Begin End Days Total Intensity
11/27/2012 11/27/2012 1 0.11 0.11
11/29/2012 12/1/2012 3 1.1 0.366666667
12/4/2012 12/4/2012 1 0.03 0.03
12/11/2012 12/11/2012 1 0.34 0.34
12/16/2012 12/17/2012 2 0.08 0.04
12/21/2012 12/26/2012 6 0.29 0.048333333
1/5/2013 1/5/2013 1 0.47 0.47
1/9/2013 1/9/2013 1 0.03 0.03
data.table::rleid is a convenient function for dealing with consecutive values, assuming your data frame is named df and it has been sorted by Time variable before hand:
library(data.table)
setDT(df)
na.omit(df[,.(Begin = as.Date(first(Time)),
End = as.Date(last(Time)),
Days = as.Date(last(Time)) - as.Date(first(Time)) + 1,
Total = sum(inc), Intensity = mean(inc)),
by = .(id = rleid(is.na(inc)))])
# id Begin End Days Total Intensity
#1: 2 2012-11-27 2012-11-27 1 days 0.11 0.1100000
#2: 4 2012-11-29 2012-12-01 3 days 1.10 0.3666667
#3: 6 2012-12-04 2012-12-04 1 days 0.03 0.0300000
#4: 8 2012-12-11 2012-12-11 1 days 0.34 0.3400000
#5: 10 2012-12-16 2012-12-17 2 days 0.08 0.0400000
#6: 12 2012-12-21 2012-12-26 6 days 1.16 0.1933333 #I think you have some miscalculation here
#7: 14 2013-01-05 2013-01-05 1 days 0.47 0.4700000
#8: 16 2013-01-09 2013-01-09 1 days 0.03 0.0300000
Here is an approach that uses dplyr.
First, some preliminary cleanup: a date variable is needed, not a POSIXct:
library(dplyr)
df2 <- df %>%
mutate(date = as.Date(Time)) %>%
select(-Time)
This computes a data frame with an explicit variable for rain_event:
df3 <- df2 %>%
filter(!is.na(inc)) %>%
mutate(
day_lag = as.numeric(difftime(date, lag(date), units = "days")),
# special case: first rain event
day_lag = ifelse(is.na(day_lag), 1, day_lag),
rain_event = 1 + cumsum(day_lag > 1)
)
> df3
inc date day_lag rain_event
1 0.11 2012-11-27 1 1
2 0.62 2012-11-29 2 2
3 0.09 2012-11-30 1 2
4 0.39 2012-12-01 1 2
5 0.03 2012-12-04 3 3
6 0.34 2012-12-11 7 4
7 0.06 2012-12-16 5 5
8 0.02 2012-12-17 1 5
9 0.29 2012-12-21 4 6
10 0.35 2012-12-22 1 6
11 0.02 2012-12-23 1 6
12 0.27 2012-12-24 1 6
13 0.17 2012-12-25 1 6
14 0.06 2012-12-26 1 6
15 0.47 2013-01-05 10 7
16 0.03 2013-01-09 4 8
Now, summarise by each rain event, computing the metrics you care about:
df3 %>%
group_by(rain_event) %>%
summarise(
begin = min(date),
end = max(date),
days = n(),
total = sum(inc),
intensity = mean(inc)
)
# A tibble: 8 × 6
rain_event begin end days total intensity
<dbl> <date> <date> <int> <dbl> <dbl>
1 1 2012-11-27 2012-11-27 1 0.11 0.1100000
2 2 2012-11-29 2012-12-01 3 1.10 0.3666667
3 3 2012-12-04 2012-12-04 1 0.03 0.0300000
4 4 2012-12-11 2012-12-11 1 0.34 0.3400000
5 5 2012-12-16 2012-12-17 2 0.08 0.0400000
6 6 2012-12-21 2012-12-26 6 1.16 0.1933333
7 7 2013-01-05 2013-01-05 1 0.47 0.4700000
8 8 2013-01-09 2013-01-09 1 0.03 0.0300000
You can append a new column that group rows when they represent a continuous rainy period, then get the statistics you want using dplyr. assuming that your dataframe is called df:
library(dplyr)
rain_period = rep(NA,nrow(df)) #initialize vector
group=1 #initialize group number
for(i in 1:nrow(df)){
if(is.na(df$inc[i])) group = group + 1
else rain_period[i] = group
}
df$group = rain_period
result = dplyr::group_by(df,group)
result = dplyr::summarise(result,
Begin = min(Time),
End = max(Time),
Days = n(),
Total = sum(inc),
Intensity = mean(inc))
Only base packages, and basically using aggregate function. I know it is not the nicest option around. The only problem is with the format of dates (the columns of data frame must be specified one-by-one for the desired date format, otherwise it will be converted to integer):
data1 <- structure(list(Time = structure(c(1353398400, 1353484800, 1353571200,
1353657600, 1353744000, 1353830400, 1353916800, 1354003200, 1354089600,
1354176000, 1354262400, 1354348800, 1354435200, 1354521600, 1354608000,
1354694400, 1354780800, 1354867200, 1354953600, 1355040000, 1355126400,
1355212800, 1355299200, 1355385600, 1355472000, 1355558400, 1355644800,
1355731200, 1355817600, 1355904000, 1355990400, 1356076800, 1356163200,
1356249600, 1356336000, 1356422400, 1356508800, 1356595200, 1356681600,
1356768000, 1356854400, 1356940800, 1357027200, 1357113600, 1357200000,
1357286400, 1357372800, 1357459200, 1357545600, 1357632000, 1357718400
), class = c("POSIXct", "POSIXt"), tzone = ""), inc = c(NA, NA,
NA, NA, NA, NA, NA, 0.11, NA, 0.62, 0.0899999999999999, 0.39,
NA, NA, 0.03, NA, NA, NA, NA, NA, NA, 0.34, NA, NA, NA, NA, 0.0600000000000001,
0.02, NA, NA, NA, 0.29, 0.35, 0.02, 0.27, 0.17, 0.0600000000000001,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.47, NA, NA, NA, 0.0300000000000002
)), .Names = c("Time", "inc"), row.names = 50:100, class = "data.frame")
rainruns <- function(datas = data1) {
incs <- c(NA, datas$inc) # last column
event <- cumsum(is.na(incs[-length(incs)]) & !is.na(incs[-1])) # counter for rain events
datas <- cbind(datas, event) # add events column
datas2 <- datas[!is.na(datas$inc),] # delete na's
summarydata1 <- aggregate(datas2$inc, by = list(datas2$event), # summarize rain data by event
FUN = function(x) c(length(x), sum(x), mean(x)))[[2]]
summarydata2 <- aggregate(as.Date(datas2$Time), by = list(datas2$event), # summarize dates by event
FUN = function(x) c(min(x), max(x)))[[2]]
summarydata <- data.frame(format(as.Date(summarydata2[,1], # combine both, correcting date formats
origin = "1970-01-01"), "%m/%d/%Y"),
format(as.Date(summarydata2[,2],
origin = "1970-01-01"), "%m/%d/%Y"), summarydata1)
names(summarydata) <- c("Begin", "End", "Days", "Total", "Intensity") # update column names
return(summarydata)
}

Resources