I have a subset of my data in a dataframe (dput codeblock below) containing dates in which a storm occurred ("Date_AR"). I'd like to know if a storm occurred in the north, south or both, by determining whether the same date occurred in the "Date_N" and/or "Date_S" column/s.
For example, the first date is Jan 17, 1989 in the "Date_AR" column. In the location column, I would like "S" to be printed, since this date is found in the "Date_S" column. If Apr 5. 1989 occurs in "Date_N" and "Date_S", the I would like a "B" (for both) to be printed in the location column.
Thanks in advance for the help! Apologies if this type of question is already out there. I may not know the keywords to search.
structure(list(Date_S = structure(c(6956, 6957, 6970, 7008, 7034,
7035, 7036, 7172, 7223, 7224, 7233, 7247, 7253, 7254, 7255, 7262, 7263, 7266, 7275,
7276), class = "Date"),
Date_N = structure(c(6968, 6969, 7035, 7049, 7103, 7172, 7221, 7223, 7230, 7246, 7247,
7251, 7252, 7253, 7262, 7266, 7275, 7276, 7277, 7280), class = "Date"),
Date_AR = structure(c(6956, 6957, 6968, 6969, 6970, 7008,
7034, 7035, 7036, 7049, 7103, 7172, 7221, 7223, 7224, 7230,
7233, 7246, 7247, 7251), class = "Date"), Precip = c(23.6,
15.4, 3, 16.8, 0.2, 3.6, 22, 13.4, 0, 30.8, 4.6, 27.1, 0,
19, 2.8, 11.4, 2, 57.6, 9.4, 39), Location = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 20L), class = "data.frame")
Using dplyr::case_when you could do:
library(dplyr)
dat |>
mutate(Location = case_when(
Date_AR %in% Date_S & Date_AR %in% Date_N ~ "B",
Date_AR %in% Date_S ~ "S",
Date_AR %in% Date_N ~ "N"
))
#> Date_S Date_N Date_AR Precip Location
#> 1 1989-01-17 1989-01-29 1989-01-17 23.6 S
#> 2 1989-01-18 1989-01-30 1989-01-18 15.4 S
#> 3 1989-01-31 1989-04-06 1989-01-29 3.0 N
#> 4 1989-03-10 1989-04-20 1989-01-30 16.8 N
#> 5 1989-04-05 1989-06-13 1989-01-31 0.2 S
#> 6 1989-04-06 1989-08-21 1989-03-10 3.6 S
#> 7 1989-04-07 1989-10-09 1989-04-05 22.0 S
#> 8 1989-08-21 1989-10-11 1989-04-06 13.4 B
#> 9 1989-10-11 1989-10-18 1989-04-07 0.0 S
#> 10 1989-10-12 1989-11-03 1989-04-20 30.8 N
#> 11 1989-10-21 1989-11-04 1989-06-13 4.6 N
#> 12 1989-11-04 1989-11-08 1989-08-21 27.1 B
#> 13 1989-11-10 1989-11-09 1989-10-09 0.0 N
#> 14 1989-11-11 1989-11-10 1989-10-11 19.0 B
#> 15 1989-11-12 1989-11-19 1989-10-12 2.8 S
#> 16 1989-11-19 1989-11-23 1989-10-18 11.4 N
#> 17 1989-11-20 1989-12-02 1989-10-21 2.0 S
#> 18 1989-11-23 1989-12-03 1989-11-03 57.6 N
#> 19 1989-12-02 1989-12-04 1989-11-04 9.4 B
#> 20 1989-12-03 1989-12-07 1989-11-08 39.0 N
I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"
I have this data for which I would like to transform it to long.
library(tidyverse)
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
df <- structure(list(
julian_days = c(
127, 130, 132, 134, 137, 139,
141, 144, 148, 151, 153, 155, 158, 160, 162, 165, 167, 169, 172,
NA, NA, NA, NA, NA, NA, NA, NA, NA
), sea_ice_algae_last_cm = c(
0.636,
0.698, 0.666666666666667, 0.685384615384615, 0.713, 0.6375, 0.58375,
0.637272727272727, 0.6575, 0.691666666666667, 0.629166666666667,
0.637142857142857, 0.589166666666667, 0.56, 0.571818181818182,
0.492, 0.31, 0.312, 0.203076923076923, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), sd = c(
0.0227058484879019, 0.0369684550213647, 0.0533853912601565,
0.0525381424324881, 0.0413790070231539, 0.0381682876458741, 0.0277788888666675,
0.0410099766132362, 0.0222076972732838, 0.0194079021706795, 0.0299873710792131,
0.0363841933236059, 0.0253908835942542, 0.055746679790749, 0.0604678727620178,
0.0294957624075053, 0.10770329614269, 0.0657267069006199, 0.0693282789084673,
NA, NA, NA, NA, NA, NA, NA, NA, NA
), julian_days_2 = c(
127, 130,
132, 134, 137, 139, 141, 144, 146, 148, 151, 153, 155, 158, 160,
162, 165, 167, 169, 172, 174, 176, 179, 181, 183, 186, 188, 190
), water_1_5_m_depth = c(
0.69, 0.5475, 0.596, 0.512, 0.598, 0.488333333333333,
0.27, 0.41, 0.568, 0.503333333333333, 0.668333333333333, 0.71,
0.636666666666667, 0.623333333333333, 0.66, 0.541666666666667,
0.57, 0.545, 0.501666666666667, 0.526666666666667, 0.566666666666667,
0.493333333333333, 0.59, 0.518333333333333, 0.443333333333333,
0.605, 0.58, 0.478333333333333
), sd_2 = c(
0.121655250605964,
0.0718215380880506, 0.0736885337077625, 0.0376828873628335, 0.084380092438916,
0.0636919670497516, 0.054037024344425, 0.0540370243444251, 0.0370135110466435,
0.0571547606649408, 0.0702614166286638, 0.0442718872423573, 0.0799166232186176,
0.0480277697448743, 0.0409878030638384, 0.0462240918425302, 0.0920869154657709,
0.0706399320497981, 0.0511533641774093, 0.100531918646103, 0.0186189867250252,
0.0588784057755188, 0.0841427358718512, 0.0934701378337842, 0.0492612085384298,
0.0653452370108182, 0.0878635305459549, 0.0851860708488579
),
water_10_m_depth = c(
0.66, 0.732, 0.595, 0.712, 0.514, 0.48,
0.35, 0.44, 0.535, 0.403333333333333, 0.728, 0.746, 0.625,
0.698333333333333, 0.705, 0.555, 0.585, 0.651666666666667,
0.603333333333333, 0.595, 0.615, 0.615, 0.658333333333333,
0.641666666666667, 0.623333333333333, 0.628333333333333,
0.661666666666667, 0.631666666666667
), sd_3 = c(
0, 0.0342052627529742,
0.0387298334620742, 0.0327108544675923, 0.0610737259384104,
0.0700000000000001, 0.127279220613579, 0.0972111104761177,
0.0564800849857717, 0.0504645089807343, 0.0540370243444252,
0.0415932686861709, 0.0809320702811933, 0.0475043857624395,
0.0398748040747538, 0.0568330889535313, 0.0388587184554509,
0.0204124145231932, 0.058878405775519, 0.0896102672688791,
0.0535723809439155, 0.0488876262463212, 0.043089055068157,
0.0306050104830347, 0.0527888877195444, 0.0708284312029193,
0.0426223728418147, 0.0348807492274272
), julian_days_3 = c(
134,
137, 139, 141, 146, 148, 153, 155, 160, 162, 165, 169, 172,
174, 176, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA
), water_40_m_depth = c(
0.523166666666667, 0.360833333333333,
0.279, 0.228, 0.551166666666667, 0.358666666666667, 0.593,
0.6225, 0.6665, 0.5468, 0.334714285714286, 0.654, 0.567666666666667,
0.664166666666667, 0.6345, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA
), sd_4 = c(
0.0793937445058905, 0.0346145441493408,
0.0834625664594612, 0.105740247777277, 0.0437008771841786,
0.0810719844747042, 0.0849529281425892, 0.0539620236833275,
0.0689514321823702, 0.0344992753547085, 0.0889713704621029,
0.064221491729794, 0.0166933120340652, 0.0545982295195244,
0.0578472125516865, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA
), julian_days_4 = c(
181, 183, 186, 188, 190, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA
), water_60_m_depth = c(
0.617833333333333,
0.492333333333333, 0.642166666666667, 0.7265, 0.686166666666667,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA
), sd_5 = c(
0.0574818812032684,
0.049766119666563, 0.0704540039079871, 0.0286618212959331,
0.0382225936674458, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)
), row.names = c(
NA,
-28L
), class = c("tbl_df", "tbl", "data.frame"))
arrange(df, desc(julian_days_4)) # Look at the data at day 190
#> # A tibble: 28 x 14
#> julian_days sea_ice_algae_l… sd julian_days_2 water_1_5_m_dep…
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 137 0.713 0.0414 137 0.598
#> 2 134 0.685 0.0525 134 0.512
#> 3 132 0.667 0.0534 132 0.596
#> 4 130 0.698 0.0370 130 0.548
#> 5 127 0.636 0.0227 127 0.69
#> 6 139 0.638 0.0382 139 0.488
#> 7 141 0.584 0.0278 141 0.27
#> 8 144 0.637 0.0410 144 0.41
#> 9 148 0.658 0.0222 146 0.568
#> 10 151 0.692 0.0194 148 0.503
#> # … with 18 more rows, and 9 more variables: sd_2 <dbl>,
#> # water_10_m_depth <dbl>, sd_3 <dbl>, julian_days_3 <dbl>,
#> # water_40_m_depth <dbl>, sd_4 <dbl>, julian_days_4 <dbl>,
#> # water_60_m_depth <dbl>, sd_5 <dbl>
I would like to “stack” all this into 3 columns:
julian with all columns starting with “julian”
measure with all columns starting with “water” or “sea”
sd with all columns starting with “sd”
Note that in the “water” columns, the numbers represent the depth (ex.: water_1_5_m_depth means 1.5 m).
The desired output for the first line would be something like:
tibble(
julian = c(127, 127, 127, 134, 181),
type = c("sea", "water_1.5", "water_10", "water_40", "water_60"),
measure = c(0.64, 0.69, 0.66, 0.52, 0.62),
sd = c(0.02, 0.12, 0, 0.08, 0.06)
)
#> # A tibble: 5 x 4
#> julian type measure sd
#> <dbl> <chr> <dbl> <dbl>
#> 1 127 sea 0.64 0.02
#> 2 127 water_1.5 0.69 0.12
#> 3 127 water_10 0.66 0
#> 4 134 water_40 0.52 0.08
#> 5 181 water_60 0.62 0.06
My attempt so far was with data.table.
melt(
setDT(df),
measure = patterns("^julian", "^sea", "^water_1_5", "^water_10", "^water_40", "^water_60", "^sd"),
value.name = c("julian", "sea", "water_1.5", "water_10", "water_40", "water_60", "sd")
)
#> variable julian sea water_1.5 water_10 water_40 water_60
#> 1: 1 127 0.6360000 0.6900 0.660 0.5231667 0.6178333
#> 2: 1 130 0.6980000 0.5475 0.732 0.3608333 0.4923333
#> 3: 1 132 0.6666667 0.5960 0.595 0.2790000 0.6421667
#> 4: 1 134 0.6853846 0.5120 0.712 0.2280000 0.7265000
#> 5: 1 137 0.7130000 0.5980 0.514 0.5511667 0.6861667
#> ---
#> 136: 5 NA NA NA NA NA NA
#> 137: 5 NA NA NA NA NA NA
#> 138: 5 NA NA NA NA NA NA
#> 139: 5 NA NA NA NA NA NA
#> 140: 5 NA NA NA NA NA NA
#> sd
#> 1: 0.02270585
#> 2: 0.03696846
#> 3: 0.05338539
#> 4: 0.05253814
#> 5: 0.04137901
#> ---
#> 136: NA
#> 137: NA
#> 138: NA
#> 139: NA
#> 140: NA
Any help appreciated.
UPDATE:
Here is the file I received.
Created on 2019-04-12 by the reprex package (v0.2.1)
library(tidyverse)
list_of_dfs <- split.default(df, rep(1:4, c(3, 5, 3, 3)))
list_of_dfs[[5]] <- list_of_dfs[[2]][, c(1, 4, 5)]
list_of_dfs[[2]] <- list_of_dfs[[2]][, 1:3]
list_of_dfs %>%
map(~ .[complete.cases(.), ]) %>%
map(~ mutate(., type = grep("^sea|^water", names(.), value = TRUE))) %>%
map(setNames, nm = c("julian", "measure", "sd", "type")) %>%
bind_rows()
# # A tibble: 95 x 4
# julian measure sd type
# <dbl> <dbl> <dbl> <chr>
# 1 127 0.636 0.0227 sea_ice_algae_last_cm
# 2 130 0.698 0.0370 sea_ice_algae_last_cm
# 3 132 0.667 0.0534 sea_ice_algae_last_cm
# 4 134 0.685 0.0525 sea_ice_algae_last_cm
# 5 137 0.713 0.0414 sea_ice_algae_last_cm
# 6 139 0.638 0.0382 sea_ice_algae_last_cm
# 7 141 0.584 0.0278 sea_ice_algae_last_cm
# 8 144 0.637 0.0410 sea_ice_algae_last_cm
# 9 148 0.658 0.0222 sea_ice_algae_last_cm
# 10 151 0.692 0.0194 sea_ice_algae_last_cm
# # … with 85 more rows
It would be nice if you share your desired output. I think this is what you want:
df %>%
select(starts_with("julian")) %>%
gather(key = col, julian) %>%
bind_cols(df %>%
select(starts_with("water")) %>%
gather(col_water, measure)) %>%
#bind_cols(df %>%
# select(starts_with("sea")) %>%
# gather(col_sea, measure2)) %>%
bind_cols(df %>%
select(starts_with("sd")) %>%
gather(col_sd, sd)) %>%
select(julian, measure, sd)
julian measure sd
<dbl> <dbl> <dbl>
1 127 0.69 0.122
2 130 0.548 0.0718
3 132 0.596 0.0737
4 134 0.512 0.0377
5 137 0.598 0.0844
6 139 0.488 0.0637
7 141 0.27 0.0540
8 144 0.41 0.0540
9 148 0.568 0.0370
10 151 0.503 0.0572
# ... with 102 more rows
In this try i did not include the variables starting with sea, sice it would lead to a one to many merge. Let me know if I am in the right direction to include that one.
data.table::melt(
df,
measure.vars = patterns("^julian"),
variable.name = "julian_variable",
value.name = "julian_value"
) %>%
data.table::melt(
measure.vars = patterns(measure = "^sea|^water"),
variable.name = "measure_variable",
value.name = "measure_value"
) %>%
data.table::melt(
measure.vars = patterns(measure = "^sd"),
variable.name = "sd_variable",
value.name = "sd_value"
)
# julian_variable julian_value measure_variable measure_value sd_variable sd_value
# 1: julian_days 127 sea_ice_algae_last_cm 0.6360000 sd 0.02270585
# 2: julian_days 130 sea_ice_algae_last_cm 0.6980000 sd 0.03696846
# 3: julian_days 132 sea_ice_algae_last_cm 0.6666667 sd 0.05338539
# 4: julian_days 134 sea_ice_algae_last_cm 0.6853846 sd 0.05253814
# 5: julian_days 137 sea_ice_algae_last_cm 0.7130000 sd 0.04137901
# ---
# 2796: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2797: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2798: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2799: julian_days_4 NA water_60_m_depth NA sd_5 NA
# 2800: julian_days_4 NA water_60_m_depth NA sd_5 NA
Though it is unclear as to what the desired output is. This solution obviously leads to a lot of duplication (basically, each individual value is duplicated 100 times! 4 "julian" columns * 5 "measure" columns * 5 "sd" columns).
I am trying to replicate the success of this solution:
remove columns with NAs from all dataframes in list
or
Remove columns from dataframe where some of values are NA
with a list of dataframes:
m1<- structure(list(vPWMETRO = c(1520L, 1520L, 1520L, 1520L, 1520L),
vPWPUMA00 = c(500L, 900L, 1000L, 1100L, 1200L),
v100 = c(96.1666666666667, 71.4615384615385, 68.6363636363636, 22.5, 64.5),
v101 = c(5, 15, NA, NA, NA),
v102 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)),
.Names = c("vPWMETRO", "vPWPUMA00", "v100", "v101", "v102"),
row.names = 26:30, class = "data.frame")
m2<- structure(list(vPWMETRO = c(6440L, 6440L, 6440L, NA, NA),
vPWPUMA00 = c(1300L,2100L, 2200L, NA, NA),
v100 = c(38.3921568627451, 35, 12.5, NA, NA),
v101 = c(NA, NA, NA, NA, NA),
v102 = c(38.3333333333333, 68, NA, NA, NA)),
.Names = c("vPWMETRO", "vPWPUMA00", "v100", "v101", "v102"),
row.names = c("39", "40", "41", "NA", "NA.1"), class = "data.frame")
#views structure
str(m1)
str(m2)
#creates list
snag<- list(v1520=m1, v6440=m2)
str(snag)
#attempts lapply solution
prob1<- lapply(snag, function(y) y[ ,!is.na(y)])
#2nd attempt, same result on just dataframe:
x5$v6440[ , apply(x5$v6440, 2, function(x) !(is.na(x)))]
So that columns that contain all NA's are deleted within the dataframe. Thus the result should be a list of 2 df's:
v1520: vPWPUMA00, v100, v101
v6440: vPWPUMA00, v100, v102
I see the difference in the example problem is that the dimension is 1x11 and my dimensions are 5x5. I am guessing this causes the "undefined column" error but I'm not sure.
Any assistance or advice would be most appreciated.
Regards,
I don't think you were looking at quite the right questions and answers.
See Remove columns from dataframe where ALL values are NA, which appears to be what you want.
You can then modify my answer to give
lapply(snag, Filter, f = function(x){!all(is.na(x))})
$v1520
vPWMETRO vPWPUMA00 v100 v101
26 1520 500 96.16667 5
27 1520 900 71.46154 15
28 1520 1000 68.63636 NA
29 1520 1100 22.50000 NA
30 1520 1200 64.50000 NA
$v6440
vPWMETRO vPWPUMA00 v100 v102
39 6440 1300 38.39216 38.33333
40 6440 2100 35.00000 68.00000
41 6440 2200 12.50000 NA
NA NA NA NA NA
NA.1 NA NA NA NA