How to drop NA variables in a data frame by row - r

Here is my data frame:
structure(list(Q = c(NA, 346.86, 166.95, 162.57, NA, NA, NA,
266.7), L = c(18.93, NA, 15.72, 39.51, NA, NA, NA, NA), C = c(NA,
23.8, NA, 8.47, 20.89, 18.72, 14.94, NA), X = c(40.56, NA, 26.05,
3.08, 23.77, 59.37, NA, NA), W = c(29.47, NA, NA, NA, 36.08,
NA, 27.34, 28.19), S = c(NA, 7.47, NA, NA, 18.64, NA, 25.34,
NA), Y = c(NA, 2.81, 0, NA, NA, 21.18, 10.83, 12.19), H = c(0,
NA, NA, NA, NA, 0, NA, 0)), class = "data.frame", row.names = c(NA,
-8L), .Names = c("Q", "L", "C", "X", "W", "S", "Y", "H"))
Each row has 4 variables that are NAs, now I want to do the same operations to every row:
Drop those 4 varibles that are NAs
Calculate diversity for the rest 4 variables (it's just some computations involved with the rest, here I use diversity() from vegan)
Append the output to a new data frame
But the problem is:
How to do drop NA variables using dplyr? I don't know whether select() can make it.
How to apply operations to every row of a data frame?
It seems that drop_na() will remove the entire row for my dataset, any suggestion?

With tidyverse it may be better to gather into 'long' format and then spread it back. Assuming that we have exactly 4 non-NA elements per row, create a row index with rownames_to_column (from tibble), gather (from tidyr) into 'long' format, remove the NA elements, grouped by row number ('rn'), change the 'key' values to common values and then spread it to wide' format
library(tibble)
library(tidyr)
library(dplyr)
res <- rownames_to_column(df1, 'rn') %>%
gather(key, val, -rn) %>%
filter(!is.na(val)) %>%
group_by(rn) %>%
mutate(key = LETTERS[1:4]) %>%
spread(key, val) %>%
ungroup %>%
select(-rn)
res
# A tibble: 8 x 4
# A B C D
#* <dbl> <dbl> <dbl> <dbl>
#1 18.9 40.6 29.5 0
#2 347 23.8 7.47 2.81
#3 167 15.7 26.0 0
#4 163 39.5 8.47 3.08
#5 20.9 23.8 36.1 18.6
#6 18.7 59.4 21.2 0
#7 14.9 27.3 25.3 10.8
#8 267 28.2 12.2 0
diversity(res)
# 1 2 3 4 5 6 7 8
#1.0533711 0.3718959 0.6331070 0.7090783 1.3517680 0.9516232 1.3215712 0.4697572
Regarding the diversity calculation, we can replace NA with 0 and apply on the whole dataset i.e.
library(vegan)
diversity(replace(df1, is.na(df1), 0))
#[1] 1.0533711 0.3718959 0.6331070 0.7090783
#[5] 1.3517680 0.9516232 1.3215712 0.4697572
as we get the same output as in the first solution

Related

Specify which column(s) a specific date appears in R

I have a subset of my data in a dataframe (dput codeblock below) containing dates in which a storm occurred ("Date_AR"). I'd like to know if a storm occurred in the north, south or both, by determining whether the same date occurred in the "Date_N" and/or "Date_S" column/s.
For example, the first date is Jan 17, 1989 in the "Date_AR" column. In the location column, I would like "S" to be printed, since this date is found in the "Date_S" column. If Apr 5. 1989 occurs in "Date_N" and "Date_S", the I would like a "B" (for both) to be printed in the location column.
Thanks in advance for the help! Apologies if this type of question is already out there. I may not know the keywords to search.
structure(list(Date_S = structure(c(6956, 6957, 6970, 7008, 7034,
7035, 7036, 7172, 7223, 7224, 7233, 7247, 7253, 7254, 7255, 7262, 7263, 7266, 7275,
7276), class = "Date"),
Date_N = structure(c(6968, 6969, 7035, 7049, 7103, 7172, 7221, 7223, 7230, 7246, 7247,
7251, 7252, 7253, 7262, 7266, 7275, 7276, 7277, 7280), class = "Date"),
Date_AR = structure(c(6956, 6957, 6968, 6969, 6970, 7008,
7034, 7035, 7036, 7049, 7103, 7172, 7221, 7223, 7224, 7230,
7233, 7246, 7247, 7251), class = "Date"), Precip = c(23.6,
15.4, 3, 16.8, 0.2, 3.6, 22, 13.4, 0, 30.8, 4.6, 27.1, 0,
19, 2.8, 11.4, 2, 57.6, 9.4, 39), Location = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 20L), class = "data.frame")
Using dplyr::case_when you could do:
library(dplyr)
dat |>
mutate(Location = case_when(
Date_AR %in% Date_S & Date_AR %in% Date_N ~ "B",
Date_AR %in% Date_S ~ "S",
Date_AR %in% Date_N ~ "N"
))
#> Date_S Date_N Date_AR Precip Location
#> 1 1989-01-17 1989-01-29 1989-01-17 23.6 S
#> 2 1989-01-18 1989-01-30 1989-01-18 15.4 S
#> 3 1989-01-31 1989-04-06 1989-01-29 3.0 N
#> 4 1989-03-10 1989-04-20 1989-01-30 16.8 N
#> 5 1989-04-05 1989-06-13 1989-01-31 0.2 S
#> 6 1989-04-06 1989-08-21 1989-03-10 3.6 S
#> 7 1989-04-07 1989-10-09 1989-04-05 22.0 S
#> 8 1989-08-21 1989-10-11 1989-04-06 13.4 B
#> 9 1989-10-11 1989-10-18 1989-04-07 0.0 S
#> 10 1989-10-12 1989-11-03 1989-04-20 30.8 N
#> 11 1989-10-21 1989-11-04 1989-06-13 4.6 N
#> 12 1989-11-04 1989-11-08 1989-08-21 27.1 B
#> 13 1989-11-10 1989-11-09 1989-10-09 0.0 N
#> 14 1989-11-11 1989-11-10 1989-10-11 19.0 B
#> 15 1989-11-12 1989-11-19 1989-10-12 2.8 S
#> 16 1989-11-19 1989-11-23 1989-10-18 11.4 N
#> 17 1989-11-20 1989-12-02 1989-10-21 2.0 S
#> 18 1989-11-23 1989-12-03 1989-11-03 57.6 N
#> 19 1989-12-02 1989-12-04 1989-11-04 9.4 B
#> 20 1989-12-03 1989-12-07 1989-11-08 39.0 N

Filter dataframe when all columns are NA in `dplyr`

This is surely a simple question (if one knows the answer) but I still couldn't find guidance on SO: I have a dataframe with lots of rows that only have NA across all columns (after a lead operation). I want to remove those rows:
df <- structure(list(line = c("0001", NA, "0002", NA, "0003", NA, "0004",
NA, "0005", NA),
speaker = c(NA, NA, "ID16.C-U", NA, NA, NA, "ID16.B-U", NA, NA, NA),
utterance = c("7.060", NA, " ah-ha,", NA, "0.304", NA, " °°yes°°", NA, "7.740", NA),
timestamp = c(NA, "00:00:00.000 - 00:00:07.060", NA, "00:00:07.060 - 00:00:07.660", NA,
"00:00:07.660 - 00:00:07.964", NA, "00:00:07.964 - 00:00:08.610", NA,
"00:00:08.610 - 00:00:16.350")), row.names = c(NA, 10L), class = "data.frame")
But neither this:
df %>%
mutate(timestamp = lead(timestamp)) %>%
filter(across(everything(), ~!is.na(.)))
nor this works:
df %>%
mutate(timestamp = lead(timestamp)) %>%
rowwise() %>%
filter(c_across(everything(), ~!is.na(.)))
What's the solution?
Expected:
line speaker utterance timestamp
1 0001 <NA> 7.060 00:00:00.000 - 00:00:07.060
3 0002 ID16.C-U ah-ha, 00:00:07.060 - 00:00:07.660
5 0003 <NA> 0.304 00:00:07.660 - 00:00:07.964
7 0004 ID16.B-U °°yes°° 00:00:07.964 - 00:00:08.610
9 0005 <NA> 7.740 00:00:08.610 - 00:00:16.350
dplyr has new functions if_all() and if_any() to handle cases like these:
library(dplyr, warn.conflicts = FALSE)
df %>%
mutate(timestamp = lead(timestamp)) %>%
filter(!if_all(everything(), is.na))
#> line speaker utterance timestamp
#> 1 0001 <NA> 7.060 00:00:00.000 - 00:00:07.060
#> 2 0002 ID16.C-U ah-ha, 00:00:07.060 - 00:00:07.660
#> 3 0003 <NA> 0.304 00:00:07.660 - 00:00:07.964
#> 4 0004 ID16.B-U °°yes°° 00:00:07.964 - 00:00:08.610
#> 5 0005 <NA> 7.740 00:00:08.610 - 00:00:16.350
Will this work?
df <- df %>% mutate(timestamp = lead(timestamp))
df[rowSums(is.na(df))!=ncol(df),]
pseudo-tidyverse version:
df %>%
dplyr::mutate(timestamp = dplyr::lead(timestamp)) %>%
dplyr::filter(rowSums(is.na(.))!=ncol(.))

Replacing NA with value from previous row or mutate with vector recycling in R [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
Hey :) I am currently trying to clean up some data and I am struggling to find an easy solution for this.
This is my dataset:
structure(list(sample = c(1, NA, NA, 2, NA, NA, 3, NA, NA, 4,
NA, NA, 5, NA, NA, 6, NA, NA, 7, NA, NA, 8, NA, NA, 9, NA, NA,
10, NA, NA, 11, NA, NA, 12, NA, NA, 13, NA, NA, 14, NA, NA, 15,
NA, NA, 16, NA, NA, 17, NA, NA, 18, NA, NA, 19, NA, NA, 20, NA,
NA), well = c("C1", "C3", "C5", "D1", "D3", "D5", "E1", "E3",
"E5", "F1", "F3", "F5", "C7", "C9", "C11", "D7", "D9", "D11",
"E7", "E9", "E11", "F7", "F9", "F11", "C13", "C15", "C17", "D13",
"D15", "D17", "E13", "E15", "E17", "F13", "F15", "F17", "C19",
"C21", "C23", "D19", "D21", "D23", "E19", "E21", "E23", "F19",
"F21", "F23", "G1", "G3", "G5", "H1", "H3", "H5", "I1", "I3",
"I5", "J1", "J3", "J5"), interp_conc = c(456582, 299611, 338462,
449737, 395905, 546031, 511817, 473617, 455924, 408370, 461656,
429297, 277609, 264949, 404073, 353142, 277509, 246494, 122663,
163873, 169455, 188879, 192751, 255511, 185383, 205396, 187415,
1897500, 1988346, 1854167, 365514, 295724, 262695, 270446, 241531,
209386, 223774, 255885, 181214, 420567, 482818, 443318, 262886,
220969, 283763, 229457, 261859, 202067, 226157, 177300, 215454,
481414, 586233, 383855, 218949, 226852, 244989, 192648, 228195,
201096)), row.names = c(NA, -60L), class = c("tbl_df", "tbl",
"data.frame"))
It basically looks like this:
It's data from an experiment done in triplicates. This means, the first three rows are sample 1, the next three rows are sample 2, ...
So basically what I need is a function that whenever it finds an NA it takes the value from the row above. Is there something like this in R? I was not able to find one.
What I tried to do instead was to just add another column - "condition" - using the mutate function. Since the experiment I did was performed five times, I was hoping that the vector would just be recycled. This was my try:
temp %>% mutate(condition = c("UT", "UT", "UT",
"Stimuli", "Stimuli","Stimuli",
"Inhib1", "Inhib1","Inhib1",
"Inhib2", "Inhib2", "Inhib2"))
But since it does not seem possible to do vector recycling with the dplyr::mutate function I also was not able to do this.
Going with this second approach would have the advantage that it directly adds crucial information that I would otherwise have to add in a second step. My original idea was to first solve the sample column issue and then, using if statements, add the experimental condition...
Does anyone have any idea how I could solve this problem?
Assuming that the non-NA entries don't decrease (as in your example), you could do
cummax(ifelse(is.na(x), 0, x)), where x is the vector you want to transform in this way (looks like temp$sample in what you have provided).
The logic: cummax(), the cumulative max function, returns the largest number encountered sequentially in a vector. However, it doesn't handle NA values well; this is what the ifelse() call is for. We use ifelse() to replace each NA with 0, then use cummax() to extract the largest value previously encountered.
Example:
x <- c(1, NA, NA, 2, NA, NA, NA, 3, NA, 4)
cummax(ifelse(is.na(x), 0, x))
## [1] 1 1 1 2 2 2 2 3 3 4
You can use either of these solutions as specified in the comments:
library(dplyr)
library(zoo)
df %>%
mutate(across(sample, ~ na.locf(.x)))
# A tibble: 60 x 3
sample well interp_conc
<dbl> <chr> <dbl>
1 1 C1 456582
2 1 C3 299611
3 1 C5 338462
4 2 D1 449737
5 2 D3 395905
6 2 D5 546031
7 3 E1 511817
8 3 E3 473617
9 3 E5 455924
10 4 F1 408370
# ... with 50 more rows
Or
library(tidyr)
df %>%
fill(sample, .direction = "down")
# A tibble: 60 x 3
sample well interp_conc
<dbl> <chr> <dbl>
1 1 C1 456582
2 1 C3 299611
3 1 C5 338462
4 2 D1 449737
5 2 D3 395905
6 2 D5 546031
7 3 E1 511817
8 3 E3 473617
9 3 E5 455924
10 4 F1 408370
# ... with 50 more rows

Determine range of time where measurements are not NA

I have a dataset with hundreds of thousands of measurements taken from several subjects. However, the measurements are only partially available, i.e., there may be large stretches with NA. I need to establish up front, for which timespan positive data are available for each subject.
Data:
df
timestamp C B A starttime_ms
1 00:00:00.033 NA NA NA 33
2 00:00:00.064 NA NA NA 64
3 00:00:00.066 NA 0.346 NA 66
4 00:00:00.080 47.876 0.346 22.231 80
5 00:00:00.097 47.876 0.346 22.231 97
6 00:00:00.099 47.876 0.346 NA 99
7 00:00:00.114 47.876 0.346 NA 114
8 00:00:00.130 47.876 0.346 NA 130
9 00:00:00.133 NA 0.346 NA 133
10 00:00:00.147 NA 0.346 NA 147
My (humble) solution so far is (i) to pick out the range of timestamp values that are not NA and to select the first and last such timestamp for each subject individually. Here's the code for subject C:
NotNA_C <- df$timestamp[which(!is.na(df$C))]
range_C <- paste(NotNA_C[1], NotNA_C[length(NotNA_C)], sep = " - ")
range_C
[1] "00:00:00.080" "00:00:00.130"
That doesn't look elegant and, what's more, it needs to be repeated for all other subjects. Is there a more efficient way to establish the range of time for which non-NA values are available for all subjects in one go?
EDIT
I've found a base R solution:
sapply(df[,2:4], function(x)
paste(df$timestamp[which(!is.na(x))][1],
df$timestamp[which(!is.na(x))][length(df$timestamp[which(!is.na(x))])], sep = " - "))
C B A
"00:00:00.080 - 00:00:00.130" "00:00:00.066 - 00:00:00.147" "00:00:00.080 - 00:00:00.097"
but would be interested in other solutions as well!
Reproducible data:
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
dplyr solution
library(tidyverse)
df <- structure(list(timestamp = c("00:00:00.033", "00:00:00.064",
"00:00:00.066", "00:00:00.080", "00:00:00.097", "00:00:00.099",
"00:00:00.114", "00:00:00.130", "00:00:00.133", "00:00:00.147"
), C = c(NA, NA, NA, 47.876, 47.876, 47.876, 47.876, 47.876,
NA, NA), B = c(NA, NA, 0.346, 0.346, 0.346, 0.346,
0.346, 0.346, 0.346, 0.346), A = c(NA, NA, NA, 22.231, 22.231, NA, NA, NA, NA,
NA), starttime_ms = c(33, 64, 66, 80, 97, 99, 114, 130, 133,
147)), row.names = c(NA, 10L), class = "data.frame")
df %>%
pivot_longer(-c(timestamp, starttime_ms)) %>%
group_by(name) %>%
drop_na() %>%
summarise(min = timestamp %>% min(),
max = timestamp %>% max())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#> name min max
#> <chr> <chr> <chr>
#> 1 A 00:00:00.080 00:00:00.097
#> 2 B 00:00:00.066 00:00:00.147
#> 3 C 00:00:00.080 00:00:00.130
Created on 2021-02-15 by the reprex package (v0.3.0)
You could look at the cumsum of differences where there's no NA, coerce them to logical and subset first and last element.
lapply(data.frame(apply(rbind(0, diff(!sapply(df[c("C", "B", "A")], is.na))), 2, cumsum)),
function(x) c(df$timestamp[as.logical(x)][1], rev(df$timestamp[as.logical(x)])[1]))
# $C
# [1] "00:00:00.080" "00:00:00.130"
#
# $B
# [1] "00:00:00.066" "00:00:00.147"
#
# $A
# [1] "00:00:00.080" "00:00:00.097"

Computing Growth Rates

I am working on a dataset for a welfare wage subsidy program, where wages per worker are structured as follows:
df <- structure(list(wage_1990 = c(13451.67, 45000, 10301.67, NA, NA,
8726.67, 11952.5, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA,
NA, 9881.67, 5483.33, 12868.33, 9321.67), wage_1991 = c(13451.67,
45000, 10301.67, NA, NA, 8750, 11952.5, NA, NA, 7140, NA, NA,
10301.67, 7303.33, NA, NA, 9881.67, 5483.33, 12868.33, 9321.67
), wage_1992 = c(13451.67, 49500, 10301.67, NA, NA, 8750, 11952.5,
NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67, NA,
12868.33, 9321.67), wage_1993 = c(NA, NA, 10301.67, NA, NA, 8750,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1994 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1995 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7140, NA, NA, 10301.67, 7303.33, NA, NA, 9881.67,
NA, NA, 9321.67), wage_1996 = c(NA, NA, 10301.67, NA, NA, 8948.33,
11958.33, NA, NA, 7291.67, NA, NA, 10301.67, 7303.33, NA, NA,
9881.67, NA, NA, 9321.67)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L))
I have tried one proposed solution, which is running this code after the one above:
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
But I keep getting this error:
Error in dim(X) <- c(n, length(X)/n) : dims [product 60000] do not match the length of object [65051]
I want to do the following: 1-Create a variable showing the annual growth rate of wage for each worker or lack of thereof.
The practical issue that I am facing is that each observation is in one row and while the first worker joined the program in 1990, others might have joined in say 1993 or 1992. Therefore, is there a way to apply the growth rate for each worker depending on the specific years they worked, rather than applying a general growth formula for all observations?
My expected output for each row would be having a new column
average wage growth rate
1- 15%
2- 9%
3- 12%
After running the following code to see descriptive statistics of my variable of interest:
skim(df$average_growth_rate)
I get the following result:
"Variable contains Inf or -Inf value(s) that were converted to NA.── Data Summary ────────────────────────
Values
Name gosi_beneficiary_growth$a...
Number of rows 3671
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 data 1348 0.633 Inf Inf -1 -0.450 0 0.0568
"
I am not sure why my mean and standard deviation values are Inf.
Here is one approach:
library(tidyverse)
growth <- df %>%
rowid_to_column() %>%
gather(key, value, -rowid) %>%
drop_na() %>%
arrange(rowid, key) %>%
group_by(rowid) %>%
mutate(yoy = value / lag(value)-1) %>%
summarise(average_growth_rate = mean(yoy, na.rm=T))
# A tibble: 12 x 2
rowid average_growth_rate
<int> <dbl>
1 1 0
2 2 0.05
3 3 0
4 6 0.00422
5 7 0.0000813
6 10 0.00354
7 13 0
8 14 0
9 17 0
10 18 0
11 19 0
12 20 0
And just to highlight that all these 0s are expected, here the dataframe:
> head(df)
# A tibble: 6 x 7
wage_1990 wage_1991 wage_1992 wage_1993 wage_1994 wage_1995 wage_1996
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 13452. 13452. 13452. NA NA NA NA
2 45000 45000 49500 NA NA NA NA
3 10302. 10302. 10302. 10302. 10302. 10302. 10302.
4 NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA
6 8727. 8750 8750 8750 8948. 8948. 8948.
where you see that e.g. for the first row, there was no growth nor any decline. The second row, there was a slight increase in between the second and the third year, but it was 0 for the first and second. For the third row, again absolutely no change. Etc...
Also, finally, to add these results to the initial dataframe, you would do e.g.
df %>%
rowid_to_column() %>%
left_join(growth)
And just to answer the performance question, here a benchmark (where I changed akrun's data.frame call to a tibble call to make sure there is no difference coming from this). All functions below correspond to creating the growth rates, not merging back to the original dataframe.
library(microbenchmark)
microbenchmark(cj(), akrun(), akrun2())
Unit: microseconds
expr min lq mean median uq max neval cld
cj() 5577.301 5820.501 6122.076 5988.551 6244.301 10646.9 100 c
akrun() 998.301 1097.252 1559.144 1160.450 1212.552 28704.5 100 a
akrun2() 2033.801 2157.101 2653.018 2258.052 2340.702 34143.0 100 b
base R is the clear winner in terms of performance.
We can use base R with apply. Loop over the rows with MARGIN = 1, remove the NA elements ('x1'), get the mean of the ratio of the current and previous element
average_growth_rate <- apply(df, 1, function(x) {
x1 <- x[!is.na(x)]
mean(x1[-1]/x1[-length(x1)]-1)})
out <- data.frame(rowid = seq_len(nrow(df)), average_growth_rate)
out[!is.na(out$average_growth_rate),]
# rowid average_growth_rate
#1 1 0.00000000000
#2 2 0.05000000000
#3 3 0.00000000000
#6 6 0.00422328325
#7 7 0.00008129401
#10 10 0.00354038282
#13 13 0.00000000000
#14 14 0.00000000000
#17 17 0.00000000000
#18 18 0.00000000000
#19 19 0.00000000000
#20 20 0.00000000000
Or using tapply/stack
na.omit(stack(tapply(as.matrix(df), row(df), FUN = function(x)
mean(head(na.omit(x), -1)/tail(na.omit(x), -1) -1))))[2:1]

Resources