Take the first value of group of columns in R - r

I have some data:
data
structure(list(WBC_BASELINE = c(2.9, NA, NA, 6.9, NA, NA, NA,
NA, NA, NA, 7.4, 12.8, NA, NA, NA, NA, NA, 4.2, NA, NA), WBC_FIRST = c(2.4,
14.8, 11, 7.3, 4.5, NA, NA, 6.1, 7.7, 16.2, 5.3, 10.3, 14.5,
NA, NA, 12.8, 3.7, 4.7, 16.6, 9.3), neuts_BASELINE = c(2, NA,
NA, 5.4, NA, NA, NA, NA, NA, NA, 4.96, 8.9, NA, NA, NA, NA, NA,
NA, NA, NA), neuts_FIRST = c(1.5, 13, 5.8, 4.5, 1.6, NA, NA,
1.7, 4.3, 9.3, 3.4, 5.8, 10.1, NA, NA, 9.7, 2.3, 3.5, 5, 8.2)), row.names = c(NA,
20L), class = "data.frame")
In the dataset I have some blood test results (in this case WBC and neuts taken at 2 time points - baseline, and first). I want to select the baseline value if it exists, else take the first value.
I can do this separately for WBC and neuts, but I want to do it for 20 different blood tests without hard coding it each time...
Hard coding way:
data %>% mutate(WBC_first_value=ifelse(!is.na(WBC_BASELINE), WBC_BASELINE, WBC_FIRST)) %>%
mutate(neuts_first_value=ifelse(!is.na(neuts_BASELINE), neuts_BASELINE, neuts_FIRST))
Please note that each blood test is always followed by _BASELINE and _FIRST
I'd be grateful for any help please!

We could automate this process with some data wrangling using pivot_longer and pivot_wider in combination:
library(dplyr)
library(tidyr)
data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c('grp', '.value'),
names_sep = "_") %>%
group_by(grp) %>%
transmute(rn, new = coalesce(BASELINE, FIRST)) %>%
pivot_wider(names_from = grp, values_from = new) %>%
select(-rn) %>%
bind_cols(data, .)
output:
WBC_BASELINE WBC_FIRST neuts_BASELINE neuts_FIRST WBC neuts
1 2.9 2.4 2.00 1.5 2.9 2.00
2 NA 14.8 NA 13.0 14.8 13.00
3 NA 11.0 NA 5.8 11.0 5.80
4 6.9 7.3 5.40 4.5 6.9 5.40
5 NA 4.5 NA 1.6 4.5 1.60
6 NA NA NA NA NA NA
7 NA NA NA NA NA NA
8 NA 6.1 NA 1.7 6.1 1.70
9 NA 7.7 NA 4.3 7.7 4.30
10 NA 16.2 NA 9.3 16.2 9.30
11 7.4 5.3 4.96 3.4 7.4 4.96
12 12.8 10.3 8.90 5.8 12.8 8.90
13 NA 14.5 NA 10.1 14.5 10.10
14 NA NA NA NA NA NA
15 NA NA NA NA NA NA
16 NA 12.8 NA 9.7 12.8 9.70
17 NA 3.7 NA 2.3 3.7 2.30
18 4.2 4.7 NA 3.5 4.2 3.50
19 NA 16.6 NA 5.0 16.6 5.00
20 NA 9.3 NA 8.2 9.3 8.20

You could do this with a loop!
vars <- c("WBC", "neuts")
for(v in vars){
df[,paste0(v, "_new")] <- ifelse(!is.na(df[,paste0(v, "_BASELINE")]), df[,paste0(v, "_BASELINE")], df[,paste0(v, "_FIRST")])
}
Or with sapply:
sapply(vars, function(v) ifelse(!is.na(df[,paste0(v, "_BASELINE")]),df[,paste0(v, "_BASELINE")], df[,paste0(v, "_FIRST")]))
Also could define vars programmatically:
vars <- unique(gsub(pattern = "^([A-Za-z]+)_[A-Za-z]+", "\\1", names(df)))

Related

Specify which column(s) a specific date appears in R

I have a subset of my data in a dataframe (dput codeblock below) containing dates in which a storm occurred ("Date_AR"). I'd like to know if a storm occurred in the north, south or both, by determining whether the same date occurred in the "Date_N" and/or "Date_S" column/s.
For example, the first date is Jan 17, 1989 in the "Date_AR" column. In the location column, I would like "S" to be printed, since this date is found in the "Date_S" column. If Apr 5. 1989 occurs in "Date_N" and "Date_S", the I would like a "B" (for both) to be printed in the location column.
Thanks in advance for the help! Apologies if this type of question is already out there. I may not know the keywords to search.
structure(list(Date_S = structure(c(6956, 6957, 6970, 7008, 7034,
7035, 7036, 7172, 7223, 7224, 7233, 7247, 7253, 7254, 7255, 7262, 7263, 7266, 7275,
7276), class = "Date"),
Date_N = structure(c(6968, 6969, 7035, 7049, 7103, 7172, 7221, 7223, 7230, 7246, 7247,
7251, 7252, 7253, 7262, 7266, 7275, 7276, 7277, 7280), class = "Date"),
Date_AR = structure(c(6956, 6957, 6968, 6969, 6970, 7008,
7034, 7035, 7036, 7049, 7103, 7172, 7221, 7223, 7224, 7230,
7233, 7246, 7247, 7251), class = "Date"), Precip = c(23.6,
15.4, 3, 16.8, 0.2, 3.6, 22, 13.4, 0, 30.8, 4.6, 27.1, 0,
19, 2.8, 11.4, 2, 57.6, 9.4, 39), Location = c(NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA)), row.names = c(NA, 20L), class = "data.frame")
Using dplyr::case_when you could do:
library(dplyr)
dat |>
mutate(Location = case_when(
Date_AR %in% Date_S & Date_AR %in% Date_N ~ "B",
Date_AR %in% Date_S ~ "S",
Date_AR %in% Date_N ~ "N"
))
#> Date_S Date_N Date_AR Precip Location
#> 1 1989-01-17 1989-01-29 1989-01-17 23.6 S
#> 2 1989-01-18 1989-01-30 1989-01-18 15.4 S
#> 3 1989-01-31 1989-04-06 1989-01-29 3.0 N
#> 4 1989-03-10 1989-04-20 1989-01-30 16.8 N
#> 5 1989-04-05 1989-06-13 1989-01-31 0.2 S
#> 6 1989-04-06 1989-08-21 1989-03-10 3.6 S
#> 7 1989-04-07 1989-10-09 1989-04-05 22.0 S
#> 8 1989-08-21 1989-10-11 1989-04-06 13.4 B
#> 9 1989-10-11 1989-10-18 1989-04-07 0.0 S
#> 10 1989-10-12 1989-11-03 1989-04-20 30.8 N
#> 11 1989-10-21 1989-11-04 1989-06-13 4.6 N
#> 12 1989-11-04 1989-11-08 1989-08-21 27.1 B
#> 13 1989-11-10 1989-11-09 1989-10-09 0.0 N
#> 14 1989-11-11 1989-11-10 1989-10-11 19.0 B
#> 15 1989-11-12 1989-11-19 1989-10-12 2.8 S
#> 16 1989-11-19 1989-11-23 1989-10-18 11.4 N
#> 17 1989-11-20 1989-12-02 1989-10-21 2.0 S
#> 18 1989-11-23 1989-12-03 1989-11-03 57.6 N
#> 19 1989-12-02 1989-12-04 1989-11-04 9.4 B
#> 20 1989-12-03 1989-12-07 1989-11-08 39.0 N

Split a dataframe by repeated rows in R? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed last year.
I am looking for a way to transfor a data.frame with thousands of rows like this
code date value uname tcode
<chr> <date> <dbl> <ord> <int>
1 CODE1 1968-02-01 14.1 "" NA
2 CODE1 1968-03-01 9.50 "" NA
3 CODE1 1968-04-01 22.1 "" NA
4 CODE2 1968-02-01 15.1 "" NA
5 CODE2 1968-03-01 13.50 "" NA
6 CODE2 1968-04-01 23.1 "" NA
7 CODE3 1968-02-01 16.1 "" NA
8 CODE3 1968-03-01 15.50 "" NA
9 CODE3 1968-04-01 13.1 "" NA
Into something like:
date CODE1 CODE2 CODE3
<date> <dbl> <dbl> <dbl>
1 1968-02-01 14.1 15.1 16.1
2 1968-03-01 9.50 13.50 15.50
3 1968-04-01 22.1 23.1 13.1
This seems straightforward but I am having difficulty realizing this task. Thanks!
With tidyverse you can use pivot_wider
library(dplyr)
library(tidyr)
df %>% select(-c(uname,tcode)) %>% pivot_wider(names_from="code")
# A tibble: 3 x 4
date CODE1 CODE2 CODE3
<chr> <dbl> <dbl> <dbl>
1 1968-02-01 14.1 15.1 16.1
2 1968-03-01 9.5 13.5 15.5
3 1968-04-01 22.1 23.1 13.1
Data
df <- structure(list(code = c("CODE1", "CODE1", "CODE1", "CODE2", "CODE2",
"CODE2", "CODE3", "CODE3", "CODE3"), date = c("1968-02-01", "1968-03-01",
"1968-04-01", "1968-02-01", "1968-03-01", "1968-04-01", "1968-02-01",
"1968-03-01", "1968-04-01"), value = c(14.1, 9.5, 22.1, 15.1,
13.5, 23.1, 16.1, 15.5, 13.1), uname = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA), tcode = c(NA, NA, NA, NA, NA, NA, NA, NA, NA
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))

Error when Running cor.test on few columns with apply in R

I want to calculate the correlation between 'y' column and each column in 'col_df' dataframe.
For each calculation I want to save only the columns name with significant p_value (p_value<0.05).
y is a vector 64X1 of 0 and 1.
Example of the col_df- 60X12000
a b c d e
7.6 4.9 8.9 6.0 4.2
25.0 6.5 4.6 13.2 3.0
col_df <- as.matrix(df)
test <- col_df[, apply(col_df, MARGIN = 2, FUN = function(x)
(cor.test(y, col_df[,x], method = "pearson")$p.value <0.05))]
This is the error:
Error in col_df[, x] : subscript out of bounds
Is this the way to do that?
This is a working solution:
df <- structure(list(a = c(7.6, 7.6, 25, 25, 25, 25, 7.6, 7.6, 7.6, 25),
b = c(4.9, 4.9, 6.5, 6.5, 4.9, 6.5, 4.9, 4.9, 6.5, 6.5),
c = c(8.9, 4.6, 8.9, 8.9, 8.9, 4.6, 4.6, 8.9, 8.9, 4.6),
d = c(13.2, 13.2, 6, 6, 6, 6, 6, 13.2, 13.2, 13.2),
e = c(3, 4.2, 3, 4.2, 3, 3, 3, 4.2, 4.2, 4.2)),
class = "data.frame", row.names = c(NA, -10L))
y <- c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L)
test <- df[, apply(df, MARGIN = 2, FUN = function(x)
(cor.test(y, x, method = "pearson")$p.value < 0.05))]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
The difference to your solution ist that apply() gives you the column as x and
not an index. Hence, all you have to do is replace col_df[,x] of your solution with
just x.
You can simplify it a little with sapply(). I also recommend not to put everything into
a single line. It is hard to read and harder to debug.
Columns <- sapply(df, FUN = function(x) (cor.test(y, x, method = "pearson")$p.value < 0.05))
test <- df[, Columns]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
Created on 2020-07-22 by the reprex package (v0.3.0)

How to drop NA variables in a data frame by row

Here is my data frame:
structure(list(Q = c(NA, 346.86, 166.95, 162.57, NA, NA, NA,
266.7), L = c(18.93, NA, 15.72, 39.51, NA, NA, NA, NA), C = c(NA,
23.8, NA, 8.47, 20.89, 18.72, 14.94, NA), X = c(40.56, NA, 26.05,
3.08, 23.77, 59.37, NA, NA), W = c(29.47, NA, NA, NA, 36.08,
NA, 27.34, 28.19), S = c(NA, 7.47, NA, NA, 18.64, NA, 25.34,
NA), Y = c(NA, 2.81, 0, NA, NA, 21.18, 10.83, 12.19), H = c(0,
NA, NA, NA, NA, 0, NA, 0)), class = "data.frame", row.names = c(NA,
-8L), .Names = c("Q", "L", "C", "X", "W", "S", "Y", "H"))
Each row has 4 variables that are NAs, now I want to do the same operations to every row:
Drop those 4 varibles that are NAs
Calculate diversity for the rest 4 variables (it's just some computations involved with the rest, here I use diversity() from vegan)
Append the output to a new data frame
But the problem is:
How to do drop NA variables using dplyr? I don't know whether select() can make it.
How to apply operations to every row of a data frame?
It seems that drop_na() will remove the entire row for my dataset, any suggestion?
With tidyverse it may be better to gather into 'long' format and then spread it back. Assuming that we have exactly 4 non-NA elements per row, create a row index with rownames_to_column (from tibble), gather (from tidyr) into 'long' format, remove the NA elements, grouped by row number ('rn'), change the 'key' values to common values and then spread it to wide' format
library(tibble)
library(tidyr)
library(dplyr)
res <- rownames_to_column(df1, 'rn') %>%
gather(key, val, -rn) %>%
filter(!is.na(val)) %>%
group_by(rn) %>%
mutate(key = LETTERS[1:4]) %>%
spread(key, val) %>%
ungroup %>%
select(-rn)
res
# A tibble: 8 x 4
# A B C D
#* <dbl> <dbl> <dbl> <dbl>
#1 18.9 40.6 29.5 0
#2 347 23.8 7.47 2.81
#3 167 15.7 26.0 0
#4 163 39.5 8.47 3.08
#5 20.9 23.8 36.1 18.6
#6 18.7 59.4 21.2 0
#7 14.9 27.3 25.3 10.8
#8 267 28.2 12.2 0
diversity(res)
# 1 2 3 4 5 6 7 8
#1.0533711 0.3718959 0.6331070 0.7090783 1.3517680 0.9516232 1.3215712 0.4697572
Regarding the diversity calculation, we can replace NA with 0 and apply on the whole dataset i.e.
library(vegan)
diversity(replace(df1, is.na(df1), 0))
#[1] 1.0533711 0.3718959 0.6331070 0.7090783
#[5] 1.3517680 0.9516232 1.3215712 0.4697572
as we get the same output as in the first solution

Construct new column from last non-NA values for each row [duplicate]

This question already has answers here:
Select last non-NA value in a row, by row
(3 answers)
Closed last month.
I have a data frame Depth which consist of LON and LAT with corresponding depths temperature data. For each coordinate (LON and LAT) I would like to pull out last record of each depth corresponding to the coordinates into a new data frame,
> Depth<-read.csv('depthdata.csv')
> head(Depth)
LAT LON X150 X175 X200 X225 X250 X275 X300 X325 X350 X375 X400 X425 X450
1 -78.375 -163.875 -1.167 -1.0 NA NA NA NA NA NA NA NA NA NA NA
2 -78.125 -168.875 -1.379 -1.3 -1.259 -1.6 -1.476 -1.374 -1.507 NA NA NA NA NA NA
3 -78.125 -167.625 -1.700 -1.7 -1.700 -1.7 NA NA NA NA NA NA NA NA NA
4 -78.125 -167.375 -2.100 -2.2 -2.400 -2.3 -2.200 NA NA NA NA NA NA NA NA
5 -78.125 -167.125 -1.600 -1.6 -1.600 -1.6 NA NA NA NA NA NA NA NA NA
6 -78.125 -166.875 NA NA NA NA NA NA NA NA NA NA NA NA NA
so that I will have this;
LAT LON
-78.375 -163.875 -1
-78.125 -168.875 -1.507
-78.125 -167.625 -1.7
-78.125 -167.375 -2.2
-78.125 -167.125 -1.6
-78.125 -166.875 NA
I tried the tail() function but I don't have the desirable result.
As I understand it, you want the last non-NA value in each row, for all columns except the first two.
We can use max.col() along with is.na() with our relevant columns to get us the column number for the last non-NA value. 2 is added (shown by + 2L) to compensate for the removal of the first two columns (shown by [-(1:2)]).
idx <- max.col(!is.na(Depth[-(1:2)]), ties.method = "last") + 2L
We can use idx in cbind() to create an index matrix for retrieving the values.
Depth[cbind(seq_len(nrow(Depth)), idx)]
# [1] -1.000 -1.507 -1.700 -2.200 -1.600 NA
Bind this together with the first two columns of the original data with cbind() and we're done.
cbind(Depth[1:2], LAST = Depth[cbind(seq_len(nrow(Depth)), idx)])
# LAT LON LAST
# 1 -78.375 -163.875 -1.000
# 2 -78.125 -168.875 -1.507
# 3 -78.125 -167.625 -1.700
# 4 -78.125 -167.375 -2.200
# 5 -78.125 -167.125 -1.600
# 6 -78.125 -166.875 NA
Data:
Depth <- structure(list(LAT = c(-78.375, -78.125, -78.125, -78.125, -78.125,
-78.125), LON = c(-163.875, -168.875, -167.625, -167.375, -167.125,
-166.875), X150 = c(-1.167, -1.379, -1.7, -2.1, -1.6, NA), X175 = c(-1,
-1.3, -1.7, -2.2, -1.6, NA), X200 = c(NA, -1.259, -1.7, -2.4,
-1.6, NA), X225 = c(NA, -1.6, -1.7, -2.3, -1.6, NA), X250 = c(NA,
-1.476, NA, -2.2, NA, NA), X275 = c(NA, -1.374, NA, NA, NA, NA
), X300 = c(NA, -1.507, NA, NA, NA, NA), X325 = c(NA, NA, NA,
NA, NA, NA), X350 = c(NA, NA, NA, NA, NA, NA), X375 = c(NA, NA,
NA, NA, NA, NA), X400 = c(NA, NA, NA, NA, NA, NA), X425 = c(NA,
NA, NA, NA, NA, NA), X450 = c(NA, NA, NA, NA, NA, NA)), .Names = c("LAT",
"LON", "X150", "X175", "X200", "X225", "X250", "X275", "X300",
"X325", "X350", "X375", "X400", "X425", "X450"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Resources