Select Date and Numeric columns only - r

Is there an efficient way to select Date and numeric columns in R?
df <- data.frame(
Date=c("10/11/2012","10/12/2012"),
AE=c(1211,100),
Percent=c(0.03,0.43),
Name = c("A", "B")
)
As such I can use is.numeric function to check if a column is numeric or not and then use one of several ways to subset, but is there a function to check if a column is date and how to use multiple conditions for subseting.
I found that there is a funciton is.Date in lubridate package but it did not work
#does not work
df <- df %>%
select_if(is.numeric|is.Date)

dplyr verbs for selection allow various methods of providing the conditionals>
raw functions, as in is.numeric, which will be called with the column data (a vector) as its one argument;
anonymous functions (R style), as in function(x) is.numeric(x) | inherits(x, "Date");
what is called a "purrr style lambda" using R formulas (~), which seems to be just a more compact form of the base R anon-func, but there are a some differences, namely you use . or .x as a placeholder for the column data, as in the answer below
df %>%
select_if(~ is.numeric(.) | inherits(., "Date"))
# AE Percent
# 1 1211 0.03
# 2 100 0.43
Since your first column is not actually a date, let's fix that
# df$Date <- as.Date(df$Date, format="%m/%d/%Y")
df %>%
mutate(Date = as.Date(Date, format="%m/%d/%Y")) %>%
select_if(~ is.numeric(.x) | inherits(.x, "Date"))
# Date AE Percent
# 1 2012-10-11 1211 0.03
# 2 2012-10-12 100 0.43

Related

Apply function on data.frame with mutate across using the same columns from another data.frame

I have two data frames with spectral bands from a satellite, redDF and nirDF. Both data frames have values per date column starting with an 'X', these names correspond in both data frames.
I want to get a new data frame where for each column starting with an 'X' in both redDF and nirDF a new value is calculated according to some formula.
Here is a data sample:
library(dplyr)
set.seed(999)
# get column names
datecolnames <- seq(as.Date("2015-05-01", "%Y-%m-%d"),
as.Date("2015-09-20", "%Y-%m-%d"),
by="16 days") %>%
format(., "%Y-%m-%d") %>%
paste0("X", .)
# sample data values
mydata <- as.integer(runif(length(datecolnames))*1000)
# sample no data indices
nodata <- sample(1:length(datecolnames), length(datecolnames)*0.3)
mydata[nodata] <- NA # assign no data to the correct indices
# get dummy data.frame of red spectral values
redDF <- data.frame(mydata,
mydata[sample(1:length(mydata))],
mydata[sample(1:length(mydata))]) %>%
t() %>%
as.data.frame(., row.names = FALSE) %>%
rename_with(~datecolnames) %>%
mutate(id = row_number()+1142) %>%
select(id, everything())
# get dummy data.frame of near infrared spectral values
# in this case a modified version of redDF
nirDF <- redDF %>%
mutate(across(-id,~as.integer(.x+20*1.8))) %>%
select(id, everything())
> nirDF
id X2015-05-01 X2015-05-17 X2015-06-02 X2015-06-18 X2015-07-04 X2015-07-20 X2015-08-05
1 1143 NA 645 NA 636 569 841 706
2 1144 1025 NA 706 569 354 NA NA
3 1145 904 636 706 645 NA NA 115
X2015-08-21 X2015-09-06 X2015-09-22 X2015-10-08 X2015-10-24 X2015-11-09
1 115 1025 904 NA 409 354
2 115 636 409 645 841 904
3 569 409 354 841 1025 NA
and this is the formula:
getNDVI <- function(red, nir){round((nir - red)/(nir + red), digits = 4)}
I hoped I would be able to do something like:
ndviDF <- redDF %>% mutate(across(starts_with('X'), .fns = getNDVI))
But that doesn't work, as dplyr doesn't know what the nir argument of getNDVI should be. I have seen solutions for accessing other data frames in mutate() by using the $COLNAME indexer, but since I have 197 columns, that is not an option here.
I would approach this with a for loop, though I know it does not make best use of functionality like across.
First we create a list of the columns we want to iterate over:
cols_to_iterate_over = redDF %>%
select(starts_with("X") %>%
colnames()
Then we join on id and ensure columns are named according to source dataset:
joined_df = redDF %>%
inner_join(nirDF, by = "id", prefix = c("_red","_nir"))
So joined_df should have columns like:
id X2015-05-01_red X2015-05-01_NIR X2015-05-17_red X2015-05-17_NIR ...
Then we can loop over these:
for(col in cols_to_iterate_over){
# columns for calculation
red_col = paste0(col,"_red") %>% sym()
nir_col = paste0(col,"_nir") %>% sym()
out_col = col %>% sym()
# calculate
joined_df = joined_df %>%
mutate(
!!out_col := round((!!nir_col - !!red_col)/(!!nir_col + !!red_col),
digits = 4)
) %>%
select(-!!red_col, -!!nir_col)
}
Explanation: We can use text strings as variable names if we turn them into symbols and then !! them.
sym() turns text into symbols,
!! inside dplyr commands turns symbols into code,
and := is equivalent to = but permits us to have !! on the left-hand side.
Sorry, this is slightly old syntax. For the current approaches see programming with dplyr.
In its most basic form, you can just do this:
round((nirDF - redDF)/(nirDF + redDF), digits = 4)
But this does not retain the id-column and can break if some columns are not numeric. A more failsafe version would be:
red <- redDF %>%
arrange(id) %>% # be sure to apply the same order everywhere
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric)) # be sure to have numeric columns
nir <- nirDF %>% arrange(id) %>%
select(starts_with('X')) %>%
mutate(across(everything(), as.numeric))
# make sure that the number of rows are equal
if(nrow(red) == nrow(nir)){
ndvi <- redDF %>%
# get data.frame with ndvi values
transmute(round((nir - red)/(nir + red), digits = 4)) %>%
# bind id-column and possibly other columns to the data frame
bind_cols(redDF %>% arrange(id) %>% select(!starts_with('X'))) %>%
# place the id-column to the front
select(!starts_with('X'), everything())
}
As far as I have understood dplyr by now, it boils down to this:
across is (generally) meant for many-to-many relationships, but handles columns on an individual basis by default. So, if you give it three columns, it will give you three columns back which are not aware of the values in other columns.
c_across on the other hand, can evaluate relationships between columns (like a sum or a standard deviation) but is meant for many-to-one relationships. In other words, if you give it three columns, it will give you one column back.
Neither of these is suitable for this task. However, by design, arithmetic operations can be applied to data frames in R (just try cars*cars for instance). This is what we need in this case. Luckily, these operations are not as greedy as dplyr join operations, so they can be done efficiently on large data frames.
While doing so, you need to keep some requirements into account:
The number of rows of the two data frames should be equal, otherwise, the shorter data frame will get recycled.
all columns in the data frame need to be of a numeric class (numeric or integer).

How to convert all columns where entries have length ≤1 to numeric?

I have a data frame with ~80 columns, and ~20-40 of those columns have single-digit integers that were stored as characters. Other character columns are complete sentences (so, length >>> 1), and so get coerced to NA if I try mutate_if(is.character, as.numeric).
I would like to transform those efficiently, and based on this question, I was hoping for something like this:
df %>% map_if(is.character & length(.) <= 1, as.numeric)
However, that doesn't work. I'm hoping for a tidy solution, maybe using purrr?
The best function for these situations is type_convert(), from readr:
"[type_convert() re-converts character columns in a data frame], which is useful if you need to do some manual munging - you can read the columns in as character, clean it up with (e.g.) regular expressions and other transformations, and then let readr take another stab at parsing it."
So, all you need to do is add it at the end of your pipe:
df %>% ... %>% type_convert()
Alternatively, we can use type.convert from base R, which would automatically detect the column type based on the value and change it
df[] <- type.convert(df, as.is = TRUE)
If the constraint is to look for columns that have only one character
i1 <- !colSums(nchar(as.matrix(df)) > 1)
df[i1] <- type.convert(df[i1])
If we want to use tidyverse, there is parse_guess from readr
library(tidyverse)
library(readr)
df %>%
mutate_if(all(nchar(.) == 1), parse_guess)
You could check for nchar of the column in mutate_if
library(dplyr)
df %>% mutate_if(~all(nchar(.) == 1) & is.character(.), as.numeric)
Using with an example data
df <- data.frame(a = c("ab", "bc", "de", "de", "ef"),
b = as.character(1:5), stringsAsFactors = FALSE)
df1 <- df %>% mutate_if(~all(nchar(.) == 1) & is.character(.), as.numeric)
str(df1)
#'data.frame': 5 obs. of 2 variables:
# $ a: chr "ab" "bc" "de" "de" ...
# $ b: num 1 2 3 4 5
You could do the same with map_if as well however, it returns a list back and you need to convert it back to dataframe
library(purrr)
df %>%
map_if(~all(nchar(.) == 1) & is.character(.), as.numeric) %>%
as.data.frame(., stringsAsFactors = FALSE)

Subtract Time Values in R

I have Values such as :
df[,1:2]
Results in
I want to create a new column that has the difference between the Ins and Outs.
These are TIME values,
Expected output is :
1201
0718 ( neglecting Negative values )
.. and So on.
library(stringr)
# generate few rows of data
In <- c('143','1239')
Out <- c('1344','521')
df <- data.frame(cbind(In, Out), stringsAsFactors=FALSE)
# pad with zero if needed (e.g. 143 -> 0143)
df$In[str_length(df$In) == 3] <- paste(0,df$In[str_length(df$In) == 3], sep='')
df$Out[str_length(df$Out) == 3] <- paste(0,df$Out[str_length(df$Out) == 3], sep='')
df$In <- strptime(df$In, format='%H%M')
df$Out <- strptime(df$Out, format='%H%M')
df$diff <- df$In - df$Out
This gives:
> df$diff
Time differences in hours
[1] -12.01667 7.30000
Is this what you are looking for?
If I understand correctly, the OP wants to compute the absolute time difference where the time of the day (neglecting the date) is given as character strings in the form HMM or HHMM.
There are classes which support time of the day (without date) directly, e.g., the hms package or the ITime class of the data.table package.
As an additional challenge, the timestamps are not given in a standard time format HH:MM, e.g., 09:43.
Here is an approach which uses as.ITime() after the strings have been padded.
# create sample data frame
df <- data.frame(In = c("143", "1239"),
Out = c("1344", "521"))
library(magrittr) # piping is used for readability
# pad strings and coerce to ITime class
df$In %<>%
stringr::str_pad(4L, pad = "0") %>%
data.table::as.ITime("%H%M")
df$Out %<>%
stringr::str_pad(4L, pad = "0") %>%
data.table::as.ITime("%H%M")
# compute absolute difference
df$absdiff <- abs(df$In - df$Out)
df
In Out absdiff
1 01:43:00 13:44:00 12:01:00
2 12:39:00 05:21:00 07:18:00
Now, the OP seems to expect the result in the same non-standard format HHMM (without the : field separator) as the input values. This can be achieved by
df$absdiff %>%
as.POSIXct() %>%
format("%H%M")
[1] "1201" "0718"

Earliest Date for each id in R

I have a dataset where each individual (id) has an e_date, and since each individual could have more than one e_date, I'm trying to get the earliest date for each individual. So basically I would like to have a dataset with one row per each id showing his earliest e_date value.
I've use the aggregate function to find the minimum values, I've created a new variable combining the date and the id and last I've subset the original dataset based on the one containing the minimums using the new variable created. I've come to this:
new <- aggregate(e_date ~ id, data_full, min)
data_full["comb"] <- NULL
data_full$comb <- paste(data_full$id,data_full$e_date)
new["comb"] <- NULL
new$comb <- paste(new$lopnr,new$EDATUM)
data_fixed <- data_full[which(new$comb %in% data_full$comb),]
The first thing is that the aggregate function doesn't seems to work at all, it reduces the number of rows but viewing the data I can clearly see that some ids appear more than once with different e_date. Plus, the code gives me different results when I use the as.Date format instead of its original format for the date (integer). I think the answer is simple but I'm struck on this one.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(data_full)), grouped by 'id', we get the 1st row (head(.SD, 1L)).
library(data.table)
setDT(data_full)[order(e_date), head(.SD, 1L), by = id]
Or using dplyr, after grouping by 'id', arrange the 'e_date' (assuming it is of Date class) and get the first row with slice.
library(dplyr)
data_full %>%
group_by(id) %>%
arrange(e_date) %>%
slice(1L)
If we need a base R option, ave can be used
data_full[with(data_full, ave(e_date, id, FUN = function(x) rank(x)==1)),]
Another answer that uses dplyr's filter command:
dta %>%
group_by(id) %>%
filter(date == min(date))
You may use library(sqldf) to get the minimum date as follows:
data1<-data.frame(id=c("789","123","456","123","123","456","789"),
e_date=c("2016-05-01","2016-07-02","2016-08-25","2015-12-11","2014-03-01","2015-07-08","2015-12-11"))
library(sqldf)
data2 = sqldf("SELECT id,
min(e_date) as 'earliest_date'
FROM data1 GROUP BY 1", method = "name__class")
head(data2)
id earliest_date
123 2014-03-01
456 2015-07-08
789 2015-12-11
I made a reproducible example, supposing that you grouped some dates by which quarter they were in.
library(lubridate)
library(dplyr)
rand_weeks <- now() + weeks(sample(100))
which_quarter <- quarter(rand_weeks)
df <- data.frame(rand_weeks, which_quarter)
df %>%
group_by(which_quarter) %>% summarise(sort(rand_weeks)[1])
# A tibble: 4 x 2
which_quarter sort(rand_weeks)[1]
<dbl> <time>
1 1 2017-01-05 05:46:32
2 2 2017-04-06 05:46:32
3 3 2016-08-18 05:46:32
4 4 2016-10-06 05:46:32

Correlation using funs in dplyr

I want to find the rank correlation of various columns in a data.frame using dplyr.
I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.
For the following df:
df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))
I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????
cor <- df %>% group_by(Universe) %>%
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)
I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.
Please help!
An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:
#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%
data.frame(row=rep(colnames(.)[-1], n_groups(.)))
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <- which(upper.tri(diag(num_col)))
cor_cols <- df %>% group_by(Universe) %>%
do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])
So here follows the winning (time-wise) solution to my problem:
d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>%
summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>%
unite(Pair, R1, R2, sep="_")
Albeit 0.005 milliseconds in this example, adding data adds time.
Try this:
library(data.table) # needed for fast melt
setDT(df) # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))
mdf %>%
mutate(obs_set = substr(variable, 4, 4) ) %>% # ".x" or ".y" subgroup
full_join(.,., by=c('Universe', 'obs_set', 'id')) %>% # see notes
group_by(Universe, variable.x, variable.y) %>%
filter(variable.x != variable.y) %>%
dplyr::summarise(rank_corr = cor(value.x, value.y,
method='spearman', use='pairwise.complete.obs'))
Produces:
Universe variable.x variable.y rank_corr
(fctr) (fctr) (fctr) (dbl)
1 A AA.x BB.x -0.9
2 A AA.x CC.x -0.9
3 A BB.x AA.x -0.9
4 A BB.x CC.x 0.8
5 A CC.x AA.x -0.9
6 A CC.x BB.x 0.8
7 A AA.y BB.y -0.3
8 A AA.y CC.y 0.2
9 A BB.y AA.y -0.3
10 A BB.y CC.y -0.3
.. ... ... ... ...
Explanation:
Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.
Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).
Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.
Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.

Resources