OR operator in filter()? - r

I want to use the filter() function to find the types that have an x value less than or equal to 4, OR a y value greater than 5. I think this might be a simple fix I just can't find much info on ?filter(). I almost have it I think:
x = c(1, 2, 3, 4, 5, 6)
y = c(3, 6, 1, 9, 1, 1)
type = c("cars", "bikes", "trains")
df = data.frame(x, y, type)
df2 = df %>%
filter(x<=4)

Try
df %>%
filter(x <=4| y>=5)

Related

how to pass options to a function using dplyr mutate_at

I want to center, but not standardize, a set of variables in a data frame. I tried the code for doing that using mutate_at, but the scale function uses scale = TRUE as default, and I can't figure out how to set it to scale = FALSE. Tis scales the desired variables, but standardizes in addition to centering:
centdata <- mydat %>%
mutate_at(.vars = c(1, 2, 3, 4, 5, 6, 7, 8, 14),
.funs = list("scaled" = scale))
You can use purrr style formula or an anonymous function here.
library(dplyr)
cols <- c(1, 2, 3, 4, 5, 6, 7, 8, 14)
centdata <- mydat %>%
mutate_at(.vars = cols,
.funs = list("scaled" = ~scale(., scale = FALSE)))
Since mutate_at has been deprecated, you can use across.
centdata <- mydat %>%
mutate(across(cols, list("scaled" = ~scale(., scale = FALSE))))
In base R -
mydat[paste0(names(mydat)[cols], '_scaled')] <- lapply(mydat[cols], scale, scale = FALSE)
scale also work on dataframe directly.
mydat[paste0(names(mydat)[cols], '_scaled')] <- scale(mydat[cols])

How do you align time-date data across multiple subjects in R

I am trying to scale data from multiple subjects onto the same time-scale. The current data files have 3 months of data for each subject, but the time-stamps for each event for each subject reflect different begin-end dates.
df$ID <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
df$Time <- c(2:34:00, 2:55:13, 5:23:23, 7:23:04, 9:18:18, 3:22:12, 4:23:02; 5:23:22, 9:30:02)
df$Date <- c(7/13/16, 7/13/16, 7/13/16, 7/14/16, 7/14/16, 1/02/14, 1/02/14, 1/03/14, 1/05/14)
df$widgets <-(4, 6, 9, 18, 3, 3, 7, 9, 12)
I want to change the df to have a common time scale so that I have a date index that allows me to keep the same format like below:
df$ScaleDate <- c(1,1,1,2,2,1,1,2,4) #time scale is within-ID
First, for reference, here's the data I used:
df <- data.frame(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
Time = c("2:34:00", "2:55:13", "5:23:23", "7:23:04", "9:18:18", "3:22:12", "4:23:02", "5:23:22", "9:30:02"),
Date = c("7/13/16", "7/13/16", "7/13/16", "7/14/16", "7/14/16", "1/02/14", "1/02/14", "1/03/14", "1/05/14"),
widgets = c(4, 6, 9, 18, 3, 3, 7, 9, 12))
We can use used dplyr syntax (not mandatory, but looks very legible and simple) with lubridate functions and perform this operation very simply:
library(dplyr)
library(lubridate)
df %>%
mutate(Date = mdy(Date)) %>%
group_by(ID) %>%
mutate(ScaleDate = as.numeric((Date - Date[1]) + 1))
mdy converts the values to Date objects. We can then do operations with them. If the dates are equal, it will result in "0 days". Hence, we add 1 and covert it to numeric to get the indexes.

How to loop through vector of column names, mutate each column with assignment back to column and function referencing loop index

If you can forgive my interest in loops, I'd like to know how to loop through a vector of variable names (must be strings in my use case) and mutate the original columns. In this toy example, I want to calculate the mean of the column i plus z.
df_have <- data.frame(x=c(1, 1, 2, 3, 3),
y=c(2, 2, 3, 4, 4),
z=c(0, 1, 2, 3, 4))
for (i in c("x", "y")) {
df_test <-
df_have %>%
mutate(!!i := mean(i)+z)
}
df_want <- data.frame(x=c(2, 3, 4, 5, 6), # mean 2 + z
y=c(3, 4, 5, 6, 7), # mean 3 + z
z=c(0, 1, 2, 3, 4))
Well, if you want to do a loop, then
df_test <- df_have
for (i in c("x", "y")) {
df_test <-
df_test %>%
mutate(!!i := mean((!!as.name(i)))+z)
}
Note you need to turns those strings into symbols in order to use in the expression for mutate. An eaiser trick in this case would be
df_have %>% mutate_at(c("x","y"), funs(mean(.)+z))

Loop through each variable and collect output R

I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.)
# name and number of columns varies...so need flexible process
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
ABC <- data.frame(A, B, C)
I want to loop through each variable and collect various information. This is a simple example, but what I am doing will be more complicated. I say that so that somebody doesn't just recommend some sort of summary() type solution.
maximum_value <- max(A)
mean_value <- mean(A)
# lots of other calculations for A
ID = 'A'
tempA <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(B)
mean_value <- mean(B)
# lots of other calculations for B
ID = 'B'
tempB <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(C)
mean_value <- mean(C)
# lots of other calculations for C
ID = 'C'
tempC <- data.frame(ID, maximum_value, mean_value)
output <- rbind(tempA, tempB, tempC)
Here is my attempt at creating a loop to go through the variables one by one and aggregate output. I can't figure out how to get [i] to point at an individual column of the data frame ABC.
# initialize data frame
data__ <- data.frame(ID__ = as.character(),
max__ = as.numeric(),
mean__ = as.numeric())
# loop through A, then B, then C
for(i in A:C) {
ID__ <- '[i]'
max__ <- maximum[i]
mean__ <- mean[i]
data__temp <- (ID__, max__, mean__)
data__ <- rbind(data__, data__temp)
}
If I were doing this in SAS, I would use a select into within proc sql to create a list of the variable names, then write an array, then i could loop through them that way, but there's something I'm missing here.
How would I tell R to do this process for each variable in the data frame?
If you use the tidyverse dplyr and tidyr package, you can do
library(tidyr)
ABC %>% gather(ID, value) %>% group_by(ID) %>% summarize_all(funs(mean, max))
or
ABC %>% gather(ID, value) %>% group_by(ID) %>%
summarize(maximum_value = max(value), mean_value=mean(value))
If you'd rather use base functions and there are a lot of "weird" functions, you can use purrr's map_df function
library(purrr)
map2_df(ABC, names(ABC), function(a, n) {
data_frame(ID=n, max_val=max(a), mean_val=mean(a))
})

How to use dplyr to make modifications in a dataframe equivalent to the use of a 'which'?

If I have a data frame, say
df <- data.frame(x = c(1, 2, 3), y = c(2, 4, 7), z = c(3, 6, 10))
then I can modify entries with the which function:
w <- which(df[,"y"] == 7)
df[w,c("y", "z")] <- data.frame(6, 9)
One way I see to do this with the package dplyr is the following:
df <- df %>%
mutate(W = (y==7),
y = ifelse(W, 6, y),
z = ifelse(W, 9, z)) %>%
select(-W)
But I find it a bit unelegant, and I am not so sure it would replace all kinds of which uses. Ideally I would imagine something like:
df <- df %>%
keep(y == 7) %>%
mutate(y = 6) %>%
unkeep()
where keep would provisionally select rows where modifications are to be made, and unkeep would unselect them to recover the full data frame.

Resources