Loop through each variable and collect output R - r

I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.)
# name and number of columns varies...so need flexible process
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
ABC <- data.frame(A, B, C)
I want to loop through each variable and collect various information. This is a simple example, but what I am doing will be more complicated. I say that so that somebody doesn't just recommend some sort of summary() type solution.
maximum_value <- max(A)
mean_value <- mean(A)
# lots of other calculations for A
ID = 'A'
tempA <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(B)
mean_value <- mean(B)
# lots of other calculations for B
ID = 'B'
tempB <- data.frame(ID, maximum_value, mean_value)
maximum_value <- max(C)
mean_value <- mean(C)
# lots of other calculations for C
ID = 'C'
tempC <- data.frame(ID, maximum_value, mean_value)
output <- rbind(tempA, tempB, tempC)
Here is my attempt at creating a loop to go through the variables one by one and aggregate output. I can't figure out how to get [i] to point at an individual column of the data frame ABC.
# initialize data frame
data__ <- data.frame(ID__ = as.character(),
max__ = as.numeric(),
mean__ = as.numeric())
# loop through A, then B, then C
for(i in A:C) {
ID__ <- '[i]'
max__ <- maximum[i]
mean__ <- mean[i]
data__temp <- (ID__, max__, mean__)
data__ <- rbind(data__, data__temp)
}
If I were doing this in SAS, I would use a select into within proc sql to create a list of the variable names, then write an array, then i could loop through them that way, but there's something I'm missing here.
How would I tell R to do this process for each variable in the data frame?

If you use the tidyverse dplyr and tidyr package, you can do
library(tidyr)
ABC %>% gather(ID, value) %>% group_by(ID) %>% summarize_all(funs(mean, max))
or
ABC %>% gather(ID, value) %>% group_by(ID) %>%
summarize(maximum_value = max(value), mean_value=mean(value))
If you'd rather use base functions and there are a lot of "weird" functions, you can use purrr's map_df function
library(purrr)
map2_df(ABC, names(ABC), function(a, n) {
data_frame(ID=n, max_val=max(a), mean_val=mean(a))
})

Related

How to insert multiple columns at specific positions in large data frame in R?

I have a large data frame in R with over 200 participants who have answered 152 questions. Now, I want to insert a column based on a conditional query after each "Answer" column. As an example, I have the following data frame:
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Answer2 = c(5, 1, 3, 5, 4))
I now want to insert a new conditional column of "Confidence" after each "Answer" column. For the column of "Answer1", the query would look like this:
data$Confidence1 <- ifelse(data$Answer1 == 1 | data$Answer1 == 6, 2, ifelse(data$Answer1 == 2 | data$Answer1 == 5, 1, 0))
In the end, I want the data frame to look like this:
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Confidence1 = c(0, 2, 1, 1, 0),
Answer2 = c(5, 1, 3, 5, 4),
Confidence2 = c(1, 2, 0, 1, 0))
Does anyone have an idea on how to achieve this for all "Answer" columns at once? Thanks!
Here is one solution using the modify() function from the purrr package the response by #r2evans from merge two data table into one, with alternating columns in R
library(purrr)
# create data, adding one more answer to your example
data <- data.frame(Participants = 1:5,
Answer1 = c(4, 6, 2, 2, 3),
Answer2 = c(5, 1, 3, 5, 4),
Answer3 = c(3, 2, 3, 4, 5))
# make a new df containing only the answer columns from data
answers <- data[,2:4]
# make confidence df and give it correct names
conf <- modify(answers, function(x) ifelse(x == 1 | x == 6, 2, ifelse(x == 2 | x == 5, 1, 0)))
names(conf) <- paste0("Confidence",1:ncol(answers))
# set order
neworder <- order(c(2*(seq_along(answers) - 1) + 1,
2*seq_along(conf)))
# put it together
cbind(Participants = data[,1], cbind(answers, conf)[,neworder])

Dynamic filtering with dplyr

For all numeric fields in a shiny app I am adding dynamically a slider. Now I want to add also
a dynamic filter for the data based on the slider input.
To illustrate the problem some code with data and static filtering:
library(glue)
library(tidyverse)
data <-
tibble(
a = c(61, 7, 10, 2, 5, 7, 23, 60),
b = c(2, 7, 1, 9, 6, 7, 3, 6),
c = c(21, 70, 1, 4, 6, 2, 3, 61)
)
input <- list("a" = c(2, 10),
"b" = c(7, 10),
"c" = c(1, 5))
data %>% filter(
between(a, input$a[1], input$a[2]),
between(b, input$b[1], input$b[2]),
between(c, input$c[1], input$c[2])
)
Is there a way to implement dynamic filtering?
I built myself a dynamic filter, which basically works:
query <- data %>% filter_if(is.numeric) %>% colnames() %>% map( function(feature){
"between({
feature
}, input${
feature
}[1], input${
feature
}[2])" %>% glue()
}) %>% paste0(collapse = ", ")
eval(parse(text = "data %>% filter({
query
})" %>% glue()))
Is there are a dplyr way?

How do you align time-date data across multiple subjects in R

I am trying to scale data from multiple subjects onto the same time-scale. The current data files have 3 months of data for each subject, but the time-stamps for each event for each subject reflect different begin-end dates.
df$ID <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
df$Time <- c(2:34:00, 2:55:13, 5:23:23, 7:23:04, 9:18:18, 3:22:12, 4:23:02; 5:23:22, 9:30:02)
df$Date <- c(7/13/16, 7/13/16, 7/13/16, 7/14/16, 7/14/16, 1/02/14, 1/02/14, 1/03/14, 1/05/14)
df$widgets <-(4, 6, 9, 18, 3, 3, 7, 9, 12)
I want to change the df to have a common time scale so that I have a date index that allows me to keep the same format like below:
df$ScaleDate <- c(1,1,1,2,2,1,1,2,4) #time scale is within-ID
First, for reference, here's the data I used:
df <- data.frame(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
Time = c("2:34:00", "2:55:13", "5:23:23", "7:23:04", "9:18:18", "3:22:12", "4:23:02", "5:23:22", "9:30:02"),
Date = c("7/13/16", "7/13/16", "7/13/16", "7/14/16", "7/14/16", "1/02/14", "1/02/14", "1/03/14", "1/05/14"),
widgets = c(4, 6, 9, 18, 3, 3, 7, 9, 12))
We can use used dplyr syntax (not mandatory, but looks very legible and simple) with lubridate functions and perform this operation very simply:
library(dplyr)
library(lubridate)
df %>%
mutate(Date = mdy(Date)) %>%
group_by(ID) %>%
mutate(ScaleDate = as.numeric((Date - Date[1]) + 1))
mdy converts the values to Date objects. We can then do operations with them. If the dates are equal, it will result in "0 days". Hence, we add 1 and covert it to numeric to get the indexes.

How to loop through vector of column names, mutate each column with assignment back to column and function referencing loop index

If you can forgive my interest in loops, I'd like to know how to loop through a vector of variable names (must be strings in my use case) and mutate the original columns. In this toy example, I want to calculate the mean of the column i plus z.
df_have <- data.frame(x=c(1, 1, 2, 3, 3),
y=c(2, 2, 3, 4, 4),
z=c(0, 1, 2, 3, 4))
for (i in c("x", "y")) {
df_test <-
df_have %>%
mutate(!!i := mean(i)+z)
}
df_want <- data.frame(x=c(2, 3, 4, 5, 6), # mean 2 + z
y=c(3, 4, 5, 6, 7), # mean 3 + z
z=c(0, 1, 2, 3, 4))
Well, if you want to do a loop, then
df_test <- df_have
for (i in c("x", "y")) {
df_test <-
df_test %>%
mutate(!!i := mean((!!as.name(i)))+z)
}
Note you need to turns those strings into symbols in order to use in the expression for mutate. An eaiser trick in this case would be
df_have %>% mutate_at(c("x","y"), funs(mean(.)+z))

OR operator in filter()?

I want to use the filter() function to find the types that have an x value less than or equal to 4, OR a y value greater than 5. I think this might be a simple fix I just can't find much info on ?filter(). I almost have it I think:
x = c(1, 2, 3, 4, 5, 6)
y = c(3, 6, 1, 9, 1, 1)
type = c("cars", "bikes", "trains")
df = data.frame(x, y, type)
df2 = df %>%
filter(x<=4)
Try
df %>%
filter(x <=4| y>=5)

Resources