I'm new to R, and I'm trying to write a function that will add the entries
of a data frame column by row, and return the data frame with
a column of the new row of sums
that column named.
Here's a sample df of my data:
Ethnicity <- c('A', 'B', 'H', 'N', 'O', 'W', 'Unknown')
Texas <- c(2,41,56,1,3,89,7)
Tenn <- c(1,9,2,NA,1,32,3)
When I directly try the following code, the columns are summed by row as desired:
new_df <- df %>% rowwise() %>%
mutate(TN_TX = sum(Tenn, Texas, na.rm = TRUE))
new_df
But when I try to use my function code, rowwise() seems not to work. My function code is:
df.sum.col <- function(df.in, col.1, col.2) {
if(is.data.frame(df.in) != TRUE){ #warning if first arg not df
warning('df.in is not a dataframe')}
if(is.numeric(col.1) != TRUE){
warning('col.1 is not a numeric vector')}
if(is.numeric(col.2) != TRUE){
warning('col.2 is not a numeric vector')} #warning if col not numeric
df.out <- rowwise(df.in) %>%
mutate(name = sum(col.1, col.2, na.rm = TRUE))
df.out
}
bad_df <- df.sum(df,Texas, Tenn)
This results in
bad_df
.
I don't understand why the core of the function works outside it but not within. I also tried piping df.in to rowsum() like this:
f.out <- df.in %>% rowwise() %>%
mutate(name = sum(col.1, col.2, na.rm = TRUE))
But that doesn't resolve the problem.
As far as naming the new column, I tried doing so by adding the name as an argument, but didn't have any success. Thoughts on this?
Any help appreciated!
As suggested by #thelatemail, it's down to non-standard evaluation. rowwise() ha nothing to do with it. You need to rewrite your function to use mutate_. It can be tricky to understand, but here's one version of what you're trying to do:
library(dplyr)
df <- tibble::tribble(
~Ethnicity, ~Texas, ~Tenn,
"A", 2, 1,
"B", 41, 9,
"H", 56, 2,
"N", 1, NA,
"O", 3, 1,
"W", 89, 32,
"Unknown", 7, 3
)
df.sum.col <- function(df.in, col.1, col.2, name) {
if(is.data.frame(df.in) != TRUE){ #warning if first arg not df
warning('df.in is not a dataframe')}
if(is.numeric(lazyeval::lazy_eval(substitute(col.1), df.in)) != TRUE){
warning('col.1 is not a numeric vector')}
if(is.numeric(lazyeval::lazy_eval(substitute(col.2), df.in)) != TRUE){
warning('col.2 is not a numeric vector')} #warning if col not numeric
dots <- setNames(list(lazyeval::interp(~sum(x, y, na.rm = TRUE),
x = substitute(col.1), y = substitute(col.2))),
name)
df.out <- rowwise(df.in) %>%
mutate_(.dots = dots)
df.out
}
In practice, you shouldn't need to use rowwise at all here, but can use rowSums, after selecting only the columns you need to sum.
Related
I have some very nested data. Within my list-column-dataframes, there are some pieces I need to put together and I've done so in a single instance to get my desired dataframe:
a <- df[[2]][["result"]]#data
b <- df[[2]][["result"]]#coords
desired_df <- cbind(a, b)
My original Large list has 171 elements, meaning I have 1:171 (3.3 GB) to go inside those square brackets and would ideally end up with 171 desired dataframes (which I would then bind all together).
I haven't needed to write a loop in 10 years, but I don't see a tidyverse way to deal with this. I also no longer know how to write loops. There are definitely some elements in there that are junk and will fail.
You haven't provided any sort of minimal example of the data.
I've condensed it to mean something like this
base_data <- data.frame(group = c("a", "b", "c"), var1 = c(3, 1, 2),
var2 = c( 2, 4, 8))
base_data2 = matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9),
nrow = 3,
ncol = 3,
byrow = TRUE
)
rownames(base_data2) = c("d", "e", "f")
methods::setClass(
"weird_object",
slots = c(data = "data.frame", coords = "matrix"),
prototype = list(data = base_data, coords = base_data2)
)
df <- list(
list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
),list(
result = new("weird_object")
)
)
And if I had such a list with these objects, then I could do
df %>%
map(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value)
But the selecting / hoisting function might fail, thus
one can wrap it in a purrr::possibly, and
choose a reasonable default:
df %>%
map(possibly(. %>% {
list(data = .$result#data,
cooords = .$result#coords)
},
otherwise = list(data = NA, coords = NA))) %>%
enframe() %>%
unnest_wider(value)
Hopefully, this could be a step forward.
Next step is probably something resembling this:
df %>%
map(. %>% {
list(data = .$result#data,
coords = .$result#coords)
}) %>%
enframe() %>%
unnest_wider(value) %>%
mutate(coords = coords %>% map(. %>% as_tibble(rownames = "rowid"))) %>%
unnest(cols = c(data, coords)) %>%
#' rotating the thing now
pivot_longer(cols = c(group, rowid),
names_to = "var_name",
values_to = "var") %>%
select(-var_name) %>%
pivot_longer(cols = c(var1, var2, V1, V2, V3),
names_to = "var_name") %>%
pivot_wider(names_from = var, values_from = value) %>%
identity()
If I understand your data structure, which I probably don't, you could do:
library(tidyverse)
# Create dummy data
df <- mtcars
df$mpg <- list(result = I(list('test')))
df$mpg$result <- list("#data" = I(list('your data')))
df <- df %>% select(mpg, cyl)
df1 <- df
df2 <- df
# Pull data you're interested in.
# The index is 1 here, instead of 2, because it's fake data and not your data.
# Assuming the # is not unique, and is just parsed from JSON or some other format.
dont_at_me <- function(x){
a <- x[[1]][["result"]][["#data"]]
a
}
# Get a list of all of your data.frames
all_dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
# Vectorize
purrr::map(all_dfs, ~dont_at_me(.))
I am trying perform dplyr summarize iteratively using concatenated string as column names
Category=c("a","a","b","b","b","c","c","c")
A1=c(1,2,3,4,3,2,1,2)
A2=c(10,11,12,13,14,15,16,17)
tt=cbind(Category,A1,A2)
tdat=data.frame(tt)
colnames(tdat)=c("Category","M1","M2")
ll=matrix(1:2,nrow=2)
for(i in 1:nrow(ll)) {
Aone=tdat %>% group_by(Category) %>%
summarize(Msum=sum(paste("M",i,sep="")))
}
I end up the following error
x invalid 'type' (character) of argument
ℹ Input Msum is sum(paste("M", i, sep = "")).
ℹ The error occurred in group 1: Category = "A".
Run rlang::last_error() to see where the error occurred.```
The goal is to iteratively get arithmentic functions within summarize function in dplyr. But this concatenated string is not recognized as column name.
If we want to pass a string as column name, then convert to symbol and evaluate (!!)
library(dplyr)
Aone <- vector('list', nrow(ll))
for(i in seq_len(nrow(ll))) {
Aone[[i]] <- tdat %>%
group_by(Category) %>%
summarize(Msum = sum(!! rlang::sym(paste("M", i, sep=""))))
}
Or assuming the column name is 'M-1', 'M-2', etc, it should work as well
Aone <- vector('list', 2)
for(i in seq_along(Aone)) {
Aone[[i]] <- tdat %>%
group_by(Category) %>%
summarise(Msum = sum(!! rlang::sym(paste("M-", i, sep=""))),
.groups = 'drop')
}
NOTE: The ll was not clear in the original post. Here, we create a list with length equal to the number of 'M-' columns and assign the output back to the list element by looping over the sequence of that list
data
tdat <- data.frame(Category, M1, M2)
tdat <- structure(list(Category = c("A", "A", "A", "A", "B", "B", "B",
"B"), `M-1` = c(1, 2, 3, 4, 3, 2, 1, 2), `M-2` = c(10, 11, 12,
13, 14, 15, 16, 17)), class = "data.frame", row.names = c(NA,
-8L))
I frequently work with data frames and have to run some sophisticated data wrangling / manipulations by subgroup that is defined in one of the columns. I am aware of dplyr and group_by and know that many things could be solved using group_by. However, often I have to do some pretty intricate calculations and end up just using the 'for' loop.
I was wondering about the existence of some other general approach or paradigm that is faster/more elegant. Maybe map (that I am not very familiar with)?
Below is an example. Notice - it is fake and meaningless. So let's ignore why I need to do those things or the fact that there could be 2 consequtive NAs in a column, etc. That's not the focus of my question. The point is that often I have to operate "within the constraints of a subgroup" and then - inside that subgroup - I have to do operations columnwise, rowwise and sometimes even cellwise.
I also realize that I could probably put most of that code inside a function, split my data frame into a list based on 'group', apply this function to each element of that list and then do.call(rbind...) at the end. But is this the only way?
Thanks a lot for any hints!
library(dplyr)
library(forcats)
set.seed(123)
x <- tibble(group = c(rep('a', 10), rep('b', 10), rep('c', 10)),
attrib = c(sample(c("one", "two", "three", "four"), 10, replace = T),
sample(c("one", "two", "three"), 10, replace = T),
sample(c("one", "three", "four"), 10, replace = T)),
v1 = sample(c(1:5, NA), 30, replace = T),
v2 = sample(c(1:5, NA), 30, replace = T),
v3 = sample(c(1:5, NA), 30, replace = T),
n1 = abs(rnorm(30)), n2 = abs(rnorm(30)), n3 = abs(rnorm(30)))
v_vars = paste0("v", 1:3)
n_vars = paste0("n", 1:3)
results <- NULL # Placeholder for final results
for(i in seq(length(unique(x$group)))) { # loop through groups
mygroup <- unique(x$group)[i]
mysubtable <- x %>% filter(group == mygroup)
# IMPUTE NAs in v columns
# Replace every NA with a mean of values above and below it; and if it's the first or
# the last value, with the mean of 2 values below or above it.
for (v in v_vars){ # loop through v columns
which_nas <- which(is.na(mysubtable[[v]])) # create index of NAs for column v
if (length(which_nas) == 0) next else {
for (na in which_nas) { # loop through indexes of column values that are NAs
if (na == 1) {
mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na + 1],
mysubtable[[v]][na + 2]), na.rm = TRUE)
} else if (na == nrow(mysubtable)) {
mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na - 2],
mysubtable[[v]][na - 1]), na.rm = TRUE)
} else {
mysubtable[[v]][na] <- mean(c(mysubtable[[v]][na - 1],
mysubtable[[v]][na + 1]), na.rm = TRUE)
}
} # end of loop through NA indexes
} # end of else
} # end of loop through v vars
# Aggregate v columns (mean) for each value of column 'attrib'
result1 <- mysubtable %>% group_by(attrib) %>%
summarize_at(v_vars, mean)
# Aggregate n columns (sum) for each value of column 'attrib'
result2 <- mysubtable %>% group_by(attrib) %>%
summarize_at(n_vars, sum)
# final result should contain the name of the group
results[[i]] <- cbind(mygroup, result1, result2[-1])
}
results <- do.call(rbind, results)
Maybe this example is too simple, but in this case, the only thing you need to pull out is the imputation.
my_impute <- function(x) {
which_nas <- which(is.na(x))
for (na in which_nas) {
if (na == 1) {
x[na] <- mean(c(x[na + 1], x[na + 2]), na.rm = TRUE)
} else if (na == length(x)) {
x[na] <- mean(c(x[na - 2], x[na - 1]), na.rm = TRUE)
} else {
x[na] <- mean(c(x[na - 1], x[na + 1]), na.rm = TRUE)
}
}
x
}
Then you just need to group appropriately and impute and summarize.
x2 <- x %>% group_by(group) %>% mutate_at(v_vars, my_impute) %>%
group_by(group, attrib)
full_join(x2 %>% summarize_at(v_vars, mean),
x2 %>% summarize_at(n_vars, sum))
My usual method for things like this, where similar calculations need to be on a bunch of columns, is to put it in long format. Here it feels a little like the long way round, but perhaps this would be useful to see.
x %>% mutate(row=1:n()) %>% gather("variable", "value", c(v_vars, n_vars)) %>%
separate(variable, c("var", "x"), sep=1) %>% spread(var, value) %>%
arrange(group, x, row) %>% group_by(group, x) %>%
mutate(v=my_impute(v)) %>% group_by(group, attrib, x) %>%
summarize(v=mean(v), n=sum(n)) %>%
gather("var", "value", v, n) %>% mutate(X=paste0(var, x)) %>%
select(-x, -var) %>% spread(X, value)
More generally, split-apply-combine is probably the way to go, as you suggest in your question; here's a way using the tidyverse.
doX <- function(x) {
x2 <- x %>% mutate_at(v_vars, my_impute) %>% group_by(attrib)
full_join(x2 %>% summarize_at(v_vars, mean),
x2 %>% summarize_at(n_vars, sum))
}
x %>% group_by(group) %>% nest() %>%
mutate(result=map(data, doX)) %>% select(-data) %>% unnest()
The more traditional method is with do.call, split, and rbind; here I don't make the effort to keep the group information.
do.call(rbind, lapply(split(x, x$group), doX))
The first thing to do is to change your data imputing into a function. I made some simple modifications to have it accept a vector and simplified the call to mean.
fx_na_rm <- function(z) {
which_nas <- which(is.na(z))
if (length(which_nas) > 0) {
for (na in which_nas) { # loop through indexes of column values that are NAs
if (na == 1) {
z[na] <- mean(z[na + (1:2)], na.rm = TRUE)
} else if (na == nrow(mysubtable)) {
z[na] <- mean(z[na - (1:2)], na.rm = TRUE)
} else {
z[na] <- mean(z[c(na - 1, na + 1)], na.rm = TRUE)
}
} # end of loop through NA indexes
}
return(z)
}
I like data.table so here's a solution that uses it. Now since you use different functions for the n and v variable groups, most purrr or any other solutions will also be a little funny.
library(data.table)
dt <- copy(as.data.table(x))
v_vars = paste0("v", 1:3)
n_vars = paste0("n", 1:3)
dt[, (v_vars) := lapply(.SD, as.numeric), .SDcols = v_vars]
dt[, (v_vars) := lapply(.SD, fx_na_rm), by = group, .SDcols = v_vars]
# see https://stackoverflow.com/questions/50626316/r-data-table-apply-function-a-to-some-columns-and-function-b-to-some-others
scols <- list(v_vars, n_vars)
funs <- rep(c(mean, sum), lengths(scols))
dt[, setNames(Map(function(f, x) f(x), funs, .SD), unlist(scols))
, by = .(group,attrib)
, .SDcols = unlist(scols)]
The for loop itself is difficult to vectorize because the results can depend on itself. Here is my attempt which is not an identical output to yours:
# not identical
fx_na_rm2 <- function(z) {
which_nas <- which(is.na(z))
if (length(which_nas) > 0) {
ind <- c(rbind(which_nas - 1 + 2 * (which_nas == 1) + -1 * (which_nas == length(z)),
which_nas + 1 + 1 * (which_nas == 1) + -2 * (which_nas == length(z))))
z[which_nas] <- colMeans(matrix(z[ind], nrow = 2), na.rm = T)
}
return(z)
}
My question is about performing a calculation between each pair of groups in a data.frame, I'd like it to be more vectorized.
I have a data.frame that has a consists of the following columns: Location , Sample , Var1, and Var2. I'd like to find the closet match for each Sample for each pair of Locations for both Var1 and Var2.
I can accomplish this for one pair of locations as such:
df0 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(1:30), times =3),
Var1 = sample(1:25, 90, replace =T),
Var2 = sample(1:25, 90, replace=T))
df00 <- data.frame(Location = rep(c("A", "B", "C"), each =30),
Sample = rep(c(31:60), times =3),
Var1 = sample(1:100, 90, replace =T),
Var2 = sample(1:100, 90, replace=T))
df000 <- rbind(df0, df00)
df <- sample_n(df000, 100) # data
dfl <- df %>% gather(VAR, value, 3:4)
df1 <- dfl %>% filter(Location == "A")
df2 <- dfl %>% filter(Location == "B")
df3 <- merge(df1, df2, by = c("VAR"), all.x = TRUE, allow.cartesian=TRUE)
df3 <- df3 %>% mutate(DIFF = abs(value.x-value.y))
result <- df3 %>% group_by(VAR, Sample.x) %>% top_n(-1, DIFF)
I tried other possibilities such as using dplyr::spread but could not avoid the "Error: Duplicate identifiers for rows" or columns half filled with NA.
Is there a more clean and automated way to do this for each possible group pair? I'd like to avoid the manual subset and merge routine for each pair.
One option would be to create the pairwise combination of 'Location' with combn and then do the other steps as in the OP's code
library(tidyverse)
df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data df1
map(~ full_join(df1 %>%
filter(Location == first(.x)),
df1 %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
top_n(-1, DIFF))
Also, as the OP mentioned about automatically picking up instead of doing double filter (not clear about the expected output though)
df %>%
distinct(Location) %>%
pull %>%
as.character %>%
combn(m = 2, simplify = FALSE) %>%
map(~ df1 %>%
# change here i.e. filter both the Locations
filter(Location %in% .x) %>%
# spread it to wide format
spread(Location, value, fill = 0) %>%
# create the DIFF column by taking the differene
mutate(DIFF = abs(!! rlang::sym(first(.x)) -
!! rlang::sym(last(.x)))) %>%
group_by(VAR, Sample) %>%
top_n(-1, DIFF))
I can use !! to filter by a user-given variable but not to modify that same variable. The following function throws an error when created, but it works just fine if I delete the mutate call.
avg_dims <- function(x, y) {
y <- enquo(y)
x %>%
filter(!!y != "TOTAL") %>%
mutate(!!y = "MEAN") %>%
group_by(var1, var2)
}
The naming of the column on the lhs of assignment goes along with the assignment operator (:=) instead of the = operator. Also, the names should be either string or symbol. So, we can convert the quosure ('y' from enquo) to string (quo_name) and then do the evaluation (!!)
avg_dims <- function(x, y) {
y <- enquo(y)
y1 <- rlang::quo_name(y)
x %>%
filter(!!y != "TOTAL") %>%
mutate(!!y1 := "MEAN") %>%
group_by(var1, var2)
}
avg_dims(df1, varN)
data
set.seed(24)
df1 <- data.frame(var1 = rep(LETTERS[1:3], each = 4),
var2 = rep(letters[1:2], each = 6),
varN = sample(c("TOTAL", "hello", 'bc'), 12, replace = TRUE),
stringsAsFactors = FALSE)