How to pass dynamic column name to h2o arrange function - r

Given a h2o dataframe df with a numeric column col, the sort of df by col works if the column is defined specifically:
h2o.arrange(df, "col")
But the sort doesn't work when I passed a dynamic variable name:
var <- "A"
h2o.arrange(df, var)
I do not want to hard-coded the column name. Is there any way to solve it? Thanks.
added an example per Darren's request
library(h2o)
h2o.init()
df <- as.h2o(cars)
var <- "dist"
h2o.arrange(df, var) # got error
h2o.arrange(df, "dist") # works

It turns out to be quite tricky, but you can get the dynamic column name to be evaluated by using call(). So, to follow on from your example:
var <- "dist"
eval(call("h2o.arrange",df,var))
Gives:
speed dist
1 4 2
2 7 4
3 4 10
4 9 10
Then:
var <- "speed"
eval(call("h2o.arrange",df,var))
Gives:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
(I'd love to say that was the first thing I thought of, but it was more like experiment number 54! I was about halfway down http://adv-r.had.co.nz/Expressions.html There might be other, better ways, to achieve the same thing.)
By the way, another approach to achieve the same result is:
var = 1
h2o:::.newExpr("sort", df, var)
and
var = 0
h2o:::.newExpr("sort", df, var)
respectively. I.e. The 3rd argument is the zero-based index of the column. You can get that with match(var, names(df)) - 1. By this point you've implemented 75% of h2o.arrange().
(Remember that any time you end up using h2o::: you are taking the risk that it will not work in some future version of H2O.)

Related

How do I use dataframe names as inputs for var in for loop (R language)?

In R, I defined the following function:
race_ethn_tab <- function(x) {
x %>%
group_by(RAC1P) %>%
tally(wt = PWGTP) %>%
print(n = 15) }
The function simply generates a weighted tally for a given dataset, for example, race_ethn_tab(ca_pop_2000) generates a simple 9 x 2 table:
1 Race 1 22322824
2 Race 2 2144044
3 Race 3 228817
4 Race 4 1827
5 Race 5 98823
6 Race 6 3722624
7 Race 7 116176
8 Race 8 3183821
9 Race 9 1268095
I have to do this for several (approx. 10 distinct datasets) where it's easier for me to keep the dfs distinct rather than bind them and create a year variable. So, I am trying to use either a for loop or purrr::map() to iterate through my list of dfs.
Here is what I tried:
dfs_test <- as.list(as_tibble(ca_pop_2000),
as_tibble(ca_pop_2001),
as_tibble(ca_pop_2002),
as_tibble(ca_pop_2003),
as_tibble(ca_pop_2004))
# Attempt 1: Using for loop
for (i in dfs_test) {
race_ethn_tab(i)
}
# Attempt 2: Using purrr::map
race_ethn_outs <- map(dfs_test, race_ethn_tab)
Both attempts are telling me that group_by can't be applied to a factor object, but I can't figure out why the elements in dfs_test are being registered as factors given that I am forcing them into the tibble class. Would appreciate any tips based on my approach or alternative approaches that could make sense here.
This, from #RonakShah, was exactly what was needed:
You code should work if you use list instead of as.list. See output of
as.list(as_tibble(mtcars), as_tibble(iris)) vs list(as_tibble(mtcars),
as_tibble(iris)) – Ronak Shah Oct 2 at 0:23
We can use mget to return a list of datasets, then loop over the list and apply the function
dfs_test <- mget(paste0("ca_pop_", 2000:2004))
It can be also made more general if we use ls
dfs_test <- mget(ls(pattern = '^ca_pop_\\d{4}$'))
map(dfs_test, race_ethn_tab)
This would make it easier if there are 100s of objects already created in the global environment instead of doing
list(ca_pop_2000, ca_pop_2001, .., ca_pop_2020)

R: "Adding" 2 variables (columns) to create an aggregate variable (column)?

This may be a strange request - I hope I am wording it correctly:
I have a dataset (df) and three variables (BELONG_1, GRPOR_14, ETHNIC10) that I want to "add" so as to get an aggregate variable (PsychIntegration) that can be run in a regression analysis - e.g. controlling for other variables such as gender, age etc.
BELONG_1: 1=Do not belong - 10=Do belong
GRPOR_14: 1=Not proud - 10=Very proud
ETHNIC10: 1=Not important - 4=Very important
The Cronbach Alpha for these three variables is 0.62
ID BELONG_1 GRPOR_14 ETHNIC10 PsychIntegration
1 10 8 4 ??
2 3 4 2 ??
3 7 10 3 ??
4 1 1 1 ??
How exactly do I "add" (?) these variables to get PsychIntegration?
I hope that makes sense - thanks again!
Try using the package dplyr (or package tidyverse). Assuming your data are stored in a data.frame named df
df <- df %>%
mutate(PsychIntegration = rowSums(select(., -ID)))
if you want to sum all rows. Use
df <- df %>%
mutate(PsychIntegration = BELONG_1 + GRPOR_14 + ETHNIC10)
if you just want to sum those three columns.
Using just base R one possibility is
df$PsychIntegration <- df$BELONG_1 + df$GRPOR_14 + df$ETHNIC10

dplyr: Can a function called inside mutate find the element of a column from current row

I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0

Changing df of integers to num doesn't work

I'm a newbie in R and don't have much experience with solving errors, so one more time I need help. I have a data frame named n_occur that has 2 columns - number and freq. Both values are integers. I wanted to get the histogram, but I got the error: argument x must be numeric, so I wanted to change both columns into num.
Firstly I tried with the simplest way:
n_occur[,] = as.numeric(as.character(n_occur[,]))
but as a result all values changed into NA. So after searching on stack, I decided to use this formula:
indx <- sapply(n_occur, is.factor)
n_occur[indx] <- lapply(n_occur[indx], function(x) as.numeric(as.character(x)))
and nothing changed, I still have integers and hist still doesn't work. Any ideas how to do that?
If anyone needs it in future, I solved the problem with mutate from dplyr:
n_occur <- mutate(n_occur, freq=as.numeric(freq))
for both columns separately. It worked!
I don't think you really have to do that, just supply the function with the actual columns instead of the entire dataframe. For example:
n_occur = data.frame(
number = sample(as.integer(1:10), 10, TRUE),
freq = sample(as.integer(0:10), 10, TRUE)
)
str(n_occur)
'data.frame': 10 obs. of 2 variables:
$ number: int 9 8 8 5 6 7 8 10 3 4
$ freq : int 0 9 2 0 4 10 7 2 7 9
hist(n_occur$number) # works
hist(n_occur$freq) # works
plot(n_occur$number, n_occur$freq, type = 'h') # works
hist(n_occur) # fails since it is the whole dataframe
Also if you still want to do that, this converts a factor to numeric:
as.numeric(factor(1:10))

Getting corresponding values from data.frame

my problem is that I can't really get my problem down in words which makes it hard to google it, so I am forced to ask you. I hope you will shed light on my issue:
I got a data.frame like this:
6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1
As you noticed, in the first column I got 0 repeating two times, 1 two times and so one. What I would like to do is get get all the corresponging values for one number, say 0, in the second columns (in this example 7 and 2). Preferably in data.frame.
I know the attempt with df$V2[which(df$V1==0)], however since the first column might have over 100 rows I can't really use this. Do you guys have a good solution?
Maybe some words regarding the background of this question: I need to process this data, i.e. get the mean of the second column for all 0's in the first columns, or get min/max values.
Regards
Here a solution using dplyr
df %>% group_by(V1) %>% summarize(ME=mean(V2))
Using your data (with some temporary names attached)
txt <- "6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1"
df <- read.table(text = txt)
names(df) <- paste0("Var", seq_len(ncol(df)))
Coerce the first column to be a factor
df <- transform(df, Var1 = factor(Var1))
Then you can use aggregate() with a nice formula interface
aggregate(Var2 ~ Var1, data = df, mean)
aggregate(Var2 ~ Var1, data = df, max)
aggregate(Var2 ~ Var1, data = df, min)
(eg:
> aggregate(Var2 ~ Var1, data = df, mean)
Var1 Var2
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
) or using the default interface
with(df, aggregate(Var2, list(Var1), FUN = mean))
> with(df, aggregate(Var2, list(Var1), FUN = mean))
Group.1 x
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
But the output is nicer from the formula interface.
Using data.table
library(data.table)
setDT(df)[, list(mean=mean(V2), max= max(V2), min=min(V2)), by = V1]
First, what exactly is the issue with the solution you suggest? Is it a question of efficiency? Frankly the code you present is close to optimal [1].
For the general case, you're probably looking at a split-apply-combine action, to apply a function to subsets of the data based on some differentiator. As #teucer points out, dplyr (and it's ancestor, plyr) are designed for exactly this, as is data.tables. In vanilla R, you would tend to use by or aggregate (or split and sapply for more advanced usage) for the same task. For example, to compute group means, you would do
by(df$V2, df$V1, mean)
or
aggregate(df, list(type=df$V1), mean)
Or even
sapply(split(df$V2, df$V1), mean)
[1] The code can be simplified to df$V2[df$V1 == 0] or df[df$V1 == 0,] as well.
Thanks all for your replies. I decided to go for the dplyr solution posted by teucer and eipi10. Since I have a third (and maybe even a fourth) column, this solution seems to be pretty easy to use (just adding V3 to group_by).
Since some are asking what's wrong with df$V2[which(df$V1==0)]: I maybe was a bit unclear when saying "rows", was I actually meant was "values". Let's assume I had n distinct values in the first column, I would have to use the command n times for all distinct values and store the n resulting vectors.

Resources