My function is defined as the following, where i subset a dataframe to a specific name and return the first 5 elements.
Bestideas <- function(x) {
topideas <- subset(Masterall, Masterall$NAME == x) %>%
slice(1:5)
return(topideas)
I would then like to apply the function, to an entire df (with one column of Names), so that the function is applied to each name on the list and binds it into a new df, containing the first five ideas from all unique names. Through research - I have arrived at the following:
bestideas_collection = lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
However, it doesn't work. It returns a dataframe with only five ideas in total, and from 5 different names. As there is 30 Unique names in my list, I expected 30*5 = 150 ideas in the "bestideas_collection" variable. I get this error message:
"longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object length"
Further, if I do it manually for each name, it works just as intended - which makes me think that the function works fine, and that the issue is with the lapply function.
holder <- Bestideas("NAME 1")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 2")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 3")
bestideas_collection <- bind_rows(bestideas_collection,holder)
...
Can anyone help me if I am using the function wrong, or do you have alternative methods of doing it? I have already tried with a for-loop - but it gives me the same error as with the lapply function.
I don't have your data, so I tried to reproduce your problem on a fabricated set. I was unable to do so. With a very simple case, your function works as expected.
library(dplyr)
set.seed(123)
Masterall <- data.frame(NAME = rep(LETTERS, 10), value = rnorm(260)) %>%
group_by(NAME) %>% arrange(desc(value))
UNIQUE_NAMES_DF <- LETTERS
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
Is your UNIQUE_NAMES_DF a data.frame? If so, that is the trouble. The lapply function expects a vector as its first input. It can handle a data.frame, but clearly unexpected results occur. Here is an example:
UNIQUE_NAMES_DF <- data.frame(NAME = LETTERS, other = sample(letters))
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 12 x 2
# Groups: NAME [11]
NAME value
<chr> <dbl>
1 C -0.785
2 D 0.385
3 E -0.371
4 F 1.13
5 I 1.10
6 N -0.641
7 P -1.02
8 Q -0.0341
9 U -1.07
10 X -0.0834
11 Z 1.26
12 Z -0.739
I do not know the structure of your UNIQUE_NAMES_DF, but if you just feed the column with the names into your lapply, it should work:
lapply(UNIQUE_NAMES_DF$NAME, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
Related
I have a case where I need to apply a dynamically selected function onto a column of a tibble. In some cases, I don't want the values to change at all -- then I select the identity function I().
After applying I() the datatype of the column changes from <dbl> to <I<dbl>>. Why is that? Why is it not just double again?
library(tidyverse)
df <- tibble(x = (1:3*pi))
print(df)
# A tibble: 3 x 1
# x
# <dbl>
# 1 3.14
# 2 6.28
# 3 9.42
df %<>% mutate(x = I(x))
print(df)
# A tibble: 3 x 1
# x
# <I<dbl>>` <-- Why <I...> and not <dbl>?
# 1 3.14
# 2 6.28
# 3 9.42
How can I just get ?
I() is not the identity function, technically (that would be identity). I() is to inhibit interpretation/conversion, saying that the component should be used "as is". Further I(...) returns an object of class "AsIs", which is and should be recognized as something unique from its non-I(...) counterpart. As for the effect of this class ... I don't know of any (though I don't use them regularly, so I might be missing something).
And you can still operate on this, it's just classes differently.
dput(1:3)
# 1:3
dput(I(1:3))
# structure(1:3, class = "AsIs")
tibble(x = (1:3*pi)) %>%
mutate(x = I(x)) %>%
mutate(y = x + 1)
# # A tibble: 3 x 2
# x y
# <I<dbl>> <I<dbl>>
# 1 3.14 4.14
# 2 6.28 7.28
# 3 9.42 10.4
though that new column is also "AsIs".
I have a helper function (say foo()) that will be run on various data frames that may or may not contain specified variables. Suppose I have
library(dplyr)
d1 <- data_frame(taxon=1,model=2,z=3)
d2 <- data_frame(taxon=2,pss=4,z=3)
The variables I want to select are
vars <- intersect(names(data),c("taxon","model","z"))
that is, I'd like foo(d1) to return the taxon, model, and z columns, while foo(d2) returns just taxon and z.
If foo contains select(data,c(taxon,model,z)) then foo(d2) fails (because d2 doesn't contain model). If I use select(data,-pss) then foo(d1) fails similarly.
I know how to do this if I retreat from the tidyverse (just return data[vars]), but I'm wondering if there's a handy way to do this either (1) with a select() helper of some sort (tidyselect::select_helpers) or (2) with tidyeval (which I still haven't found time to get my head around!)
Another option is select_if:
d2 %>% select_if(names(.) %in% c('taxon', 'model', 'z'))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
select_if is superseded. Use any_of instead:
d2 %>% select(any_of(c('taxon', 'model', 'z')))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
type ?dplyr::select in R and you will find this:
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must
be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names
that don't exist.
You can use one_of(), which gives a warning when the column is absent but otherwise selects the correct columns:
d1 %>%
select(one_of(c("taxon", "model", "z")))
d2 %>%
select(one_of(c("taxon", "model", "z")))
Using the builtin anscombe data frame for the example noting that z is not a column in anscombe :
anscombe %>% select(intersect(names(.), c("x1", "y1", "z")))
giving:
x1 y1
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
I want to create a function to take in a dataframe and a string assigned GENDER. The function will find the mean and sd of each variable in the df by GENDER and return a dataframe with all that info to a new df named "GENDERstats" that I could use in further analysis later on.
I can get everything I want to up until I name the new "GENDERstats" df, then it throws an error
Here's what I have so far, with dummy data
df <- data.frame(GENDER=c("M","F","M","F","M","F"),HELP=c(5,4,2,7,5,5),CARE=c(6,4,7,8,5,4),TRUST=c(6,5,3,6,8,6),SERVE=c(6,5,7,8,7,6))
my.func <- function(dat, bias){
datFrame <- data.frame()
for(i in 2:5){
d1 <- aggregate(dat[,i],by=list(dat[,bias]),FUN=mean,na.rm=TRUE)
d2 <- aggregate(dat[,i],by=list(dat[,bias]),FUN=sd,na.rm=TRUE)
d1$sd <- d2$x
d1$Var <- i
datFrame <- rbind(datFrame,d1)
}
# paste(bias,"stats") <- datFrame
}
I get the df I want in "datFrame", but I want to paste the bias variable and "stats" to make a new data frame. I will be doing this with several different "biases"
I want the new df to look like this:
Group.1 x sd Var
1 F 5.333333 1.5275252 2
2 M 4.000000 1.7320508 2
3 F 5.333333 2.3094011 3
4 M 6.000000 1.0000000 3
5 F 5.666667 0.5773503 4
6 M 5.666667 2.5166115 4
7 F 6.333333 1.5275252 5
8 M 6.666667 0.5773503 5
and from there I can plot graphs or only focus on means or sds
I'm not quite sure how to fix your function (a couple details are missing), but you can get the same results without a user-defined function or for loop. The following iterates over combinations of GENDER + other variables, generate means and SDs with aggregate, and then rbinds the dataframes in do.call:
do.call("rbind", lapply(2:ncol(df),
function(j) {
df_out <- aggregate(df[j], list(df$GENDER), "mean")
df_out[3] <-
aggregate(df[j], list(df$GENDER), "sd")[[2]]
df_out[4] <- j
`names<-`(df_out, c("gender", "x", "sd", "var"))
}))
#### OUTPUT ####
gender x sd var
1 F 5.33333 1.52753 2
2 M 4.00000 1.73205 2
3 F 5.33333 2.30940 3
4 M 6.00000 1.00000 3
5 F 5.66667 0.57735 4
6 M 5.66667 2.51661 4
7 F 6.33333 1.52753 5
8 M 6.66667 0.57735 5
I'm not sure if there isn't a slicker way of doing this in base R. Personally, I would go with dplyr's gather + group_by + summarise, which is much cleaner and easier to understand. The output is pretty much the same as the above, just in a different order. The rounding only looks different because of how tibbles are printed:
library(dplyr)
library(tidyr)
df %>%
gather(var, val, -GENDER) %>%
group_by(GENDER, var) %>%
summarise(x = mean(val), sd = sd(val))
#### OUTPUT ####
# A tibble: 8 x 4
# Groups: GENDER [2]
GENDER var x sd
<chr> <chr> <dbl> <dbl>
1 F CARE 5.33 2.31
2 F HELP 5.33 1.53
3 F SERVE 6.33 1.53
4 F TRUST 5.67 0.577
5 M CARE 6 1
6 M HELP 4 1.73
7 M SERVE 6.67 0.577
8 M TRUST 5.67 2.52
I have two data sets A and B and for each observation in A I want to calculate a distance distance (e.g. an euclidean distance, L1 distance, or something else) to each observation in B (the calculation of the distance is based on the variables in the data sets). An observation from A should then be related to an observation in B for which this distance is minimal.
For example, if A has 5000 observations and B has 10000 observations then
for(i in 1:5000)
{
x = data.frame(x = numeric(), y = numeric())
for(j in 1:10000)
{
x[j,] = distance(A[i,], B[j,])
}
A[i,]$associated_row_B = x[which.min(x[1,]),1]
}
does basically what I want (I still have to solve if observations have the same distance). But since I am using dplyr I hardly ever had to use a for loop. My solution needs even two loops so I wonder if there is a possibility to avoid the for loop using a solution from dplyr/tidyverse.
A very basic example:
A:
i a b
1 -0.5920377 a
2 0.4263199 b
3 0.6737029 a
4 1.3063658 c
5 0.1314103 d
B:
i a b
1 -0.30201541 a
2 -0.07093386 b
3 0.96317764 c
4 -0.33303061 d
5 -1.00834895 d
and the distance function:
distance = function(x,y) return(c((x[2] - y[2])^2 + abs(x[3] - y[3]), y[1])
The first element of the return value is the actual distance, the second value is the identifier from B.
Fair warning: this is going to be pretty inefficient for large datasets!
You can accomplish this using crossing from tidyr and slice from dplyr.
First, let's create two dummy dataframes, A_df and B_df
A_df <- data.frame(
observation_A = runif(100),
id_A = 1:100
)
B_df <- data.frame(
observation_B = runif(50),
id_B = 1:50
)
For clarity, I've kept the column names unique between A_df and B_df. Next, we'll use tidyr::crossing to find every combination of rows between the two dataframes. Next, we use mutate to calculate the distance (here I arbitrarily took the absolute value of their difference, but you can apply your custom distance function here). Finally, we group by id_A, and keep only the minimum using slice (and base R which.max).
library(tidyverse)
full_df <- A_df %>%
crossing(B_df) %>%
mutate(distance = abs(observation_A-observation_B)) %>%
group_by(id_A) %>%
slice(which.min(distance))
Looking at full_df, we get what we were hoping for:
> full_df
# A tibble: 100 x 5
# Groups: id_A [100]
observation_A id_A observation_B id_B distance
<dbl> <int> <dbl> <int> <dbl>
1 0.826 1 0.851 44 0.0251
2 0.903 2 0.905 3 0.00176
3 0.371 3 0.368 18 0.00305
4 0.554 4 0.577 34 0.0232
5 0.656 5 0.654 10 0.00268
6 0.120 6 0.110 37 0.0101
7 0.991 7 0.988 6 0.00244
8 0.983 8 0.988 6 0.00483
9 0.325 9 0.318 45 0.00649
10 0.860 10 0.864 40 0.00407
# ... with 90 more rows
I have a helper function (say foo()) that will be run on various data frames that may or may not contain specified variables. Suppose I have
library(dplyr)
d1 <- data_frame(taxon=1,model=2,z=3)
d2 <- data_frame(taxon=2,pss=4,z=3)
The variables I want to select are
vars <- intersect(names(data),c("taxon","model","z"))
that is, I'd like foo(d1) to return the taxon, model, and z columns, while foo(d2) returns just taxon and z.
If foo contains select(data,c(taxon,model,z)) then foo(d2) fails (because d2 doesn't contain model). If I use select(data,-pss) then foo(d1) fails similarly.
I know how to do this if I retreat from the tidyverse (just return data[vars]), but I'm wondering if there's a handy way to do this either (1) with a select() helper of some sort (tidyselect::select_helpers) or (2) with tidyeval (which I still haven't found time to get my head around!)
Another option is select_if:
d2 %>% select_if(names(.) %in% c('taxon', 'model', 'z'))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
select_if is superseded. Use any_of instead:
d2 %>% select(any_of(c('taxon', 'model', 'z')))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
type ?dplyr::select in R and you will find this:
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must
be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names
that don't exist.
You can use one_of(), which gives a warning when the column is absent but otherwise selects the correct columns:
d1 %>%
select(one_of(c("taxon", "model", "z")))
d2 %>%
select(one_of(c("taxon", "model", "z")))
Using the builtin anscombe data frame for the example noting that z is not a column in anscombe :
anscombe %>% select(intersect(names(.), c("x1", "y1", "z")))
giving:
x1 y1
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68