In R, some functions expect the function parameters to be in quotes, like this:
summarySE(xx, measurevar= "X1F1", groupvars="genotype",na.rm=TRUE)
others seem happy with the same parameter without ", like this:
aov(data=xx,X1F1~genotype)
How can I convert from a string like "X1F1" to the X1F1 required by the formula
Here's my data
genotype X1F1 X2F1
1 R 43.33877 7.881666
2 R 130.34433 65.056984
3 R 53.39783 11.985018
4 R 23.45456 5.683387
5 R 138.50044 61.194956
6 R 108.63964 39.581222
7 R 153.60738 55.854238
8 T 264.96127 108.751380
9 T 222.94124 119.695112
10 T 119.55373 36.793537
11 T 34.97877 12.285921
You can use data.frame column names in formula. Two lm below will give the same result:
df <- data.frame(x = 1:100, y = rnorm(100))
# Option 1. As formula
lm(y ~ x, df)
# Option 2. As data.frame index
lm(df[, "y"] ~ df[, "x"])
Related
I haven't been able to find an answer to this, but I am guessing this is because I am not phrasing my question properly.
I want to combine two strings containing several comma-separated values into one string, alternating the inputs from each original string.
x <- '1,2'
y <- 'R,L'
# fictitious function
z <- combineSomehow(x,y)
z = '1R, 2L'
EDIT : Adding dataframe to better describe my issue. I would like to be able to accomplish the above, but within a mutate ideally.
df <- data.frame(
x = c('1','2','1,1','2','1'),
y = c('R','L','R,L','L','R'),
desired_result = c('1R','2L','1R,1L','2L','1R')
)
df:
x y desired_result
1 1 R 1R
2 2 L 2L
3 1,1 R,L 1R,1L
4 2 L 2L
5 1 R 1R
Final Edit/Answer: Based on #akrun's comment/response below and after removing the error originally in df, this ended up being the tidyverse answer:
mutate(desired_result = map2(.x=strsplit(x,','),.y=strsplit(y,','),
~ str_c(.x,.y, collapse=',')))
It can be done with strsplit and paste
combineSomehow <- function(x, y) {
do.call(paste0, c(strsplit(c(x,y),","), collapse=", "))
}
combineSomehow(x,y)
#[1] "1R, 2L"
Without modifying the function, we can Vectorize it to apply on multiple elements
df$desired_result2 <- Vectorize(combineSomehow)(df$x, df$y)
I'm trying to use the sqldf package inside a user-defined function in r with generic column names. I can only get it to work if the variable names match placeholder variable names (x and y) within the function. However, I want it to work regardless of the variable name fed into the function. Here is the example I've been playing with:
Here is the form that works:
df<-data.frame(X=as.factor(c("a","a","a","b","b","b","c","c","c")), Y=c(2.5,3,4,4,5.3,6,6.555,7,8))
df
Bar_Prep1<-function(data,x,y){
library(sqldf)
require(sqldf)
dataframe<-sqldf("select a.[x] Grp, AVG(a.[y]) Mean, stdev(a.[y]) SD, Max(a.[y]) Max
from data a
group by a.[x]")
dataframe$RD<-round(dataframe$Mean,digits=0)
return(dataframe)
}
test<-Bar_Prep1(df,df$X,df$Y)
test
Which returns the following df:
Grp Mean SD Max RD
1 a 3.166667 0.7637626 4 3
2 b 5.100000 1.0148892 6 5
3 c 7.185000 0.7400507 8 7
BUT, I want to be able to use the function on various column names, so I tried this:
df1<-data.frame(a=as.factor(c("a","a","a","b","b","b","c","c","c")), b=c(2.5,3,4,4,5.3,6,6.555,7,8))
df1
test1<-Bar_Prep1(df1,df1$a,df1$b)
test1
Returns the following errors: "Error: no such column: a.x"
"object 'test1' not found
So the question is, how do I need to modify my function code to accept variable names other than "x" and "y"?
Pass the names rather than the columns. Change the sqldf call to fn$sqldf which will enable string interpolation using $. Then in the select statement use $x and $y.
library(sqldf)
Bar_Prep1 <- function(data, x, y) {
dataframe <- fn$sqldf("select
a.[$x] Grp,
AVG(a.[$y]) Mean,
stdev(a.[$y]) SD,
Max(a.[$y]) Max
from data a
group by a.[$x]")
dataframe$RD <- round(dataframe$Mean, digits = 0)
return(dataframe)
}
Bar_Prep1(df, "X", "Y")
## Grp Mean SD Max RD
## 1 a 3.166667 0.7637626 4 3
## 2 b 5.100000 1.0148892 6 5
## 3 c 7.185000 0.7400507 8 7
Note that it would be possible to absorb the rounding into the SQL statement:
Bar_Prep1 <- function(data, x, y) {
fn$sqldf("with tmp as (select
a.[$x] Grp,
AVG(a.[$y]) Mean,
stdev(a.[$y]) SD,
Max(a.[$y]) Max
from data a
group by a.[$x])
select *, round(Mean) RD from tmp")
}
I'd like to perform different aggregations in a loop to be applied to different row subsets of my data, but it seems tricky to achieve (if possible at all):
t <- data.frame(agg=c(list("field1"=field1, "field2"=field2), ...),
fun=c(mean, ...))
f <- function(x) {
for (i in 1:nrow(t) {
y <- aggregate(x, by=t$agg[i], FUN=t$fun[i])
# do something with y
}
}
One problem is that the field list agg triggers an error when trying to build the data frame ("object 'field1' not found"), and the other problem is that R does not like to assign a function value to fun ("cannot coerce class ""function"" to a data.frame").
Appendix:
A concrete example for my data (just to match the definitions above) could be:
> d <- data.frame(field1=round(rnorm(5, 10, 1)),field2=letters[round(rnorm(5, 10, 1))], field3=1:5)
> d
field1 field2 field3
1 11 j 1
2 11 i 2
3 10 j 3
4 12 i 4
5 11 j 5
> with(d, aggregate(d$field3,by=list(field1, field2),FUN=mean))
Group.1 Group.2 x
1 11 i 2
2 12 i 4
3 10 j 3
4 11 j 3
Playing tricks with the variable names in the data frame, I still get this:
> with(d,t <- data.frame(agg=c(list("field1"=field1, "field2"=field2)),fun=c(mean)))
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class ""function"" to a data.frame
The problems were several, mostly caused by R making exceptions to general processing:
First a vector cannot be nested, but only lists can. Still all the elements are required to have the same type.
Second, data.frame does some magic treatment when constructing the variables (causing the inability to assign closures), so it cannot be used.
Finally I had to refer to variables to aggregate by name
So the definition looks like this (where , ... means "add more similar items"):
t <- list(agg=list(c("field1", "field2"), ...),
fun=list(mean, ...))
f <- function(x) {
for (i in 1:length(t$agg)) {
agg <- t$agg[[i]]
aggList <- lapply(agg, FUN=function(e) x[[e]])
names(aggList) <- agg
y <- aggregate(x, by=aggList, FUN=t$fun[[i]])
# do something with y
}
}
Note: In the actual solution I added another list holding the names of the columns to select for the aggregated data frame to avoid warnings about mean returning NA.
The whole data include 5 columns, which are named A, B, C, D, and Portfolio. I will run the linear regression model for each portfolio. Therefore, the whole data is divided into subset data.Then, run the regression model and check their summaries.
Data frame looks like the table below,
A B C D Portfolio
1 ... 11
2 ... 22
3 ... 13
4 ... 11
5 ... 21
6 ... 21
7 ... 23
8 ... 12
9 ... 11
10 ... 12
11 ... 22
...
The code I did presents as below,
Portfolio_11<-subset(df, Portfolio==11)
Portfolio_12<-subset(df, Portfolio==12)
Portfolio_13<-subset(df, Portfolio==13)
Portfolio_21<-subset(df, Portfolio==21)
Portfolio_22<-subset(df, Portfolio==22)
Portfolio_23<-subset(df, Portfolio==23)
Reg_11<-lm(A ~ B + C + D, data=Portfolio_11)
Reg_12<-lm(A ~ B + C + D, data=Portfolio_12)
Reg_13<-lm(A ~ B + C + D, data=Portfolio_13)
Reg_21<-lm(A ~ B + C + D, data=Portfolio_21)
Reg_22<-lm(A ~ B + C + D, data=Portfolio_22)
Reg_23<-lm(A ~ B + C + D, data=Portfolio_23)
summary(Reg_11)
summary(Reg_12)
summary(Reg_13)
summary(Reg_21)
summary(Reg_22)
summary(Reg_23)
I try to simplify R code by using loop function. Like,
for (i=1:3, j=1:3){
Portfolio_ij<-subset(df, Portfolio==ij)
Reg_ij<-lm(A ~ B + C + D, data=Portfolio_ij)
summary(Reg_ij)
}
However, I am a starter in r and don't really understand the rule of loop function. Therefore, I want to learn it. Thank you so much.
We can use one of the group by functions
library(data.table)
dtSummary <- setDT(df)[, list(list(summary(lm(A ~ B + C + D)))), by = Portfolio]
dtSummary$V1
To make life easier for yourself, use one of the R packages for data munging. Akrun has already mentioned data.table; this is also a classic use case for dplyr's do:
library(dplyr)
df %>%
group_by(Portfolio) %>%
do(smry=summary(lm(A ~ B + C + D, data=.)))
This is a classic case for the split-apply-combine approach, or at least the split-apply part, since it's not clear what you want to do with the output. Here's one way to do that in base R, returning the results in a list called Summaries:
Summaries <- lapply(split(df, df$Portfolio), function(i) summary(lm(A ~ B + C + D, data = i)))
Working out from the inside, you:
Use split to break the original data into a list composed of the desired subsets, defined here by unique values of DF$Portfolio.
use lapply to iterate the modeling and model summarizing functions over the elements of the list created in step 1.
The result is a list (Summaries), the ith element of which corresponds to the ith subset of df$Portfolio. Conveniently, the list elements will have names that correspond to the unique values of df$Portfolio, so you can inspect them with Summaries[["21"]], for example. Or, if you just want to see the results in your terminal or markdown or whatever, drop the Summaries <- part.
Using base R, you could try:
#creates your combinations
subs <- apply(expand.grid(1:3, 1:2), 1, function(x) as.numeric(paste0(x, collapse="")))
# loop along these combinations. Note the print.
for (i in subs)
print(summary(lm(A ~ B + C + D, data=subset(df, Portfolio==i))))
But as asked in the comments, a reproducible example would help.
Here is one built dataset:
# same as above
subs <- apply(expand.grid(1:3, 1:2), 1, function(x) as.numeric(paste0(x, collapse="")))
# here we create the dataset
n=50 # we want 50 rows
set.seed(1) # for the sake of reproducibility
df <- data.frame(A=rnorm(n), B=rnorm(n), C=rnorm(n), D=rnorm(n), Portfolio=sample(subs, n, replace=TRUE))
# now we can apply the loop:
for (i in subs){
cat(rep("*", 20), "\nlm for Portfolio =", i, '\n') # a cheap console displayer
print(summary(lm(A ~ B + C + D, data=subset(df, Portfolio==i))))
}
But as others answered both data.table and dplyr packages result in a more straightforward/generic syntax compared to base R.
I can't figure out why this isn't working. I have a data set with 5 columns, n rows. I just want to apply a function to each row and have the result returned in an n by 1 vector.
Just to test out how everything works, i made this simple function:
f1 <- function(uniqueid,Perspvalue,expvalue,stddevi,stddevc) {
uniqueid+ Perspvalue- expvalue+ stddevi+stddevc
}
and here's the first few rows of my data set:
> data
uniqueid Perspvalue expvalue stddevi stddevc
1 1 2.404421e+03 3337239.00 8.266566e+03 3.324624e+03
2 2 1.345307e+03 3276559.87 7.068823e+03 2.648072e+03
3 3 1.345307e+03 3276559.87 7.068823e+03 2.648072e+03
Note that it's a data frame (i think), and not a matrix. I loaded in the data from a csv using read.csv.
So i try this: apply(data,1,f1)
But my result is this: Error in uniqueid + Perspvalue : 'Perspvalue' is missing
I expected a number instead of an error.
You'll need to use mapply for this, or even more convienient mdply from the plyr package.
Some example code:
spam_function = function(a, b) {
return(a*b)
}
require(plyr)
input_args = data.frame(a = runif(1000), b = runif(1000))
result = mdply(input_args, spam_function)
> head(result)
a b V1
1 0.46902575 0.6865863 0.32202668
2 0.56837805 0.2400993 0.13646717
3 0.07185661 0.2334754 0.01677675
4 0.15589191 0.6636891 0.10346377
5 0.98317092 0.8895609 0.87459042
6 0.46070479 0.4301685 0.19818071
If you just want the vector of results:
result_vector = result$V1
Or alternatively, a base R solution using mapply:
result_mapply = mapply(spam_function, a = input_args$a, b = input_args$b)
> head(result_mapply)
[1] 0.2757767 0.1268879 0.5851026 0.7904186
[5] 0.2186079 0.1091692