How to get the linear regression equation for a certain variable? - r

When trying to retreive a linear regression equation I keep getting this error:
Error in eval(predvars, data, env) : object 'OBP' not found
Here is my chunk,
teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF) %>%
ggplot(aes(x=OBP,y=W))+
geom_point()+
geom_smooth(method = "lm")+
labs(title="OBP vs. Wins in MLB 1982-2002 ex:1994",x="On Base Percentage",y="Wins") %>%
teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF) %>%
m1<-lm(data = teams_oak,formula = W~OBP)
I've tried to pipe all the functions together to make sure that the code was reading for the mutated variable and it seems it cannot find it.

When you do this:
teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF) %>%
m1<-lm(data = teams_oak,formula = W~OBP)
on the last two lines you are using the pipe operator with an assignment on the last line. Within that assignment you are naming teams_oak as the data, but because this is in the pipe and the pipe hasn't finished yet, it doesn't have the OBP variable, so it fails.
I think simply omitting the last pipe operator and turning it into two assigment statements, should work, with a blank line and indentation to indicate the end of the pipe
teams_oak <- teams_oak %>%
select(G:FP) %>%
mutate(OBP=H+BB+HBP/AB+BB+HBP+SF)
m1<-lm(data = teams_oak,formula = W~OBP)
Personally I don't use pipes because it does make it easy to break code this way. They are not the best way to write clear code.

Related

How do I solve the error object not found after I created this variable using mutate?

I have added a variable that is the sum of all policies for each customer:
mhomes %>% mutate(total_policies = rowSums(select(., starts_with("num"))))
However, when I now want to use this total_policies variable in plots or when using summary() it says: Error in summary(total_policies) : object 'total_policies' not found.
I don't understand what I did wrong or what I should do differently here.
May be slightly round about, but feel solves the purpose. Considering df is the dataset and it has customer_id, policy_id and policy_amount as variables then the below command should work
req_output = df %>% group_by(customer_id) %>% summarise (total_policies = sum (policy_amount)
if you still face the issue, kindly convert to data frame and try plotting
req_output = as.data.frame(req_output)

How to `dput` a `ggplot` object?

I am looking for a way to save some ggplot objects for later use. The dput function creates a string that when passed to dget() would return the errors of unexpected <:
The first one is here: .internal.selfref = <. This can be easily solved by setting .internal.selfref to NULL.
The remaining seven are distributed across different attributes, with the arguments being <environment>. I tried to change the <environment>'s to something like NULL or environment(), but none of them works - the environment is not set right and the object not found error is returned.
Some searches led me to the function ggedit::dput.ggedit. But it gives me the error:
# Error in sprintf("%s = %s", item, y) :
# invalid type of argument[2]: 'symbol'
I am thinking, either I set the environments right in using the dput function, or I figure out why ggedit::dput.ggedit does not work...
Any idea?
Not using dput(), but to save your ggplot objects for later use, you could save them as .rds files (just like any R objects).
Example:
my_plot <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
saveRDS(my_plot, "my_plot.rds")
And to restore your object in another session, another script, etc.
my_plot <- readRDS("my_plot.rds")
You can try a tidyverse
Save the plot beside the data in a tibble using nest and map.
library(tidyverse)
res <- mtcars %>%
as.tibble() %>%
nest() %>%
mutate(res=map(data, ~ggplot(.,aes(mpg, disp)) + geom_point()))
Then save the data.frame using save or saveRDS.
Finally, call the plot:
res$res
The size is 4kb for tibble(mtcars) vs. 21kb with plot.

"could not find function %>%<-", issue with tidyr package and the %>% operator

I'm working on a script for a swirl lesson on using the tidyr package and I'm having some trouble with the %>% operator. I've got a data frame called passed that contains the name, class number, and final grade of 4 students. I want to add a new column called status and populate it with a character vector that says "passed". Before that, I used select to grab some columns from a data frame called students4 and stored it in a data frame called grade book
gradebook <- students4 %>%
select(id, class, midterm, final) %>%
passed<-passed %>% mutate(status="passed")
Swirl problems build on each other, and the last one just had me running the first to lines of code, so I think those two are correct. The third line was what was suggested after a couple of wrong attempts, so I think there's something about %>% that I'm not understanding. When I run the code I get an error that says;
Error in students4 %>% select(id, class, midterm, final) %>% passed <- passed %>% :
could not find function "%>%<-
I found another user who asked about the "could not find function "%>%" who was able to resolve the issue by installing the magrittr package, but that didn't do the trick for me. Any input on the issues in my code would be super appreciated!
It’s not a problem with the package or the operator. You’re trying to pipe into a new line with a new variable.
The %>%passes the previous dataframe into the next function as that functions df argument.
Instead of doing all of this:
Gradebook <- select(students4, id, class, midterm, final)
Gradebook2 <- mutate(Gradebook, test4 = 100)
Gradebook3 <- arrange(Gradebook2, desc(final))
You can pipe operator into the next argument if you’re working on the same dataframe.
Gradebook <- students4 %>%
select(students4, id, class, midterm, final) %>%
mutate(test4 = 100) %>%
arrange(desc(final))
Much cleaner and easier to read.
In your second line you’re trying to pass it to a new function but instead of there being a function you’re all of a sudden defining a variable. I don’t know the exercise you’re doing but you should remove the second operator.
gradebook <- students4 %>%
select(id, class, midterm, final)
passed <- passed %>% mutate(status="passed")

Doing Calculations In Spark(R)

I am using the sparklyr library.
I have a variable, wtd which I copied to spark:
copy_to(sc,wtd)
colnames(wtd) <- c("a","b","c","d","e","f","g")
Then I want to do a computation and store that in spark, not in my environment in R.
When I try:
sdf_register(wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d), "wtd2")
Error in UseMethod("sdf_register") :
no applicable method for 'sdf_register' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"
The command wtd2 = wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d) works correctly, but that will store it in my environment, not in spark.
The first argument in your sequence of operations should be a "tbl_spark", not a regular data.frame. Your command,
wtd2 = wtd %>% group_by(c,b) %>% filter(row_number()==1) %>%count(d)
works because you are not using Spark at all, just normal R data.frames.
If you want to use it with spark, first, store the spark_tbl variable that is returned when you copy your data.frame:
colnames(wtd) <- c("a","b","c","d","e","f","g")
wtd_tbl <- copy_to(sc, wtd)
Then, you can execute your data pipeline using sdf_register(wtd_tbl %>% ..., "wtd2").
If you execute the pipeline as defined, you will get an exception saying:
Error: org.apache.spark.sql.AnalysisException: Window function rownumber() requires window to be ordered
This is because in order to use row_number() in Spark, first you need to provide an "order function". You can have this with arrange(). I assume that you want your rows ordered by the columns "c" and "b", so your final pipeline would be something like this:
sdf_register(wtd_tbl %>%
dplyr::group_by(c, b) %>%
arrange(c, b) %>%
dplyr::filter(row_number() == 1) %>%
dplyr::count(d),
"wtd2")
I hope this helps.

how to break ggplot code into multiple lines in rstudio

Maybe a dumb question: How can we break ggplot code into multiple lines, each line ended with +?
I tried to do that in rstudio editor, but it does not work.
Not sure what you're asking, but I'll offer a suggestion. When working with long ggplot, dplyr or other statements in rstudio, I structure the lines:
results = (dcsv
%>% mutate(sdisagree_pct = strongly_disagree / n_resp * 100)
# %>% select( Item,sdisagree_pct:sagree_pct)
%>% rename( `Strongly disagree` = sdisagree_pct,
`Disagree` = disagree_pct,
`Neutral` = neutral_pct,
`Agree` = agree_pct,
`Strongly agree` = sagree_pct
)
)
Note the parentheses wrapping the entire multiline statement. The parenthesis force R to read the multiple lines regardless of where the operator is located in the line. Also note that I put the %>% operator at the beginning of each line. With the operator at the beginning of the line, I'm able to comment out # specific lines for testing. The same structure works with a + operator for ggplot.

Resources