Keeping a column throughout pipe manipulations, R, dyplr - r

I have a DF called data that has multiple columns. One of them is a column called 'swim_hours' that is a column of integers. I want to use dyplr to do some manipulations on this DF, but I also want to keep this column intact, though I am not manipulating it in any way. My code is as follows:
trial_swim = data %>%
group_by(penguin, date, trial, location) %>%
summarise(swim_t = as.numeric(datetime[n()] - datetime[1])) %>%
summarise(hours = swim_hours)
Again, the purpose of the last line is to simply maintain my 'swim_hours' column into my new DF 'trial_swim'.
When I run this, I get the error message: Error in summarise_impl(.data, dots) :
Evaluation error: object 'swim_hour' not found.
Clearly 'swim_hours' is in my 'data' DF, why can't it find it?
Is there an easier way to keep a column that is not being manipulated when using dyplr and pipes??

Related

`by` can't contain join column/inner_join

im trying to run this code but i keep getting this error, given the code below, Im applying inner-join for 2 lists which are student_db and grade_db by student_id and then couse_db by course_id.
could anyone help with this issue?
q2 <- inner_join(student_db, grade_db, by = "student_id") %>%
inner_join(course_db, by = "course_id", suffix = c(".student", ".course")) %>%
filter(name.student == "Ava Smith" | name.student == "Freddie Haris")
Error in common_by.list():
! by can't contain join column
course_id which is missing from LHS.
Run rlang::last_error() to see where the error occurred.
Does this work?
library(dplyr)
q2 <- student_db %>%
inner_join(grade_db, by = c("student_id"="student_id")) %>%
inner_join(course_db, by = c("course_id"="course_id")) %>%
filter(name.student %in% c("Ava Smith","Freddie Haris"))
If the names of your id variables are different in the two data frames, the by argument allows you to tell R which variables should matched, e.g. by=("var_from_df1"="var_from_df2"). (My guess is your dataframes have different column names, so this might be what you need to fix....
I'm not sure why you've included the suffix argument. That's there for if you have two variables with same name in both data sets with data that doesn't match. If you need it you can add it back. It's hard to tell exactly what the problem is without seeing your dataframes or similar example data....

Group by different column names in a df using in a for loop

I'm new to R, recently I tried to use a for loop to put every different col names in a group by statement, then make a graph accordingly. However I got this error
Error: Must group by variables found in .data.
*Column i is not found.
below is the code, I kinda simplify the problem here, only kept the group by part, does anyone know how I can solve this?
for (i in colnames(mtcars)) {
mtcars %>%
group_by(i) %>%
summarise((total=n())
}

How do I convert a column from character to double in R?

I am trying to group a dataset on a certain value and then sum a column based on this grouped value.
UN.surface.area.share <- left_join(countries, UN.surface.area, by = 'country') %>% drop_na() %>%
rename('surface.area' = 'Surface.area..km2.') %>% group_by(region) %>% summarise(total.area = sum(surface.area))
When I run this I get this error:
Error: Problem with `summarise()` input `total.area`.
x invalid 'type' (character) of argument
i Input `total.area` is `sum(surface.area)`.
i The error occurred in group 1: region = "Africa".
I think the problem is that the 'surface.area' column is of the character type and therefore the sum function doesn't work. I tried adding %>% as.numeric('surface.area') to the previous code:
UN.surface.area.share <- left_join(countries, UN.surface.area, by = 'country') %>% drop_na() %>%
rename('surface.area' = 'Surface.area..km2.') %>% as.numeric('surface.area') %>% group_by(region) %>% summarise(total.area = sum(surface.area))
But this gives the following error:
Error in group_by(., region) :
'list' object cannot be coerced to type 'double'
I think this problem can be solved by changing the 'surface.area' column to a numeric datatype but I am not sure how to do this. I checked the column and it only consists of numbers.
Use dplyr::mutate()
So instead of:
... %>% as.numeric('surface.area') %>%...
do:
...%>% mutate(surface.area = as.numeric(surface.area)) %>%...
mutate() changes one or more variables within a dataframe. When you pipe to is.numeric, as you're currently doing, you're effectively asking R to run
as.numeric(data.frame.you.piped.in, 'surface.area')
as.numeric then tries to convert the data frame into a number, which it can't do since the data frame is a list object. Hence your error. It's also running with two arguments, which will cause a crash regardless of the structure of the first argument.

How to interpret column length error from ddplyr::mutate?

I'm trying to apply a function (more complex than the one used below, but I was trying to simplify) across two vectors. However, I receive the following error:
mutate_impl(.data, dots) :
Column `diff` must be length 2 (the group size) or one, not 777
I think I may be getting this error because the difference between rows results in one row less than the original dataframe, per some posts that I read. However, when I followed that advice and tried to add a vector to add 0/NA on the final line I received another error. Did I at least identify the source of the error correctly? Ideas? Thank you.
Original code:
diff_df <- DF %>%
group_by(DF$var1, DF$var2) %>%
mutate(diff = map2(DF$duration, lead(DF$duration), `-`)) %>%
as.data.frame()
We don't need map2 to get the difference between the 'duration' and the lead of 'duration'. It is vectorized. map2 will loop through each element of 'duration' with the corresponding element of lead(duration) which is unnecessary
DF %>%
group_by(var1, var2) %>%
mutate(diff = duration - lead(duration))
NOTE: When we extract the column with DF$duration after the group_by. it is breaking the grouping condition and get the full dataset column. Also, in the pipe, there is no need for dataset$columnname. It should be columnname (However,in certain situations, when we want to get the full column for some comparison - it can be used)

Passing column names as both variables and columns in a single dplyr function in R

I am writing a code in which a column name (e.g. "Category") is supplied by the user and assigned to a variable biz.area. For example...
biz.area <- "Category"
The original data frame is saved as risk.data. User also supplies the range of columns to analyze by providing column names for variables first.column and last.column.
Text in these columns will be broken up into bigrams for further text analysis including tf_idf.
My code for this analysis is given below.
x.bigrams <- risk.data %>%
gather(fields, alldata, first.column:last.column) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, biz.area, sort=TRUE) %>%
bind_tf_idf(bigrams, biz.area, n) %>%
arrange(desc(tf_idf))
However, I get the following error.
Error in grouped_df_impl(data, unname(vars), drop) : Column
x.biz.area is unknown
This is because count() expects a column name text string instead of variable biz.area. If I use count_() instead, I get the following error.
Error in compat_lazy_dots(vars, caller_env()) : object 'bigrams'
not found
This is because count_() expects to find only variables and bigrams is not a variable.
How can I pass both a constant and a variable to count() or count_()?
Thanks for your suggestion!
It looks to me like you need to enclosures, so that you can pass column names as variables, rather than as strings or values. Since you're already using dplyr, you can use dplyr's non-standard evaluation techniques.
Try something along these lines:
library(tidyverse)
analyze_risk <- function(area, firstcol, lastcol) {
# turn your arguments into enclosures
areaq <- enquo(area)
firstcolq <- enquo(firstcol)
lastcolq <- enquo(lastcol)
# run your analysis on the risk data
risk.data %>%
gather(fields, alldata, !!firstcolq:!!lastcolq) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, !!areaq, sort=TRUE) %>%
bind_tf_idf(bigrams, !!areaq, n) %>%
arrange(desc(tf_idf))
}
In this case, your users would pass bare column names into the function like this:
myresults <- analyze_risk(Category, Name_of_Firstcol, Name_of_Lastcol)
If you want users to pass in strings, you'll need to use rlang::expr() instead of enquo().

Resources