How do I convert a column from character to double in R? - r

I am trying to group a dataset on a certain value and then sum a column based on this grouped value.
UN.surface.area.share <- left_join(countries, UN.surface.area, by = 'country') %>% drop_na() %>%
rename('surface.area' = 'Surface.area..km2.') %>% group_by(region) %>% summarise(total.area = sum(surface.area))
When I run this I get this error:
Error: Problem with `summarise()` input `total.area`.
x invalid 'type' (character) of argument
i Input `total.area` is `sum(surface.area)`.
i The error occurred in group 1: region = "Africa".
I think the problem is that the 'surface.area' column is of the character type and therefore the sum function doesn't work. I tried adding %>% as.numeric('surface.area') to the previous code:
UN.surface.area.share <- left_join(countries, UN.surface.area, by = 'country') %>% drop_na() %>%
rename('surface.area' = 'Surface.area..km2.') %>% as.numeric('surface.area') %>% group_by(region) %>% summarise(total.area = sum(surface.area))
But this gives the following error:
Error in group_by(., region) :
'list' object cannot be coerced to type 'double'
I think this problem can be solved by changing the 'surface.area' column to a numeric datatype but I am not sure how to do this. I checked the column and it only consists of numbers.

Use dplyr::mutate()
So instead of:
... %>% as.numeric('surface.area') %>%...
do:
...%>% mutate(surface.area = as.numeric(surface.area)) %>%...
mutate() changes one or more variables within a dataframe. When you pipe to is.numeric, as you're currently doing, you're effectively asking R to run
as.numeric(data.frame.you.piped.in, 'surface.area')
as.numeric then tries to convert the data frame into a number, which it can't do since the data frame is a list object. Hence your error. It's also running with two arguments, which will cause a crash regardless of the structure of the first argument.

Related

Trying to recode a list variable based on the lengths method

I have a variable that stores user enter race data. The survey allowed people to choose 1 or multiple races. The survey separated the entries using a semicolon. I used strsplit to convert the str variable into a list variable. I want to assess the number of members in each index and recode any response with more than on entry to "multiracial" and if only one race was enter than pass that value into the new variable. I have been able to get the following to recode the variable to either Multiracial or Single but I want to pass the original value if there is only one value in the index.
Combined %>%
select(Race_List) %>%
mutate(Race_Number = if_else(lengths(Race_List) > 1,
"Multiracial",
"Single")) %>%
count(Race_Number) %>%
arrange(desc(n)) %>%
View()
So I tried this:
Combined %>%
select(Race_List) %>%
mutate(Race_Number = if_else(lengths(Race_List) > 1,
"Multiracial",
Race_Number)) %>%
count(Race_Number) %>%
arrange(desc(n)) %>%
View()
This gives me the following error:
Error in View : Problem while computing Race_Number = if_else(lengths(Race_List) > 1, "Multiracial", Race_Number).
I am very new to R and I am probably missing something obvious or more likely using the wrong methods. Any pointers would be greatly appreciated. Thank you.

How to use mutate function in R?

I am trying to create plot in R. For that I first need to mutate a new column called "export_ratio". I have tried using the below code, but I am getting error as given below the code. Can someone please help?
eu_macro %>%
mutate(export_ratio = (exports/gdp)*100) %>%
filter(year>1995) %>%
filter(country %in% c("Germany","France","Spain","Sweden")) %>%
ggplot(aes(year,export_ratio))+
geom_line(aes(color=country))
Error: Problem with mutate() column export_ratio.
i export_ratio = (exports/gdp) * 100.
x non-numeric argument to binary operator
Sample Data:
The problem here is that your exports variable is a character type, so it does not make sense to do (exports/gdp). So, you can convert it to numeric:
eu_macro %>%
mutate(export_ratio = (as.numeric(exports)/gdp)*100) %>%

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like
The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

Applying map function to a nested tibble in R

I'm trying to replicate an 'old' R script I found for the tidyverse package.
library(dslabs)
DataTib<-as_tibble(us_contagious_diseases)
DataTib_nested <- DataTib %>%
group_by(disease) %>%
nest()
Mean_count_nested <- DataTib_nested %>%
mutate(mean_count = map(.x=DataTib_nested$data, ~mean(.x$count)))
As I understand, I have a tibble where data was grouped by disease and the remaining variables/data were nested, and then I'm trying to add a new column which should represent the average for variable "count" on that nested dataframe.
But I get the error, which I don't quite understand:
Error: Problem with `mutate()` input `mean_count`.
x Input `mean_count` can't be recycled to size 1.
i Input `mean_count` is `map(.x = DataTib_nested$data, ~mean(.x$count))`.
i Input `mean_count` must be size 1, not 7.
i The error occured in group 1: disease = "Hepatitis A".
Thanks in advance and best regards!
Your syntax is slightly wrong:
DataTib_nested <- DataTib %>%
group_by(disease) %>%
nest(data = - disease)
Mean_count_nested <- DataTib_nested %>%
mutate(mean_count = map_dbl(data, ~mean(.x$count)))
Note that I use map_dbl
instead of map since the return value is numeric.

Passing column names as both variables and columns in a single dplyr function in R

I am writing a code in which a column name (e.g. "Category") is supplied by the user and assigned to a variable biz.area. For example...
biz.area <- "Category"
The original data frame is saved as risk.data. User also supplies the range of columns to analyze by providing column names for variables first.column and last.column.
Text in these columns will be broken up into bigrams for further text analysis including tf_idf.
My code for this analysis is given below.
x.bigrams <- risk.data %>%
gather(fields, alldata, first.column:last.column) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, biz.area, sort=TRUE) %>%
bind_tf_idf(bigrams, biz.area, n) %>%
arrange(desc(tf_idf))
However, I get the following error.
Error in grouped_df_impl(data, unname(vars), drop) : Column
x.biz.area is unknown
This is because count() expects a column name text string instead of variable biz.area. If I use count_() instead, I get the following error.
Error in compat_lazy_dots(vars, caller_env()) : object 'bigrams'
not found
This is because count_() expects to find only variables and bigrams is not a variable.
How can I pass both a constant and a variable to count() or count_()?
Thanks for your suggestion!
It looks to me like you need to enclosures, so that you can pass column names as variables, rather than as strings or values. Since you're already using dplyr, you can use dplyr's non-standard evaluation techniques.
Try something along these lines:
library(tidyverse)
analyze_risk <- function(area, firstcol, lastcol) {
# turn your arguments into enclosures
areaq <- enquo(area)
firstcolq <- enquo(firstcol)
lastcolq <- enquo(lastcol)
# run your analysis on the risk data
risk.data %>%
gather(fields, alldata, !!firstcolq:!!lastcolq) %>%
unnest_tokens(bigrams,alldata,token = "ngrams", n=2) %>%
count(bigrams, !!areaq, sort=TRUE) %>%
bind_tf_idf(bigrams, !!areaq, n) %>%
arrange(desc(tf_idf))
}
In this case, your users would pass bare column names into the function like this:
myresults <- analyze_risk(Category, Name_of_Firstcol, Name_of_Lastcol)
If you want users to pass in strings, you'll need to use rlang::expr() instead of enquo().

Resources