how can i use group_by in text mining with r

how can i use group_by in text mining with r - r

I am just wondering why I cant use group_by() in corpus-text.
I tried using some packages too but at the end nothing.
Also tried to convert to tibble.
My code:
data <- data %>%
group_by(Title) %>%
mutate(line = row_number()) %>%
ungroup()
The output:
Error:
! All columns in a tibble must be vectors.
✖ Column `text` is a `corpus_text` object.
Run `rlang::last_error()` to see where the error occurred.

Related

na_if() function in R started giving error recently

I was using the following code to get rid of empty cell in my dataframe.
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit`
it was working fine till recently but now i am getting the following error
Error in na_if():
! Can't convert y to match type of x <tbl_df>.
Run rlang::last_error() to see where the error occurred.
rlang::last_error()
<error/vctrs_error_cast>
Error in na_if():
! Can't convert y to match type of x <tbl_df>.
I am using r version 4.1.3 and dplyr package 1.1.0
Note: i am getting the same error when using
df %>% mutate_all(~na_if(.,"")) %>%
na.om`it

library(tidyverse)
set.seed(2023)
df <- data.frame(values=sample(c(letters[1:3],""),30,T))
no_na_df <- df %>%
na_if("") %>%
na.omit()
map_dbl(list(df,no_na_df),nrow) # print number of rows of wach data set
Output:
[1] 30 22
If i may suggest an easier base R version (can be used in the context of mtate as well:
replace(df$values,which(df$values==""),NA)
df %>% mutate(values_no_na=replace(values,which(values==""),NA)) %>% view()

how to remove duplicate rows in R within Arrow?

I work with the arrow dataset to reduce the RAM usage but I met with the following problem.
I need to remove duplicate rows. With dplyr I can do it using distinct() but this function doesn't supported in Arrow.
Any ideas?
Following to recommendations I wrote the following code
Sales_2021 <- Sales_2021 %>%
group_by(`Cust-Item-Loc`) %>%
arrange(desc(SBINDT)) %>%
distinct(`Cust-Item-Loc`, .keep_all = TRUE) %>%
collect()
and got the Error message
Error: `distinct()` with `.keep_all = TRUE` not supported in Arrow
How can I slice the first rows?
The advice with filter(!duplicate()) is not working as well.
Sales_2021 <- Sales_2021 %>%
group_by(`Cust-Item-Loc`) %>%
arrange(desc(SBINDT)) %>%
filter(!duplicated(`Cust-Item-Loc`)) %>%
collect()
Error message
Error: Filter expression not supported for Arrow Datasets: !duplicated(`Cust-Item-Loc`)
Call collect() first to pull data into R.

How to use mutate function in R?

I am trying to create plot in R. For that I first need to mutate a new column called "export_ratio". I have tried using the below code, but I am getting error as given below the code. Can someone please help?
eu_macro %>%
mutate(export_ratio = (exports/gdp)*100) %>%
filter(year>1995) %>%
filter(country %in% c("Germany","France","Spain","Sweden")) %>%
ggplot(aes(year,export_ratio))+
geom_line(aes(color=country))
Error: Problem with mutate() column export_ratio.
i export_ratio = (exports/gdp) * 100.
x non-numeric argument to binary operator
Sample Data:

The problem here is that your exports variable is a character type, so it does not make sense to do (exports/gdp). So, you can convert it to numeric:
eu_macro %>%
mutate(export_ratio = (as.numeric(exports)/gdp)*100) %>%

How do I convert a column from character to double in R?

I am trying to group a dataset on a certain value and then sum a column based on this grouped value.
UN.surface.area.share <- left_join(countries, UN.surface.area, by = 'country') %>% drop_na() %>%
rename('surface.area' = 'Surface.area..km2.') %>% group_by(region) %>% summarise(total.area = sum(surface.area))
When I run this I get this error:
Error: Problem with `summarise()` input `total.area`.
x invalid 'type' (character) of argument
i Input `total.area` is `sum(surface.area)`.
i The error occurred in group 1: region = "Africa".
I think the problem is that the 'surface.area' column is of the character type and therefore the sum function doesn't work. I tried adding %>% as.numeric('surface.area') to the previous code:
UN.surface.area.share <- left_join(countries, UN.surface.area, by = 'country') %>% drop_na() %>%
rename('surface.area' = 'Surface.area..km2.') %>% as.numeric('surface.area') %>% group_by(region) %>% summarise(total.area = sum(surface.area))
But this gives the following error:
Error in group_by(., region) :
'list' object cannot be coerced to type 'double'
I think this problem can be solved by changing the 'surface.area' column to a numeric datatype but I am not sure how to do this. I checked the column and it only consists of numbers.

Use dplyr::mutate()
So instead of:
... %>% as.numeric('surface.area') %>%...
do:
...%>% mutate(surface.area = as.numeric(surface.area)) %>%...
mutate() changes one or more variables within a dataframe. When you pipe to is.numeric, as you're currently doing, you're effectively asking R to run
as.numeric(data.frame.you.piped.in, 'surface.area')
as.numeric then tries to convert the data frame into a number, which it can't do since the data frame is a list object. Hence your error. It's also running with two arguments, which will cause a crash regardless of the structure of the first argument.

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Trying to perform the basic Summarise() function but getting the same error again and again!
I have a large number of csv files having 4 columns. I am reading them into R using lapply and rbinding them. Next I need to see the number of complete observations present for each ID.
Error:
*Problem with `summarise()` input `complete_cases`.
x unused argument (Date)
i Input `complete_cases` is `n(Date)`.
i The error occured in group 1: ID = 1.*
Code:
library(dplyr)
merged <-do.call(rbind,lapply(list.files(),read.csv))
merged <- as.data.frame(merged)
remove_na <- merged[complete.cases(merged),]
new_data <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n(Date))
Here is what the data looks like

The problem is not coming from summarise but from n.
If you look at the help ?n, you will see that n is used without any argument, like this:
new_data_count <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n())
This will count the number of rows for each ID group and is independent from the Date column. You could also use the shortcut function count:
new_data_count <- remove_na %>% count(ID)
If you want to count the different Date values, you might want to use n_distinct:
new_data_count_dates <- remove_na %>% group_by(ID) %>% summarise(complete_cases = n_distinct(Date))
Of note, you could have written your code with purrr::map, which has better functions than _apply as you can specify the return type with the suffix. This could look like this:
library(purrr)
remove_na = map_dfr(list.files(), read.csv) %>% na.omit()
Here, map_dfr returns a data.frame with binding rows, but you could have used map_dfc which returns a data.frame with binding columns.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how can i use group_by in text mining with r - r

Related

na_if() function in R started giving error recently

how to remove duplicate rows in R within Arrow?

How to use mutate function in R?

How do I convert a column from character to double in R?

R version 3.6.3 (2020-02-29) | Using package dplyr_1.0.0 | Unable to perform summarise() function

Categories

Resources