cannot upload tweets with emojis to mysql database using RMariaDB - r

Original Post
I am trying to gather some tweets using the {rtweet} package and store them in a MySQL database. But I am getting errors whenever I try to upload a tweet that contains emojis.
Here is my code:
# loading packages
library(rtweet)
library(DBI)
library(RMariaDB)
library(dplyr)
library(dbplyr)
library(lubridate)
# create twitter api token
twitterToken <- rtweet_bot(
api_key = "*****",
api_secret = "*****",
access_token = "*****",
access_secret = "*****"
)
# search tweets
tweets <- search_tweets(q = "beautiful", n = 500, type = "recent", include_rts = FALSE, token = twitterToken)
tweets$Topic <- "beautiful"
tweets$created_at <- parse_date_time(tweets$created_at, orders = c("%a %b %d %T %z %Y", "%Y-%m-%d %H:%M:%S"))
tweets$screen_name <- users_data(tweets)$screen_name
tweets$status_url <- paste0("https://twitter.com/", tweets$screen_name, "/status/", tweets$id_str)
tweets <- tweets %>% select(Topic, status_url, screen_name, created_at, text, favorite_count, retweet_count)
# upload to database
con <- dbConnect(MariaDB(), dbname="dbname", username="username", password="password", host="db_host", port=db_port, ssl.ca = "ssl.pem", load_data_local_infile = TRUE)
dbWriteTable(con, "testTweetsDB", tweets, overwrite = T)
This throws the following error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"MariaDBConnection", "character", "tweets"’
This is a bit odd because it used to work before I upgraded my R version and updated all the packages. But is not the main issue. I can work around this by the following codes:
tweets <- as.data.frame(tweets)
dbWriteTable(con, "testTweetsDB", tweets, overwrite = T)
This time I get the following error:
Error: Error executing query: Invalid utf8 character string: '#Alyri_tv So beautifull girl '
The string it complains about is the first tweet that contains emojis. It works perfectly fine if I only select tweets that don't have any emojis in them. It works even if the tweets contain Chinese, Korean and other language characters. It is the emojis that are causing the problem.
The default collation for the database is utf8mb4_unicode_ci
Update 1
Here is the output from dput(tweets[1:10,])
structure(list(Topic = c("beautiful", "beautiful", "beautiful",
"beautiful", "beautiful", "beautiful", "beautiful", "beautiful",
"beautiful", "beautiful"), status_url = c("https://twitter.com/supyo192/status/1623573544645926917",
"https://twitter.com/Ghana_Ronaldo/status/1623573542267756545",
"https://twitter.com/ValentinoAndy77/status/1623573539906281479",
"https://twitter.com/Gino15618462/status/1623573536550838273",
"https://twitter.com/kaylenicodemus/status/1623573533468000256",
"https://twitter.com/beace_tw/status/1623573527688347651", "https://twitter.com/JoyceNwanochi/status/1623573525968691200",
"https://twitter.com/Adasu_d_gr8/status/1623573525742120961",
"https://twitter.com/AsadAli62560407/status/1623573523070390276",
"https://twitter.com/HidingWolfe/status/1623573516313407488"),
screen_name = c("supyo192", "Ghana_Ronaldo", "ValentinoAndy77",
"Gino15618462", "kaylenicodemus", "beace_tw", "JoyceNwanochi",
"Adasu_d_gr8", "AsadAli62560407", "HidingWolfe"), created_at = structure(c(1675946672,
1675946671, 1675946671, 1675946670, 1675946669, 1675946668,
1675946667, 1675946667, 1675946667, 1675946665), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), text = c("#meeshell___ THESE ARE BEAUTIFUL!! 🥳😊❤💘\nLove! Love! & Good Vibes to you!! 🤗🌞💪",
"Good morning beautiful and handsome people , pls how are u doing today ?? 🥰🥰🥰",
"#geegeebanks Good morning have a beautiful and happy Thursday ❤ 💯 🌹 😘🤗🙏",
"#hossain_mita \"I learned that in feelings we are guided by mysterious laws, perhaps fate or perhaps a mirage, in any case something inexplicable. Because... after all.. there is never a reason why you fall in love. It happens and Enough.\" Good morning dear beautiful soul💖",
"hings are beautiful if you love them\n\n>>>>>>>>S50<<<<<<<\n>>>>>>S50<<<<<<<\n>>>>>>S50<<<<<<<<<",
"子育てママはなぜ燃える!? #ナイトブラ",
"#isaaczara_ This looks beautiful! \nWhole I'm not a designer, I've been doing some work in Canva these past months, and recently stumbled upon the \"Intro rust\" font. I love how sleek it it!\n\nMaybe some day it can feature in your series.\n\nThese are equally beautiful too.",
"Because she is extremely beautiful, gorgeous and charming does not mean you should over look her 🚩.\n\nThose red flags will swallow up everything you are seeing.",
"#farhat121212 Beautiful picture", "#MistressSnowPhD Now that is some beautiful ink. I’m sure he’d love it."
), favorite_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), retweet_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), row.names = c(NA, 10L), class = "data.frame")
If I try dbWriteTable(con, "testTweetsDB", tweets[c(5,6,7,9,10),], overwrite = T), it works fine.
If I try dbWriteTable(con, "testTweetsDB", tweets[-c(5,6,7,9,10),], overwrite = T), gives me the error Error: Error executing query: Invalid utf8 character string: '#meeshell___ THESE ARE BEAUTIFUL!! '
Also, the table testTweetsDB does not exist on the database. I am relying on the dbWriteTable() function to create the table.

Related

(TypeError): Cannot read property 'substr' of null | cannot filter dataframe

I am having trouble with an error message in R when I try to filter my dataframe. I've tried researching the problem but there doesn't seem to be anything directly related to my needs.
First I had some issues with duplicate row names, which is why I set row.names = NULL.
my_data <- read.csv("my_path\\my_folder\\file.csv", sep = "|", row.names = NULL)
Then I had some problems with shifting columns, so then I used:
colnames(my_data) <- c(colnames(my_data)[-1],NULL)
These commands seems to solve the problem for me. However now I cannot filter my data with the tidyverse library. I have tried filter(my_data$column_1 >45) and filter(column_1 >45) but I get the same error messages.
While I type the filter command, I get the popup:
(TypeError): Cannot read property 'substr' of null
If I try to execute the command anyway I get:
Error in env_bind_lazy(private$bindings, !!!set_names(promises, names_bindings)) :
attempt to use zero-length variable name
I get the feeling that this is related to row.names = NULLbut I'm having trouble finding an alternative method to get my dataframe in order and in a way that I can filter it properly. Any advice will be greatly appreciated.
Thanks all!
(EDIT) Example data that was read from a csv file:
Also I found that I only get the filter issue after I run the colnames(my_data) <- c(colnames(my_data)[-1],NULL) command.
my_index|GT|GQ|DP
1|" 0/1"|67|14
2|" 1/1"|52|11
1|" 0/1"|21|50
2|" 0/1"|39|10
dput result:
structure(list(index = 1:4, GT = c(" 0/1", " 1/1", " 0/1", " 0/1"
), GQ = c(67L, 52L, 21L, 39L), DP = c(14L, 11L, 5L, 1L)), row.names = c(NA,
4L), class = "data.frame")
We can also do this with append
colnames(my_data) <- append(colnames(my_data)[-1], "temp")
The error occurs because of creating a column called NULL. We can rename the column with any other arbitrary name and it should work fine. For example -
colnames(my_data) <- c(colnames(my_data)[-1], 'temp')

R: Error in UseMethod("tbl_vars")

So I'm running the code below in R Studio and getting this error:
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars'
applied to an object of class "character"
I don't know how to fix it cause there is no tbl_vars function! Can someone help?
for (i in 1:ceiling(nrow(reviews)/batch)) {
row_start <- i*batch-batch+1
row_end <- ifelse(i*batch < nrow(reviews), i*batch, nrow(reviews))
print(paste("Processing row", row_start, "to row", row_end))
reviews[row_start:row_end, ] %>%
unnest_tokens(word, text) -> reviews_subset
reviews_subset$row <- 1:nrow(reviews_subset)
reviews_subset %>%
anti_join(stopwords) %>%
arrange(row) -> reviews_subset
write_feather(reviews_subset, path = paste0("reviews", i, ".txt"))
}
Ps: dplyr is installed. Also other installed packages: pacman, feather, data.table, devtools, tidyr, tidytext, tokenizers, tibble
I'm using it to work with Yelp dataset.
Thank you so much,
Carmem
ps2: dataset example (edited and simplified to fit here):
> dput(as.data.frame(review))
structure(list(user_id = 1:10, review_id = 11:20, business_id = 21:30,
stars = c(2L, 2L, 5L, 4L, 4L, 5L, 4L, 3L, 5L, 4L), text = c("Are you the type of person that requires being seen in an expensive, overly pretentious restaurant so that you can wear it as a status symbol? Or maybe you're a gansta who dresses like CiLo Green and wants to show the hunny's (yes, a group of them out with one man) a night on the town!",
"Today was my first visit to the new luna, and I was disappointed-- both because I really liked the old cafe luna, and because the new luna came well recommended",
"Stayed here a few months ago and still remember the great service I received.",
"I came here for a business lunch from NYC and had a VERY appetizing meal. ",
"Incredible food with great flavor. ",
"OMG, y'all, try the Apple Pie Moonshine. It. Is. Seriously. Good. Smoooooooth. The best rum that I've sampled so far: Zaya.",
"Caitlin is an amazing stylist. She took time to hear what I had to say before jumping in",
"Oh yeah! After some difficulties in securing dinner, my dad and I found ourselves at one of the billion Primanti's locations for a quick feast",
"I've been going to this studio since the beginning of January",
"The best cannoli, hands down!!"
)), .Names = c("user_id", "review_id", "business_id", "stars",
"text"), row.names = c(NA, -10L), class = "data.frame")
change anti_join(stopwords) to anti_join(stop_words). stopwords probably doesn't exist or isn't what you want it to be
The
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars'...
message is not being caused by a missing tbl_vars function. I ran into this exact same error when I mistakenly passed a vector to a dplyr join function instead of another dataframe. Here is a simple example of how to generate this error in R 3.5 using dplyr 0.7.5:
library(dplyr)
# Create a dataframe of sales by person and bike color
salesNames = c('Sally', 'Jim', 'Chris', 'Chris', 'Jim',
'Sally', 'Jim', 'Sally', 'Chris', 'Sally')
salesDates = c('2018-06-01', '2018-06-05', '2018-06-10', '2018-06-15',
'2018-06-20', '2018-06-25', '2018-06-30', '2018-07-09',
'2018-07-12', '2018-07-14')
salesColor = c('red', 'red', 'red', 'green', 'red',
'blue', 'green', 'green', 'green', 'blue')
df_sales = data.frame(Salesperson = salesNames,
SalesDate = as.Date(salesDates),
BikeColor = salesColor,
stringsAsFactors = F)
# Create another dataframe to join to
modelColor = c('red', 'blue', 'green', 'yellow', 'orange', 'black')
modelPrice = c(279.95, 269.95, 264.95, 233.54, 255.27, 289.95)
modelCommission = modelPrice * 0.20
df_commissions = data.frame(ModelColor = modelColor,
ModelPrice = modelPrice,
Commission = modelCommission,
stringsAsFactors = F)
df_sales_comm = df_sales %>% left_join(df_commissions,
by = c('BikeColor'= 'ModelColor'))
This works fine. Now try this:
df_comms = df_commissions$ModelColor # vector instead of dataframe
df_sales_comm2 = df_sales %>% left_join(df_comms,
by = c('BikeColor'= 'ModelColor'))
and you should see the exact same error you report because df_comms is not a dateframe. The problem you are having is that stopwords is a vector and not a dataframe (or a tibble).
There are several ways to resolve this error. As Szczepaniak points out, the root cause is attempting to pass a character vector into an operation that expects a data frame or tibble.
Option 1: convert the character vector to a data frame (or tibble), then use in anti_join. An example conversion:
`stopwords <- tibble(joinColumn = stopwords)`
Option 2: change the operation to accept a character vector. In this case we can use filter in place of anti_join as shown here:
`reviews_subset <- reviews_subset %>%
filter(!joinColumn %in% stopwords) %>%
arrange(row) -> reviews_subset`

How to prepare transaction data for arules

I've been digging the questions for 3 days already so finally have a courage to ask here.
I have a dataset of 379,584 entries and I want to feed it to "arules" in R
It looks like this
A. If I try to go with the format = "basket", I do the following
sales <- read.csv("sales.csv", sep=";")
s1 <- split(sales$product_id, sales$order_id)
s1 <- unique(s1)
tr <- as(s1, "transactions")
This gives me an error "can not coerce list with transactions with duplicated items"
B. If I go with the format = "single"
tr <- read.transactions("sales.csv",
sep=";", format = "single", cols = c(4,2))
I have the same error "can not coerce list with transactions with duplicated items"
I've already checked the files for duplicates and Excel can't find any. I believe the trouble is trivial but I'm just stuck.
Apparently the unique(s1) is causing some problem to your coding. Is it required?
I'd managed to create the transaction just by hashing out that line.
sales <- structure(list(sku = c(207426L, 207422L, 207424L, 9793L, 33186L,
72406L), product_id = c(15729L, 15725L, 15727L, 15999L, 15983L,
15992L), item_id = 1:6, order_id = c(1L, 1L, 1L, 2L, 2L, 2L)),
.Names = c("sku", "product_id", "item_id", "order_id"),
class = "data.frame", row.names = c(NA, -6L))
s1 <- split(sales$product_id, sales$order_id)
#s1 <- unique(s1)
tr <- as(s1, "transactions")
tr
transactions in sparse format with
2 transactions (rows) and
6 items (columns)
If unique is really required, run this instead:
s1 <- lapply(s1, unique)

How to add values for a column after scraping the table from a website

I am trying to get the total deaths from Ebola from the List of ebola outbreaks and cant seem to find my mistake. Would appreciate some help. The website link is http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks
I have used the following code:
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1)[[2]]
df1$"Human death"
but when I am trying to add the values in this using the sum function. It gives the following error
Error in Summary.factor(c(5L, 12L, 1L, 2L, 9L, 1L, 1L, 1L, 1L, 14L, 1L, :
sum not meaningful for factors
Can someone please help me figure this out?
You are reading the table in with R default which converts characters to factors. You can use stringsAsFactors = FALSE in readHTMLTable and this will be passed to data.frame. Also the table uses commas for thousand separators which you will need to remove :
library(XML)
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1, which = 2, stringsAsFactors = FALSE)
df1$"Human death"
sum(as.integer(gsub(",", "", df1$"Human death")))
> mySum
[1] 6910

Unrecognized column name after loading data table with fread

I load a data table with fread the way I always do. The files has ~2M records and is tab delimited.
The load is successful. I can print the head of the table and the column names, so far so good.
But then either changing the name of the first column or setting it as a key fails complaining it cannot find the old column name. I am sure there is no typo in the column name, no heading or trailing space, I tried many times with copy/paste and retyping. I can change the name of apparently any other column.
The first column is long integer id's, so I had to load the bit64 package to get rid of a warning in 'fread', but it did not seem to help. Is it a clue?
Does anyone have any idea what could cause such a symptom? How to debug?
I use R 3.1.0 on Windows 64, latest version of all packages.
Edit: more details
The data load command:
txnData <- fread(txnInDataPathFileName, header=TRUE, sep="\t", na.strings="NA")
The column names:
colnames(txnData)
[1] "txn_ext_id" "txn_desc" "txn_type_id" "site_id" "date_id" "device_id" "cust_id"
[8] "empl_id" "txn_start_time" "txn_end_time" "total_sales" "total_units" "gross_margin"
The rename column that fails (and so does setkey):
setnames(txnData, "txn_ext_id", "txnId")
Error in setnames(txnData, "txn_ext_id", "txnId") :
Items of 'old' not found in column names: txn_ext_id
And finally the requested dput command:
dput(head(txnData))
structure(list(`txn_ext_id` = structure(c(4.88536962440272e-311,
1.10971996159584e-311, 9.9460266389845e-312, 1.0227644072435e-311,
1.10329710699982e-311, 1.01930594588518e-311), class = "integer64"),
txn_desc = c("checkout transaction", "checkout transaction",
"checkout transaction", "checkout transaction", "checkout transaction",
"checkout transaction"), txn_type_id = c(0L, 0L, 0L, 0L,
0L, 0L), site_id = c(982L, 982L, 982L, 982L, 982L, 982L),
date_id = c("2012-12-24", "2013-11-27", "2013-04-08", "2013-06-04",
"2013-11-14", "2013-05-28"), device_id = c(8L, 7L, 8L, 53L,
8L, 5L), cust_id = structure(c(2.02600292130833e-313, 2.02572944866119e-313,
2.02583815970388e-313, 2.02580527009968e-313, 2.02568405005593e-313,
2.02736582767668e-313), class = "integer64"), empl_id = c("?",
"?", "?", "?", "?", "?"), txn_start_time = c("2012-12-24T08:35:56",
"2013-11-27T12:43:30", "2013-04-08T11:48:29", "2013-06-04T15:27:47",
"2013-11-14T12:57:38", "2013-05-28T11:03:21"), txn_end_time = c("2012-12-24T08:38:00",
"2013-11-27T12:47:00", "2013-04-08T11:49:00", "2013-06-04T15:35:00",
"2013-11-14T13:00:00", "2013-05-28T11:05:00"), total_sales = c(48.86,
69.7, 8.53, 33.46, 39.19, 35.56), total_units = c(12L, 44L,
3L, 4L, 14L, 17L), gross_margin = c(0, 0, 0, 0, 0, 0)), .Names = c("txn_ext_id",
"txn_desc", "txn_type_id", "site_id", "date_id", "device_id",
"cust_id", "empl_id", "txn_start_time", "txn_end_time", "total_sales",
"total_units", "gross_margin"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000002c0788>)
The hidden character was the Byte Order Mark (BOM) displayed as  when you get a chance of seeing it. You can in principle see it in editors set in ANSI display mode -- well I could not in Notepad++! In R, printing the head of the data table does show it as well using RStudio but it does not show it using Eclipse StatET that I use by default, explaining why I did not notice it immediately.
See the following links re. how to get rid of the BOM character: SO1, SO2, yiiframework.
I loaded my file in Notepad++, Encoding -> Convert to UTF-8 without BOM, saved, and this BOM character disappeared, all went fine.
A pure R solution to this problem without touching the file is to include the BOM character as the prefix in the rename command: setnames(dataTable, "firstColumnName", "firstColumnName"). This worked in RStudio and I suppose would work in R console as well. However, it does not work in Eclipse-StatET as the BOM character remains hidden while messing up data table accesses: the first column is not accessible with or without the BOM prefix in name, the setnames fail either way.

Resources