I load a data table with fread the way I always do. The files has ~2M records and is tab delimited.
The load is successful. I can print the head of the table and the column names, so far so good.
But then either changing the name of the first column or setting it as a key fails complaining it cannot find the old column name. I am sure there is no typo in the column name, no heading or trailing space, I tried many times with copy/paste and retyping. I can change the name of apparently any other column.
The first column is long integer id's, so I had to load the bit64 package to get rid of a warning in 'fread', but it did not seem to help. Is it a clue?
Does anyone have any idea what could cause such a symptom? How to debug?
I use R 3.1.0 on Windows 64, latest version of all packages.
Edit: more details
The data load command:
txnData <- fread(txnInDataPathFileName, header=TRUE, sep="\t", na.strings="NA")
The column names:
colnames(txnData)
[1] "txn_ext_id" "txn_desc" "txn_type_id" "site_id" "date_id" "device_id" "cust_id"
[8] "empl_id" "txn_start_time" "txn_end_time" "total_sales" "total_units" "gross_margin"
The rename column that fails (and so does setkey):
setnames(txnData, "txn_ext_id", "txnId")
Error in setnames(txnData, "txn_ext_id", "txnId") :
Items of 'old' not found in column names: txn_ext_id
And finally the requested dput command:
dput(head(txnData))
structure(list(`txn_ext_id` = structure(c(4.88536962440272e-311,
1.10971996159584e-311, 9.9460266389845e-312, 1.0227644072435e-311,
1.10329710699982e-311, 1.01930594588518e-311), class = "integer64"),
txn_desc = c("checkout transaction", "checkout transaction",
"checkout transaction", "checkout transaction", "checkout transaction",
"checkout transaction"), txn_type_id = c(0L, 0L, 0L, 0L,
0L, 0L), site_id = c(982L, 982L, 982L, 982L, 982L, 982L),
date_id = c("2012-12-24", "2013-11-27", "2013-04-08", "2013-06-04",
"2013-11-14", "2013-05-28"), device_id = c(8L, 7L, 8L, 53L,
8L, 5L), cust_id = structure(c(2.02600292130833e-313, 2.02572944866119e-313,
2.02583815970388e-313, 2.02580527009968e-313, 2.02568405005593e-313,
2.02736582767668e-313), class = "integer64"), empl_id = c("?",
"?", "?", "?", "?", "?"), txn_start_time = c("2012-12-24T08:35:56",
"2013-11-27T12:43:30", "2013-04-08T11:48:29", "2013-06-04T15:27:47",
"2013-11-14T12:57:38", "2013-05-28T11:03:21"), txn_end_time = c("2012-12-24T08:38:00",
"2013-11-27T12:47:00", "2013-04-08T11:49:00", "2013-06-04T15:35:00",
"2013-11-14T13:00:00", "2013-05-28T11:05:00"), total_sales = c(48.86,
69.7, 8.53, 33.46, 39.19, 35.56), total_units = c(12L, 44L,
3L, 4L, 14L, 17L), gross_margin = c(0, 0, 0, 0, 0, 0)), .Names = c("txn_ext_id",
"txn_desc", "txn_type_id", "site_id", "date_id", "device_id",
"cust_id", "empl_id", "txn_start_time", "txn_end_time", "total_sales",
"total_units", "gross_margin"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000002c0788>)
The hidden character was the Byte Order Mark (BOM) displayed as  when you get a chance of seeing it. You can in principle see it in editors set in ANSI display mode -- well I could not in Notepad++! In R, printing the head of the data table does show it as well using RStudio but it does not show it using Eclipse StatET that I use by default, explaining why I did not notice it immediately.
See the following links re. how to get rid of the BOM character: SO1, SO2, yiiframework.
I loaded my file in Notepad++, Encoding -> Convert to UTF-8 without BOM, saved, and this BOM character disappeared, all went fine.
A pure R solution to this problem without touching the file is to include the BOM character as the prefix in the rename command: setnames(dataTable, "firstColumnName", "firstColumnName"). This worked in RStudio and I suppose would work in R console as well. However, it does not work in Eclipse-StatET as the BOM character remains hidden while messing up data table accesses: the first column is not accessible with or without the BOM prefix in name, the setnames fail either way.
Related
Original Post
I am trying to gather some tweets using the {rtweet} package and store them in a MySQL database. But I am getting errors whenever I try to upload a tweet that contains emojis.
Here is my code:
# loading packages
library(rtweet)
library(DBI)
library(RMariaDB)
library(dplyr)
library(dbplyr)
library(lubridate)
# create twitter api token
twitterToken <- rtweet_bot(
api_key = "*****",
api_secret = "*****",
access_token = "*****",
access_secret = "*****"
)
# search tweets
tweets <- search_tweets(q = "beautiful", n = 500, type = "recent", include_rts = FALSE, token = twitterToken)
tweets$Topic <- "beautiful"
tweets$created_at <- parse_date_time(tweets$created_at, orders = c("%a %b %d %T %z %Y", "%Y-%m-%d %H:%M:%S"))
tweets$screen_name <- users_data(tweets)$screen_name
tweets$status_url <- paste0("https://twitter.com/", tweets$screen_name, "/status/", tweets$id_str)
tweets <- tweets %>% select(Topic, status_url, screen_name, created_at, text, favorite_count, retweet_count)
# upload to database
con <- dbConnect(MariaDB(), dbname="dbname", username="username", password="password", host="db_host", port=db_port, ssl.ca = "ssl.pem", load_data_local_infile = TRUE)
dbWriteTable(con, "testTweetsDB", tweets, overwrite = T)
This throws the following error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"MariaDBConnection", "character", "tweets"’
This is a bit odd because it used to work before I upgraded my R version and updated all the packages. But is not the main issue. I can work around this by the following codes:
tweets <- as.data.frame(tweets)
dbWriteTable(con, "testTweetsDB", tweets, overwrite = T)
This time I get the following error:
Error: Error executing query: Invalid utf8 character string: '#Alyri_tv So beautifull girl '
The string it complains about is the first tweet that contains emojis. It works perfectly fine if I only select tweets that don't have any emojis in them. It works even if the tweets contain Chinese, Korean and other language characters. It is the emojis that are causing the problem.
The default collation for the database is utf8mb4_unicode_ci
Update 1
Here is the output from dput(tweets[1:10,])
structure(list(Topic = c("beautiful", "beautiful", "beautiful",
"beautiful", "beautiful", "beautiful", "beautiful", "beautiful",
"beautiful", "beautiful"), status_url = c("https://twitter.com/supyo192/status/1623573544645926917",
"https://twitter.com/Ghana_Ronaldo/status/1623573542267756545",
"https://twitter.com/ValentinoAndy77/status/1623573539906281479",
"https://twitter.com/Gino15618462/status/1623573536550838273",
"https://twitter.com/kaylenicodemus/status/1623573533468000256",
"https://twitter.com/beace_tw/status/1623573527688347651", "https://twitter.com/JoyceNwanochi/status/1623573525968691200",
"https://twitter.com/Adasu_d_gr8/status/1623573525742120961",
"https://twitter.com/AsadAli62560407/status/1623573523070390276",
"https://twitter.com/HidingWolfe/status/1623573516313407488"),
screen_name = c("supyo192", "Ghana_Ronaldo", "ValentinoAndy77",
"Gino15618462", "kaylenicodemus", "beace_tw", "JoyceNwanochi",
"Adasu_d_gr8", "AsadAli62560407", "HidingWolfe"), created_at = structure(c(1675946672,
1675946671, 1675946671, 1675946670, 1675946669, 1675946668,
1675946667, 1675946667, 1675946667, 1675946665), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), text = c("#meeshell___ THESE ARE BEAUTIFUL!! 🥳😊❤💘\nLove! Love! & Good Vibes to you!! 🤗🌞💪",
"Good morning beautiful and handsome people , pls how are u doing today ?? 🥰🥰🥰",
"#geegeebanks Good morning have a beautiful and happy Thursday ❤ 💯 🌹 😘🤗🙏",
"#hossain_mita \"I learned that in feelings we are guided by mysterious laws, perhaps fate or perhaps a mirage, in any case something inexplicable. Because... after all.. there is never a reason why you fall in love. It happens and Enough.\" Good morning dear beautiful soul💖",
"hings are beautiful if you love them\n\n>>>>>>>>S50<<<<<<<\n>>>>>>S50<<<<<<<\n>>>>>>S50<<<<<<<<<",
"子育てママはなぜ燃える!? #ナイトブラ",
"#isaaczara_ This looks beautiful! \nWhole I'm not a designer, I've been doing some work in Canva these past months, and recently stumbled upon the \"Intro rust\" font. I love how sleek it it!\n\nMaybe some day it can feature in your series.\n\nThese are equally beautiful too.",
"Because she is extremely beautiful, gorgeous and charming does not mean you should over look her 🚩.\n\nThose red flags will swallow up everything you are seeing.",
"#farhat121212 Beautiful picture", "#MistressSnowPhD Now that is some beautiful ink. I’m sure he’d love it."
), favorite_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), retweet_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L)), row.names = c(NA, 10L), class = "data.frame")
If I try dbWriteTable(con, "testTweetsDB", tweets[c(5,6,7,9,10),], overwrite = T), it works fine.
If I try dbWriteTable(con, "testTweetsDB", tweets[-c(5,6,7,9,10),], overwrite = T), gives me the error Error: Error executing query: Invalid utf8 character string: '#meeshell___ THESE ARE BEAUTIFUL!! '
Also, the table testTweetsDB does not exist on the database. I am relying on the dbWriteTable() function to create the table.
I am having trouble with an error message in R when I try to filter my dataframe. I've tried researching the problem but there doesn't seem to be anything directly related to my needs.
First I had some issues with duplicate row names, which is why I set row.names = NULL.
my_data <- read.csv("my_path\\my_folder\\file.csv", sep = "|", row.names = NULL)
Then I had some problems with shifting columns, so then I used:
colnames(my_data) <- c(colnames(my_data)[-1],NULL)
These commands seems to solve the problem for me. However now I cannot filter my data with the tidyverse library. I have tried filter(my_data$column_1 >45) and filter(column_1 >45) but I get the same error messages.
While I type the filter command, I get the popup:
(TypeError): Cannot read property 'substr' of null
If I try to execute the command anyway I get:
Error in env_bind_lazy(private$bindings, !!!set_names(promises, names_bindings)) :
attempt to use zero-length variable name
I get the feeling that this is related to row.names = NULLbut I'm having trouble finding an alternative method to get my dataframe in order and in a way that I can filter it properly. Any advice will be greatly appreciated.
Thanks all!
(EDIT) Example data that was read from a csv file:
Also I found that I only get the filter issue after I run the colnames(my_data) <- c(colnames(my_data)[-1],NULL) command.
my_index|GT|GQ|DP
1|" 0/1"|67|14
2|" 1/1"|52|11
1|" 0/1"|21|50
2|" 0/1"|39|10
dput result:
structure(list(index = 1:4, GT = c(" 0/1", " 1/1", " 0/1", " 0/1"
), GQ = c(67L, 52L, 21L, 39L), DP = c(14L, 11L, 5L, 1L)), row.names = c(NA,
4L), class = "data.frame")
We can also do this with append
colnames(my_data) <- append(colnames(my_data)[-1], "temp")
The error occurs because of creating a column called NULL. We can rename the column with any other arbitrary name and it should work fine. For example -
colnames(my_data) <- c(colnames(my_data)[-1], 'temp')
When I extract the matadata from an IR site, I found that the value of dataframe could not be rewrite.
In the matadata I extract, there is an value of attribute named “Related URLs” is “查看原文”(means “look up the source”), which need to be replaced by its real link in the webpage.
> dput(imeta_dc)
structure(list(itemDisplayTable = structure(c(5L, 8L, 6L, 4L,
3L, 7L, 1L, 1L, 12L, 9L, 13L, 11L, 2L, 10L), .Names = c("Title",
"Author", "Source", "Issued Date", "Volume", "Corresponding Author",
"Abstract", "English Abstract", "Indexed Type", "Related URLs",
"Language", "Content Type", "URI", "专题"), .Label = c(" In the current data-intensive era, the traditional hands-on method of conducting scientific research by exploring related publications to generate a testable hypothesis is well on its way of becoming obsolete within just a year or two. Analyzing the literature and data to automatically generate a hypothesis might become the de facto approach to inform the core research efforts of those trying to master the exponentially rapid expansion of publications and datasets. Here, viewpoints are provided and discussed to help the understanding of challenges of data-driven discovery.",
"[http://ir.las.ac.cn/handle/12502/8904] ", "1, Issue:4, Pages:1-9",
"2016-11-03 ", "Data-driven Discovery: A New Era of Exploiting the Literature and Data",
"Journal of Data and Information Science ", "Ying Ding (E-mail:dingying#indiana.edu) ",
"Ying Ding; Kyle Stirling ", "查看原文 ", "期刊论文", "期刊论文 ",
"其他 ", "英语 "), class = "factor")), .Names = "itemDisplayTable", row.names = c("Title",
"Author", "Source", "Issued Date", "Volume", "Corresponding Author",
"Abstract", "English Abstract", "Indexed Type", "Related URLs",
"Language", "Content Type", "URI", "专题"), class = "data.frame")
I tried to use name of row and column to locate the value of “Related URLs” and change its value by such sentence:
meta_ru <- “http://www.jdis.org”
imeta_dc[c("Related URLs"), c("itemDisplayTable")] <- meta_ru
I use rownames instead of rownumbers because those metadata has different length and different sequence of attribute, only this way can locate one attribute accurately. Further more, when I do this, none of error or warning occurs, but the data could not write into it, and it changed to blank. What should we do to avoid this problem?
There is one problem with your dataset, the field itemDisplayTable is in factor , you need to first convert it into character then use rownames() function to assign it to a value like below.
df$itemDisplayTable <- as.character(df$itemDisplayTable)
meta_ru <- c("http://www.jdis.org")
df[(rownames(df) %in% c("Related URLs"))==T,"itemDisplayTable"] <- meta_ru
View(df)
Output:
You can see here that Related URLs is not empty now and filled with "http://www.jdis.org" in the final output.
I am trying to get the total deaths from Ebola from the List of ebola outbreaks and cant seem to find my mistake. Would appreciate some help. The website link is http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks
I have used the following code:
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1)[[2]]
df1$"Human death"
but when I am trying to add the values in this using the sum function. It gives the following error
Error in Summary.factor(c(5L, 12L, 1L, 2L, 9L, 1L, 1L, 1L, 1L, 14L, 1L, :
sum not meaningful for factors
Can someone please help me figure this out?
You are reading the table in with R default which converts characters to factors. You can use stringsAsFactors = FALSE in readHTMLTable and this will be passed to data.frame. Also the table uses commas for thousand separators which you will need to remove :
library(XML)
url1 <-'http://en.wikipedia.org/wiki/List_of_Ebola_outbreaks'
df1<- readHTMLTable(url1, which = 2, stringsAsFactors = FALSE)
df1$"Human death"
sum(as.integer(gsub(",", "", df1$"Human death")))
> mySum
[1] 6910
I have a dataframe that I am putting into a sweave document using xtable, however one of my column names is quite long, and I would like to break it over two lines to save space
calqc_table<-structure(list(RUNID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), ANALYTEINDEX = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), ID = structure(1:11, .Label = c("Cal A", "Cal B", "Cal C",
"Cal D", "Cal E", "Cal F", "Cal G", "Cal H", "Cal High", "Cal Low",
"Cal Mid"), class = "factor"), mean_conc = c(200.619459644855,
158.264703128903, 102.469121407733, 50.3551544728544, 9.88296440865076,
4.41727762501703, 2.53494715706024, 1.00602831741361, 199.065054555735,
2.48063347296935, 50.1499780776199), sd_conc = c(2.3275711264554,
NA, NA, NA, NA, NA, NA, 0.101636943231162, 0, 0, 0), nrow = c(3,
1, 1, 1, 1, 1, 1, 3, 2, 2, 2)), .Names = c("Identifier of the Run within the Study", "ANALYTEINDEX",
"ID", "mean_conc", "sd_conc", "nrow"), row.names = c(NA, -11L
), class = "data.frame")
calqc_xtable<-xtable(calqc_table)
I have tried putting a newline into the name, but this didn't seem to work
names(calqc_table)[1]<-"Identifier of the \nRun within the Study"
Is there a way to do this ? I have seen someone suggest using the latex function from the hmisc package to manually iterate over the table and write it out in latex manually, including the newline, but this seems like a bit of a faf !
The best way I have found to do this is to indicate the table column as a "fixed width" column so that the text inside it wraps. With the xtable package, this can be done with:
align( calqc_xtable ) <- c( 'l', 'p{1.5in}', rep('c',5) )
xtable demands that you provide an alignment for the option "rownames" column- this is the initial l specification. The section specification, p{1.5in}, is used for your first column header, which is quite long. This limits it to a box 1.5 inches in width and the header will wrap onto multiple lines if necessary. The remaining five columns are set centered using the c specifier.
One major problem with fixed width columns like p{1.5in} is that they set the text using a justified alignment. This causes the inter-word spacing in each line to be expanded such that the line will fill up the entire 1.5 inches allotted.
Frankly, in most cases this produces results which I cannot describe using polite language (I'm an amateur typography nut and this sort of behavior causes facial ticks).
The fix is to provide a latex alignment command by prepending a >{} field to the column specification:
align( calqc_xtable ) <- c( 'l', '>{\\centering}p{1.5in}', rep('c',4) )
Other useful alignment commands are:
\raggedright -> causes text to be left aligned
\raggedleft -> causes text to be right aligned
Remember to double backslashes to escape them in R strings. You may also need to disable the string sanitation function that xtable uses by default.
Note
This alignment technique will fail if used on the last column of a table unless table rows are ended with \tabularnewline instead of \\, which I think is not the case with xtable and is not easily customizable through any user-settable option.
The other thing to consider is that you may not want the entire column line-wrapped to 1.5 inches and centered- just the header. In that case, disable xtable string sanitization and set your header using a \multicolumn cell of width 1:
names(calqc_table)[1]<-"\\multicolumn{1}{>{\\centering}p{1.5in}}{Identifier of the Run within the Study}"
#Sharpie 's technique did not work for me, as pandoc failed with error 43 on conversion to pdf. Therefore, here is what I did:
moved the \\centering marker:
names(calqc_table)=c(rep("\\multicolumn{1}{p{0.75in}}{\\centering Identifier of the Run within the Study}", 6))
(here applied to all 6 columns of the table)
and disabled sanitization in xtable printing:
print(calqc_table, sanitize.colnames.function=function(x){x})
The following solution works for me: I first save the xtable in a file, then re-import it (as malexan describes here) and then replace the text which is to be browken into two lines as suggested by egreg here. The tricky thing is to geht the escape characters "\" right. This is also relevant if the text you want to break into two lines contains characters which are Special Regex Characters (for a list of these see e.g. here) as you need to match them in the "pattern" argument.
Below you can see how the table from the question above can be modified to break the first column name in two lines.
print(calqc_xtable, "calqc_xtable.tex", type = "latex")
readLines("calqc_xtable.tex") |>
stringr::str_replace(
pattern = "Identifier of the Run within the Study",
replace = paste("\\\\begin{tabular}[x]{#{}c#{}}Identifier of the Run \\\\\\\\ within the Study \\\\end{tabular}")) |>
writeLines(con = "calqc_xtable.tex")
Screenshot from the resulting LaTex table