Preserve content while displaying data columnwise from MongoDB - r

Reading data from twitter and then saving it in MongoDB
data.list <- searchTwitter('#demonetization ', n=10)
data.df = twListToDF(data.list)
temp=mongo.bson.from.df(data.df)
mongo <- mongo.create()
DB_Details <- paste(twitter, "filterstream", sep=".")
mongo.insert.batch(mongo, DB_Details, temp)
Reading the data in MongoDB and saving it in dataset variable(all columns of table are stored in this variable).
mongo <- mongo(db = "twitter",collection = "filterstream",url = "mongodb://localhost")
dataset <- mongo$find()
When i try printing the content of dataset variable there is no problem(See OUTPUT-1), but when i try to print a column from dataset variable the output of column(See OUTPUT-2) differs from the previous output(OUTPUT-1).
OUTPUT1
> **dataset**
--------------------------------------------------
| id | text |
--------------------------------------------------
| 1 | <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD>
<ed> <U+00B8> <U+0082><ed><U+00A0><U+00BD>
<ed> <U+00B1><U+0087>\nSome great jokes on #DeMonetization on
my TL today.\n\nThank you, Modi ji. <ed><U+00A0><U+00BD>
<ed><U+00B1><U+0087> |
--------------------------------------------------
| 2 | should be one |
--------------------------------------------------
OUTPUT-2
> **dataset$text**
| id | text |
--------------------------------------------------
| 1 | \xed��\xed�\u0082\xed��\xed�\u0082\xed��\xed�\u0087\nSome great jokes on #DeMonetization on my TL today.\n\nThank you, Modi ji. \xed��\xed�\u0087 |
--------------------------------------------------
| 2 | should be one |
--------------------------------------------------
4.Detecting these weird characters in OUTPUT-2 and getting rid of them is difficult. I am able to remove special characters(tags) and obtain clean text using REGEX for content of text column in OUTPUT-1, but the content of text column in OUTPUT-2 is quite different and i am not able to remove those special weird characters.
5.Why the content suddenly changes while printing a particular column from dataset, what am i doing wrong.

Related

Julia Markdown - Chunk Output as Markdown

I would like to print a table in Julia Markdown. To the best of my knowledge there is no cool package, yet, that is doing this. Hence, I would like to create a nice looking table through code, but I can't figure out how.
This is my table code...
---
title: Just a test
author: Me
date: 2022-01-03
output: pdf_document
---
```julia
"""
| Column One | Column Two | Column Three |
|:---------- | ---------- |:------------:|
| Row `1` | Column `2` | |
| *Row* 2 | **Row** 2 | Column ``3`` |
"""
```
...and I want it to produce this...
...instead of this:
The Markdown standard library can parse tables too:
julia> tbl = """
| Column One | Column Two | Column Three |
|:---------- | ---------- |:------------:|
| Row `1` | Column `2` | |
| *Row* 2 | **Row** 2 | Column ``3`` |
"""
julia> md = Markdown.parse(tbl);
julia> # text formatting like emphasis and bold are lost in pasting
# to StackOverflow, but shown in the original output
md
Column One Column Two Column Three
–––––––––– –––––––––– ––––––––––––
Row 1 Column 2
Row 2 Row 2 Column 3
The parse output is a Markdown.MD object that is rendered appropriately depending on your output display (i.e. terminal, Jupyter, etc).
If you want to produce a markdown table directly from data (without parsing it from a string), you can also construct a Markdown.Table directly; check the varinfo() function from the InteractiveUtils standard library for an example of that.

How do you combine Lapply() and dbListFields() to get all column names for every table in a DATABASE?

I would like to create a short Catalog for myself out of a database which would show what tables and fields are available there with the combination of SAPPLY(), LAPPLY() etc. and DBListNames.
So far I only get this far but it returns a "0" character variable:
catalog <- lapply(list_of_tables, function(t) dbListFields(con, name = paste0(t)))
So I would like to create an output like this:
+------------+--------------------+
| DB TABLES | FIELDS |
+------------+--------------------+
| ORDERS | "PRODUCT", "TIME" |
| CLIENTS | "ID", "NAME" |
| PROMOTIONS | "DATE", "DISCOUNT" |
+------------+--------------------+
I haven't used these kind of loops and I would like to start it here..
Thank you for your support in advance!

Appending text to an existing file from inside a parallel loop

I'm currently processing some data in parallel as part of a larger loop that essentially looks something like this:
for (i in files) {
doFunction
foreach (j = 1:100) %dopar% {
parallelRegFun}
}
The doFunction extracts average data for each year, while the parallelRegFun regresses the data with maximally overlapping windows (not always 1:100, sometimes it can be 1:1000+, which is why I'm doing it in parallel).
Part of the parallelRegFun involves writing data to CSV
write_csv(parallelResults,
path = "./outputFile.csv",
append = TRUE, col_names = FALSE)
The issues is that quite often when writing to the output file, the data is appended to an existing row, or a blank row is written. For example the output might look like this:
+-----+-------+------+---+-------+------+
| Uid | X | Y | | | |
+-----+-------+------+---+-------+------+
| 1 | 0.79 | 2.37 | | | |
+-----+-------+------+---+-------+------+
| 2 | -1.88 | 3.53 | 3 | -0.54 | 3.32 |
+-----+-------+------+---+-------+------+
| | | | | | |
+-----+-------+------+---+-------+------+
| 5 | -0.18 | 1.45 | | | |
+-----+-------+------+---+-------+------+
This requires extensive clean-up afterwards, but when some of the output files are 100+MB it's a lot of data to have to inspect manually and clean. It also appears that if a blank row is written, the output for that row is completely missing from the output - i.e. it's not in the data that gets appended to an existing row.
Is there anyway to get the doParallel workers to check if a file is being accessed and if it is, to wait until it's not before appending the output?
I thought something like Sys.sleep() before the write_csv command would work,as it would force each worker to wait a different amount of time before writing, but this doesn't appear to work in the testing I've done.

knitr's kable is printing 2.29e-30 as "0"

CODE:
# some data
dat <-
data.frame(
log2fc = c(0.28, 10.82, 8.54, 5.64, 8.79, 6.46),
pvalue = c(0.00e+00, 2.29e-30, 7.02e-30, 4.14e-29, 1.86e-28, 1.78e-27)
)
# observe in markdown format
knitr::kable(dat, format="markdown")
OUTPUT:
| log2fc| pvalue|
|------:|------:|
| 0.28| 0|
| 10.82| 0|
| 8.54| 0|
| 5.64| 0|
| 8.79| 0|
| 6.46| 0|
PROBLEM:
The problem with the output is that, it is rendering the last column pvalue as zeros. But I would want to retain the same format as I see in my dataframe. How do I do that ? I've tried several solutions from various threads but nothing seems to work. Can someone point me to the right direction ?
Please do not suggest me to convert the pvalue column into a character vector. That is a quick and dirty solution that works, but I don't want to do that because:
I don't want to mess around with my dataframe.
I am interested in the reason for why the scientific format of the last column is not being retained while printing it in markdown.
I have many tables each with various columns with scientific format, I am looking for a way that automatically handles this issue.
kable() calls the base R function round(), which truncates those small values to zero unless you set digits to a really large value. But you can do that, e.g.
knitr::kable(dat, format = "markdown", digits = 32)
which gives
| log2fc| pvalue|
|------:|--------:|
| 0.28| 0.00e+00|
| 10.82| 2.29e-30|
| 8.54| 7.02e-30|
| 5.64| 4.14e-29|
| 8.79| 1.86e-28|
| 6.46| 1.78e-27|
If you do want the regular rounding in some columns, you can specify multiple values for digits, e.g.
knitr::kable(dat, format = "markdown", digits = c(1, 32))
| log2fc| pvalue|
|------:|--------:|
| 0.3| 0.00e+00|
| 10.8| 2.29e-30|
| 8.5| 7.02e-30|
| 5.6| 4.14e-29|
| 8.8| 1.86e-28|
| 6.5| 1.78e-27|

How to separate out letters in a sentence using R

I have a character vector that is a string of letters and punctuation. I want to create a data frame where each column is made up of a letter/character from this string.
e.g.
Character string = I WENT TO THE FAIR
Dataframe = | I | | W | E | N | T | | T | O | | T | H | E | | F | A | I | R |
I thought I could do this using a loop with substr, but I can't work out how to get R to write into separate columns, rather than just writing over the previous letter. I'm new to writing loops etc so struggling a bit to get my head around the way in which to compose what I need.
Thanks for any help and advice that you can offer.
Best wishes,
Natalie
This should get that result
string <- "I WENT TO THE FAIR"
df <- as.data.frame(t(as.data.frame(strsplit(string,""))), row.names = "1")

Resources