How can I pull data from a .sqlite file into Databricks Spark? - sqlite

I found a dataset I like on Kaggle however the only download option is a .sqlite file that has the three tables. Is there anyway I can access this data from Databricks?

If you are using pyspark and SQLContext, try the following code.
Add extraClassPath to your spark conf.
spark.executor.extraClassPath=<jdbc.jar>
Code snippet:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sqlContext.read.format("jdbc").options(
url="jdbc:sqlite:{folder_path}/{file_name}.db",
driver="org.sqlite.JDBC",
dbtable="employee")
.load().take(10)

Related

Importing Excel in Watson Studio

I am trying to read an excel file (xlsx) into a data frame in ibm watson studio. the excel file is saved in my list of assets. i'm a bit new to python
i have tried creating a project token with some help i got here. I will appreciate if someone helps with the complete code.
i tried this
from project_lib import Project
project = Project(project_id='',
project_access_token='')
pc = project.project_context
file = project.get_file("xx.xlsx")
file.sheet_names
df = pd.ExcelFile(file)
df = file.parse (0)
df.head ()
i needed to pass the excel file into a pandas data frame , pd for eg.
All you need to do is
First insert the project token as you already did.
Then simply fetch file and then do .seek(0),
Then read it using pandas' read_excel() and you should be able to read it.
# Fetch the file
my_file = project.get_file("tests-example.xls")
# Read the CSV data file from the object storage into a pandas DataFrame
my_file.seek(0)
import pandas as pd
pd.read_excel(my_file, nrows=10)
For more information:- https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/project-lib-python.html

Unable to Export Data to from R to Vertica

It will be Great if anyone an help me out with COPY function syntax.
I am trying to copy data from google Doc to vertica with the help of r (using google-sheets and RJDBC Package).
I have exported the data from google sheet but unable to import in Vertica server.
Please Help me out.
googleSheet <- gs_title("GS", verbose = FALSE)
auditSheet <- gs_read(ss = googleSheet, ws = 'GS1')
copy verticaDB.tableName from local auditSheet
Thanks

Expected BOF record for XLRD when first line is redundant

I came across the problem when I tried to use xlrd to import an .xls file and create dataframe using python.
Here is my file format:
xls file format
When I run:
import os
import pandas as pd
import xlrd
for filename in os.listdir("."):
if filename.startswith("report_1"):
df = pd.read_excel(filename)
It's showing "XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'Report g'"
I am pretty sure nothing wrong with xlrd (version 1.0.0) because when I remove the first row, dataframe can be created.
Wonder if there is any way that i can load the original file format?
Try following that accounts for a header line:
df = pd.read_excel(filename, header=0)

Jupyter Multiple Notebooks using Same Data

I have built out a Jupyter notebook for my previous analysis.
And I want to start a different branch of analysis, using the some of the same dataframes from previous analysis.
How do I use the previous dataframes without coping all my code to rebuild my previous analysis, in my new notebook?
You can share data across notebooks with Jupyter magics. For example:
Given
# Notebook 1
import pandas as pd
d = {"one" : pd.Series([1., 2., 3.], index=list("abc"))}
df = pd.DataFrame(d)
Code
%store df
Recall the DataFrame in a separate notebook:
# Notebook 2
%store -r df
df
Output
More on this in the older IPython docs. See also Jupyter's %bookmark magic for sharing directories.
You can pickle the dataframe then load the dataframe in your new notebook. This is fine for short term data reuse. For long term data storage, writing then reading a text csv file may be more reliable.
pickle_save.py
import os
import pandas as pd
pickle_location = r'd:\temp\pickle_file'
df = pd.DataFrame({'A':1,'B':2}, index=[0])
df.to_pickle(pickle_location)
if os.path.exists(pickle_location):
print('pickle created')
pickle_load.py
import os
import pandas as pd
pickle_location = r'd:\temp\pickle_file'
df_load = pd.read_pickle(pickle_location)
print(df_load)

How to use R to analyze kcore from csv file

Currently, I touched something about SNA and encountered the problem of how to use R to analyze the kcore network from file.
The format of csv file is like below:
//File
PointStart,PointEnd
jay,yrt
hiqrr,huame
Sam,joysunn
timka,tomdva
......,.....
I have import this file into R but I do not know next step to handle it.
Thanks for your help geeks.
Use read.csv to import your data into R data frame, say d
Use network package to create network object with net <- network(d, directed=TRUE) with directed set to TRUE/FALSE depending on your data.
Use kcores from sna package (http://www.rdocumentation.org/packages/sna/functions/kcores): kcores(net).

Resources