How to create multi-dimensional structure array in python using dictionary - dictionary

Is there a way to form a dictionary in python which opens and appears like the following picture in MATLAB (in structure format)? The saving was done using hdf5storage.savemat.

I would convert the dictionary into a pandas data frame:
import pandas as pd
data = [{'area': 'new-hills', 'rainfall': 100, 'temperature': 20},
{'area': 'cape-town', 'rainfall': 70, 'temperature': 25},
{'area': 'mumbai', 'rainfall': 200, 'temperature': 39 }]
df = pd.DataFrame.from_dict(data)
Many IDEs (e.g. PyCharm) provide visualization options for data frames.
Lazloo

Related

How to import many dataframes from R into a one Excel file in a simple way?

I have this dataset (let's imagine it with 900 variables )
df = data.frame(x = c(1,0,0,0,1,1,1), y = c(2,2,2,2,3,3,2) )
l1 = lapply(df,table)
l2 = lapply(l1,as.data.frame)
I created a list of dataframes that I wish to be extracted to a one file excel in a simple way using lapply or something like this. Appreciate the help.
write_xlsx() from writexl saves each list item in one sheet of an Excel file:
library(writexl)
write_xlsx(l2, 'file.xlsx')

Converting JSON file with nested lists to R Dataframe

I have downloaded a JSON dataset from https://ped.uspto.gov/peds/#!/#%2FapiDocumentation. I need to convert it to a dataframe in R. I checked the guidance provided at:
Flatten nested lists in dataframe after JSON import in R and Convert nested json file to Dataframe in R but these are not helpful in my case.
Given below are my R code and R Output. The output shows there are nested lists. Kindly suggest how to convert these into a dataframe. Thanks.
df1 <- fromJSON(
"E:\\IIMU\\Databases\\USPTO\\2020-2022-pairbulk-full-20220609-json\\2022.json",
simplifyVector = TRUE,
simplifyDataFrame = TRUE,
simplifyMatrix = TRUE,
flatten = TRUE
)
R Output

How to quickly and easily convert between R and pandas dataframes in Databricks?

I am an R user with minimal python experience. I have some colleagues who use python and I want to be able to easily convert between R and python/pandas dataframes in the same Databricks notebook. I have heard that I have to use spark temp tables to do this and that it is quite straightforward, but I cannot find any complete example code and so far I haven't been able to get it to work.
I get a SparkR dataframe (as I can't get Base R dataframes to work with RegisterTempTable()) and convert it to a temp table:
#Cell 1
jdbc_url <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb;user=user;password=*****"
df_R <- read.jdbc(jdbc_url, "(SELECT TOP 10 * FROM [schema].[table]) as result" )
SparkR:::registerTempTable(df_R,"df_temptable")
Then I try to read that back in as a pandas dataframe:
%python
#Cell 2:
import pandas as pd
pandas_df = df_temptable.select("*").toPandas()
which results in the error:
NameError: name 'df_temptable' is not defined
How do I successfully convert between R and python dataframes and back within Databricks (I would preferably like to go from a Base R dataframe to a pandas dataframe without using any Scala and in as few steps as possible)?
From the error message "NameError: name 'df_temptable' is not defined", it looks like df_temptable is not defined as the dataframe.
Here is an example to convert spark DataFrames to and from Pandas DataFrames.
%python
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Generate a pandas DataFrame
pdf = pd.DataFrame(np.random.rand(100, 3))
# Create a Spark DataFrame from a pandas DataFrame using Arrow
df = spark.createDataFrame(pdf)
# Convert the Spark DataFrame back to a pandas DataFrame using Arrow
result_pdf = df.select("*").toPandas()

How to display GroupBy Count as Bokeh vbar for categorical data

I have a small issue creating a Bokeh vbar in 0.13.0
from a dataframe groupby count operation. The response here was for a multi level group by where as mine isn't.
Updates since posting
added sample data and code based on provided answer to see if issue is my code or something else
Outline
The pandas dataframe contains survey responses
Excellent
Good
Poor
Satisfactory
Very Good
under columns ('ResponseID','RateGeneral','RateAccomodation','RateClean','RateServices')and the dtype as been set as catagory. I want to display a bokeh vbar of the Response Count groupby using
DemoDFCount = DemoDF.groupby('RateGeneral').count()
My bokeh code looks like this
pTest= figure(title='Rating in General',plot_height=350)
pTest.vbar(width=0.9,source=DemoDFCount, x='RateGeneral',top='ResponseID')
show(pTest))
but doesn't produce any chart only a title and toolbar
If I use pandas DemoDFCount.plot.bar(legend=False) I can plot something but how do I create this chart in bokeh?
Sample data as json export
50 rows of sample data from DemoDF.to_json()
'{"ResponseID":{"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8,"8":9,"9":10,"10":11,"11":12,"12":13,"13":14,"14":15,"15":16,"16":17,"17":18,"18":19,"19":20,"20":21,"21":22,"22":23,"23":24,"24":25,"25":26,"26":27,"27":28,"28":29,"29":30,"30":31,"31":32,"32":33,"33":34,"34":35,"35":36,"36":37,"37":38,"38":39,"39":40,"40":41,"41":42,"42":43,"43":44,"44":45,"45":46,"46":47,"47":48,"48":49,"49":50},"RateGeneral":{"0":"Good","1":"Satisfactory","2":"Good","3":"Poor","4":"Good","5":"Satisfactory","6":"Excellent","7":"Good","8":"Good","9":"Satisfactory","10":"Satisfactory","11":"Excellent","12":"Satisfactory","13":"Excellent","14":"Satisfactory","15":"Very Good","16":"Satisfactory","17":"Excellent","18":"Very Good","19":"Excellent","20":"Satisfactory","21":"Good","22":"Satisfactory","23":"Excellent","24":"Satisfactory","25":"Good","26":"Excellent","27":"Very Good","28":"Good","29":"Very Good","30":"Good","31":"Satisfactory","32":"Very Good","33":"Very Good","34":"Very Good","35":"Good","36":"Excellent","37":"Satisfactory","38":"Excellent","39":"Good","40":"Good","41":"Satisfactory","42":"Very Good","43":"Very Good","44":"Poor","45":"Excellent","46":"Good","47":"Excellent","48":"Satisfactory","49":"Good"},"RateAccomodation":{"0":"Very Good","1":"Excellent","2":"Satisfactory","3":"Satisfactory","4":"Good","5":"Good","6":"Very Good","7":"Very Good","8":"Good","9":"Satisfactory","10":"Satisfactory","11":"Excellent","12":"Satisfactory","13":"Excellent","14":"Good","15":"Very Good","16":"Good","17":"Excellent","18":"Excellent","19":"Very Good","20":"Good","21":"Satisfactory","22":"Good","23":"Excellent","24":"Satisfactory","25":"Very Good","26":"Excellent","27":"Excellent","28":"Good","29":"Very Good","30":"Very Good","31":"Very Good","32":"Excellent","33":"Very Good","34":"Very Good","35":"Very Good","36":"Excellent","37":"Satisfactory","38":"Excellent","39":"Good","40":"Excellent","41":"Poor","42":"Very Good","43":"Very Good","44":"Poor","45":"Excellent","46":"Satisfactory","47":"Excellent","48":"Good","49":"Good"},"RateClean":{"0":"Excellent","1":"Excellent","2":"Satisfactory","3":"Good","4":"Excellent","5":"Very Good","6":"Very Good","7":"Excellent","8":"Excellent","9":"Satisfactory","10":"Satisfactory","11":"Excellent","12":"Good","13":"Good","14":"Excellent","15":"Excellent","16":"Good","17":"Excellent","18":"Excellent","19":"Excellent","20":"Good","21":"Very Good","22":"Poor","23":"Very Good","24":"Satisfactory","25":"Very Good","26":"Excellent","27":"Good","28":"Poor","29":"Good","30":"Excellent","31":"Good","32":"Good","33":"Very Good","34":"Satisfactory","35":"Good","36":"Excellent","37":"Satisfactory","38":"Excellent","39":"Good","40":"Very Good","41":"Satisfactory","42":"Excellent","43":"Excellent","44":"Very Good","45":"Excellent","46":"Good","47":"Excellent","48":"Good","49":"Excellent"},"RateServices":{"0":"Very Good","1":"Excellent","2":"Good","3":"Good","4":"Excellent","5":"Good","6":"Good","7":"Very Good","8":"Good","9":"Satisfactory","10":"Satisfactory","11":"Excellent","12":"Good","13":"Very Good","14":"Good","15":"Excellent","16":"Poor","17":"Excellent","18":"Excellent","19":"Excellent","20":"Good","21":"Good","22":"Very Good","23":"Excellent","24":"Satisfactory","25":"Very Good","26":"Excellent","27":"Very Good","28":"Good","29":"Excellent","30":"Very Good","31":"Excellent","32":"Good","33":"Excellent","34":"Very Good","35":"Very Good","36":"Excellent","37":"Satisfactory","38":"Excellent","39":"Good","40":"Very Good","41":"Satisfactory","42":"Excellent","43":"Excellent","44":"Good","45":"Excellent","46":"Very Good","47":"Excellent","48":"Good","49":"Very Good"}}'
The fact that it is multi-level in the other question is not really relevant. When you use a Pandas GroupBy as a data source for Bokeh, Bokeh uses the results of group.describe (which includes counts for each column per group) as the contents of the data source. Here is a complete example that shows Counts-per-Origin from the "cars" data set:
from bokeh.io import show, output_file
from bokeh.plotting import figure
from bokeh.sampledata.autompg import autompg as df
output_file("groupby.html")
df.origin = df.origin.astype(str)
group = df.groupby('origin')
p = figure(plot_height=350, x_range=group, title="Count by Origin",
toolbar_location=None, tools="")
# using yr_count, but count for any column would work
p.vbar(x='origin', top='yr_count', width=0.8, source=group)
p.y_range.start = 0
p.xgrid.grid_line_color = None
show(p)

Drop duplicate times in xarray

I'm reading NetCDF files with open_mfdataset, which contain duplicate times. For each duplicate time I only want to keep the first occurrence, and drop the second (it will never occur more often). The problem is quite similar to this Pandas question, but none of the solutions provided there seem to work with Xarray.
To reproduce the problem:
import numpy as np
import netCDF4 as nc4
import xarray as xr
# Create example NetCDF files
for t in range(2):
nc = nc4.Dataset('test{}.nc'.format(t), 'w')
dim_t = nc.createDimension('time', None)
var_t = nc.createVariable('time', 'f8', ('time',))
var_s = nc.createVariable('var', 'f8', ('time',))
var_t.setncattr('units', 'hours since 2001-01-01 00:00:00')
var_t[:] = t*5+np.arange(6)
var_s[:] = t*5+np.arange(6)+t
nc.close()
# Read with xarray
f = xr.open_mfdataset(['test0.nc', 'test1.nc'])
The times in the resulting dataset are:
array(['2001-01-01T00:00:00.000000000', '2001-01-01T01:00:00.000000000',
'2001-01-01T02:00:00.000000000', '2001-01-01T03:00:00.000000000',
'2001-01-01T04:00:00.000000000', '2001-01-01T05:00:00.000000000',
'2001-01-01T05:00:00.000000000', '2001-01-01T06:00:00.000000000',
'2001-01-01T07:00:00.000000000', '2001-01-01T08:00:00.000000000',
'2001-01-01T09:00:00.000000000', '2001-01-01T10:00:00.000000000'], dtype='datetime64[ns]')
Is there an easy way to remove the second ocurance of 2001-01-01T05:00:00.000000000? The real-life problem deals with multi-dimensional NetCDF files, so switching to Pandas is no option.
[update] The closest I get is following this answer; that works for my simple example as long as Dask is not used, if the files contain Dask arrays I get the error:
'last' with skipna=True is not yet implemented on dask arrays
But I don't see where I can/have to set skipna.
I think xarray does not have its own method for this purpose, but the following works,
In [7]: _, index = np.unique(f['time'], return_index=True)
In [8]: index
Out[8]: array([ 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11])
In [9]: f.isel(time=index)
Out[9]:
<xarray.Dataset>
Dimensions: (time: 11)
Coordinates:
* time (time) datetime64[ns] 2001-01-01 2001-01-01T01:00:00 ...
Data variables:
var (time) float64 dask.array<shape=(11,), chunksize=(6,)>
Apparently stackoverflow won't let me comment... I wanted to add to Keisuke's answer. You can also use the get_index() function to get a pandas index.
f.sel(time=~f.get_index("time").duplicated())

Resources