As I pull the date data from my excel file on my computer which is listed as: "10/1/10" - and stored in an array dData, and the numerical version of the date is stored in nData as: 734046, so when you call dData[0] it returns "10/1/10" and when you call nData it returns 734046.
HOWEVER
The code in bold as I pass in 10/1/10 it returns 735536, which is not the exact key-value pair that it should be organized chronologically.
import numpy as np
import pandas as pd
import xlrd
import csv
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import time
import random
import statistics
import numpy
from numpy.random import normal
from scipy import stats
dData = [] #Date in string format - Month/Day/Year
pData = [] #Date in float format - Value.Decimals
nData = [] #Data in Dates in int - Formatted Date Data for plotting in Matpl
def loadData(dates, prices, numDates):
dateDictionary = {} # empty dictionary that will contain string dates to number dates
numDateToPrice = {} # empty dictionary that will contain number dates to string dates
nestedDictionary = {} # empty dictionary that will contain a nested dictionary str date : {numbertodate: price}
with open('/Users/dvalentin/Code/IndividualResearch/CrudeOilFuturesAll.csv', 'rU') as csvfile: #This is where I pull data from an excel file on my comp
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
dates.append(row[0])
numDates.append(row[1])
prices.append(row[2])
**for x in dates:
for x in numDates:
dateDictionary[x] = y
print dateDictionary**
for x in numDates[x]:
for y in prices[y]:
numDateToPrice[x] = y
plt.plot_date(x=numDates, y=prices, fmt="r-")
plt.plot()
plt.title("Crude Oil Futures")
plt.ylabel("Closing Price")
plt.grid(True)
plt.show()
import pandas as pd
import datetime as dt
dates = ['10/1/10', '10/2/10','11/3/10','1/4/11']
prices = [12,15,13,18]
df = pd.DataFrame({'dates':dates,'prices':prices})
df = df.set_index(pd.DatetimeIndex(df['dates']))
df = df.drop('dates', axis = 1)
print df.ix['20101002']
print df['20101001':'20101002']
print df['2010']
print df['2010-10']
This seems to be a better way to organize your data instead of messing around with the numerical code for the date. You can always manipulate the datetimeindex for graphical parameters and style it out how you want. But this datetimeindex is much easier to manipulate data with instead of having to use dictionaries. More info on datetime indices: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DatetimeIndex.html. Hope this helps!
Related
I want to caption all figures generated in Python chunk in R markdown. Currently it is only giving me 1 caption as I can use fig.cap only once in chunk header. How can I do that?
The code is below for 2 different dataframes.
`{python Test1-plot,fig.cap="The shear stress evolution"}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_table("../gb_cyc_csv/T1_gb_cyc.csv",sep=',',decimal='.',low_memory=False,header=0)
X = data[data.keys()[3]]
X = np.array(X)
TS = data[data.keys()[9]]
TS = np.array(TS)
plt.plot(X,TS)
plt.show()
data = pd.read_table("../gb_cyc_csv/T3_gb_cyc.csv",sep=',',decimal='.',low_memory=False,header=0)
X = data[data.keys()[3]]
X = np.array(X)
N = data[data.keys()[15]]
N = np.array(N)
plt.plot(X,N)
plt.show()
I build an automatic program that needs the user to modify only few parameters in the Launch file.
I ask myself if it's possible to automatize the file path automation according to the custom "year" and "month" ?
Structure
Launch.r
| Import.r #(load librairies and call specific programs)
| Topic1.r
|.......
| Topic2.r
......
#-- > Launch file
# Parameters to be personalized by the user
year <- 2021
month <- 01
# File directory
import <- c('c:/folderX/year/month/folder')
export <- c('c:/folderX/year/export/folder1')
.....
When I run the program, R sort
import "c:/folderX/year/month/folder"
export "c:/folderX/year/export/folder1"
My goal is to get
import "c:/folderX/2021/01/folder"
export "c:/folderX/2021/export/folder1"
Would you have tips to help me?
If I am understanding the question correctly, file.path can accept variables as parts of the path e.g.
year <- 2021
month <- 01
import <- file.path("c:/folderX", year, month, "folder")
should give
#> [1] "c:/folderX/2021/1/folder"
You can also use glue package for string generation from templates:
# Parameters to be personalized by the user
library(glue)
year <- 2021
month <- 01
# File directory
import <- glue(
'c:/folderX/{year}/{month}/folder',
month = sprintf('%02d', month),
year = year
)
export <- glue(
'c:/folderX/{year}/{month}/folder1',
month = sprintf('%02d', month),
year = year
)
# glue can also take variables from environment:
export <- glue(
'c:/folderX/{year}/{month}/folder1'
)
# in this case just make sure that month variable is a string in correct format
i am trying to fetch data from a URL(git hub) for further analysis(mean ,average ,percentage , ratio etc) but this code is not properly working
You can use the below code:
import requests
import pandas as pd
res = requests.get('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user')
df = pd.DataFrame([a.split('|') for a in res.text.split('\n')])
df.columns = df.values[0]
df.drop([df.index[0],df.index[df.shape[0]-1]],inplace=True)
print(df['age'].astype(int).mean())
Output:
34.05196182396607
Dataframe:
I want to aplly some R function to a pandas dataframe
df = pd.DataFrame( np.random.randn(5,2), # 5 rows, 2 columns
columns = ["A","B"], # name of columns
index = ["Max", "Nathy", "Tom", "Joe", "Kathy"] )
How can I apply for example the summary dunction from R?
I have the following code:
import numpy as np
import pandas as pd
import rpy2
# print(rpy2.__version__) ## 2.9.4
from rpy2.rinterface import R_VERSION_BUILD
# print(R_VERSION_BUILD) ## ('3', '5.1', '', 74947)
from rpy2.robjects.packages import importr
# import R's "base" package
base = importr('base')
You are almost there. In order to run R functions, you need to convert the pandas Dataframe to R Dataframe. Once we have the R object we can call the functions as shown below.
import rpy2
from rpy2.robjects.packages import importr # import R's "base" package
base = importr('base')
from rpy2.robjects import pandas2ri # install any dependency package if you get error like "module not found"
pandas2ri.activate()
# Create pandas df
df = pd.DataFrame( np.random.randn(5,2), # 5 rows, 2 columns
columns = ["A","B"], # name of columns
index = ["Max", "Nathy", "Tom", "Joe", "Kathy"] )
# Convert pandas to r
r_df = pandas2ri.py2ri(df)
type(r_df)
#calling function under base package
print(base.summary(r_df))
I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string.
df2 = sqlContext.createDataFrame(i[1])
df2.show
df2.printSchema()
Result:
2016-07-05T17:42:55.238544+0900
2016-07-05T17:17:38.842567+0900
2016-06-16T19:54:09.546626+0900
2016-07-05T17:27:29.227750+0900
2016-07-05T18:44:12.319332+0900
string (nullable = true)
Since the datetime schema is a string, I want to change it to datetime format as follows:
df3 = df2.withColumn('_1', df2['_1'].cast(datetime()))
Here I got an error:
TypeError: Required argument 'year' (pos 1) not found
What should I do to solve this problem?
Try this:
from pyspark.sql.types import DateType
ndf = df2.withColumn('_1', df2['_1'].cast(DateType()))