I have a pandas.DataFrame (df), which consists of some values and a datetime which is a string at first but which I convert to a Timestamp using
df['datetime'] = pd.to_datetime(df['Time [dd.mm.yyyy hh:mm:ss.ms]'], format="%d.%m.%Y %H:%M:%S.%f")
It seems to work and I can access the new column's element's properties like obj.day and such. So the resulting column contains a Timestamp. When I try to plot this by using either pyplot.plot(df['datetime'],df['value_name']) or df.plot(x='datetime',y='value_name'),the picture below is the reslut. I tried converting the Timestamps using obj.to_pydatetime() but that did not change anything. The dataframe itself is populated by some data coming from csvs. What confuses me, is that with a certain csvs it works but with others not. I am pretty sure that the conversion to Timestamps was successful but I could be wrong. Also my time window should be from 2015-2016 not from 1981-1700. If I try to locate the min and max Timestamp from the DataFrame, I get the right Timestamps in 2015 and 2016 respectively.
Resulting Picture form pyplot.plot
Edit:
df.head() gives:
Sweep Time [dd.mm.yyyy hh:mm:ss.ms] Frequency [Hz] Voltage [V]
0 1.0 11.03.2014 10:13:04.270 50.0252 230.529
1 2.0 11.03.2014 10:13:06.254 49.9515 231.842
2 3.0 11.03.2014 10:13:08.254 49.9527 231.754
3 4.0 11.03.2014 10:13:10.254 49.9490 231.678
4 5.0 11.03.2014 10:13:12.254 49.9512 231.719
datetime
0 2014-03-11 10:13:04.270
1 2014-03-11 10:13:06.254
2 2014-03-11 10:13:08.254
3 2014-03-11 10:13:10.254
4 2014-03-11 10:13:12.254
and df.info() gives:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33270741 entries, 0 to 9140687
Data columns (total 5 columns):
Sweep float64
Time [dd.mm.yyyy hh:mm:ss.ms] object
Frequency [Hz] float64
Voltage [V] float64
datetime datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 1.5+ GB
I am trying to plot 'Frequency [Hz]'vs 'datetime'.
I think you need set_index and then set formatting of both axis:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df['datetime'] = pd.to_datetime(df['Time [dd.mm.yyyy hh:mm:ss.ms]'],
format="%d.%m.%Y %H:%M:%S.%f")
print (df)
df.set_index('datetime', inplace=True)
ax = df['Frequency [Hz]'].plot()
ticklabels = df.index.strftime('%Y-%m-%d')
ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels))
ax.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.2f'))
plt.show()
Related
I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)
I have a question how to extract parts of the text and convert them df output.
This is an example of my df, output of one row in my one column (content of one cell)
[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]
My expected output would be to have 2 columns with as many rows I get from this string:
effortDate
2021-07-04
2021-04-11
and second column
effort
2
1
Any suggestion how to achieve that?
Thanks!
looks like json-content... but the => messes with the reading. If you replace it with :, you sould be able to read properly.
mystr <- '[{"id"=>"aaaaaaaaaaaaaaaa", "effortDate"=>"2021-07-04T23:00:00.000Z", "effort"=>2, "author"=>"a:aa:a"}, {"id"=>"bbbbbbbbbbbbbb", "effortDate"=>"2021-07-11T23:00:00.000Z", "effort"=>1, "author"=>"b:bb:b"}, {"id"=>"ccccccccccccc", "effortDate"=>"2021-07-17T23:00:00.000Z", "effort"=>1, "author"=>"c:cc:c"}]'
jsonlite::fromJSON(gsub("=>", ":", mystr))
# id effortDate effort author
# 1 aaaaaaaaaaaaaaaa 2021-07-04T23:00:00.000Z 2 a:aa:a
# 2 bbbbbbbbbbbbbb 2021-07-11T23:00:00.000Z 1 b:bb:b
# 3 ccccccccccccc 2021-07-17T23:00:00.000Z 1 c:cc:c
I work with MODIS EVI rasters in 2000. I have 6 raster by years, one raster by month :
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000209_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000225_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000241_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000257_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000273_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000289_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000305_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000321_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000337_aid0001.tif"
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000353_aid0001.tif"
I would like convert the month like that 2000-02-09 but I don't know how to do it.
Looks like your datetime string format follows year(4digits)day_of_year(3digit) format, e.g. 2000209 means 2000 year & 209 days (27th July 2000). If it's true then the problem isn't difficult:
Extract those seven digit numbers. (str_extract)
Parse 'datetime' from it. (you need to know that j is used for parsing date from the day_of_year.)
[:Graph:] will drop anything other than numbers and punctuation marks.
Code
dt <- data.frame(string =
c("D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000209_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000225_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000241_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000257_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000273_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000289_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000305_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000321_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000337_aid0001.tif",
"D:/Rteledetection/Pivots/MODIS/MOD13Q1.006__250m_16_days_EVI_doy2000353_aid0001.tif"))
dt$string %>% str_extract('\\d{7}') %>% str_replace('2000', '2000-') %>%
parse_date_time('y-j') %>% str_subset('[:graph:]')
Output
[1] "2000-07-27" "2000-08-12" "2000-08-28" "2000-09-13" "2000-09-29" "2000-10-15"
[7] "2000-10-31" "2000-11-16" "2000-12-02" "2000-12-18"
I have a CSV file of 1000 daily prices
They are of this format:
1 1.6
2 2.5
3 0.2
4 ..
5 ..
6
7 ..
.
.
1700 1.3
The index is from 1:1700
But I need to specify a begin date and end date this way:
Start period is lets say, 25th january 2009
and the last 1700th value corresponds to 14th may 2013
So far Ive gotten this close to this problem:
> dseries <- ts(dseries[,1], start = ??time??, freq = 30)
How do I go about this? thanks
UPDATE:
managed to create a seperate object with dates as suggested in the answers and plotted it, but the y axis is weird, as shown in the screenshot
Something like this?
as.Date("25-01-2009",format="%d-%m-%Y") + (seq(1:1700)-1)
A better way, thanks to #AnandaMahto:
seq(as.Date("2009-01-25"), by="1 day", length.out=1700)
Plotting:
df <- data.frame(
myDate=seq(as.Date("2009-01-25"), by="1 day", length.out=1700),
myPrice=runif(1700)
)
plot(df)
R stores Date-classed objects as the integer offset from "1970-01-01" but the as.Date.numeric function needs an offset ('origin') which can be any staring date:
rDate <- as.Date.numeric(dseries[,1], origin="2009-01-24")
Testing:
> rDate <- as.Date.numeric(1:10, origin="2009-01-24")
> rDate
[1] "2009-01-25" "2009-01-26" "2009-01-27" "2009-01-28" "2009-01-29"
[6] "2009-01-30" "2009-01-31" "2009-02-01" "2009-02-02" "2009-02-03"
You didn't need to add the extension .numeric since R would automticallly seek out that function if you used the generic stem, as.Date, with an integer argument. I just put it in because as.Date.numeric has different arguments than as.Date.character.
I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.