Daniel Fosas

Transcription

Daniel Fosas
Introduction to HDF5
Daniel Fosas
Research project
Motivation
Stöckli et al. (2004)
Research question
Goal
13000
Methodology
Considerations
1 csv (hourly)
Parametric study
8761x121
19MB
19MB
13000sim
≈ 240GB
sim
LOCATION,San Francisco Intl Ap,CA,USA,TMY3,724940,37.62,­122.40,­8.0,2.0 DESIGN CONDITIONS,1,Climate Design Data 2009 ASHRAE Handbook,,Heating,1,3.8,4.9,­3.7,2.8,10.7,­1.2,3.4
TYPICAL/EXTREME PERIODS,6,Summer ­ Week Nearest Max Temperature For Period,Extreme,8/ 1,8/ 7,Summer ­ Week Ne
GROUND TEMPERATURES,3,.5,,,,10.86,10.57,11.08,11.88,13.97,15.58,16.67,17.00,16.44,15.19,13.51,11.96
HOLIDAYS/DAYLIGHT SAVINGS,No,0,0,0 COMMENTS 1,Custom/User Format ­­ WMO#724940; NREL TMY Data Set (2008); Period of Record 1973­2005 (Generally)
COMMENTS 2, ­­ Ground temps produced with a standard soil diffusivity of 2.3225760E­03 {m**2/day} DATA PERIODS,1,1,Data,Sunday, 1/ 1,12/31 1999,1,1,1,0,?9?9?9?9E0?9?9?9?9?9?9?9?9?9?9?9?9?9?9?9*9*9?9?9?9,7.2,5.6,90,102200,0,0,290,0,0,0,0,0
1999,1,1,2,0,?9?9?9?9E0?9?9?9?9?9?9?9?9?9?9?9?9?9?9?9*9*9?9?9?9,7.2,5.6,90,102100,0,0,296,0,0,0,0,0
1999,1,1,3,0,?9?9?9?9E0?9?9?9?9?9?9?9?9?9?9?9?9?9?9?9*9*9?9?9?9,6.7,5.0,89,102200,0,0,291,0,0,0,0,0
EPW San Francisco TMY3: https://energyplus.net/weather-location/
Meet HDF5
Disclaimer
I am a mere user:
I am not a computer scientist
In fact, I am an architect/civil engineer
Which means that I am somewhat vaguely familiar with the
details
Yet:
This shows that everyone can use it
It is very well documented, do checkout references and
bibliography!
Introduction
Munroe "Standards"
Hierarchical (groups)
Data (and metadata)
Format (binary file)
5 (not a newcomer)
HDF5
Made up of:
1. A file format that holds the data
2. A library that manages interactions
3. An ecosystem with many applications
Collete (2013, fig 2-1)
HDF5: A self-describing file format
Wasser (2015)
HDF5: The library
Fragment from Collete (2013, fig 2-1)
HDF5: The ecosystem
Many (many) platforms
Bindings:
Fortran
C++
Java
Python
R
Matlab
...
Munroe "Universal Converter Box"
When to use it
Need
Features
Data
Large
Complex
Many
objects
Heterogeneous
Access
Parallel
I/O
Random
access
Fast
access
Partial I/O
Environment
requires
Special
platforms
Multiple
platforms
Portability
Efficient
storage
Plus
Standard
format
Available
tools
Low cost
Image/Table
API
Source: HDFGroup (2011) (personal highlighting)
Esoteric
When less useful
"It’s quite different from SQL-style relational databases. HDF5
has quite a few organizational tricks up its sleeve (see Chapter
8, for example), but if you find yourself needing to enforce
relationships between values in various tables, or wanting to
perform JOINs on your data, a relational database is probably
more appropriate. Likewise, for tiny 1D datasets you need to
be able to read on machines without HDF5 installed. Text
formats like CSV (with all their warts) are a reasonable
alternative." (Collete 2013, p. 5)
Using HDF5
!
Demo
File viewers:
HDFView
Compass
Python:
Saving
Reading
Metadata
Munroe "Python" (Extract)
Python
Generate some numbers
In[3]: rows = 8760 cols = 11*11 index = pd.date_range(start='2001­01­01', periods=8760, freq='H') df = pd.DataFrame(np.random.uniform(­10, 40, (rows, cols)), columns=[random_word(40) for i in range(cols)], index=index) df.head() Out[3]: UqoYLpehAfynGiXMvkuINJhqDjmOfHMCjYxzKuwW \ 2001­01­01 00:00:00 35.580977 2001­01­01 01:00:00 19.150455 2001­01­01 02:00:00 26.721835 2001­01­01 03:00:00 ­3.331457 2001­01­01 04:00:00 ­6.761000 aMhPHzbgKgBcNZMEdZQherUOnpSaeeuCxvcOWLyn \ 2001­01­01 00:00:00 39.823702 Saving
In[4]: %timeit df.to_csv('example.csv') # 19.0MB 1 loop, best of 3: 17.8 s per loop In[5]: %timeit df.to_sql('example.db', engine, if_exists='replace') # 9.97MB 1 loop, best of 3: 5.89 s per loop In[6]: %timeit df.to_hdf('example.h5', 'a_group', format="table") # 8.22MB 1 loop, best of 3: 118 ms per loop
Reading
In[7]: %timeit pd.read_csv('example.csv') 1 loop, best of 3: 1.87 s per loop In[8]: %timeit pd.read_sql('example.db', engine) 1 loop, best of 3: 3.15 s per loop In[9]: %timeit pd.read_hdf('example.h5') 10 loops, best of 3: 56.3 ms per loop
Metadata
In[10]: with pd.HDFStore('example.h5', 'a') as store: store.put('another_group', df, format='table', append=True) store.get_storer('another_group').attrs.metadata = "hello world"
Looking into the file
>ptdump example.h5 / (RootGroup) '' /a_group (Group) '' /a_group/table (Table(8760,)) '' /another_group (Group) '' /another_group/table (Table(35040,)) '' >ptdump example.h5:/another_group ­a /another_group (Group) '' /another_group._v_attrs (AttributeSet), 15 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := [], encoding := 'UTF­8', index_cols := [(0, 'index')], info := {1: {'names': [None], 'type': 'Index'}, 'index': {'freq': <Hour>}}, References
Portrait image: Wheatley, M., 2013. file-cabinets-big-data [Online] Available here
[Accessed 15 May 2016]
Collette, A., 2013. Python and HDF5, Beijiing: O’Reilly.
HDFGroup, 2011. Why HDF? [Online] Available here [Accessed 15 May 2016]
Munroe, R., Standards [Online]. Available here [Accessed 15 May 2016]
Munroe, R., Universal Converter Box [Online]. Available here [Accessed 15 May 2016]
Munroe, R., Python [Online]. Available here [Accessed 15 May 2016]
Stöckli, R., Simmon, R., Herring, D., 2003. NASA Earth Observatory, based on data from
the MODIS land team [Online]. Available here [Accessed 15 May 2016]
Wasser, L, 2015. About: Hierarchical Data Formats - What is HDF5? [Online]. Available
here [Accessed 15 May 2016]
(Some) Useful resources
HDF5
HDF Group [Online] Available here [Accessed 15 May 2016]
Wasser, L, 2015. About: Hierarchical Data Formats - What is HDF5? [Online]. Available
here [Accessed 15 May 2016]
HDF5 and Python
Collette, A., 2013. Python and HDF5 First edition., Beijiing: O’Reilly.
Scopatz, A., 2013. HDF5 is for Lovers [Online]. Available here [Accessed 15 May 2016]
Scopatz, A., 2015. Python & HDF5 — A vision [Online]. Available here [Accessed 15 May
2016]
Thank you!
https://dl.dropboxusercontent.com/u/4235140/_download_tutorial.zip