Daniel Fosas
Transcription
Daniel Fosas
Introduction to HDF5 Daniel Fosas Research project Motivation Stöckli et al. (2004) Research question Goal 13000 Methodology Considerations 1 csv (hourly) Parametric study 8761x121 19MB 19MB 13000sim ≈ 240GB sim LOCATION,San Francisco Intl Ap,CA,USA,TMY3,724940,37.62,122.40,8.0,2.0 DESIGN CONDITIONS,1,Climate Design Data 2009 ASHRAE Handbook,,Heating,1,3.8,4.9,3.7,2.8,10.7,1.2,3.4 TYPICAL/EXTREME PERIODS,6,Summer Week Nearest Max Temperature For Period,Extreme,8/ 1,8/ 7,Summer Week Ne GROUND TEMPERATURES,3,.5,,,,10.86,10.57,11.08,11.88,13.97,15.58,16.67,17.00,16.44,15.19,13.51,11.96 HOLIDAYS/DAYLIGHT SAVINGS,No,0,0,0 COMMENTS 1,Custom/User Format WMO#724940; NREL TMY Data Set (2008); Period of Record 19732005 (Generally) COMMENTS 2, Ground temps produced with a standard soil diffusivity of 2.3225760E03 {m**2/day} DATA PERIODS,1,1,Data,Sunday, 1/ 1,12/31 1999,1,1,1,0,?9?9?9?9E0?9?9?9?9?9?9?9?9?9?9?9?9?9?9?9*9*9?9?9?9,7.2,5.6,90,102200,0,0,290,0,0,0,0,0 1999,1,1,2,0,?9?9?9?9E0?9?9?9?9?9?9?9?9?9?9?9?9?9?9?9*9*9?9?9?9,7.2,5.6,90,102100,0,0,296,0,0,0,0,0 1999,1,1,3,0,?9?9?9?9E0?9?9?9?9?9?9?9?9?9?9?9?9?9?9?9*9*9?9?9?9,6.7,5.0,89,102200,0,0,291,0,0,0,0,0 EPW San Francisco TMY3: https://energyplus.net/weather-location/ Meet HDF5 Disclaimer I am a mere user: I am not a computer scientist In fact, I am an architect/civil engineer Which means that I am somewhat vaguely familiar with the details Yet: This shows that everyone can use it It is very well documented, do checkout references and bibliography! Introduction Munroe "Standards" Hierarchical (groups) Data (and metadata) Format (binary file) 5 (not a newcomer) HDF5 Made up of: 1. A file format that holds the data 2. A library that manages interactions 3. An ecosystem with many applications Collete (2013, fig 2-1) HDF5: A self-describing file format Wasser (2015) HDF5: The library Fragment from Collete (2013, fig 2-1) HDF5: The ecosystem Many (many) platforms Bindings: Fortran C++ Java Python R Matlab ... Munroe "Universal Converter Box" When to use it Need Features Data Large Complex Many objects Heterogeneous Access Parallel I/O Random access Fast access Partial I/O Environment requires Special platforms Multiple platforms Portability Efficient storage Plus Standard format Available tools Low cost Image/Table API Source: HDFGroup (2011) (personal highlighting) Esoteric When less useful "It’s quite different from SQL-style relational databases. HDF5 has quite a few organizational tricks up its sleeve (see Chapter 8, for example), but if you find yourself needing to enforce relationships between values in various tables, or wanting to perform JOINs on your data, a relational database is probably more appropriate. Likewise, for tiny 1D datasets you need to be able to read on machines without HDF5 installed. Text formats like CSV (with all their warts) are a reasonable alternative." (Collete 2013, p. 5) Using HDF5 ! Demo File viewers: HDFView Compass Python: Saving Reading Metadata Munroe "Python" (Extract) Python Generate some numbers In[3]: rows = 8760 cols = 11*11 index = pd.date_range(start='20010101', periods=8760, freq='H') df = pd.DataFrame(np.random.uniform(10, 40, (rows, cols)), columns=[random_word(40) for i in range(cols)], index=index) df.head() Out[3]: UqoYLpehAfynGiXMvkuINJhqDjmOfHMCjYxzKuwW \ 20010101 00:00:00 35.580977 20010101 01:00:00 19.150455 20010101 02:00:00 26.721835 20010101 03:00:00 3.331457 20010101 04:00:00 6.761000 aMhPHzbgKgBcNZMEdZQherUOnpSaeeuCxvcOWLyn \ 20010101 00:00:00 39.823702 Saving In[4]: %timeit df.to_csv('example.csv') # 19.0MB 1 loop, best of 3: 17.8 s per loop In[5]: %timeit df.to_sql('example.db', engine, if_exists='replace') # 9.97MB 1 loop, best of 3: 5.89 s per loop In[6]: %timeit df.to_hdf('example.h5', 'a_group', format="table") # 8.22MB 1 loop, best of 3: 118 ms per loop Reading In[7]: %timeit pd.read_csv('example.csv') 1 loop, best of 3: 1.87 s per loop In[8]: %timeit pd.read_sql('example.db', engine) 1 loop, best of 3: 3.15 s per loop In[9]: %timeit pd.read_hdf('example.h5') 10 loops, best of 3: 56.3 ms per loop Metadata In[10]: with pd.HDFStore('example.h5', 'a') as store: store.put('another_group', df, format='table', append=True) store.get_storer('another_group').attrs.metadata = "hello world" Looking into the file >ptdump example.h5 / (RootGroup) '' /a_group (Group) '' /a_group/table (Table(8760,)) '' /another_group (Group) '' /another_group/table (Table(35040,)) '' >ptdump example.h5:/another_group a /another_group (Group) '' /another_group._v_attrs (AttributeSet), 15 attributes: [CLASS := 'GROUP', TITLE := '', VERSION := '1.0', data_columns := [], encoding := 'UTF8', index_cols := [(0, 'index')], info := {1: {'names': [None], 'type': 'Index'}, 'index': {'freq': <Hour>}}, References Portrait image: Wheatley, M., 2013. file-cabinets-big-data [Online] Available here [Accessed 15 May 2016] Collette, A., 2013. Python and HDF5, Beijiing: O’Reilly. HDFGroup, 2011. Why HDF? [Online] Available here [Accessed 15 May 2016] Munroe, R., Standards [Online]. Available here [Accessed 15 May 2016] Munroe, R., Universal Converter Box [Online]. Available here [Accessed 15 May 2016] Munroe, R., Python [Online]. Available here [Accessed 15 May 2016] Stöckli, R., Simmon, R., Herring, D., 2003. NASA Earth Observatory, based on data from the MODIS land team [Online]. Available here [Accessed 15 May 2016] Wasser, L, 2015. About: Hierarchical Data Formats - What is HDF5? [Online]. Available here [Accessed 15 May 2016] (Some) Useful resources HDF5 HDF Group [Online] Available here [Accessed 15 May 2016] Wasser, L, 2015. About: Hierarchical Data Formats - What is HDF5? [Online]. Available here [Accessed 15 May 2016] HDF5 and Python Collette, A., 2013. Python and HDF5 First edition., Beijiing: O’Reilly. Scopatz, A., 2013. HDF5 is for Lovers [Online]. Available here [Accessed 15 May 2016] Scopatz, A., 2015. Python & HDF5 — A vision [Online]. Available here [Accessed 15 May 2016] Thank you! https://dl.dropboxusercontent.com/u/4235140/_download_tutorial.zip