Making sense out of the many changes to our computing environment
Transcription
Making sense out of the many changes to our computing environment
Making sense out of the many changes to our computing environment (with a healthy dose of Odyssey) Bob Yantosca Senior Software Engineer with the GCST: Melissa, Matt, Lizzie, Mike Jacob Group Meeting Friday, 20 Nov 2015 Introduction ● Two major sources of significant disruptive change to our computing environments: – Consolidation of IT assets formerly managed by SEAS ● ● – e.g. Email, web servers, computational clusters Many of these assets have now been taken over by Harvard University IT (HUIT) and FAS Research Computing (RC) Jacob-group migration from AS cluster to Odyssey ● Precipitated by Jack's retirement ● But which has resulted in growing pains Why centralization of IT assets? ● Historically, each school made its own IT decisions – ● Which led to much repetition and confusion ● i.e. 20+ email systems, 1000's of web servers & clusters ● Single point of failure – not good Need to provide IT as economically as possible – More IT assets are being managed by HUIT and FAS RC and less by individual schools – Centralization = Economy of scale = Saving $$ – SEAS is doing this already (under Gabriele Fariello) Letter from Gabriele Fariello to the SEAS community (30 Sep 2015) Dear Friends and Colleagues, I am writing to update you on what we in computing have been doing and why, what is left to be done, and to invite you to share your thoughts on the future of computing at SEAS. You can find more details below, but in summary: Over the past two years, we have worked to refocus computing at SEAS to be able to provide the exceptional service that a world-class engineering and applied sciences school needs. We have: ● ● significantly increased the compute power available to all faculty while reducing costs and managing resources reduced our operational footprint to one-quarter of what it was two years ago The many recent changes coming to SEAS and Harvard have inevitably added to the burden of changes the community has experienced. I understand that this has not been an easy period of transition, but the period of significant changes is almost over and should be by the end of the Fall Semester. … -Gabriele Assistant Dean for Computing & Chief Information Officer Harvard John A. Paulson School of Engineering and Applied Sciences Local IT assets that were changed ● seas.harvard.edu emails ● seas.harvard.edu web sites ● GEOS-Chem and Jacob-group email lists ● GEOS-Chem and Jacob-group websites ● GEOS-Chem wiki ● GEOS-Chem Git repositories seas.harvard.edu emails Old System SEAS-hosted Microsoft Exchange Server New System Cloud-based Microsoft Office 365 Server Switchover May 2015 thru September 2015 Affected Everyone with a @seas.harvard.edu email address Issues ● ● ● People were migrated to a temporary email server in May and then back again in August and September. The transition was not as smooth as hoped for some. Everyone was receiving an excessive amount of spam for about 2-3 months until HUIT finalized the spam filters. seas.harvard.edu web sites Old System people.seas.harvard.edu/~username New System Drupal OpenScholar Switchover Summer 2015 Affected SEAS faculty, staff, students Issues ● ● ● ● ● You used to be able to upload HTML files (created with Dreamweaver etc.) to the people.seas.harvard.edu site. The people.seas.harvard.edu sites were discontinued (but existing users were grandfathered in) You cannot edit HTML code directly with OpenScholar. OpenScholar's “look-and-feel” < < HTML + CSS. OpenScholar has a 2GB limit (can't upload a lot of documents). GC and Jacob-group email lists Old System SEAS-hosted “Mailman” list servers New System Google Groups (hosted on g.harvard.edu) Switchover August 2015 Affected All users of the GEOS-Chem and Jacob-group email lists Issues ● ● ● ● ● [email protected] → [email protected] We were promised a smooth transition, BUT the initial migration was done incorrectly. Many addresses were omitted; people complained. GCST tried to manually add addresses but found a 100 address/day limit was put in place. (UGH!!!) The migration had to be a second time, which was sucessful. GC and Jacob-group web sites Old System Local Atmospheric Sciences web server New System Amazon Web Services (cloud-based) web server Switchover Summer 2015 Affected GC and Jacob-group website users Issues ● ● None really to speak of; transition was very smooth! Judit is currently looking into replacing the website Git servers with another type of content management system (stay tuned). GEOS-Chem wiki Old System SEAS-hosted MediaWiki New System Amazon Web services (cloud-hosted) MediaWiki Switchover June 2015 (after IGC7) Affected GC users who rely on the wiki (i.e. everyone) Issues ● ● ● ● ● The machine where the GC wiki lived @ SEAS was retired. Migration to AWS was very smooth. (Thanks Judit!) The MediaWiki version was updated from v1.19 to v1.24.2. All pages were preserved during transition. Very little changes apparent to users, except for look & feel. GEOS-Chem Git repositories Old System git.as.harvard.edu New System bitbucket.org/gcst Switchover Summer 2015 Affected All GEOS-Chem users Issues ● ● ● ● git.as.harvard.edu can only be updated from the AS server. It is read-only from Odyssey. The git.as.harvard.edu server has a 15-minute delay time before updates are made visible to the outside world. The need to sync code between AS and Odyssey prompted us to migrate the Git repos to bitbucket.org. We also obtained a academic license for bitbucket.org, so that we can have an unlimited number of developers for free. Migration to Odyssey ● ● ● ● Homer's Odyssey is a story of a warrior who took 10 years to get home. Sometimes it felt like it would take that long to get up and running on Odyssey. We faced a few issues along the way (which we'll hear about shortly). But first, I'll give a brief introduction to Odyssey. Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC Login nodes Home & Lab Disks Network Scratch Holyseas 01-04 4 x 64 CPUs jacob Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC Compute Nodes /n/home*/YOUR_USER_NAME /n/seasasfs02/YOUR_USER_NAME (long-term storage) /n/regal/jacob_lab/YOUR_USER_NAME (temp. storage) Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC You can request interactive (I) or batch (B) jobs to run in these queues via the SLURM resource manager. NOTE: While there are issues with the jacob partition, you can try submitting to serial_requeue, especially if you are using 8 or less CPUs, and a moderate amount of memory. (B) (I) (B) The general queue prioritizes high-memory jobs; low memory jobs will pend for days or weeks (until RC fixes this!) (B) (B) (B) jacob (ACMG only) (B,I) 18 hours default 36 hours maximum 256 (64 CPUs/node) 288 GB/node (that's 4.5 GB/cpu) Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC Simple Linux Utility for Resource Management (SLURM) ● You ask SLURM for the following resources: – Amount of memory for job – Amount of time for job – Number of CPUs and nodes for job – Interactive session (srun) or – Queued session (sbatch) – Run queue (aka “Partition”) Who came up with a name like SLURM? The gag was that SLURM was highly addictive so people couldn't stop drinking it! SLURM was the cola featured on the animated series “Futurama”. It was what people drank in the year 3000 (a parody of Coke and Pepsi). “Slurms Mackenzie”, the party slug, was the mascot for SLURM on the show (pictured at left). Cc: [email protected] (i.e. Judit) and [email protected] GCST has also created local documentation about how to log into Odyssey and run jobs on the AS wiki pages! Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC To get to the AS wiki, click on this link! We have written documentation on how to run jobs on Odyssey on the AS wiki! Direct Link: http://wiki.as.harvard.edu/wiki/doku.php Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:basics Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:as:startup_scripts Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:as:startup_scripts Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:as:slurm Bumps in the road ... ● ● During the migration process, several unforeseen issues came up that demanded our attention. Case in point, GCHP: – Mike, Matt, and Jiawei (Jintai Lin's student) were early users of GCHP on Odyssey. – But they immediately ran into a couple of serious roadblocks... Mike Long wrote to FAS RC (June 2015): I failed to notice that you only have TWO Intel licenses. This WILL BE A HUGE PROBLEM for us. Please let me know what we have to do to remedy this soon. Thanks. ML And the response that Mike got from RC was: 1. Use a free compiler like gcc/gfortran. 2. Don't compile as often, in reality the contention for the Intel compiler isn't that high. When I build software I only stall out once a month maybe, and we don't get many complaints. So unless you are building constantly or need to recompile on the fly it may not be a frequent issue. 3. Bu y more licenses. As it stands we do not plan to purchase more licenses. However if you want to make a case for it I can pass you along to … our operations manager. However, he may ask you to foot the bill for the additional licenses. Bumps in the road ... ● This situation was made worse by users abusing the system: Mike Long wrote again to FAS RC (1st week of June 2015): A quick follow-up. Unfortunately the situation is disruptive. A user...appears to have an automatically running script that loads an Intel-based OpenMPI, compiles and runs a program. It is completely monopolizing the Intel licenses leaving me absolutely dead in the water. Unfortunately, we rely upon the Intel system for our work. Is there a way to ask Mr. ____ to amend his procedure? ● We had to address this ASAP. Bumps in the road ... ● ● ● The way that we solved this was that we ported our existing Intel Fortran Compiler licenses (v11) to Odyssey. We also brought over our existing netCDF/HDF5 libraries that were compiled with Intel Fortran Compiler v11. These are sufficient for most Jacob-group users, who are working with GEOS-Chem “classic”. Jobs running slower on Odyssey Lu Hu wrote (11/5/2015 1:58 PM) Here are some examples (see below), I got a set of runs, same settings but for different years. The runtime to finish them varies from normally 20 hours, to 32 hours, and many times >36 hours. All of these jobs were running on /n/regal ~29.5h ===> SIMULATION START TIME: 2015/08/25 20:19 <=== ===> SIMULATION END TIME: 2015/08/27 02:53 <=== ~20h ===> SIMULATION START TIME: 2015/08/26 22:22 <=== ===> SIMULATION END TIME: 2015/08/27 19:22 <=== ~19.5h ===> SIMULATION START TIME: 2015/09/12 06:23 <=== ===> SIMULATION END TIME: 2015/09/13 01:07 <=== ~32.5h ===> SIMULATION START TIME: 2015/09/17 18:05 <=== ===> SIMULATION END TIME: 2015/09/19 02:44 <=== Katie Travis wrote (11/3/2015 5:16 PM) Judit, my job 50950707 has run 10 days in 6 hours, which is about 2/3 slower than it should be. Thanks, Katie Rachel Silvern wrote (11/3/2015 5:22 PM) I am have been running into issues trying to run a one month 4x5 v9-02 (SEAC4RS) simulation this afternoon. My run keeps dying at stratospheric chemistry during the first timestep [with a segfault]. I have a couple of job numbers: 50974904, 50972514, 50967636. I'm not sure if this is related to issues on the partition or I have my own mysterious issues. Jobs running slower on Odyssey Melissa Sulprizio wrote (Thu 11/5/2015 4:05 PM) Here are the job stats for our recent 1-month benchmark simulations. These ran on AS (v11-01a, v11-01b), the previous holy2a110xx jacob nodes on Odyssey (v11-01c), and the current holyseas[01-04] jacob nodes on Odyssey (v11-01d). I highlighted the elapsed time for the last two jobs because those runs hit the time limit that I had set. Those jobs should not have timed out because all of these jobs are for 1-month GEOS-Chem simulations and the previous run times were all well below 12 hours. In fact, the last run (50518629) only got to day 16 out of 31 in the allotted 24 hours. It’s important to note that some of the differences in run time and memory may have been affected by updates that we added to the model. Jobs running slower on Odyssey Melissa Sulprizio wrote again (11/6/2015 8:36 AM) Finally, it's interesting to note that run times on AS were *MUCH* faster. We see typical run times of ~1 hour on AS for these 7-day GEOS-Chem simulations, as compared to 2-4 hours on Odyssey. I think this confirms our suspicion that the Odyssey nodes are not running faster than AS, like we were promised. Jobs running slower on Odyssey Mins 300 Courtesy of Matt Yannetti GC w/ GEOS-FP GC w/ GCAP2 250 200 150 100 50 0 Initialization Timesteps Chemistry Transport Convection Matt writes: Over on the left (labeled “Initialization”), we have all of the I/O that’s done, and as you can see, there’s not much variation in it. There’s some small spikes for the blue bit, but the red is almost all the same. The big kicker we have here is the “Timesteps” bars, which is effectively all computation the code is doing. Notice how on the red, there’s a variance of over 60 minutes between some runs, and for the blue, a variation of almost 50, and the I/O part has nowhere near that level of variation. The other three columns are a breakdown of the different types of science the code does, and the runtimes of each. Jobs running slower on Odyssey ● ● GCST felt that RC had not given us a satisfactory explanation of why GC was so much slower. DJJ and GCST brought these issues to FAS RC (via high-level meetings). RC was responsive. – Scott Yockel (RC) is working with us to diagnose the problem. – Scott is helping us run tests. RC has been very responsive GEOS-CHEM Group, Thank you all for meeting today. Sometimes it helps a lot on both ends to meet face-to-face. Please continue to send in tickets to anytime you have issues with the cluster. Please address those tickets to Dan Caunt and myself as we will make sure to fully answer your questions/concerns in a timely manner. Also, please spread this to the rest of the Jacobs group that were not at the meeting today. At any point that you are dealing with someone else in RC via our ticketing system, chat, or Office Hours and you are having issues dealing with that individual, please email me directly. And, if you have issues with how I’m handling something, please email James Cuff. I’m hear to make sure that your research throughput is optimal and that RC resources are attractable. I’ve already closed off serial_requeue jobs from the jacobs nodes. After 51173945 job is finished, I’ll reboot holyseas01 with the lower processor count (essentially hyper threading disabled). Then I’ll run the test provided below on that host to start getting some baseline of timings. Lizzie, if you can also provide me with the runtimes for that same job on the AS cluster too that would be helpful for comparison. Working hypothesis ● ● Differences in performance may be due to different architecture of our holyseas01-04 nodes – Holyseas nodes use AMD CPUs (version: “Opteron”) – AS cluster uses Intel CPUs (version: “Nehalem”) “Hyperthreading” (each CPU acting as 2) – May also degrade performance when nodes are busy – Recall other users' jobs are “backfilling” onto holyseas01-04 CPUs to maximize usage We did some testing ... ● ● Scott was able to (temporarily) eliminate sources of variability on the holyseas nodes – Turned off “hyperthreading” – Turned off “backfilling” Simulations done – 1-year Rn-Pb-Be simulations (Melissa) – 12-hour “benchmark” simulation (Lizzie, Scott) Results of our tests ● Melissa ran a set of 1-yr Rn-Pb-Be simulations – ● On both AS and Odyssey But, run times are still slower than on AS – By up to a factor of 2 Results of our tests ● Scott ran the 12-hr “benchmark” test on Odyssey – Using varying options, runs took ~10-12 minutes Results of our tests ● Lizzie ran the 12-hr “benchmark” tests on Odyssey – Consistent run times obtained on Odyssey – But, runs take 2X as long on Odyssey than on AS Results of our tests Scott Yockel wrote (Fri 11/13, 10:23 AM) In all of this testing, I see only about 1-2 min variability in the run times (on Odyssey), even when I run 4 jobs at concurrently. I even tried running on a different host (holyhbs03) that doesn't have the memory cap on it to see if more memory would make it run faster and do less swapping. I didn't find that to be the case either. After watching many of these runs, there is a substantial amount of time when the CPU's are not at 100%. This means that the CPUs are waiting for instructions, this may be due to the times that it is fetching data from files or something else. I'm guessing that some compilation options may make some difference here. The Intel 11 compiler came out in 2008 with an update in 2009 (11.1). These AMD Opteron 6376 chips were launched in late 2012, so there are some chipset features that can be used that the Intel 11.1 doesn't even know about. For example the floating point unit on these chips is 256-bit wide, which means it could do 4 x 64-bit simultaneously. However, if the compiled code isn’t aware of that feature, then it will not feed it 4 instructions per clock cycle. So effectively instead of 2.3Ghz you are cutting that down in half or even by a quarter. But .. switching to a newer compiler version doesn't make that much difference Summary of our findings ● Testing is still continuing – Scott was away this week (back next week). – We may need to try to find the right combination of SLURM options. – But, Intel Fortran may not work as well on AMD chips ● – This may be an Intel marketing decision designed to crush AMD. This is a longstanding AMD complaint. Worst case: RC has offered to replace AMD nodes w/ Intel nodes, if our testing suggests that is necessary. Article from 2005. Not sure if situation has improved since then. http://techreport.com/news/8547/does-intel-compiler-cripple-amd-performance GCHP test by Seb Eastham shows that input from /n/regal is the fastest GCHP @ 4x5 1-day run w/ 6 CPUs Disk I/O is the most expensive operation GCHP test by Seb Eastham shows that input from /n/regal is the fastest Average of 10 1-day runs GCHP @ 4x5 w/ 6 CPUs Questions and Discussion