Making sense out of the many changes to our computing environment

Transcription

Making sense out of the many changes to our computing environment
Making sense out of the many changes
to our computing environment
(with a healthy dose of Odyssey)
Bob Yantosca
Senior Software Engineer
with the GCST: Melissa, Matt, Lizzie, Mike
Jacob Group Meeting
Friday, 20 Nov 2015
Introduction
●
Two major sources of significant disruptive change
to our computing environments:
–
Consolidation of IT assets formerly managed by SEAS
●
●
–
e.g. Email, web servers, computational clusters
Many of these assets have now been taken over by Harvard
University IT (HUIT) and FAS Research Computing (RC)
Jacob-group migration from AS cluster to Odyssey
●
Precipitated by Jack's retirement
●
But which has resulted in growing pains
Why centralization of IT assets?
●
Historically, each school made its own IT decisions
–
●
Which led to much repetition and confusion
●
i.e. 20+ email systems, 1000's of web servers & clusters
●
Single point of failure – not good
Need to provide IT as economically as possible
–
More IT assets are being managed by HUIT and FAS RC
and less by individual schools
–
Centralization = Economy of scale = Saving $$
–
SEAS is doing this already (under Gabriele Fariello)
Letter from Gabriele Fariello to the SEAS community (30 Sep 2015)
Dear Friends and Colleagues,
I am writing to update you on what we in computing have been doing and why, what is left to
be done, and to invite you to share your thoughts on the future of computing at SEAS. You can
find more details below, but in summary:
Over the past two years, we have worked to refocus computing at SEAS to be able to
provide the exceptional service that a world-class engineering and applied sciences
school needs. We have:
●
●
significantly increased the compute power available to all faculty while reducing
costs and managing resources
reduced our operational footprint to one-quarter of what it was two years ago
The many recent changes coming to SEAS and Harvard have inevitably added to the burden of
changes the community has experienced. I understand that this has not been an easy period
of transition, but the period of significant changes is almost over and should be by the end
of the Fall Semester.
… -Gabriele
Assistant Dean for Computing & Chief Information Officer
Harvard John A. Paulson School of Engineering and Applied Sciences
Local IT assets that were changed
●
seas.harvard.edu emails
●
seas.harvard.edu web sites
●
GEOS-Chem and Jacob-group email lists
●
GEOS-Chem and Jacob-group websites
●
GEOS-Chem wiki
●
GEOS-Chem Git repositories
seas.harvard.edu emails
Old System
SEAS-hosted Microsoft Exchange Server
New System
Cloud-based Microsoft Office 365 Server
Switchover
May 2015 thru September 2015
Affected
Everyone with a @seas.harvard.edu email address
Issues
●
●
●
People were migrated to a temporary email server in May
and then back again in August and September.
The transition was not as smooth as hoped for some.
Everyone was receiving an excessive amount of spam for
about 2-3 months until HUIT finalized the spam filters.
seas.harvard.edu web sites
Old System
people.seas.harvard.edu/~username
New System
Drupal OpenScholar
Switchover
Summer 2015
Affected
SEAS faculty, staff, students
Issues
●
●
●
●
●
You used to be able to upload HTML files (created with
Dreamweaver etc.) to the people.seas.harvard.edu site.
The people.seas.harvard.edu sites were discontinued
(but existing users were grandfathered in)
You cannot edit HTML code directly with OpenScholar.
OpenScholar's “look-and-feel” < < HTML + CSS.
OpenScholar has a 2GB limit (can't upload a lot of documents).
GC and Jacob-group email lists
Old System
SEAS-hosted “Mailman” list servers
New System
Google Groups (hosted on g.harvard.edu)
Switchover
August 2015
Affected
All users of the GEOS-Chem and Jacob-group email lists
Issues
●
●
●
●
●
[email protected][email protected]
We were promised a smooth transition, BUT the initial
migration was done incorrectly.
Many addresses were omitted; people complained.
GCST tried to manually add addresses but found a 100
address/day limit was put in place. (UGH!!!)
The migration had to be a second time, which was sucessful.
GC and Jacob-group web sites
Old System
Local Atmospheric Sciences web server
New System
Amazon Web Services (cloud-based) web server
Switchover
Summer 2015
Affected
GC and Jacob-group website users
Issues
●
●
None really to speak of; transition was very smooth!
Judit is currently looking into replacing the website Git
servers with another type of content management system
(stay tuned).
GEOS-Chem wiki
Old System
SEAS-hosted MediaWiki
New System
Amazon Web services (cloud-hosted) MediaWiki
Switchover
June 2015 (after IGC7)
Affected
GC users who rely on the wiki (i.e. everyone)
Issues
●
●
●
●
●
The machine where the GC wiki lived @ SEAS was retired.
Migration to AWS was very smooth. (Thanks Judit!)
The MediaWiki version was updated from v1.19 to v1.24.2.
All pages were preserved during transition.
Very little changes apparent to users, except for look & feel.
GEOS-Chem Git repositories
Old System
git.as.harvard.edu
New System
bitbucket.org/gcst
Switchover
Summer 2015
Affected
All GEOS-Chem users
Issues
●
●
●
●
git.as.harvard.edu can only be updated from the AS
server. It is read-only from Odyssey.
The git.as.harvard.edu server has a 15-minute delay
time before updates are made visible to the outside world.
The need to sync code between AS and Odyssey prompted
us to migrate the Git repos to bitbucket.org.
We also obtained a academic license for bitbucket.org, so
that we can have an unlimited number of developers for free.
Migration to Odyssey
●
●
●
●
Homer's Odyssey is a story of a warrior who took
10 years to get home.
Sometimes it felt like it would take that long to get
up and running on Odyssey.
We faced a few issues along the way (which we'll
hear about shortly).
But first, I'll give a brief introduction to Odyssey.
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
Login nodes
Home
& Lab
Disks
Network
Scratch
Holyseas
01-04
4 x 64 CPUs
jacob
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
Compute
Nodes
/n/home*/YOUR_USER_NAME
/n/seasasfs02/YOUR_USER_NAME (long-term storage)
/n/regal/jacob_lab/YOUR_USER_NAME (temp. storage)
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
You can request
interactive (I) or batch
(B) jobs to run in these
queues via the SLURM
resource manager.
NOTE: While there are issues with
the jacob partition, you can try
submitting to serial_requeue,
especially if you are using 8 or
less CPUs, and a moderate
amount of memory.
(B)
(I)
(B)
The general queue prioritizes
high-memory jobs; low memory
jobs will pend for days or weeks
(until RC fixes this!)
(B)
(B)
(B)
jacob
(ACMG only)
(B,I)
18 hours default
36 hours maximum
256
(64 CPUs/node)
288 GB/node
(that's 4.5 GB/cpu)
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
Simple Linux Utility for
Resource Management (SLURM)
●
You ask SLURM for the following resources:
–
Amount of memory for job
–
Amount of time for job
–
Number of CPUs and nodes for job
–
Interactive session (srun) or
–
Queued session (sbatch)
–
Run queue (aka “Partition”)
Who came up with a name like SLURM?
The gag was that SLURM
was highly addictive so
people couldn't stop
drinking it!
SLURM was the cola
featured on the animated
series “Futurama”. It was
what people drank in the
year 3000 (a parody of
Coke and Pepsi).
“Slurms Mackenzie”, the
party slug, was the mascot
for SLURM on the show
(pictured at left).
Cc: [email protected] (i.e. Judit)
and [email protected]
GCST has also created local
documentation about how
to log into Odyssey and run
jobs on the AS wiki pages!
Slide from Introduction to Odyssey & RC Services by Robert Freeman and Plamen Krastev, FAS RC
To get to the AS wiki,
click on this link!
We have written documentation on how to run jobs on Odyssey on the AS wiki!
Direct Link: http://wiki.as.harvard.edu/wiki/doku.php
Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:basics
Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:as:startup_scripts
Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:as:startup_scripts
Direct Link: http://wiki.as.harvard.edu/wiki/doku.php/wiki:as:slurm
Bumps in the road ...
●
●
During the migration process, several unforeseen
issues came up that demanded our attention.
Case in point, GCHP:
–
Mike, Matt, and Jiawei (Jintai Lin's student) were early
users of GCHP on Odyssey.
–
But they immediately ran into a couple of serious
roadblocks...
Mike Long wrote to FAS RC (June 2015):
I failed to notice that you only have TWO Intel licenses. This WILL BE A HUGE
PROBLEM for us. Please let me know what we have to do to remedy this
soon.
Thanks.
ML
And the response that Mike got from RC was:
1. Use a free compiler like gcc/gfortran.
2. Don't compile as often, in reality the contention for the Intel compiler isn't
that high. When I build software I only stall out once a month maybe, and
we don't get many complaints. So unless you are building constantly or
need to recompile on the fly it may not be a frequent issue.
3. Bu y more licenses.
As it stands we do not plan to purchase more licenses. However if you want
to make a case for it I can pass you along to … our operations manager. However, he may ask you to foot the bill for the additional licenses.
Bumps in the road ...
●
This situation was made worse by users abusing
the system:
Mike Long wrote again to FAS RC (1st week of June 2015):
A quick follow-up. Unfortunately the situation is disruptive. A
user...appears to have an automatically running script that loads an
Intel-based OpenMPI, compiles and runs a program. It is completely
monopolizing the Intel licenses leaving me absolutely dead in the
water. Unfortunately, we rely upon the Intel system for our work.
Is there a way to ask Mr. ____ to amend his procedure?
●
We had to address this ASAP.
Bumps in the road ...
●
●
●
The way that we solved this was that we ported
our existing Intel Fortran Compiler licenses (v11) to
Odyssey.
We also brought over our existing netCDF/HDF5
libraries that were compiled with Intel Fortran
Compiler v11.
These are sufficient for most Jacob-group users,
who are working with GEOS-Chem “classic”.
Jobs running slower on Odyssey
Lu Hu wrote (11/5/2015 1:58 PM)
Here are some examples (see below), I got a set of runs,
same settings but for different years. The runtime to
finish them varies from normally 20 hours, to 32 hours,
and many times >36 hours.
All of these jobs were running on /n/regal
~29.5h
===> SIMULATION START TIME: 2015/08/25 20:19 <===
===> SIMULATION END TIME: 2015/08/27 02:53 <===
~20h
===> SIMULATION START TIME: 2015/08/26 22:22 <===
===> SIMULATION END TIME: 2015/08/27 19:22 <===
~19.5h
===> SIMULATION START TIME: 2015/09/12 06:23 <===
===> SIMULATION END TIME: 2015/09/13 01:07 <===
~32.5h
===> SIMULATION START TIME: 2015/09/17 18:05 <===
===> SIMULATION END TIME: 2015/09/19 02:44 <===
Katie Travis wrote (11/3/2015 5:16 PM)
Judit, my job 50950707 has run 10 days in 6
hours, which is about 2/3 slower than it
should be.
Thanks,
Katie
Rachel Silvern wrote (11/3/2015 5:22 PM)
I am have been running into issues trying
to run a one month 4x5 v9-02 (SEAC4RS)
simulation this afternoon. My run keeps
dying at stratospheric chemistry during the
first timestep [with a segfault].
I have a couple of job numbers: 50974904,
50972514, 50967636. I'm not sure if this is
related to issues on the partition or I have
my own mysterious issues.
Jobs running slower on Odyssey
Melissa Sulprizio wrote (Thu 11/5/2015 4:05 PM)
Here are the job stats for our recent 1-month benchmark simulations. These ran on AS
(v11-01a, v11-01b), the previous holy2a110xx jacob nodes on Odyssey (v11-01c), and
the current holyseas[01-04] jacob nodes on Odyssey (v11-01d).
I highlighted the elapsed time for the last two jobs because those runs hit the time
limit that I had set. Those jobs should not have timed out because all of these jobs are
for 1-month GEOS-Chem simulations and the previous run times were all well below
12 hours. In fact, the last run (50518629) only got to day 16 out of 31 in the allotted 24
hours.
It’s important to note that some of the differences in run time and memory may have
been affected by updates that we added to the model.
Jobs running slower on Odyssey
Melissa Sulprizio wrote again (11/6/2015 8:36 AM)
Finally, it's interesting to note that run times on AS were *MUCH* faster. We see
typical run times of ~1 hour on AS for these 7-day GEOS-Chem simulations, as
compared to 2-4 hours on Odyssey. I think this confirms our suspicion that the
Odyssey nodes are not running faster than AS, like we were promised.
Jobs running slower on Odyssey
Mins 300
Courtesy of Matt Yannetti
GC w/ GEOS-FP
GC w/ GCAP2
250
200
150
100
50
0
Initialization
Timesteps
Chemistry
Transport
Convection
Matt writes: Over on the left (labeled “Initialization”), we have all of the I/O that’s done, and as you can see,
there’s not much variation in it. There’s some small spikes for the blue bit, but the red is almost all the same.
The big kicker we have here is the “Timesteps” bars, which is effectively all computation the code is doing.
Notice how on the red, there’s a variance of over 60 minutes between some runs, and for the blue, a variation
of almost 50, and the I/O part has nowhere near that level of variation. The other three columns are a
breakdown of the different types of science the code does, and the runtimes of each.
Jobs running slower on Odyssey
●
●
GCST felt that RC had not given us a satisfactory
explanation of why GC was so much slower.
DJJ and GCST brought these issues to FAS RC (via
high-level meetings). RC was responsive.
–
Scott Yockel (RC) is working with us to diagnose the
problem.
–
Scott is helping us run tests.
RC has been very responsive
GEOS-CHEM Group,
Thank you all for meeting today. Sometimes it helps a lot on both ends to meet face-to-face.
Please continue to send in tickets to anytime you have issues with the cluster. Please address
those tickets to Dan Caunt and myself as we will make sure to fully answer your
questions/concerns in a timely manner. Also, please spread this to the rest of the Jacobs group
that were not at the meeting today. At any point that you are dealing with someone else in
RC via our ticketing system, chat, or Office Hours and you are having issues dealing with
that individual, please email me directly. And, if you have issues with how I’m handling
something, please email James Cuff. I’m hear to make sure that your research
throughput is optimal and that RC resources are attractable. I’ve already closed off serial_requeue jobs from the jacobs nodes. After 51173945 job is
finished, I’ll reboot holyseas01 with the lower processor count (essentially hyper
threading disabled). Then I’ll run the test provided below on that host to start getting some
baseline of timings. Lizzie, if you can also provide me with the runtimes for that same job on
the AS cluster too that would be helpful for comparison.
Working hypothesis
●
●
Differences in performance may be due to different
architecture of our holyseas01-04 nodes
–
Holyseas nodes use AMD CPUs (version: “Opteron”)
–
AS cluster uses Intel CPUs (version: “Nehalem”)
“Hyperthreading” (each CPU acting as 2)
–
May also degrade performance when nodes are busy
–
Recall other users' jobs are “backfilling” onto
holyseas01-04 CPUs to maximize usage
We did some testing ...
●
●
Scott was able to (temporarily) eliminate sources
of variability on the holyseas nodes
–
Turned off “hyperthreading”
–
Turned off “backfilling”
Simulations done
–
1-year Rn-Pb-Be simulations (Melissa)
–
12-hour “benchmark” simulation (Lizzie, Scott)
Results of our tests
●
Melissa ran a set of 1-yr Rn-Pb-Be simulations
–
●
On both AS and Odyssey
But, run times are still slower than on AS
–
By up to a factor of 2
Results of our tests
●
Scott ran the 12-hr “benchmark” test on Odyssey
–
Using varying options, runs took ~10-12 minutes
Results of our tests
●
Lizzie ran the 12-hr “benchmark” tests on Odyssey
–
Consistent run times obtained on Odyssey
–
But, runs take 2X as long on Odyssey than on AS
Results of our tests
Scott Yockel wrote (Fri 11/13, 10:23 AM)
In all of this testing, I see only about 1-2 min variability in the run times (on
Odyssey), even when I run 4 jobs at concurrently. I even tried running on a
different host (holyhbs03) that doesn't have the memory cap on it to see if more
memory would make it run faster and do less swapping. I didn't find that to be the
case either.
After watching many of these runs, there is a substantial amount of time when
the CPU's are not at 100%. This means that the CPUs are waiting for instructions,
this may be due to the times that it is fetching data from files or something else. I'm
guessing that some compilation options may make some difference here. The Intel
11 compiler came out in 2008 with an update in 2009 (11.1). These AMD Opteron
6376 chips were launched in late 2012, so there are some chipset features that
can be used that the Intel 11.1 doesn't even know about. For example the floating
point unit on these chips is 256-bit wide, which means it could do 4 x 64-bit
simultaneously. However, if the compiled code isn’t aware of that feature, then it will
not feed it 4 instructions per clock cycle. So effectively instead of 2.3Ghz you are
cutting that down in half or even by a quarter.
But .. switching to a newer compiler version doesn't make that much difference
Summary of our findings
●
Testing is still continuing
–
Scott was away this week (back next week).
–
We may need to try to find the right combination of
SLURM options.
–
But, Intel Fortran may not work as well on AMD chips
●
–
This may be an Intel marketing decision designed to crush
AMD. This is a longstanding AMD complaint.
Worst case: RC has offered to replace AMD nodes w/
Intel nodes, if our testing suggests that is necessary.
Article from
2005.
Not sure if
situation has
improved since
then.
http://techreport.com/news/8547/does-intel-compiler-cripple-amd-performance
GCHP test by Seb Eastham shows that input from /n/regal is the fastest
GCHP @ 4x5
1-day run w/ 6 CPUs
Disk I/O is the
most expensive
operation
GCHP test by Seb Eastham shows that input from /n/regal is the fastest
Average of 10 1-day runs
GCHP @ 4x5 w/ 6 CPUs
Questions and Discussion