iaas Documentation Release 0.1.0 NorCAMS November 14, 2014
Transcription
iaas Documentation Release 0.1.0 NorCAMS November 14, 2014
iaas Documentation Release 0.1.0 NorCAMS November 14, 2014 Contents 1 Getting started 3 2 Installation 5 3 Design 3.1 Development hardware (draft) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 4 Development 9 5 Howtos 5.1 Build docs locally using Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 6 About the project 6.1 What is NorCAMS anyways? 6.2 People . . . . . . . . . . . . 6.3 Tracking the project . . . . . 6.4 Meetings . . . . . . . . . . . 6.5 Project plan and description . 13 13 14 14 15 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i ii iaas Documentation, Release 0.1.0 This is our current documentation. Contents 1 iaas Documentation, Release 0.1.0 2 Contents CHAPTER 1 Getting started 3 iaas Documentation, Release 0.1.0 4 Chapter 1. Getting started CHAPTER 2 Installation Documentation describing installation of the IaaS platform 5 iaas Documentation, Release 0.1.0 6 Chapter 2. Installation CHAPTER 3 Design High-level documents describing the IaaS platform design 3.1 Development hardware (draft) The project will deliver a geo-distributed iaas service across (at least) three locations. A key point is that each location is built from the same hardware specification. This is done to simplify and limit influence of external variables as much as possible while building the base platform. The spec represents a minimal baseline for one site/location. 3.1.1 Networking 4x L3 switches • Will be connected as routed leaf-spine fabric (OSPF) • Each with at least 48 ports 10gb SFP+ / 4 ports 40gb QSFP • Swithces that support ONIE/OCP preferred 1x L2 management switch • 48 ports 1GbE, VLAN • Remote management possible 48x 10GBase-SR SFP+ tranceivers 8x 40GBase-SR4 QSFP+ tranceivers 3.1.2 Servers 3x management nodes • 1u 1x12 core with 128gb RAM • 2x SFP+ 10gb and 2x 1gbE • 2x SSD drives RAID1 • Room for more disks • Redundant PSUs 7 iaas Documentation, Release 0.1.0 3x compute nodes • 1u 2x12 core with 512Gb RAM • 2x SFP+ 10Gb and 2x 1GbE • 2x SSD drives RAID1 • Room for more disks • Redundant PSUs 5x storage nodes • 2u 1x12 core with 128gb RAM • 2x SFP+ 10Gb and 2x 1GbE • 8x 3.5” 2tb SATA drives • 4x 120gb SSD drives • No RAID, only JBOD • Room for more disks (12x 3.5” ?) • Redundant PSUs Comments • Management and compute nodes could very well be the same chassis with different specs. Possibly even higher density like half width would be considered, but not blade chassis (it would mean non-standard cabling/connectivity) • Important key attribute for SSD drives is sequential write performance. SSDs might be PCIe connected. • 2tb disks for storage nodes to speed up recovery times with Ceph 8 Chapter 3. Design CHAPTER 4 Development 9 iaas Documentation, Release 0.1.0 10 Chapter 4. Development CHAPTER 5 Howtos This is a collection of howtos and documentation bits with relevance to the project. 5.1 Build docs locally using Sphinx This describes how to build the documentation from norcams/iaas locally 5.1.1 RHEL, CentOS, Fedora You’ll need the python-virtualenvwrapper package from EPEL sudo yum -y install python-virtualenvwrapper # Restart shell exit # Make a virtual Python environment # This env is placed in .virtualenv in $HOME mkvirtualenv docs # activate the docs virtualenv workon docs # install sphinx into it pip install sphinx sphinx_rtd_theme # Compile docs cd iaas/docs make html # Open in modern internet browser of choice xdg-open _build/html/index.html # Deactivate the virtualenv deactivate 11 iaas Documentation, Release 0.1.0 12 Chapter 5. Howtos CHAPTER 6 About the project norcams/iaas is an open source effort focused around automating, documenting and delivering all parts of a complete, Openstack-based production-quality infrastructure. This repository is our project handbook. Infrastructure as code and ‘automation first’ are the main technical driving forces along with a general need for faster and more efficient delivery of standardized, self-provisioned services among IT-departments in the Norwegian academic sector. Development is funded by the participating entities by contributing employees and knowledge into a nationally distributed team of engineers. Project goals are set, changed and validated within a formal project organization where management from all contributing entities are present. This project organization is named UH-sky and is coordinated by UNINETT, the Norwegian NREN organization. 6.1 What is NorCAMS anyways? It is nothing more than a name label, really. Or a lot more, if you choose for it to be. Pretty confusing, right? To provide some background on it, when starting to collaborate between the universities it became apparent that we needed a name of some sort to identify us and what we were trying to do. Key words where technology, collabration and learning to continuosly improve. In order to have something the NorCAMS name was invented and presented at a meetup in Tromsø early 2014. It is a play on words created from the words Norwegian (or it could be Nordic?) and CAMS. CAMS is an acronym (we all love them, right?) coined in 2010 by Damon Edwards and John Willis at the first US based Devopsdays. It stands for Culture, Automation, Measurement and Sharing and has become a mantra for the devops community and concept. NorCAMS is used as an identifier of the open source and collabration aspect of the formal UH-sky IaaS project. It is useful in several ways, possibly mostly as a marker to show our ambition to be truly open. By not using a more offical name we hope to not scare off anyone, thus maybe attracting contributors? Some further references around CAMS and Devops for those interested • John Willis, July 16, 2010 What Devops Means to Me (explaining CAMS) • Patrick Debios started Devopsdays in 2009 with Devopsdays Ghent • James Turnbull, Feb 2010 What DevOps means to me... 13 iaas Documentation, Release 0.1.0 6.2 People 6.3 Tracking the project Contents • Tracking the project – Chat room – Tasks and progress reporting – Core team weekly schedule * Daily status meeting * Weekly planning meeting – Project calendar – Social sharing platform 6.3.1 Chat room All members of the project are expected to join and follow our chat room while working. The chat room is used for socializing, status updates, informal quick questions and coordinating various group efforts. Commit messages from the most important git repositories we use are announced in the chat room automatically. To start using the chat room connect to a IRC server on the Freenode network and join the #uh-sky room. Remember, everything in the room is logged on the public internet at https://botbot.me/freenode/uh-sky/ 6.3.2 Tasks and progress reporting The project uses a Trello board for tasks and project planning. Core members are expected to add and manage cards directly on the board. Tasks described on the cards should not be too complicated to solve, ideally we want cards to flow through the board each day. If we do this correctly we get a low-cost, low-friction way of reporting progress and status. Divide and conquer seems like a good idea to try for this. If a card stays in the same column for a day, divide it and try to get smaller parts of it to Done! The Goals column is a bit special. This is where we put larger goals and milestones broken out from the project plan. Goals move directly to the Done column once they are reached. The board is public and available at https://trello.com/b/m7tD31zU/iaas To be able to comment on a card you’ll need a Trello account. Most of the team members use a Google account as their login identity. 6.3.3 Core team weekly schedule This table shows which days the core team members are available. Jan Ivar, Tor and Erlend are working full time. 14 Chapter 6. About the project iaas Documentation, Release 0.1.0 Name Erlend Hans-Henry Hege Jan Ivar Marte Mikael Tor Monday 1 1 0 1 0 0 1 Tuesday 1 1 0 1 0 1 1 Wednesday 1 1 1 1 0.5 1 1 Thursday 1 0 1 1 1 1 1 Friday 1 0 0 1 1 0 1 Daily status meeting The core team has daily meetings at 09:30 every work day. These are short meetings meant to summarize what has been worked on since yesterday, what will be done today and what blocks progress, if anything. Each team member is expected to speak briefly about their own situation. Daily meetings are held on Goolge Hangouts and published to the project calendar. They are also announced in the chat room a few minutes before they start. Weekly planning meeting The weekly planning meeting where we discuss direction, milestones and general progress. This is the place for any larger topics or issues involving the full team. To schedule a topic for this meeting project members make a card in Trello and label it as Discussion. The weekly status meeting is held on Google Hangouts and published to the project calendar. 6.3.4 Project calendar Meetings and events are published to a public Google calendar. It is possible to read it as a webpage or subscribe to it in ical format. Right now you’ll need to use the webpage interface to find the Google Hangouts video links for each event. There is a plan to update the event description field in the ical data with the Hangout URL by using this Python code but it has not been done yet. 6.3.5 Social sharing platform We have been using a NorCAMS Google Plus-community to share links of project related information for a while. Anyone with relevant content is free to use this as a channel. We put up a web redirect to the community page to make it easier to find, it is at http://plus.norcams.org 6.4 Meetings 6.4.1 Planning meeting Every thursday at 13.00 we have a planning meeting. This is the main arena for discussing the project, choices we are making and planning ahead. 6.4. Meetings 15 iaas Documentation, Release 0.1.0 2014-10-16: Planning meeting Contents • 2014-10-16: Planning meeting – 1. Defining a MVP – 2. Ceph training – 3. Openstack Summit – 4. Meetings with potential partners Present: Jan Ivar, Tor, Erlend, Hege, Hans-Henry, Mikael 1. Defining a MVP Jan Ivar brought up some ideas around MVP as a concept. Discussed defining a MVP and what it means. We got so far as to define a few general limitations and some actions around networking. Outcome The current list of characteristics defining our MVP • Networking – 2x 10GB fiber (SR) to the core network at each location – A single /24 public IPv4 subnet per site will be allocated for the infrastructure • Feature limitations – No redundancy for the Openstack services – No central authN/Z – No persistence (no booting from Cinder) for instances, we’ll use local disk on the compute nodes Actions • Write a short introduction about the plans for connectivity to the core network at each site. We will share this with the local organizations so that they are aware of the plans. (Jan Ivar, Hege) 2. Ceph training There’s a need for basic Ceph training as part of the project. At least one person from each org. Outcome Using the online course is fine, it has a curriculum and dates http://www.inktank.com/university/ceph130/. Pricing is not too bad, about $1000 for two days. posted at Jan Ivar suggests that Tor, Marte, Erlend and Mikael does the training, they’ll need to check if they are allowed and report back. Tor is ready and might even do it already next week. For the rest of the group we are looking at November 19-20. 16 Chapter 6. About the project iaas Documentation, Release 0.1.0 3. Openstack Summit We discussed a little around what sessions we want to be focusing on. It might make sense to not have everybody go to the same sessions :-) Jan Ivar reported he will focus on networking (IPv6!) and openstack maintenance and release engineering (packaging). There’s an Openstack Summit app for Android on Google Play. 4. Meetings with potential partners The project is in a process where we are meeting with potential partners. Jan Ivar reported about the current status of that and it looks like we’re headed in a good direction. Interest among vendors and external parties towards our project is great. Jan Ivar shared a presentation (mostly in Norwegian) he gave at one of the meetings. 2014-10-23: Planning meeting Contents • 2014-10-23: Planning meeting – 1. MVP and networking – 2. Illustrate and document cabling of the equipment – 3. Keeping people busy the next weeks – 4. Better ability to draw sketches during a meeting Present: Jan Ivar, Mikael, Tor 6.4. Meetings 17 iaas Documentation, Release 0.1.0 1. MVP and networking Outcome • Jump hosts ideally need to provide redundancy for access across locations. • Ideally we’d want access across the infrastructure for both in- and out-of-band managmenet (switches and servers). • It was (again) suggested that we use private IP addressing for at least the out-of-band management, maybe also the in-band management. • Discussion is not finalized Actions • Identify IP segments (/24 with possibility for /23) • Write a spec of how we want the public (service) IP segment routed across two fibers (static routes, how does redundancy/failover work with this?) 2. Illustrate and document cabling of the equipment We need to make a cabling illustration to show our current plan for connectivity. Actions • Jan Ivar, Tor and Raymond will make a draft next week. Hege will need to verify and ask questions :-) 18 Chapter 6. About the project iaas Documentation, Release 0.1.0 3. Keeping people busy the next weeks Jan Ivar and Tor are at Devopsdays Ghent next week, and a lot of people in and around the project will go to Paris for Openstack Summit the week after. We should write up some tasks and TODOs for winch to keep people busy when they feel ready to move on with PUppet. Actions • Jan Ivar adds more open cards to Planned in Trello, focusing on winch and Puppet 4. Better ability to draw sketches during a meeting When discussing we need a better ability to draw and sketch stuff during meetings. Actions • We tested a few solutions and https://awwapp.com seems like it might fit our needs. We’ll try using it over the next few meetings. 2014-11-13: Planning meeting Contents • 2014-11-13: Planning meeting – 1. Further definition of MVP – 2. Puppet versus hostnames, certificates, global identifiers across sites – 3. Project codebase versus winch? – 4. Placement of shared OS images for cloud, testing Present: Jan Ivar, Tor, Trond, Mikael, Hans-Henry, Erlend 1. Further definition of MVP Hardware placement in racks, networks, jump hosts. We need to discuss these in more detail. Outcome Jan Ivar has described (in a drawing) how to populate the rack at each site with the Dell hardware: 6.4. Meetings 19 iaas Documentation, Release 0.1.0 20 Chapter 6. About the project iaas Documentation, Release 0.1.0 The routers and management switches have reversed air flow for our convenience. This will make cabling work easier. We discussed in-band vs out-of-band management for the routers (compared to servers, that is). The management ports on the routers will be treated as in-band. The first revision of the IaaS will be set up in the most simple manner in regard to physical failover for the infrastructure hosts, but will be fully cabled up to provide for future improvement. Actions • Jan Ivar creates a detailed cabling chart for all the equipment. The equipment is to be identically set up at each site 2. Puppet versus hostnames, certificates, global identifiers across sites Discussion around DNS, public vs internal IP addresses, name standards, auto signing of puppet certificates. This meeting was not ready to make any decisions. Actions • The next meeting will feature some written material on the topics to better facilitate a good debate, and better preparations. 3. Project codebase versus winch? Winch will continue to be a side project for testing, training and code development. We need to start working on a new production codebase. It’s‘ not clear for now who can start contributing and when. Actions • Jan Ivar will create a skeleton for a new production codebase • We will have a kickoff day for the new production codebase • We’ll start iterating together on the codebase and growing it based on feature requests 4. Placement of shared OS images for cloud, testing We need a web server where we kan place images produced by the project. Actions • Short term solution for now - we all have arbitrary web servers we can use • We will create a fork of the bento project and integrate our own code 6.4.2 Technical steering group As needed, meetings in the technical steering group is called. This group has a mandate granted by the UH-sky leadership to govern the day-to-day coordination and management needs of the project. These meeting notes are written in Norwegian. 6.4. Meetings 21 iaas Documentation, Release 0.1.0 2014-11-10: Teknisk styringsgruppe Contents • 2014-11-10: Teknisk styringsgruppe – 1. Gruppen sitt mandat – 2. Møteform og frekvens – 3. Rapportering fra prosjektet – 4. Prosjektplanen og status – 5. Partnerskap med leverandører – 6. Pågående aktiviteter – 7. Spørsmål Tilstede: Kjetil Otter Olsen, Raymond Kristiansen, Ola Ervik, Per Markussen, Jan Ivar Beddari, Kristin Selvaag 1. Gruppen sitt mandat • Er det tilstrekkelig og riktig definert? • Er det noe som eventuelt mangler? Utfall Det er generell tilslutning til mandatet. Spørsmål om det er naturlig at gruppen rapporterer til UH-sky styringsgruppa via programleder - gruppen synes ikke dette er noe problem. Programleder er best posisjonert til å ha oversikt med tanke på hele UH-sky. Det er uformell kommunikasjon også i det daglige med ledernivået. Gjøremål • Jan Ivar publiserer gruppen sitt mandat sammen med referat fra første møte 2. Møteform og frekvens • Jan Ivar ønsker ikke å lede møtene eller skrive referat. Det er viktig at man skiller rollene på en god nok måte. • Frekvens for møtene? • E-post som verktøy? Annet? Utfall • Kjetil leder møtet. Ved neste møte skrives ikke referat av Jan Ivar. • Prosjektet kan kalle styringsgruppa inn til elektroniske møter hvis det er nødvendig, vi setter foreløpig ikke opp noen fast møtefrekvens. Ved innkalling skal møte avholdes senest arbeidsuken etter at saken kommer opp. • Torsdag 22. januar blir det et fysisk møte i Trondheim hos UNINETT. • Vi ønsker å bruke Agora-plattformen for deling av tilgangsbegrensede dokumenter, i den grad det er nødvendig. • Referatene fra teknisk styringsgruppe publiseres i prosjekthåndboka på samme sted som prosjektets ukentlige planleggingsmøter. 22 Chapter 6. About the project iaas Documentation, Release 0.1.0 Gjøremål • Jan Ivar og Kristin sørger for tilgang til Agora. Vi kan muligens gjenbruke den gruppa som allerede eksisterer for UH-sky. • Jan Ivar publiserer referat fra dette møtet på https://iaas.readthedocs.org (under About the project, Meetings). 3. Rapportering fra prosjektet • Frekvens og form for rapportering fra prosjektet til teknisk styringsgruppe? • Detaljnivå? Hva kan vi forvente av de som deltar i teknisk styringsgruppe i forhold til tidsbruk? Utfall Styringsgruppa ønsker å ha jevnlig innsyn i prosjektet. Referatene fra ukemøtene er egnet i så måte i forhold til å se hva vi jobber med men de ser framover og har ikke så mye rapporteringsfokus. Møtet ønsker en måndetlig skriftlig rapport som oppsummerer status i relasjon til målene i prosjektplanen. Denne rapporten skal sendes på epost til teknisk styringsgruppe. Gruppen kommenterer på eventuelle mangler eller uklarheter før den samme rapporten også postes på UH-sky sine websider. På spørsmålet om forventet tidsbruk har vi ikke noe godt svar og ønsker å ta vurderingen per sak. Gjøremål • Jan Ivar planlegger en månedtlig rapport første uken av hver måned • Jan Ivar skriver en første rapport som skal publiseres på UH-sky websida 4. Prosjektplanen og status Jan Ivar rapporterer fra pågående aktiviteter og hvordan oppstarten har gått, spesielt i forhold til ressurser. Utfall • Vi sjekker status på leveransen av hardware, det vil ta noen uker før alt er levert ser det ut til. Jan Ivar er ikke bekymret for framdrift likevel siden lærings og mestringsprosessen i gruppa er god. Vi er ikke hindret av mangel på hardware før tidlig i desember. • Vi er bevisst på vanskelighetene med delt stilling som deltakere i prosjektet lever med daglig. Tor Lædre (UIB) og Erlend Midttun (NTNU) blir nøkkelpersoner i prosjektet siden de er 100% avgitt. Tor har etterhvert fått god skjerming av sin arbeidstid og Ola Ervik rapporterer at dette tas alvorlig også for Erlend. • Ressurser fra UNINETT skal avgis innen kort sikt, det er et møte på torsdag førstkommende. Gruppen vil gjerne ha tilbakemelding fra møtet så snart en vet resultatet. Det er positivt at prosjektet ser ut til å bli tilført nødvendig kompetanse. • Det er lite trolig at UNINETT vil avgi totalt 100% slik de andre har gjort, på dette tidspunktet. Gruppa v/Kjetil vil ta dette opp videre. • Møte i prosjektet på torsdag vil jobbe videre med definisjon av minsteprodukt. Prosjektet jobber med en metodikk som ligger nært opp til smidig programvareutvikling. Dette er nytt for noen men forståelsen for tilnærminga ser ut til å etablere seg godt. 6.4. Meetings 23 iaas Documentation, Release 0.1.0 Gjøremål • Kjetil ønsker å ta saken om størrelsen på UNINETT sine avgitte ressurser med UH-sky styringsgruppa etter at resultatet foreligger. Møtet har ingen kommentarer til dette. • Jan Ivar vil følge opp rundt levering av hardware. 5. Partnerskap med leverandører Rapportering fra møter med Dell og Red Hat rundt partnerskap i prosjektet. Utfall • Dell stiller kompetansen i sin interne Openstack-ekspertgruppe til vår rådighet. Vi får “24 timer hver 90. dag” av Paul Brook sin tid dedikert til oss. Dette skal legges inn i prosjektets kalender. • Tekst med pressemelding fra Dell skal sendes teknisk styringsgruppe. Dell skal komme med forslag til skriftlig avtale om partnerskap. • Noe forklarende diskusjon rundt Red Hat og produkt versus åpen kildekode. USIT sitt kundeforhold til Red Hat er litt annerledes enn de andres. Gjøremål • Jan Ivar sender teksten fra Dell til alle for gjennomlesning. 6. Pågående aktiviteter Rapportering fra pågående aktiviteter i prosjektet. Utfall • Definisjon av første minimumsprodukt - fokus er nå nettverk og fysisk koblingsskjema • Ceph-opplæring, kurs? Flere ønsker dette. • Puppet-prosjekt i norcams/winch læringsmiljø 7. Spørsmål Eventuelt. Utfall Ingen vesentlige spørsmål. Møtet ble hevet kl 15:25 Teknisk styringsgruppe sitt mandat Bakgrunn Det er behov for å ta tekniske beslutninger underveis i prosjektet IaaS UH-sky. Styringsgruppen vil ikke være tilstrekkelig operativ i forhold til å kunne ta tekniske beslutninger i prosjektet og nedsetter derfor en teknisk styringsgruppe for dette formålet. 24 Chapter 6. About the project iaas Documentation, Release 0.1.0 Mandat 1. Gruppen skal ta tekniske beslutninger innenfor prosjektets mandat, mål, økonomiske og tidsmessige rammer. 2. Teknisk styringsgruppe rapporterer til styringsgruppen via programleder. 3. Alle beslutninger som vil medføre endringer i timeplan meldes styringsgruppen med begrunnelse og kan overprøves av denne. Beslutninger som endrer mål eller økonomiske rammer skal legges fram for styringsgruppen. 4. Teknisk styringsgruppe skal være en støtte for teknisk prosjektleder, bidra til tekniske avklaringer og aktivt bidra til at prosjektet når sine mål. 5. Teknisk styringsgruppe skal bidra med forankring av tekniske vurderinger, løsninger og valg med de faglige miljøene hos deltagerorganisasjonene. 6. Teknisk prosjektleder rapporterer til teknisk styringsgruppe på tekniske spørsmål. Øvrig rapportering i hht prosjektets organisering. 7. Teknisk styringsgruppe skal melde vesentlige risikoer de ser til styringsgruppen 6.5 Project plan and description UH-sky IaaS platform development • Project plan and description – Descriptive summary * Limitations * Prerequisites – Project goals and success criterias * 1. Develop, document and deliver a base IaaS platform * 2. Integration of authentication and authorization * 3. Further develop and verify services to cover ‘traditional workloads’ * 4. Research and suggest a solution for PaaS * 5. Research and suggest possible SaaS servics * 6. Research and specify a consumer-focused self-service portal – Project milestones and scheduling – Resources and budgeting – Project organization and management * Core development and engineering * Technical steering group * Top-level management and ownership – Risks – Appendix * 1. Support for the Microsoft Windows operating system * 2. Licensing of instances in the service * 3. Calculating needed capacity for development 6.5.1 Descriptive summary This document describes what the IaaS project will develop and deliver. The project aims to position IaaS as a common building block and vessel for future IT infrastructure and services delivery in the academic sector. 6.5. Project plan and description 25 iaas Documentation, Release 0.1.0 The main project activity is developing, documenting and delivering an open source IaaS platform ready for production use by June 15th 2015. Additional activites that expands and builds on top of this platform are described. These activites will need to be researched, discussed and specified in greater detail before they can be put into action. The project plan sets the earliest startup time for these activities to be February/March 2015. The base IaaS platform will deliver these services: • Compute • Storage in 2 variants – Block storage, accessible as virtual disks for compute instances – Object storage, accessible over the network as an API Limitations • The project will not deliver traditional backup. A common definition of backup state that backup data must be off-site, off-grid (e.g tape). A planned property of the storage system is to be able to select that an instance will be replicated to another location. • The additional activites described are dependent on the base IaaS platform. • Initial success criterias for the additional activities are described but no cost estimates (resources, budget) are given as part of this project plan. Prerequisites To be able to deliver the platform as described, on time, it is a requirement that the project get access to the needed resources • At least 3 people must work full-time (100%) with the main project activity • No roles less than 50% • If split roles are used, alternating blocks of at least 3 days continuous work hours must be with the project The project will need at least 6 months from the Locations complete milestone to delivery of the platform. This means that to deliver on time by 15th of June 2015 procurement of the needed hardware will need to be completed within 2014. If hardware is delayed until 2015, the final delivery date will be delayed the same amount of time, counting from August 15th 2015, as June and July are not counted due to vacations. E.g, if Locations complete is reached in February 2015 final delivery will be 15th of October 2015. 6.5.2 Project goals and success criterias The project will deliver a base IaaS platform to form a buildling block for future IT infrastructure delivery in the academic sector. The project has defined the following activities: 1. Develop, document and deliver a base IaaS platform 2. Integration of authentication and authorization 3. Further develop and verify services to cover ‘traditional workloads’ 4. Research and suggest a solution for PaaS 5. Research and suggest possible SaaS servics 26 Chapter 6. About the project iaas Documentation, Release 0.1.0 6. Research and specify a consumer-focused self-service portal Activities 1 and 2 have been passed by the UH-sky steering group in June 2014. To describe the activities a format similar to user stories is used. The stories share a common set of definitions service The base IaaS platform, including all services layered below user A person within the academic sector (with an identity record in FEIDE) given rights to administer instances and services on behalf of a tenant. tenant An organization or unit within the Norwegian academic sector administrator A person given responsibility and access to all the components of the service. This does not extend to access rights to the resources of a tenant. small instance A compute instance defined as 1 vCPU, 4GB RAM, 10GB storage large instance A compute instance defined as 4 vCPU, 16GB RAM, 100GB storage 1. Develop, document and deliver a base IaaS platform This is the main project activity. • The service must deliver capacity for ~750 small instances or ~275 large instanecs with a total of 100tb accessible storage. This capacity should be equally divided across three geo-dispersed sites. • The project must deliver a proof-of-concept PaaS solution able to offer three standardized development environments. • The project must deliever proof-of-concept operation of at least one common service, in a SaaS-like model. • The service must enable and document an expansion of the base platform to include (existing or new) HPC environments and workloads • The service must deliver data that can be used for billing tenants. The data delivered must be usable to identify users, organizations and organization units. • A user must be able to start an instance immediately after first login. The instance must be available within 60 seconds. • A user must be able to create, update and delete instanes in the service from a graphical user interface in a browser, using an API or by using command line tools. • A user must be able to select if an instance should have a persistent boot volume or not. • A user must be able to assign and use more storage as needed, within a quota. Billing of storage must be per usage, not per quota. • A user should be able to place or move an instance geographically across the available locations. The choice should be possible to make according to the users need for redundancy, resilience, geographical distance or other factors. • A user should be able to choose that an instance is replicated to other locations automatically, thus potentially increasing protection against service outages. • A user must be given the ability to monitor service performance and quality continuously. • An administrator must use two-factor authentication for any access to the service for systems management and maintenance purposes. • An administrator must be able to expand capacity, plan and execute infrastructure changes and fix errors in all parts of the service by using version-controlled code and automation. This key point should cover all operational tasks like discovery, deployment, maintenance, monitoring and troubleshooting. 6.5. Project plan and description 27 iaas Documentation, Release 0.1.0 2. Integration of authentication and authorization • A user must be able to authenticate via FEIDE and be authorized as belonging to a tenant in the service • Any FEIDE user passwords should NOT be stored in the service Before the service can be used in a production scenario it is neccessary to integrate central authentication and authorization. Users in the service must be identified as belonging to an organizational entity with correct billing information. This activity must research and document a model and solution that shows how user- and organization data from FEIDE (and other sources) can be integrated to cover the needs of the service. The model must be detailed enough to make it possible to estimate cost and resource constraints for the solution. Limitations in the chosen solution and model must be described. Suggestions and cost estimates for more advanced id/authN/authZ models, e.g users and billing across organizational boundaries, must be discussed. An analysis and assessment of integration with the UNINETT project FEIDE Connect should be done as part of this. 3. Further develop and verify services to cover ‘traditional workloads’ The base IaaS platform is planned to be built using OpenStack, a framework for building modern scalable cloud-centric infrastructure. Traditional enterprise workloads, defined as long-lived instances with critical data and state kept as part of the boot filesystem, is not as easily integrated into this framework. We believe a lot of our potential users would also like the service to cover this class of workloads. This activity integrates a solution tailored for traditional workloads with the base IaaS platform. Openstack and its service APIs are used to unify the solution so that the consumer side of the service is kept uniform. The solution can make use of existing infrastructure at each site/location, possibly by utilizing existing excess capacity, or later by expansion. A key value proposition for this activity is to confirm and further develop the requirement that any solution, knowledge and people working in the project are part of a shared pool of resources. Existing systems and available free capacity vary greatly between locations but this must not prevent or stop all parties from participating. Licensing is an important question that this activity must address. 4. Research and suggest a solution for PaaS There is a definite interest in PaaS as a concept in our communities. Earlier discussions has revealed that it is very likely we would want to deliver some form of PaaS solution on top of the IaaS platform. Today, from what we know, only UNINETT and its internal Nova project has experience with PaaS as an environment. This activity must research and suggest a form and model for a PaaS service delivered on top of the base IaaS platform. The suggested solution must be described and cost must be estimated. 5. Research and suggest possible SaaS servics Several of the common IT services in the sector are already today delivered in models that are close to SaaS. From our UH-sky viewpoint it is natural to look at these services as possible future migrations to the IaaS platform. This activity must actively approach the sector on multiple fronts to find use cases and needs that could possibly fit in a SaaS model. Early examples of such services could be software used in labs or classrooms. Is SPSS as a service possible? 6. Research and specify a consumer-focused self-service portal This activity will define goals to enable a uniform, consumer-focused, self-service portal for all IaaS, PaaS (SaaS?) related services. A central point for consuming the services is needed. 28 Chapter 6. About the project iaas Documentation, Release 0.1.0 Functional aspects we’d need solved are • Chargeback. Automatically generated billing based on usage. • Support for several cloud and virt providers, both private and public • Possibility for migrating workloads/instances and data between different infrastructure providers • Overview and monitoring of allocated resources across providers There are several products today that cover most if not all of the functional aspects described. A central customerfocused portal should be developed using one of them as a base. A development project formed around this activity will be only loosely coupled to the IaaS project but we think it would be beneficial to wait until the core functionality of the IaaS platform is in place. 6.5.3 Project milestones and scheduling The following describes planned progress and possible startup dates for the project activies Activity Startup activity 1 and 2 Minimum viable product. Per activity 1, one of three physical sites installed and running. Locations complete. All sites up and running. No storage or instance uptime guaranteed. Functionally complete. All functional goals completed and operative. No storage or instance uptime guaranteed. Incubation period. Pre-production tuning, testing and verification. Early customers given access. Best effort storage consistency and instance uptime. Documenting any further development needed. Project delivery. Activites 1, 2 delivered as described. Date June 2014 October 2014 December 2014 February 2015 Feb.-Jun. 2015 15.6.2015 6.5.4 Resources and budgeting This part of the project plan is not public 6.5.5 Project organization and management Core development and engineering Day-to-day activties are led by technical project lead Jan Ivar Beddari. A weekly meeting for planning is held Thursday at 13:00. Daily “morning meetings” to keep track of activites are held at 0930. Both these regular meetings are held online using video conferencing. Core development and engineering team • Erlend Midttun, NTNU • Tor Lædre, University of Bergen • Mikael Dalsgard, University of Oslo • Hege Trosvik, University of Oslo • Hans-Henry Jakbosen, University of Tromsø • Marte Karidatter Skadsem, University of Tromsø 6.5. Project plan and description 29 iaas Documentation, Release 0.1.0 Technical steering group The project reports to a technical steering group with representatives from all the participating organizations. Its main function is to coordinate commuication and solve issues that could possibly block progress in the project. This group is given a mandate from the top level project management to specify its roles and functions. Its members are • Kjetil Otter Olsen, University of Oslo (group lead) • Per Markussen, University of Tromsø • Ola Ervik, NTNU Norwegian University of Science and Technology • Raymond Kristiansen, University of Bergen • Kristin Selvaag, UNINETT, UH-sky • Jan Ivar Beddari, UH IaaS tech project lead Top-level management and ownership The UH-sky steering group represents the top level project management and project ownership. This group consists of the IT Directors from the four larger universities and representatives from university colleges and UNINETT, the Norwegian NREN organization. • Håkon Alstad, IT Director, NTNU Norwegian University of Science and Technology • Lars Oftedal, IT Director, University of Oslo • Stig Ørsje, IT Director, University of Tromsø • Tore Burheim, IT Director, University of Bergen • Thor-Inge Næsset, IT Manager, NHH Norwegian School of Economics • Vidar Solheim, IT Director, HiST Sør-Trøndelag University College • Frode Gether-Rønning, Head of IT-dept., AHO The Oslo School of Architecture and Design • Petter Kongshaug, CEO, UNINETT • Tor Holmen, Deputy CEO, UNINETT Meetings in the steering group are organized by the UNINETT UH-sky program manager, Kristin Selvaag. 6.5.6 Risks • The hardware investments planned will have a lifetime of at least four years. Risks involved with the investment is considered low. All aquired hardware will be usable to its full extent in the local organizations even if the project fails. • Delays in progress (3 months or more) due to lack of access to resources, non-foreseen technical or organizational complexities, or problems with coordinating efforts across the participants is very likely. • Inaccuracies in cost estimates for harware (both current and futur) is not considered high. However, the project does not estimate costs for production usage of the finished platform. 6.5.7 Appendix Questions and additions for the goals and criterias 30 Chapter 6. About the project iaas Documentation, Release 0.1.0 1. Support for the Microsoft Windows operating system A basic Windows-based instance requires substantial capacity from the service when compared to a basic Linux-based instance. The project aims to support Windows instances in the best way possible. Testing done within the project will determine what the technical solution will be. Windows will be tested in the service as large instances and performance will be measured and compared to our existing virtualization infrastructures. 2. Licensing of instances in the service The project will not handle or research licensing of instances in the service. Tenants must ensure that they are properly licensed for all instances they create using the service. Microsoft and Red Hat are examples of vendors with software products and operating systems that requires licensing. In a future production service we recommend negotiating agreements with vendors for site licensing. This could potentially be more cost effective than purchasing licenses per tenant or organization. The project has so far not planned or set aside resources towards this. 3. Calculating needed capacity for development Back-of-a-napkin assessment of development compute capacity • Physical cores (non-hyperthreaded): 2x12 core, 3x nodes, 3x sites = 216 cores • Virtual cores: 4x oversubscription = 864 vCPU, 3x oversubscription = 648 vCPU • RAM, no oversubscription = 512 GB 3x nodes, 3x sites = 4608 GB raw capacity Instances • Small instances: 1 vCPU, ~6 GB RAM, 10 GB disk ~ 72 instances per compute node, 648 total (at 3x cpu oversubscription) • Large instances: 4 vCPU, ~24 GB RAM, 100 GB disk ~ 18 instanecs per compute node, 162 total (at 3x cpu oversubscription) 6.5. Project plan and description 31