Production Readiness
Risks and Mitigation
Here we enumerate the risks, and detail how the risk will be mitigated.
General Security Measures
Until we are better staffed to administer a public facing site, we shall allow only traffic from the city and county to access the site and web services. By adopting this simple and effective meaure, we can sidestep a number of other issues such as DOS and SQL injection attacks.
TODO - Will, can I ask you to do this work? I recall that we can use a subnet mask or equivelent.
Hardware Failure
Hardware failures are handled by applogic. As long as we do not use too many resources on a given grid, the hardware failover is automatic.
TODO - Will, is there a way we can be notified when there is a hardware failure? Do we need to do anything here?
Denial of Service Attacks
We will not be taking any specific measures to detect or to mitigate a DOS attack.
SQL Injection Attacks
The application code uses a framework to escape all user input which effectively dispenses with this problem.
User Credential Security
We will not enforce strong passwords nor will we force password changes.
Each year we will conduct a user account audit and disable or remove all accounts that are not in use.
System Administration Credential Security
administration credentials include:
ETL Processing
- windows user accounts (dev, qa, prod)
- ssh keys (dev, qa, prod)
Web Application
- applogic grid root
- ssh access to linux VMs
- tomcat
- geoserver
- postgres
- MAD application admin
- apache httpd
TODO - Paul - enumerate all these accounts, to be stored and maintained at a single unpublished location known to the support staff.
TODO - Paul - review all of the administration credentials, and do the following:
- insure that default passwords are not used
Viruses
We will not be scanning for viruses.
Database Integrity and Performance
We are using the Postgres database which is known for stability and robustness.
In any event, every 24 hours we
- run consistency checks
- run a full vacuum/analyze to guarantee good performance
- backup the database
- copy the backup to offsite location
- retain backups as needed (todo - need this from Hema)
TODO - Paul - automate all of this
TODO - Hema - we need your business schedule to do these things at a good time
Data Center Connectivity Lost
With each version release into production we will back up the entire application to a second VM for use upon failover. Should we lose the primary data center, we will bring up the application in the secondary data center using the most recent database backup. ? Performance of the application for the first day will be sluggish because the map cache will be empty/out of date. It will take overnight to reseed the map cache which can be done from the command line on the ETL machine.
todo - Paul - test
todo - need input from Hema here.
Deployment practices
The Enterprise Addressing System includes a data server, a map server and a web server. The application deployment are managed using standard practices with 3 separate environments (DEV, QA, PROD). Changes of any sort are first tested in the DEV environment. If the tests pass, we apply the changes to QA where business users conduct testing. Only after the business users approve the changes do we release any changes to PROD. This includes everything from OS upgrades through to our own application code, minor and major.
Since all application resources are hosted "in the cloud", all deployment and ETL activities in PROD shall be conducted through SSH tunnnels. Under the DVE and QA environments we shall occasionally allow non-SSH access for convenience.
For various legacy oriented reasons, we were unable to employ standard practices for the extract, transform, and load (ETL) processes. While this process has been coded to support DEV, QA, and PROD environments, none of the participating systems have more than a single environment. (Is this correct?) This includes each of the data servers (SFGIS, DPW, ASR). The workstation that execute the ETL is virtualized but is not backed up and has no failover plan in place. (Is this correct?)
TODO - Jeff, please do what you can to bolt this down. Add more details here.
- Lack of FME License Manager failover; FME uses shrink-wrapped, non-production database; etc. These and other FME issues are documented on Citypedia.
The Department of Technology has a Change Control system in place to advise and vet proposed changes to production systems. Change Controls will be created prior to the release of any changes to the Enterprise Addressing System production system, per departmental policy.
Development Practices
The software development team uses version control (Subversion), bug tracking (Jira), wiki collaboration (Confluence), all of which is hosted by Atlassian. All source code, DDL, DML, design documents, etc. are stored on Atlassian. When a version is released, the repository is tagged. When bug fixes are made to production, a branch is create in the repository.
Disk Space
Every 24 hours we
- check that there are no disk space issues
- notify application support if there are any issues
TODO - Will, can I ask you to look at this? Paul can help.
Failover
Before we go into production and once every year, we will conduct a failover exercise to ensure that we are able to provide business continuity.
TODO - Paul
Service Level Agreement (separate docs?)
Who will provide Bug Fixes and Enhancements?
How will we decide what bugs get fixed and what enhancements are built (governance)?
Maximum time allowed for ETL outage.
Recovery Time Objective
Without MAD, DBI will not be able to issue permits. (Is this correct?) The recovery time objective for the application is 2 hours. Hema, is this acceptable?
Recovery Point Objective
The recovery point objective for the database is 24 hours. I am assuming that, given our limited resources, we will use database backups for recovery, and that we do not want to use replication. If we were to use replication, we could have a recovery point objective right down to a single transaction. But we have no expertise here and so I recommend against this approach.
todo - Hema, is all this acceptable?
Support Notifications
At a minimum we will use email for support notification. This means we'll need access to a mail server or some sort. And depending on DBI requirements, including their schedule requirements, we may consider a more robust solution such as Nagios. Using Nagios would probably mean more work; I have no experience with Nagios but some DT staff do.
Administration Roles
We need to flesh out who is going to do what...
Virtual Machine Admin
Can our ASP provide this until we move to 200 Paul?
user administration
TODO - Hema, can DBI help with this?
linux system admin
sfgis will do this
todo - Paul - wiki pages
postgres db admin
sfgis will do this
We should have a real DBA sign on.
todo - Paul - wiki pages
production support
sfgis will do this
todo - Paul - wiki pages