Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

This page details production environment issues.

Risks

  • hardware failure
  • denial of service attacks
  • SQL injection attacks
  • credentials compromised
  • viruses
  • data center is physically destroyed
  • data center connectivity is lost
  • database is corrupted
  • production support
  • key person risk

Mitigation of Risks

...

Table of Contents

Risks and Mitigation

Here we enumerate the risks, and detail how the risk will be mitigated.

...

General Security Measures

...

Until we are better staffed to administer a public facing site, we shall allow only traffic from the city and county to access the site and web services. By adopting this simple and effective meaure, we can sidestep a number of other issues such as DOS and SQL injection attacks.
TODO - Will, can I ask you to do this work? I recall that we can use a subnet mask or equivelent.

...

Hardware Failure

...

Hardware failures are handled by applogic. As long as we do not use too many resources on a given grid, the hardware failover is automatic.
TODO - Will, is there a way we can be notified when there is a hardware failure? Do we need to do anything here?

...

Denial of Service Attacks

...

We will not be taking any specific measures to detect or to mitigate a DOS attack.

...

SQL Injection Attacks

...

The application code uses a framework to escape all user input which effectively dispenses with this problem.

...

User Credential Security

...

We will not enforce strong passwords nor will we force password changes.

Each year we will conduct a user account audit and disable or remove all accounts that are not in use.

...

System Administration Credential Security

...

administration credentials include:

ETL Processing

  • windows user accounts (dev, qa, prod)
  • ssh keys (dev, qa, prod)

Web Application

  • applogic grid root
  • ssh access to linux VMs
  • tomcat
  • geoserver
  • postgres
  • MAD application admin
  • apache httpd

TODO - Paul - enumerate all these accounts, to be stored and maintained at a single unpublished location known to the support staff.

TODO - Paul - review all of the administration credentials, and do the following:

  • insure that default passwords are not used

...

Viruses

...

We will not be scanning for viruses.

...

Database Integrity and Performance

...

We are using the Postgres database which is known for stability and robustness.
In any event, every 24 hours we

  • run consistency checks
  • run a full vacuum/analyze to guarantee good performance
  • backup the database
  • copy the backup to offsite location
  • retain backups as needed (todo - need this from Hema)

TODO - Paul - automate all of this
TODO - Hema - we need your business schedule to do these things at a good time

...

Data Center Connectivity Lost

...

With each version release into production we will back up the entire application to a second VM for use upon failover. Should we lose the primary data center, we will bring up the application in the secondary data center using the most recent database backup. ? Performance of the application for the first day will be sluggish because the map cache will be empty/out of date. It will take overnight to reseed the map cache which can be done from the command line on the ETL machine.

todo - Paul - test
todo - need input from Hema here.

...

Deployment practices

...

The Enterprise Addressing System includes a data server, a map server and a web server. The application deployment are managed using standard practices with 3 separate environments (DEV, QA, PROD). Changes of any sort are first tested in the development DEV environment. If the tests pass, we apply the changes to QA where business users conduct testing. Only after the business users approve the changes do we release any changes to PROD. This includes everything from OS upgrades through to our own application code, minor and major.To push normal street and parcel data changes into the MAD, we us an

Since all application resources are hosted "in the cloud", all deployment and ETL activities in PROD shall be conducted through SSH tunnnels. Under the DVE and QA environments we shall occasionally allow non-SSH access for convenience.

For various legacy oriented reasons, we were unable to employ standard practices for the extract, transform, and load (ETL) processprocesses. While this process has been coded to support ( DEV, QA, and PROD ) environments, none of the partipating participating systems have more than a single environment. (Is this correct?) This includes the machine that the ETL jobs run on, the SFGIS dataserver, the DPW data server, and the ASR data server.each of the data servers (SFGIS, DPW, ASR). The workstation that execute the ETL is virtualized but is not backed up and has no failover plan in place. (Is this correct?)
TODO - Jeff, please do what you can to bolt this down. Add more details here.

  • Lack of FME License Manager failover; FME uses shrink-wrapped, non-production database; etc. These and other FME issues are documented on Citypedia.

The Department of Technology has a Change Control system in place to advise and vet proposed changes to production systems. Change Controls will be created prior to the release of any changes to the Enterprise Addressing System production system, per departmental policy.

...

Development Practices

...

The software development team uses version control (Subversion), bug tracking (Jira), wiki collaboration (Confluence), all of which is hosted by Atlassian. All source code, DDL, DML, design documents, etc. are stored on Atlassian. When a version is released, the repository is tagged. When bug fixes are made to production, a branch is create in the repository.

...

Disk Space

...

Every 24 hours we

  • check that there are no disk space issues
  • notify application support if there are any issues
    TODO - Will, can I ask you to look at this? Paul can help.

...

Failover

...

Before we go into production and once every year, we will conduct a failover exercise to ensure that we are able to provide business continuity.
TODO - Paul

Service Level Agreement (separate docs?)

...

Who will provide Bug Fixes and Enhancements?

...

How will we decide what bugs get fixed and what enhancements are built (governance)?

...

Maximum time allowed for ETL outage.

...

Recovery Time Objective

...

Without MAD, DBI will not be able to issue permits. (Is this correct?)
The recovery time objective for the application is 1 hour. (Is this OK?)2 hours. Hema, is this acceptable?

...

Recovery Point Objective

...

The recovery point objective for the database is 8 hours. (Is this OK?)

In some cases, such as a data center failure, the map cache that we fail over to
will be unseeded. The cache can be reseeded over night but the responsiveness
of the entire application will be slow until the the cache is reseeded.

access to the database is more important than access to the cached map data

upon a DC failover, we will have to reseed chache

Can we assume that we are not going to use replication?

If no replication, how much work can we loose? (1 day, 4 hours?, 2 hours?)

monitor disk space24 hours. I am assuming that, given our limited resources, we will use database backups for recovery, and that we do not want to use replication. If we were to use replication, we could have a recovery point objective right down to a single transaction. But we have no expertise here and so I recommend against this approach.

todo - Hema, is all this acceptable?

Support Notifications

At a minimum we will use email for support notification. This means we'll need access to a mail server or some sort. And depending on DBI requirements, including their schedule requirements, we may consider a more robust solution such as Nagios. Using Nagios would probably mean more work; I have no experience with Nagios but some DT staff do.

Administration Roles

We need to flesh out who is going to do what...

...

Virtual Machine Admin

...

Can our ASP provide this until we move to 200 Paul?

...

user administration

...

TODO - Hema, can DBI help with this?

...

linux system admin

...

sfgis will do this
todo - Paul - wiki pages

...

postgres db admin

...

sfgis will do this
We should have a real DBA sign on.
todo - Paul - wiki pages

...

production support

...

sfgis will do this
todo - Paul - wiki pages