Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
The AVS addresses are imported into EAS via a one time ETL.
Once EAS goes live, this ETL will be turned off.
Here I describe in general the processing that occurs during the ETL.

The ETL process is controlled by a python job which is here (jira access required)
* http://sfgovdt.jira.com/source/browse/MAD/trunk/etl/src/config/live/config_jobs.py?hb=true#to318

If you do not have access, the important code is here
{code}
        job.commands = [
            # load the data
            FtpCommand(job=job, connectionGroup=self.connectionOrganizer.ftpConnections['DBI2SFGIS'], sourcePath='/dbi-external/tblAVS.txt', targetPath=csvFileIn, direction='GET'),
            FmeCommand(job=job, fmw="avs_load/avs_to_eas.fmw", connectionGroups=(self.connectionGroups['MAD_ETL'],), useSdePorts=(False,), files=(csvFileIn,)),
            FmeCommand(job=job, fmw="avs_load/parcels_sfgisStg_no_geom_to_eas.fmw", connectionGroups=(self.connectionGroups['SFGIS_STG_SDE'], self.connectionGroups['MAD_ETL'],), useSdePorts=(True,False,)),
            SqlCommand(job=job, sqlStatement="select avs.delete_addresses();", connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="vacuum analyze;", connectionGroup=self.connectionGroups['MAD_ETL'], useTransaction=False),
            SqlCommand(job=job, sqlStatement="select avs.init_parcel_points();", connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="select avs.init_streets_nearest();", connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="select avs.init_avs_addresses();", connectionGroup=self.connectionGroups['MAD_ETL']),
            AvsLoadCommand(job=job, connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="update avs.avs_addresses set exception_text = 'programming error - row not processed' where address_base_id is null and exception_text is null;", connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="select public._eas_validate_addresses_after_etl('ALL');", connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="delete from public.d_address_disposition where disposition_description = 'provisional';", connectionGroup=self.connectionGroups['MAD_ETL']),
            SqlCommand(job=job, sqlStatement="vacuum analyze;", connectionGroup=self.connectionGroups['MAD_ETL'], useTransaction=False),
            # report on results
            RemoveDirectoryTreeCommand(job=job, targetDir=os.path.join(getDataFilePath(), 'avs_load_report')),
            SqlCommand(job=job, sqlStatement="select avs.qa();", connectionGroup=self.connectionGroups['MAD_ETL']),
            DumpTablesCommand(job=job, connectionGroup=self.connectionGroups['MAD_ETL'], tableNames=['avs.vw_load_summary', 'avs.vw_load_results', 'avs.qa'], targetDir=reportDir),
            ZipCommand(job=job, sourceDir=os.path.join(getDataFilePath(), 'avs_load_report'), targetFile=os.path.join(getDataFilePath(), 'avs_load_report.zip')),
            EmailCommand(job=job, subject='AVS Load Report', text='See attached.', attachFiles=[os.path.join(getDataFilePath(), 'avs_load_report.zip')]),
            FtpCommand(job=job, connectionGroup=self.connectionOrganizer.ftpConnections['DBI2SFGIS'], sourcePath=os.path.join(reportDir, 'avs.vw_load_results.csv'), targetPath='/dbi-external/avsEasLoadResults.csv', direction='PUT')
        ]

{code}

This python code calls into a DB proc (avs.load) which then does the heavy lifting.

The majority of the other code is here
* http://code.google.com/p/eas/source/browse/#svn%2Ftrunk%2Fetl%2Fsql%2Favs_load

with these being the key pieces
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_init_avs_addresses_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_load_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_process_address_base_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_process_address_unit_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_process_address_parcel_link_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/src/commands.py#1799

The execution path is not trivial so I'll walk through the process in english (below).
By necessity I do a lot of generalizing but I also link to the code.

In case you are curious, the key parts of the EAS data model are here
* http://code.google.com/p/eas/source/browse/trunk/database/erd_20110901.jpeg


OK, off we go...

We start by doing to blanket validations and standardizing some values.
This is done in this db proc:
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_init_avs_addresses_ddl.sql

and includes the following

'invalid street number suffix'
The domain values are here
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/migrate_1-0_beta_to_1-0_prod/d_address_base_number_suffix_ddl.sql


'street name does not exist'
The domain values are specified by DPW.
We do not use fuzzy string matching.

'street suffix does not exist in street dataset'
The domain values are specified by DPW.
We do not use fuzzy string matching.

'street - street suffix combination does not exist'
The domain values are specified by DPW.
We do not use fuzzy string matching.

'referenced parcel has no geometry'
parcel data from DPW

'no matching block - lot'
parcel data from DPW

'block lot values are inconsistent'
Check for consistency across the columnes block, lot, and block-lot.

'length of concatenated unit num exceeds 10'
We concatenate avsa.unit and avsa.unit_sfx using
http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_concatenate_unit_ddl.sql
The result must fit into char 10.

'invalid end date value'
Some of the end date values cannot be cats into the date type.


We also standardize dates, "unit number" values, and street number suffixes.


At this point we have "excepted" addresses that we know we cannot process.
We take each remaining address and try to load it into EAS.
This proceeds from the base address, to the unit address, to the unit address - parcel link.
This is detailed here in
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_load_ddl.sql

The main load proc calls into these procs in this order
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_process_address_base_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_process_address_unit_ddl.sql
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/avs_load/f_process_address_parcel_link_ddl.sql

With the most interesting work being done by f_process_address_base.
I will describe this now and will do a good bit of generalizing.
We try to use an existing adddress and create a new one if necessary.
If we create a new one, we have to find the best matching street segment.
This is easily the most complicated process.
The segment must be within 500 feet of the specified parcel, and the street name and street suffix must match the source data.
Again, we do not use fuzzy string matching (such as Levenshtein etc).

If all goes well, we insert the base address.
There are various constraints that protect data integrity, specifically to prevent duplicate active base addresses.
You  can see the trigger enforced constraints here
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/migrate_1-0_beta_to_1-0_prod/tg_address_base_before_ddl.sql

If the ETL finds or creates a base address, it will then insert the unit address if it is specified.
Again, we do not allow multiple active duplicates.
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/migrate_1-0_beta_to_1-0_prod/tg_addresses_before_ddl.sql


Finally, we insert an "address - parcel link".
Here again we do not allow duplicates.
* http://code.google.com/p/eas/source/browse/trunk/etl/sql/migrate_1-0_beta_to_1-0_prod/tg_address_x_parcels_before_ddl.sql

At the end of all this processing we report results in three ways
 - summary results
 - detail on all AVS rows
 - quality assurance on selected data

An example of this attached here: [^avs_load_report.zip]

The summary results look like this example (from 2011-12-07):

{csv}
exception_text_general,count,percentage
,324018,96.3671
referenced parcel has no geometry,5409,1.6087
no matching block - lot,3926,1.1676
insert rejected - address_x_parcels - the parcel is not within the block number range,1172,0.3486
your street suffix appears to be incorrect ,929,0.2763
unable to find nearby street ,357,0.1062
street name does not exist,321,0.0955
insert rejected - address_x_parcels - the parcel is not within within 500 feet of the address,87,0.0259
length of concatenated unit num exceeds 10,5,0.0015
block lot values are inconsistent,4,0.0012
street suffix does not exist in street dataset,3,0.0009
invalid end date value,2,0.0006
{csv}