Stages 1 - 3 are run only once per Bulk Loader process.
Stages 4 - 7 are run one or more times in batches.
Stage Number | Stage | Category | Summary | Environment | Iterations | Estimated Person Time | Estimated Computer Time |
---|---|---|---|---|---|---|---|
1 | Import and parse reference dataset (Optional) | Parsing | This optional step in the Bulk Loader process is to cross check an address for a match in a reference data set. If a source address is found in the reference dataset the address makes it to the next step. If not found the address is put aside in an exclusion set for later review. | Once per Bulk Loader process | 1 hour | 10 minutes | |
2 | Import, parse and filter source dataset | Parsing | Import the dataset destined for the EAS. Parse and filter the set. | Once per Bulk Loader process | 90 minutes | 15 minutes | |
Geocode and filter | Geocoding | Geocode the set and filter further based on the geocoder score and status. | ArcMap | Once per Bulk Loader process | 1 hour | 5 minutes | |
4 | Export full set (single batch) or subset (multiple batches) | Geocoding | For large datasets, create one of many subsets that will be run through the Bulk Loader in multiple batches. | ArcMap | One or more batches for each Bulk Loader process | 30 minutes per batch | 5 minutes per batch |
5 | Bulk Load batch (full set or subset) | Bulk Loading | Run the entire batch or each subset batch through the Bulk Loader. | One or more batches for each Bulk Loader process | 1 hour per batch | 5 minutes per batch | |
6 | Extract results | Bulk Loading | Extract and archive the list of addresses that were added to the EAS . Also archive the unique EAS 'change request id' associated with this batch. Also archive the addresses that were rejected by the Bulk Loader in this batch. | PostgreSQL / pgAdmin | One or more batches for each Bulk Loader process | 1 hour per batch | 5 minutes per batch |
7 | Cleanup and Restoration | Bulk Loading | Clean up database, restore services and in the event of a failure, restore from backup. | PostgreSQL / pgAdmin | One or more batches for each Bulk Loader process | 1 hour per batch | 5 minutes per batch |
Never load any new addresses into production until a successful trial run is performed on the same addresses in a non-production environment, such as development or QA. |
Be aware of the downstream implications of running the Bulk Loader. The Bulk Loader adds new addresses to the EAS. This has a ripple effect of populating multiple databases. |
Some of the downstream databases affected by running the Bulk Loader:
Is is highly recommended that a backup be made of the database prior to running the Bulk Load step in Stage 5. |
If a problem is noticed immediately after running the Bulk Loader it may be possible to restore the database from backup.
In order to facilitate an immediate backout it is recommended that the Bulk Loader process be run after hours and with access to the EAS temporarily suspended. Otherwise, backtracking may become difficult if not impossible.
When Bulk Loading in a production environment, treat it like a software release by following protocols to stop and start relevant services as outlined in the steps below. |
When releasing in the production environment pick a time outside core work hours that does not conflict with any cron jobs running on the production server. |
Artifacts are generated throughout the stages of the Bulk Loader Process. Below is a suggested folder and file layout for the various artifacts.
./ # root (YYYYMMDD-tag e.g. 20190510-3x10k) |
########### # Folders # ########### ./ # root (YYYYMMDD-tag e.g. 20190510-SourceNameHere) ./gap # Artifacts from the General Address Parser stage ./geocoder # input csv and output shapefiles ./bulkload_001 # artifacts from the first batch ./bulkload_00N # artifacts from the Nth batch, if any ./excluded_records # records from all stages set aside for later review ./bulkload_00N # artifacts from the Nth batch ./bulkload_00N/shapefile/ # this is the input for the Bulk Loader step ./excluded_records # records from all stages set aside for later review ./excluded_records/gap ./excluded_records/geocoder ./excluded_records/bulkload_001 ./excluded_records/bulkload_00N ######### # Files # ######### ./gap/gap_YYYYMMDD.csv # GAP output; geocoder input ./geocoder/bulkloader.mxd # ArcMap file ./geocoder/addresses_geocoded.shp # All geocoded records ./geocoder/geocode_score_100.shp # Records with non-tied, perfect geocoder score ./bulkload_00N/shapefile/bulkload.shp # Input for the Bulk Loader stage ./bulkload_00N/address_base.csv # New base records ./bulkload_00N/addresses.csv # New base+unit records ./bulkload_00N/change_request.csv # One-record containing unique id change request id associated with this batch ./bulkload_00N/record-counts-after.csv # EAS record-count after Bulk Load ./bulkload_00N/record-counts-before.csv # EAS record-count before Bulk Load ./excluded_records/geocoder/geocode_under_100.shp # Tied and/or not perfect geocoder score ./excluded_records/bulkload_00N/address_extract_exceptions.csv ./excluded_records/bulkload_00N/address_extract_exception_totals.csv |
This optional stage is run once per Bulk Loader process. This stage can be skipped if the reference dataset is already available or if the optional 'filter by reference' step (Step 2.5) is skipped.
Import CSV file into PostgreSQL table (++<odbc>)
python csv2pg.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --input_file=./path/to/raw_reference_file.csv \ --output_table=reference_raw |
Parse address table (++<odbc>)
python parse_address_table.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --address_table=reference_raw \ --primary_key=sfgisgapid \ --address_column=address \ --city_column=city \ --state_column=state \ --zip_column=zip \ --output_table=reference_parsed \ --output_report_file=./output/reference.xlsx \ --limit=-1 |
This stage is run once per Bulk Loader process. This stage involves importing, parsing and filtering the dataset. The result will be a dataset that will be geocoded in Stage 3.
Import CSV file into PostgreSQL table (++<odbc>)
python csv2pg.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --input_file=./path/to/raw_source_file.csv \ --output_table=addresses_raw |
addresses_no_range
CREATE TABLE addresses_no_range AS SELECT * FROM addresses_raw WHERE NOT add_number LIKE '%-%'; |
addresses_with_range
CREATE TABLE addresses_with_range AS SELECT * FROM addresses_raw WHERE add_number LIKE '%-%'; |
Parse address table (++<odbc>)
python parse_address_table.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --address_table=addresses_no_range \ --primary_key=sfgisgapid \ --address_column=address \ --city_column=city \ --state_column=state \ --zip_column=zip \ --output_table=addresses_parsed \ --output_report_file=./output/source_report.xlsx \ --limit=-1 |
Remove records with post directional values (++<odbc>)
python remove_postdirectionals.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --address_table=addresses_parsed \ --output_table=addresses_no_post_dir \ --output_report_file=./output/source_report.xlsx \ --limit=-1 |
Create table addresses_with_post_dir
CREATE TABLE addresses_with_post_dir AS SELECT * FROM addresses_parsed WHERE NOT id IN (SELECT id FROM addresses_no_post_dir); |
This optional step excludes from the source dataset any addresses that are not also found in the parsed reference dataset.
For an understanding on how this step handles duplicates in either the source or reference, see the comments in EAS Issue 281, Refine the filtering regime for address data to be bulk-loaded .
Add field to flag if address is found in reference (++<odbc>)
python is_address_in_reference.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --source_table=addresses_no_post_dir \ --reference_table=reference_parsed \ --source_pk=id \ --reference_pk=id \ --source_address_number=address_number \ --source_address_number_suffix=address_number_suffix \ --source_street_name_pre_directional=street_name_pre_directional \ --source_street_name_pre_type=street_name_pre_type \ --source_street_name=street_name \ --source_street_name_post_type=street_name_post_type \ --source_street_name_post_directional=street_name_post_directional \ --source_subaddress_type=subaddress_type \ --source_subaddress_identifier=subaddress_identifier \ --source_place_name=place_name \ --reference_address_number=address_number \ --reference_address_number_suffix=address_number_suffix \ --reference_street_name_pre_directional=street_name_pre_directional \ --reference_street_name_pre_type=street_name_pre_type \ --reference_street_name=street_name \ --reference_street_name_post_type=street_name_post_type \ --reference_street_name_post_directional=street_name_post_directional \ --reference_subaddress_type=subaddress_type \ --reference_subaddress_identifier=subaddress_identifier \ --reference_place_name=place_name \ --source_output_field=in_reference \ --limit=-1 |
Create and archive table addresses_in_reference
CREATE TABLE addresses_in_reference AS SELECT * FROM addresses_no_post_dir WHERE in_reference = 1; |
Create and archive table addresses_not_in_reference
CREATE TABLE addresses_not_in_reference AS SELECT * FROM addresses_no_post_dir WHERE in_reference = 0; |
Create table addresses_to_geocode
-- Select, order and save as 'addresses_to_geocode' CREATE TABLE addresses_to_geocode AS SELECT * FROM addresses_in_reference ORDER BY street_name, street_name_pre_directional, street_name_pre_type, street_name_post_type, street_name_post_directional, address_number, address_number_suffix, subaddress_type, subaddress_identifier, place_name |
Add counter
field
-- Add and index a counter field ALTER TABLE addresses_to_geocode DROP COLUMN IF EXISTS counter; ALTER TABLE addresses_to_geocode ADD COLUMN counter SERIAL; CREATE INDEX ON addresses_to_geocode (counter); |
Export PostgreSQL table to CSV file for Bulk Loading (++<odbc>)
python pg2bulkloader.py \ --odbc_server=<odbc-server> \ --odbc_port=<odbc-port> \ --odbc_database=<odbc-database> \ --odbc_uid=<odbc-uid> \ --odbc_pwd=<odbc-pwd> \ --input_table=addresses_to_geocode \ --output_file_name=./output/addresses_to_geocode \ --input_source='Imported Addresses (20XX-XX-XX)' \ --limit=-1 |
Detailed Substeps
Create folder for storing all input and output artifacts for this iteration of the Bulk Loader process.
For example, R:\Tec\..\Eas\_Task\2018_2019\path\to\archive\bulkloader_process_YYYYMMDD
Create new ArcMap map document (ArcMap 10.6.1)
For example, bulkloader_YYYYMMDD.mxd
Add streets (optional)
See StClines_20190129.shp
in R:\Tec\...\Eas\_Task\2018_2019\20181128_248_DocumentBulkLoader\Data
Create Personal Geodatabase
Right click Home
in Folder Connections
in Catalog
Select New → Personal Geodatabase
Import CSV into File Geodatabase
Right-click new personal geodatabase: Import → Table (single)
Browse to addresses_to_geocode.csv
Specify output table addresses_to_geocode
Click OK
and time on stopwatch. Wait up to 5 minutes for TOC to update.
addresses_to_geocode
in the ArcMap table of contents and select Geocode Addresses
'Choose an address geocoder to use'
select 'Add'
. R:\311\...\StClines_20150729_VirtualAddressLocator (TODO - Replace with new path)
'OK'
'Address Input Fields'
select 'Multiple Fields'
address
zip
Output
'Save as type'
to 'Shapefile'
.Save shapefile in path dedicated to artifacts for this Bulk Loader Progress
For example, R:\Tec\..\Eas\_Task\2018_2019\path\to\archive\bulkloader_YYYYMMDD\geocoder
\
addresses_geocoded.shp
Time on stopwatch and take note of execution time.
bulkloader_process_YYYYMMDD.mxd
addresses_geocoded.shp
The purpose of this step is to filter geocoded addresses based on their geocoding score and status.
Input - addresses_geocoded.shp
Detailed Substeps
Open the ArcMap document created in the previous step.
Filter matched addresses (status = 'M') with score of 100.
In the Table of Contents right-click the shapefile from previous step, e.g. addresses_geocoded.shp, and select Open Attributes Table
.
Click the first icon in the menu bar (top-left) and select Select By Attributes
.
Enter the following WHERE clause to select matched address with a score of 100:
Click Apply
, wait for operation to complete and then close the Attributes window
Save results to geocoder_matched_and_score100.shp
In the Table of Contents right-click the shapefile from the previous step and select Data → Export Data
Click the browse icon.
In the file browser select the Save as type
dropdown and select Shapefile
.
Save the shapefile to the artifacts folder dedicated to this iteration of the Bulk Loader Process.
Filter the opposite set: status not matched ('M') or score not 100.
Right-click addresses_geocoded.shp and open the attributes table.
Enter the following WHERE clause:
NOT ("Status" = 'M') OR NOT ("Score" = 100)
Save results to geocoder_notmatched_or_under100.shp. (See substeps above.)
Output / Artifacts
geocoder_score_100.shp - Shapefile of matched addresses (status = 'M') with geocoding score of 100.
geocoder_notmatched_or_under100.shp - Shapefile of non-matched addresses (unmatched or tied) or with geocoding score less than 100.
The total number of records of the two output shapefiles should be the same as the number of records in the input shapefile.
Stages 4, 5 and 6 can be run one time with the results from Stage 3, or they can be run in multiple batches of subsets. A major consideration of when to run the full set at once versus in batches is the number of records being Bulk Loaded. The size of each Bulk Loader operation affects the following aspects of the EAS:
For medium-to-large datasets (input sets with over 1,000 records) it is recommended that the Bulk Loading process be run in batches over several days or weeks. Reminder! It is required that the process first be run on a development server to assess the implications of the operation. The remaining steps will document a single batch iteration. Repeat these steps in a multi-batch process. |
This example shows filtering for the second batch of 50,000 records using the
counter_
field.
Open Attributes Table
.Select By Attributes
.Enter the following WHERE clause to select the current batch of 50,000 records:
"counter_" > 50000 AND "counter_" <= 100000 |
Apply
, wait for operation to complete and then close the Attributes windowIn the Table of Contents right-click the layer geocoder_score_100 and select Data → Export Data.
Click the browse icon.
In the file browser select the Save as type
dropdown and select Shapefile
.
Save the shapefile to the artifacts folder dedicated to this iteration of the Bulk Loader Process.
e.g. R:\Tec\..\Eas\_Task\2018_2019\path\to\archive\bulkloader_process_YYYYMMDD\bulkloader\batch_NNN\bulkload.shp
For a complete set of steps and background about the Bulk Loader, see also Running the Bulk Loader, a page dedicated to its input, operation and results.
Never load any new addresses into production until a successful trial run is performed on the same addresses in a non-production environment, such as development or QA. |
Disable web service on <environment>_WEB
(SF DEV WEB, SF QA WEB, SF PROD WEB)
cd /var/www/html sudo ./set_eas_mode.sh MAINT |
These steps are being performed to facilitate immediate roll-back of the EAS database if the Bulk Load Process ends in failure |
SKIP Turn off the replication server
Disable database replication by shutting down the database service on the replication server (DR PROD DB).
#sudo -u postgres -i #/usr/pgsql-9.0/bin/pg_ctl -D /data/9.0/data stop |
Suspend downstream replication to internal business system database (SF PROD WEB).
sudo /var/www/html/eas/bin/xmit_change_notifications.bsh stop |
sudo -u postgres -i /home/dba/scripts/dbbackup.sh > /var/tmp/dbbackup.log # this step takes about 2 minutes ls -l /var/tmp # ensure the log file is 0 bytes ls -la /mnt/backup/pg/daily/easproddb.sfgov.org-* # the timestamp on the last file listed should match timestamp of backup exit # logout of user postgres when done |
Connect to the database, <environment>_DB
, and clear any leftover records from previous Bulk Loader batches.
TRUNCATE bulkloader.address_extract, bulkloader.blocks_nearest; |
VACUUM FULL ANALYZE bulkloader.address_extract; |
VACUUM FULL ANALYZE bulkloader.blocks_nearest; |
Make note of EAS record counts before the Bulk Loading operation.
SELECT schemaname,relname,n_live_tup FROM pg_stat_user_tables ORDER BY schemaname,relname,n_live_tup |
Make note of the database partition size on the file system at the current point in time.
date; df /data # 1st of 3 |
<environment>_AUTO
.<environment>
with one of the relevant environments: SF_DEV, SF_QA, SF_PROD, SD_PROD.C:\apps\eas_automation\app_data\data\bulkload_shapefile
.Open a command prompt and change folders:
cd C:\apps\eas_automation\automation\src |
Run the step to stage the address records:
python job.py --job stage_bulkload_shapefile --env <environment> --action EXECUTE --v python job.py --job stage_bulkload_shapefile --env SF_DEV --action EXECUTE --v python job.py --job stage_bulkload_shapefile --env SF_QA --action EXECUTE --v python job.py --job stage_bulkload_shapefile --env SF_PROD --action EXECUTE --v |
Run the step to bulk load the address records
python job.py --job bulkload --env <environment> --action EXECUTE --v python job.py --job bulkload --env SF_DEV --action EXECUTE --v python job.py --job bulkload --env SF_QA --action EXECUTE --v python job.py --job bulkload --env SF_PROD --action EXECUTE --v |
To calculate the time it took to run the Bulk Loader look at the timestamps in the output or use a stopwatch or clock to time the operation.
Make note of the database partition size on the file system at this point. Compare with size of partition prior to loading to get the total disk space used as a result of running the Bulk Loader.
date; df /data # 2nd of 3 |
Make note of EAS record counts after the Bulk Load operation.
SELECT schemaname,relname,n_live_tup FROM pg_stat_user_tables ORDER BY schemaname,relname,n_live_tup |
In the spreadsheet, calculate the difference between the 'before' and 'after' record counts. The results will indicate the number of new base addresses added to the table `public.address_base` and the number of new addresses and units added to the table `public.addresses`.
See dedicated Bulk Loader page, Running the Bulk Loader, for more analysis options.
The Bulk Loader operation in Stage 5 populated an EAS table named If any errors occurred on a given address during the load, the Bulk Loader populated the |
address_extract
table.Use a query tool such as pgAdmin to query and save the table as a CSV file.
SELECT * FROM bulkloader.address_extract; |
Query subtotals
SELECT exception_text, Count(*) FROM bulkloader.address_extract GROUP BY exception_text ORDER BY exception_text; |
Save artifact as exception_text_counts.csv
Query all exception text records
SELECT * FROM bulkloader.address_extract WHERE NOT(exception_text IS NULL) ORDER BY exception_text, id; |
'exception_text
' field.'exception_text
' fieldchange_request_id
associated with the Bulk Loadchange_request_id
created by the Bulk Load operation. The value of <change_request_id>
will be used in the next steps to count addresses added to the EAS.Query the 'public.change_requests'
table for the new 'change_request_id'
value.
SELECT change_request_id FROM public.change_requests WHERE requestor_comment LIKE 'bulk load change request' ORDER BY change_request_id DESC LIMIT 1; |
addresses
recordsQuery the public.addresses
table on the new change_request_id
value.
SELECT * FROM public.addresses WHERE activate_change_request_id = <change_request_id>; |
address_base
recordsQuery the public.address_base
table on the new change_request_id
value.
SELECT activate_change_request_id, address_id, public.address_base.* FROM public.address_base, public.addresses WHERE public.address_base.address_base_id = public.addresses.address_base_id AND public.addresses.address_base_flg = TRUE AND public.addresses.activate_change_request_id = <change_request_id>; |
address_base.csv
Compare the results of Stage 5 with the results from Stage 6.
The number of addresses found in the Step 5.6 (2) should be identical to the number of addresses found in Step 6.3.
<environment>_DB
, and clear address_extract records from the latest Bulk Loader batch. Make note of final disk usage tally.TRUNCATE bulkloader.address_extract, bulkloader.blocks_nearest; |
VACUUM FULL ANALYZE bulkloader.address_extract; |
VACUUM FULL ANALYZE bulkloader.blocks_nearest; |
date; df /data # 3rd of 3 # Optional step: archive output to 'df.txt' artifact exit |
Re-enable database replication by restarting the database service on the replication server (DR PROD DB).
#sudo -u postgres -i #/usr/pgsql-9.0/bin/pg_ctl -D /data/9.0/data start |
Resume downstream replication to internal business system database (SF PROD WEB).
#sudo /var/www/html/eas/bin/xmit_change_notifications.bsh start |
<environment>_WEB
(SF DEV WEB, SF QA WEB, SF PROD WEB)cd /var/www/html sudo ./set_eas_mode.sh LIVE exit |
Contents of progress and summary artifact, BulkLoader_Process_YYYYMMDD.xlsx
Progress - This sheet contains a table of relevant totals for each batch
Batch number
Batch date
Input record counts
New base record counts
New unit record counts
Sample addresses
Email Jobs - This sheet contains a table of details related to the weekly 'Address Notification Report' automated email job
Batch range
Record count in batch range
Email Timestamp
Total record counts in email
Subtotal of records generated as a result of the Bulk Loader
Size of email attachment
Batch N - This sheet tracks the before and after record counts for all tables in the EAS database. There is a sheet for each batch loaded. Within each sheet is a section for the 'before' records, a section for the 'after' record counts, and a 'diff' column showing the change in record counts.
END OF STEPS
(+) Substitute EAS <environment> with one of the relevant environments: SF_DEV, SF_QA, SF_PROD, SD_PROD.
(++<odbc>) Substitute <odbc> arguments with values for an available PostgreSQL database.
<odbc-server> - Name or IP address of the database server, e.g. localhost
<odbc-port> - Port of the database server, e.g. 5432
<odbc-database> - Name of the database, e.g. awgdb
<odbc-uid> - User name
<odbc-pwd> - User password
(*) Scripts are written in and require Python 3. See the source code repository for more details.
(**) Artifacts should be saved in a network folder dedicated to the entire instance of a given Bulk Loader process. Artifact names shown are suggestions. Note: For large datasets the entire process could be spread over many days or weeks. Take this into consideration when naming any artifacts and subfolders.
This is the stack used for the development and testing of the steps. For best results run the steps with the same or compatible stack.
The content by label feature displays related articles automatically, based on labels you choose. To edit options for this feature, select the placeholder below and tap the pencil icon.
|