Build a matching service - Government Digital Service on GitHub

Transcription

Build a matching service - Government Digital Service on GitHub
Build a matching service
What is a matching service?
A matching service matches a user’s assured identity to a local identifier in the service’s
records, to allow them to use the service. Because there’s no unique identifier for
citizens in the UK, locating the record involves matching information about the user (eg
name, address, date of birth) against the service’s records. A matching service enables
a service to say that the John Smith trying to access the service is the same John Smith
already held on file locally.
The local identifier could be a known identifier such as an NHS number, or it could
simply be an identifier that the service allocates to a user.
Data matching is a fundamental requirement of identity assurance and it cannot function
without it.
It’s a service’s responsibility to develop and implement a matching service that carries
out this matching, and it must be located within the security domain. Services must also
determine what to do in the following circumstances:
● when multiple records are found that could be a match (risk of incorrect
matching)
● when no match is found, ie whether to create a new record for the user
How does a matching service work?
A matching service matches the user to the correct record in a service’s database using
a limited set of information, called the matching dataset, that has been verified by the
identity provider.
The matching dataset contains:
● name
● address
● date of birth
● gender
It may also contain historical data if available, for example previous addresses. The
matching service uses the matching dataset to make sure the service is dealing with the
right person.
The hub sends the matching dataset to the matching service via the Matching Service
Adapter. Once matching has taken place, the user is directed to the correct record
within the service.
There are potentially 3 cycles of matching, as explained in the GOV.UK Verify
Architecture Overview. If, after these 3 cycles, there’s no match, the service can
optionally decide to create a new account for the user. See Creating user accounts for
more information.
Why is a matching service important?
The Identity Assurance Hub Service SAML 2.0 Profile states that a service must
nominate a matching service, even if it doesn’t need to perform matching. This is
because the matching service also performs other key functions. The matching service
also:
● creates the hashed persistent identifier
● creates the final assertion as sent to the service
● acts as the trust anchor for the service (signs the assertion)
● (optionally) attempts to find a local identifier that means something to the service
It’s up to the service to define its own matching service, but we’ve provided this
guidance for services to consider. We recognise that not all services have experience of
matching verified identities to government data sources, and that the process of
matching itself is often complex.
Matching assured identities to service records
Matching assured identities to records is likely to be complicated. A service needs to
establish the most efficient and effective way of doing this, depending on the quality of
data and how it’s stored and managed.
Matching may be made easier when there’s also a unique identifier (eg a passport
number) but this should not be used in isolation, and the process should always first
attempt a match to the matching dataset. There may be data matching problems where
services have old or slightly different information, for example previous addresses or
maiden names; or addresses and names in different formats.
Services have to carry out a risk-based match to find the account against which to
match. For example, the matching service must be able to handle users who have
proved their identities, but who the service can’t match with sufficient confidence to any
particular record, or even users who match multiple records.
Confidence scoring for matching
Confidence scoring is the score given to the data that matches; a greatly simplified
example would be:
● 100% match confidence means that all elements of data fully match
● 80% match confidence might mean that first name, surname and address match,
but date of birth is showing a mismatch
Mismatches in data can happen, because, for example, the user typed the asserted
information incorrectly or incorrect data is held in the service records. Service data may
be incorrect for many reasons, for example if someone has moved house but not
informed the service. This means that the new address asserted doesn’t match the
service records.
It’s up to the service to configure the level of match confidence that’s needed, as
attitudes to risk and matching requirements may be different for each service.
What approaches to consider when building a matching service
● Understand the different types of users
● Prepare for matching with customer data aggregation and data cleansing
● Define the rules for successful matching, for example allow synonyms for first
names such as William and Bill
● Specify additional attributes that can be used to match identities reliably and
securely (eg passport number). The matching service can then prompt the hub to
request additional attributes from the identity provider
● Understand, from both a business and a technical view, what level of ‘fuzzy
matching’ is acceptable and possible. The service can apply fuzzy matching
when an exact match isn’t found; it allows a match that, although not 100%, is
above a service-defined threshold matching percentage
● Define the fuzzy matching rules
Matching strategy
We strongly recommend that the service define a matching strategy for the successful
matching of verified identity data (matching dataset) to locally-held data sources able to
identify an account or local identifier for user transactions.
The matching strategy should take into account:
● quality of available data
● completeness of local data sources
● confidence level required in a successful match
● how to score the confidence of a match (based on multiple criteria)
We also recommend that the service analyse its local data sources in light of the
matching strategy, so that the service can test and refine the strategy before launching
alpha or beta services.
For more guidance, see Example matching strategy.
Common problems
Exact matching of identity data such as name, address and date of birth is rarely
possible because of the following:
● transcription errors
● spelling mistakes
● user-asserted data from previous records may include typos or errors
Services should be aware of this and devise a matching strategy that deals with these
issues. Some suggestions are as follows:
● widen initial queries (or traces) to ensure that relevant records are not missed or
false positives returned. For example, search for surname, date of birth and
postcode first, then examine the results to identify the required record with further
scored matching
● split names into separate forename(s) and surname elements and try synonym
matching against combinations of forename and surname, possibly transposing
them (eg surname, forename 1).
This is to overcome the common situation where people refer to themselves
using shortened versions of their name or nicknames, as opposed to their legal
name. For example, Mike Smith, Michael Smith and David Michael Smith may all
be different representations of the same person (usual name, official name and
legal name (eg from a passport))
Example matching strategy
An example matching strategy for a service is outlined in the following steps. In this
example, the service’s local data source is called local-data.
1. Identify potential matches by trying to match the matching dataset to the
service’s local records (local-data). The first attempt at matching looks for a
100% match to full name and date of birth. If there are no matches, it widens as
follows:
a. Request the set of records (from local-data) that match both surname and
date of birth (from the matching dataset).
b. Perform the following matches:
Note: Include matches to historic names in the matching dataset, if provided.
● surname, forename 1, forename 2, date of birth
If forename 2 doesn’t exist, skip this match
● surname forename 1, date of birth
● surname, date of birth
2. Look for a match to historic information in the matching dataset. As for step 1, but
match to historic names if provided in the matching dataset.
3. Look for a match to historic information in local-data (if available). As for step 1,
but match current matching dataset names to name history in local-data.
4. Match to partial data and synonyms. As for step 1, but with synonyms applied to
names from the matching dataset. For example Mike rather than Michael, Steven
rather than Steve or Stephen.
If there’s still no match, try further matching to surname and initials, for example
surname, initial 1, date of birth.
5. If there are matches, apply the outcode element of the postcode (first element,
eg PO1) to increase match confidence (match score).
Of the remaining (potential) matches, try to match to:
a. Full postcode and first 6 characters of address line 1.
b. Outcode element of postcode and first 6 characters of address line 1.
c. Full postcode only.
6. If step 5 results in one or more matches:
a. If there’s a single match and the match confidence score is over the
acceptable limit, assume a match to the matching dataset.
b. If there are multiple matches over the confidence score, proceed to cycle 3
matching (user-asserted match). See GOV.UK Verify Architecture Overview
for a description of the 3 matching cycles.
c. If all matches are below the match confidence threshold, proceed to address
and date of birth partial matching.
7. Look for address and date of birth partial matching. Date of birth is sometimes
incorrectly recorded so the service could perform additional address level
matching to find potential matches, as follows:
a. Retrieve records that match the following criteria:
union (surname, outcode (first element of postcode) + surname, first 6
characters of address line 1)
b. With these records, attempt the following matches at the matching service:
● surname, forename 1, forename 2, MDS-YYYY-DD-MM
where MDS is matching dataset and YYYY-DD-MM is year-day-month
If forename 2 does not exist, skip this match
● surname forename 1, date of birth
● surname, date of birth
Further guidance
Contact the GOV.UK Verify team or your engagement lead if your service needs more
help with building a matching service.