Build a matching service - Government Digital Service on GitHub
Transcription
Build a matching service - Government Digital Service on GitHub
Build a matching service What is a matching service? A matching service matches a user’s assured identity to a local identifier in the service’s records, to allow them to use the service. Because there’s no unique identifier for citizens in the UK, locating the record involves matching information about the user (eg name, address, date of birth) against the service’s records. A matching service enables a service to say that the John Smith trying to access the service is the same John Smith already held on file locally. The local identifier could be a known identifier such as an NHS number, or it could simply be an identifier that the service allocates to a user. Data matching is a fundamental requirement of identity assurance and it cannot function without it. It’s a service’s responsibility to develop and implement a matching service that carries out this matching, and it must be located within the security domain. Services must also determine what to do in the following circumstances: ● when multiple records are found that could be a match (risk of incorrect matching) ● when no match is found, ie whether to create a new record for the user How does a matching service work? A matching service matches the user to the correct record in a service’s database using a limited set of information, called the matching dataset, that has been verified by the identity provider. The matching dataset contains: ● name ● address ● date of birth ● gender It may also contain historical data if available, for example previous addresses. The matching service uses the matching dataset to make sure the service is dealing with the right person. The hub sends the matching dataset to the matching service via the Matching Service Adapter. Once matching has taken place, the user is directed to the correct record within the service. There are potentially 3 cycles of matching, as explained in the GOV.UK Verify Architecture Overview. If, after these 3 cycles, there’s no match, the service can optionally decide to create a new account for the user. See Creating user accounts for more information. Why is a matching service important? The Identity Assurance Hub Service SAML 2.0 Profile states that a service must nominate a matching service, even if it doesn’t need to perform matching. This is because the matching service also performs other key functions. The matching service also: ● creates the hashed persistent identifier ● creates the final assertion as sent to the service ● acts as the trust anchor for the service (signs the assertion) ● (optionally) attempts to find a local identifier that means something to the service It’s up to the service to define its own matching service, but we’ve provided this guidance for services to consider. We recognise that not all services have experience of matching verified identities to government data sources, and that the process of matching itself is often complex. Matching assured identities to service records Matching assured identities to records is likely to be complicated. A service needs to establish the most efficient and effective way of doing this, depending on the quality of data and how it’s stored and managed. Matching may be made easier when there’s also a unique identifier (eg a passport number) but this should not be used in isolation, and the process should always first attempt a match to the matching dataset. There may be data matching problems where services have old or slightly different information, for example previous addresses or maiden names; or addresses and names in different formats. Services have to carry out a risk-based match to find the account against which to match. For example, the matching service must be able to handle users who have proved their identities, but who the service can’t match with sufficient confidence to any particular record, or even users who match multiple records. Confidence scoring for matching Confidence scoring is the score given to the data that matches; a greatly simplified example would be: ● 100% match confidence means that all elements of data fully match ● 80% match confidence might mean that first name, surname and address match, but date of birth is showing a mismatch Mismatches in data can happen, because, for example, the user typed the asserted information incorrectly or incorrect data is held in the service records. Service data may be incorrect for many reasons, for example if someone has moved house but not informed the service. This means that the new address asserted doesn’t match the service records. It’s up to the service to configure the level of match confidence that’s needed, as attitudes to risk and matching requirements may be different for each service. What approaches to consider when building a matching service ● Understand the different types of users ● Prepare for matching with customer data aggregation and data cleansing ● Define the rules for successful matching, for example allow synonyms for first names such as William and Bill ● Specify additional attributes that can be used to match identities reliably and securely (eg passport number). The matching service can then prompt the hub to request additional attributes from the identity provider ● Understand, from both a business and a technical view, what level of ‘fuzzy matching’ is acceptable and possible. The service can apply fuzzy matching when an exact match isn’t found; it allows a match that, although not 100%, is above a service-defined threshold matching percentage ● Define the fuzzy matching rules Matching strategy We strongly recommend that the service define a matching strategy for the successful matching of verified identity data (matching dataset) to locally-held data sources able to identify an account or local identifier for user transactions. The matching strategy should take into account: ● quality of available data ● completeness of local data sources ● confidence level required in a successful match ● how to score the confidence of a match (based on multiple criteria) We also recommend that the service analyse its local data sources in light of the matching strategy, so that the service can test and refine the strategy before launching alpha or beta services. For more guidance, see Example matching strategy. Common problems Exact matching of identity data such as name, address and date of birth is rarely possible because of the following: ● transcription errors ● spelling mistakes ● user-asserted data from previous records may include typos or errors Services should be aware of this and devise a matching strategy that deals with these issues. Some suggestions are as follows: ● widen initial queries (or traces) to ensure that relevant records are not missed or false positives returned. For example, search for surname, date of birth and postcode first, then examine the results to identify the required record with further scored matching ● split names into separate forename(s) and surname elements and try synonym matching against combinations of forename and surname, possibly transposing them (eg surname, forename 1). This is to overcome the common situation where people refer to themselves using shortened versions of their name or nicknames, as opposed to their legal name. For example, Mike Smith, Michael Smith and David Michael Smith may all be different representations of the same person (usual name, official name and legal name (eg from a passport)) Example matching strategy An example matching strategy for a service is outlined in the following steps. In this example, the service’s local data source is called local-data. 1. Identify potential matches by trying to match the matching dataset to the service’s local records (local-data). The first attempt at matching looks for a 100% match to full name and date of birth. If there are no matches, it widens as follows: a. Request the set of records (from local-data) that match both surname and date of birth (from the matching dataset). b. Perform the following matches: Note: Include matches to historic names in the matching dataset, if provided. ● surname, forename 1, forename 2, date of birth If forename 2 doesn’t exist, skip this match ● surname forename 1, date of birth ● surname, date of birth 2. Look for a match to historic information in the matching dataset. As for step 1, but match to historic names if provided in the matching dataset. 3. Look for a match to historic information in local-data (if available). As for step 1, but match current matching dataset names to name history in local-data. 4. Match to partial data and synonyms. As for step 1, but with synonyms applied to names from the matching dataset. For example Mike rather than Michael, Steven rather than Steve or Stephen. If there’s still no match, try further matching to surname and initials, for example surname, initial 1, date of birth. 5. If there are matches, apply the outcode element of the postcode (first element, eg PO1) to increase match confidence (match score). Of the remaining (potential) matches, try to match to: a. Full postcode and first 6 characters of address line 1. b. Outcode element of postcode and first 6 characters of address line 1. c. Full postcode only. 6. If step 5 results in one or more matches: a. If there’s a single match and the match confidence score is over the acceptable limit, assume a match to the matching dataset. b. If there are multiple matches over the confidence score, proceed to cycle 3 matching (user-asserted match). See GOV.UK Verify Architecture Overview for a description of the 3 matching cycles. c. If all matches are below the match confidence threshold, proceed to address and date of birth partial matching. 7. Look for address and date of birth partial matching. Date of birth is sometimes incorrectly recorded so the service could perform additional address level matching to find potential matches, as follows: a. Retrieve records that match the following criteria: union (surname, outcode (first element of postcode) + surname, first 6 characters of address line 1) b. With these records, attempt the following matches at the matching service: ● surname, forename 1, forename 2, MDS-YYYY-DD-MM where MDS is matching dataset and YYYY-DD-MM is year-day-month If forename 2 does not exist, skip this match ● surname forename 1, date of birth ● surname, date of birth Further guidance Contact the GOV.UK Verify team or your engagement lead if your service needs more help with building a matching service.