Algorithm for product similarity

From wiki.gpii
Jump to: navigation, search

Algorithm for product similarity

Similarity is a property of a couple of product records A-B. Similarity represents the probability that product described in record A is the same as the one described in record B.

Since we assume that in each database the product is only contained once, but different version of the same product may be represented by different records, the comparison is made also between products of the same database. In such case a correction factor (subtraction of 0,2 to the evaluated similarity) is applied.

The following algorithm can be used to compare two products coming from EASTIN databases. To compare a record in the EASTIN DB with a record in the Unified Listing the algorithm has to be iterated for all the original sources of the UL record. The similarity between UL record and EASTIN record will then be the maximum of the similarities between EASTIN record and original sources. UL record fields can be considered as an additional “original source”.

The source code of the algorithm implemented in the EASTIN web service is available here: File:Similarity Algorithm source code.docx

Fields to be compared:

  • commercial name
  • manufacturer: name and country
  • ISO 9999 classification codes: primary and optional ISO codes
  • insert date

The comparison of each field results in a score (from 0 to 1). The similarity is calculated as the weighted average of scores (see below)

Commercial name:

the procedure to calculate the Commercial name comparison score is the following:

  1. compare the two strings representing the commercial name: if the strings are the same then score is 1; else…
  2. strings are lower cased
  3. “tokens” (i.e. words composing the name) are created from the two commercial names. The tokens are created using the following separators: “ “; “-“; “_”; ”/”; “&” (example: “ZoomText Magnifier/Screen Reader” is tokenized in “ZoomText” “Magnifier” “Screen” “Reader”)
  4. Each token calculated for record A is searched in commercial name of record B. The percentage of tokens of record A found in commercial name B is calculated (pta-b= number of token A found in commercial name B/number of token A)
  5. The same is done for token B in commercial name A. (ptb-a= number of token B found in commercial name A/number of token B)
  6. The maximum between pta-b and ptb-a is the Commercial name comparison score

 

Manufacturer comparison

manufacturer name:

A “stop word” list is created (e.g. “ltd”, “spa”, “srl”, “GmbH”, “inc”, “LLC”, “eurl”, “sarl” ). Those words are taken away from the manufacturer name. After that step the procedure to evaluate the Manufacturer name comparison score is similar to the one described above for the commercial name:

  1. compare the two strings representing the manufacturer name: if the strings are the same then score is 1; else…
  2. strings are lower cased
  3.  “tokens” (i.e. words composing the name) are created from the two manufacturer names using separators: “ “; “-“; “_”; ”/”; “&”.
  4. Each token of record A is searched in manufacturer name B. The percentage of tokens of A found in commercial name B is calculated (pta-b= number of token A found in manufacturer name B/number of token A)
  5. The same is done for token B in commercial name A. (ptb-a= number of token B found in manufacturer name A/number of token B)
  6. The maximum between pta-b and ptb-a is the manufacturer name comparison score

 

manufacturer country

if country is the same then Manufacturer country comparison score is 1, otherwise 0. If country is not available score is 0

Manufacturer comparison score = Manufacturer name comparison score x 0.95 + Manufacturer country comparison score x 0.05

 

Classification codes

Classification codes are made up of ISO 9999 codes and ETNA subdivision codes (when available)

ISO codes:

Each ISO code of record A is compared with each ISO code of record B (i.e. all the possible combinations are evaluated). For example, If ISOa1 and ISOa2 are the codes of record A, and ISOb1 and ISOb2 are the codes of record B, then 4 comparisons are made: ISOa1 = ISOb1 ; ISOa1 = ISOb2 ; ISOa2 = ISOb1 ; ISOa2 = ISOb2 .

Each ISO code is made up of three levels: Class (c), Subclass (s) and Division (d): ISOa1= (ISOa1c ; ISOa1s ; ISOa1d) , the overall procedure to evaluate all the single comparison scores is the following:

NumISOA= number of ISO codes in record A

NumISOB= number of ISO codes in record B

For i=1 to NumISOA

For k=1 to NumISOB

if ISOaic≠ ISObkc then scoreai-bk=0 else

            if ISOais≠ ISObks then scoreai-bk= 0,30 else

if ISOaid≠ ISObkd then scoreai-bk= 0,80 else

scoreai-bk=1

end if

end if

end for

end for

With this procedure N scores (N = NumISOA x NumISOB) are calculated. The ISO code comparison score is the maximum of the scoresai-bk evaluated with the procedure above.

The Classification code comparison score is evaluated as:

 Classification code comparison score = ISO code comparison score x 0.95

Insert date

If the difference between insert dates is grater then 5 years then Insert date comparison score is 0.

If the difference is between 2 and 5 years then Insert date comparison score is 0,5.

If the difference is lower than 2 year then Insert date comparison score is 1.

 

Overall similarity

similarity = (commercial name x 0.62 + manufacturer x 0.28 + classification codes x 0.08 + insert date x 0.02)

weights will be adjusted on the basis of tests that will be done in the second iteration of the cloud4all pilots.



[1] This restriction may be removed depending on how the Unified Listing will decide to manage the different versions of the same product.