Added web-example for description
Application demonstrates using Python and InterSystems IRIS to resolve linear regression in task of checking similarity of two text strings.
Problem: To get an analogue of directory B for the nomenclature of directory A automatically. For example, price list of some pharmacy company and some dictionary, Like ESKLP (federal single structured reference directory of drugs in Russia).
This example is available at https://paramon.esc.ru/csp/maf/index.html in guest mode (Choose: Инструменты / Распознавание )
{width="6.496527777777778in"
height="5.045138888888889in"}
Input data:
Price list (left part on screen),
Some dictionary (ESKLP for example)
Sorted by similarity candidates from ESKLP for every string from price list: many candidates to one position from price list.
Information about every pair "Price list -- candidate ESKLP"
Similarity is a classic linear regression function, where we calculate metrics values from two strings, and if the full value of function is the same more, then minimum we want -- then we can say that positions are the same.
Metrics:
Country - similarity of Country;
Decimal - similarity of two number list, especially prepared;
LekForm - similarity of dosage form, especially prepared;
ManufName - similarity of manufacturer's name;
Ngramm - similarity of two strings by n-gramm method;
Nomer - similarity of tablet's count in pack;
ProdName - similarity of production name;
Simber - similarity of two number list, especially prepared;
Translit - similarity of two strings in translit;
Trigram - similarity of two strings in translit by n-gramm method;
Some of these metrics getting-values-methods are shown in App.MAF.Metric.
Information about every pair: is collected in App_MAF.LinkML, it contains:
Code from organization's nomenclature dictionary
ESKLP-code (federal single structured reference directory of drugs in Russia)
Similarity value
Each of metric values.
Every link marked by human - if this link right or not.
At start all weights of all metrics = 1. And for example from print screen (code = 3045_1 )the final value of candidates are 96.38 and 95.5. The second candidate (95.5) is wrong, but the difference is not very big.
Solution
Get weights of all metrics, because some of them are not so effective to make our choice: if string from our organization's dictionary and string from ESKLP are the same or not. And when we will compare another organization's nomenclature, there will be much less error.
Reset coefficients (IRIS-Management Portal: System > SQL) update App_maf.PlanMetric set weight=1
Start production: ( Interoperability > Configure > Production
Configuration > Category: Match)
ml.match.RgrCoefProcess ->Start button
Test Production: ml.match.RgrCoefProcess > Actions > Test button
Get result, see column "weight": (IRIS-Management Portal: System > SQL)
SELECT m.id AS metricId, link.weight, link.id AS linkId, link.Order AS Ord FROM App_MAF.Plan plan LEFT JOIN app_maf.PlanMetric link ON plan.id = link.plan RIGHT JOIN app_maf.Metric m ON link.Metric = m.id WHERE plan.id = 1 AND link.active = 1 ORDER BY Ord
After updating weights we could see another values of similarity function. (Choose Plan Лексредства V2 in Options)