Application demonstrates using Python and InterSystems IRIS to resolve
linear regression in task of checking similarity of two text strings. Strings contain descriptions of some goods.
Problem: To get an analogue of directory B for the nomenclature of
directory A automatically. For example, price list of some pharmacy
company and some dictionary, Like ESKLP (federal single structured
reference directory of drugs in Russia).
This example is available at https://paramon.esc.ru/csp/maf/index.html
in guest mode (Choose: Инструменты / Распознавание )
{width=“6.496527777777778in”
height=“5.045138888888889in”}
Input data:
Price list (left part on screen),
Some dictionary (ESKLP for example)
Sorted by similarity candidates from ESKLP for every string from
price list: many candidates to one position from price list.
Information about every pair “Price list – candidate ESKLP”
Similarity is a classic linear regression function, where we calculate
metrics values from two strings, and if the full value of function is
the same more, then minimum we want – then we can say that positions
are the same.
Metrics:
Country - similarity of Country;
Decimal - similarity of two number list, especially prepared;
LekForm - similarity of dosage form, especially prepared;
ManufName - similarity of manufacturer's name;
Ngramm - similarity of two strings by n-gramm method;
Nomer - similarity of tablet's count in pack;
ProdName - similarity of production name;
Simber - similarity of two number list, especially prepared;
Translit - similarity of two strings in translit;
Trigram - similarity of two strings in translit by n-gramm method;
BarcodeSimilarity (new) - similarity of two strings that contain (o not) barcodes
Some of these metrics getting-values-methods are shown in
App.MAF.Metric.
Information about every pair: is collected in App_MAF.LinkML, it
contains:
Code from organization’s nomenclature dictionary
ESKLP-code (federal single structured reference directory of drugs
in Russia)
Similarity value
Each of metric values.
Every link marked by human - if this link right or not.
At start all weights of all metrics = 1. And for example from print
screen (code = 3045_1 )the final value of candidates are 96.38 and 95.5.
The second candidate (95.5) is wrong, but the difference is not very
big.
Solution
Get weights of all metrics, because some of them are not so effective to
make our choice: if string from our organization’s dictionary and string
from ESKLP are the same or not. And when we will compare another
organization’s nomenclature, there will be much less error.
Reset coefficients: in terminal d ##class(App.MAF.Plan).ResetMetricsWeights(1)
Start production: ( Interoperability > Configure > Production
Configuration > Category: Match)
ml.match.RgrCoefProcess ->Start button
Test Production: ml.match.RgrCoefProcess > Actions > Test button
Choose “Ens.Request” in Request Type and press button “Invoke Testing Service”. Please wait for finish.
Get result, see column “weight”: (IRIS-Management Portal: System >
SQL)
SELECT m.id AS metricId, link.weight, link.id AS linkId, link.Order AS
Ord FROM App_MAF.Plan plan LEFT JOIN app_maf.PlanMetric link ON plan.id
= link.plan RIGHT JOIN app_maf.Metric m ON link.Metric = m.id WHERE
plan.id = 1 AND link.active = 1 ORDER BY Ord
Now we have anover values of weight for every metric. Why it’s good: one metric began to express the similarity of strings more than another, and we could see another values of similarity function for different types of goods. For example - barcode for computer goods is less important, than for medicaments, and weight for BarcodeSimilarity metric when checking computer goods must be less then value for it’s metric, when we check similarity of two strings containig description of medicaments goods.
So, we could save different plans of checking similarity for different types of goods.
In the new version added calculation of logistic regression to compare with linear regression model. How it works:
We could see big difference between precisions of linear regression model and logistic regression model.
Working with web-application (code of web-application is not presented in this repository yet)