Extracts text from PDF, PPTX files and Images (PNG, JPEG, …) using embeded python
USER>zpm "install text-extractor"
apt-get install tesseract-ocr
If the text is in any of the languages other than English, you will need the appropriate packages, for example, tesseract-ocr-fra for French: apt-get install tesseract-ocr-fra
apt-get install poppler-utils
To get text from the whole document:
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
USER>set string = pdf.Extract()
To get the number of pages:
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
USER>set numpages = pdf.GetNumPages()
The first argument of the Extract
method is the page number (starting from 0).
To get text from the first page of the document:
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
USER>set string = pdf.Extract(0)
The examples above ignore images that can be inside .pdf and also contain text data
To get text and add text from images to it - use:
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
USER>set string = pdf.ExtractWithImages(0,"eng")
Another option is to save each .pdf page as an image, and then extract the text from those images
USER>set pdf = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pdf")
USER>set string = pdf.ExtractAsImages(0,"eng")
(use -1
as first argument to process whole document)
To get text from the image:
USER>set img = ##class(NSolov.TextExtract.Image).%New("/full/path/to/file.png", "fra")
USER>set string = img.Extract()
(second argument in %New() is language (eng
by default))
To get text from the whole presentation:
USER>set pptx = ##class(NSolov.TextExtract.PPTX).%New("/full/path/to/file.pptx")
USER>set string = pptx.Extract()
To get the number of slides:
USER>set pptx = ##class(NSolov.TextExtract.PPTX).%New("/full/path/to/file.pptx")
USER>set numslides = pptx.GetNumSlides()
The first argument of the Extract
method is the slide number (starting from 0).
To get text from the first slide of the presentation:
USER>set pptx = ##class(NSolov.TextExtract.PDF).%New("/full/path/to/file.pptx")
USER>set string = pptx.Extract(0)
From Interoperability you can use Business Operation NSolov.TextExtract.BusinessOperation
with request NSolov.TextExtract.PDFRequest
for pdf, NSolov.TextExtract.PPTXRequest
for pptx and NSolov.TextExtract.ImageRequest
for images.
The response is Ens.StringContainer
object.