java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
dev.botcity.botcity_document_processing.pdf.PdfReader

public class PdfReader extends org.apache.pdfbox.text.PDFTextStripper
Overwrites PDF Box PDFTextStripper and converts each text element to an Entry object that is used to parse the document based on the visual structure.
Author:
Gabriel Archanjo
  • Field Summary

    Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

    charactersByArticle, document, LINE_SEPARATOR, output
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    protected float
    computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
     
    float
     
    float
     
    void
     
    readFile(File file)
     
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)
     
    protected void
    writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions)
     

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    endArticle, endDocument, endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startArticle, startDocument, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphEnd, writeParagraphSeparator, writeParagraphStart, writeString, writeText, writeWordSeparator

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

  • Method Details

    • readFile

      public DocumentParser readFile(File file) throws FileNotFoundException, IOException
      Throws:
      FileNotFoundException
      IOException
    • getPageWidth

      public float getPageWidth()
    • getPageHeight

      public float getPageHeight()
    • writeString

      protected void writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) throws IOException
      Overrides:
      writeString in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • loadEntries

      public void loadEntries(Object entries)
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • computeFontHeight

      protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
      Throws:
      IOException