org.icepdf.core.pobjects.graphics.text
Class PageText

java.lang.Object
  extended by org.icepdf.core.pobjects.graphics.text.PageText
All Implemented Interfaces:
TextSelect

public class PageText
extends java.lang.Object
implements TextSelect

Page text represents the root element of a page's text hierarchy which looks something like this.

The hierarchy elements are build by the content parser when text extraction is enabled. It is build to seperate the huristics used to calculate word and line detection which is used for text extraction/search, search highlighting and text highlighting.

It very important to note that all coordinates system represented in this hierarchy of object has been normalized to the page space. This allows for object to be sorted and drawn. Also this structure is not used for page layout and painting. It is is used for painting text selectin via UI input or search. The seperation is needed so that the text represented in Page text can be padded and sorted to aid in text extraction readability.

Since:
4.0

Constructor Summary
PageText()
           
 
Method Summary
protected  void addGlyph(GlyphText sprite)
           
 void addGlyph(GlyphText glyphText, java.util.LinkedList<OptionalContents> oCGs)
           
protected  void addOptionalPageLines(OptionalContents optionalContent, GlyphText sprite)
           
 void addPageLines(java.util.ArrayList<LineText> pageLines)
          Adds the specified pageLines to the array of pageLines.
 void applyXObjectTransform(java.awt.geom.AffineTransform transform)
          Utility method to normalize text created in a Xform content stream and is only called from the contentParser when parsing 'Do' token.
 void clearHighlighted()
           
 void clearSelected()
           
 void deselectAll()
           
 java.util.ArrayList<LineText> getPageLines()
          Creates a copy of the pageLines array and sorts that text both vertically and horizontally to aid in the proper ordering during text extraction.
 java.lang.StringBuilder getSelected()
           
 void newLine()
           
 void newLine(java.util.LinkedList<OptionalContents> oCGs)
           
 void selectAll()
           
 void sortAndFormatText()
          Takes the raw page lines represented as one continuous line and sorts the text by the y access of the word bounds.
 java.lang.String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

PageText

public PageText()
Method Detail

newLine

public void newLine(java.util.LinkedList<OptionalContents> oCGs)

newLine

public void newLine()

addGlyph

protected void addGlyph(GlyphText sprite)

getPageLines

public java.util.ArrayList<LineText> getPageLines()
Creates a copy of the pageLines array and sorts that text both vertically and horizontally to aid in the proper ordering during text extraction. The value is cached so any changes to optional content visibility should require that the cache is refreshed with a call to sortAndFormatText.

During the extraction process extra space will automatically be added between words. However depending on how the PDF is encoded can result in too many extra spaces. So as a result this feature can be turned off with the system property org.icepdf.core.views.page.text.autoSpace which is set to True by default.

Returns:
list of page lines that are in the main content stream and any visible layers.

addPageLines

public void addPageLines(java.util.ArrayList<LineText> pageLines)
Adds the specified pageLines to the array of pageLines. Generally only called when passing text form xObjects up to their parent shapes text.

Parameters:
pageLines - page lines to add.

addGlyph

public void addGlyph(GlyphText glyphText,
                     java.util.LinkedList<OptionalContents> oCGs)

addOptionalPageLines

protected void addOptionalPageLines(OptionalContents optionalContent,
                                    GlyphText sprite)

applyXObjectTransform

public void applyXObjectTransform(java.awt.geom.AffineTransform transform)
Utility method to normalize text created in a Xform content stream and is only called from the contentParser when parsing 'Do' token.

Parameters:
transform - do matrix tranform

clearSelected

public void clearSelected()
Specified by:
clearSelected in interface TextSelect

clearHighlighted

public void clearHighlighted()
Specified by:
clearHighlighted in interface TextSelect

getSelected

public java.lang.StringBuilder getSelected()
Specified by:
getSelected in interface TextSelect

selectAll

public void selectAll()
Specified by:
selectAll in interface TextSelect

deselectAll

public void deselectAll()

toString

public java.lang.String toString()
Overrides:
toString in class java.lang.Object

sortAndFormatText

public void sortAndFormatText()
Takes the raw page lines represented as one continuous line and sorts the text by the y access of the word bounds. The words are then sliced into separate lines base on y changes. And finally each newly sorted line is sorted once more by each words x coordinate.