TextPage¶
This class represents text and images shown on a document page. All MuPDF document types are supported.
Method |
Short Description |
---|---|
Extract the page’s plain text |
|
synonym of previous |
|
Extract the page’s content in HTML format |
|
Extract the page’s content in JSON format |
|
Extract the page’s content in XHTML format |
|
Extract the page’s text in XML format |
|
Extract the page’s content in dict format |
|
Extract the page’s content in dict format |
|
Search for a string in the page |
Class API
-
class
TextPage
¶ -
extractText
()¶
-
extractTEXT
()¶ Extract all text from a
TextPage
object. Returns a string of the page’s complete text. The text is UTF-8 unicode and in the same sequence as specified at the time of document creation.- Return type
str
-
extractHTML
()¶ Extract all text and images in HTML format. This version contains complete formatting and positioning information. Images are included (encoded as base64 strings). You need an HTML package to interpret the output in Python. Your internet browser should be able to adequately display this information, but see Controlling Quality of HTML Output.
- Return type
str
-
extractDICT
()¶ Extract content as a Python dictionary. Provides same information detail as HTML. See below for the structure.
- Return type
dict
-
extractJSON
()¶ Extract content as a string in JSON format. Created by
json.dumps(TextPage.extractDICT())
. It is included only for backlevel compatibility. You will probably use this method ever only for outputting the result in some text file or the like.- Return type
str
-
extractXHTML
()¶ Extract all text in XHTML format. Text information detail is comparable with
extractTEXT()
, but also contains images (base64 encoded). This method makes no attempt to re-create the original visual appearance.- Return type
str
-
extractXML
()¶ Extract all text in XML format. This contains complete formatting information about every single character on the page: font, size, line, paragraph, location, etc. Contains no images. You need an XML package to interpret the output in Python.
- Return type
str
-
extractRAWDICT
()¶ Extract content as a Python dictionary – technically similar to
extractDICT()
, and it contains that information as a subset (including any images). It provides additional detail down to each character, which makes using XML obsolete in many cases. See below for the structure.- Return type
dict
-
search
(string, hit_max = 16, quads = False)¶ Search for
string
and return a list of found locations.- Parameters
string (str) – the string to search for.
hit_max (int) – maximum number of accepted hits (default 16).
quads (bool) – return quadrilaterals instead of rectangles.
- Return type
list
- Returns
a list of Rect or Quad objects, each surrounding a found
string
occurrence.
Example: If the search for string “pymupdf” contains a hit like shown, then the corresponding entry will either be the blue rectangle, or, if
quads
was specified,Quad(ul, ur, ll, lr)
.
Note
All of the above can be achieved by using the appropriate
Page.getText()
andPage.searchFor()
methods. Also see further down and in the Page chapter for examples on how to create a valid file format by adding respective headers and trailers.-
Dictionary Structure of extractDICT()
and extractRAWDICT()
¶
Page Dictionary¶
Key |
Value |
---|---|
width |
page width in pixels (float) |
height |
page height in pixels (float) |
blocks |
list of block dictionaries |
Block Dictionaries¶
Blocks come in two different formats: image blocks and text blocks.
Image block:
Key |
Value |
---|---|
type |
1 = image (int) |
bbox |
block / image rectangle, formatted as |
ext |
image type (str), as its file extension, see below |
width |
original image width (float) |
height |
original image height (float) |
image |
image content (bytes/bytearray) |
Possible values of key "ext"
are "bmp"
, "gif"
, "jpeg"
, "jpx"
(JPEG 2000), "jxr"
(JPEG XR), "png"
, "pnm"
, and "tiff"
.
Note
All of the above values may be zero or contain empty objects respectively. In an effort to provide complete information we may return entries like {'type': 1, 'bbox': [0.0, 0.0, 0.0, 0.0], 'width': 0, 'height': 0, 'ext': 'png', 'image': b''}
.
Text block:
Key |
Value |
---|---|
type |
0 = text (int) |
bbox |
block rectangle, formatted as |
lines |
list of text line dictionaries |
Line Dictionary¶
Key |
Value |
---|---|
bbox |
line rectangle, formatted as |
wmode |
writing mode (int): 0 = horizontal, 1 = vertical |
dir |
writing direction (list of floats): |
spans |
list of span dictionaries |
The value of key "dir"
is a unit vetor and should be interpreted as follows:
x
: positive = “left-right”, negative = “right-left”, 0 = neithery
: positive = “top-bottom”, negative = “bottom-top”, 0 = neither
The values indicate the “relative writing speed” in each direction, such that x2 + y2 = 1. In other words dir = [cos(beta), sin(beta)]
, where beta
is the writing angle relative to the horizontal.
Span Dictionary¶
Spans contain the actual text. In contrast to MuPDF versions prior to v1.12, a span no longer includes positioning information. Therefore, to reconstruct the text of a line, the text pieces of all spans must be concatenated. A span since v1.12 also contains font information. A line contains more than one span only, if the font or its attributes of the text are changing.
Key |
Value |
---|---|
font |
font name (str) |
size |
font size (float) |
flags |
font characteristics (int) |
text |
(only for |
chars |
(only for |
flags
is an integer, encoding bools of font properties:
bit 0: superscripted (20)
bit 1: italic (21)
bit 2: serifed (22)
bit 3: monospaced (23)
bit 4: bold (24)
Test these characteristics like so:
>>> if flags & 2**0: print("super")
>>> if flags & 2**1: print("italic")
>>> if flags & 2**2: print("serif")
>>> # etc.
>>>
Character Dictionary for extractRAWDICT()
¶
Key |
Value |
---|---|
bbox |
character rectangle, formatted as |
c |
the character (unicode) |
origin |
tuple coordinates of the bottom left point |