Documentation

simple_html_dom
in package

simple html dom parser Paperg - in the find routine: allow us to specify that we want case insensitive testing of the value of the selector.

Paperg - change $size from protected to public so we can easily access it Paperg - added ForceTagsClosed in the constructor which tells us whether we trust the html or not. Default is to NOT trust it.

Table of Contents

$_charset  : mixed
$_target_charset  : mixed
$callback  : callable|null
Callback function to run for each element in the DOM.
$default_span_text  : string
Suffix for <span> elements
$lowercase  : bool
Indicates how tags and attributes are matched
$nodes  : array<string|int, mixed>
List of nodes in the current DOM
$original_size  : int
Original document size
$root  : object
The root node of the document
$size  : int
Current document size
$block_tags  : mixed
Defines a list of tags which - if closed - close all optional closing elements within if they haven't been closed yet. (So, an element where neither opening nor closing tag is omissible consistently closes every optional closing element within)
$char  : string
Current character
$cursor  : mixed
$default_br_text  : string
Innertext for <br> elements
$doc  : string
The document
$noise  : mixed
$optional_closing_tags  : array<string|int, mixed>
Defines elements whose end tag is omissible.
$parent  : object
Parent node of the next node detected by the parser
$pos  : int
Current position in the document
$self_closing_tags  : mixed
Defines a list of self-closing tags (Void elements) according to the HTML Specification
$token_attr  : string
Tokens to identify the end of an attribute
$token_blank  : string
Tokens considered blank in HTML
$token_equal  : string
Tokens to identify the equal sign for attributes, stopping either at the closing tag ("/" i.e. "<html />") or the end of an opening tag (">" i.e.
$token_slash  : string
Tokens to identify the end of a tag name. A tag name either ends on the ending slash ("/" i.e. "<html/>") or whitespace ("\s\r\n\t")
__construct()  : mixed
__destruct()  : mixed
__get()  : mixed
__toString()  : mixed
childNodes()  : mixed
clear()  : mixed
createElement()  : mixed
createTextNode()  : mixed
dump()  : mixed
find()  : mixed
firstChild()  : mixed
getElementById()  : mixed
getElementByTagName()  : mixed
getElementsById()  : mixed
getElementsByTagName()  : mixed
lastChild()  : mixed
load()  : mixed
load_file()  : mixed
loadFile()  : mixed
remove_callback()  : void
Remove callback function
restore_noise()  : string
Restore noise to HTML content
save()  : mixed
search_noise()  : mixed
set_callback()  : void
Set the callback function
as_text_node()  : bool
Add tag as text node to current node
copy_skip()  : string
Copy substring from the current document position to the first occurrence of a character not defined by the provided string.
copy_until()  : string
Copy substring from the current document position to the first occurrence of any of the provided characters.
copy_until_char()  : string
Copy substring from the current document position to the first occurrence of the provided string.
link_nodes()  : void
Link node to parent node
parse()  : bool
Parse HTML content
parse_attr()  : void
Parse attribute from current document position
parse_charset()  : mixed
prepare()  : mixed
read_tag()  : bool
Parse tag from current document position.
remove_noise()  : mixed
Remove noise from HTML content
skip()  : void
Seek from the current document position to the first occurrence of a character not defined by the provided string. Update the current document position to the new position.

Properties

$callback

Callback function to run for each element in the DOM.

public callable|null $callback = ull

$default_span_text

Suffix for <span> elements

public string $default_span_text = ""

$lowercase

Indicates how tags and attributes are matched

public bool $lowercase = alse

When set to true tags and attributes will be converted to lowercase before matching.

$nodes

List of nodes in the current DOM

public array<string|int, mixed> $nodes = array()

$original_size

Original document size

public int $original_size

Holds the original document size.

$size

Current document size

public int $size

Holds the current document size. The document size is determined by the string length of (simple_html_dom::$doc).

Note: Using this variable is more efficient than calling strlen($doc)

$block_tags

Defines a list of tags which - if closed - close all optional closing elements within if they haven't been closed yet. (So, an element where neither opening nor closing tag is omissible consistently closes every optional closing element within)

protected mixed $block_tags = array('body' => 1, 'div' => 1, 'form' => 1, 'root' => 1, 'span' => 1, 'table' => 1)

Remarks:

  • Use isset() instead of in_array() on array elements to boost performance about 30%
  • Sort elements by name for better readability!

$default_br_text

Innertext for <br> elements

protected string $default_br_text = ""

$optional_closing_tags

Defines elements whose end tag is omissible.

protected array<string|int, mixed> $optional_closing_tags = array( 'b' => array('b' => 1), // Not optional, see https://www.w3.org/TR/html/textlevel-semantics.html#the-b-element 'dd' => array('dd' => 1, 'dt' => 1), 'dl' => array('dd' => 1, 'dt' => 1), // Not optional, see https://www.w3.org/TR/html/grouping-content.html#the-dl-element 'dt' => array('dd' => 1, 'dt' => 1), 'li' => array('li' => 1), 'optgroup' => array('optgroup' => 1, 'option' => 1), 'option' => array('optgroup' => 1, 'option' => 1), 'p' => array('p' => 1), 'rp' => array('rp' => 1, 'rt' => 1), 'rt' => array('rp' => 1, 'rt' => 1), 'td' => array('td' => 1, 'th' => 1), 'th' => array('td' => 1, 'th' => 1), 'tr' => array('td' => 1, 'th' => 1, 'tr' => 1), )
  • key = Name of an element whose end tag is omissible.
  • value = Names of elements whose end tag is omissible, that are closed by the current element.

Remarks:

  • Use isset() instead of in_array() on array elements to boost performance about 30%
  • Sort elements by name for better readability!

Example

An li element’s end tag may be omitted if the li element is immediately followed by another li element. To do that, add following element to the array:

'li' => array('li'),

With this, the following two examples are considered equal. Note that the second example is missing the closing tags on li elements.

<ul><li>First Item</li><li>Second Item</li></ul>
  • First Item
  • Second Item
<ul><li>First Item<li>Second Item</ul>
  • First Item
  • Second Item

A two-dimensional array where the key is the name of an element whose end tag is omissible and the value is an array of elements whose end tag is omissible, that are closed by the current element.

Tags
link

Optional tags

todo

The implementation of optional closing tags doesn't work in all cases because it only consideres elements who close other optional closing tags, not taking into account that some (non-blocking) tags should close these optional closing tags. For example, the end tag for "p" is omissible and can be closed by an "address" element, whose end tag is NOT omissible. Currently a "p" element without closing tag stops at the next "p" element or blocking tag, even if it contains other elements.

todo

Known sourceforge issue #2977341 B tags that are not closed cause us to return everything to the end of the document.

$parent

Parent node of the next node detected by the parser

protected object $parent

$self_closing_tags

Defines a list of self-closing tags (Void elements) according to the HTML Specification

protected mixed $self_closing_tags = array('area' => 1, 'base' => 1, 'br' => 1, 'col' => 1, 'embed' => 1, 'hr' => 1, 'img' => 1, 'input' => 1, 'link' => 1, 'meta' => 1, 'param' => 1, 'source' => 1, 'track' => 1, 'wbr' => 1)

Remarks:

  • Use isset() instead of in_array() on array elements to boost performance about 30%
  • Sort elements by name for better readability!
Tags
link

HTML Specification

link

Void elements

$token_attr

Tokens to identify the end of an attribute

protected string $token_attr = ' >'

$token_blank

Tokens considered blank in HTML

protected string $token_blank = " "

$token_equal

Tokens to identify the equal sign for attributes, stopping either at the closing tag ("/" i.e. "<html />") or the end of an opening tag (">" i.e.

protected string $token_equal = ' =/>'

"")

$token_slash

Tokens to identify the end of a tag name. A tag name either ends on the ending slash ("/" i.e. "<html/>") or whitespace ("\s\r\n\t")

protected string $token_slash = " /> "

Methods

__construct()

public __construct([mixed $str = null ][, mixed $lowercase = true ][, mixed $forceTagsClosed = true ][, mixed $target_charset = DEFAULT_TARGET_CHARSET ][, mixed $stripRN = true ][, mixed $defaultBRText = DEFAULT_BR_TEXT ][, mixed $defaultSpanText = DEFAULT_SPAN_TEXT ], mixed $options) : mixed
Parameters
$str : mixed = null
$lowercase : mixed = true
$forceTagsClosed : mixed = true
$target_charset : mixed = DEFAULT_TARGET_CHARSET
$stripRN : mixed = true
$defaultBRText : mixed = DEFAULT_BR_TEXT
$defaultSpanText : mixed = DEFAULT_SPAN_TEXT
$options : mixed
Return values
mixed

__get()

public __get(mixed $name) : mixed
Parameters
$name : mixed
Return values
mixed

childNodes()

public childNodes([mixed $idx = -1 ]) : mixed
Parameters
$idx : mixed = -1
Return values
mixed

createElement()

public createElement(mixed $name[, mixed $value = null ]) : mixed
Parameters
$name : mixed
$value : mixed = null
Return values
mixed

createTextNode()

public createTextNode(mixed $value) : mixed
Parameters
$value : mixed
Return values
mixed

dump()

public dump([mixed $show_attr = true ]) : mixed
Parameters
$show_attr : mixed = true
Return values
mixed

find()

public find(mixed $selector[, mixed $idx = null ][, mixed $lowercase = false ]) : mixed
Parameters
$selector : mixed
$idx : mixed = null
$lowercase : mixed = false
Return values
mixed

getElementById()

public getElementById(mixed $id) : mixed
Parameters
$id : mixed
Return values
mixed

getElementByTagName()

public getElementByTagName(mixed $name) : mixed
Parameters
$name : mixed
Return values
mixed

getElementsById()

public getElementsById(mixed $id[, mixed $idx = null ]) : mixed
Parameters
$id : mixed
$idx : mixed = null
Return values
mixed

getElementsByTagName()

public getElementsByTagName(mixed $name[, mixed $idx = -1 ]) : mixed
Parameters
$name : mixed
$idx : mixed = -1
Return values
mixed

load()

public load(mixed $str[, mixed $lowercase = true ][, mixed $stripRN = true ][, mixed $defaultBRText = DEFAULT_BR_TEXT ][, mixed $defaultSpanText = DEFAULT_SPAN_TEXT ], mixed $options) : mixed
Parameters
$str : mixed
$lowercase : mixed = true
$stripRN : mixed = true
$defaultBRText : mixed = DEFAULT_BR_TEXT
$defaultSpanText : mixed = DEFAULT_SPAN_TEXT
$options : mixed
Return values
mixed

remove_callback()

Remove callback function

public remove_callback() : void
Return values
void

restore_noise()

Restore noise to HTML content

public restore_noise(string $text) : string

Noise is restored from simple_html_dom::$noise

Parameters
$text : string

A subset of HTML containing noise

Return values
string

The same content with noise restored

save()

public save([mixed $filepath = '' ]) : mixed
Parameters
$filepath : mixed = ''
Return values
mixed

search_noise()

public search_noise(mixed $text) : mixed
Parameters
$text : mixed
Return values
mixed

set_callback()

Set the callback function

public set_callback(callable $function_name) : void
Parameters
$function_name : callable

Callback function to run for each element in the DOM.

Return values
void

as_text_node()

Add tag as text node to current node

protected as_text_node(string $tag) : bool
Parameters
$tag : string

Tag name

Return values
bool

True on success

copy_skip()

Copy substring from the current document position to the first occurrence of a character not defined by the provided string.

protected copy_skip(string $chars) : string
Parameters
$chars : string

A string containing every allowed character.

Return values
string

Substring from the current document position to the first occurrence of a character not defined by the provided string.

copy_until()

Copy substring from the current document position to the first occurrence of any of the provided characters.

protected copy_until(string $chars) : string
Parameters
$chars : string

A string containing every character to stop at.

Return values
string

Substring from the current document position to the first occurrence of any of the provided characters.

copy_until_char()

Copy substring from the current document position to the first occurrence of the provided string.

protected copy_until_char(string $char) : string
Parameters
$char : string

The string to stop at.

Return values
string

Substring from the current document position to the first occurrence of the provided string.

Link node to parent node

protected link_nodes(object &$node, bool $is_child) : void
Parameters
$node : object

Node to link to parent

$is_child : bool

True if the node is a child of parent

Return values
void

parse()

Parse HTML content

protected parse() : bool
Return values
bool

True on success

parse_attr()

Parse attribute from current document position

protected parse_attr(object $node, string $name, array<string|int, mixed> &$space) : void
Parameters
$node : object

Node for the attributes

$name : string

Name of the current attribute

$space : array<string|int, mixed>

Array for spacing information

Return values
void

parse_charset()

protected parse_charset() : mixed
Return values
mixed

prepare()

protected prepare(mixed $str[, mixed $lowercase = true ][, mixed $defaultBRText = DEFAULT_BR_TEXT ][, mixed $defaultSpanText = DEFAULT_SPAN_TEXT ]) : mixed
Parameters
$str : mixed
$lowercase : mixed = true
$defaultBRText : mixed = DEFAULT_BR_TEXT
$defaultSpanText : mixed = DEFAULT_SPAN_TEXT
Return values
mixed

read_tag()

Parse tag from current document position.

protected read_tag() : bool
Return values
bool

True if a tag was found, false otherwise

remove_noise()

Remove noise from HTML content

protected remove_noise(string $pattern[, bool $remove_tag = false ]) : mixed

Noise is stored to simple_html_dom::$noise

Parameters
$pattern : string

The regex pattern used for finding noise

$remove_tag : bool = false

True to remove the entire match. Default is false to only remove the captured data.

Return values
mixed

skip()

Seek from the current document position to the first occurrence of a character not defined by the provided string. Update the current document position to the new position.

protected skip(string $chars) : void
Parameters
$chars : string

A string containing every allowed character.

Return values
void

Search results