Home      Products & Services      Contact Us      Links

WebHatchers will design & develop your site for you.

Website Menu Heaven: menus, buttons, etc.

Send us your questions.

site search by freefind

SEO, Google, Privacy
   and Anonymity
Browser Insanity
Popups and Tooltips
Free Website Search
HTML Form Creator
Buttons and Menus
Image Uploading
Website Poll
IM and Texting
   or Not MySQL
Personal Status Boards
Content Management
Article Content
   Management Systems
Website Directory
   CMS Systems
Photo Gallery CMS
Forum CMS
Blog CMS
Customer Records
   Management CMS
Address Book CMS
Private Messaging CMS
Chat Room CMS
JavaScript Charts
   and Graphs

Free Personal Status Boards (PSB™)

Free Standard Free PSB

Free PSB Pro Version

Free Social PSB

Free Social PSB Plus (with Email)

Free Business PSB

Free Business PSB Plus (with Email)

PSB demo

Social PSB demo

Business PSB demo

So what's all this PSB stuff about?

Chart comparing business status boards

PSB hosting diagram

PSB Licence Agreement

Copyright © 2002 -
MCS Investments, Inc. sitemap

PSBs, social networking, social evolution, microcommunities, personal status boards
PSBs, social networking, business personal status boards
website design, ecommerce solutions
website menus, buttons, image rotators
Ez-Architect, home design software
the magic carpet and the cement wall, children's adventure book
the squirrel valley railroad, model railroad videos, model train dvds
the deep rock railroad, model railroad videos, model train dvds

List a Website's Video Links Alphabetically Using XPATH and PHP

This script will get an alphabetically sorted list of the video links found on a website. It is limited to .html, .htm, and .php extensions for page urls.

The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.

We start with ensuring that fatal run-time errors (errors that can not be recovered from) are reported. Execution of the script is halted and a message appears. The likely ones are the out of memory and timing out ones. Both happen because a site is too big for the available memory to process (or it's crammed with video content). Often the server's script execution time limit is exceeded (usually 30 seconds). It appears that the server is lenient if the other websites on the server make demands which are relatively light at the moment—we've experienced a nearly-3-minute script execution time as well as the 30 second timeout. The script has to examine every word on the site, but we've seen 300-page sites take under 10 seconds as well as a 240-page site with lots of videos take over 2 minutes. Feel free to attempt to manipulate the script timeout setting and risk the wrath of the host. We did NOT.

Next we get the config.php data into the mix with an include so the connection to the MySQL database will work right and get our videos indexing routine access to writing and then reading all the juicy data which it will cleverly insert into records in a database table called videos.

Then comes the function isitvideo(). It will check the urls to see if their extensions are either video or likely to be video. If the MIME type is application/x-shockwave-flash, application/x-oleobject, application/x-mplayer2, application/vnd.rn-realmedia, or application/ogg, the media could be audio or video, for example, so MIME types aren't looked at except for when the link has been found in an object tag in the data attribute. The extensions rm and ogx and swf can be audio or video, so we may possibly end up with a few links that are "likely to be video" but aren't. Another issue is that embed, object, parameter, anchor, iframe, and HTML5 source tags can contain video or audio. Another issue is that data, src, and value attributes can contain video or audio. And many video links have no extensions at all, such as those on YouTube, Break, Hulu, and plenty more, and the same goes for audio links on some audio sites. In other words, there is no way to get it perfect no matter what one does. Whatever.rm may be video, but that is not a sure thing. As a result, we put some major video site domains in the isitvideo() function and you may feel free to add your own. These, if found, ensure that video links from there get included in the video links list displayed onscreen and stored in the MySQL table. Few sites host both video and audio, so seeing if a link contains "http://www.youtube.com" or one of the other domains really will ensure whether the link is video or not. It's too bad that there are so many hundreds of video sites and combinations of MIME types, file extensions, tags, and tag attributes and few standards to enforce consistency, but that is how it is. HTML5 is trying to simplify it with their audio and video tags, but their source tag has already confused the issue—it can contain either video or audio or both!

Next there is an HTML form that gets submitted to this same page to send the site url to the PHP script. The user is instructed that the site url "must include /index.html, /index.htm, /index.php or whatever the home page filename is." The input is checked to see that it ends in htm, html, or php and if not, the script skips to the end and gives the alert "Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL." before reloading the page.

Now the parse_url() function is used to parse a URL and return an associative array containing any of the various components of the URL that are present. Then the $home variable gets filled with the file name of the home page.

Next the strip_tags() function is run on the url (just in case) and the home page file name is subtracted from the url to get the remainder of the url—the scheme and the host name containing the domain. If the $f now ends with "/" that character is dumped. Then spaces inside the url are replaced by %20 so PHP functions that use it do not get errors, and spaces before or after the $f variable get trimmed away.

Then any existing MySQL videos table is dumped and a new table of that name is built with a field for the video links and a field for the page url where the link is found. The $a array is for for page urls and the $video array is for the video links. It may be useful to have a MySQL table full of the site's data regarding video links—who can say? We use it to avoid out of memory errors from the $video array getting too big, for storage, to be able to count unique videos for statistical purposes, and to use SQL's ORDER BY command to let us easily display the MySQL table contents in alphabetical order. Now we save $g as $f plus "/" and put into $www the value of $g minus "www."

Now we go to the Internet with file_get_contents($g.$home), which gets us the page's contents into the PHP variable $t. The new DOMDocument object is created because for XPATH use, you have to create a DomDocument object. The @ in @$dom->loadHTML($t) suppresses error messages from sloppy HTML code loads as it gets page contents into the DOM object. The $xpath = new DOMXPath($dom) statement creates an XPATH object to use with the evaluate() method, which evaluates the given XPath expression, which is, in this case, rather complex. If an XPATH expression returns a node set, you will get a DOMNodeList which can be looped through to get values of attributes such as href. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. Our evaluate argument contains the path /html/body//a. This gets link anchor elements and $url = $href->getAttribute('href') is used to parse these elements, in a loop, for their href attribute node values which have link urls. Multiple XPath expression paths are used in our evaluate() method argument parameters. To separate the paths, use "|", which means AND, not OR.

Look through our multi-path XPATH expression. The first path uses application/x-shockwave-flash as a required MIME type for when we find data attribute node values in object tags. The second one is the anchor tag parser. The next path gets embed tags' src or qtsrc attribute node values. The next one gets value attribute node values from the parameter tags in object tags as long as this parameter tags' name attribute is either src or FileName or movie. The next three get src attribute node values from video tags, source tags, and iframe tags, but the latter is accepted only if the title attribute is "YouTube video player". (Only HTML5 has video and source tags, not to mention audio tags and some other new tags.) The final path parses img tags for the dynsrc attribute—a Microsoft way of sticking a video where it doesn't belong: in an image tag. Go figure!

The DOMElement class method getAttribute() is essential since attributes are where all the node values with page urls and video urls will be found. It is used, in a results loop, to get href first, and then dynsrc, src, qtsrc, value, and data attributes. The href attributes are tricky in that they can be just a new page to parse or a video url link on a page. The former gets stored in the $a array—the latter in the $video array. In handling hrefs we trim off excess spaces and replace spaces inside the urls with %20 to avoid errors. If there are # anchors in the url, they are trimmed off. If there are ? url query strings, they are dumped if the url ends with html, htm, or php before the ? symbol, but otherwise left alone as essential aspects of video links. Path symbols like ./ and ../ and / are dumped since we only want the links on the page we are on, not elsewhere. We will, however, get to "elsewhere" (other site pages) because of our overall method, which is to get every page link url on every page and store it in the $a array, and then go to every one of these pages, getting more page urls and video links.

Now the site url (plus /), A.K.A. $g, is dumped from the url being processed, then $www, which is $g without the www. part of the url, is dumped. This handles either http://siteurl.com or http://www.siteurl.com being used on interior links of the website which should have been relative links without these absolute link characteristics. If the links are to offsite urls like YouTube, etc., the absolute aspects are purposely retained because the PHP str_replace() function will do nothing (since it will not find $g or $www)—the desired result. The $ok flag means $url is a page url. If there is no http (offsite url) in $url and it is a page url we search the $a array. If the url isn't in the array, it is added.

If the url has an href attribute in an anchor link so $url isn't empty, control slips to the if($url){isitvideo();} statement, since it is an href but does not end in htm, html, or php, so it may be a video. We run the isitvideo() function (already discussed) to find out, which will get it in the $video array if it is indeed a video or very likely a video. If no href is involved in the node, one of the other attributes like src likely is. So we then check out any dynsrc, src, qtsrc, value or data attribute. And, again, if the attibute is the node located, control slips down to if($url){isitvideo();}, since $url is not empty. Now we dump duplicate array values and fill the holes that are left, then count($a) array elements, getting the $a array size.

Now we go through most of what we just did, but in a function. Why not use just the function from the start? Because the needs are similar, but not the same. We have to deal with folders, since we're no longer on the home page, like we were above. We have to deal with the context parameter for the PHP file_get_contents() function, since network conditions can cause us to lose some page links unless we create a stream context with a timeout. Moreover, we have to use the results loop twice in the function (but not in the initial codes outside the function) and we must confess we don't quite understand why. It will not work otherwise—that's the bottom line. And the global variables declared right there are needed as well for some ungodly reason. Quirks? Bugs? Let us know if you solve these riddles. In the meantime, the script works nicely, given an impossible task to try to accomplish . . . well . . . SEMIperfectly! (But, of course, if you find that you know of a bunch of relevant video site domains and you add them to the isitvideo() function, this makes the script more perfect!)

The reason we call the function add_urls_to_array() is to handle the dozens or hundreds of other page urls the script will encounter after the home page. Note that the $folder variable gets and folders found, so that if we find stuff/myvids.html, we put stuff/ into $folders. If ./ or ../ or http is found, we ignore any folders we find, otherwise we keep the $url variable's folder aspect with the rest of the url. The rest of the function is pretty much the same as previously discussed. The stream context timeout gives the network time to catch up to the script rather than skipping links. We found this out the hard way when we ignored stream context at first. It really is needed.

Note that when we repeat the results attribute getting loop, we begin with $url = $href->getAttribute('href') since video links like <a href="http://www.computerhope.com/issues/floppy2.avi">Floppy drive robot</a> will be skipped otherwise even though the first loop will catch the page urls.

Out of the function and farther down the page now, we check if the $videos array has content yet (from the home page) and stick it into the MySQL videos table if it does, using the precaution of the mysql_real_escape_string() function first. Next we put a 3-second timeout in the stream context—arrived at by trial and error (mostly error!). Next comes the while ($o<$r-1){ that starts the section dealing with parsing one page url after another until they are all done. Note that $r will have its value increased as more page urls are found needing checking out because of trips to the add_urls_to_array() function. $o is the element number being processed. Once again, the videos table gets any content found and put in the $videos array. The for loop involved is run for every one of the page urls that has video content.

Once the links are processed, "SELECT DISTINCT video FROM videos" is the SQL statement run on our table. We just want a count of the unique ones, to report in a message at the end. Then the SQL statement "SELECT * FROM videos ORDER BY video" is run and the results, because of the ORDER BY, are alphabetical. The final table echoing shows the video links and the page urls they are in. Then the $a array is displayed as well, so you can see the page urls. The final message tells how many videos, how many were unique, and how many pages were found. The final codes are for if the site url entry was unacceptable, so an example entry is shown. Then the page reloads.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>List a Website's Video Links Alphabetically</TITLE>
<meta name="description" content="List a Website's Video Links Alphabetically">
<meta name="keywords" content="Index Website Video Links,List Website Page Video Links Alphabetically,List Website Video Links, list Video Links alphabetically,Index a Website,php,CMS,javascript, dhtml, DHTML">
<style type="text/css">
BODY {margin-left:0; margin-right:0; margin-top:0;text-align:left;background-color:#ddd}
p, li {font:13px Verdana; color:black;text-align:left}
h1 {font:bold 28px Verdana; color:black;text-align:center}
h2 {font:bold 24px Verdana;text-align:center}
td {font:normal 13px Verdana;text-align:left;background-color:#ccc}
.topic {text-align:left;background-color:#fff}
.center {text-align:center;}

function isitvideo(){
global $url; global $video; global $nn;
if (substr($url,-4)==".afl" || substr($url,-4)==".asf" || substr($url,-4)==".asx" || substr($url,-4)==".avi" || substr($url,-4)==".dif" || substr($url,-3)==".dl" || substr($url,-3)==".dv" || substr($url,-4)==".fli" || substr($url,-3)==".gl" || substr($url,-4)==".isu" || substr($url,-4)==".m1v" || substr($url,-4)==".m2v" || substr($url,-5)==".mjpg" || substr($url,-4)==".mov" || substr($url,-5)==".moov" || substr($url,-6)==".movie" || substr($url,-4)==".m4v" || substr($url,-4)==".mpe" || substr($url,-5)==".mpeg" || substr($url,-4)==".mpg" || substr($url,-3)==".mv" || substr($url,-4)==".ogv" || substr($url,-4)==".ogx" || substr($url,-3)==".qt" || substr($url,-4)==".qtc" || substr($url,-3)==".rm" || substr($url,-4)==".scm" || substr($url,-4)==".flv" || substr($url,-4)==".swf" || substr($url,-4)==".mp4" || substr($url,-4)==".vdo" || substr($url,-4)==".viv" || substr($url,-5)==".vivo" || substr($url,-4)==".vos" || substr($url,-4)==".wmv" || substr($url,-4)==".xmz" || substr($url,-4)==".xsr" || substr($url,0,22)=="http://www.youtube.com" || substr($url,0,22)=="http://video.yahoo.com" || substr($url,0,23)=="MP3.com metacafe.com" || substr($url,0,20)=="http://www.imeem.com" || substr($url,0,22)=="http://embed.break.com" || substr($url,0,19)=="http://www.veoh.com" || substr($url,0,19)=="http://www.hulu.com" || substr($url,0,24)=="http://www.clipshack.com" || substr($url,0,26)=="http://www.dailymotion.com" || substr($url,0,20)=="http://www.vimeo.com" || substr($url,0,23)=="http://www.liveleak.com" || substr($url,0,23)=="http://www.vidilife.com" || substr($url,0,24)=="http://www.livevideo.com" || substr($url,0,22)=="http://www.current.com" || substr($url,0,22)=="http://www.maniatv.com"){$video[$nn]=$url;$nn++;}}

if (!isset($f)){
echo '<div id="pw" style="position:absolute;top:150px;left:50px;width:950px;text-align:center"><table style="background-color:#8aa;border-color:#00f" border="6" cellspacing=0 cellpadding=6><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="list-a-websites-video-links-alphabetically.php"><b>home page URL (must include /index.html, /index.htm, /index.php or whatever the home page filename is)</b><BR><label for="URL">URL: </b><input type="text" name="siteurl" size="66" maxlength="99" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';


if (substr($f,-4)==".htm" || substr($f,-4)=="html" || substr($f,-4)==".php"){
if (substr($e,0,1)=="/"){$LLLL=strlen($e);$home=substr($e,1,$LLLL-1);}

$f=strip_tags($f);$f=str_replace($e, "", $f);
$L=strlen($f);if (substr($f,-1)=="/"){$f=substr($f,0,$L-1);}
$f = str_replace(" ", "%20", $f); $f=trim($f);

$sql = "DROP TABLE IF EXISTS videos";

$sql = "CREATE TABLE videos (
id int(4) NOT NULL auto_increment,
video varchar(255) NOT NULL default '',
pageurl varchar(255) NOT NULL default '',

// "mediumtext" allows over 16 million bytes

$a=array();$video=array();$n=0;$nn=0;$o=-1;$g=$f."/";echo "<B>".$f."</B><BR>";$www=str_replace("www.","",$g);
$t = file_get_contents($g.$home);
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//object[@data and @type='application/x-shockwave-flash'] | /html/body//a | /html/body//embed[@src or @qtsrc] | /html/body//object/param[@value and (@name='src' or @name='FileName' or @name='movie')] | /html/body//video[@src] | /html/body//source[@src] | /html/body//iframe[@src and @title='YouTube video player'] | /html/body//img[@dynsrc]");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w && (substr($url,$w-4,$w)==".htm" || substr($url,$w-5,$w)==".html" || substr($url,$w-4,$w)==".php")){$url=substr($url,0,$w);}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
$ok="0";$url=str_replace($g, "", $url);$url=str_replace($www, "", $url);$L=strlen($url);
if(substr($url,-4)==".htm" || substr($url,-5)==".html" || substr($url,-4)==".php"){$ok="1";}

if (substr($url,0,4)<>"http" && $ok=="1"){$q=array_search($url,$a);if ($L>4 && $ok=="1" && $q===false){$a[$n]=$url;$n++;}} //dumps offsite or wrong extension links
if(!$url){$url = $href->getAttribute('src');} //audio or video
if(!$url){$url = $href->getAttribute('dynsrc');} //only video
if(!$url){$url = $href->getAttribute('qtsrc');} //audio or video
if(!$url){$url = $href->getAttribute('value');} //audio or video
if(!$url){$url = $href->getAttribute('data');} //audio or video

$a=array_keys(array_flip($a)); //dump duplicate array values and fill the holes that are left; array_unique has BUG!
$r = count($a);

function add_urls_to_array(){
global $a; global $g; global $z; global $t; global $r; global $nn; $nn=0; global $video; $n=$r; $folder="";
$fo=strrpos($z,"/"); if ($fo){$folder=substr($z,0,$fo+1);}
$t = file_get_contents($g.$z,0,$context);
$dom = new DOMDocument();
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//object[@data and @type='application/x-shockwave-flash'] | /html/body//a | /html/body//embed[@src or @qtsrc] | /html/body//object/param[@value and (@name='src' or @name='FileName' or @name='movie')] | /html/body//video[@src] | /html/body//source[@src] | /html/body//iframe[@src and @title='YouTube video player'] | /html/body//img[@dynsrc]");

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
if (substr($url,0,4)=="http"){$sf="0";}else{$sf="1";}
if (substr($url,0,3)=="../" || substr($url,0,2)=="./"){$sf="0";}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
if (substr($url,0,4)<>"http" && substr($url,0,$LLL)<>$folder && $sf=="1"){$url=$folder.$url;}
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w && (substr($url,$w-4,$w)==".htm" || substr($url,$w-5,$w)==".html" || substr($url,$w-4,$w)==".php")){$url=substr($url,0,$w);}
$ok="0";$q=null;$url=str_replace($g, "", $url);$L=strlen($url);
if(substr($url,-4)==".htm" || substr($url,-5)==".html" || substr($url,-4)==".php"){$ok="1";$q=array_search($url,$a);}

if (substr($url,0,4)<>"http" && $ok=="1" && $q===false){$a[$n]=$url;$n++;}}

$r = count($a);global $url; global $video; global $nn;

for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href'); //audio or video
if(!$url){$url = $href->getAttribute('dynsrc');} //audio or video
if(!$url){$url = $href->getAttribute('src');} //only video
if(!$url){$url = $href->getAttribute('qtsrc');} //audio or video
if(!$url){$url = $href->getAttribute('value');} //audio or video
if(!$url){$url = $href->getAttribute('data');} //audio or video

if (strlen($video[0])>4){
for($i = 0; $i < count($video); $i++){
$sql="INSERT INTO videos(id, video, pageurl)VALUES('', '$j', '$z')";

$context = stream_context_create(array('http' => array('timeout' => 3))); // Timeout in seconds

while ($o<$r-1){
$o++; $z=$a[$o];$NN=$o+2;
if (strlen($video[0])>4){
for($i = 0; $i < count($video); $i++){
$sql="INSERT INTO videos(id, video, pageurl)VALUES('', '$j', '$z')";

$result = mysql_query("SELECT DISTINCT video FROM videos")
or die(mysql_error());
$result = mysql_query("SELECT * FROM videos ORDER BY video")
or die(mysql_error());

echo "<table border='1'>";
echo "<tr><th>Video</th><th>Page URL</th></tr>";
while($row = mysql_fetch_array($result)) {
echo "<tr><td>";
echo $row['video'];
echo "</td><td>";
echo $row['pageurl'];
echo "</td></tr>";

echo "</table><BR>";

echo "Links<BR>";for ($i = 0; $i < $r; $i++) {echo ($i+1)." ".$a[$i]; echo "<BR>";}
echo "<BR>".$rr." videos (".$rrr." unique ones) were indexed on ".$r." pages. Press Back Button to submit another URL.";



echo '<script language="javascript">alert("Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL.");window.location="list-a-websites-video-links-alphabetically.php"; </script>';}