List a Website's Audio Links Alphabetically Using XPATH and PHP
- List Elements in HTML Document by Tag Name Using XPATH and PHP
- List All Elements in HTML Document by Tag Name Using XPATH and PHP
- List Specified Elements in XML Document by Tag Name Using XPATH and PHP
- List Urls in XML Sitemap by Tag Name Using XPATH and PHP
- List Elements in XML Document by Tag Name Using XPATH Query and PHP
- List Child Nodes of Element in XML Document Using XPATH Query and PHP
- List Urls in XML Sitemap Using XPATH Query and registerNamespace and PHP
- List a Website's Audio Links Alphabetically Using XPATH and PHP
- List a Website's Video Links Alphabetically Using XPATH and PHP
- Grab Web Page Links and Video Links and Audio Links from Web Page
- List a Website's Images Alphabetically Using XPATH and PHP
- List a Website's External Links Alphabetically Using XPATH and PHP
- List a Website's Page Urls Alphabetically Using XPATH and PHP
- List a Website's Page Descriptions Alphabetically Using XPATH and PHP
- List a Website's Page Titles Alphabetically Using XPATH and PHP
- Get Links from Web Page Using XPATH and PHP
- Count and Alphabetize Words on a Web Page
- Search Website without Indexing Using XPATH and PHP
- Free Website Indexing Script Using XPATH and PHP
- Free Website Search Script Using PHP
- Free Website Search Script and Tutorial
This script will get an alphabetically sorted list of the audio links found on a website. It is limited to .html, .htm, and .php extensions for page urls.
The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.
We start with ensuring that fatal run-time errors (errors that can not be recovered from) are reported. Execution of the script is halted and a message appears. The likely ones are the out of memory and timing out ones. Both happen because a site is too big for the available memory to process (or it's crammed with audio content). Often the server's script execution time limit is exceeded (usually 30 seconds). It appears that the server is lenient if the other websites on the server make demands which are relatively light at the moment—we've experienced a nearly-3-minute script execution time as well as the 30 second timeout. The script has to examine every word on the site, but we've seen 300-page sites take under 10 seconds as well as a 240-page site with lots of audios take over 2 minutes. Feel free to attempt to manipulate the script timeout setting and risk the wrath of the host. We did NOT.
Next we get the config.php data into the mix with an include so the connection to the MySQL database will work right and get our audios indexing routine access to writing and then reading all the juicy data which it will cleverly insert into records in a database table called audios.
Then comes the function isitaudio(). It will check the urls to see if their extensions are either audio or likely to be audio. If the MIME type is application/x-shockwave-flash, application/x-oleobject, application/x-mplayer2, application/vnd.rn-realmedia, or application/ogg, the media could be audio or video, for example, so MIME types aren't looked at except for when the link has been found in an object tag in the data attribute. The extensions rm and ogx and swf can be audio or video, so we may possibly end up with a few links that are "likely to be audio" but aren't. Another issue is that embed, object, parameter, anchor, iframe, and HTML5 source tags can contain video or audio. Another issue is that data, src, and value attributes can contain video or audio. And many video links have no extensions at all, such as those on YouTube, Break, Hulu, and plenty more, and the same goes for audio links on some audio sites. In other words, there is no way to get it perfect no matter what one does. Whatever.rm may be video, but that is not a sure thing. As a result, we put some major audio site domains in the isitaudio() function and you may feel free to add your own. These, if found, ensure that audio links from there get included in the audio links list displayed onscreen and stored in the MySQL table. Few sites host both video and audio, so seeing if a link contains "http://www.mp3.com" or one of the other domains really will ensure whether the link is audio or not. It's too bad that there are so many hundreds of audio sites and combinations of MIME types, file extensions, tags, and tag attributes and few standards to enforce consistency, but that is how it is. HTML5 is trying to simplify it with their audio and video tags, but their source tag has already confused the issue—it can contain either video or audio or both!
Next there is an HTML form that gets submitted to this same page to send the site url to the PHP script. The user is instructed that the site url "must include /index.html, /index.htm, /index.php or whatever the home page filename is." The input is checked to see that it ends in htm, html, or php and if not, the script skips to the end and gives the alert "Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL." before reloading the page.
Now the parse_url() function is used to parse a URL and return an associative array containing any of the various components of the URL that are present. Then the $home variable gets filled with the file name of the home page.
Next the strip_tags() function is run on the url (just in case) and the home page file name is subtracted from the url to get the remainder of the url—the scheme and the host name containing the domain. If the $f now ends with "/" that character is dumped. Then spaces inside the url are replaced by %20 so PHP functions that use it do not get errors, and spaces before or after the $f variable get trimmed away.
Then any existing MySQL audios table is dumped and a new table of that name is built with a field for the audio links and a field for the page url where the link is found. The $a array is for for page urls and the $audio array is for the audio links. It may be useful to have a MySQL table full of the site's data regarding audio links—who can say? We use it to avoid out of memory errors from the $audio array getting too big, for storage, to be able to count unique audios for statistical purposes, and to use SQL's ORDER BY command to let us easily display the MySQL table contents in alphabetical order. Now we save $g as $f plus "/" and put into $www the value of $g minus "www."
Now we go to the Internet with file_get_contents($g.$home), which gets us the page's contents into the PHP variable $t. The new DOMDocument object is created because for XPATH use, you have to create a DomDocument object. The @ in @$dom->loadHTML($t) suppresses error messages from sloppy HTML code loads as it gets page contents into the DOM object. The $xpath = new DOMXPath($dom) statement creates an XPATH object to use with the evaluate() method, which evaluates the given XPath expression, which is, in this case, rather complex. If an XPATH expression returns a node set, you will get a DOMNodeList which can be looped through to get values of attributes such as href. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. Our evaluate argument contains the path /html/body//a. This gets link anchor elements and $url = $href->getAttribute('href') is used to parse these elements, in a loop, for their href attribute node values which have link urls. Multiple XPath expression paths are used in our evaluate() method argument parameters. To separate the paths, use "|", which means AND, not OR.
Look through our multi-path XPATH expression. The first path uses application/x-shockwave-flash as a required MIME type for when we find data attribute node values in object tags. The second one is the anchor tag parser. The next path gets embed tags' src or qtsrc attribute node values. The next one gets value attribute node values from the parameter tags in object tags as long as this parameter tags' name attribute is either src or FileName. The next three get src attribute node values from audio tags, bgsound tags, and source tags. (Only HTML5 has audio and source tags, not to mention video tags and some other new tags.)
The DOMElement class method getAttribute() is essential since attributes are where all the node values with page urls and audio urls will be found. It is used, in a results loop, to get href first, and then src, qtsrc, value, and data attributes. The href attributes are tricky in that they can be just a new page to parse or an audio url link on a page. The former gets stored in the $a array—the latter in the $audio array. In handling hrefs we trim off excess spaces and replace spaces inside the urls with %20 to avoid errors. If there are # anchors in the url, they are trimmed off. If there are ? url query strings, they are dumped if the url ends with html, htm, or php before the ? symbol, but otherwise left alone as essential aspects of audio links. Path symbols like ./ and ../ and / are dumped since we only want the links on the page we are on, not elsewhere. We will, however, get to "elsewhere" (other site pages) because of our overall method, which is to get every page link url on every page and store it in the $a array, and then go to every one of these pages, getting more page urls and audio links.
Now the site url (plus /), A.K.A. $g, is dumped from the url being processed, then $www, which is $g without the www. part of the url, is dumped. This handles either http://siteurl.com or http://www.siteurl.com being used on interior links of the website which should have been relative links without these absolute link characteristics. If the links are to offsite urls like YouTube, etc., the absolute aspects are purposely retained because the PHP str_replace() function will do nothing (since it will not find $g or $www)—the desired result. The $ok flag means $url is a page url. If there is no http (offsite url) in $url and it is a page url we search the $a array. If the url isn't in the array, it is added.
If the url has an href attribute in an anchor link so $url isn't empty, control slips to the if($url){isitaudio();} statement, since it is an href but does not end in htm, html, or php, so it may be an audio. We run the isitaudio() function (already discussed) to find out, which will get it in the $audio array if it is indeed an audio or very likely an audio. If no href is involved in the node, one of the other attributes like src likely is. So we then check out any src, qtsrc, value or data attribute. And, again, if the attibute is the node located, control slips down to if($url){isitaudio();}, since $url is not empty. Now we dump duplicate array values and fill the holes that are left, then count($a) array elements, getting the $a array size.
Now we go through most of what we just did, but in a function. Why not use just the function from the start? Because the needs are similar, but not the same. We have to deal with folders, since we're no longer on the home page, like we were above. We have to deal with the context parameter for the PHP file_get_contents() function, since network conditions can cause us to lose some page links unless we create a stream context with a timeout. Moreover, we have to use the results loop twice in the function (but not in the initial codes outside the function) and we must confess we don't quite understand why. It will not work otherwise—that's the bottom line. And the global variables declared right there are needed as well for some ungodly reason. Quirks? Bugs? Let us know if you solve these riddles. In the meantime, the script works nicely, given an impossible task to try to accomplish . . . well . . . SEMIperfectly! (But, of course, if you find that you know of a bunch of relevant audio site domains and you add them to the isitaudio() function, this makes the script more perfect!)
The reason we call the function add_urls_to_array() is to handle the dozens or hundreds of other page urls the script will encounter after the home page. Note that the $folder variable gets and folders found, so that if we find stuff/mysound.html, we put stuff/ into $folders. If ./ or ../ or http is found, we ignore any folders we find, otherwise we keep the $url variable's folder aspect with the rest of the url. The rest of the function is pretty much the same as previously discussed. The stream context timeout gives the network time to catch up to the script rather than skipping links. We found this out the hard way when we ignored stream context at first. It really is needed.
Note that when we repeat the results attribute getting loop, we begin with $url = $href->getAttribute('href') since audio links like <a href="http://www.computerhope.com/issues/ibm.mp3">IBM audio commercial</a> will be skipped otherwise even though the first loop will catch the page urls.
Out of the function and farther down the page now, we check if the $audios array has content yet (from the home page) and stick it into the MySQL audios table if it does, using the precaution of the mysql_real_escape_string() function first. Next we put a 3-second timeout in the stream context—arrived at by trial and error (mostly error!). Next comes the while ($o<$r-1){ that starts the section dealing with parsing one page url after another until they are all done. Note that $r will have its value increased as more page urls are found needing checking out because of trips to the add_urls_to_array() function. $o is the element number being processed. Once again, the audios table gets any content found and put in the $audios array. The for loop involved is run for every one of the page urls that has audio content.
Once the links are processed, "SELECT DISTINCT audio FROM audios" is the SQL statement run on our table. We just want a count of the unique ones, to report in a message at the end. Then the SQL statement "SELECT * FROM audios ORDER BY audio" is run and the results, because of the ORDER BY, are alphabetical. The final table echoing shows the audio links and the page urls they are in. Then the $a array is displayed as well, so you can see the page urls. The final message tells how many audios, how many were unique, and how many pages were found. The final codes are for if the site url entry was unacceptable, so an example entry is shown. Then the page reloads.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>List a Website's Audio Links Alphabetically</TITLE>
<meta name="description" content="List a Website's Audio Links Alphabetically">
<meta name="keywords" content="Index Website Audio Links,List Website Page Audio Links Alphabetically,List Website Audio Links, list Audio Links alphabetically,Index a Website,php,CMS,javascript, dhtml, DHTML">
<style type="text/css">
BODY {margin-left:0; margin-right:0; margin-top:0;text-align:left;background-color:#ddd}
p, li {font:13px Verdana; color:black;text-align:left}
h1 {font:bold 28px Verdana; color:black;text-align:center}
h2 {font:bold 24px Verdana;text-align:center}
td {font:normal 13px Verdana;text-align:left;background-color:#ccc}
.topic {text-align:left;background-color:#fff}
.center {text-align:center;}
</style>
</head>
<body>
<?php
error_reporting(E_ERROR);
include_once"config.php";
function isitaudio(){
global $url; global $audio; global $nn;
if (substr($url,-4)==".aac" || substr($url,-4)==".aif" || substr($url,-5)==".aifc" || substr($url,-5)==".aiff" || substr($url,-3)==".au" || substr($url,-5)==".funk" || substr($url,-4)==".gsd" || substr($url,-4)==".gsm" || substr($url,-3)==".it" || substr($url,-4)==".jam" || substr($url,-3)==".la" || substr($url,-4)==".lam" || substr($url,-4)==".lma" || substr($url,-4)==".m2a" || substr($url,-4)==".m3u" || substr($url,-4)==".mid" || substr($url,-5)==".midi" || substr($url,-4)==".mod" || substr($url,-4)==".mp2" || substr($url,-4)==".mp3" || substr($url,-4)==".mpa" || substr($url,-4)==".m1a" || substr($url,-5)==".mpga" || substr($url,-3)==".my" || substr($url,-4)==".oga" || substr($url,-4)==".ogg" || substr($url,-4)==".ogx" || substr($url,-6)==".pfunk" || substr($url,-3)==".ra" || substr($url,-4)==".ram" || substr($url,-3)==".rm" || substr($url,-4)==".rmi" || substr($url,-4)==".rmm" || substr($url,-4)==".rmp" || substr($url,-4)==".rnx" || substr($url,-4)==".rpm" || substr($url,-3)==".rv" || substr($url,-4)==".s3m" || substr($url,-4)==".sid" || substr($url,-4)==".snd" || substr($url,-4)==".ssm" || substr($url,-4)==".swf" || substr($url,-4)==".m4a" || substr($url,-4)==".tsi" || substr($url,-4)==".tsp" || substr($url,-4)==".voc" || substr($url,-4)==".vox" || substr($url,-4)==".vqf" || substr($url,-4)==".wav" || substr($url,-4)==".wma" || substr($url,-3)==".xm" || substr($url,0,25)=="http://www.purevolume.com" || substr($url,0,18)=="http://www.mp3.com" || substr($url,0,21)=="http://www.deezer.com" || substr($url,0,14)=="http://mog.com" || substr($url,0,27)=="http://www.jukeboxalive.com" || substr($url,0,25)=="http://www.dopetracks.com" || substr($url,0,28)=="http://www.apple.com/itunes/" || substr($url,0,21)=="http://www.emusic.com" || substr($url,0,16)=="http://bleep.com" || substr($url,0,20)=="http://www.ilike.com"){$audio[$nn]=$url;$nn++;}}
$f=$_POST['siteurl'];
if (!isset($f)){
echo '<div id="pw" style="position:absolute;top:150px;left:50px;width:950px;text-align:center"><table style="background-color:#8aa;border-color:#00f" border="6" cellspacing=0 cellpadding=6><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="list-a-websites-audio-links-alphabetically.php"><b>home page URL (must include /index.html, /index.htm, /index.php or whatever the home page filename is)</b><BR><label for="URL">URL: </b><input type="text" name="siteurl" size="66" maxlength="99" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';
}else{
if (substr($f,-4)==".htm" || substr($f,-4)=="html" || substr($f,-4)==".php"){
$e=(parse_url($f,PHP_URL_PATH));
if (substr($e,0,1)=="/"){$LLLL=strlen($e);$home=substr($e,1,$LLLL-1);}
$f=strip_tags($f);$f=str_replace($e, "", $f);
$L=strlen($f);if (substr($f,-1)=="/"){$f=substr($f,0,$L-1);}
$f = str_replace(" ", "%20", $f); $f=trim($f);
$sql = "DROP TABLE IF EXISTS audios";
mysql_query($sql);
$sql = "CREATE TABLE audios (
id int(4) NOT NULL auto_increment,
audio varchar(255) NOT NULL default '',
pageurl varchar(255) NOT NULL default '',
PRIMARY KEY (id)
) ENGINE=MyISAM AUTO_INCREMENT=1";
mysql_query($sql);
// "mediumtext" allows over 16 million bytes
$a=array();$audio=array();$n=0;$nn=0;$o=-1;$g=$f."/";echo "<B>".$f."</B><BR>";$www=str_replace("www.","",$g);
$t = file_get_contents($g.$home);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//object[@data and @type='application/x-shockwave-flash'] | /html/body//a | /html/body//embed[@src or @qtsrc] | /html/body//object/param[@value and (@name='src' or @name='FileName')] | /html/body//audio[@src] | /html/body//bgsound[@src] | /html/body//source[@src]");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w && (substr($url,$w-4,$w)==".htm" || substr($url,$w-5,$w)==".html" || substr($url,$w-4,$w)==".php")){$url=substr($url,0,$w);}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
$ok="0";$url=str_replace($g, "", $url);$url=str_replace($www, "", $url);$L=strlen($url);
if(substr($url,-4)==".htm" || substr($url,-5)==".html" || substr($url,-4)==".php"){$ok="1";}
if (substr($url,0,4)<>"http" && $ok=="1"){$q=array_search($url,$a);if ($L>4 && $ok=="1" && $q===false){$a[$n]=$url;$n++;}} //dumps offsite or wrong extension links
else{
if(!$url){$url = $href->getAttribute('src');} //audio or video
if(!$url){$url = $href->getAttribute('qtsrc');} //audio or video
if(!$url){$url = $href->getAttribute('value');} //audio or video
if(!$url){$url = $href->getAttribute('data');} //audio or video
if($url){isitaudio();}
}}
$a=array_keys(array_flip($a)); //dump duplicate array values and fill the holes that are left; array_unique has BUG!
$r = count($a);
function add_urls_to_array(){
global $a; global $g; global $z; global $t; global $r; global $nn; $nn=0; global $audio; $n=$r; $folder="";
$fo=strrpos($z,"/"); if ($fo){$folder=substr($z,0,$fo+1);}
$LLL=strlen($folder);
$t = file_get_contents($g.$z,0,$context);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//object[@data and @type='application/x-shockwave-flash'] | /html/body//a | /html/body//embed[@src or @qtsrc] | /html/body//object/param[@value and (@name='src' or @name='FileName')] | /html/body//audio[@src] | /html/body//bgsound[@src] | /html/body//source[@src]");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
if (substr($url,0,4)=="http"){$sf="0";}else{$sf="1";}
if (substr($url,0,3)=="../" || substr($url,0,2)=="./"){$sf="0";}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
if (substr($url,0,4)<>"http" && substr($url,0,$LLL)<>$folder && $sf=="1"){$url=$folder.$url;}
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w && (substr($url,$w-4,$w)==".htm" || substr($url,$w-5,$w)==".html" || substr($url,$w-4,$w)==".php")){$url=substr($url,0,$w);}
$ok="0";$q=null;$url=str_replace($g, "", $url);$L=strlen($url);
if(substr($url,-4)==".htm" || substr($url,-5)==".html" || substr($url,-4)==".php"){$ok="1";$q=array_search($url,$a);}
if (substr($url,0,4)<>"http" && $ok=="1" && $q===false){$a[$n]=$url;$n++;}}
$r = count($a);global $url; global $audio; global $nn;
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href'); //audio or video
if(!$url){$url = $href->getAttribute('src');} //audio or video
if(!$url){$url = $href->getAttribute('qtsrc');} //audio or video
if(!$url){$url = $href->getAttribute('value');} //audio or video
if(!$url){$url = $href->getAttribute('data');} //audio or video
if($url){isitaudio();}}
}
$z=$home;$NN=1;$rr=$nn;
if (strlen($audio[0])>4){
for($i = 0; $i < count($audio); $i++){
$j=mysql_real_escape_string($audio[$i]);
$z=mysql_real_escape_string($z);
$sql="INSERT INTO audios(id, audio, pageurl)VALUES('', '$j', '$z')";
$result=mysql_query($sql);}}
$context = stream_context_create(array('http' => array('timeout' => 3))); // Timeout in seconds
$z="";
while ($o<$r-1){
$audio=array();
$o++; $z=$a[$o];$NN=$o+2;
add_urls_to_array();
if (strlen($audio[0])>4){
for($i = 0; $i < count($audio); $i++){
$j=mysql_real_escape_string($audio[$i]);
$z=mysql_real_escape_string($z);
$sql="INSERT INTO audios(id, audio, pageurl)VALUES('', '$j', '$z')";
$result=mysql_query($sql);$rr++;}}
}
$result = mysql_query("SELECT DISTINCT audio FROM audios")
or die(mysql_error());
$rrr=mysql_num_rows($result);
$result = mysql_query("SELECT * FROM audios ORDER BY audio")
or die(mysql_error());
echo "<table border='1'>";
echo "<tr><th>Audio</th><th>Page URL</th></tr>";
while($row = mysql_fetch_array($result)) {
echo "<tr><td>";
echo $row['audio'];
echo "</td><td>";
echo $row['pageurl'];
echo "</td></tr>";
}
echo "</table><BR>";
mysql_close();
unset($f);
echo "Links<BR>";for ($i = 0; $i < $r; $i++) {echo ($i+1)." ".$a[$i]; echo "<BR>";}
echo "<BR>".$rr." audios (".$rrr." unique ones) were indexed on ".$r." pages. Press Back Button to submit another URL.";
}else{
mysql_close();
unset($f);
echo '<script language="javascript">alert("Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL.");window.location="list-a-websites-audio-links-alphabetically.php"; </script>';}
}
?>
</body>
</html>