Search Website without Indexing Using XPATH and PHP

This script will search any website for a term or phrase and without requiring site indexing first like Free-Website-Indexing-Script.html does. It is limited to .html, .htm, and .php extensions for page urls. The reason to use the Free-Website-Indexing-Script.html script which indexes one of YOUR sites is the same reason that other site searches do indexing first. It makes searches a lot faster, since the page contents are indexed in a convenient MySQL database table. This means you get less failures due to default script timeouts (30 seconds allowed if the servers are busy, more if not). If you host your own site on your own server, feel free to set timeout setting to infinity and use this no-indexing site search. Another reason to use it is if you want to search relatively small sites or ones with few words per page. However, to speed it up we did not search title tags or description metatags (and we did not use MySQL or any other db). If you need the title tags or description metatags searched, use Free-Website-Indexing-Script.html and site search on your website.

The script uses the PHP DOM extension and PHP 5. The DOM extension is enabled by default in most PHP installations, so the following should work fine—it does for us. The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It supports XPATH 1.0, which this script uses extensively. XPATH has been around awhile. What is it? XPath is a syntax for defining parts of an XML document (or an HTML or XHTML one). It uses path expressions to navigate in documents. It contains a library of standard functions.

The bottom line here is that in order to have a fast and efficient site search on your website, you first have to index your site. What this means is that you run our indexing script that will crawl through your website, grabbing most of the searchable text, and storing it in a MySQL database. Not stored are tags such as script, body, paragraph, head, CSS styles, JavaScript, etc. In fact NO tags are stored, nor is the data between most tags' start tag and end tag. But the content in the head tag's title and description is stored, and so is the page URL, and everything that isn't tags between start tags and end tags of body tags, paragraph tags, headings tags like <H1>, etc.

But this no-indexing-site-search script may suffice, for smaller sites you want to search. Who can say?

The script starts with an error reporting function. So error_reporting(E_ERROR) lets you know about the nature of any fatal run-time errors. Errors that can not be recovered from are called "fatal." Execution of the script is halted. Examples: Server time-outs because script execution time limit is usually 30 seconds on hosted servers (unless you are the host), and out-of-memory errors while running a script. Both of these can happen in this script if network connections are weak or you try to index too big a website. We've indexed sites of 789 pages with no problem with Free-Website-Indexing-Script.html, but when network connections are weak, we can have a problem with a smaller site. The lesson is to do no indexing during peak network use or peak server use times. This script makes extensive use of the file_get_contents() function, which actually goes out to web pages on the Internet and reads their contents, searches them, and displays the results. While crawling, the script finds links to other pages and crawls those too. All links are stored in an array and as new links are found, the script searches the array to make sure it's not already in the list. Each page gets extensive processing before the search. You can see why commercial web crawlers host their own servers that have no time limits, since doing all the above in under 30 seconds is asking a lot!

After the error reporter, an HTML form is echoed to the screen if the user has not yet submitted a URL. If s/he has submitted, the URL is processed, as long as a search term/phrase has been entered. The form action is to reload the page (no-indexing-site-search.php) while POSTing the URL and search term to the PHP script.

If a URL was submitted, it is checked to see that it ends in php, html, or htm. If not, the user did not read the instruction that said to include the file name of the home page as well as the rest of the URL. The script jumps all the way to the bottom where an alert is given and an example of doing it right is shown. Then the page reloads. If the search term is too short or dumb, an alert is given and then the page reloads.

If the home page file name was correctly included in the URL, we start getting serious. The URL is parsed with the parse_url() function. It returns an array of the URL's parts. The path is the /filename in our case and the / is removed so we just have index.html (or whatever) now in the $home variable. Next we dump the filename from the URL the user input so we have just the site URL in $f without the filename. If there's a / at the end of $f, we dump it. Note that we also dump tags in the input using strip_tags() and trim any spaces the user input before or after the URL, and change any spaces inside the URL to %20—otherwise the page will screw up since such URLs mess up in PHP functions.

Now we add / to the site URL and put it in $g as a part to build URLs with, since the file_get_contents() function cannot work with relative addresses. It needs absolutes. Then we use file_get_contents() on the home page and get all its info.

We use XPATH to evaluate the home page. (There are a few places on the Net to learn about $xpath = new DOMXPath($dom).) In this case, we get all the link URLs, trim off outer spaces, replace inner spaces with %20 so the URL has no holes which prevent it from working correctly in PHP functions, and use strrpos()—which finds the last position in one specific string where there is another specific string to dump anchors (#whatever) and query strings (?whatever). Next relative path syntax such as ../ and ./ is dumped. If the URL starts with / that's dumped too, but / inside the path is left. If the site URL with / at the end is found in the URL (i.e., absolute URL), it is dumped since we want only relative URLs.

Then the script dumps offsite, home page or wrong extension links. Finally, if the URL made it this far and is 5 or more characters in length (minimum allowable relative URL: a.htm), it is added to the $a array. Once the home page links are all in the array, $a=array_keys(array_flip($a)) dumps duplicate array values and fills the holes that are left, since array_unique has a bug and is 10 times slower so it is not used. Why does the above work? array_flip says: "array_flip — Exchanges all keys with their associated values in an array . . . . . If a value has several occurrences, the latest key will be used as its values, and all others will be lost." And array_keys says: "array_keys — Return all the keys of an array". So you can see why the final array dumps duplicate array values and fills the holes that are left, now! Finally, the count() function puts the number of array elements into the $r variable.

Control now jumps over the add_urls_to_array() function. Then the stream context creation is set so the file_get_contents() function doesn't miss a slow-to-load page. Before we added stream context a tiny fraction of pages would be bypassed due to remote server busyness. If the remote server hangs or fails to respond, you need to try to program your way around it. Like this PHP experts site says: "Most HTTP requests complete in sub-second time," so stream context timeout is not usually needed, but when it is, a timeout of 3 seconds really comes to the rescue during a remote server falter.

Now we run a while loop. The PHP variable $o is the array element number we're now on. We've already processed the home page. But now we need to loop through all the URLs in the $a array, processing one page at a time. While on these pages, if more URLs are found not yet in the array, they are added to the end of the array.

The function add_urls_to_array() adds any new URLs found during site crawling to the main URLs array $a. In the first 3 lines, we find folders and subfolders in the URL and put everything but the file name into $folder. When we run the PHP function file_get_contents(), we use the stream context discussed above to make sure we deal with remote server falters or weak network conditions. The filename parameter (besides stream context) in the function file_get_contents() is $g (e.g., http://www.yoursite.com/) concatenated to $z (e.g., folder/subfolder/file.html). We once again use XPATH to get any links on the page and near the end of this function we use the PHP function array_search() to see if the URL is already in the array, and if not, we add it. If the URL starts with http or ./ or ../, we zero the "folder prepending flag" named $sf. Otherwise, we concatenate $folder and $url to get the correct url for the array (e.g., folder/subfolder/file.html). We replace $g in the url (if we find it) with an empty string—only relative addresses are allowed in our array. If we find "http" in the url after all this, we dump it—it's not part of this website. Incidentally, dumping URL filenames like home, placeholder, and default (and a few others) is needed because we've already processed the home page, and these names are all acceptable home pages on websites but we're only recognizing the one we started with—the one the user was asked to put at the end of the site URL, in the HTML form.

The stream_context_create() function defines the stream context—as discussed. Then the while loop loops through the $a array elements adding more URLs to the $a array. If the search term is acceptable, we go on to filter it, stripping out tags and dumping unacceptable characters. Then we grab each page's content with the PHP function file_get_contents().

The next code block processes the page content. It dumps the head, style, script, object, embed, applet, noframes, noscript, noembed and comment tags and everything in between them. The   (nonbreaking spaces) tokens are replaced with spaces. Any other tags, like P, div, font and span, are replaced but what's between them is not touched, obviously. The [^>]*? in the open tag regular expression allows this generic tag dumper to succeed even if there are attributes in the tag, like class='h', for instance. This is critical since many webmasters put such attributes all over the place, and just trying to catch these tags only, without allowing for attributes, WILL CAUSE TONS OF PAGE TEXT TO SIMPLY VANISH from a page's content string due to the strip_tags() function coming up in a couple of lines—it dumps both tags and contents. (So if a generic opening tag dumper without [^>]*? misses one or more tags, it will be because it's not a simple <P> tag, but instead a <P id='main'> tag—one with attributes.) In other words, if we did the opening tag the way we did the closing tag (which takes no attributes), we risk losing most of the page content on many website pages, simply due to the oversight about tag attributes. (The "\\0 " is padding plus a space—the tags get replaced by a space.)

The above generic opening tag dumper was tested on sites with and without [^>]*?, and, sure enough, without [^>]*?, a lot of page content was missing that had been nestled between paragraph tags with attributes. But that's not the only story we have for you.

It's hard to believe it, but ONE SINGLE MISSING CHARACTER ON ONE PAGE can make the difference between ALL page content showing up and NO page content showing up—on ALL of a site's pages! True story. We tested a website and the indexing was a sick joke. We investigated why. It turned out to be that a missing character in one file created this entire disaster. It's good to run your pages through an HTML validator, since we hadn't noticed that an ending span tag was coded as </span instead of </span>. The good news is that the browsers were forgiving and merciful and let it slide, and everything displayed as it should regardless of our goof. The bad news is that our indexer was neither forgiving nor merciful. Our generic closing tag dumper expects a complete tag—if that's a problem, use the validator: it caught the incomplete tag. So what happened when the indexer indexed is that it left in the </span, so one would think that the strip_tags() function on the next line would have ignored it. It's not a real tag if it's incomplete, right? Wrong. It DID NOT ignore it. That's the good news—sort of. Unfortunately that is also the bad news. The function looked for any tags missed by our bunch of tag dumpers and dealt with them harshly. It treated the partial tag as one of a set and looked for a second span tag. It didn't find it, so it defaulted to the end of the page! strip_tags() says: "Because strip_tags() does not actually validate the HTML, partial, or broken tags can result in the removal of more text/data than expected." (Ya think?!) The end result is that it removed the page content after the broken tag, which was nearly everything. So why would this goofy tag dump the page content of 240 pages? Because it is a PHP included file, included with code like this: <?php include("important-links.html"); ?>. This code is on every page of that site. So the page with the broken tag (which we fixed once we found it, solving the whole problem) was part of the code for all pages—that's how includes work. So the strip_tags() function dumped all page content of all pages, except for the teeny bit of stuff before the broken tag. The moral of the story: always validate your include files.

Our strategy to avoid bad search results, besides making good, well-tested, site search and site indexing scripts, was to let the tag dumpers above help avoid unintentional meanings by replacing tags with spaces. Had we not done this, the last word from one paragraph could get concatenated with the first word of the next, and a search would display weird results and/or find weird results. If a paragraph ends with dog. and the next one starts with Dew, searchers would find neither the lone word dog. nor the lone word Dew, but only the word dog.Dew. So it's important to keep words apart. If one paragraph ends with: then the sales fell off., and the next paragraph starts with: The cliff was high but we climbed it. Then: fell off.The cliff would illustrate why paragraphs need spaces between them.

We replace carriage returns and newlines with spaces as well as trimming the string, which knocks out spaces, NULLs, and tabs before and after the string.

Our searcher will go at least 3 levels deep if you use folder/subfolder/file.html or ../folder/subfolder/file.html syntax rather than ../folder/../subfolder/file.htm syntax, which confuses the script so you get fewer pages searched than you wanted. We don't mess with https, pdf, excel, powerpoint, word, text, doc, asp, aspx, xml, xhtml, images, or other file types, or mess with nofollow or noindex attributes, etc. We do not deal with robots, ports, sockets, sessions, keywords, encrypted or hashed or passworded files, or links not on the submitted domain. We just search .html, .htm, and .php file extensions on your domain—period. (The .shtml extension may also work—we haven't tried it. An SHTML html document contains "Server Side Includes" that the server processes before the page gets sent to the browser. And it depends on settings on your particular server.)

Next we use the PHP function stripos() to find the position of the search string $S in the $content string. If it's found at all, we print results onscreen. Results have $g.$a[$j] as a title and as a link we turn the title into. You can click on it and go to that page. Next, we create substring $x that has 350 characters in a row beginning with the found search term/phrase. Then, we use the PHP function str_ireplace() to put a span tag around the search term that highlights the term with a light blue background color. After the page URL link comes a new line that starts with ... to represent "not the start of the page." We do the same at the end of the 350-character excerpt from the page which we echo to the screen. Finally we repeat the page URL in green italics so the search result looks somewhat Google-like. Once we find the search term, we increment the $found variable. If this is zero after searching the site, we give the message "Term was not found on this site." If their URL entered in the form was not a page ending in html, htm, or php, they get a message showing an example of correct input, then the page is reloaded.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252">
<TITLE>No Indexing Site Search</TITLE>
<meta name="description" content="No Indexing Site Search">
<meta name="keywords" content="No Indexing Site Search,Search Website URLs,Search Website,php,CMS,javascript, dhtml, DHTML">
<style type="text/css">
BODY {margin-left:0; margin-right:0; margin-top:0;text-align:left;background-color:#ddd}
p, li {font:13px Verdana; color:black;text-align:left}
h1 {font:bold 28px Verdana; color:black;text-align:center}
h2 {font:bold 24px Verdana;text-align:center}
td {font:normal 13px Verdana;text-align:left;background-color:#ccc}
.topic {text-align:left;background-color:#fff}
.center {text-align:center;}
.textbox {position:absolute;top:50px;left:190px;width:772px;word-wrap:break-word;white-space:nowrap;overflow:hidden;text-overflow: ellipsis;}
.info {position:absolute;top:0px;left:2px;width:160px;background-color:#bbb;border:1px solid blue;padding:5px}
.ts {background-color:#8aa;border:6px solid blue;padding:6px}
.pw {position:absolute;top:150px;left:185px;width:820px;text-align:center}
</style>
</head>
<body>
<center><h1>Search Website</h1></center>
<?php
error_reporting(E_ERROR);

$f=$_POST['siteurl'];
$S=$_POST['search'];
if (!isset($f)){
echo '<div class="pw"><table class="ts"><tr><td style="text-align:center"><form id="formurl" name="formurl" method="post" action="no-indexing-site-search.php"><b>home page URL (must include /index.html, /index.htm, /index.php or whatever the home page filename is)</b><BR><label for="URL">URL: </b><input type="text" name="siteurl" size="66" maxlength="99" value=""></label><br><br><b>Search word/phrase (only exact word or phrase will be searched for)</b><BR><label for="Search">Search: </b><input type="text" name="search" size="66" maxlength="99" value=""></label><br><br><input type="submit" value="Submit URL"><br><br><input type="reset" value="Reset"></form></td></tr></table></div>';

}else{

if (substr($f,-4)==".htm" || substr($f,-4)=="html" || substr($f,-4)==".php"){
if (strlen($S)<3 || $S=="the" || $S=="The" || $S=="THE") {echo '<script language="javascript">alert("Enter longer search terms.");window.location="no-indexing-site-search.php"; </script>';
}else{

$e=(parse_url($f,PHP_URL_PATH));
if (substr($e,0,1)=="/"){$LLLL=strlen($e);$home=substr($e,1,$LLLL-1);}

$f=strip_tags($f);$f=str_replace($e, "", $f);
$L=strlen($f);if (substr($f,-1)=="/"){$f=substr($f,0,$L-1);}
$f = str_replace(" ", "%20", $f); $f=trim($f);

$a=array();$n=1;$o=-1;$g=$f."/"; $a[0]=$home;$found=0;
$t = file_get_contents($g.$home);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
$ok="0";$url=str_replace($g, "", $url);$L=strlen($url);
if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps offsite, home page or wrong extension links
if($L>4 && $ok=="1"){$a[$n]=$url;$n++;}}
$a=array_keys(array_flip($a)); //dump duplicate array values and fill the holes that are left; array_unique has BUG!
$r = count($a);

function add_urls_to_array(){
global $a; global $g; global $z; global $t; global $r; $n=$r; $folder="";
$fo=strrpos($z,"/"); if ($fo){$folder=substr($z,0,$fo+1);}
$LLL=strlen($folder);
$t = file_get_contents($g.$z,0,$context);
$dom = new DOMDocument();
@$dom->loadHTML($t);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$url = trim($url);
$url = str_replace(" ", "%20", $url);
if (substr($url,0,4)=="http"){$sf="0";}else{$sf="1";}
if (substr($url,0,3)=="../" || substr($url,0,2)=="./"){$sf="0";}
$url = str_replace("../", "", $url);
$url = str_replace("./", "", $url);
if (substr($url,0,1)=="/"){$LL=strlen($url);$url=substr($url,1,$LL-1);}
if (substr($url,0,4)<>"http" && substr($url,0,$LLL)<>$folder && $sf=="1"){$url=$folder.$url;}
$w=strrpos($url,"#");if ($w){$url=substr($url,0,$w);}
$w=strrpos($url,"?");if ($w){$url=substr($url,0,$w);}
$ok="0";$url=str_replace($g, "", $url);$L=strlen($url);
if ((substr($url,0,4)<>"http" && substr($url,0,6)<>"index." && substr($url,0,8)<>"default." && substr($url,0,5)<>"home." && substr($url,0,6)<>"Index." && substr($url,0,8)<>"Default." && substr($url,0,5)<>"Home." && substr($url,0,12)<>"placeholder.") && (substr($url,-4)==".htm" || substr($url,-4)=="html" || substr($url,-4)==".php")){$ok="1";} //dumps offsite, home page or wrong extension links
$q=array_search($url,$a);if ($L>4 && $ok=="1" && $q===false){$a[$n]=$url;$n++;}}
$r = count($a);
}

$context = stream_context_create(array('http' => array('timeout' => 3))); // Timeout in seconds

while ($o<$r-1){
$o++; $z=$a[$o];
add_urls_to_array();
}

if (strlen($S)>2 && $S<>"the" && $S<>"The" && $S<>"THE") {
echo "<div class='textbox'>";

$S=strip_tags($S);
$pattern2 = '/[^a-zA-Z0-9\\s\\.\\,\\!\\?]/i';
$replacement = '';
$S=preg_replace($pattern2, $replacement, $S);

echo '<table width="772" border="1">';

for($j=0;$j<$r;$j++) {
$content = file_get_contents($g.$a[$j],0,$context);

$pp=array('/<head[^>]*?>.*?<\/head>/si',
'/<style[^>]*?>.*?<\/style>/si',
'/<script[^>]*?.*?<\/script>/si',
'/<object[^>]*?.*?<\/object>/si',
'/<embed[^>]*?.*?<\/embed>/si',
'/<applet[^>]*?.*?<\/applet>/si',
'/<noframes[^>]*?.*?<\/noframes>/si',
'/<noscript[^>]*?.*?<\/noscript>/si',
'/<noembed[^>]*?.*?<\/noembed>/si',
'//si');
$content = preg_replace($pp,'',$content);
$content = preg_replace('/ /si',' ',$content);
$content = preg_replace("/<[A-Za-z]+[^>]*?>/i", "\\0 ", $content);
$content = preg_replace("/<\/[A-Za-z]+>/", "\\0 ", $content);
$content=strip_tags($content);
$content=preg_replace('/\r\n/', ' ', trim($content));

$starter=stripos($content,$S);if ($starter){
$found++;
echo "<a target='_blank' a HREF='".$g.$a[$j]."'>".$g.$a[$j]."</a><BR>";
$x=substr($content,$starter,350);
$x = str_ireplace($S, '<span style="background-color:lightblue;">'.$S.'</span>', $x);
echo "...".$x."...<BR><I><span style='color:green;background-color:#ddd'>".$g.$a[$j]."</span></I>";
echo "<br><br></td></tr>";
}

}

if($found==0){echo '<script language="javascript">alert("Term was not found on this site.");window.location="no-indexing-site-search.php";</script>';}

echo '</table>';

}}

}else{
echo '<script language="javascript">alert("Enter full URL with page filename, like this example:\n\nhttp://www.yoursitename/index.html\n\nPress a key to submit another URL.");window.location="no-indexing-site-search.php"; </script>';}
}
?>

</div>

<div id='info' class='info'>No hyphens (-) or underscores (_) or Enter/Return allowed in search terms. Use letters, numbers, spaces and these: <B> , . ? ! </b> in searches. Search results will be in no particular order. <A HREF="javascript:history.go(-1)">GO BACK</A> </div>
</body>
</html>

Free Personal Status Boards (PSB™)

Search Website without Indexing Using XPATH and PHP