Creating a File Content Crawler with ColdFusion

0
72

This tutorial will show you how to create a local file crawler that will enable you to find a specified document type (i.e. PDF files) within a directory (and it’s children directories).

I want to begin by explaining a little bit about what a crawler is, some of you might be like… a what? 🙂

A crawler is a script that will basically return matching items you specify for it to find… I think the best example you can see is the actual code itself, so lets get started:

The first example will be a local file crawler, now what this does is this; say you have a directory structure that looks like this:

D:websitesinformation.pdf
D:websitesaccount_info.pdf
D:websitesmysite.cominfo.pdf
D:websiteshello kittyfree_stuff.pdf

Now, notice that the PDF files are on all different types of folder under the D:websites folder, so that will become the ROOT FOLDER.

<!--- define an empty variable that will become a list of directories 
        to search later in the application --->
<cfset current_directory_to_crawl = "">

<!--- now by default define the root folder to search, in this example D:websites --->
<cfset next_directory_to_crawl = "D:websites">

<!--- Now define a variable that will tell the application later on if it should continue
        At default set the value to 'one' --->
<cfset crawl_again = 1>

<!--- now define a variable that will count the number of files found and set it to 'zero' by default --->
<cfset file_counter = 0>

<!--- do >>ONLY<< one extension per run --->
<cfset extension_to_crawl = "pdf">

<!--- define a variable to hold the file names of the files found  --->
<cfset file_container = "">

<!--- create a container to hold all files processed (If you are wanting to move them elsewhere) --->
<cfset file_completed = "">

<!--- ok, here begin the processing because the variable 
        crawl_again is set to 1 (stop when set to 0) --->
<cfloop condition="crawl_again neq 0">

    <!--- first switch the directory values --->
    <cfset current_directory_to_crawl = next_directory_to_crawl>

    <!--- now clear the next --->
    <cfset next_directory_to_crawl = "">

    <!--- Clear the file container --->
    <cfset file_container = "">

    <!--- Now loop through the list of directories to crawl and look for the extensions --->
    <cfloop list="#current_directory_to_crawl#" index="dir" delimiters="|">

        <!---- now list the directory contents --->
        <cfdirectory action="LIST"
                         directory="#dir#"
                         name="CurrentPull">

            <!--- first get all the files --->
            <cfloop query="CurrentPull">

                <!---- process everything returned in the CFDIRECTORY with the exception of the first to 
                            records which are "." and "..". Those can be skipped for this example --->
                <cfif name neq "." OR name neq "..">

                <!--- display the current file/directory to the screen --->
                <cfoutput>#name#<BR></cfoutput>

                <!--- lets see if the current item is a file or directory --->
                <cfif type eq "dir">

                        <!--- Found a directory, set this folder as crawlable 
                                  so on the next loop we can search it for PDF files --->
                        <cfset next_directory_to_crawl = 
                               ListAppend(next_directory_to_crawl, dir & name & "", "|")>

                <cfelseif type eq "file">

                <!--- this is a file, see if the extension of the file is the one defined above --->
                    <cfif ListLast(name, ".") eq extension_to_crawl>
                        <!--- here is checks to make sure that this file and it's path is UNIQUE --->
                        <cfif NOT ListFind(file_completed, dir & name, "|")>

                            <!--- define this file are completed --->
                            <cfset file_completed = ListAppend(file_completed, dir & name, "|")>

                            <!--- add the file to the container --->
                            <cfset file_container = ListAppend(file_container, dir & name, "|")>

                            <!--- add one to the file counter --->
                            <cfset file_counter = file_counter + 1>

                        </cfif>
                </cfif>

            </cfif>

      </cfif>
</cfloop>

</cfloop>
 

<!--- now output the final values to the screen so we can see them --->
<cfoutput>
      <hr><ol>
       <cfloop list="#next_directory_to_crawl#" index="folder" delimiters="|">
          <li>#folder#</li>
       </cfloop>
       </ol>
       <hr><ol>
       <cfloop list="#file_container#" index="files" delimiters="|">
           <li>#files#</li>
        </cfloop>
        </ol>
       <HR>Files Found: #file_counter#<hr>
</cfoutput>

<cfif next_directory_to_crawl eq "">
      <!--- There are no more folders to crawl, stop the main loop --->
       <cfset crawl_again = 0>
</cfif> 
</cfloop>

That’s pretty much it, that will make a local crawler to find files and much more!

Click here to sign up for FREE Tech newsletters from Murdok!

EasyCFM.Com introduces at least three new tutorials each week, written by the webmaster (Pablo Varando) and also from individual people who post their own tutorials for visitors to learn from. For more information please visit: http://www.easycfm.com [EasyCFM is Hosted by Colony One On-Line – http://www.colony1.net]

LEAVE A REPLY

Please enter your comment!
Please enter your name here