\noweboptions{longxref,longchunks} \title{ {\tt mygrep} -- Recursive Grep } \author{ Anthony Towns \\ { \tt aj@humbug.org.au } \\ 343676-968 } @ \section{ Motivation } The Unix {\tt grep} command allows you to search strings in named files, but does not include the ability to search recursively through directory trees. {\tt mygrep} is designed to overcome this difficulty. @ \section{ Usage } The command line options for {\tt mygrep} are: {\tt mygrep [-w] [-r] [-h] } where {\tt -w} enables warnings, {\tt -r} enables recursion, {\tt -h} displays usage information and exits (with a return code of 1). The program will then search each file specified on the command line; descending through subdirectories if the {\tt -r} argument is specified. Output is of the form: {\tt : } for each {\tt line} in the specified {\tt file(s)} that contains the {\tt searchstring}. @ \section{ Program Structure } As most of this code is highly specific to this particular problem, it was left in a single source file. <>= <
> <> <> <> <> <<[[main]]>> @ In order to access the standard string and file manipulation routines, we will need at least the following header files. <
>= #include #include #include #include @ We will also make use of the named constants [[TRUE]] and [[FALSE]]. <>= #ifndef TRUE #define TRUE (1) #endif #ifndef FALSE #define FALSE (0) #endif @ %def TRUE FALSE \subsection{ Program decomposition } We will use three functions to separate the various parts of the code, [[rgrep]] which will be a dispatching function, [[grep_file]] which will do the work of the program, and [[grep_dir]] which will handle recursion if it is enabled. Each of these will return non-zero if any matching lines were output. <>= <<[[rgrep]]>> <<[[grep_file]]>> <<[[grep_dir]]>> @ \subsection{ The [[rgrep_opts]] structure } As we have global mode information (viz whether or not we are warning and/or recursing and the string we're searching for) we will need to either use global (or module level) variables, or pass the pertinent information between each function. The latter path will be used in this case, by passing an [[rgrep_opts]] structure to each function. <>= typedef struct rgrep_opts { int warning; int recursive; char* searchstring; } rgrep_opts; @ %def rgrep_opts \section{ The Dispatcher -- [[rgrep]] } [[rgrep]] will take care of checking whether the particular file we are looking at should be checked. It will first determine whether the given filename refers to a file or a directory, and then if the file should be skipped (eg, if it is a link or an executable). [[rgrep]] will need the name of the file it is to work on, and the various options specified by [[rgrep_opts]]. <>= int rgrep( char* filename, rgrep_opts* opts ); @ %def rgrep <<[[rgrep]]>>= int rgrep( char* filename, rgrep_opts* opts ) { <> <> <> if ( <<[[filename]] is a directory>> ) { <> return grep_dir( filename, opts ); } else { <> return grep_file( filename, opts ); } } @ %def rgrep filename opts [[rgrep]] will present a warning if the file specified by [[filename]] doesn't exist, and will interpret most possible values of [[opts]] successfully. We will add some sanity checking nevertheless. <>= assert( filename != NULL && filename[0] != '\0' ); assert( opts != NULL ); assert( opts->searchstring != NULL ); /* an empty searchstring is acceptable */ @ \subsection{ Avoiding second-class files and directories } As a step toward avoiding infinite recursion, we will not recurse down a directory if it is a link. It's somewhat questionable whether this should be here or within [[grep_dir]], which would permit links to directories to be searched if they were specified on the command line, but not otherwise. <>= if ( <<[[filename]] is a link>> ) { if ( opts->warning ) { <> } return FALSE; } if ( !opts->recursive ) { if ( opts->warning ) { <> } return FALSE; } @ To avoid some uninteresting files, we will skip links, special files and executable files, under the expectation that these will be less likely to produce meaningful results. <>= if ( <<[[filename]] is a link>> ) { if ( opts->warning ) { <> } return FALSE; } else if ( <<[[filename]] is special>> ) { if ( opts->warning ) { <> } return FALSE; } else if ( <<[[filename]] is executable>> ) { if ( opts->warning ) { <> } return FALSE; } @ \subsection{ Getting the file's status } We will use [[lstat]] to obtain information on the file, as that will allow us to distinguish links from other files. To do so, we will need to include the [[sys/stat.h]] system header and declare an instance of [[struct stat]] in which we can store [[lstat]]'s results. <
>= #include @ <>= struct stat stat_buf; @ %def stat_buf We may then obtain the file status of our file fairly simply: <>= if ( lstat( filename, &stat_buf ) == -1 ) { perror( filename ); return FALSE; } @ And we can also use the macros defined in [[sys/stat.h]] to give a meaning to assertions about the sort of file we are dealing with. <<[[filename]] is a directory>>= (S_ISDIR( stat_buf.st_mode )) @ <<[[filename]] is a link>>= (S_ISLNK( stat_buf.st_mode )) @ <<[[filename]] is special>>= (!S_ISREG( stat_buf.st_mode )) @ <<[[filename]] is executable>>= (stat_buf.st_mode & ( S_IXUSR | S_IXGRP | S_IXOTH )) @ \subsection{ Warnings } We may also need to warn when we skip files for various reasons. <>= fprintf( stderr, "warning: %s is a link. Skipping.\n", filename ); @ <>= fprintf( stderr, "warning: %s is special. Skipping.\n", filename ); @ <>= fprintf( stderr, "warning: %s is executable. Skipping.\n", filename ); @ <>= fprintf( stderr, "warning: %s is a directory. Skipping." "(use -r to enter directories)\n", filename ); @ \section{ The Meat of the Program -- [[grep_file]] } [[grep_file]] will, given the name of a file, and a valid [[rgrep_opts]] structure, display each matching line on stdout. As did [[rgrep]], [[grep_file]] will need the name of the file it is to work on, and the searchstring contained in an [[rgrep_opts]] structure. It would be possible to pass a [[char*]] instead of an [[rgrep_opts*]] to this function, but this would break the symmetry of these calls, and make the addition of further options in future difficult. <>= int grep_file( char* filename, rgrep_opts* opts ); @ %def grep_file <<[[grep_file]]>>= int grep_file( char* filename, rgrep_opts* opts ) { <> int found = FALSE; <> <> <> for(;;) { <> if ( <> ) { <> found = TRUE; } } <> <> return found; } @ %def grep_file filename opts found As for [[rgrep]], there's not really too much checking that is necessary -- any errors will be output to stderr. Again, some sanity-checking is worthwhile nevertheless. <>= assert( filename != NULL && filename[0] != '\0' ); assert( opts != NULL ); assert( opts->searchstring != NULL ); @ \subsection{ File manipulation } Our file manipulation will simply involve using [[fgets]] to iterate over each line in the given file. <>= FILE* file; @ %def file <>= file = fopen( filename, "r" ); if ( file == NULL ) { perror( filename ); return FALSE; } @ <>= fclose(file); file = NULL; @ The number of alternative scenarios when attempting to read a line from a file make the code here somewhat complicated. In short, though, we need to [[break]] if we find we can't find anything useful <>= if ( feof( file ) ) break; assert( line_size > 0 ); if ( fgets( line, line_size, file ) == NULL ) break; while ( !feof(file) && <> ) { if ( ! <> ) { break; } fgets( line + strlen(line), line_size / 2 + 1, file ); } <> @ We can make an approximation as to when [[fgets]] has finished reading its line in two ways: if it ends with a newline, or if it didn't fill up the entire buffer then we can be sure it did finish. Otherwise, we can't be entirely confident that there is more to read, but that should be rare enough that it doesn't cause any problems. <>= ( (strlen( line ) == line_size - 1) && (line[ line_size - 2 ] != '\n') ) @ \subsection{ Line storage } We use a dynamically allocated array of characters to store each line we read. As such, we need to allocate, reallocate and free this buffer. <>= char* line; size_t line_size; @ %def line line_size <>= line_size = 5; line = malloc( line_size ); if ( line == NULL ) { fprintf( stderr, "Out of memory!\n" ); return FALSE; } @ <>= free(line); line_size = 0; line = NULL; @ As realloc is somewhat painful to use, we introduce a wrraper function, [[double_buffer]], which will attempt to expand the size of a buffer allocated with [[malloc]], returning [[TRUE]] or [[FALSE]] depending on its success. <>= <<[[double_buffer]]>> @ <>= int double_buffer( void** buffer, size_t* size ); @ %def double_buffer <<[[double_buffer]]>>= int double_buffer( void** buffer, size_t* size ) { void* bufnew; assert( buffer != NULL && *buffer != NULL ); assert( size != NULL && *size > 0 ); bufnew = realloc( *buffer, *size * 2 ); if ( bufnew == NULL ) { return FALSE; } *buffer = bufnew; *size *= 2; return TRUE; } @ %def double_buffer buffer size bufnew We can then go on to define our chunk to increase the size of our [[line]] buffer. Note that this is used as an expression. <>= (double_buffer( (void**) &line, &line_size )) @ \subsection{ Line manipulation } Our chunks for manipulating the line are fairly simple. <>= if ( line[ strlen(line) - 1 ] == '\n' ) { line[ strlen(line) - 1 ] = '\0'; } @ <>= (strstr( line, opts->searchstring ) != NULL) @ <>= printf( "%s: %s\n", filename, line ); @ \section{ Putting the r in rgrep -- [[grep_dir]] } This procedure will handle ``grepping'' a directory. It will do so by stepping through each entry in the directory and calling the dispatching function, [[rgrep]], to process that entry. [[grep_dir]] will therefore need the name of the directory it is to recurse down, and an [[rgrep_opts]] structure. <>= int grep_dir( char* dirname, rgrep_opts* opts ); @ %def grep_dir <<[[grep_dir]]>>= int grep_dir( char* dirname, rgrep_opts* opts ) { <> int result = FALSE; /* have we found any lines yet? */ <> <> <> <> while ( <> ) { if ( <> ) continue; <> <> } <> <> <> return result; } @ %def grep_dir dirname opts result As was the case for both [[rgrep]] and [[grep_file]] there are few preconditions to this function. <>= assert( dirname != NULL && dirname[0] != '\0' ); assert( opts != NULL ); /* we're not actually concerned with searchstring here */ @ As we value the side-effects of the call to rgrep (ie printing out the matching lines), we use a cute trick to avoid C's short-circuit evaluation of logical or's: we use a bitwise or. This is guaranteed to work, as C treats any non-zero value as {\tt TRUE}, and both bitwise and logical or will never reduce a non-zero value to a zero (or {\tt FALSE}) value. <>= result |= rgrep( name, opts ); @ \subsection{ Directory manipulation } Our directory manipulation is textbook usage of the POSIX functions. <
>= #include <>= DIR* dirp; struct dirent* direntp; @ %def dirp direntp <>= dirp = opendir( dirname ); if ( dirp == NULL ) { perror( dirname ); <> return FALSE; } @ <>= closedir(dirp); @ <>= ((direntp = readdir( dirp )) != NULL) @ \subsection{ Filename storage } We're guaranteed that no given filename will be greater than [[MAXNAMLEN]] characters in length, so by allocating enough characters for the given directory name, any filename, a separating {\tt '/'} and a terminating {\tt NUL} we can be confident we have enough space. We will assume that we have enough heap to allocate the couple of kilobytes this will waste. <>= char* name; @ %def name <>= name = malloc( strlen(dirname) + MAXNAMLEN + 2 ); if ( name == NULL ) { fprintf( stderr, "Out of memory!\n" ); return FALSE; } strcpy( name, dirname ); if ( name[ strlen(name) - 1 ] != '/' ) { strcat( name, "/" ); } @ <>= free(name); name = NULL; @ \subsection{ Filename manipulation } For convenience, we have an extra [[char*]] variable, [[addhere]] which we set to the first character after the {\tt '/'}. This separates the section of [[name]] which is constant throughout the function, and that section which changes for each iteration. <>= char* addhere; @ %def addhere <>= addhere = name + strlen(name); @ <>= addhere = NULL; @ <>= strcpy( addhere, direntp->d_name ); @ \subsection{ Ensuring termination } To avoid infinite recursion, we will skip files named {\tt "."} and {\tt ".."}, which are used to represent the current and parent directory respectively. <>= ( (strcmp(".", direntp->d_name) == 0) || (strcmp("..", direntp->d_name) == 0)) @ \section{ [[main]] } Our [[main]] function is this program will simply parse any command line arguments and pass the work on to [[rgrep]]. <<[[main]]>>= int main( int argc, char** argv ) { <> int found = FALSE; <> <> <> <> <> { <> } if ( found ) return 0; else return 1; } @ %def main argc argv found \subsection{ Argument parsing } We handle arguments by iterating over [[argv]] exactly once, updating an index, [[c_arg]], as we finish processing each argument. <>= int c_arg; @ %def c_arg When we have parsed all our options, we may then easily check whether the supplied arguments were correct or not, simply by determining whether we have an argument we may take as a searchstring, and one or more arguments we may take as files. <>= if ( <> ) { <> return 1; } <>= (argc - c_arg < 2) @ The only remaining steps in parsing our arguments is to iterate over each file, which is facilitated by the remaining two chunks. <>= for ( ; c_arg < argc; c_arg++ ) @ <>= found |= rgrep( argv[c_arg], &opts ); @ \subsection{ Handling options } Our options all neatly map to items in the [[rgrep_opts]] structure. They are conveniently found at the start of our argument list, and consist of two character pairs, beginning with a {\tt '-'}. We will assume that the {\tt } does not begin with a {\tt '-'}. <>= rgrep_opts opts; @ %def opts <>= opts.warning = FALSE; opts.recursive = FALSE; opts.searchstring = NULL; @ <>= for ( c_arg = 1; c_arg < argc && argv[c_arg][0] == '-'; c_arg++ ) @ <