safer file access techniques in php

gbhorwood

grant horwood

Posted on September 15, 2022

safer file access techniques in php

file handling is one of those things developers don't pay a lot of attention to anymore; we read in files with, for instance, file_get_contents, or write with it's 'put' companion and it just works. most of the time.

file handling in the real world, beyond our development environment, though, can be a lot trickier. for many years, my company's primary focus was rescuing other people's projects that had been abandoned or were just plain Too Broken, and i have seen many instances were a casual approach to reading and writing files was the cause of production-crashing bugs.

in this article, we're going to go over how to write file handling functions that are safer and more reliable.

the flyover

the basic topics we're going to go over in this article are:

  • handling filesystem problems before access
  • reading files with generators
  • a simple file write function
  • putting it together with a fun but not particularly useful 'file copy' function

note: all of the examples here assume we are reading text files to keep things shorter.

reading files

when we read a file with file_get_contents() or even fopen() and a loop, we're putting a lot of faith in the filesystem; faith that the file exists, faith that we have the permissions read it, and so on. on our development box, this faith is usually justified. but if we're writing wares that run in the wild, especially if it's on someone else's servers that we don't control, that faith can cause problems.

catching filesystem errors early

there are basically three errors we have to check for before we can successfully read a file:

  • the file exists
  • we have permission to read the file
  • some other potential error error

fortunately, php provides us with methodologies to check each of these. let's build a validate_file_read() function that does those checks.

/**
 * Validate we can read a file
 * @param  String $path_to_file The path to the file we want to read
 * @return void
 */
function validate_file_read(String: $path_to_file):void {
    // does the file exist?
    !file_exists($path_to_file) ? die("File '$path_to_file' does not exist") : null;

    // do we have permissions to read the file?
    !is_readable($path_to_file) ? die("File '$path_to_file' is not readable") : null;

    // do we get any other errors trying to open the file
    $fp = fopen($path_to_file, "r");
    $fp === false ? die("Could not open file '$path_to_file") : null;
    fclose($fp);
}
Enter fullscreen mode Exit fullscreen mode

this function tests all three of our requirements and, if one fails, kills the script with an error message.

first, we confirm the file exists with file_exists. this function returns a boolean, so we can test it with an if() statement or, as shown here, a ternary operator. next, we use is_readable to verify that we actually have the permissions required to open and read the file. note that php, when used as a web language, usually runs under a special user, ie 'www-data', that has limited permissions.

finally, we try opening the file to confirm that there is nothing else wrong. we note here that fopen() is one of those annoying php functions that returns the 'mixed' type. on success, we get our file pointer. on fail, we get boolean false.

we probably aren't going to use this function 'as is'; it's basically just to illustrate the process of file validation. going forward, we will be implementing the contents of this function in other, more immediately useful functions.

handling large files

once, many years ago, i worked on a rescue project that had a central feature of processing very large files uploaded by users. the system would frequently fail if the file was too large, and the 'solution' the original development team implemented was setting php's memory_limit to -1 (no limit) and buying a boatload of ram. it was clumsy, expensive, and still didn't stop the client from losing business because of errors.

the solution we implemented was to migrate all the file reads to generators so that the wares only every held one line of a file in memory at a time.

let's take a look at how we would read a file using a generator:

/**
 * Generator to read a file line-by-line
 *
 * @param  String $file  The path to the file to read
 * @return Generator
 */
function read_generator(String $file):Generator {
    // open the file for reading 
    $fp = fopen($file, "r");

    // read file one line at a time until the end
    while (!feof($fp)) {
        yield fgets($fp);
    }

    // close the file pointer
    fclose($fp);
};

// entry point
foreach (read_generator("testfile") as $line) {
    print "processing line... ".$line;
}
Enter fullscreen mode Exit fullscreen mode

if you've never used generators before, this may be a bit confusing, and we will do a short overview of generators below.

we see in the read_generator() function, that we accept a path to a file as an argument and then open that file for reading. we then proceed to loop to the end of the file, reading one line at a time. instead of appending each line to a buffer string or array, however, we yield that line so it can be dealt with by the code that called the function. no buffer means no risk of running out of memory with large files!

generators implement a simple iterator that allows us loop over the results using foreach. we can see in our loop at the entry point that we can treat a call toread_generator() the same way that we would treat foreach-ing over an array that we got from calling, say, file.

this construct gives us (most of) the convenience of having our entire file in an array of lines without the risk of us blowing through our memory roof if the file is very large. of course there are some limitations to this technique compared to having an in-memory array of file lines; we can't call count or use array_map or the like, but the payoff is safety and reliability.

a bit about generators

generators are not widely used in php, which is a shame because they are a very powerful tool to have.

essentially, all a generator is, is a function that executes up to the next yield statement every time it is called, maintaing the function's state.

let's look at this example:

/**
 * A sample generator function
 */

function samplegenerator() {
    // we set the state of $j here on the first call
    $j = 10;

    // we then yield three times, incrementing $j each time
    yield $j; // first yield
    $j++;
    yield $j; // second yield
    $j++;
    yield $j; // third yield

    // there are no more yields, so the end of the generator's iterator is reached
    print "there are no more yields, the generator's iterator ends.";
}

foreach(samplegenerator() as $l) {
    print $l.PHP_EOL;
}
Enter fullscreen mode Exit fullscreen mode

in our foreach() we call the generator function samplegenerator() in a loop until it has no more yield statements. now, let's look at the generator function itself and how it behaves on each of these calls.

on the first call to samplegenerator(), the code executes up to the first yield. this means our function sets the value of the internal variable $j to 10 and then yields it, essentially returning the value of $j to the calling loop. our calling loop sets the value of $j, 10 this time, to the variable $l and prints it.

on the second call to the function, exection advances to the second yield. this takes the value of $j, which has persisted in the function, and increments it by one to 11. it is then returned by yield. on the third call, we find that the value of $j set by the previous call as 11 is still set. we increment again, and advance to the third and final yield.

on the last call to the function, there are no more yields. execution continues to the bottom of the function and it terminates. our generator is now 'empty', and our calling foreach loop ends.

if we run this script, we will see output like this:

10
11
12
there are no more yields, the generator's iterator ends.
Enter fullscreen mode Exit fullscreen mode

generators are not limited to being used as iterators, either. php provides methods on the Generator class like current and next that allow us more finely-grained control.

writing files

reading files is great, but at some point we're going to want to write them as well.

fortunately, writing files requires less work than reading them; there's no need for generators. however, we will still have to do some error checking.

catching errors

like reading, writing files can result in errors, and we want to catch those errors before they happen.

in general, there are four potential errors we want to check for:

  • the target directory we want to write to does not exist
  • we don't have permission to write the file
  • we don't have enough disk space for our new file
  • optionally, we're overwriting a file that's already there

let's look at a function that tests all those conditions:

/**
 * Validate we can write a file
 * @param  String $path_to_file The path to the file we want to write to
 * @return void
 */
function validate_file_write(String $file_contents, String $path_to_file):void {
    // does the target directory exist?
    !file_exists(dirname($path_to_file)) ? die("Target directory does not exist") : null;

    // is the target directory writable?
    !is_writable(dirname($path_to_file)) ? die("Target directory is not writable") : null;

    // do we have enough diskspace
    strlen($file_contents) > disk_free_space(dirname($out)) ? die("File '$file' is too big to write. Not enough space on disk.") : null;

    // optional: are we clobbering an existing file?
    file_exists($path_to_file) ? die("Target output file already exists at '$out'") : null;
}
Enter fullscreen mode Exit fullscreen mode

again, this is not a function we would typically use in real life; it's just an example to show how we test our file writes.

we used file_exists to check for errors when we were reading files, and we're using it again here. however, this time we're checking to see if the directory we want to write to is there or not. we have a file path in $dirname that we want to write to, but is this a valid path? we test that by getting the directory of the path with dirname and then running file_exists to see if the directory is there. despite it's name, file_exists also handles diretories!

next, we check if we have permissions to write our file with is_writable. this function is the companion to the is_readable we used in our file read example, and it works the same way. again, we're testing if the directory is writeable since the file itself does not exist yet.

then there's the issue of diskspace. running out of diskspace is never fun. fortuanely, checking how much room we have on a drive is fairly straightforward with disk_free_space. we note, here, that this function takes our target directory as an argument. this is because we're not really checking for available space on the disk, but on the partition. the path to the directory tells disk_free_space which partition we're interested in. once we have our available space in bytes, we can check it against the size of the contents we want to write.

building a cp function

now that we can safely read and write files, let's put it all together into a function that copies a text file by reading it line-by-line and applying a transform and filter function on each line. note that this is function is just for demonstration and is probably not something you would use in real life!

let's look at the function cp:

/**
 * Copies file $in to destination $out with optional line filter and transformation.
 *
 * @param  String   $in  Path to input file
 * @param  String   $out Path to target output file
 * @param  Callable $transform Optional. Function to apply to each line on copy
 * @param  Callable $filter Optional. Function that returns boolean test on line and copies line on true.
 * @return void
 */
function cp(String $in, String $out, ?callable $transform = null, ?callable $filter = null):void
{
    /**
     * Assign identity functions as default for transform and filter
     */
    $transform = $transform ?? fn ($n) => $n;
    $filter = $filter ?? fn ($n) => true;

    /**
     * Preflight we can read and write.
     * 
     * @param  String  $in  Path to the input file
     * @param  String  $out Path to the target output file
     * @return void
     */
    $preflight = function (String $in, String $out):void {
        !file_exists($in) ? die("File '$in' does not exist") : null;
        !is_readable($in) ? die("File '$in' is not readable") : null;

        !file_exists(dirname($out)) ? die("Target directory does not exist") : null;
        !is_writable(dirname($out)) ? die("Target directory is not writable") : null;
        file_exists($out) ? die("Target output file already exists at '$out'") : null;

        // check disk space
        filesize($in) > disk_free_space(dirname($out)) ? die("File '$file' is too big to copy") : null;
    };

    /**
     * File readline generator
     * @param  String  $file  Path to the file to read
     * @return Generator
     */
    $read = function (String $file):Generator {

        // open file and handle error
        $fp = fopen($file, "r");
        $fp === false ? die("Could not open file '$file") : null;

        // yield each line
        while (!feof($fp)) {
            yield fgets($fp);
        }

        // cleanup
        fclose($fp);
    };

    /**
     * Confirm that our filesystem is good before starting copy
     */
    $preflight($in, $out);

    /**
     * Read from input file and write to output file one line at a time
     * testing the filter and applying the transformation
     */
    $fp = fopen($out, 'w');

    foreach ($read($in) as $line) {
        $filter($line) ? fwrite($fp, $transform($line)) : null;
    }

    // cleanup
    fclose($fp);
} // cp
Enter fullscreen mode Exit fullscreen mode

the basic steps this function follows are:

  • 'preflight' that we can read and write our target files
  • read the input file, one line at a time, using a generator
  • apply a filter function on each line, determining if that line will be copied to the output file or not
  • apply a transform function on each line, altering it
  • write the line to the target output file

when we look at this function, the first thing that catches our notice are the transform and filter arguments. these are of type callable; basically they are functions that we will apply to each line as we copy it from the in file to the out file.

the filter function determines if we copy the line at all. this function takes the line of the file as an argument and it's body applies a test to that line. if the filter function returns true, we copy the line. if it returns false, we don't.

let's take a look at a filter function we might use as an argument here:

$filter = fn($line) => !str_contains($line, 'two');
cp($in, $out, null, $filter);
Enter fullscreen mode Exit fullscreen mode

here we create an anonymous function using php's arrow notation an assign it to the variable named $filter. this function checks if the line contains the word 'two' and returns false if it does. if we pass this as our $filter argument, then any line in our input file that contains the word 'two' will not be copied to our output file.

the $transform argument is similar. it is also a function we pass to cp, however it's purpose is to modify the line we are copying. here's an example:

$transform = fn($line) => ucfirst($line);
cp($in, $out, $transform, null);
Enter fullscreen mode Exit fullscreen mode

this transform changes the first letter to uppercase and returns it. if we pass this function as an argument to cp, then every line in the out file will have it's first letter uppercase.

in the body of the cp function, we see that the first thing we do is assign values to $transform and $filter if they are null:

$transform = $transform ?? fn($n) => $n;
$filter = $filter ?? fn($n) => true;
Enter fullscreen mode Exit fullscreen mode

our filter funtion always returns true, so it copies every line, and our transform function simply returns the line unmodified.

the next thing we see is a preflight function. we didn't need to wrap this code in an anonymous function, but it's done that way here to keep it neat and separate.

the preflight function is where we do all the tests to assure that our read and write will work; the stuff we've covered alredy.

next, is another anonymous function: read. this is our generator function for reading the file line-by-line. it behaves exactly the same as the read_generator() we looked at before; the only difference is that this is an anonymous function inside our main function and is applied to a variable name.

finally, we get to the foreach call that iterates over the generator. each line it takes is tested against the filter function. if it passes, the transform function is applied to the line and it is written to the output file. copying is achieved.

to show how this cp function is used, we'll run a few basic tests to copy this textfile, which is a list of the sonic youth records i own.

$ cat /tmp/in.txt
confusion is sex
bad moon rising
evol
sister
daydream nation
goo
dirty
experimental jet set, trash and no star
Enter fullscreen mode Exit fullscreen mode

now let's run our cp function with a transform that uppercases the first letter of each line, and a filter function that removes any line that contains the letter 'o'.

$in = "/tmp/in.txt";
$out = "/tmp/out.txt";
$transform = fn($line) => ucfirst($line);
$filter = fn($line) => !str_contains($line, 'o');
cp($in, $out, $transform, $filter);
Enter fullscreen mode Exit fullscreen mode

the results, predictably enough, are:

Sister
Dirty
Enter fullscreen mode Exit fullscreen mode

of course, we can also just copy the file without any filtering or transforming:

$in = "/tmp/in.txt";
$out = "/tmp/out.txt";
cp($in, $out, null, null);
Enter fullscreen mode Exit fullscreen mode

and our new file is the same as the original.

conclusion

it may not seem like an important thing to write safer file-access code. after all, most php applications are for the web and run in controlled environments where the filesystem is predictable.

however, the effort is minimal and file access errors, if not properaly caught and handled, can be catastrophic. i have seen clients who have lost tens of thousands of dollars in business because of unsafe file access code. safety has its rewards.

💖 💪 🙅 🚩
gbhorwood
grant horwood

Posted on September 15, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related