PHP Array: A Gross Mistake
Anton Ukhanev
Posted on May 2, 2022
Any developer who has spent a little time working with PHP knows one of the most used compound types - the array. Its uses reach from de-serialization results, to the backbone of sets, collections and containers, stacks and queues, indexes, and much more. It is so ubiquitous that it's possible to make an object appear to be an array with the ArrayAccess
, Iterator
, and Countable
interfaces.
However, countless examples suggest that even experienced PHP developers frequently make the same kind of mistake, which often has a hidden cost months or even years later. I contend that
- An array is not an array.
- What you really want is either a map or a list.
- You're doing it wrong.
PHP Array
From Java to C++ to Pascal to Basic, an array is an implementation of a list. As follows from official documentation, however, a PHP array is really a linked hash map. This is very versatile and flexible; yet it goes against the very principles that keep our code sane. Specifically, it violates the Interface Segregation Principle, which leads to much lower separation of concerns, puts unnecessary burden on implementations, confuses and complicates consumers, and makes implementations less flexible. All this leads to a cascade of more negative side-effects which could easily be avoided by applying engineering to the problem at hand. What does this mean for us?
ISP
The Interface Segregation Principle states:
no client should be forced to depend on methods it does not use
Let's consider a typical scenario:
function sumNumbers(array $numbers, int $limit): float
{
$sum = 0;
$i = 0;
foreach ($numbers as $number) {
if ($i === $limit) {
break;
}
$sum += $number;
$i++;
}
return (float) $sum;
}
Most frequently, this is the sort of thing developers would write when it's necessary to perform some kind of operation on elements of a list. The way this list is consumed here is by iterating over its elements, starting from the first one and until either the end or the limit have been reached, using a simple foreach
loop. Nothing else is done to this list of numbers, besides iterating over it in the given order. And yet, we are asking for an array
, as if we would need to access or modify its elements directly, or in random order. What if the consumer of sumNumbers()
has an infinite set of numbers and just wants to know the sum of the first 1000? The signature does not permit that, because nothing but an array must be passed. A type defines ways in which values of that type may be consumed, and looking at ArrayAccess
, Iterator
, and Countable
, which together make up the true interface of an array
, we see that this type is in reality far more complex than is let on by the ease of its use. But simplicity is not about ease, and a much simpler version of the function affords us incredible flexibility compared to its former state - without even changing anything about the algorithm:
function sumNumbers(iterable $numbers, int $limit): float
{
// ...
}
echo sumNumbers((function () {
$i = 0;
$k = 1;
yield $k;
while(true)
{
$k = $i + $k;
$i = $k - $i;
yield $k;
}
})(), 1000)
The new consumer of sumNumbers()
can now use any series of numbers, finite or infinite, generated, hard-coded, or loaded from an external source.
Another example of array usage.
function getGreeting(array $user): string
{
$fullName = [];
$fullNameSegments = ['first_name', 'last_name'];
foreach ($fullNameSegments as $segment) {
if (isset($user[$segment])) {
$fullName[] = $user[$segment];
}
}
return implode(' ', $fullName);
}
The only way that the $user
argument is consumed is by accessing its specific, discreet indices. If the consumer wanted to use something that is only capable of exposing discreet members, which would actually be enough for the algorithm to work, they cannot! Let's consider a simplified version, where we depend only on what we actually use.
function getGreeting(MapInterface $user): string
{
$fullName = [];
$fullNameSegments = ['first_name', 'last_name'];
foreach ($fullNameSegments as $segment) {
if ($user->has($segment)) {
$fullName[] = $user->get($segment);
}
}
return implode(' ', $fullName);
}
interface MapInterface
{
public function get(string $key);
public function has(string $key): bool;
}
class Map implements MapInterface
{
protected $data;
public function __construct(array $data)
{
$this->data = $data;
}
public function get(string $key)
{
if (!array_key_exists($key, $this->data)) {
throw new RangeException(sprintf('Key %1$s not found', $key));
}
return $this->data[$key];
}
public function has(string $key): bool
{
return array_key_exists($key, $this->data);
}
}
$user = new Map([
'first_name' => 'Xedin',
'last_name' => 'Unknown',
'id' => '12345',
]);
assert($user instanceof MapInterface);
echo getGreeting($user);
Because the new getGreeting()
only consumes as well as requires the methods of a map, any compatible map can be used, whether hard-coded, loaded from a database, de-serialized, or from a remote API. Cases such as with a remote API or some key-value storages are especially curious here, because they may not allow the listing of all entries, while supporting retrieval/checking by key, and so cannot be represented by an array because its "members" are not enumerable.
Data Representation
Often, data needs to be encoded in text form in order to be saved or transferred. In these cases, some kinds of DTOs are used in order to represent that data in the program. Let's looks at a typical example of some remote API response:
{
"users": [
{
"id": 1,
"username": "xedin",
"first_name": "Xedin",
"last_name": "Unknown"
},
{
"id": 2,
"username": "jsmith",
"first_name": "John",
"last_name": "Smith"
}
]
}
The response data contains a map with a single member users
, which corresponds to a list, where every member is a map with members id
, username
, first_name
, and last_name
. This is because JSON is a very simple interchange format, and supports maps and lists. Note that there is no such thing as an "ordered map": looking at such a response, we understand quite intuitively and rather well that each "user" representation has a schema, which dictates certain mandatory (and perhaps some optional) fields, and in an application this data will be retrieved by key that is known in advance - because the application is written in accordance with the schema. There is never really a need to get all fields of a user. Let's look at a solution for a typical problem, where entries support arbitrary fields.
{
"users": [
{
"id": 1,
"username": "xedin",
"first_name": "Xedin",
"last_name": "Unknown",
"meta": [
{
"name": "date_of_birth",
"value": "1970-01-01"
},
{
"name": "hair_colour",
"value": null
}
]
}
]
}
This adds support for an arbitrary number of arbitrary members through metadata in the meta
member of each user, which is a list of maps, each map with key-value pairs, but can at any time receive additional members if necessary, such as a type
which could determine the data or field type of the member. This is structurally very similar to how data is stored in various engines, be it EAV (Magento 1), WordPress (meta tables), some other relational or key-value storage, etc, and allows a simple and seamless flow between the HTTP, the application, and the data layers.
So then, why should the DTO type structure be any different from the schema? If we wanted to represent the entities from the above data in a PHP API, this is what it could look like.
interface UsersResponseInterface
{
/**
* @return iterable<UserInterface>
*/
public function getUsers(): iterable;
}
interface UserInterface
{
public function getId(): int;
public function getUsername(): string;
public function getFirstName(): string;
public function getLastName(): string;
/**
* @return iterable<MetaInterface>
*/
public function getMeta(): iterable;
}
interface MetaInterface
{
public function getName(): string;
public function getValue(): int|float|string|bool;
}
Here, each user is represented by a UserInterface
instance, which is a data object that via its methods exposes the members of each "users" entry. Conceptually, it is consumed as a simple map, by knowing its exact getter method names (keys), and in no other way. The design of arbitrary metadata support follows a similar approach. For convenience, the metadata can also be represented as a map by simply augmenting the UserInterface
:
interface MetaMapAwareInterface
{
public function getMetaMap(): MapInterface;
}
interface UserInterface extends MetaMapAwareInterface
{
public function getId(): int;
public function getUsername(): string;
public function getFirstName(): string;
public function getLastName(): string;
/**
* @return iterable<MetaInterface>
*/
public function getMeta(): iterable;
// Inherits `getMetaMap()`
}
/** @var $user UserInterface */
$meta = $user->getMetaMap();
if ($meta->has('date_of_birth')) {
echo $meta->get('date_of_birth');
}
Note: In the above example, generic syntax in e.g. iterable<MapInterface>
may not be supported by PHPDoc natively, but is probably supported by your IDE, and is definitely supported by Psalm.
In fact, a more generic DTO type structure could be achieved by converting all maps to an e.g. MapInterface
, and all lists to an iterable
. Since these are the only two compound types necessary, any datastructure can be represented in lists of maps of lists etc. Following the ISP principle allows great flexibility, because any such structure can be parsed by a uniform algorithm, preserve more type information, and any part of it can be replaced by one that comes from another source, or retrieves data in a different way, or generates mock data on the fly - or anything else, really, and the meaning of the program or the logic of your DTO's consumers need not change.
Indexing
Another very common thing PHP developers do, and which can be found in the code of most frameworks, is something like this:
/**
* @return array<UserInterface> A list of users, by ID.
*/
function get_users(): array
{
// Retrieves users from DB...
}
Here, the value returned by get_users()
betrays the principles described in this article. While it is perfectly reasonable and valuable to have an index, index does not imply any order, but simply the ability to reference a whole record directly, often by a combination of only some of its members. If consuming code needs an ordered set of users (for example, sorted by first_name
), then it is consuming the interface of a list of users, and every user has the same significance to the consuming logic. If consuming code needs an index of users, where each user can be retrieved by their id
, then it is consuming the interface of a map of users, every user has a potentially different significance, and the order is irrelevant. Naturally, it is possible to convert a list of users to an index of users at any time by simply iterating over it programmatically. Because with this separation the index is now a separate "collection" than the list, the index can even be cached separately - like databases do, but also in memory, in a file like JSON, etc. With some additional logic, such an index can easily be used as an entity repository, which is usually unable to reliably enumerate its members.
Summary
Here are some practical take-aways that I would like to suggest.
- Achieve parity across your application layers by representing all data in a documented format with a single source of truth.
- It's either a map or a list. It's not both. If you think you need both, it's a good sign that your design could be simplified.
- Do not restrict the consumers of your APIs to using native types. Strings, lists, and maps can all be generated on the fly, in different ways, and there's no reason to limit how your consumers acquire the data they pass to your code.
- Observe ISP on one hand, and on the other - always depend on the most narrow type that provides the necessary interface. The
array
type is far too wide for most cases.
Posted on May 2, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.