Pinch me, I'm zooming: gestures in the DOM

danburzo

Dan Burzo

Posted on November 22, 2020

Pinch me, I'm zooming: gestures in the DOM

Note: The version you're reading is a first draft. Please refer to the updated article:

Pinch me, I'm zooming: gestures in the DOM


Interpreting multi-touch user gestures on the web is not as straightforward as you'd imagine. In this article we look at how the current generation of browsers behave, and piece together a solution using wheel, gesture and touch DOM events.

The anatomy of a gesture

Two-finger gestures on touchscreens and modern trackpads allow users to manipulate on-screen elements as if they were physical objects: to move them and spin them around, to bring them closer or push them further away. Such a gesture encodes a unique combination of translation, uniform scaling, and rotation, known as an (affine) linear transformation, to be applied to the target element.

To create the impression of direct manipulation, this transformation must map naturally to the movement of the touchpoints. One possible mapping is that which keeps the parts you touch underneath the fingertips throughout the gesture. While it's not the only way to interpret a gesture, it's the approach on which mobile operating systems have settled. The principle has also been adapted to trackpads — which, in their modern incarnation, can be thought of as smaller, surrogate (or even literal!) touchscreens.

Let's see how a two-finger gesture maps to the basic components of a linear transformation. The change in distance between the two touchpoints throughout the gesture dictates the scale: if the fingers are brought together to half the initial distance, the object should be made half its original size. The slope defined by the two touchpoints similarly dictates the rotation to be applied to the object. The midpoint, located halfway between the two touchpoints, has a double role: its initial coordinates establish the transformation origin, and its movement throughout the gesture imposes a translation to the object.

Native applications on touch-enabled devices have to access to high-level APIs that provide the translation, scale, rotation, and origin of a user gesture directly. On the web, we have to glue together several types of events to get a similar results across a variety of platforms.

A summary of relevant DOM events

A WheelEvent is triggered when the user intends to scroll an element with the mousewheel (from which the interface takes its name), a separate "scroll area" on older trackpads, or the entire surface area of newer trackpads with the two-finger vertical movement.

Wheel events have deltaX, deltaY, and deltaZ properties to encode the displacement dictated by the input device, and a deltaMode to establish the unit of measurement:

Constant Value Explanation
WheelEvent.DOM_DELTA_PIXEL 0 scroll an amount of pixels
WheelEvent.DOM_DELTA_LINE 1 scroll by lines
WheelEvent.DOM_DELTA_PAGE 2 scroll entire pages

As pinch gestures on trackpads became more commonplace, browser implementers needed a way support them in desktop browsers. Kenneth Auchenberg, in his article on detecting multi-touch trackpad gestures, brings together key pieces of the story. In short, Chrome settled on an approach inspired by Internet Explorer: to encode pinch gestures as wheel events with ctrlKey: true, and the deltaY property holding the proposed scale increment. Firefox eventually did the same, and with Microsoft Edge recently having switched to Chromium as its underlying engine, we have a "standard" of sorts. I use scare-quotes because, as will be revealed shortly — and stop me if you've heard this before about Web APIs — some aspects don't quite match across browsers.

Sometime between Chrome and Firefox adding support for pinch-zoom, Safari 9.1 brought its very own GestureEvent, which exposes precomputed scale and rotation properties, to the desktop.

To this day, Safari remains the only browser implementing GestureEvent, even among browsers on touch-enabled platforms. Instead, mobile browsers produce the arguably more useful TouchEvents, which encode the positions of individual touchpoints in a gesture. They allow us, with a bit more effort than is required with higher-level events, to compute all the components of the linear transformation ourselves: whereas WheelEvent only maps scale, and GestureEvent adds rotation, TouchEvent uniquely affords capturing the translation, with much more fine-grained control over interpreting the gesture.

Intuitively, the combination of wheel, gesture and touch events seems sufficient to handle two-finger gestures across a variety of platforms. Let's see how this intuition — ahem — pans out.

Putting browsers to the test

I've put together a basic test page that logs relevant properties of all the wheel, gesture, and touch events it captures.

The plan is to perform a series of scrolls and pinches in recent versions of Firefox, Chrome, Safari, and Edge (Chromium-based), on a variety of devices I managed to procure for this purpose:

  • a MacBook Pro (macOS Big Sur);
  • a Surface Laptop with a touchscreen and built-in precision touchpad (Windows 10);
  • an ASUS notebook with a non-precision touchpad (Windows 10);
  • an iPhone (iOS 14);
  • an iPad with a keyboard (iPadOS 14); and
  • an external mouse to connect to all laptops.

Let's dig into a few of the results, and how they inform our solution.

Results on macOS

When performing a pinch-zoom gesture, Firefox and Chrome produce a wheel event with a deltaY: ±scale, ctrlKey: true. They produce an identical result when you scroll normally with two fingers while physically pressing down the Ctrl, with the difference that the latter is subject to inertial scrolling. For its part, Safari reacts to the proprietary gesturestart, gesturechange, and gestureend events, producing a precomputed scale and rotation.

In all browsers, clientX and clientY, and the position of the on-screen cursor, remain constant throughout two-finger gestures. The pair of coordinates establishes the gesture origin.

The process of testing various modifier keys brought forth some default browser behaviors that we'll likely need to deflect with event.preventDefault():

  • Option + wheel in Firefox navigates (or rather flies) through the browser history; this is probably a misapplication of the code that handles discrete steps on a mousewheel, and it feels too weird to be useful on an inertial trackpad;
  • Command + wheel in Firefox zooms in and out of the page, similarly to the Command + and Command - keyboard shortcuts;
  • Pinching inwards in Safari minimizes the tab into a tab overview screen.

External, third-party mice are a different matter. Instead of the smooth pixel increments on the trackpad, the mouse's wheel jumps entire lines at a time. (The Scrolling speed setting in System Preferences > Mouse controls how many.)

Accordingly, Firefox shows deltaY: ±1, deltaMode: DOM_DELTA_LINE for a tick of the wheel. This is the first, and at least on macOS the only, encounter with DOM_DELTA_LINE. Chrome and Safari stick with deltaMode: DOM_DELTA_PIXEL and a much larger deltaY, sometimes hundreds of pixels at a time. This is an instance of the many more pixels than expected deviation of which we'll see more throughout the test session. A basic pinch-zoom implementation that doesn't account for this quirk will zoom in and out in large, hard-to-control strides when using the mousewheel.

In all three browsers, deltaX is normally zero. Holding down the Shift key, a common way for users of an external mouse to scroll horizontally, swaps deltas: deltaY becomes zero instead.

Results on Windows

A precision touchpad works on Windows similarly to the Magic Trackpad on macOS: Firefox, Chrome, and Edge produce results comparable to what we've seen on macOS. The quirks emerge with non-precision touchpads and external mice, however.

On Windows, the wheel of an external mouse has two scroll modes: either L lines at a time (with a configurable L), or a whole page at a time.

When using the external mouse with line-scrolling, Firefox produces the expected deltaY: ±L, deltaMode: DOM_DELTA_LINE. Chrome generates deltaY: ±L * N, deltaMode: DOM_DELTA_PIXEL, where N is a multiplier dictated by the browser, and which varies by machine: I've seen 33px on the ASUS laptop and 50px on the Surface. (There's probably an inner logic to what's going on, but it doesn't warrant further investigation at this point.) Edge produces deltaY: ±100, deltaMode: DOM_DELTA_PIXEL, so 100px regardless on the number of lines L that the mouse is configured to scroll. With page-scrolling, browsers uniformly report deltaY: ±1, deltaMode: DOM_DELTA_PAGE. None of the three browsers support holding down the Shift to reverse the scroll axis of the mousewheel.

On non-precision touchpads, the effect of scrolling on the primary (vertical) axis will mostly be equivalent to that of a mousewheel. The behavior of the secondary (horizontal) axis will not necessarily match it. At least on the machines on which I performed the tests, mouse settings also apply to the touchpad, even when there was no external mouse attached.

In Firefox, in line-scrolling mode, scrolls on both axes produce deltaMode: DOM_DELTA_LINE with deltaX and deltaY, respectively, containing a fraction of a line; a pinch gesture produces a constant deltaY: ±L, deltaMode: DOM_DELTA_LINE, ctrlKey: true. In page-scrolling mode, scrolls on the primary axis produce deltaMode: DOM_DELTA_PAGE, while on the secondary axis it remains in deltaMode: DOM_DELTA_LINE; the pinch gesture produces deltaY: ±1, deltaMode: DOM_DELTA_PAGE, ctrlKey: true. In Chrome, a surprising result is that when scrolling on the secondary axis we get deltaX: 0, deltaY: N * ±L, shiftKey: true. Otherwise, the effects seen with a non-precision touchpad on Windows are of the unexpected deltaMode or unexpected deltaY value varieties.

Converting WheelEvents to gestures

If we took Safari's GestureEvent as the gold standard, and we wanted to derive an equivalent from wheel events, we'd find a few sub-problems to tackle:

  1. how to normalize the various ways browsers emit wheel events into an uniform delta value;
  2. how to generate the equivalent of the gesturestart, gesturechange and gestureend events from wheel events;
  3. how to compute the scale value from the delta.

Let's explore each task one by one.

Normalizing wheel events

Our goal here is to implement a normalizeWheelEvent function as described below:

/*
    Normalizes WheelEvent `e`,
    returning an array of deltas `[dx, dy]`.
*/
function normalizeWheelEvent(e) {
    let dx = e.deltaX;
    let dy = e.deltaY;
    // TODO: normalize dx, dy
    return [dx, dy];
}
Enter fullscreen mode Exit fullscreen mode

This is where we can put our experimental browser data to good use. Let's recap some findings relevant to normalizing wheel events.

The browser may emit deltaX: 0, deltaY: N, shiftKey: true when scrolling horizontally. We want to interpret this as deltaX: N, deltaY: 0 instead:

if (dx === 0 && e.shiftKey) {
    return [dy, dx]; // swap deltas
}
Enter fullscreen mode Exit fullscreen mode

Furthermore, the browser may emit values in a deltaMode other than pixels; for each, we need a multiplier:

if (e.deltaMode === WheelEvent.DOM_DELTA_LINE) {
  dy = dy * 8;
} else if (e.deltaMode === WheelEvent.DOM_DELTA_PAGE) {
 dy = dy * 24;
}
Enter fullscreen mode Exit fullscreen mode

The choice of multipliers ultimately depends on the application. We might take inspiration from browsers themselves or other tools the user may be familiar with; a document viewer may respect the mouse configuration to scroll one page at a time; map-pinching, on the other hand, may benefit from smaller increments.

Finally, the browser may forego emitting DOM_DELTA_LINE or DOM_DELTA_PAGE where the input device would dictate them, and instead offer a premultiplied value in DOM_DELTA_PIXELs, which is often very large, 100px or more at a time. Why would they do that? With a whole lot of code out there that fails to look at the deltaMode, minuscule DOM_DELTA_LINE / DOM_DELTA_PAGE increments interpreted as pixels would make for lackluster scrolls. Browsers can be excused for trying to give a helping hand, but premultiplied pixel values — often computed in a way that only works if you think of wheel events as signifying scroll intents — makes them harder to use for other purposes.

Thankfully, in the absence of a more sophisticated approach, simply setting the upper limit of deltaY to something reasonable, such as 24px, just to push the breaks a little on a wild zoom, can go a long way towards improving the experience.

dy = Math.sign(dy) * Math.min(24, Math.abs(dy));
Enter fullscreen mode Exit fullscreen mode

(The code above uses Math.sign() and Math.min() to impose a maximum on the absolute value of a possibly-negative number.)

These few adjustments should cover a vast array of variations across browsers and devices. Yay compromise!

Generating gesture events

With normalization out of the way, the next obstacle is that wheel events are separate happenings, for which we must devise a "start" and "end" if we want to have equivalents to gesturestart and gestureend:

The first wheel event marks the beginning of a gesture, but what about the end? In line with keeping things simple, we consider a gesture done once a number of milliseconds pass after the last wheel event. An outline for batching wheel events into gestures is listed below:

let timer;
let gesture = false;
element.addEventListener('wheel', function(e) {
    if (!gesture) {
        startGesture();
        gesture = true;
    } else {
        doGesture();
    }
    if (timer) {
        window.clearTimeout(timer);
    }
    timer = window.setTimeout(function() {
        endGesture();
        gesture = false;
    }, 200); // timeout in milliseconds
});
Enter fullscreen mode Exit fullscreen mode

What arguments we're supposed to send to the startGesture, doGesture, and endGesture functions is explored in the next section.

Converting the delta to a scale

In Safari, a gesturechange event's scale property holds the accumulated scale to apply to the object at each moment of the gesture:

final_scale = initial_scale * event.scale;
Enter fullscreen mode Exit fullscreen mode

In fact, the documentation for the UIPinchGestureRecognizer which native iOS apps use to detect pinch gestures, and which works similarly to Safari's GestureEvent, emphasizes this aspect:

Important: Take care when applying a pinch gesture recognizer’s scale factor to your content, or you might get unexpected results. Because your action method may be called many times, you cannot simply apply the current scale factor to your content. If you multiply each new scale value by the current value of your content, which has already been scaled by previous calls to your action method, your content will grow or shrink exponentially. Instead, cache the original value of your content, apply the scale factor to that original value, and apply the new value back to your content. Alternatively, reset the scale factor to 1.0 after applying each new change.

Conversely, pinch gestures encoded as wheel events contain deltas that correspond to percentual changes in scale that you're supposed to apply incrementally:

scale = previous_scale * (1 + delta/100);
Enter fullscreen mode Exit fullscreen mode

Accumulating a series of increments d1, d2, ..., dN into a final scaling factor requires some back-of-the-napkin arithmetics. The intermediary scales:

scale1 = initial_scale * (1 + d1/100);
scale2 = scale1 * (1 + d2/100);
scale3 = scale2 * (1 + d3/100);
....
Enter fullscreen mode Exit fullscreen mode

Lead us to the formula for the final scale:

final_scale = initial_scale * factor;
factor = (1 + d1/100) * (1 + d2/100) * ... * (1 + dN/100);
Enter fullscreen mode Exit fullscreen mode

Which let us flesh out the scale we're supposed to send to our startGestue, doGesture and endGesture functions we introduced in the previous section:

let gesture = false;
let timer;
let factor; // accumulates the scaling factor
element.addEventListener('wheel', e => {
    let [dx, dy] = normalizeWheel(e);
    if (!gesture) {
        factor = 1; // reset the factor
        startGesture({
            scale: factor
        });
        gesture = true;
    } else {
        factor = factor * (1 + dy/100);
        doGesture({
            scale: factor
        });
    }
    if (timer) {
        window.clearTimeout(timer);
    }
    timer = window.setTimeout(() => {
        endGesture({
            scale: factor
        });
        gesture = false;
    }, 200);
});
Enter fullscreen mode Exit fullscreen mode

This approach will get us scale values in the same ballpark for WheelEvent and GestureEvent, but you'll notice pinches in Firefox and Chrome effect a smaller scale factor than similar gestures in Safari. We can solve this by mixing in a SPEEDUP multiplier that makes up for the difference:

/*
    Eyeballing it suggests the sweet spot
    for SPEEDUP is somewhere between 
    1.5 and 3. Season to taste!
*/
const SPEEDUP = 2.5;
factor = factor * (1 + SPEEDUP * dy/100);
Enter fullscreen mode Exit fullscreen mode

Converting TouchEvents to gestures

Touch events are more low-level; they contain everything we need to derive the entire affine transformation ourselves. Each individual touchpoint is encoded in the event.touches list as a Touch object containing, among others, its coordinates clientX and clientY.

Emitting gesture-like events

The four touch events are touchstart, touchmove, touchend and touchcancel.
We want to map these to the startGesture, doGesture and endGesture functions introduced in the WheelEvent section.

Each individual touch produces a touchstart event on contact and a touchend event when lifted from the touchscreen; the touchcancel event is emitted when the browser wants to bail out of the gesture (for example, when adding to many touchpoints to the screen). For our purpose we want to observe gestures involving exactly two touchpoints, and we use the same function watchTouches for all three events.

let gesture = false;
function watchTouches(e) {
    if (e.touches.length === 2) {
        gesture = true;
        e.preventDefault();  
        startGesture();
        el.addEventListener('touchmove', touchMove);
        el.addEventListener('touchend', watchTouches);
        el.addEventListener('touchcancel', watchTouches);
    } else if (gesture) {
        gesture = false;
        endGesture();
        el.removeEventListener('touchmove', touchMove);
        el.removeEventListener('touchend', watchTouches);
        el.removeEventListener('touchcancel', watchTouches);
  }
};
document.addEventListener('touchstart', watchTouches);
Enter fullscreen mode Exit fullscreen mode

The touchmove event is the only one using its own separate listener:

function touchMove(e) {
  if (e.touches.length === 2) {
      doGesture();
      e.preventDefault();
  }
}
Enter fullscreen mode Exit fullscreen mode

In the next section we figure out what to put in place of the ellipses () as the argument for the startGesture, doGesture, and endGesture functions.

Producing the affine transformation

To have a frame of reference, we must store the initial touches, at the very beginning of a gesture. We'll take advantage of the fact that TouchList and Touch objects are immutable to just save a reference:

let gesture = false;
let initial_touches;
function watchTouches(e) {
    if (e.touches.length === 2) {
        gesture = true;
        initial_touches = e.touches;
        startGesture();
        
    }
    
}
Enter fullscreen mode Exit fullscreen mode

The argument to startGesture is straightforward. We haven't done any gesturing yet, so all parts of the transformation are set to their initial values. The origin of the transform is the midpoint between the two initial touchpoints:

startGesture({
  scale: 1,
  rotation: 0,
  translation: [0, 0],
  origin: midpoint(initial_touches)
});
Enter fullscreen mode Exit fullscreen mode

The midpoint is computed as:

function midpoint(touches) {
    let [t1, t2] = touches;
    return [
        (t1.clientX + t2.clientX) / 2,
        (t1.clientY + t2.clientY) / 2
    ];
}
Enter fullscreen mode Exit fullscreen mode

For the doGesture function, we must compare our pair of current touchpoints to the initial ones, and using the distance and angle formed by each pair (for which functions are defined below):

function distance(touches) {
    let [t1, t2] = touches;
    let dx = t2.clientX - t1.clientX;
    let dy = t2.clientY - t2.clientY;
    return Math.sqrt(dx * dx + dy * dy);
}

function angle(touches) {
    let [t1, t2] = touches;
    let dx = t2.clientX - t1.clientX;
    let dy = t2.clientY - t2.clientY;
    return 180 / Math.PI * Math.atan2(dy, dx);
}
Enter fullscreen mode Exit fullscreen mode

We can produce the argument to doGesture:

let mp_init = midpoint(initial_touches);
let mp_curr = midpoint(e.touches);

doGesture({
    scale: distance(e.touches) / distance(initial_touches),
    rotation: angle(e.touches) - angle(initial_touches),
    translation: [
        mp_curr.x - mp_init.x,
        mp_curr.y - mp_init.y
    ],
    origin: mp_init
});
Enter fullscreen mode Exit fullscreen mode

Finally, let's tackle the argument to endGesture. It can't be computed on the spot, at the moment when endGesture gets called, we explicitly don't have two touchpoints available. Therefore, in order to place a relevant gesture as the argument to endGesture we must remember the last gesture we produced. To that end, instead of having the gesture variable hold a boolean value, lets use it to store the latest gesture.

Putting everyhing together, the watchTouches and touchMove functions look like:

let gesture = false;
function watchTouches(e) {
    if (e.touches.length === 2) {
        gesture = {
          scale: 1,
          rotation: 0,
          translation: [0, 0],
          origin: midpoint(initial_touches)
        };
        e.preventDefault();  
        startGesture(gesture);
        el.addEventListener('touchmove', touchMove);
        el.addEventListener('touchend', watchTouches);
        el.addEventListener('touchcancel', watchTouches);
    } else if (gesture) {
        endGesture(gesture);
        gesture = null;
        el.removeEventListener('touchmove', touchMove);
        el.removeEventListener('touchend', watchTouches);
        el.removeEventListener('touchcancel', watchTouches);
  }
};

el.addEventListener('touchstart', watchTouches);

function touchMove(e) {
  if (e.touches.length === 2) {
      let mp_init = midpoint(initial_touches);
        let mp_curr = midpoint(e.touches);
      gesture = {
            scale: distance(e.touches) / distance(initial_touches),
            rotation: angle(e.touches) - angle(initial_touches),
            translation: [
                mp_curr.x - mp_init.x,
                mp_curr.y - mp_init.y
            ],
            origin: mp_init
        };
      doGesture(gesture);
      e.preventDefault();
  }
}
Enter fullscreen mode Exit fullscreen mode

Safari mobile: touch or gesture events?

Safari mobile (iOS and iPadOS) is the only browser that has support for both GestureEvent and TouchEvent, so which one should you choose for handling two-finger gestures? On the one hand, enhancements Safari applies to GestureEvents makes them feel smoother; on the other hand, TouchEvents afford capturing the translation aspect of the gesture. Ultimately, the choice is dictated by the needs of the web application, and the subjective experience on real-life iOS/iPadOS devices.

The feature-detection code, based on which you can attach to GestureEvents or not, is below:

if (typeof GestureEvent !== 'undefined') {
    // Safari... 
  if (typeof TouchEvent !== 'undefined') {
    // ...on mobile
  } else {
    // ...on desktop
  }
}
Enter fullscreen mode Exit fullscreen mode

Applying the affine transformation to the object

We we talk about transforming elements we mean either an HTML or a SVG element. Both use the same syntax, described in the CSS Transforms Level 1 specification:

let transform_string = `
    translate(
        ${translation && translation.x ? translation.x : 0 } 
        ${translation && translation.y ? translation.y: 0 }
    )
    scale(${scale || 1}) 
    rotate(${rotation || 0})`;
Enter fullscreen mode Exit fullscreen mode

The mechanisms to apply a transform from DOM APIs are similar. For HTML, we set it on the element's style object; SVG also affords it as an attribute:

html_el.style.transform = transform_string;
svg_el.setAttribute('transform', transform_string);
Enter fullscreen mode Exit fullscreen mode

The origin of the transform must correspond to the gesture's midpoint, and this is done via the transform-origin CSS property and its equivalent SVG attribute. These are slightly different in HTML vs. SVG, so we need some more math to bring the midpoint coordinates to something that can be used for transform-origin.

For SVG elements, values in transform-origin are relative to the element's closest <svg>. The SVGGraphicsElement.getScreenCTM() method returns the object's current transform matrix, which expresses the transform from the element's coordinate system to client coordinates. The .inverse() of that matrix does the opposite, letting us convert client coordinates to values useful for transform-origin:

function clientToSVGElementCoords(el, coords) {
    let screen_to_el = el.getScreenCTM().inverse();
    let point = el.ownerSVGElement.createSVGPoint();
    point.x = coords.x;
    point.y = coords.y;
    return point.matrixTransform(screen_to_el);
}

let o = clientToSVGElementCoords(el, origin);
el.setAttribute('transform-origin', `${o.x} ${o.y}`);
Enter fullscreen mode Exit fullscreen mode

This works splendidly no matter what transforms are already applied to the element: translation, scale, rotation are all supported.

In HTML the closest we can get to getScreenCTM is with the Element.getBoundingClientRect() method, which returns information about the element's on-screen size and position. And since HTML elements' transform-origin is relative to the element itself, this allows us to compute the appropriate origin for the transform:

function clientToHTMLElementCoords(el, coords) {
  let rect = el.getBoundingClientRect();
  return {
    x: coords.x - rect.x,
    y: coords.y - rect.y
  };
}

let o = clientToHTMLElementCoords(el, origin);
el.style.transformOrigin = `${o.x} ${o.y}`;
Enter fullscreen mode Exit fullscreen mode

Unlike SVG, this method does not work as well when the element is rotated.

Conclusion

In this article we've looked at how we can treat DOM GestureEvent, WheelEvent, or TouchEvent uniformly, to add support for two-finger gestures to web pages with pretty-good-to-great results across a variety of devices.

Head over to danburzo/ok-zoomer on GitHub for the full implementation, as well as the event debug tool I used while researching this article.

Further reading

Miscellaneous things tangential to the article you might find interesting:

  • The algorithm for decomposing a DOMMatrix so that you can extract the translation, scale and rotation from a 2D matrix;
  • lethargy, a JavaScript library that tries to figure out which wheel events are initiated by the user and which are inertial;
  • Chrome's percent-based scrolling, a proposal I haven't yet read up on.
💖 💪 🙅 🚩
danburzo
Dan Burzo

Posted on November 22, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related