Pinch me, I'm zooming: gestures in the DOM
Dan Burzo
Posted on November 22, 2020
Note: The version you're reading is a first draft. Please refer to the updated article:
Pinch me, I'm zooming: gestures in the DOM
Interpreting multi-touch user gestures on the web is not as straightforward as you'd imagine. In this article we look at how the current generation of browsers behave, and piece together a solution using wheel
, gesture
and touch
DOM events.
The anatomy of a gesture
Two-finger gestures on touchscreens and modern trackpads allow users to manipulate on-screen elements as if they were physical objects: to move them and spin them around, to bring them closer or push them further away. Such a gesture encodes a unique combination of translation, uniform scaling, and rotation, known as an (affine) linear transformation, to be applied to the target element.
To create the impression of direct manipulation, this transformation must map naturally to the movement of the touchpoints. One possible mapping is that which keeps the parts you touch underneath the fingertips throughout the gesture. While it's not the only way to interpret a gesture, it's the approach on which mobile operating systems have settled. The principle has also been adapted to trackpads — which, in their modern incarnation, can be thought of as smaller, surrogate (or even literal!) touchscreens.
Let's see how a two-finger gesture maps to the basic components of a linear transformation. The change in distance between the two touchpoints throughout the gesture dictates the scale: if the fingers are brought together to half the initial distance, the object should be made half its original size. The slope defined by the two touchpoints similarly dictates the rotation to be applied to the object. The midpoint, located halfway between the two touchpoints, has a double role: its initial coordinates establish the transformation origin, and its movement throughout the gesture imposes a translation to the object.
Native applications on touch-enabled devices have to access to high-level APIs that provide the translation, scale, rotation, and origin of a user gesture directly. On the web, we have to glue together several types of events to get a similar results across a variety of platforms.
A summary of relevant DOM events
A WheelEvent
is triggered when the user intends to scroll an element with the mousewheel (from which the interface takes its name), a separate "scroll area" on older trackpads, or the entire surface area of newer trackpads with the two-finger vertical movement.
Wheel events have deltaX
, deltaY
, and deltaZ
properties to encode the displacement dictated by the input device, and a deltaMode
to establish the unit of measurement:
Constant | Value | Explanation |
---|---|---|
WheelEvent.DOM_DELTA_PIXEL |
0 |
scroll an amount of pixels |
WheelEvent.DOM_DELTA_LINE |
1 |
scroll by lines |
WheelEvent.DOM_DELTA_PAGE |
2 |
scroll entire pages |
As pinch gestures on trackpads became more commonplace, browser implementers needed a way support them in desktop browsers. Kenneth Auchenberg, in his article on detecting multi-touch trackpad gestures, brings together key pieces of the story. In short, Chrome settled on an approach inspired by Internet Explorer: to encode pinch gestures as wheel
events with ctrlKey: true
, and the deltaY
property holding the proposed scale increment. Firefox eventually did the same, and with Microsoft Edge recently having switched to Chromium as its underlying engine, we have a "standard" of sorts. I use scare-quotes because, as will be revealed shortly — and stop me if you've heard this before about Web APIs — some aspects don't quite match across browsers.
Sometime between Chrome and Firefox adding support for pinch-zoom, Safari 9.1 brought its very own GestureEvent
, which exposes precomputed scale
and rotation
properties, to the desktop.
To this day, Safari remains the only browser implementing GestureEvent
, even among browsers on touch-enabled platforms. Instead, mobile browsers produce the arguably more useful TouchEvent
s, which encode the positions of individual touchpoints in a gesture. They allow us, with a bit more effort than is required with higher-level events, to compute all the components of the linear transformation ourselves: whereas WheelEvent
only maps scale, and GestureEvent
adds rotation, TouchEvent
uniquely affords capturing the translation, with much more fine-grained control over interpreting the gesture.
Intuitively, the combination of wheel
, gesture
and touch
events seems sufficient to handle two-finger gestures across a variety of platforms. Let's see how this intuition — ahem — pans out.
Putting browsers to the test
I've put together a basic test page that logs relevant properties of all the wheel, gesture, and touch events it captures.
The plan is to perform a series of scrolls and pinches in recent versions of Firefox, Chrome, Safari, and Edge (Chromium-based), on a variety of devices I managed to procure for this purpose:
- a MacBook Pro (macOS Big Sur);
- a Surface Laptop with a touchscreen and built-in precision touchpad (Windows 10);
- an ASUS notebook with a non-precision touchpad (Windows 10);
- an iPhone (iOS 14);
- an iPad with a keyboard (iPadOS 14); and
- an external mouse to connect to all laptops.
Let's dig into a few of the results, and how they inform our solution.
Results on macOS
When performing a pinch-zoom gesture, Firefox and Chrome produce a wheel
event with a deltaY: ±scale, ctrlKey: true
. They produce an identical result when you scroll normally with two fingers while physically pressing down the Ctrl, with the difference that the latter is subject to inertial scrolling. For its part, Safari reacts to the proprietary gesturestart
, gesturechange
, and gestureend
events, producing a precomputed scale
and rotation
.
In all browsers, clientX
and clientY
, and the position of the on-screen cursor, remain constant throughout two-finger gestures. The pair of coordinates establishes the gesture origin.
The process of testing various modifier keys brought forth some default browser behaviors that we'll likely need to deflect with event.preventDefault()
:
-
Option + wheel
in Firefox navigates (or rather flies) through the browser history; this is probably a misapplication of the code that handles discrete steps on a mousewheel, and it feels too weird to be useful on an inertial trackpad; -
Command + wheel
in Firefox zooms in and out of the page, similarly to theCommand +
andCommand -
keyboard shortcuts; - Pinching inwards in Safari minimizes the tab into a tab overview screen.
External, third-party mice are a different matter. Instead of the smooth pixel increments on the trackpad, the mouse's wheel jumps entire lines at a time. (The Scrolling speed setting in System Preferences > Mouse controls how many.)
Accordingly, Firefox shows deltaY: ±1, deltaMode: DOM_DELTA_LINE
for a tick of the wheel. This is the first, and at least on macOS the only, encounter with DOM_DELTA_LINE
. Chrome and Safari stick with deltaMode: DOM_DELTA_PIXEL
and a much larger deltaY
, sometimes hundreds of pixels at a time. This is an instance of the many more pixels than expected deviation of which we'll see more throughout the test session. A basic pinch-zoom implementation that doesn't account for this quirk will zoom in and out in large, hard-to-control strides when using the mousewheel.
In all three browsers, deltaX
is normally zero. Holding down the Shift key, a common way for users of an external mouse to scroll horizontally, swaps deltas: deltaY
becomes zero instead.
Results on Windows
A precision touchpad works on Windows similarly to the Magic Trackpad on macOS: Firefox, Chrome, and Edge produce results comparable to what we've seen on macOS. The quirks emerge with non-precision touchpads and external mice, however.
On Windows, the wheel of an external mouse has two scroll modes: either L
lines at a time (with a configurable L
), or a whole page at a time.
When using the external mouse with line-scrolling, Firefox produces the expected deltaY: ±L, deltaMode: DOM_DELTA_LINE
. Chrome generates deltaY: ±L * N, deltaMode: DOM_DELTA_PIXEL
, where N
is a multiplier dictated by the browser, and which varies by machine: I've seen 33px
on the ASUS laptop and 50px
on the Surface. (There's probably an inner logic to what's going on, but it doesn't warrant further investigation at this point.) Edge produces deltaY: ±100, deltaMode: DOM_DELTA_PIXEL
, so 100px
regardless on the number of lines L
that the mouse is configured to scroll. With page-scrolling, browsers uniformly report deltaY: ±1, deltaMode: DOM_DELTA_PAGE
. None of the three browsers support holding down the Shift to reverse the scroll axis of the mousewheel.
On non-precision touchpads, the effect of scrolling on the primary (vertical) axis will mostly be equivalent to that of a mousewheel. The behavior of the secondary (horizontal) axis will not necessarily match it. At least on the machines on which I performed the tests, mouse settings also apply to the touchpad, even when there was no external mouse attached.
In Firefox, in line-scrolling mode, scrolls on both axes produce deltaMode: DOM_DELTA_LINE
with deltaX
and deltaY
, respectively, containing a fraction of a line; a pinch gesture produces a constant deltaY: ±L, deltaMode: DOM_DELTA_LINE, ctrlKey: true
. In page-scrolling mode, scrolls on the primary axis produce deltaMode: DOM_DELTA_PAGE
, while on the secondary axis it remains in deltaMode: DOM_DELTA_LINE
; the pinch gesture produces deltaY: ±1, deltaMode: DOM_DELTA_PAGE, ctrlKey: true
. In Chrome, a surprising result is that when scrolling on the secondary axis we get deltaX: 0, deltaY: N * ±L, shiftKey: true
. Otherwise, the effects seen with a non-precision touchpad on Windows are of the unexpected deltaMode
or unexpected deltaY
value varieties.
Converting WheelEvent
s to gestures
If we took Safari's GestureEvent
as the gold standard, and we wanted to derive an equivalent from wheel events, we'd find a few sub-problems to tackle:
- how to normalize the various ways browsers emit
wheel
events into an uniform delta value; - how to generate the equivalent of the
gesturestart
,gesturechange
andgestureend
events fromwheel
events; - how to compute the
scale
value from the delta.
Let's explore each task one by one.
Normalizing wheel
events
Our goal here is to implement a normalizeWheelEvent
function as described below:
/*
Normalizes WheelEvent `e`,
returning an array of deltas `[dx, dy]`.
*/
function normalizeWheelEvent(e) {
let dx = e.deltaX;
let dy = e.deltaY;
// TODO: normalize dx, dy
return [dx, dy];
}
This is where we can put our experimental browser data to good use. Let's recap some findings relevant to normalizing wheel
events.
The browser may emit deltaX: 0, deltaY: N, shiftKey: true
when scrolling horizontally. We want to interpret this as deltaX: N, deltaY: 0
instead:
if (dx === 0 && e.shiftKey) {
return [dy, dx]; // swap deltas
}
Furthermore, the browser may emit values in a deltaMode
other than pixels; for each, we need a multiplier:
if (e.deltaMode === WheelEvent.DOM_DELTA_LINE) {
dy = dy * 8;
} else if (e.deltaMode === WheelEvent.DOM_DELTA_PAGE) {
dy = dy * 24;
}
The choice of multipliers ultimately depends on the application. We might take inspiration from browsers themselves or other tools the user may be familiar with; a document viewer may respect the mouse configuration to scroll one page at a time; map-pinching, on the other hand, may benefit from smaller increments.
Finally, the browser may forego emitting DOM_DELTA_LINE
or DOM_DELTA_PAGE
where the input device would dictate them, and instead offer a premultiplied value in DOM_DELTA_PIXEL
s, which is often very large, 100px
or more at a time. Why would they do that? With a whole lot of code out there that fails to look at the deltaMode
, minuscule DOM_DELTA_LINE
/ DOM_DELTA_PAGE
increments interpreted as pixels would make for lackluster scrolls. Browsers can be excused for trying to give a helping hand, but premultiplied pixel values — often computed in a way that only works if you think of wheel
events as signifying scroll intents — makes them harder to use for other purposes.
Thankfully, in the absence of a more sophisticated approach, simply setting the upper limit of deltaY
to something reasonable, such as 24px
, just to push the breaks a little on a wild zoom, can go a long way towards improving the experience.
dy = Math.sign(dy) * Math.min(24, Math.abs(dy));
(The code above uses Math.sign()
and Math.min()
to impose a maximum on the absolute value of a possibly-negative number.)
These few adjustments should cover a vast array of variations across browsers and devices. Yay compromise!
Generating gesture events
With normalization out of the way, the next obstacle is that wheel
events are separate happenings, for which we must devise a "start" and "end" if we want to have equivalents to gesturestart
and gestureend
:
The first wheel
event marks the beginning of a gesture, but what about the end? In line with keeping things simple, we consider a gesture done once a number of milliseconds pass after the last wheel
event. An outline for batching wheel events into gestures is listed below:
let timer;
let gesture = false;
element.addEventListener('wheel', function(e) {
if (!gesture) {
startGesture(…);
gesture = true;
} else {
doGesture(…);
}
if (timer) {
window.clearTimeout(timer);
}
timer = window.setTimeout(function() {
endGesture(…);
gesture = false;
}, 200); // timeout in milliseconds
});
What arguments we're supposed to send to the startGesture
, doGesture
, and endGesture
functions is explored in the next section.
Converting the delta to a scale
In Safari, a gesturechange
event's scale
property holds the accumulated scale to apply to the object at each moment of the gesture:
final_scale = initial_scale * event.scale;
In fact, the documentation for the UIPinchGestureRecognizer
which native iOS apps use to detect pinch gestures, and which works similarly to Safari's GestureEvent
, emphasizes this aspect:
Important: Take care when applying a pinch gesture recognizer’s scale factor to your content, or you might get unexpected results. Because your action method may be called many times, you cannot simply apply the current scale factor to your content. If you multiply each new scale value by the current value of your content, which has already been scaled by previous calls to your action method, your content will grow or shrink exponentially. Instead, cache the original value of your content, apply the scale factor to that original value, and apply the new value back to your content. Alternatively, reset the scale factor to 1.0 after applying each new change.
Conversely, pinch gestures encoded as wheel
events contain deltas that correspond to percentual changes in scale that you're supposed to apply incrementally:
scale = previous_scale * (1 + delta/100);
Accumulating a series of increments d1
, d2
, ..., dN
into a final scaling factor requires some back-of-the-napkin arithmetics. The intermediary scales:
scale1 = initial_scale * (1 + d1/100);
scale2 = scale1 * (1 + d2/100);
scale3 = scale2 * (1 + d3/100);
....
Lead us to the formula for the final scale:
final_scale = initial_scale * factor;
factor = (1 + d1/100) * (1 + d2/100) * ... * (1 + dN/100);
Which let us flesh out the scale
we're supposed to send to our startGestue
, doGesture
and endGesture
functions we introduced in the previous section:
let gesture = false;
let timer;
let factor; // accumulates the scaling factor
element.addEventListener('wheel', e => {
let [dx, dy] = normalizeWheel(e);
if (!gesture) {
factor = 1; // reset the factor
startGesture({
scale: factor
});
gesture = true;
} else {
factor = factor * (1 + dy/100);
doGesture({
scale: factor
});
}
if (timer) {
window.clearTimeout(timer);
}
timer = window.setTimeout(() => {
endGesture({
scale: factor
});
gesture = false;
}, 200);
});
This approach will get us scale
values in the same ballpark for WheelEvent
and GestureEvent
, but you'll notice pinches in Firefox and Chrome effect a smaller scale factor than similar gestures in Safari. We can solve this by mixing in a SPEEDUP
multiplier that makes up for the difference:
/*
Eyeballing it suggests the sweet spot
for SPEEDUP is somewhere between
1.5 and 3. Season to taste!
*/
const SPEEDUP = 2.5;
factor = factor * (1 + SPEEDUP * dy/100);
Converting TouchEvent
s to gestures
Touch events are more low-level; they contain everything we need to derive the entire affine transformation ourselves. Each individual touchpoint is encoded in the event.touches
list as a Touch
object containing, among others, its coordinates clientX
and clientY
.
Emitting gesture-like events
The four touch events are touchstart
, touchmove
, touchend
and touchcancel
.
We want to map these to the startGesture
, doGesture
and endGesture
functions introduced in the WheelEvent
section.
Each individual touch produces a touchstart
event on contact and a touchend
event when lifted from the touchscreen; the touchcancel
event is emitted when the browser wants to bail out of the gesture (for example, when adding to many touchpoints to the screen). For our purpose we want to observe gestures involving exactly two touchpoints, and we use the same function watchTouches
for all three events.
let gesture = false;
function watchTouches(e) {
if (e.touches.length === 2) {
gesture = true;
e.preventDefault();
startGesture(…);
el.addEventListener('touchmove', touchMove);
el.addEventListener('touchend', watchTouches);
el.addEventListener('touchcancel', watchTouches);
} else if (gesture) {
gesture = false;
endGesture(…);
el.removeEventListener('touchmove', touchMove);
el.removeEventListener('touchend', watchTouches);
el.removeEventListener('touchcancel', watchTouches);
}
};
document.addEventListener('touchstart', watchTouches);
The touchmove
event is the only one using its own separate listener:
function touchMove(e) {
if (e.touches.length === 2) {
doGesture(…);
e.preventDefault();
}
}
In the next section we figure out what to put in place of the ellipses (…
) as the argument for the startGesture
, doGesture
, and endGesture
functions.
Producing the affine transformation
To have a frame of reference, we must store the initial touches, at the very beginning of a gesture. We'll take advantage of the fact that TouchList
and Touch
objects are immutable to just save a reference:
let gesture = false;
let initial_touches;
function watchTouches(e) {
if (e.touches.length === 2) {
gesture = true;
initial_touches = e.touches;
startGesture(…);
…
}
…
}
The argument to startGesture
is straightforward. We haven't done any gesturing yet, so all parts of the transformation are set to their initial values. The origin of the transform is the midpoint between the two initial touchpoints:
startGesture({
scale: 1,
rotation: 0,
translation: [0, 0],
origin: midpoint(initial_touches)
});
The midpoint is computed as:
function midpoint(touches) {
let [t1, t2] = touches;
return [
(t1.clientX + t2.clientX) / 2,
(t1.clientY + t2.clientY) / 2
];
}
For the doGesture
function, we must compare our pair of current touchpoints to the initial ones, and using the distance and angle formed by each pair (for which functions are defined below):
function distance(touches) {
let [t1, t2] = touches;
let dx = t2.clientX - t1.clientX;
let dy = t2.clientY - t2.clientY;
return Math.sqrt(dx * dx + dy * dy);
}
function angle(touches) {
let [t1, t2] = touches;
let dx = t2.clientX - t1.clientX;
let dy = t2.clientY - t2.clientY;
return 180 / Math.PI * Math.atan2(dy, dx);
}
We can produce the argument to doGesture
:
let mp_init = midpoint(initial_touches);
let mp_curr = midpoint(e.touches);
doGesture({
scale: distance(e.touches) / distance(initial_touches),
rotation: angle(e.touches) - angle(initial_touches),
translation: [
mp_curr.x - mp_init.x,
mp_curr.y - mp_init.y
],
origin: mp_init
});
Finally, let's tackle the argument to endGesture
. It can't be computed on the spot, at the moment when endGesture
gets called, we explicitly don't have two touchpoints available. Therefore, in order to place a relevant gesture as the argument to endGesture
we must remember the last gesture we produced. To that end, instead of having the gesture
variable hold a boolean value, lets use it to store the latest gesture.
Putting everyhing together, the watchTouches
and touchMove
functions look like:
let gesture = false;
function watchTouches(e) {
if (e.touches.length === 2) {
gesture = {
scale: 1,
rotation: 0,
translation: [0, 0],
origin: midpoint(initial_touches)
};
e.preventDefault();
startGesture(gesture);
el.addEventListener('touchmove', touchMove);
el.addEventListener('touchend', watchTouches);
el.addEventListener('touchcancel', watchTouches);
} else if (gesture) {
endGesture(gesture);
gesture = null;
el.removeEventListener('touchmove', touchMove);
el.removeEventListener('touchend', watchTouches);
el.removeEventListener('touchcancel', watchTouches);
}
};
el.addEventListener('touchstart', watchTouches);
function touchMove(e) {
if (e.touches.length === 2) {
let mp_init = midpoint(initial_touches);
let mp_curr = midpoint(e.touches);
gesture = {
scale: distance(e.touches) / distance(initial_touches),
rotation: angle(e.touches) - angle(initial_touches),
translation: [
mp_curr.x - mp_init.x,
mp_curr.y - mp_init.y
],
origin: mp_init
};
doGesture(gesture);
e.preventDefault();
}
}
Safari mobile: touch or gesture events?
Safari mobile (iOS and iPadOS) is the only browser that has support for both GestureEvent
and TouchEvent
, so which one should you choose for handling two-finger gestures? On the one hand, enhancements Safari applies to GestureEvent
s makes them feel smoother; on the other hand, TouchEvent
s afford capturing the translation aspect of the gesture. Ultimately, the choice is dictated by the needs of the web application, and the subjective experience on real-life iOS/iPadOS devices.
The feature-detection code, based on which you can attach to GestureEvent
s or not, is below:
if (typeof GestureEvent !== 'undefined') {
// Safari...
if (typeof TouchEvent !== 'undefined') {
// ...on mobile
} else {
// ...on desktop
}
}
Applying the affine transformation to the object
We we talk about transforming elements we mean either an HTML or a SVG element. Both use the same syntax, described in the CSS Transforms Level 1 specification:
let transform_string = `
translate(
${translation && translation.x ? translation.x : 0 }
${translation && translation.y ? translation.y: 0 }
)
scale(${scale || 1})
rotate(${rotation || 0})`;
The mechanisms to apply a transform from DOM APIs are similar. For HTML, we set it on the element's style object; SVG also affords it as an attribute:
html_el.style.transform = transform_string;
svg_el.setAttribute('transform', transform_string);
The origin of the transform must correspond to the gesture's midpoint, and this is done via the transform-origin
CSS property and its equivalent SVG attribute. These are slightly different in HTML vs. SVG, so we need some more math to bring the midpoint coordinates to something that can be used for transform-origin
.
For SVG elements, values in transform-origin
are relative to the element's closest <svg>
. The SVGGraphicsElement.getScreenCTM()
method returns the object's current transform matrix, which expresses the transform from the element's coordinate system to client coordinates. The .inverse()
of that matrix does the opposite, letting us convert client coordinates to values useful for transform-origin
:
function clientToSVGElementCoords(el, coords) {
let screen_to_el = el.getScreenCTM().inverse();
let point = el.ownerSVGElement.createSVGPoint();
point.x = coords.x;
point.y = coords.y;
return point.matrixTransform(screen_to_el);
}
let o = clientToSVGElementCoords(el, origin);
el.setAttribute('transform-origin', `${o.x} ${o.y}`);
This works splendidly no matter what transforms are already applied to the element: translation, scale, rotation are all supported.
In HTML the closest we can get to getScreenCTM
is with the Element.getBoundingClientRect()
method, which returns information about the element's on-screen size and position. And since HTML elements' transform-origin
is relative to the element itself, this allows us to compute the appropriate origin for the transform:
function clientToHTMLElementCoords(el, coords) {
let rect = el.getBoundingClientRect();
return {
x: coords.x - rect.x,
y: coords.y - rect.y
};
}
let o = clientToHTMLElementCoords(el, origin);
el.style.transformOrigin = `${o.x} ${o.y}`;
Unlike SVG, this method does not work as well when the element is rotated.
Conclusion
In this article we've looked at how we can treat DOM GestureEvent
, WheelEvent
, or TouchEvent
uniformly, to add support for two-finger gestures to web pages with pretty-good-to-great results across a variety of devices.
Head over to danburzo/ok-zoomer
on GitHub for the full implementation, as well as the event debug tool I used while researching this article.
Further reading
Miscellaneous things tangential to the article you might find interesting:
- The algorithm for decomposing a
DOMMatrix
so that you can extract the translation, scale and rotation from a 2D matrix; -
lethargy, a JavaScript library that tries to figure out which
wheel
events are initiated by the user and which are inertial; - Chrome's percent-based scrolling, a proposal I haven't yet read up on.
Posted on November 22, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.