Scroll through the hundreds of icons for "camera" on Noun Project or the 124,706 community-generated drawings of cameras on Google Quick Draw, and you'll notice they're all remarkably similar. Together, they suggest a shared cultural understanding of a camera: a classic point-and-shoot.
But the cameras we encounter every day bear little resemblance—in form or function—to this vestigial object. New capabilities in software, new hardware formats and imaging technologies, and emerging user behaviors around image creation are radically reshaping the object we know of as the "camera" into new categories. Perhaps the most impactful influence on the camera is being brought about by computer vision: empowering cameras to not only capture various kinds of images but to also parse visual information—effectively, to understand the world.
Software trained on vast datasets of labeled images can recognize things like vehicles, dogs, cats, and people, along with facial features, emotions, and second-order information like movement vectors and gaze direction from raw images and videos. Timo Arnall explored this emerging capability of machines to interpret images in Robot Readable World, a collection of computer vision videos from 2012. In the years since, machine learning has advanced by leaps and bounds in both accuracy and speed—see, for instance, the more recent open-source YOLO Real-Time Object Detection technique for comparison—with the potential to transform how we interface with cameras and computers alike. In media analyst and venture capitalist Benedict Evans' recent Ten Year Futures presentation, he discussed the potential impact of machine learning on cameras in the near future: "You turn the image sensor into a universal input for a computer. It's not just that computers see like people, it's that computers can see like computers, and that makes all sorts of things possible."
Cameras enabled with machine learning therefore have the potential to both automate existing functions of the camera as a tool for human use and extend its creative possibilities far beyond image capture. One notable example is the recently launched Google Clips camera. Clips is a small camera with a special superpower: it understands enough about what it sees that it can take pictures for you. You set it on a shelf or a table or clip it to something, and on-board machine learning allows it to continuously observe, learn familiar faces, detect smiles and movement, and snap a well-composed candid picture, all while allowing you to be present in the scene instead of behind the viewfinder.It also does all of this without connecting to the internet.
As computing hardware has been miniaturized and made more affordable and machine learning algorithms more efficient and accurate, we'll likely see more cameras—and objects of all kinds—imbued with intelligence. According to Eva Snee, UX Lead on the Clips project, there are a lot of technological and user benefits to this approach. By learning on-device instead of communicating to servers in the cloud, the device can maintain the privacy of its user (all clips are stored locally on-device unless you choose to share or save them to your photos library on your phone) and operate much more efficiently in terms of both battery power and speed. "No one gets this data except you," says Snee. "That was very deliberate: you don't need a Google account, you don't need Google Photos."
Clips suggests a future of cameras as photographers; where decisions about the moment of capture are further shifted to the device. In the early design phases of the project, Snee says that the team asked themselves, "we're building an automatic capture camera, why does it need a button?...This is an amazing breakthrough—let's just make a camera that does it all for you." The Clips team stopped short of removing the button entirely, however. Snee explains that in addition to helping to train a camera to appreciate the inherently subjective, personal nature of photography, the button remained functionally significant:
"Every other camera that a human has interacted with in their life has a button. So it felt extremely foreign, it didn't make sense to people, and it actually made it harder for them to really understand how to use this thing and to understand even what capture means. That was a core design goal that we changed our position on—we need to give people agency and control just like they would have in a traditional camera."
A camera like Clips that can choose an appropriate moment to capture is really only the tip of the iceberg when considering the larger implications of machine learning. As the capacities of computer vision systems continue to evolve rapidly, what else might a camera that understands what it sees be capable of? How might these capabilities shift our relationship with our devices?
Pinterest Lens points to a potential future for the camera as a kind of sampling device—perceiving phenomena it can interpret from its environment and reporting back to the user. Every time you pin an image to a board on the Pinterest platform, you are creating a set of associations between it and other images on the board, which helps Pinterest's machine learning systems to categorize images. Lens leverages these insights to give its smartphone app a semantic understanding of on-camera objects, and use it for "visual discovery"—essentially querying the world for information relevant to your interests.
The camera in this context is a kind of interpretation device for a user's lived experience, extracting salient information from what it sees and reporting back with useful information rather than serving as a tool for composing and capturing a moment in time.
Beyond interpreting phenomena at capture, machine learning—and especially techniques like General Adversarial Networks (GANs)—extend the camera's expressive potential into profoundly unnerving territory. These algorithms have the remarkable ability to synthesize realistic images from the emergent patterns in a database of images. Since their characteristics are drawn from real conditions, they produce a kind of uncanny fantasy of reality: they capture alternative conditions in alternative presents. And as a result, they suggest the potential of a camera without a camera, the full dissolution of the camera's physical form into software.
Take for instance the paper Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, which showcases a technique to translate the characteristics of one set of photos into those of another—for instance, transforming horses into zebras or oranges into apples. Or GPU maker NVIDIA's research paper "Progressive Growing of GANs for Improved Quality, Stability, and Variation" and the uncanny images it produces while exploring the latent space of a database of celebrity faces.
These algorithms use real images as a basis for manufacturing reality. In this sense, they are reminiscent of the 5th Dimensional Camera, a terrific early project by the speculative design studio Superflux. This camera was essentially a prop, designed to suggest the possibilities of the many worlds theory in the emerging science of quantum physics. It's a fictional camera that captures parallel worlds, the parallel possibilities between two moments in time.
The images that GANs produce have a similar quality: extrapolating from conditions in the world to explore plausible alternate realities. None of these images are "real" per se, but as their features are drawn from the world, they are both somehow made of the real and in composite, made unreal. As a result of this inherent ambivalence, reality gets a bit wobbly. Similar systems for extrapolating on image sets are empowering a new arsenal of Fake News and manipulated-to-an-unhealthy-extreme advertisements, transforming political attitudes and images of self in the process.
Consider the already existing pressures on image manipulation in advertisements and the way we present ourselves on social networks as described in Jiayang Fan's piece in the New Yorker, China's Selfie Obsession:
"I asked a number of Chinese friends how long it takes them to edit a photo before posting it on social media. The answer for most of them was about forty minutes per face; a selfie taken with a friend would take well over an hour. The work requires several apps, each of which has particular strengths. No one I asked would consider posting or sending a photo that hadn't been improved."
Pressures to automatically enhance images are likely to continue. Imagine an internet ad adjusting its image content on demand, to match a model trained on an individual viewer's interests. Or an Instagram filter guaranteed to increase follow count by curating your feed and manipulating your images imperceptibly towards a more desirable ideal. Perhaps in such a world of truth-bending we'll take on-camera image manipulation for granted as long as it furthers our interests. But where might this lead us?
In Vernacular Video, a keynote talk at the 2010 Vimeo Awards, sci-fi author and futurist Bruce Sterling took the notion of a camera as reality-sampling-device one step further to explore the future possibilities of of what he called a "magic transformation camera", capable of total understanding of a given scene.
"In order to take a picture, I simply tell the system to calculate what that picture would have looked like from that angle at that moment, you see? I just send it as a computational problem out into the cloud wirelessly. [...] In other words there's sort of no camera there, there's just the cloud and computation."
Sterling distills photography into its core action: the selection of a specific vantage point at a specific moment in time. Yet, in the future, this "decisive moment" is instead reconstructed by querying a database with comprehensive knowledge. He later describes imaging and computational power embedded in paint flakes in the walls, a kind of sensory layer on everything, capable of observing everything.
This future concept inspired the early development of DepthKit, a toolkit built by Scatter in the context of efforts by the creative coding community to explore the expressive potential of the Kinect and similar structured-light scanners capable of bringing three dimensionality to recorded images. The technique, called volumetric video (on which you can read more by Scatter co-founder James George here and here), allows a real scene to be captured with depth information, or even in-the-round from various vantage points, with perspectives on the scene lit and composed after the fact.
So, what to make of all this? If we consider the implications of products like Google Clips and Pinterest Lens, algorithmic approaches like GANs, and Sterling's magic transformation camera as indicators for the future of the camera, it suggests a camera as far more than a point-and-shoot.
Google Clips suggests a near future of cameras as things endowed with agency, capable of observing, composing, and selecting moments to capture for us. And furthermore, it suggests future cameras as learning platforms, evolving over time in response to human use. Pinterest Lens suggests cameras as reality querying devices, interpreting our surroundings for information of value to us. GANs extend this possibility into generative territory, a near future of cameras as reality-extrapolation or -distortion devices, building on learned models to produce convincing synthetic images. Bruce Sterling's speculative future camera and the volumetric filmmaking techniques it inspired suggest a near future camera whose act of taking a photograph is one of searching through recorded moments from a total history of lived experience.
All of these cases suggest a camera with a very different kind of relationship to its operator: a camera with its own basic intelligence, agency, and access to information. Beyond a formal evolution away from the artifact of the "camera", these novel capabilities should complicate our expectations of what a camera is capable of. And increasingly, we may need to acknowledge a certain speciation has happened: these strange new cameras deserve categories of their own in order to contend with the competing visions of reality they suggest.