How Does Dropbox Camera Uploads Work?

Camera uploads is a feature in our Android and iOS apps that automatically backs upwardly a user'southward photos and videos from their mobile device to Dropbox. The feature was first introduced in 2012, and uploads millions of photos and videos for hundreds of thousands of users every solar day. People who use camera uploads are some of our about defended and engaged users. They care deeply about their photo libraries, and expect their backups to be quick and dependable every time. It's important that we offering a service they can trust.

Until recently, photographic camera uploads was built on a C++ library shared between the Android and iOS Dropbox apps. This library served us well for a long time, uploading billions of images over many years. Notwithstanding, information technology had numerous problems. The shared code had grown polluted with complex platform-specific hacks that fabricated it difficult to understand and risky to alter. This risk was compounded by a lack of tooling support, and a shortage of in-house C++ expertise. Plus, later more than five years in production, the C++ implementation was beginning to bear witness its age. Information technology was unaware of platform-specific restrictions on groundwork processes, had bugs that could delay uploads for long periods of time, and fabricated outage recovery difficult and time-consuming.

In 2019, we decided that rewriting the characteristic was the all-time way to offer a reliable, trustworthy user experience for years to come up. This time, Android and iOS implementations would exist dissever and apply platform-native languages (Kotlin and Swift respectively) and libraries (such every bitWorkManager andRoom for Android). The implementations could then be optimized for each platform and evolve independently, without being constrained by blueprint decisions from the other.

This post is near some of the design, validation, and release decisions we made while building the new photographic camera uploads feature for Android, which we released to all users during the summertime of 2021. The project shipped successfully, with no outages or major issues; error rates went downwardly, and upload performance profoundly improved. If yous haven't already enabled photographic camera uploads, yous should try it out for yourself.

Designing for groundwork reliability

The main value proposition of camera uploads is that it works silently in the groundwork. For users who don't open the app for weeks or even months at a time, new photos should nevertheless upload promptly.

How does this work? When someone takes a new photo or modifies an existing photo, the OS notifies the Dropbox mobile app. A background worker we phone call the scanner advisedly identifies all the photos (or videos) that haven't even so been uploaded to Dropbox and queues them for upload. Then another background worker, the uploader, batch uploads all the photos in the queue.

Uploading is a ii step process. Outset, like many Dropbox systems, nosotros break the file into 4 MB blocks, compute the hash of each block, and upload each block to the server. Once all the file blocks are uploaded, we brand a terminal commit request to the server with a listing of all block hashes in the file. This creates a new file consisting of those blocks in the user's Camera Uploads binder. Photos and videos uploaded to this folder can so be accessed from any linked device.

One of our biggest challenges is that Android places potent constraints on how often apps can run in the background and what capabilities they have. For case, App Standby limits our background network access if the Dropbox app hasn't recently been foregrounded. This means we might but be allowed to access the network for a 10-minute interval once every 24 hours. These restrictions accept grown more strict in contempo versions of Android, and the cross-platform C++ version of photographic camera uploads was not well-equipped to handle them. It would sometimes try to perform uploads that were doomed to fail considering of a lack of network access, or fail to restart uploads during the system-provided window when network admission became available.

Our rewrite does non escape these background restrictions; they still use unless the user chooses to disable them in Android's organization settings. However, we reduce delays as much as possible past taking maximum advantage of the network access nosotros practise receive. We apply WorkManager to handle these background constraints for the states, guaranteeing that uploads are attempted if, and only if, network access becomes available. Unlike our C++ implementation, we besides practice every bit much work as possible while offline—for instance, by performing rudimentary checks on new photos for duplicates—before asking WorkManager to schedule us for network admission.

Measuring interactions with our status banners helps usa place emerging problems in our apps, and is a helpful signal in our efforts to eliminate errors. Subsequently the rewrite was released, we saw users interacting with more "all done" statuses than usual, while the number of "waiting" or error status interactions went downwardly. (This data reflects but paid users, simply non-paying users show like results.)

To further optimize employ of our express network access, we also refined our handling of failed uploads. C++ camera uploads aggressively retried failed uploads an unlimited number of times. In the rewrite we added backoff intervals between retry attempts, and also tuned our retry behavior for dissimilar fault categories. If an fault is likely to be transient, nosotros retry multiple times. If information technology's probable to be permanent, we don't bother retrying at all. As a upshot, we brand fewer overall retry attempts—which limits network and bombardment usage—andusers meet fewer errors.

Designing for performance

Our users don't just expect camera uploads to work reliably. They besides await their photos to upload apace, and without wasting system resources. Nosotros were able to make some big improvements here. For instance, first-time uploads of large photo libraries now stop up to iv times faster. There are a few ways our new implementation achieves this.

Parallel uploads
Start, we substantially improved performance by calculation support for parallel uploads. The C++ version uploaded only one file at a time. Early in the rewrite, nosotros collaborated with our iOS and backend infrastructure colleagues to design an updated commit endpoint with support for parallel uploads.

Once the server constraint was gone, Kotlin coroutines made it easy to run uploads concurrently. Although Kotlin Periods are typically candy sequentially, the available operators are flexible enough to serve every bit edifice blocks for powerful custom operators that support concurrent processing. These operators can exist chained declaratively to produce code that's much simpler, and has less overhead, than the transmission thread management that would've been necessary in C++.

            val uploadResults = mediaUploadStore     .getPendingUploads()     .unorderedConcurrentMap(concurrentUploadCount) {         mediaUploader.upload(it)     }     .takeUntil {         it != UploadTaskResult.SUCCESS     }     .toList()

A elementary example of a concurrent upload pipeline. unorderedConcurrentMap is a custom operator that combines the built-in flatMapMerge and transform operators.

Optimizing retentiveness use
After calculation support for parallel uploads, we saw a big uptick in out-of-memory crashes from our early testers. A number of improvements were required to brand parallel uploads stable enough for product.

First, nosotros modified our uploader to dynamically vary the number of simultaneous uploads based on the amount of available organization retentiveness. This way, devices with lots of retention could enjoy the fastest possible uploads, while older devices would not be overwhelmed. However, we were notwithstanding seeing much college memory usage than we expected, so we used the memory profiler to take a closer wait.

The first thing we noticed was that memory consumption wasn't returning to its pre-upload baseline later on all uploads were done. It turned out this was due to an unfortunate behavior of the Java NIO API. Information technology created an in-retentivity enshroud on every thread where we read a file, and once created, the cache could never exist destroyed. Since nosotros read files with the threadpool-backed IO dispatcher, we typically ended upwards with many of these caches, one for each dispatcher thread nosotros used. We resolved this past switching to direct byte buffers, which don't allocate this cache.

The adjacent thing we noticed were large spikes in retentivity usage when uploading, particularly with larger files. During each upload, we read the file in blocks, copying each block into aByteArray for further processing. We never created a new byte array until the previous ane had gone out of scope, and so we expected only ane to be in-retentiveness at a time. Nevertheless, it turned out that when we allocated a big number of byte arrays in a short time, the garbage collector could not gratis them quickly enough, causing a transient retentivity spike. We resolved this result by re-using the same buffer for all block reads.

Parallel scanning and uploading
In the C++ implementation of photographic camera uploads, uploading could not start until we finished scanning a user'southward photograph library for changes. To avert upload delays, each scan only looked at changes that were newer than what was seen in the previous browse.

This approach had downsides. There were some border cases where photos with misleading timestamps could be skipped completely. If we e'er missed photos due to a issues or OS change, shipping a set up wasn't enough to recover; we also had to clear afflicted users' saved browse timestamps to force a total re-scan. Plus, when photographic camera uploads was first enabled, we however had to check everything before uploading anything. This wasn't a dandy starting time impression for new users.

In the rewrite, we ensured definiteness by re-scanning the whole library subsequently every modify. We likewise parallelized uploading and scanning, so new photos tin start uploading while we're still scanning older ones. This means that although re-scanning can take longer, the uploads themselves still get-go and end promptly.

Validation

A rewrite of this magnitude is risky to ship. It has dangerous failure modes that might only show up at scale, such as corrupting ane out of every million uploads. Plus, as with most rewrites, we could not avoid introducing new bugs because we did non sympathize—or even know about—every edge case handled by the one-time system. Nosotros were reminded of this at the start of the project when we tried to remove some ancient camera uploads code that nosotros thought was dead, and instead ended up DDOSing Dropbox's crash reporting service. 🙃

Hash validation in production
During early development, we validated many low-level components by running them in production alongside their C++ counterparts then comparison the outputs. This allow us confirm that the new components were working correctly before we started relying on their results.

Ane of those components was a Kotlin implementation of the hashing algorithms that nosotros use to place photos. Because these hashes are used for de-duplication, unexpected things could happen if the hashes change for fifty-fifty a tiny percentage of photos. For instance, we might re-upload old photos believing they are new. When we ran our Kotlin code alongside the C++ implementation, both implementations nigh always returned matching hashes, only they differed about 0.005% of the time. Which implementation was wrong?

To answer this, we added some additional logging. In cases where Kotlin and C++ disagreed, we checked if the server subsequently rejected the upload because of a hash mismatch, and if and so, what hash it was expecting. We saw that the server was expecting the Kotlin hashes, giving us high confidence the C++ hashes were wrong. This was great news, since information technology meant we had stock-still a rare bug we didn't even know nosotros had.

Validating state transitions
Photographic camera uploads uses a database to track each photograph'southward upload state. Typically, the scanner adds photos in state NEW and then moves them to PENDING (or Done if they don't need to be uploaded). The uploader tries to upload PENDING photos and so moves them to DONE or ERROR.

Since we parallelize so much work, it's normal for multiple parts of the organisation to read and write this state database simultaneously. Individual reads and writes are guaranteed to happen sequentially, but we're still vulnerable to subtle bugs where multiple workers try to change the country in redundant or contradictory ways. Since unit tests merely cover single components in isolation, they won't grab these bugs. Fifty-fifty an integration test might miss rare race conditions.

In the rewritten version of camera uploads, nosotros guard confronting this by validating every land update against a set of immune land transitions. For instance, we stipulate that a photo can never move from ERROR to DONE without passing back through Awaiting. Unexpected state transitions could indicate a serious problems, and so if we run across ane, we stop uploading and study an exception.

These checks helped united states of america detect a nasty bug early in our rollout. We started to see a high volume of exceptions in our logs that were caused when camera uploads tried to transition photos fromDone toWashed. This fabricated us realize we were uploading some photos multiple times! The root cause was a surprising behavior in WorkManager whereunique workers can restart before the previous instance is fully cancelled. No indistinguishable files were being created because the server rejects them, but the redundant uploads were wasting bandwidth and time. One time we fixed the event, upload throughput dramatically improved.

Rolling information technology out

Even later all this validation, we still had to be cautious during the rollout. The fully-integrated system was more circuitous than its parts, and we'd also need to debate with a long tail of rare device types that are not represented in our internal user testing pool. We too needed to proceed to see or surpass the high expectations of all our users who rely on camera uploads.

To reduce this risk preemptively, we made sure to support rollbacks from the new version to the C++ version. For example, we ensured that all user preference changes made in the new version would use to the sometime version as well. In the cease we never ended up needing to scroll back, but it was even so worth the effort to take the pick available in example of disaster.

We started our rollout with an opt-in pool of beta (Play Store early access) users who receive a new version of the Dropbox Android app every week. This pool of users was large enough to surface rare errors and collect key performance metrics such as upload success rate. We monitored these key metrics in this population for a number of months to proceeds confidence it was prepare to ship widely. Nosotros discovered many problems during this fourth dimension period, only the fast beta release cadence allowed us to iterate and fix them quickly.

Nosotros as well monitored many metrics that could hint at future problems. To make certain our uploader wasn't falling behind over time, we watched for signs of ever-growing backlogs of photos waiting to upload. We tracked retry success rates past fault type, and used this to fine-tune our retry algorithm. Final but not least, we also paid shut attention to feedback and support tickets we received from users, which helped surface bugs that our metrics had missed.

When we finally released the new version of photographic camera uploads to all users, it was clear our months spent in beta had paid off. Our metrics held steady through the rollout and we had no major surprises, with improved reliability and low error rates right out of the gate. In fact, we concluded upward finishing the rollout ahead of schedule. Since we'd front-loaded so much quality improvement work into the beta menstruation (with its weekly releases), we didn't accept whatsoever multi-week delays waiting for disquisitional bug fixes to roll out in the stable releases.

And then, was it worth it?

Rewriting a big legacy feature isn't always the right conclusion. Rewrites are extremely fourth dimension-consuming—the Android version lonely took two people working for two full years—and can easily cause major regressions or outages. In order to be worthwhile, a rewrite needs to deliver tangible value by improving the user experience, saving engineering fourth dimension and try in the long term, or both.

What communication exercise we have for others who are outset a project like this?

Define your goals and how you will measure them. At the kickoff, this is of import to make certain that the benefits will justify the endeavour. At the finish, it will help you lot make up one's mind whether you got the results you wanted. Some goals (for example, future resilience against OS changes) may not be quantifiable—and that's OK—but it's good to spell out which ones are and aren't.
De-risk it. Place the components (or system-wide interactions) that would crusade the biggest problems if they failed, and baby-sit against those failures from the very start. Build critical components first, and try to test them in production without waiting for the whole system to be finished. It'southward also worth doing actress work up-front in order to be able to curlicue back if something goes wrong.
Don't blitz. Shipping a rewrite is arguably riskier than aircraft a new feature, since your audition is already relying on things to work as expected. First by releasing to an audition that's only big enough to give you the information you demand to evaluate success. And so, picket and wait (and set stuff) until your data requite you conviction to keep. Dealing with problems when the user-base is small is much faster and less stressful in the long run.
Limit your telescopic. When doing a rewrite, it'due south tempting to tackle new feature requests, UI cleanup, and other backlog work at the same time. Consider whether this will actually be faster or easier than shipping the rewrite showtime and fast-following with the rest. During this rewrite we addressed bug linked to the core architecture (such as crashes intrinsic to the underlying information model) and deferred all other improvements. If you modify the feature too much, not just does information technology accept longer to implement, only information technology'southward also harder to notice regressions or gyre back.

In this case, we feel good nigh the decision to rewrite. Nosotros were able to improve reliability right away, and more importantly, we prepare ourselves up to stay reliable in the future. As the iOS and Android operating systems go along to evolve in separate directions, it was only a matter of time before the C++ library bankrupt desperately enough to require central systemic changes. Now that the rewrite is complete, we're able to build and iterate on camera uploads much faster—and offer a ameliorate feel for our users, as well.

Also: We're hiring!

Are you a mobile engineer who wants to make software that's reliable and maintainable for the long haul? If and so, nosotros'd love to have you at Dropbox! Visit our jobs folio to see electric current openings.

Source: https://dropbox.tech/mobile/making-camera-uploads-for-android-faster-and-more-reliable

Posted by: drinnonhused1980.blogspot.com