docs: Add a new formats section, move static deltas in there
The `src/libostree/README-deltas.md` was rather hidden - let's move this into the manual.
This commit is contained in:
parent
6821ca1029
commit
11b3050fd7
|
|
@ -0,0 +1,181 @@
|
||||||
|
# OSTree data formats
|
||||||
|
|
||||||
|
## On the topic of "smart servers"
|
||||||
|
|
||||||
|
One really crucial difference between OSTree and git is that git has a
|
||||||
|
"smart server". Even when fetching over `https://`, it isn't just a
|
||||||
|
static webserver, but one that e.g. dynamically computes and
|
||||||
|
compresses pack files for each client.
|
||||||
|
|
||||||
|
In contrast, the author of OSTree feels that for operating system
|
||||||
|
updates, many deployments will want to use simple static webservers,
|
||||||
|
the same target most package systems were designed to use. The
|
||||||
|
primary advantages are security and compute efficiency. Services like
|
||||||
|
Amazon S3 and CDNs are a canonical target, as well as a stock static
|
||||||
|
nginx server.
|
||||||
|
|
||||||
|
## The archive-z2 format
|
||||||
|
|
||||||
|
In the [repo](repo) section, the concept of objects was introduced,
|
||||||
|
where file/content objects are checksummed and managed individually.
|
||||||
|
(Unlike a package system, which operates on compressed aggregates).
|
||||||
|
|
||||||
|
The archive-z2 format simply gzip-compresses each content object.
|
||||||
|
Metadata objects are stored uncompressed. This means that it's easy
|
||||||
|
to serve via static HTTP.
|
||||||
|
|
||||||
|
When you commit new content, you will see new `.filez` files appearing
|
||||||
|
in `objects/`.
|
||||||
|
|
||||||
|
## archive-z2 efficiency
|
||||||
|
|
||||||
|
The advantages of `archive-z2`:
|
||||||
|
|
||||||
|
- It's easy to understand and implement
|
||||||
|
- Can be served directly over plain HTTP by a static webserver
|
||||||
|
- Clients can download/unpack updates incrementally
|
||||||
|
- Space efficient on the server
|
||||||
|
|
||||||
|
The biggest disadvantage of this format is that for a client to
|
||||||
|
perform an update, one HTTP request per changed file is required. In
|
||||||
|
some scenarios, this actually isn't bad at all, particularly with
|
||||||
|
techniques to reduce HTTP overhead, such as
|
||||||
|
[HTTP/2](https://en.wikipedia.org/wiki/HTTP/2).
|
||||||
|
|
||||||
|
In order to make this format work well, you should design your content
|
||||||
|
such that large data that changes infrequently (e.g. graphic images)
|
||||||
|
are stored separately from small frequently changing data (application
|
||||||
|
code).
|
||||||
|
|
||||||
|
Other disadvantages of `archive-z2`:
|
||||||
|
|
||||||
|
- It's quite bad when clients are performing an initial pull (without HTTP/2),
|
||||||
|
- One doesn't know the total size (compressed or uncompressed) of content
|
||||||
|
before downloading everything
|
||||||
|
|
||||||
|
## Aside: the bare and bare-user formats
|
||||||
|
|
||||||
|
The most common operation is to pull from an `archive-z2` repository
|
||||||
|
into a `bare` or `bare-user` formatted repository. These latter two
|
||||||
|
are not compressed on disk. In other words, pulling to them is
|
||||||
|
similar to unpacking (but not installing) an RPM/deb package.
|
||||||
|
|
||||||
|
The `bare-user` format is a bit special in that the uid/gid and xattrs
|
||||||
|
from the content are ignored. This is primarily useful if you want to
|
||||||
|
have the same OSTree-managed content that can be run on a host system
|
||||||
|
or an unprivileged container.
|
||||||
|
|
||||||
|
## Static deltas
|
||||||
|
|
||||||
|
OSTree itself was originally focused on a continous delivery model, where
|
||||||
|
client systems are expected to update regularly. However, many OS vendors
|
||||||
|
would like to supply content that's updated e.g. once a month or less often.
|
||||||
|
|
||||||
|
For this model, we can do a lot better to support batched updates than
|
||||||
|
a basic `archive-z2` repo. However, we still want to preserve the
|
||||||
|
model of "static webserver only". Given this, OSTree has gained the
|
||||||
|
concept of a "static delta".
|
||||||
|
|
||||||
|
These deltas are targeted to be a delta between two specific commit
|
||||||
|
objects, including "bsdiff" and "rsync-style" deltas within a content
|
||||||
|
object. Static deltas also support `from NULL`, where the client can
|
||||||
|
more efficiently download a commit object from scratch.
|
||||||
|
|
||||||
|
Effectively, we're spending server-side storage (and one-time compute
|
||||||
|
cost), and gaining efficiency in client network bandwith.
|
||||||
|
|
||||||
|
## Static delta repository layout
|
||||||
|
|
||||||
|
Since static deltas may not exist, the client first needs to attempt
|
||||||
|
to locate one. Suppose a client wants to retrieve commit `${new}`
|
||||||
|
while currently running `${current}`.
|
||||||
|
|
||||||
|
The first thing to understand is that in order to save space, these
|
||||||
|
two commits are "modified base64" - the `/` character is replaced with
|
||||||
|
`_`.
|
||||||
|
|
||||||
|
Like the commit objects, a "prefix directory" is used to make
|
||||||
|
management easier for filesystem tools
|
||||||
|
|
||||||
|
A delta is named `$(mbase64 $from)-$(mbase64 $to)`, for example
|
||||||
|
`GpTyZaVut2jXFPWnO4LJiKEdRTvOw_mFUCtIKW1NIX0-L8f+VVDkEBKNc1Ncd+mDUrSVR4EyybQGCkuKtkDnTwk`,
|
||||||
|
which in sha256 format is
|
||||||
|
`1a94f265a56eb768d714f5a73b82c988a11d453bcec3f985502b48296d4d217d-2fc7fe5550e410128d73535c77e98352b495478132c9b4060a4b8ab640e74f09`.
|
||||||
|
|
||||||
|
Finally, the actual content can be found in
|
||||||
|
`deltas/$fromprefix/$fromsuffix-$to`.
|
||||||
|
|
||||||
|
## Static delta internal structure
|
||||||
|
|
||||||
|
A delta is itself a directory. Inside, there is a file called
|
||||||
|
`superblock` which contains metadata. The rest of the files will be
|
||||||
|
integers bearing packs of content.
|
||||||
|
|
||||||
|
The file format of static deltas should be currently considered an
|
||||||
|
OSTree implementation detail. Obviously, nothing stops one from
|
||||||
|
writing code which is compatible with OSTree today. However, we would
|
||||||
|
like the flexibility to expand and change things, and having multiple
|
||||||
|
codebases makes that more problematic. Please contact the authors
|
||||||
|
with any requests.
|
||||||
|
|
||||||
|
That said, one critical thing to understand about the design is that
|
||||||
|
delta payloads are a bit more like "restricted programs" than they are
|
||||||
|
raw data. There's a "compilation" phase which generates output that
|
||||||
|
the client executes.
|
||||||
|
|
||||||
|
This "updates as code" model allows for multiple content generation
|
||||||
|
strategies. The design of this was inspired by that of Chromium:
|
||||||
|
[http://dev.chromium.org/chromium-os/chromiumos-design-docs/filesystem-autoupdate](ChromiumOS
|
||||||
|
autoupdate).
|
||||||
|
|
||||||
|
### The delta superblock
|
||||||
|
|
||||||
|
The superblock contains:
|
||||||
|
|
||||||
|
- arbitrary metadata
|
||||||
|
- delta generation timestamp
|
||||||
|
- the new commit object
|
||||||
|
- An array of recursive deltas to apply
|
||||||
|
- An array of per-part metadata, including total object sizes (compressed and uncompressed),
|
||||||
|
- An array of fallback objects
|
||||||
|
|
||||||
|
Let's define a delta part, then return to discuss details:
|
||||||
|
|
||||||
|
## A delta part
|
||||||
|
|
||||||
|
A delta part is a combination of a raw blob of data, plus a very
|
||||||
|
restricted bytecode that operates on it. Say for example two files
|
||||||
|
happen to share a common section. It's possible for the delta
|
||||||
|
compilation to include that section once in the delta data blob, then
|
||||||
|
generate instructions to write out that blob twice when generating
|
||||||
|
both objects.
|
||||||
|
|
||||||
|
Realistically though, it's very common for most of a delta to just be
|
||||||
|
"stream of new objects" - if one considers it, it doesn't make sense
|
||||||
|
to have too much duplication inside operating system content at this
|
||||||
|
level.
|
||||||
|
|
||||||
|
So then, what's more interesting is that OSTree static deltas support
|
||||||
|
a per-file delta algorithm called
|
||||||
|
[bsdiff](https://github.com/mendsley/bsdiff) that most notably works
|
||||||
|
well on executable code.
|
||||||
|
|
||||||
|
The current delta compiler scans for files with maching basenamesin
|
||||||
|
each commit that have a similar size, and attempts a bsdiff between
|
||||||
|
them. (It would make sense later to have a build system provide a
|
||||||
|
hint for this - for example, files within a same package).
|
||||||
|
|
||||||
|
A generated bsdiff is included in the payload blob, and applying it is
|
||||||
|
an instruction.
|
||||||
|
|
||||||
|
## Fallback objects
|
||||||
|
|
||||||
|
It's possible for there to be large-ish files which might be resistant
|
||||||
|
to bsdiff. A good example is that it's common for operating systems
|
||||||
|
to use an "initramfs", which is itself a compressed filesystem. This
|
||||||
|
"internal compression" defeats bsdiff analysis.
|
||||||
|
|
||||||
|
For these types of objects, the delta superblock contains an array of
|
||||||
|
"fallback objects". These objects aren't included in the delta
|
||||||
|
parts - the client simply fetches them from the underlying `.filez`
|
||||||
|
object.
|
||||||
|
|
@ -8,3 +8,4 @@ pages:
|
||||||
- Deployments: 'manual/deployment.md'
|
- Deployments: 'manual/deployment.md'
|
||||||
- Atomic Upgrades: 'manual/atomic-upgrades.md'
|
- Atomic Upgrades: 'manual/atomic-upgrades.md'
|
||||||
- Adapting Existing Systems: 'manual/adapting-existing.md'
|
- Adapting Existing Systems: 'manual/adapting-existing.md'
|
||||||
|
- Formats: 'manual/formats.md'
|
||||||
|
|
|
||||||
|
|
@ -1,158 +0,0 @@
|
||||||
OSTree Static Object Deltas
|
|
||||||
===========================
|
|
||||||
|
|
||||||
Currently, OSTree's "archive-z2" mode stores both metadata and content
|
|
||||||
objects as individual files in the filesystem. Content objects are
|
|
||||||
zlib-compressed.
|
|
||||||
|
|
||||||
The advantage of this is model are:
|
|
||||||
|
|
||||||
0) It's easy to understand and implement
|
|
||||||
1) Can be served directly over plain HTTP by a static webserver
|
|
||||||
2) Space efficient on the server
|
|
||||||
|
|
||||||
However, it can be inefficient both for large updates and small ones:
|
|
||||||
|
|
||||||
0) For large tree changes (such as going from -runtime to
|
|
||||||
-devel-debug, or major version upgrades), this can mean thousands
|
|
||||||
and thousands of HTTP requests. The overhead for that is very
|
|
||||||
large (until SPDY/HTTP2.0), and will be catastrophically bad if the
|
|
||||||
webserver is not configured with KeepAlive.
|
|
||||||
1) Small changes (typo in gnome-shell .js file) still require around
|
|
||||||
5 metadata HTTP requests, plus a redownload of the whole file.
|
|
||||||
|
|
||||||
Why not smart servers?
|
|
||||||
======================
|
|
||||||
|
|
||||||
Smart servers (custom daemons, or just CGI scripts) as git has are not
|
|
||||||
under consideration for this proposal. OSTree is designed for the
|
|
||||||
same use case as GNU/Linux distribution package systems are, where
|
|
||||||
content is served by a network of volunteer mirrors that will
|
|
||||||
generally not run custom code.
|
|
||||||
|
|
||||||
In particular, Amazon S3 style dumb content servers is a very
|
|
||||||
important use case, as is being able to apply updates from static
|
|
||||||
media like DVD-ROM.
|
|
||||||
|
|
||||||
Finding Static Deltas
|
|
||||||
=====================
|
|
||||||
|
|
||||||
Since static deltas may not exist, the client first needs to attempt
|
|
||||||
to locate one. Suppose a client wants to retrieve commit ${new} while
|
|
||||||
currently running ${current}. The first thing to fetch is the delta
|
|
||||||
metadata, called "meta". It can be found at
|
|
||||||
${repo}/deltas/${current}-${new}/meta.
|
|
||||||
|
|
||||||
FIXME: GPG signatures (.metameta?) Or include commit object in meta?
|
|
||||||
But we would then be forced to verify the commit only after processing
|
|
||||||
the entirety of the delta, which is dangerous. I think we need to
|
|
||||||
require signing deltas.
|
|
||||||
|
|
||||||
Delta Bytecode Format
|
|
||||||
=====================
|
|
||||||
|
|
||||||
A delta-part has the following form:
|
|
||||||
|
|
||||||
byte compression-type (0 = none, 'g' = gzip')
|
|
||||||
REPEAT[(varint size, delta-part-content)]
|
|
||||||
|
|
||||||
delta-part-content:
|
|
||||||
byte[] payload
|
|
||||||
ARRAY[operation]
|
|
||||||
|
|
||||||
The rationale for having delta-part is that it allows easy incremental
|
|
||||||
resumption of downloads. The client can look at the delta descriptor
|
|
||||||
and skip downloading delta-parts for which it already has the
|
|
||||||
contained objects. This is better than simply resuming a gigantic
|
|
||||||
file because if the client decides to fetch a slightly newer version,
|
|
||||||
it's very probable that some of the downloading we've already done is
|
|
||||||
still useful.
|
|
||||||
|
|
||||||
For the actual delta payload, it comes as a stream of pair of
|
|
||||||
(payload, operation) so that it can be processed while being
|
|
||||||
decompressed.
|
|
||||||
|
|
||||||
Finally, the delta-part-content is effectively a high level bytecode
|
|
||||||
for a stack-oriented machine. It iterates on the array of objects in
|
|
||||||
order. The following operations are available:
|
|
||||||
|
|
||||||
FETCH
|
|
||||||
Fall back to fetching the current object individually. Move
|
|
||||||
to the next object.
|
|
||||||
|
|
||||||
WRITE(array[(varint offset, varint length)])
|
|
||||||
Write from current input target (default payload) to output.
|
|
||||||
|
|
||||||
GUNZIP(array[(varint offset, varint length)])
|
|
||||||
gunzip from current input target (default payload) to output.
|
|
||||||
|
|
||||||
CLOSE
|
|
||||||
Close the current output target, and proceed to the next; if the
|
|
||||||
output object was a temporary, the output resets to the current
|
|
||||||
object.
|
|
||||||
|
|
||||||
# Change the input source to an object
|
|
||||||
READOBJECT(csum object)
|
|
||||||
Set object as current input target
|
|
||||||
|
|
||||||
# Change the input source to payload
|
|
||||||
READPAYLOAD
|
|
||||||
Set payload as current input target
|
|
||||||
|
|
||||||
Compiling Deltas
|
|
||||||
================
|
|
||||||
|
|
||||||
After reading the above, you may be wondering how we actually *make*
|
|
||||||
these deltas. I envison a strategy similar to that employed by
|
|
||||||
Chromium autoupdate:
|
|
||||||
http://www.chromium.org/chromium-os/chromiumos-design-docs/autoupdate-details
|
|
||||||
|
|
||||||
Something like this would be a useful initial algorithm:
|
|
||||||
1) Compute the set of added objects NEW
|
|
||||||
2) For each object in NEW:
|
|
||||||
- Look for a the set of "superficially similar" objects in the
|
|
||||||
previous tree, using heuristics based first on filename (including
|
|
||||||
prefix), then on size. Call this set CANDIDATES.
|
|
||||||
For each entry in CANDIDATES:
|
|
||||||
- Try doing a bup/librsync style rolling checksum, and compute the
|
|
||||||
list of changed blocks.
|
|
||||||
- Try gzip-compressing it
|
|
||||||
3) Choose the lowest cost method for each NEW object, and partition
|
|
||||||
the program for each method into deltapart-sized chunks.
|
|
||||||
|
|
||||||
However, there are many other possibilities, that could be used in a
|
|
||||||
hybrid mode with the above. For example, we could try to find similar
|
|
||||||
objects, and gzip them together. This would be a *very* useful
|
|
||||||
strategy for things like the 9000 Boost headers which have massive
|
|
||||||
amounts of redundant data.
|
|
||||||
|
|
||||||
Notice too that the delta format supports falling back to retrieving
|
|
||||||
individual objects. For cases like the initramfs which is compressed
|
|
||||||
inside the tree with gzip, we're not going to find an efficient way to
|
|
||||||
sync it, so the delta compiler should just fall back to fetching it
|
|
||||||
individually.
|
|
||||||
|
|
||||||
Which Deltas To Create?
|
|
||||||
=======================
|
|
||||||
|
|
||||||
Going back to the start, there are two cases to optimize for:
|
|
||||||
|
|
||||||
1) Incremental upgrades between builds
|
|
||||||
2) Major version upgrades
|
|
||||||
|
|
||||||
A command line operation would look something like this:
|
|
||||||
|
|
||||||
$ ostree --repo=/path/to/repo gendelta --ref-prefix=gnome-ostree/buildmaster/ --strategy=latest --depth=5
|
|
||||||
|
|
||||||
This would tell ostree to generate deltas from each of the last 4
|
|
||||||
commits to each ref (e.g. gnome-ostree/buildmaster/x86_64-runtime) to
|
|
||||||
the latest commit. It might also be possible of course to have
|
|
||||||
--strategy=incremental where we generate a delta between each commit.
|
|
||||||
I suspect that'd be something to do if one has a *lot* of disk space
|
|
||||||
to spend, and there's a reason for clients to be fetching individual
|
|
||||||
refs.
|
|
||||||
|
|
||||||
$ ostree --repo=/path/to/repo gendelta --from=gnome-ostree/3.10/x86_64-runtime --to=gnome-ostree/buildmaster/x86_64-runtime
|
|
||||||
|
|
||||||
This is an obvious one - generate a delta from the last stable release
|
|
||||||
to the current development head.
|
|
||||||
Loading…
Reference in New Issue