.. Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.

.. include:: ../../common.defs

.. _developer-cache-consistency:

Cache Tools
~~~~~~~~~~~

Tools and techniques for cache monitoring and inspection.

* :ref:`The cache inspector <inspecting-the-cache>`.

Topics to be done
~~~~~~~~~~~~~~~~~

* Resident alternates
* Object refresh

Cache Consistency
~~~~~~~~~~~~~~~~~

The cache is completely consistent, up to and including kicking the power cord
out, if the write buffer on consumer disk drives is disabled. You need to use::

  hdparm -W0

The cache validates that all the data for the document is available and will
silently mark a partial document as a miss on read. There is no gentle
shutdown for Traffic Server. You simply kill the process and the recovery code
(fsck) is run every time Traffic Server starts up.

On startup the two versions of the index are checked, and the last valid one is
read into memory. |TS| then moves forward from the last snapped write
cursor and reads all the fragments written to disk and updates the directory
(as in a log-based file system). It stops reading at the write before the last
valid write header it sees (as a write is not necessarily atomic because of
sector reordering). Then the new updated index is written to the invalid
version (in case of a crash during startup) and the system starts.

.. _volume tagging:

Volume Tagging
~~~~~~~~~~~~~~

Currently, :term:`cache volumes <cache volume>` are allocated somewhat
arbitrarily from storage elements. `This enhancement <https://issues.apache.org/jira/browse/TS-1728>`__
allows :file:`storage.config` to assign :term:`storage units <storage unit>` to
specific :term:`volumes <cache volume>` although the volumes must still be
listed in :file:`volume.config` in general and in particular to map domains to
specific volumes. A primary use case for this is to be able to map specific
types of content to different storage elements. This can be employed to have
different storage devices for various types of content (SSD vs. rotational).

Version Upgrade
---------------

It is currently the case that any change to the cache format will clear the
cache. This is an issue when upgrading the |TS| version and should be kept in mind.

.. _cache-key:

Controlling the cache key
-------------------------

The :term:`cache key` is by default the URL of the request. There are two
possible choices, the original (pristine) URL and the remapped URL. Which of
these is used is determined by the configuration value
:ts:cv:`proxy.config.url_remap.pristine_host_hdr`.

This is an ``INT`` value. If set to ``0`` (disabled) then the remapped URL is
used, and if it is not ``0`` (enabled) then the original URL is used. This
setting also controls the value of the ``HOST`` header that is placed in the
request sent to the :term:`origin server`, using the hostname from the original
URL if not ``0`` and the host name from the remapped URL if ``0``. It has no
other effects.

For caching, this setting is irrelevant if no remapping is done or there is a
one-to-one mapping between the original and remapped URLs.

It becomes significant if multiple original URLs are mapped to the same
remapped URL. If pristine headers are enabled, requests to different original
URLs will be stored as distinct :term:`objects <cache object>` in the cache. If
disabled, the remapped URL will be used and there may be collisions. This is
bad if the contents different, but quite useful if they are the same (as in
situations where the original URLs are just aliases for the same underlying
server resource).

This is also an issue if a remapping is changed because it is effectively a
time axis version of the previous case. If an original URL is remapped to a
different server address then the setting determines if existing cached objects
will be served for new requests (enabled) or not (disabled). Similarly, if the
original URL mapped to a particular URL is changed then cached objects from the
initial original URL will be served from the updated original URL if pristine
headers is disabled.

These collisions are not by themselves good or bad. An administrator needs to
decide which is appropriate for their situation and set the value correspondingly.

If a greater degree of control is desired, a plugin must be used to invoke the
API calls :c:func:`TSHttpTxnCacheLookupUrlSet()` or  :c:func:`TSCacheUrlSet()`
to provide a specific :term:`cache key`. The :c:func:`TSCacheUrlSet()` API can
be called as early as ``TS_HTTP_READ_REQUEST_HDR_HOOK`` but no later than
``TS_HTTP_POST_REMAP_HOOK``. It can be called only once per transaction;
calling it multiple times has no additional effect.

A plugin that changes the cache key must do so consistently for both cache hit
and cache miss requests because two different requests that map to the same
cache key will be considered equivalent by the cache. Use of the URL directly
provides this and so must any substitute. This is entirely the responsibility
of the plugin; there is no way for the |TS| core to detect such an occurrence.

If :c:func:`TSHttpTxnCacheLookupUrlGet()` is called after new cache url set by
:c:func:`TSHttpTxnCacheLookupUrlSet()` or :c:func:`TSCacheUrlSet()`, it should
use a URL location created by :c:func:`TSUrlCreate()` as its third input
parameter instead of getting ``url_loc`` from the client request.

It is a requirement that the string be syntactically a URL but otherwise it is
completely arbitrary and need not have any path. For instance, if the company
Network Geographics wanted to store certain content under its own
:term:`cache key`, using a document GUID as part of the key, it could use a
cache key like ::

   ngeo://W39WaGTPnvg

The scheme ``ngeo`` was picked specifically because it is not a valid URL
scheme, and so will never collide with any valid URL.

This can be useful if the URL encodes both important and unimportant data.
Instead of storing potentially identical content under different URLs (because
they differ on the unimportant parts) a url containing only the important parts
could be created and used.

For example, suppose the URL for Network Geographics content encoded both the
document GUID and a referral key. ::

   http://network-geographics-farm-1.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H

We don't want to serve the same content for every possible referrer. Instead,
we could use a plugin to convert this to the previous example and requests that
differed only in the referrer key would all reference the same cache entry.
Note that we would also map the following to the same cache key ::

   http://network-geographics-farm-56.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H

This can be handy for sharing content between servers when that content is
identical. Plugins can change the cache key, or not, depending on any data in
the request header. For instance, not changing the cache key if the request is
not in the ``doc`` directory. If distinguishing servers is important, that can
easily be pulled from the request URL and used in the synthetic cache key. The
implementor is free to extract all relevant elements for use in the cache key.

While there is no explicit requirement that the synthetic cache key be based on
the HTTP request header, in practice it is generally necessary due to the
consistency requirement. Because cache lookup happens before attempting to
connect to the :term:`origin server`, no data from the HTTP response header is
available, leaving only the request header. The most common case is the one
described above where the goal is to elide elements of the URL that do not
affect the content to minimize cache footprint and improve cache hit rates.