next up previous contents
Next: Sendmail X: Implementation Up: Sendmail X Previous: Sendmail X: Architecture   Contents

Subsections

Sendmail X: Functional Specification


Functional Specification of sendmail X

This chapter describes the external functionality as well as the internal interfaces of sendmail X as much as required, i.e., for an implementation by experienced software engineers with some knowledge about MTAs etc. In each section it will be made clear what is the external functionality and which API or other interface (in the form of a protocol) will be provided.

The internal functionality describes the interfaces between the modules of sendmail X. As such they must never be used by anything else but the programs of which sendmail X consists. These interfaces are subject to changes without notice. It is not expected that modules from different versions work with each other, even though it might be possible.

An external interface is one which is accessible by a user.


Asynchronous APIs

Many of the sendmail X modules require that function calls do not block. If a function can block then an asynchronous API is required. Unfortunately it seems hard to specify an API that works well with blocking and non-blocking calls. Hence the APIs will be specified such that most (or all) blocking functions are split in two: one to initiate a function (the initiate function) and another one (the result function) to get the result(s) of the function call. To allow for a synchronous implementation, the first (initiate) function should also provide result parameters and a status return. The status indicates whether the result parameters have valid values, or whether the result will be returned later on via the second (get result) function. This convention seems to make it possible to provide APIs that work for both (synchronous and asynchronous) kind of implementations.

The calling program should not be required to poll for a result, but instead it should be notified that a result is available. This can be done via some interprocess notification mechanism, e.g., System V message passing or via a message over a socket through which both programs communicate. Then those programs can use select(2), poll(2), or something similar to determine whether a message is available. Note: this requires that both programs agree on a common notification mechanism which can be a problem in case of (third-party) libraries. For example, OpenLDAP [LDA] does provide an asynchronous interface (e.g., ldap_search(3)), but it does not provide an ``official'' notification mechanism3.1, it requires polling (ldap_result(3)). It should be possible to have a function associated with a message (type) such that the main loop is just a function dispatcher (compare engine in libmilter).


Generic Description of Asynchronous Operation

In this section we describe how asynchronous ``function'' calls can be handled. For the following discussion we assume a caller C and a callee S. The caller C creates a request RQ that contains the arguments for a function F, and a token T which uniquely identifies the request RQ (we could write RQ(T)). Note: it is also possible to let S generate the token if the argument is a result parameter (pointer to a token which can be populated by S). Additionally, the argument list may contain a result parameter PR in which a result can be immediately stored if the call can be handled synchronously. C enqueues the fact that it made the function call by storing data describing the current state St(T) in a queue/database DB along with the identifier T. The request is sent to S (may be appended to a queue for sending via a single communication task instead of sending it directly) unless it can be handled immediately and returned via the result parameter PR (the return status must specify that the result has been returned synchronously). A communication task receives answers A from S. The answer contains the identifier T which is used to lookup the current data St(T) in the DB. Then a function (the result or callback function RC) is invoked with St and A as parameters. It decodes the answer A (which contains the function result) and acts accordingly, also taking the current state St into account and probably modifying it accordingly. Additionally, it may use the result parameter PR to store the result in the proper location. The result parameter may provide some buffer (or allocated) structure in which to store the result such that the callee is not responsible for the allocation. However, the details of this depend on the functions and their requirements, e.g., their output values (it may be just a scalar).

To handle immediate status returns, the function that sends (enqueues) the request has to be able to deal with that. Such an implementation is an enhancement for a later version (probably not for 9.0).


Some Remarks about Identifiers

There are several kinds of identifiers that are used within sendmail X for various purposes. Identifiers (handles, tags) in general must of course be unique during the time in which they are used to identify an object. Beyond this obvious purpose, some identifiers should fulfill other requirements.

  1. Some identifiers are supposed to be unique for a long time, e.g., for sessions and transactions. It seems useful to make these unique for a much longer time which allows them to be used as identifiers in logfiles over a long period. These identifiers should also be easily representable in human-readable form.
  2. Identifiers for asynchronous operations (see Section 3.1.1.1) are very short-lived, they need to be only valid while the transaction is not yet finished. In some cases it might be useful to assure that they are unique even after the end of a transaction to avoid problem with ``hanging'' transactions (compare TCP/IP?).

    If a handle is only used by the caller to identify an object in its memory, then it seems a good idea to use the memory address of that object as handle (as for example done by aio(3), see aio_return(3)). The advantages are:

    The disadvantages are:


Configuration


Configuration Structure

Section 2.2.5 gives an example how two SMTP servers are defined and reference common parts (aliases). However, there are some problems due to the modularized implementation of the MTS: many operations are performed by the address resolver, e.g., access map lookups and routing decisions including whether a recipient address is local. It is easy to imagine a configuration in which different servers listen on different IP addresses and need different configurations, e.g., a map of local users or even anti-spam configurations. Most MTAs only offer virtual hosting, i.e., one MTA is responsible for several domains, but those domain have basically the same configuration. sendmail 8 has some options per ``daemon'' (DaemonPortOptions) and several requests have been made to have more options ``per daemon''. This is easy in sendmail X as long as those options are ``local'' to the server, but it is complicated when other modules provide the functionality.

There are two problems involved here:

  1. options are currently specified in the module which implements the functionality, e.g.,
  2. some functionality provided by an option is used by different modules, e.g., mailertable (as explained above).

Problem 1 could be solved as explained in Section 2.2.5.

Problem 2 could be solved by separating the functionality, e.g., mailertable could be split into two tables: one to decide where a recipient is local, and another one for routing addresses. However, this may increase maintainance costs as some information is duplicated in different tables which may cause consistency problems.


Naming Conventions

It is important to use consistent naming; not just in the source code but especially in the configuration file.

In the following some proposals are given to achieve this. First about the structure of entries:

  1. Entries can consist of components which are delimited by underscores (or dots).
  2. Entries are named in a top-down manner: the most important component is first, e.g., queue_timeout. However, if an adjective is used, it may make sense to use the ``natural'' ordering, i.e., the adjective before the noun, e.g., max_processes. These two rules may be used within a single entry, e.g., smtps_max_connections.
  3. Whenever useful, entries should be structured, not written as components. This is similar to breaking a program into several functions instead of inlining everything. However, there is no clear rule about this. Some rules of thumb:
    1. If there are several entries (more than three(?)) with the same prefix, then creating a structured entry is useful.
    2. If there are several entries (more than two(?)) with the same prefix which are used in different components, then creating a structured entry is useful.
  4. Abbreviations are used sparringly and if they are used, then they are used everywhere, they should not be mixed with the unabbreviated words.

Next about the (components of) names used:

  1. entries: number of entries in some structure.
  2. size: size of some structure (in KB, MB, etc)
  3. space: required free space for some structure in some other medium, e.g., on disk.
  4. level: level at which some events should occur, usually refers to producing some kind of output, e.g., log_level.


Specification of Classes (Lists, Sets)

Question: how should classes (lists, sets) be specified? Should it be just a lists of entries in the configuration file? sendmail 8 allows to load classes from other sources, e.g., from an external file, a program, or a map lookup. In the latter case a key is looked up in a map and the RHS (split?) is added to the specified class. It might be useful to have also a map which simply specifies the valid members as keys, i.e., lookup a value and if it is found, it is a member of the class for which the map is given. More complicated would be that the RHS contains a list of class names which would be compared against the class that is checked.


Supervisor


External Interfaces

The supervisor doesn't have an external interface, except for starting/stopping it. Reconfiguring is probably done via stop/start, or maybe via sending it a HUP signal. The latter would be good enough to enable/disable parts of sendmail X, i.e., edit the configuration file and restart the supervisor. In the first version, there will be no special commands to enable/disable parts of sendmail X.

The supervisor can be configured in different ways for various situations. Example: on a gateway system local delivery is not required hence the LDA daemon is not configured (or started).


Operation

The MCP starts processes on demand, which can come from several sources:

  1. The configuration requires more than zero processes (see Section 3.3.4).
  2. There is a communication request coming in on the socket on which the process is supposed to listen, but there is no process available (and the maximum number of processes has not been reached yet).
  3. Another process requests that a process is started, a typical example would be the QMGR asking for more SMTPS or DA processes.

The MCP keeps track of the number of processes and acts accordingly, i.e., when the upper limit is reached no further processes will be started, when a lower limit is reached more processes are started. Child processes also inform the MCP whenever they finished a task and are available for a new one.

Case 2: If there is no process available then the MCP listens on the communication channel and starts a process when a communication comes in. In that case the connection will be passed to the new process (in the form of an open file descriptor). This is similar to inetd or postfix's master program. Question: how to deal with the case where processes are available? Should those processes just call accept() on the connection socket? That's the solution that the postfix master program uses; the child processes have access to the connection socket and all(?) of them listen to it (Wietse claims the thundering herd problem isn't really a problem since the number of processes is small).

Question: would it be more secure if only a program can request to start more copies of itself? However, that would probably increase the program complexity (slightly) and communication overhead. Decision: designated programs can request to start other programs.

Question: should the MCP use only one file descriptor to receive requests from all processes it started or should it have one fd per process or process type? The latter seems better since it allows for a clean separation, even though it may require (a few) more fds.


Shutdown

There are several types of shutdown:

Question: how to distinguish between immediate shutdown and ``normal'' shutdown? The former is probably triggered by a signal (SIGTERM), the latter by a command from the MCP (or some other control program).

The supervisor is responsible for shutting down the entire sendmail X system. To achieve this, the receiving processes are first told to stop accepting new connections. Current connections are either terminated (if immediate shutdown is required, e.g., system will be turned off fast) or are allowed to proceed (up to a certain amount of time). The queue manager will not schedule any delivery attempts any more and wait for outstanding connections to terminate. Delivery agents will be told to terminate (again: either immediately or orderly). The helper programs, e.g., address resolver, will be terminated as soon as the programs which require them have stopped.


Configuration

The configuration of the MCP might be similar to the master process in postfix. It contains a list of processes, their types, the uid/gid they should run as, the number of processes that should be available, how they are supposed to be started, etc. It should also list how the processes communicate with the MCP (fork()/wait(), pipe), how often/fast a process can be restarted before it is considered to fail permanently. The latter functionality should probably be similar to (x)inetd.

Here's the example master.cf3.2file from postfix:

Postfix master process configuration file. Each line describes how a mailer component program should be run. The fields that make up each line are described below. A "-" field value requests that a default value be used for that field.

Service: any name that is valid for the specified transport type (the next field). With INET transports, a service is specified as host:port. The host part (and colon) may be omitted. Either host or port may be given in symbolic form or in numeric form. Examples for the SMTP server: localhost:smtp receives mail via the loopback interface only; 10025 receives mail on port 10025.

Transport type: "inet" for Internet sockets, "unix" for UNIX-domain sockets, "fifo" for named pipes.

Private: whether or not access is restricted to the mail system. Default is private service. Internet (inet) sockets can't be private.

Unprivileged: whether the service runs with root privileges or as the owner of the Postfix system (the owner name is controlled by the mail_owner configuration variable in the main.cf file).

Chroot: whether or not the service runs chrooted to the mail queue directory (pathname is controlled by the queue_directory configuration variable in the main.cf file). Presently, all Postfix daemons can run chrooted, except for the pipe and local daemons. The files in the examples/chroot-setup subdirectory describe how to set up a Postfix chroot environment for your type of machine.

Wakeup time: automatically wake up the named service after the specified number of seconds. A ? at the end of the wakeup time field requests that wake up events be sent only to services that are actually being used. Specify 0 for no wakeup. Presently, only the pickup, queue manager and flush daemons need a wakeup timer.

Max procs: the maximum number of processes that may execute this service simultaneously. Default is to use a globally configurable limit (the default_process_limit configuration parameter in main.cf). Specify 0 for no process count limit.

Command + args: the command to be executed. The command name is relative to the Postfix program directory (pathname is controlled by the program_directory configuration variable). Adding one or more -v options turns on verbose logging for that service; adding a -D option enables symbolic debugging (see the debugger_command variable in the main.cf configuration file). See individual command man pages for specific command-line options, if any.

SPECIFY ONLY PROGRAMS THAT ARE WRITTEN TO RUN AS POSTFIX DAEMONS. ALL DAEMONS SPECIFIED HERE MUST SPEAK A POSTFIX-INTERNAL PROTOCOL.

DO NOT CHANGE THE ZERO PROCESS LIMIT FOR CLEANUP/BOUNCE/DEFER OR POSTFIX WILL BECOME STUCK UP UNDER HEAVY LOAD

DO NOT CHANGE THE ONE PROCESS LIMIT FOR PICKUP/QMGR OR POSTFIX WILL DELIVER MAIL MULTIPLE TIMES.

DO NOT SHARE THE POSTFIX QUEUE BETWEEN MULTIPLE POSTFIX INSTANCES.

# service type private unpriv chroot wakeup maxproc command + args
#   (yes) (yes) (yes) (never) (50)  
               
smtp inet n - n - - smtpd
#628 inet n - n - - qmqpd
pickup fifo n n n 60 1 pickup
cleanup unix - - n - 0 cleanup
qmgr fifo n - n 300 1 qmgr
#qmgr fifo n - n 300 1 nqmgr
rewrite unix - - n - - trivial-rewrite
bounce unix - - n - 0 bounce
defer unix - - n - 0 bounce
flush unix - - n 1000? 0 flush
smtp unix - - n - - smtp
showq unix n - n - - showq
error unix - - n - - error
local unix - n n - - local
virtual unix - n n - - virtual
lmtp unix - - n - - lmtp

Interfaces to non-Postfix software. Be sure to examine the manual pages of the non-Postfix software to find out what options it wants. The Cyrus deliver program has changed incompatibly.

cyrus	  unix	-	n	n	-	-	pipe
  flags=R user=cyrus argv=/cyrus/bin/deliver -e -m ${extension} ${user}
uucp	  unix	-	n	n	-	-	pipe
  flags=Fqhu user=uucp argv=uux -r -n -z -a$sender - $nexthop!rmail ($recipient)
ifmail    unix  -       n       n       -       -       pipe
  flags=F user=ftn argv=/usr/lib/ifmail/ifmail -r $nexthop ($recipient)
bsmtp     unix  -       n       n       -       -       pipe
  flags=Fq. user=foo argv=/usr/local/sbin/bsmtp -f $sender $nexthop $recipient

First take at necessary configuration options for MCP:

Notice: using a syntax as in master.cf or inetd.conf is not a good idea, since it violates the requirements we listed in Section 2.2.2. The syntax must be the same as for the other configuration files for consistency.


Internal Interfaces

The supervisor starts and controls several processes. As such, it has a control connection with them. In the simplest case these are just fork(), wait() system calls, in more elaborate case it may be a socket over which status and control commands are exchanged.

The supervisor may set up data (fd, shared memory, etc) that should be shared by different processes of the sendmail X system. Notice: since the MCP runs as root, it could setup sockets in protected directories to which normal users don't have access. Open file descriptors to those sockets (for communication between modules) could then be passed on to forked programs. This may help to increase the security of sendmail X due to the extra protection. However, it requires that the MCP sets up all the necessary communication files. Moreover, if a program closes the socket (as SMTPS may do for port 25) it can't reopen it anymore and hence must get the file descriptors from the MCP again (either by passing a fd or by terminating and being started). This may not be really useful, but it's just an idea to be noted.

The MCP will bind to port 25 as explained earlier (see Section 2.3) and hand the open socket over to the SMTPS (after changing the uid).


Queue Manager

Question: Will this be a (purely) event driven program?

Question: worker model or thread per functionality? It won't be a single process which uses event based programming since this doesn't scale on multi-processor machines. It seems a worker model is more appropriate: it is more general and we might be able to reuse it for other modules, see also Section 3.20.2.


External Interfaces

The queue manager doesn't have an external interface. However, it can be configured in different ways for various situations. Todo: and these are?

Such configuration options influence the behavior of the scheduler, the location of the queues, the size of memory caches, etc.


Shutdown

The queue manager will not schedule any delivery attempts any more. It will wait for outstanding connections to terminate unless an immediate shutdown is requested. The incoming queue will be flushed to the deferred queue. Delivery agents are informed by the MCP to stop. The QMGR is waiting for them to terminate and records the delivery status in the deferred EDB.


Internal Interfaces

The status of the queue manager must be accessible from helper programs, e.g., see Section 2.11.1 The QMGR does not provide this as user accessible interface to allow for changing the internal protocols without having to change programs outside sendmail X.

Note: due to the complexity of the QMGR the following sections are not structured as subsection of this one because the nesting becomes too deep otherwise.


Indices for Accessing the EDBs

The main index to access the deferred EDB will be the time of the next try. However, it is certainly useful to send entries to a host for which a DA has an open connection. Question: what do we use as index to find entries in an EDB to reuse an open connection? An EDB stores canonified addresses or resolved addresses (DA, host, address), it does not store MX records. Those are kept in a different cache (mapping), if they are cached at all. We cannot use the host signature for lookups as sendmail 8 does for piggybacking for the same reason: the host signature consists of the MX records. 8.12 uses an enhanced version for piggybacking, i.e., not the entire host signatures need to match, but only the first. Maybe it is sufficient for selection of entries from an EDB to use the host name (beside all the other criteria as listed in Section 2.4.4.1). The actual decision to reuse a connection is made later on (by the micro scheduler, see Section 2.4.4.3). That decision is based on the host name/IP address and maybe the connection properties, e.g., STARTTLS, AUTH data.

If a DA has an open connection, then that data is added to the outgoing open connection cache (an incoming connection from a host may be taken as hint that the host can also receive mail). The hint is given to the schedulers, which may change their selection strategy accordingly. The first level scheduler can reverse map the host name/IP address to a list of destination hosts that can appear in resolved addresses. Then a scan for those hosts can be made and the entries may be moved upward in the queue.

Question: do we only lookup queue entries for the best MX record (to piggyback other recipients on an open connection)? We could lookup queue entries for lower priority MX records if we know the higher ones are unavailable. It may be a question of available resources (computational as well as network I/O) whether it is useful to perform such fairly complicated searches (and decision processes) to reuse a connection. In some cases it might be simpler (and faster) to just open another connection. However, it might be ``expensive'' to setup a connection, esp. if STARTTLS or something similar is used. A really sophisticated scheduler may take this into account for the decision whether to use an existing connection or whether to open new ones. For example, if the number of open connections is large then it is most likely better to reuse an existing connection. Those numbers (total and per domain) will be configurable (and may also depend on the OS resources).


Interface to SMTP DAs

Question: which host information does the QMGR send to the DA? Only the host from the resolved address tuple (so the DA does MX lookups), the MX list, only one host out of the MX list, or only one IP address? A ``clean'' interface would be to send only the resolved address and let the DA do the MX lookup etc. However, for piggybacking the QMGR needs to know the open connections and it must be able to compare those connections with the entries in the EDB. Hence the QMGR must do the MX expansion (using the AR or DNS helper processes).

Very simple outline of selection of next hop(s) for SMTP delivery:

  1. Determine DA, destination host, address (done by AR). Let's assume the DA is a SMTP client (the first sendmail X prototype will only support SMTP and LMTP).
  2. Lookup MX records for destination host. If none: only destination host (priority 0) in list.
  3. Is any local host in list? If yes: remove all entries with same priority of higher.
  4. Is the list of MX records empty? If yes: failure (maybe configuration option).
  5. Randomize equal priority MX records; try to use an existing randomization3.3if it matches an existing set of MX records, otherwise we need functions on sets, at least for comparison. For example, if one lookup returns A and B and another one returns B and A, then a list comparison will fail, but comparing the two sets will return that they are equal. It would also be helpful to have a ``is subset of'' function (if the lists are sorted according to the same criteria, then sets a ``is prefix of'' function).
  6. Lookup address records for each host in the list. Question: who does this? Will the AR return the entire list or will the QMGR cache that data and access it directly (asking a DNS helper if entries are expired)?

The MX list and the address list are represented as linked lists, with two different kind of links: same priority, lower priority. This can be either coded as two pointers or as a ``type of link'' value. If we use two pointers then we have to decide whether we fill in both pointers (if a record of that type is available) or only one. For example, let A, B, and C be MX records for a host with value 10, 10, and 20 respectively. Does A have only a (same priority) pointer to B, or does it have pointers to B and C? Is there a case where we do not go sequentially through the list? Maintaining two pointers is more effort which may not give us any advantage at all.


QMGR - Delivery Agents API

The QMGR provides APIs for the different modules with which it communicates. The two most important interfaces are those to the SMTP server and the delivery agents, the former is discussed in Section 3.4.12.

Question: how does the QMGR control the DAs? That is, who starts a new DA if necessary? This should be the MCP, since it starts (almost) all sendmail X processes. However, what if a DA is already running and the QMGR just wants more? Does the MCP start more or does a DA fork()? Probably the former, which means the QMGR and the MCP must be able to communicate this type of data. Do the DAs terminate themselves after some (configurable) idle time? That should be a configuration option in the MCP control file, see Section 3.3.4.

Question: does the QMGR have each of the following functions per DA or are these generic functions which take the name/type of the DA as argument and are dispatched accordingly?

General side note about (communication) protocols: it seems simpler that the caller generates the appropriate handles instead of the callee. The caller has to pass some kind of token (handle/identifier) anyway to the callee to associate the result it is getting back with the correct data (supplied via the call). If this would be a synchronous call, then the callee could generate the handle, but since we must deal with asynchronous calls, we must either generate the handle ourselves such that the callee can use it to identify the returned data, or the calling mechanism itself must generate the data, which however makes this too complicated.

Notice: the handles (identifiers) from the DA (da-trans-handle, da-session-handle) are not related to the transaction/session identifiers of SMTPS. That is, we do not ``reuse'' those identifiers except for transaction identifiers for the purpose of logging. We only need those handles to identify the sessions/transactions in SMTPC and QMGR, i.e., to associate the data structures (and maybe threads) that describe the sessions/transactions. We can generate the identifiers in a DA (SMTPC) similarly as in SMTPS; the differences are:

See also Section 3.1.1.2.


Transferring Data between EDBs

There are several different EDBs in the QMGR: active queue (AQ or ACTEDB), incoming queue (INCEDB; two variants: IBDB: backup on disk and IQDB: in memory only), and main (deferred) queue (DEFEDB). Data must be transferred between those DBs in various situations, e.g., for scheduling data is taken from IQDB or DEFEDB and put into ACTEDB. Doing so involves copying of data and maybe allocating memory for referenced data structures, e.g., mail addresses, and then copying the data from one place (Src) into another (Dst). This problem can be solved in two ways:

  1. moving the pointer to the data from the old structure Src to the new one Dst, i.e., copy the pointer and set it to NULL in Src. This only works if the data isn't needed anymore in Src.
  2. Use data structures with reference counts, see Section 3.16.7.1. This is a clean solution but adds overhead to data structures that barely need the feature (and may require additional locking).

Another approach is to use the same data structures for all/most EDBs with an additional type field that defines which data elements are valid. This way copying is either not necessary or can be done almost one-to-one in most cases. The disadvantage of this approach is the potential waste of some memory and fewer chances for typechecks by the compiler (however, more generic functions might be possible). Moreover, the data requirements for incoming and outgoing envelopes are fairly different, so maybe those should be separate data structures.

Maybe only ACTEDB and DEFEDB data structures should be identical, or at least ACTEDB structs should be a superset of DEFEDB structs. Otherwise we need to update entries in DEFEDB by reading the data in (into a structure for DEFEDB), modifying the data (according to the data in ACTEDB), and writing the data out.


Detailled Data Flow for Cut-Through Delivery

Section 2.4.3.3 describes the data flow of envelopes between the various EDB that QMGR maintains, while Section 2.4.3.6 describes the changes required for cut-through delivery. This section specifies the functional behavior for the latter.

If the final data dot is received by the SMTP server, it sends a CTAID record to QMGR including the information that this transaction is scheduled for cut-through delivery. Such an information is given by QMGR in reply to RCPT records: if all of them are flagged by QMGR for cut-through delivery, the transaction is scheduled for it. After receiving CTAID QMGR decides whether the scheduler will go ahead with cut-through delivery. If it doesn't, it sends back a reply code for case 1b below and proceeds as for normal delivery mode. Otherwise all recipients are transferred to AQ immediately, the recipients and the transaction are marked properly and delivery attempts are made. Moreover, a timeout is set for the transaction after which case 1b is selected.

Question: should the data be in some IBDB list too?

  1. QMGR can return one of the following replies:
    1. accept without fsync(2): the mail has been successfully delivered to all recipients. All recipients and the transaction data are removed from IQDB and from AQ.
    2. accept with fsync(2): the mail has not been successfully delivered to all recipients. QMGR must store the information about the transaction safely in IBDB (or DEFEDB) before sending this reply.
    3. reject
  2. QMGR does not reply within the timeout: return a temporary error to the client.

For case 1b the SMTP server needs to send another message to QMGR telling it the result of fsync(2). If fsync(2) fails, the message must be rejected with a temporary error, however, QMGR may already have delivered the mail to some recipients, hence causing double deliveries.


Reading Entries from Deferred Queue

To minimize disk I/O an envelope database cache (EDBC) is used (see Section 3.4.6.2). As explained in Section 2.4.3.5.1 the cache may not always contain all references to DEFEDB due to memory restrictions. Such a restriction can be either artificially enforced (by specifying a maximum number of entries in EDBC) or indirectly if the program runs out of memory when trying to add a new entry to the cache (see also Section 3.4.17.1). In that case, the operation mode of reading entries from the deferred queue must be changed (from cached to disk). In the disk mode the entire DEFEDB is read at regular intervals to fill up EDBC with the youngest entries which in turn are read at the appropriate time from DEFEDB into AQ. Question: how can those ``read the queue'' operations be minimized? It is important that EDBC is properly maintained, it contains the ``youngest'' entries, i.e., all entries in DEFEDB that are not referenced from EDBC have a ``next time to try'' $ntt$ that is larger than the last entry in EDBC. Question: how can this be guaranteed? Proposal: when switching from cached to disk set a status flag to keep track of the mode, and store the maximum $ntt$ in $nttmax$. When inserting entries into EDBC, ignore everything that has a $ntt$ greater then $nttmax$. If entries are actually added ($ntt < nttmax$) and hence older entries are removed, set a new $nttmax$. Perform a DEFEDB scan when EDBC is empty or below some threshold, e.g., only filled up to ten per cent and $nttmax - now < 10$. If all entries from DEFEDB can be read, reset the mode to cached.


Reconstructing Data from IBDB

In case of an unclean shutdown there might be open transactions in IBDB. Hence on startup the QMGR must read the IBDB files and reconstruct data as necessary. The API for IBDB is specified in Section 3.13.4.3. It allows an application to read all records from the IBDB. To reconstruct the data, the function sequentially reads through IBDB and stores the data in an internal data structure (in the following called RecDB: Reconstruction DB)that allows access via transaction/recipient id. The entries are ordered, i.e., the first entry for a record has the state open, the next entry has the state done (with potentially more information, e.g., delivered, transferred to DEFEDB due to a temporary/permanent failure, or cancelled). The second entry for a record might be missing which indicates an open transaction that must be taken care of. For each done transaction the corresponding open entry is removed from RecDB. After all records have been read RecDB contains all open transactions. These must be added to DEFEDB or AQ. If they are added only to the latter, then we still need to keep the IBDB files around until the transactions are done in which case a record is written to IBDB. This approach causes problems if the number of open transactions exceeds the size of AQ in which case an overflow mechanism must set in, e.g., either delaying further reading of IBDB or writing the data to DEFEDB. In the first sendmail X version the data should be transferred to DEFEDB instead for simplicity. Even with this simpler approach there are still some open problems:

  1. The most important is the order of operations: should the reconstruction of the open transactions be done concurrently with the normal operation of QMGR?

  2. A smaller problem is that IBDB does not contain all data that is necessary for DEFEDB entries (see Sections 3.4.10.4 and 3.4.10.6).

About 1: If yes, a faster startup time is achieved since the QMGR does not need to wait for the reconstruction. This of course causes other problems, e.g., what to do if the reconstruction runs into problems3.4? There are further complications with the order of operations: the reconstruction must be performed before the new IBDB is opened unless either a different name is used or the sequence numbers start after the last previously used entry. In the former case some scheme must be used to come up with names that will not cause problems if the system goes down while the reconstruction is still ongoing. In the latter case the problem of a wrap-around must be dealt with, i.e., what happens if the sequence number reaches the maximum value? A simple approach would be to simply start over at 1 again, but then it must be avoided that those files are still in use (which seems to be fairly unlikely if for example a 32 bit unsigned integer is used because the number of files would be huge, definitely larger what is sane to store in a single directory3.5).

A potential solution to the problem of overlapping operation is to use different sets of files, e.g., two different directories: ibdbw for normal operation, ibdbr for recovery. In that case at startup the following algorithm is used: If both ibdbw and ibdbr exist then the recovery run from the last startup didn't finish properly. Hence the rest of QMGR is not started before completing recovery, i.e., asynchronous operation is only allowed if the previous recovery succeeded. This simple approach allows us to deal with recovery problems without introducing an unbound number of IBDB file sets (directories). If only ibdbw exists, then rename it to ibdbr and let the QMGR startup continue while recovery is running. After recovery finished, ibdbr is removed thus indicating successful recovery for subsequent startups.

This approach can be extended to deal with a cleanup process that ``compresses'' IBDB files by removing all closed transactions from them. This cleanup process uses a different directory ibdbc in which it writes open transactions similar to the recovery program. That is, it reads IBDB files from ibdbw, gets rid of closed transaction in the same way as the recovery function does and writes the open transactions into a new file. After this has been safely committed to persistent storage, the IBDB files that have been read can be removed. The recovery function needs to read in this case the files from ibdbw and ibdbc to reconstruct all open transactions. The cleanup process should minimize the amount of data to read at startup. Such a cleanup function may not be available in sendmail X.0, however, if the MTA runs for a long time it may require a lot of disk space if there is not cleanup task. Question: is cleaning up a recursive process? If so, how to accomplish that? In a simple approach two ``versions'' can be used between which the cleanup task toggles, i.e., read from version 1 and write to version 2, then switch: read from version 2 and write to version 1. Question: how easy is it to keep track of this?

An example of the missing data mentioned as problem 2 in the list of problems is the start time (of the transaction) which is not recorded. This can be either taken in first approximation from the creation time of the IBDB file, or the current time can be simply used (which might be off by a fair amount if the system is down for a longer time, which, however, should not happen).


Cleaning up IBDB

Question: is there a way to avoid writing data when cleaning up IBDB? The cleanup task could ``simply'' remove IBDB files that contain only references to closed transactions. We may not even have to read any IBDB files at all since the data can be stored in IQDB, i.e., a reference to the IBDB logfile (sequence number). This requires a reference counter for each IBDB logfile which contains the number of open transactions (TA/RCPTs). When both reference counters reach zero the logfile can be removed. Note: this violates the modularity: now IQDB is used to store data that is related to the implementation of IBDB, i.e., both are tied together. Currently IQDB is fairly independent of the implementation of IBDB, e.g., it does not know about sequence numbers. Now there must a way to return those numbers to IBDB callers and they must be stored in IQDB. Moreover, there are functions that do not allow for a simple way to do this, e.g., those which act on request lists. In this case it would be fairly complicated to associate the IBDB data with the IQDB data. However, at the expense of memory consumption, the data could be maintained by the IBDB module itself. In this case it behaves similar as the IBDB recovery program, i.e., it stores open transactions (only a subset of data is necessary: identifier and sequence number) in an internal (hash) table, and matches closed transactions against existing entries to remove them. Periodically it can scan the table to find out which sequence numbers are still in use and remove unused files.

A different (simpler?) approach to the problem is to use time-based cleanup. Open transactions that are stored in IBDB are referenced by IQBD (at least with the current implementation) or AQ (entries which are marked to come from IQDB). Periodically these entries can be scanned to find the oldest. All older IBDB logfile can be removed. Note: there should be a correctness proof for this before it is implemented.


Deferred Envelope Database: Recipient Addresses

An interesting question is in which format recipient addresses are stored in the EDB, i.e., whether only the original recipient address is stored or whether also the resolved format (DA, host, address, etc) is stored too. If we use the resolved (expanded) format then we need to remove/invalidate those in case of reconfigurations. These reconfigurations may change DAs or routing. A possible solution is to add a time stamp to the resolved form. If that time stamp is older than the last reconfiguration then the AR must be called again. However, the routing decision may have been based on maps which have changed inbetween, hence this isn't a complete solution. It may require a TTL based on the minimum of all entries in maps used for the routing decision. Alternatively we can keep the resolved address ``as-is'' and do not care about changes. For examples, some address resolution steps happen in sendmail 8 before the address is stored in the queue, some happen afterwards. Examples for the former are alias expansion, which certainly should not be done every time. So the only address resolution that happens during delivery are DNS lookups (MX records, A/AAAA records), and those can be cached since they provide TTLs. We might make the address resolution a clean two step process:

  1. address resolution which is done only once before the (resolved) recipient address is stored in the queue;
  2. address resolution which is done during each delivery attempt and hence changes take effect almost immediately for all (not just new) recipients.

It might be an interesting idea to provide a cache of address mappings. However, such a cache cannot be simply for domains since it might be possible to provide per-address routing. The cache may be for DNS (MX/A/AAAA) lookups nevertheless, i.e., a ``partially'' resolved address maps to a domain which in turn maps to a list of destination hosts. This is fairly much what sendmail 8 does:

  1. the address rewriting engine returns a triple mailer, host, and address.
  2. The host part is mapped to list of destination hosts during ``final'' delivery via MX lookups (unless suppressed via a flag or other measures).
Note that the host part can also be a list of (colon separated) hosts.

These two steps are clearly related to the two steps listed above.

Incomplete (as of now) summary: Advantages of storing resolved addresses:

Disadvantages of storing resolved addresses:

If an address ``expires'' earlier than the computed ``next time to try'' then it probably is not useful to store the resolved address in DEFEDB. However, if the scheduler decides to try the address before the ``next time to try'', e.g., because a host is available again, then the resolved address might still be useful.

See also Section 3.4.15 for further discussion of this problem.

We also need to mark addresses (similar to sendmail 8 internally) to denote their origin, i.e., original recipient, expanded from alias (alias or list), etc.


Scheduler Algorithms

As described in Section 2.4.7 the scheduler must control the load it generates.


Slow Start and Connection Limit

One of the algorithms the scheduler should implement is ``slow start'', i.e., when a connection to a new host is created, only a certain number of initial connections must be made (``initial concurrency''). When a session/transaction has been succesful, the number of allowed concurrent connections can be increased (by one for each successful connection) until an upper limit (``maximum concurrency'') is reached. This varying number of concurrency is usually called ``window'' (see TCP/IP). If a connection fails, the size of the window is decreased until it reaches 0 in which the destination host is declared ``dead'' (for some amount of time).

Question: which kind of failures should actually decrease the window size besides any kind of network I/O errors? For example, a server may respond with 452 Too many connections but smX will not try to interpret the text of the reply, only the reply code.


Data Structures

The queue manager uses several data structures to store the status of the system and the envelopes of mails for which it has to schedule delivery. These are listed in the next sections.


Connection Cache Access

Question: are the connection caches indexed on host name or IP address? Problem: there is no 1-1 relation between host names and IP addresses, but an N-M relation. This causes problems for finding recipients to send over an open connection. A recipient address is mapped to (a DA and) a destination host, which is mapped to a list of MX records (host names) which in turn are mapped to IP addresses. Since there can be multiple address records for a host name and multiple PTR records for an IP address, we have a problem. The general problem is what to use as index for the connection caches. A smaller problem results from the expiration (TTL) of DNS entries (all along the mapping: MX, A/AAAA, and PTR records). Question: do load balancers further complicate this or can we ignore them for our purposes? Possible solution: use host name as index, provide N-M mappings for host name to IP address and vice versa. These mappings are provided by DNS. A simpler (but more restrictive solution) is to use the IP address and the host name together as index (see Exim [Haz01]). Question: do we want our own (simpler) DNS interface? We need some (asynchronous) DNS ``helper'' module anyway, maybe that can add some caching and a useful API? We shouldn't replicate caches too often due to the programming and memory usage overhead. So how do we want to access the connection caches? If a host name maps to several IP addresses, must it be a single machine? Does it matter wrt SMTP? Is a host down or an IP address? It could be possible that some of the network interfaces are down, but the host still can receive e-mail via another one.

So the QMGR to DA interface should provide an option to tell the DA to use a certain IP address for a delivery attempt because the QMGR knows the DA has an open connection to it. Even though this is a slight violation of the abstraction principle, the QMGR scheduled this delivery because of the connection cache, so it seems better than letting the DA figuring out to use one of its open connection (by looking up addresses etc).


Connection Reuse Problem

Problem: a connection (session) has certain attributes, e.g., STARTTLS, AUTH restrictions/requirements. These session-specific options can depend on the host name or the IP address (in sendmail 8: TLS_Srv:, AuthInfo:). This makes it even more complicated to reuse a connection. If a host behaves differently based on under which IP address it has been contacted, or if different requirements are specified for host name/IP addresses then connection reuse is significantly more complicated. In the worst case this could result in bounces that happen only in certain situations. It is certainly possible to document this behavior on the sendmail side, but if the other (server) side has a connection information dependent behavior then we have a problem.

A connection is made to an IP address, the server only knows the client IP address (plus port, plus maybe ident information, the latter should not be used for anything important). SMTP doesn't include a ``Hi, I would like to speak with X'' greeting, but only a ``Hi, this is Y'' (which might be considered a design flaw), so the server can't change its behavior based on whom the client wants to speak to (which would be useful for virtual hosting), but only based on the connection information (IP addresses, ports). Hence when a connection is reused (same IP address) the server can't change its behavior. Problem solved? Not really, someone could impose wierd restrictions based on sender addresses. It seems to be necessary to make this configurable (some external map/function can make the decision). This probably can be achieved by imposing a transactions per session limit, see also Section 2.4.8.

Another solution might be to base connection reuse on host name and IP address3.6. This may restrict connection reuse more than necessary, but it should avoid the potential problems. Maybe that should be a compile time option? Make sure that the code does not depend completely on the type of the index.

There have been requests to perform SMTP AUTH based on the sender address. which of course invalidates connection reuse. It's one (almost valid) example for additional requirements for a connection. It's only almost valid, since SMTP AUTH between MTAs is not for user authentication, it is used to authenticate the directly communicating parties. However, SMTP AUTH allows some kind of proxying (authenticate as X, authorize to act as Y), which seems to be barely used.

Note: connection reuse requires that the delivery agent is the same, if two recipients resolve to different delivery agents -- even for the same IP address and hostname -- then the connection will not be reused. In some cases this seems rather useless3.7hence the maximum flexibility would be reached by establishing congruence classes of mailer with respect to connection reuse. If then not just the mailer definitions are used to create those classes but also some connection attributes (see begin of this section), then we may have found the right approach to connection reuse.


Data for DSN

See Section 2.4.6 for the different types of DSNs and the problems we may encounter. We need to store (at least?) three different counters as explained in Section 2.4.6.1. DSNs can be requested individually for each recipient. Hence the sum of these counters is less than or equal the number of original recipients. Question: could it ever increase due to alias expansion? We could store the number of requested DSNs for each type and then compare the number of generated DSNs against them. If any of the generated DSN counters reached the requested counter we can schedule the DSN for delivery and hence we can be sure (modulo the question above) that all DSNs of the same type can be merged into one. Question: do we really need those counters? Or do we just generate a DSN whenever necessary and schedule it for delivery (with some delay)? Then the DSN generation code could look whether there is already a DSN generated and add the new one to it, as far as this is possible since the scheduler has to coalesce those DSNs. Problem: there is extra text (the error description) that should be put into one mail. How to do this? See also Section 2.4.6 about the question how to generate the body of DSNs.


Incoming Queue (INCEDB)

As explained in Section 2.4.1, the incoming queue consists of two parts: a restricted size cache in memory and a backup on disk.

The incoming queue does not (need to) have the same amount of data as the SMTP servers. It only stores data that is relevant to the QMGR. There is even less information in the data that is written to disk when an entry has been received. The data in the RSC is not just needed for delivery, but also during mail reception for policy decisions. In contrast, the backup data is only there to reconstruct the incoming cache in case of a system crash, i.e., the mail has already been received, there will not be any policy decisions about it. Hence the backup stores only the data that is required for delivery, not the data that is necessary to make decision while the mail is being received. This of course means that a reconstructed RSC does not contain the same (amount of) information as the original RSC.

The size of the cache defines the maximum number of concurrently open sessions and transactions in the SMTP servers. Question: do we store transactions and recipients also in fixed size caches? We certainly have to limit the size used for that data, which then acts as an additional restriction on the number of connections, transactions, and recipients. It's probably better to handle at least the recipients dynamically instead of pre-allocating a fixed amount of memory. The amount of memory must be limited, but it should be able to shrink if it has been expanded during a high volume phase (instead of having some maximum reserved all the time). Since most of the information in the cache is not fixed size, we need to dynamically allocate memory anyway. We maybe need some special kind of memory allocation for this which works within a given allocated area (see also Section 3.16.6).

Each entry in the cache has one of the following two formats (maybe use two different RSC):

Session

session-id session identifier
client-host identification of connecting host
  IP address, host name, ident
features features offered: AUTH, TLS, EXPN, ...
workarounds work around bugs in client (?)
transaction-id current transaction
reject-msg message to use for rejections (needed?)
auth AUTH information
starttls TLS information
n-bad-cmds number of bad SMTP commands
n-transactions number of transactions
n-rcpts total number of recipients
n-bad-rcpts number of bad recipients

Transaction:

transaction-id transaction identifier
start-time date/time of transaction
mail address, arguments (decoded?)
n-rcpts number of recipients
rcpt-list addresses, arguments (decoded?)
cdb-id CDB identifier (obtained from cdb?)
msg-size message size
n-bad-cmds number of bad SMTP commands (necessary?)
n-rcpts number of valid recipients
n-bad-rcpts number of bad recipients
session-id (pointer back to) session
statistics:
end-time end of transaction

If recipients addresses are expanded while in the INCEDB, we need to store the number of original recipients too.

Backup on disk: those entries have a different format than the in-memory version. The entries must be clearly marked as to what they are: transaction sender or recipient.

Sender (transaction):

transaction-id transaction identifier
start-time start time of transaction
sender-spec address incl. ESMTP extensions
cdb-id CDB identifier (obtained from cdb?)
n-rcpts reference count for cdb-id

Notice: the sender information is written to disk after all recipients have been received, i.e., when DATA is received, because it contains a counter of the recipients (reference count).

ESMTP sender extensions (substructure of the structure above)

size size of mail content (SIZE=)
bodytype type of body
envid envelope id
ret DSN return information (FULL, HDRS)
auth AUTH parameter
by Deliverby specification

Per recipient (transaction):

transaction-id transaction identifier
rcpt-spec address incl. ESMTP extensions
  and maybe a unique id (per session/transaction?)

ESMTP Recipient extensions (substructure of the structure above):

notify DSN parameters (SUCCESS, FAILURE, WARNING)
orcpt original recipient

The data for sender and recipient should be put into a data structure such that all relevant data is kept together. That structure must contain the data that must be kept in (almost) all queues.


Active Queue (ACTEDB, AQ)

The active queue needs different types of entries, hence it might be implemented as two RSCs, i.e., one for senders and one for recipients3.8. It could also be only one RSC if the ``typed'' variant is used (see 4.3.5.1).

Notice: there are two types of transaction records:

  1. The (incoming, SMTPS) transaction.
  2. The (outgoing, DA) transaction.
Since sendmail X performs scheduling per recipient (see Section 2.1.5, item 5), not per (incoming) transaction, those two transactions do not need to match. The outgoing (DA) transaction context is not stored directly in AQ, it is part of the QMGR - DA interface, see 3.4.10.12. However, the recipient data in AQ has data that refers to an outgoing transaction as described below.

The AQ context itself contains some summary data and the data structures necessary to access transactions and recipients:

max-entries maximum number of entries
limit current limit on number of entries
entries current number of entries
t-da entries being delivered
nextrun if set: don't run scheduler before this time
tas access to transactions
rcpts access to recipients

There should probably be more specific counters: total number of recipients, number of recipients being delivered, number of recipients waiting for AR, number of recipients ready to be scheduled, and total number of transactions.

The incoming transaction context contains at least the following fields:

transaction-id SMTPS transaction identifier
sender-spec address (see INCEDB)
cdb-id CDB identifier
from-queue from which queue does this entry come? (deferred or incoming)
counters several counters to keep track of delivery status

For ACTEDB we only need to know how many recipients are referenced by this transaction in the DB itself. That is, when we put a new recipient into ACTEDB, then we need to have the sender (transaction) context in it. If it is not yet in the queue, then we need to get it (from INCEDB or ACTEDB) and initialize its counter to one. For each added recipient the counter is incremented by one, when the status of a recipient delivery attempt is returned from a DA, the counter is decremented by one and the recipient is taken care of in the appropriate way. See also 2.4.3.4. However, because AQ should contain all necessary data to update DEFEDB it must also store the overall counters, e.g., how many recipients are in the system in total (not just in AQ).

Note: the delivery status for a transaction is not stored in DEFEDB since each delivery attempt in theory may lead to a different transaction, i.e., a DA transaction is not stored in DEFEDB.

The recipient context in AQ contains at least the following elements:

SMTPS transaction-id SMTPS transaction identifier
DA transaction-id DA transaction identifier
rcpt-spec address (see INCEDB)
rcpt-internal delivery tuple (DA, host, address)
from-queue from which queue does this entry come?
status not yet scheduled, being delivered, (temp) failure, ...
SMTPS-TA recipient from same SMTPS transaction
DA-TA recipient in same DA transaction
DEST recipient for same destination

The last three entries are links to access related recipients. These are used to group recipients based on the usual criteria, i.e., same SMTPS transaction, same delivery transaction, same next hop. Maybe this data can also be stored in DEFEDB to pull in a set of recipients that belongs together instead of searching during each scheduling attempt for the right recipients that can be grouped into a single transaction. Questions: how to do this? Is it worth it?

The internal recipient format is the resolved address returned by the AR. Its format is explained in Section 3.6.3.1.

As explained in 3.4.4, several access methods are required for the various EDBs, those for AQ are:

  1. The unique identifier for entries in AQ is the SMTPS transaction identifier (plus recipient index for recipients).
  2. Another (non-unique) key is the DA transaction identifier, which is required to associate DA results with entries in AQ; as explained in Section 3.4.10.12 the DA status cache stores also DA and SMTPS transaction identifier.
  3. The DA and next hop are used as key (maybe combined into one) to access a list of recipient entries which are going to be delivered to the same destination (see Section 2.4.4.8: AQRD, 3). This data access method can also be used to easily control the number of allowed concurrent connections (see also 3.4.9.1): when that number is reached, no entry in the list is scheduled anymore, i.e., the entire list is skipped, which is simpler than checking for each recipient whether the limit has been reached.

The key described in item 3 refers to another data structure which summarizes the entries. This data structure is the ``head'' of the DEST list:

DA Delivery Agent
next-hop Host (destination/next hop) to which to connect for delivery
todo-entries Number of entries in todo-list
todo-list Link to recipients which still need to be sent
busy-entries Number of entries in busy-list
busy-list Link to recipients which are being sent

The number of waiting transactions (todo-entries) can be used to determine whether to keep a session open or close it.

Question: is it really useful to have a busy list? What's the purpose of that list, which algorithms in the scheduler need this access method? The number of entries in the busy list is somehow useful if it were the number of open transactions or sessions, however, this is the number of recipients which does not have a useful relation to transactions/sessions.

Note: when a recipient is added to AQ it may not be in these destination queues because its next hop has not yet been determined, i.e., the address resolver needs to be called first. Those entries must be accessible via other means, e.g., their unique (recipient) identifier (see item 1 above). It might also be possible (for consistency) to have another queue with a bogus destination (e.g., a reserved DA value or IP address) which contains the entries whose destination addresses have not yet been resolved. Section 3.4.4 explains some of the problems with chosing indices to access AQ (and other EDBs), there is an additional problem for AQRD: if the connection limit for an IP address is reached, the scheduler will skip recipients for that destination. However, the recipient may have other destinations with the same priority whose limit is not yet reached. Either the system relies on the randomization of those same priority destinations (real randomization in turn causes problems for session reuse), or some better access methods need to be used. It might be useful in certain cases to look through the recipient destinations nevertheless (which defeats the advantage of this organization to easily skip entries that cannot be scheduled due to connection limits).

There might be yet another data structure which provides a summary of the entries in DEFEDB of this kind. That data structure can be used to decide whether to pull in recipients from DEFEDB to deliver them over an open connection.

Figure 3.1: Example of Active Queue

\begin{picture}(120, 150)\epsfxsize 100mm
\leavevmode
\epsffile{aq2.eps}
\end{picture}


Cleaning up AQ

The QMGR/scheduler must also remove entries from AQ that are too long in the queue, either because AR didn't respond or because a delivery attempt failed and the DA didn't tell QMGR about it (see Section 2.4.4.5). Question: what is an efficient way to do this? Should those entries also be in some list, organized in the order of timeout? Then the scheduler (or some other module) just needs to check the head of the list (and only needs to wake up if that timeout is actually reached). When an item is sent to AR or a DA then it must be sorted into that list.


Deferred Queue (DEFEDB)

The entries must be clearly marked as to what they are: transaction sender or recipient.

Sender:

transaction-id transaction identifier
start-time date/time mail was received
sender-spec address (see INCEDB)
cdb-id CDB identifier
rcpts-left reference count for cdb-id
rcpts-tried counter for ``tried'' recipients

rcpts-left refers to the number of recipients which somehow still require a delivery, whether to the recipient address or a DSN back to the sender. rcpts-tried is used to determine when to send a DSN (if requested). It might be useful to have counters for the three different delivery results: ok, temporary/permanent failure:

rcpts-total total number of recipients
rcpts-left number of recipients left
rcpts-temp-fail number of recipients which temporary failed
rcpts-perm-fail number of recipients which permanently failed
rcpts-succ recipients which have been delivered

In case of a DELAY DSN request we may need yet another counter. See also Sections 2.4.6 and 3.4.10.3.

rcpts-total is at least useful for statistics (logging); one of rcpts-succ and rcpts-total may be omitted. The invariances are:

$rcpts\_total = rcpts\_temp\_fail + rcpts\_perm\_fail + rcpts\_succ$

rcpts-total also counts the bounces that have been generated. It is never decremented.

$rcpts\_left = rcpts\_temp\_fail + rcpts\_perm\_fail$

Notice: these counters must be only changed if the delivery status of a recipient changes. For example, if a recipient was previously undelivered and now a delivery caused a temporary failure, then rcpts-temp-fail is increased. However, if a recipient previously caused a temporary failure and now a delivery failed again temporarily, then rcpts-temp-fail is unchanged. This obviously requires to keep the last delivery status for each recipients (see below).

Recipient:

transaction-id transaction identifier
rcpt-spec address (see INCEDB)
rcpt-internal delivery tuple (DA, host, address, timestamp)
d-stat delivery status (why is rcpt in deferred queue)
schedule data relevant for delivery scheduling, e.g.,
  last-try: last time delivery has been attempted
  next-try: time for next delivery attempt

d-stat must contain sufficient data for a DSN, i.e.:

act-rcpt actual recipient (?)
orig-rcpt original recipient (stored in rcpt-spec, see above)
final-rcpt final recipient (from RCPT command)
DSN-status extended delivery status code
remote-mta remote MTA
diagnostic-code actual SMTP code from other side (complete reply)
last-attempt data/time of last attempt
will-retry for temporary failure: estimated final delivery time

The internal recipient format is the resolved address returned by the AR. Its format is explained in Section 3.6.3.1. Question: do we really want to store rcpt-internal in DEFEDB? See Section 3.4.8 for a discussion. The timestamp for the delivery tuple is necessary as explained in the same section.

Question: which kind of delivery timestamp is better: last time a delivery has been attempted or time for next delivery attempt? We probably need both (last-try for DSN, next-try for scheduling).


Deferred Queue Cache (EDBC)

EDBC implements a sorted list based on the ``next time to try''. with references to recipient identifiers (which are the main indices to access DEFEDB). ``Next time to try'' is not a unique identifier hence this structure must be aware of that, e.g., when adding or removing entries.


Connection Cache (incoming)

This cache is accessed via IP addresses and maybe hostnames. It is used to check whether an incoming connection (to SMTPS) is allowed (see also Section 2.4.7).

open-conn number of currently open connections
open-conn-X number of open connections over last X seconds
  (probably for X in 60, 120, 180, ...)
trans-X number of performed transactions over last X seconds
rcpts-X number of recipients over last X seconds
fail-X number of SMTP failures over last X seconds
last-conn time of last connection
last-state status of last connection, see 3.4.10.9

Notice: statistics must be kept in even intervals, otherwise there is no way to cycle them as time goes on.

Question: do we use a RSC for this? If so, how do we handle the case when the RSC is full? Just throwing out the least recently used connection information doesn't seem appropriate.


Connection Status (incoming)

Question: what kind of status do we want to store here? Whether the connection was succesfull, or aborted by the sender? Or whether it acted strange, e.g., caused SMTP violations? Maybe performance related data? For example, number of recipients, number of transactions, throughput, and latency.


Connection Cache (outgoing) Connections

As explained in Section 2.4.4.8, there are two different connection caches for outgoing connections: one for currently open connections (OCC, 1) and one for previously made (tried) connections (DCC, 2). These two connection caches are described in the following two subsections.

Note: it might be possible to merge these two caches into one if the proper implementation, such as a restricted size cache (see Section 3.13.10), is chosen.


Connection Cache for currently open (outgoing) Connections

This is OCC (see Section 2.4.4.8: 1) for the currently open (outgoing) connections.

OCC helps the scheduler to perform its operation, it contains summary information (and hence could be gathered from the AQ/DA data by going through the appropriate entries3.9).

open-conn number of currently open connections
open-conn-X number of open connections over last X seconds
  (probably for X in 60, 120, 180, ...)
trans-X number of performed transactions over last X seconds
rcpts-X number of recipients over last X seconds
fail-X number of failures over last X seconds
performance data related to performance, see 3.13.8
first-conn time of first connection
last-conn time of last connection
last-update time of last status update
last-state status of last connection, see 3.4.10.11
initial-conc initial concurrency value
max-conc maximum concurrency limit
cur-conc current concurrency limit (``window'')

This connection cache stores only information about current connections. The connection cache also stores the time of the last connection. Question: do we need to store a list of those times, e.g., the last three? We can use these times to decide when a new connection attempt should be made (if the last connection failed). For this we need at least the last connection time and the time interval to the previous attempt. If we use exponential backup we need only those two values. For more sophisticated methods (which?) we need probably more time stamps.

It is not yet clear whether the open connection cache actually needs the values listed above, especially those for ``over last X seconds''. Unless the scheduler actually needs them, they can be omitted (they might be useful for logging or statistics). Instead, the counters may be for the entire ``lifetime'' of the connection cache entry; those counters can be used to implement limits for the total number of sessions, transactions, recipients, etc.

The last three values are used to implement the slow-start algorithm, see 3.4.9.1. The time of the last status update can be used to expire the ``host is dead'' marking, i.e., cur-conc equal zero.

Question: which times do we actually need: the time of the last connection or last status update? If a connection is successful, but the session takes very long, those times may differ substantially. The time of the last status update is used to expire a ``host is dead'' marking (as explained in the previous paragraph). The time of the last connection might be used to expire an entry if something went wrong with the delivery agent, e.g., it ``died'' in some way such that QMGR does not notice it and hence cannot clean up properly. In this case the expiration timeout should depend on the message size, obviously it takes much longer to deliver a message of several MB than of a few KB. The timeout could be set as the some of a basic timeout (60s?) and the message size divided by the transfer rate, where transfer rate is an option set in a configuration file (and maybe checked against the ``performance'' values for this entry).


Connection Cache for previously made Connections

This is DCC (see Section 2.4.4.8: 2) for previously open connections. It contains similar data as OCC but only for connections which are not open anymore.

open-conn-X number of open connections over last X seconds
  (probably for X in 60, 120, 180, ...)
trans-X number of performed transactions over last X seconds
rcpts-X number of recipients over last X seconds
fail-X number of failures over last X seconds
performance data related to performance, see 3.13.8
last-conn time of last connection
last-update time of last status update
last-state status of last connection, see 3.4.10.11

The connection cache also stores the time of the last connection. Question: do we need to store a list of those times, e.g., the last three? We can use these times to decide when a new connection attempt should be made (if the last connection failed). For this we need at least the last connection time and the time interval to the previous attempt. If we use exponential backup we need only those two values. For more sophisticated methods (which?) we need probably more time stamps.

This connection cache can be optimized to ignore some recent connections at the expense of being limited in size. For example, see the BSD inetd(8) ([ine]) implementation which uses a fixed-size hash array to store recent connections. If there are too many connections, some entries are simply overwritten (least recent entry will be replaced).