Claus Assmann
Copyright Sendmail, Inc.
All rights reserved
This chapter describes the basic requirements which sendmail X.0 must fulfill as well as the basic functionality that it must implement. It will state several obvious items just for the sake of completeness. This chapter will serve as a basis from which the MTA team will work to design sendmail X.0. The content must be agreed upon by all relevant, involved parties. It will also lay the groundwork for the future development of the sendmail X series of MTAs, of which sendmail X.0 will be the first implementation. sendmail X.0 is not meant to be feature complete but a proof of concept. It must be designed and written with extensibility in mind. Subsequent versions of sendmail X will add more features.
Some requirements for an MTA are so basic that it should not be necessary to mention them. However, we will do it nevertheless. These requirements are:
These requirements are conditio sina qua non, they will be taken for granted throughout this and related documents and during the design and implementation of sendmail X. In addition to these, sendmail X must be efficient. However, neither of the two main requirements will be compromised to increase efficiency. The intended goal for the efficiency of sendmail X is to be about one order of magnitude faster than sendmail 8.
There is another obvious requirement which is given here for completeness, i.e., conformance to all relevant RFCs, esp. RFC 2821 [Kle01], and implementation of (nearly) all RFCs that sendmail 8 handles:
RFC 974 | Mail Routing and the Domain System [Par86] |
RFC 1123 | Internet Host Requirements [Bra89] |
RFC 1652 | SMTP 8BITMIME Extension [KFR+94] |
RFC 1869 | SMTP Service Extensions [KFR+95] |
RFC 1870 | SMTP SIZE Extension [KFM95] |
RFC 1891 | SMTP Delivery Status Notifications [Moo96b] |
RFC 1892 | The Multipart/Report Content Type for the Reporting of |
Mail System Administrative Messages [Vau96b] | |
RFC 1893 | Enhanced Mail System Status Codes [Vau96a] |
RFC 1894 | Delivery Status Notifications [Moo96a] |
RFC 1985 | SMTP Service Extension for Remote Message Queue Starting [Win96] |
RFC 2033 | Local Mail Transfer Protocol [Mye96] |
RFC 2034 | SMTP Service Extension for Returning Enhanced Error Codes [Fre96] |
RFC 2045 | Multipurpose Internet Mail Extensions (MIME) |
Part One: Format of Internet Message Bodies [FB96] | |
RFC 2476 | Message Submission [GK98] |
RFC 2487 | SMTP Service Extension for Secure SMTP over TLS [Hof99] |
RFC 2554 | SMTP Service Extension for Authentication [Mye99] |
RFC 2822 | Internet Message Format [Res01] |
RFC 2852 | Deliver By SMTP Service Extension [New00] |
RFC 2920 | SMTP Service Extension for Command Pipelining [Fre00] |
sendmail X must be compliant with all the usual buzzwords (marketing or software development): robust, flexible, scalable, extendable, maintainable, portable to modern OS with support for POSIX (or similar) threads and certain other basic requirements. What this really means in particular will be explained in the next sections. sendmail X must run well on Unix and Windows (natively), support for Windows will not be added later on as an afterthought. It must provide hooks for monitoring, managing, etc. It also must be simple to replace certain modules (library functions) by specialized functions, e.g., as in sendmail 8.12 different lookup schemes for mailboxes.
sendmail X should be backward compatible with sendmail 8 at the user level as far as the basic requirements (esp. security) allow this. This includes support for aliases and .forward files, however the actual implementation may differ and result in slightly different behavior.
sendmail X should minimize the burden it puts on other systems, i.e., it must not overwhelm other MTAs, and it should make efficient use of network resources. It must properly deal with resource shortages on the local system and degrade gracefully in such events. However, the first implementation(s) of sendmail X may require some large amounts of certain resources, esp. memory.
sendmail X is meant to be useful to a majority of users, and thus must deal with some of the quirks of other MTAs and MUAs. However, it will most likely not go to the extreme of sendmail 8. Even though it should be ``liberal in what it accepts'', there are limits. Those limits are given by the time required to implement workarounds for broken systems, the possible problems for security (which will not be compromised), performance (which will only be compromised if really necessary), the amount of architectural changes, etc.
In this section we explain what the various buzzwords mean with respect to sendmail X.
sendmail X must be robust which means that even in event of failures it should behave reasonably. It must never lose e-mails except in the case of hardware failures or software failures beyond sendmail's control that occur after sendmail has accepted responsibility for an e-mail (there's nothing sendmail can do if someone/something destroys the files/disks containing the e-mails). In the event of resource shortages (memory, disk, etc) it must degrade gracefully. It should be implemented in such a way that it can also deal with simple OS errors, e.g., if a (sub)process fails due to an error (e.g., looping because a system calls fails in unexpected ways), it should deal with that in a useful way, i.e., log an error message and exit without causing havoc to the rest of the system. It is not expected to deal with all possible errors (e.g., who supervises the supervisor?) but the design should be done properly to deal with most (also unexpected) problems.
sendmail X also must be able to deal with Denial of Service attacks as much as technically possible.
It must be possible to tune or replace components of the sendmail X system for different purposes to achieve good performance in almost all cases or to add new features.
sendmail X should be able to take advantage of changes in the underlying OS and hardware that modify the relative performance of the available components. For example, if more processing power (faster or more CPUs) is available then the overall system should become faster. However, this can't be taken to an extreme since other parts might become bottlenecks despite careful design. For example, it is unreasonable to expect twice as many local mail deliveries if the number of CPUs is doubled since the limiting factor is most likely disk I/O in this scenario.
Clustering/High Availability support might be required too. The impact, as far as an application designer is concerned, is:
It must be possible to extend sendmail X in various places. These include: interfaces to other mailbox identification schemes, interfaces to other map types, adding other delivery agents. sendmail X will have clear APIs for various extensions, as for example the milter API is one for sendmail 8.
sendmail X must be designed and implemented to be easily maintainable. This not just means adherence to the Sendmail coding standard, but also a clear design without ``surprises'', e.g., far reaching side effects.
It is important that (almost) each functionality in sendmail X is subject to automatic testing. Having tests greatly simplifies maintainability because changes to the implementation can be automatically tested, i.e., if some of the tests break then the changes are most likely incorrect (unless the tests are broken themselves). Tests should cover (almost) all of the code (paths).
sendmail X must run on Unix OSs which support (POSIX) threads and on Windows. The exact list of requirements for the OS is not yet finalized. sendmail X will be written in C, most likely C89-compliant.
There are several things that sendmail X will not do or offer. First of all, it will not use the same (or even compatible) configuration files as sendmail 8 does. There should be a tool to migrate sendmail 8 mc files to sendmail X configuration files. However, it is unlikely that cf files can be easily migrated to sendmail X configuration files.
sendmail X.0 will not not have support for:
Those features may be added in future versions.
This section deals with two aspects: the configuration file for sendmail X and the configuration options for sendmail X.0.
sendmail 8 uses a configuration file that is supposedly easy to parse, but hard to read, understand, or modify by a system administrator. Even though this was simplified by using a macro preprocessor (m4) and a significantly simpler input file (mc), it is still considered complex. Moreover, there is no syntax check for the mc file which makes it fairly user unfriendly.
Hence it is necessary that sendmail X uses a simpler and more flexible configuration file. We need a syntax that allows us to keep options that belong together in a structured entity. This will be achieved with a C like syntax using (initial) keywords and braces. Moreover, the syntax will be free form, i.e., it will not depend on indentations, use of tabs, etc. It is also a bad idea to require special software, e.g., editors, to maintain the configuration files. For example, requiring a syntax oriented editor will add an additional migration hurdle which must be avoided otherwise users may switch to a different MTA that does not have such additional requirements.
Since sendmail X will consist of several modules, it is likely that there are several configuration files, at least for modules that are really separate. Moreover, it might be useful to delegate certain parts of a configuration to different administrators or even users. For example, it should be possible to delegate the configuration of virtual domains to different administrators. However, this will not be implemented in sendmail X.0, but the design will take this into consideration.
sendmail X.0 will initially offer a subset of the (huge set of) configuration options of sendmail 8. It is simply not possible to implement all of the accumulated features of sendmail 8 in the first version of sendmail X within a reasonable timeframe and the limited resources we have.
The documentation for sendmail X must fulfill different requirements with respect to its structure and its formats. The latter is explained in the next section.
The sendmail documentation must at least consist of the following sections:
It is important that the documentation provides different levels of details. The current documentation is not particularly well structured or organized. It more or less requires to read everything, which, however, is not really necessary nor does it help someone who has a lot of other things to do besides installing an MTA. Even though the installation of an MTA usually requires a bit more than
./configure && make && make test && make install && startit is not really necessary to read about (or even understand) all parts of the MTA.
The documentation that describes sendmail X must be written in a format that allows easy generation of different versions including at least:
This chapter describes the architecture of sendmail X. It presents some possible design choices for various parts of sendmail X and explains why a particular choice has been made. Notice: several decisions haven't been made yet, there are currently a lot of open questions.
sendmail X consists of several communicating modules. A strict separation of functionality allows for a flexible, maintainable, and scalable program. It also enhances security by running only those parts with special privileges (e.g., root, which will be used as a synonym for the required privileges in this text) that really require it.
Some terms relevant for e-mail are explained in a glossary 2.17.
sendmail X consists of the following modules:
sendmail X uses persistent databases for content (CDB) and for envelope (routing) information (EDB). The content DB is written by the SMTP servers only, and read by the delivery agents. The envelope DBs are under complete control of the queue manager.
There are other components for sendmail X, e.g., a recovery program that can reconstruct an EDB after a crash if necessary, a program to show the content of the mail queue (EDB), and at least hooks for status monitoring.
Since sendmail X is designed to have a lifetime of about one decade, it must not be tuned to specific bottlenecks in common computers as they are known now. For example, even though it seems common knowledge that disk I/O is the predominant bottleneck in MTAs, this isn't true in all cases. There is hardware support (e.g., disk system with non-volatile RAM) that eliminates this bottleneck2.1. Moreover, some system tests show that sendmail 8 is CPU bound on some platforms. Therefore the sendmail X design must be well-balanced and it must be easy to tune (or replace) subsystems that become bottlenecks in certain (hardware) configurations or situations.
This section contains some general remarks about configuring sendmail X. Todo: fill this in, add a new section later on that defines the configuration.
sendmail X must be easy enough to configure such that it does not require reading lots of files or even large section of a single file (see also Section 1.4). A ``default'' configuration may not require any configuration at all, i.e., the defaults should be stored in the binary and most of the required values should be automagically be determined at startup. A small configuration file might be necessary to override those defaults in case the system cannot determine the right values. Moreover, it is usually required to tell the MTS for which domain name to accept mail - by default a computer should have a FQDN but it is not advisable to decide to accept mail for the domain name itself2.2
The configuration file must be suitable for all kinds of administrators: at one end of the spectrum are those who just want to have an MTA installed and running with minimum effort, at the other end are those who want to tweak every detail of the system and maybe even enhance it by other software.
Only a few configuration options apply globally, many have exceptions or suboptions that apply in specific situations. For example, sendmail 8 has timeouts for most SMTP commands and there are separate timeouts to return queued messages for different precedence values. Moreover, some features can be determined by rulesets, some options apply on a per connection basis, etc. In many cases it is useful to group configuration options together instead of having those options very fine grained. For examples, there are different SMTP mailers in sendmail 8 that create configuration groups (with some preselected set of options) which can be selected via mailertable (or rules). Instead of having mailer options per destination host (or other criteria), different options are grouped together and then an option set is selected. This can reduce the amount of configuration options that need to be stored (e.g., it's a two level mapping: address mailer mailer flags, instead of just one level in which each argument can have different function values: address mailer and mailer flags).
However, it might be complicated to actually structure options in a tree like manner. For example, a rewrite configuration option may be
Question: can we organize options into a tree structure? If not, how should we specify options and how should we implement them? Take the above example: there might be rewrite options per mailer and per address type (seems to make sense). However, in which order should those rewrite options be processed? Does that require yet another option?
A simple tree structure is not sufficient. For example, some option groups may share common suboptions, e.g., rewrite rules. Instead of having to specify them separately in each group, it makes more sense to refer to them. Here is an example from sendmail 8: there are several different SMTP mailers, but most of them share the same rewrite rulesets. In a strict tree structure each mailer would have a copy of the rewrite rulesets, which is neither efficient nor simple to maintain. Hence there must be something like ``subroutines'' which can be referenced. In a sendmail 8 configuration file this means there is a list of rulesets which can be referenced from various places, e.g., the binary (builtin ruleset numbers) and the mailers.
This means internally a configuration might be represented as a graph with references to various subconfigurations. However, this structure can be unfolded such that is actually looks like a tree. Hence, the configuration can conceptually be viewed as a tree.
There should be a way to query the system about the current configuration and to change (some) options on the fly. A possible interface could be similar to sysctl(8) in BSD. Here options are structured in a tree form with names consisting of categories and subscategories separated by dots, i.e., ``Management Information Base'' (MIB) style. Such names could be daemon.MTA.port, mailer.local.path, etc. If we can structure options into a tree as mentioned in the previous section then we can use this naming scheme. Whether it will be possible to change all parts on the fly is questionable, esp. since some changes must be done as transaction (all at once or none at all).
Each section of this chapter that describes a module of sendmail X has a subsection about security considerations for that particular part. More discussion can be found in Section 2.14.
This section gives an overview over the control and data flow for a typical situation, i.e., e-mail received via SMTP. This should give an idea how the various components interact. More details can be found in the appropriate sections.
Question: can we treat a configuration file like a programming language with
Definitions do not depend on anything else, they define the basic structure (and behavior?) of the system. There are fixed attributes which cannot be changed at runtime, e.g., port number, IP address to listen on. Attributes which can change at runtime, e.g., the hostname to use for a session, fall in category 3, i.e., they are functions which can determine a value at runtime.
The distinction between definitions and functions is largely determined by the implementation and the underlying operating system as well as the application protocol to implement and the underlying transport protocol. When defining an SMTP daemon (or a DA) some of its attributes must be fixed (defined/specified) in the configuration, these are called immutable. For example, it is not possible to dynamically change the port of the SMTP daemon because that's the way the OS call bind(2)2.4 works. However, the IP address of the daemon does not need to be fixed (within the capabilities of the OS and the available hardware), i.e., it could listen on exactly one IP address or on any. Such configuration options are called variable or mutable2.5.
It seems to be useful to make a list of configuration options and their ``configurability'', i.e., whether they are fixed, or at which places they can change, i.e., on which other values they can depend.
As required, the semantics of the configuration file does not depend on its layout, i.e., spaces are only important for delimiting syntactic entities, tabs (whitespace) do not have a special meaning.
The syntax of the sendmail X configuration files is as follows:
conf | ::= | entries |
entries | ::= | entry * |
entry | ::= | option section |
section | ::= | keyword [name ] "{" entries "}" [";"] |
option | ::= | option-name "=" value |
value | ::= | name ";" values [";"] |
values | ::= | "{" name-list "}" |
This can be shortened to (remove the rule for entries):
conf | ::= | entry * |
entry | ::= | option section |
section | ::= | keyword [name ] "{" conf "}" [";"] |
option | ::= | option-name "=" value |
value | ::= | name ";" values [";"] |
values | ::= | "{" name-list "}" |
Generic definition of ``list'':
X-list | ::= | X X "," X-list [","] |
That is, a configuration file consists of a several entries, each of which is either a section or an option. A section starts with a keyword, e.g., mailer, daemon, rewriterules, and has an optional name, e.g., daemon MTA. Each section contains a section of entries which is embedded in curly braces. Each syntactic entity that isn't embedded in braces is terminated with a semicolon. An entry in a section can be an option or a (sub)configuration. To make writing configuration files simpler, lists can have a terminating comma and a semicolon can follow after values. That makes these symbols terminators not separators.
Examples:
mailer smtp { Protocol = SMTP; Connection = TCP; Port = mtp; flags { DSN } MaxRecipientsPerSession = 5; }; mailer lmtp { Protocol = LMTP; flags = { LocalRecipient, Aliases } Path = "/usr/bin/lmtp"; }; Daemon MTA { smtps-restriction = { qualified-sender, resolvable-domain } }; Map Mailertable { type = hash; file = "/etc/sm9/mailertable"; }; Rewrite { Envelope { sender = { Normalize, Canonify }, recipient = { Normalize, Virtual, Mailertable } }; Header { sender = { Normalize }, recipient = { Normalize } }; }; Check { DNSBL MyDNSBL { Orbd, Maps } Envelope { sender = { Qualified, MyDNSBL }, recipient = { Qualified, AuthorizedRelay } }; };
The usual rules for identifiers (list of characters, digits, and underscores) apply. Values (name) that contain spaces must be quoted, other entries can be quoted, but don't need to. Those quotes are stripped in the internal representation. Backslashes can be used to escape meta-symbols.
Todo: completely specify syntax.
Note: it has been proposed to make the equal sign optional for this rule:
option | ::= | option-name ["="] value |
However, that causes a reduce/reduce conflict when the grammar is fed into yacc(1)2.6because it conflicts with
section | ::= | keyword [name ] "{" entries "}" [";"] |
That is, with a lookahead of one it can not be decided whether something reduces to option or section. If the parser ``knows'' whether some identifier is a keyword or the name of an option then the equal sign can easily be optional. However, doing so violates the layering principle because it ``pushes'' knowledge about the actual configuration file into the parser where it does not really belong: the parser should only know about the grammar. Of course if would be possible to write a more specific grammar that includes lists of options and keywords. However, keeping the grammar abstract (hopefully) allows for simpler tools to handle configuration files. Moreover, if new options or keywords are added the parser does not need to change, it is only the upper layers that perform semantic analysis of a configuration file.
Most configuration/programming languages provide at least one way to add comments: a special character starts a comment which extends to the end of the line. Some languages also have constructs to end comments at a different place than the end of a line, i.e., they have characters (or character sequences) that start and end a comment. To make it even more complicated, some languages allow for nested comments. Text editors make it fairly easy to replace the begin of a line with a character and hence it is simple to ``comment out'' entire sections of a (configuration) file. Therefore it seems sufficient to have just a simple comment character (``#'') which starts a comment that extends to the end of the current line. The comment character can be escaped, i.e., its special meaning disabled, by putting a backslash in front of it as usual in many languages.
For now all characters are in UTF-8 format which has ASCII has a proper subset. Hence it is possible to specify texts in a different language, which might be useful in some cases, esp. if the configuration syntax is also used in other projects than sendmail X.
Strings are embedded (as usual) in double quotes. To escape special characters inside strings the usual C conventions are used, probably enhanced by a way to specify unicode characters (``uVALUE''). Strings can not continue past the end of a line, to specify longer strings they can be continued by starting the next line (after any amount of white space) with a double quote (just like in ANSI C).
The parser should be able to to some basic semantic checks for various types. That is, it can detect whether strings are well formed (see above), and it must understand basic types like boolean, time specification, file names, etc.
There has been a wish to include configuration data via files or even databases, e.g., OpenLDAP attributes.
There are some suggestions for alternative configuration formats:
option = value
This syntax is not flexible enough to describe the configuration of an MTA, unless some hacks are employed as done by postfix which uses an artificial structuring by naming the options ``hierarchically''. For example, sendmail 8 uses a dot-notation to structure some options, e.g., timeouts (Timeout.queuereturn.urgent); postfix uses underscores for a similar purpose, e.g.,
smtpd_recipient_restrictions = smtpd_sender_restrictions = local_destination_concurrency_limit = default_destination_concurrency_limit =
An explicit hierarchical structure is easier to understand and to maintain.
SMTP defines a structure which influences how a SMTP server (and client) can be configured. The topmost element in SMTP is a session, which can contain multiple transactions, which can contain multiple recipients and one messsage. Each of these elements has certain attributes (properties). For example
This structure restricts how a SMTP server can be configured. Some things can only be selected (``configured'') at a certain point in a session, e.g., a milter can not be selected for each recipient2.7, neither can a server IP address selected per transaction, other options have explicit relations to the stage in a session, e.g., MaxRecipientsPerSession, MaxRecipientsPerTransaction (which might be better expressed as Session.MaxRecipients and Transaction.MaxRecipients or Session.Transaction.MaxRecipients). Some options do not have a clear place in a session at all, e.g., QueueLA, RefuseLA: do these apply to a session, a transaction or a recipient? It is possible to use QueueLA per recipient, but only in sendmail X because it does scheduling per recipient, in sendmail 8 scheduling is done per transaction and hence QueueLA can only be per transaction. This example shows that an actual implementation restricts the configurability, not just the protocol itself.
If a SMTP session is depicted as a tree (where the root is a session) then there is a ``maximum depth'' for each option at which it can be applied. As explained before, that depth is determined
Question: taking these restrictions into consideration, can we specify the maximum depth for each configuration option at which the setting the option is possible/makes sense? Moreover, can we specify a range of depths for options? For example: QueueLA can be a global option, an option per daemon, an option per session, etc. If such a range can be defined per option, then the configuration can be checked for violations. Moreover, it restricts the ``search'' for the value of an option that must be applied in the current stage/situation.
Question: it seems the most important restriction is the implementation (beside the structure of SMTP of course). If the implementation does not check for an option at a certain stage, then it does not make any sense to specify the option at that stage. While for some options it is not much effort to check it at a very deep level, for others that means that data structures must be replicated or be made significantly more complex. Examples:
recipient postmaster { reject-client-ip {map really-bad-abusers} } recipient * { reject-client-ip {map all-abusers} }
This brings us back to the previous question: Question: can we specify the maximum depth for each configuration option at which the setting the option makes sense or at which it is possible without making the implementation too complex.
There are other configuration options which do not really belong to that structure, e.g., ``mailers'' (as they are called in sm8). A mailer defines a delivery agent (DA), it is selected per recipient. Hence a DA describes the behavior of an SMTP client, not an SMTP server. In turn, many options are per DA too, while others only apply to the server, e.g., milters are server side only2.9.
Problem: STARTTLS is a session atttribute, i.e., whether it is used/offered is defined per client/server (per session). However, it is useful (and possible) to require certain STARTTLS features per recipient2.10(as sm8 does via access db and ruleset). It is not possible to say: only offer STARTTLS feature X if the recipient is R, but it is possible to say: if the recipient is R then STARTTLS feature X must be in use (active). Moreover, it's not possible to say: "if the recipient is R, the milter M must be used." How do those configuration options fit into the schema explained above? What's the qualitative difference between these configuration options?
Questions: What's the qualitative difference between these examples? What is the underlying structure? How does the structure define configurability, i.e., what defines why a behavior/option can be dependent on something but not on something else?
For example: STARTTLS in client (SMTPC): this isn't really: ``use STARTTLS with feature X if recipient R will be send'', but it is: ``if recipient R will be send then STARTTLS with feature X must be active'' (similar to SMTPS). However, it is conceivable to actually do the former, i.e., make a session option based on recipient because sm9 can do per recipient scheduling, i.e., a DA is selected per recipient. Hence it can be specified that a session to deliver recipient R must have STARTTLS feature X. However, doing that makes connection reuse significantly more complicated (see Section 3.4.10.2). Question: doesn't this define a specific DA? Each DA has some features/options. Currently the use of STARTTLS is orthogonal to DAs (e.g., almost completely independent) hence the connection reuse problem (a connection is defined by DA and server, not DA and server and specific features because those features should be in the DA definition). Hence if different DAs are defined based on whether STARTTLS feature X should be used, then we tied a session to DA and server. This brings us to the topic of defining DAs. Question: what do we need ``per DA'' to make things like connection reuse simple? Note: if we define DAs with all features, then we may have a lot of DAs. Hence we should restrict the DA features to those which are really specific to a DA (connection/session/transaction) behavior, and cannot be defined independently. For example, it doesn't seem to be useful to have a DA for each different STARTTLS and AUTH feature, e.g., TLS version, SASL mechanism, cipher algorithm, and key length. However, can't we leave that decision up to the admin?
In addition to simple syntax checks, it would be nice to check a configuration also for consistency. Examples?
As explained in Section 2.1.3.2 there are some issues with the structuring of the configuration options. Here is a simple example that should serve as base for a discussion:
Daemon MTA { smtps-restriction { qualified-sender, resolvable-domain } mailer smtp { Protocol SMTP; Port smtp; flags { DSN } MaxRecipientsPerSession 25; }; Aliases { type hash; file /etc/sm9/aliases; }; mailer lmtp { Protocol LMTP; flags { LocalRecipient, Aliases } Path "/usr/bin/lmtp"; }; Map Mailertable { type hash; file /etc/sm9/mailertable; }; Rewrite { Envelope { sender { Normalize }, recipient { Normalize, Virtual, Mailertable } }; Header { sender { Normalize }, recipient { Normalize } }; }; }; Daemon MSA { mailer smtp { Protocol SMTP; Port submission; flags { DSN } MaxRecipientsPerSession 100; }; Aliases { type hash; file /etc/sm9/aliases; }; mailer lmtp { Protocol LMTP; flags { LocalRecipient, Aliases } Path "/usr/bin/lmtp"; }; Rewrite { Envelope { sender { Normalize, Canonify }, recipient { Normalize, Canonify } }; Header { sender { Normalize, Canonify }, recipient { Normalize, Canonify } }; }; };
This configuration specifies two daemons: MTA and MSA which share several subconfigurations, e.g., aliases and lmtp mailer, that are identical in both daemons. As explained in Section 2.1.3.2 it is better to not duplicate those specifications in various places. Here is the example again written in the new style:
aliases MyAliases { type hash; file /etc/sm9/aliases; }; mailer lmtp { Protocol LMTP; flags { LocalRecipient, Aliases } Path "/usr/bin/lmtp"; }; Daemon MTA { smtps-restriction { qualified-sender, resolvable-domain } mailer smtp { Protocol SMTP; Port smtp; flags { DSN } MaxRecipientsPerSession 25; }; aliases MyAliases; mailer lmtp; Map Mailertable { type hash; file /etc/sm9/mailertable; }; Rewrite { Envelope { sender { Normalize }, recipient { Normalize, Virtual, Mailertable } }; Header { sender { Normalize }, recipient { Normalize } }; }; }; Daemon MSA { mailer smtp { Protocol SMTP; Port submission; flags { DSN } MaxRecipientsPerSession 100; }; aliases MyAliases; mailer lmtp; Rewrite { Envelope { sender { Normalize, Canonify }, recipient { Normalize, Canonify } }; Header { sender { Normalize, Canonify }, recipient { Normalize, Canonify } }; }; };
Here the subconfigurations aliases and lmtp mailer are referenced explicitly from both daemon declarations. This is ok if there are only a few places in which a few common subconfiguration are referenced, but what if there are many subconfigurations or many places? In this case a new root of the tree would be used which declares all ``global'' options which can be overridden in subtrees. So the configuration tree would look like:
generic declarations common root daemon mailer ?
Question: what is the complete structure of the configuration tree? Question: can the tree be specified by the configuration file itself, or is its structure fixed in the binary?
The next problem is how to find the correct value for an option. For example, how to determine the value for MaxRecipientsPerSession in this configuration:
MaxRecipientsPerSession 10; Daemon MTA { MaxRecipientsPerSession 25; mailer smtp { ... }; }; mailer relay { MaxRecipientsPerSession 75; }; }; Daemon MSA { MaxRecipientsPerSession 50; mailer smtp { MaxRecipientsPerSession 100; }; };
Does this mean the system has to search in the tree for the correct value? This wouldn't be particularly efficient.
sendmail 8 also offers configuration via the access database, i.e., some tagged key is looked up to find potential changes for the configuration options that are specified in the cf file. For example, srv_features allows to set several options based on the connecting client (see also Section 2.2.6.2). This adds another ``search path'' to find the correct value for a configuration option. In this case there are even two tree structures that need to be searched which are defined by the host name of the client and its IP address, both of which are searched for in the database by removing the most significant parts of it, e.g., Tag:host.sub.tld, Tag:sub.tld, Tag:tld, Tag:.
What about a ``dynamic'' configuration, i.e., something that contains conditions etc? For example:
if client IP = A and LA < B then accept connection else if client IP in net C and LA < D and OpenConnections < E then accept connection else if OpenConnections < F then accept connection else if ConnectionRate < G then accept connection else reject connection fi
Note: it might be not too hard to specify a functional configuration language, i.e., one without side effects. However, experience with sm8 shows that temporary storage is required too2.13. As soon as assignments are introduced, the language becomes significantly more complex to implement. Moreover, having such a language introduces another barrier to the configuration: unless it is one that is established and widely used, people would have to learn it to use sm9 efficiently. For example, the configuration language of exim allows for runtime evaluation of macros (variables) and the syntax is hard to read (as usual for unknown languages). There are a few approaches to deal with this problem:
One proposal for the sm9 syntax includes conditionals in the form of
entry option section | ||
option option-name ["="] value | ||
condopt "if" "(" condition ") option |
In sendmail 8 it proved to be useful to have some configuration options stored in maps. These can be as simple as reply codes to certain phases in ESMTP and for anti-relay/anti-spam checks, and as complex as the srv_features rulesets (see also Section 2.2.5).
There are several reasons to have configuration options in maps:
If not just anti-spam data is stored in maps but also more complicated options (as explained before: map entries for srv_features) then those options are usually not well structured, e.g., for the example it is just a sequence of single characters where the case (upper/lower) determines whether some features is offered/required. This does not fulfill the readability requirements of a configuration syntax for sm9.
Question: how to integrate references to maps that provide configuration data into the configuration file syntax and how should map entries look like? One possibility way is to have a set of option combined into a group and reference that group from the map. For example, instead of using
SrvFeatures:10 l Vit would be
LocalSrvFeatures { RequestClientCertificate=No; AUTH=require; };
SrvFeatures:10 LocalSrvFeatures
The defaults of the configuration should be compiled into the binary instead of having a required configuration file which contains all default values.
Advantages:
Disadvantages:
It must be possible to query the various sm9 components to print their current configuration settings as well as their current status. The output should be formatted such that it can be used as a configuration file to reproduce the current configuration.
It must be possible to tell the various sm9 components to change their current configuration settings. This may not be practical for all possible options, but at least most of them should be changeable while the system is running. That minimizes downtime to make configuration changes, i.e., it must not be required to restart the system just to change some ``minor'' options. However, options like the size of various data structures may not be changeable ``on the fly''.
sendmail X.0.0.PreaAlpha9 has the following configuration parameters:
Various other definitions: postmaster address for double bounces2.14, log level and debug level could be more specific, i.e., per module in the code, but probably not per something external, configuration flags, time to wait for SMAR, SMTPC to be ready.
It doesn't seem to be very useful to make these dependent on something: minimum and ``ok'' free disk space (KB).
definitions (see Section 2.2.1, 2): log level and debug level (see above), heap check level, group id (numeric) for CDB, time to wait for QMGR to be ready.
run in interactive mode, serialize all accept() calls, perform one SMTP session over stdin/stdout,
socket over which to receive listen fd, specify thread limits per listening address,
create specified number of processes, bind to specified address - multiple addresses are permitted, maximum length of pending connections queue,
I/O timeout: could be per daemon and client IP address,
client IP addresses from which relaying is allowed, recipient addresses to which relaying is allowed.
All of these are definitions:
log level and debug level (see above), heap check level, time to wait for QMGR to be ready, run in interactive mode, create specified number of processes, specify thread limits.
These could be dependent on DA or even server address: socket location for LMTP, I/O timeout, connect to (server)port.
All of these are runtime options, i.e., they are specified when the binary is started (hence definitions in the sense of Section 2.2.1, 2):
log level and debug level (see above), IPv4 address for name server, DNS query timeout, use TCP for DNS queries instead of UDP, use connect(2) for UDP.
All of these are definitions: name: string (name of program/service); port: number or service entry (optional); socket.name: name of socket to listen on: path (optional); tcp: currently always tcp (could be udp); type: type of operation: nostartaccept, pass, wait; exchange_socket: socket over which fd should be passed to program; processes_min: minimum number of processes; processes_max: maximum number of processes; user: run as which user (user name, i.e., string); path: path to executable; args: arguments for execv(3) call.
MCP { processes_min=1; processes_max=1; type=wait; smtps { port=25; type=pass; exchange_socket=smtps/smtpsfd; user=sm9s; path="../smtps/smtps"; arguments="smtps -w 4 -d 4 -v 12 -g 262 -i -l . -L smtps/smtpsfd"; } smtpc { user=sm9c; path="../smtpc/smtpc"; arguments="smtpc -w 4 -P 25 -d 4 -v 12 -i -l ."; } qmgr { user=sm9q; path="../qmgr/qmgr"; arguments="qmgr -w 4 -W 4 -B 256 -A 512 -d 5 -v 12"; } smar { user=sm9m; path="../smar/smar"; arguments="smar -i 127.0.0.1 -d 3 -v 12"; } lmtp { socket_name="lmtpsock"; socket_perm="007"; socket_owner="root:sm9c"; type=nostartaccept; processes_min=0; processes_max=8; user=root; path="/usr/local/bin/procmail"; arguments="procmail -z"; } };
Note: some definitions could be functions (see Section 2.2.1), e.g., I/O timeout could be dependent on the IP address of the other side or the protocol, debug and log level could have similar dependencies. As explained in Section 2.2.4 the implementation restricts how ``flexible'' those values are.
Currently hostname is determined by the binary at runtime. If it is set by the configuration then it could be: global, per SMTP server, per IP address of client, per SMTP client, per IP address of server. This is one example of how an options can be set at various depths in the configuration file. Would this be a working configuration file?
Global { hostname = my.host.tld; } Daemon SMTPS1 { Port=MTA; hostname=my2.host.tld; IP-Client { IP-ClientAddr=127.*; hostname=local.host.tld;} } DA SMTPC1 { hostname=out1.host.tld; IP-Server { IP-ServerAddr=127.*; hostname=local.host.tld;} IP-Server { IP-ServerAddr=10.*; hostname=net.host.tld;} }
The lines that list an IP address are intended to act as restrictions, i.e., if the IP address is as follows then apply this setting. Question: Is this the correct way to express that? What about more complicated expressions (see Section 2.2.6)?
In principle these are conditionals:
hostname = my.host.tld; if (Port==MTA) { hostname=my2.host.tld; if (IP-ClientAddr==127.*) hostname=local.host.tld; }
Question: what are the requirements for anti-spam configuration for a (pre-)alpha version of sendmail X?
Not yet available: allow relaying based on TLS.
This brings in all the subtleties from sm8, especially delay-checks. What's a simple way to express this?
The Control flow in sm8 is explained in Section 3.5.2.6.
Note: for the first version it seems to the best to use a simple configuration file without any conditionals etc. If an option is dependent on some data, then the access method from sm8 should be used. This allows us to put that data into a well known place and treat it in a matter that has been successfully used before. Configuration like anti-relaying should be ``hard-wired'' in the binary and their behavior should only be dependent on data in a map. This is similar to the mc configuration ``level'' in sm8; more control over the behavior is archievable in sm8 by writing rules which in sm9 may have some equivalent in modules.
The configuration files must be protected from tampering. They should be owned by root or a trusted user. sendmail must not read/use configuration files from untrusted sources, which not just means wrong owners, but also files in unsecure directories.
Some processes require root privileges to perform certain operations. Since sendmail X will not have any set-user-id root program for security reasons, those processes must be started by root. It is the task of the supervisor process (MCP: Master Control Process) to do this.
There are a few operations that usually require root privileges in a mail system:
The MCP will bind to port 25 and other ports if necessary before it starts the SMTP server daemons (see Section 2.5) such that those processes can use the sockets without requiring root access themselves.
The supervisor process will also be responsible for starting the various processes belonging to sendmail X and supervising them, i.e., deal with failures, e.g., crashes, by either restarting the failed processes or just reporting those crashes if they are fatal. The latter may happen if a system has a hardware or software failure or is (very) misconfigured. The MCP is also responsible for shutting down the entire sendmail X system on request.
The configuration file for the supervisor specifies which processes to start under which user/group IDs. It also controls the behavior in case of problems, i.e., whether they should be restarted, etc. This is fairly similar to inetd, except that the processes are not started on demand (incoming connection) but at startup and whenever a restart is necessary.
The supervisor process runs as root and hence must be carefully written (just like any other sendmail X program). Input from other sources must be carefully examined for possible security implications (configuration file, communication with other parts of the sendmail X system).
The queue manager is the central coordination instance in sendmail X. It controls the flow of e-mail throughout the system. It implements (almost) all policies and coordinates the receiving and sending processes. Since it controls several processes it is important that it will not slow them down. Hence the queue manager will be a multi-threaded program to allow for easy scalability and fast response to requests.
The queue manager will handle several queues (see Section 2.4.1); there will be at least queues for incoming mails, for scheduling delivery of mails, and for storing information about delayed mails.
The queue manager also maintains the state of the various other processes with which it communicates and which it controls, e.g., the open connections of the delivery agents. It should also have knowledge about the system to which it sends e-mails, i.e., whether they are accepting e-mails, probably the throughput of the connections, etc.
Todo: add a complete specification what the QMGR does; at least the parts that aren't related to incoming SMTP.
One proposal for a set of possible queues is:
Having several on-disk queues has the following advantages:
Disadvantages of several on-disk queues are:
Since the disadvantages outweigh the advantages the number of on-disk queues will be minimized. The deferred queue will become the main queue and contains also entries that are on hold or waiting for ETRN. Envelopes for bounces should go to the main queue too. This way the status for an envelope is available in one place (well, almost: the incoming queue may have the current status). Only the ``corrupt'' queue is different since nobody ever schedules entries from this queue and it probably needs a different format (no decision yet). To achieve the ``effect'' of having different queues it might be sufficient to build different indices to access parts of the queue (logical queues). For example, the ETRN queue index has only references to items in the queue that are waiting for an ETRN command.
The ``active'' and the ``incoming'' queues are resident in memory for fast access. The incoming queue is backed up on stable storage in a form that allows fast modifications, but may be slow to reconstruct in case the queue manager crashes. The active queue is not directly backed up on disk, other queues act as backup. That is, the active queue is a restricted size cache (RSC) of entries in other queues. The deferred queue contains items that have been removed from the active queue for various reasons, e.g., policy (deliver only on ETRN, certain times, quarantine due to milter feedback, etc), delays (temporary failures, load too high, etc), or as the result of delivery attempts (success/failure/delay). The active queue must contain the necessary information to schedule deliveries in efficient (and policy based) ways. This implies that there is not just one way to access the data (one key), but several to accommodate different needs. For example, it might be useful to MX-piggyback deliveries, which requires to store (valid, i.e., not-expired) MX records together with recipient domains. Another example is a list of recipients that wait for a host to be able to receive mail again, i.e., a DB which is keyed on hosts (IP addresses or names?) and the data is a list of recipients for those hosts.
Normally entries in the incoming queue are moved into the deferred queue only after a delivery attempt, i.e., via the active queue. However, the active queue itself is not backed up on persistent storage. Hence an envelope must be either in the incoming queue or in the deferred queue at any given time (unless it has been completely delivered). Moving an entry into the deferred queue must be done safely, i.e., the envelope must be safely in the deferred queue before it is removed from the incoming queue. Question: When do we move the sender data over from the incoming to the deferred queue? Do we do it as soon as one recipient has been tried or only after all have been tried? Since we are supposed to try delivery as soon as we have the mail, we probably should move the sender data after we tried all recipients. ``Trying'' means here: figure out a DA, check flags/status (whether to try delivery at all, what's the status of the system to which the mail should be delivered), if normal delivery: schedule it, otherwise move to the deferred queue.
The in-memory queues are limited in size. These sizes are specified in the configuration file. It is not yet clear which form these specifications may have: amount of memory, amount of entries, percentage of total memory that can be used by sendmail X or specific processes. The specification should include percentages at which the behavior of the queue manager changes, esp. for the incoming queue. If the incoming queue becomes almost full the acceptance of messages must be throttled. This can be done in several stages: just slow down, reduce the number of concurrent incoming connections (just accept them slower), and in the extreme the SMTP server daemons will reject connections. Similarly the mail acceptance must be slowed down if the active queue is about to overflow. Even though the queue manager will normally favor incoming mail over other (e.g., deferred) mail, it must not be possible to starve those other queues. The size of the active queue does not need to be a direct feedback mechanism to the SMTP daemon, it is sufficient if this is happening indirectly through the incoming queue (which will fill up if items can't be moved fast enough into the active queue). However, this may not be intended, maybe we want to accept messages for some limited time faster than we can send them.
It might be nice for the in-memory queues to vary in size during runtime. In high-load situation those queues may grow up to some maximum, but during lower utilization they should shrink again. Maximum and minimum sizes should be user-configurable. However, in general the OS (VM system) should solve the problem for us.
Here's the list of queues that are currently used:
One proposal for a set of possible queues is:
There's currently no decision about the queue for corrupted entries.
The incoming queue must be backed up on stable storage to ensure reliability. The most likely implementation right now is a logfile, in which entries are simply appended since this is the fastest method to store data on disk. This is similar to a log-structured filesystem and we should take a look at the appropriate code. However, our requirements are simpler, we don't need a full filesystem, but only one file type with limited access methods. There must be a cleanup task that removes entries from the backup of the incoming queue when they have been taken care of. For maximum I/O throughput, it might be useful to specify several logfiles on different disks.
The other queues require different formats. The items on hold are only released on request (e.g., for ETRN). Hence they must be organized in a way that allows easy access per domain (the ETRN argument) or other criteria, e.g., a hold message for quarantined entries.
The delayed queue contains items that could not be delivered before due to temporary errors. These are accessed in at least two ways:
The queue manager needs to keep a reference count for an envelope to decide when an entry can be removed. This may become non-trivial in case of aliases (expansion). If the LDA does alias expansion then one question is whether it injects a new mail (envelope) with a new body. Otherwise reference counting must take care of this too.
The MAIL (sender) data, which includes the reference counter, is stored in the deferred queue if necessary, i.e., as long as there are any recipients left (reference count greater than zero). Hence we must have a fast way to access the sender data by transaction id. At any time the envelope sender information must be either in the incoming queue or in the deferred queue.
Problem: mailing lists require to create new envelope sender addresses, i.e., the list owner will be the sender for those mails. An e-mail can be addressed to several mailing lists and to ``normal'' recipients, hence this may require to generate several different mail sender entries. Question: should the reference counter only be in the original sender entry and the newly created entries have references to that? Distributing the reference count is complicated. However, this may mean that a mail sender entry stays around even though all of its (primary) recipients have been taken care of.
It might be necessary to have a garbage collection task: if the system crashes during an update of a reference counter the value might become incorrect. We must assure that the value is never too small, because we could remove data that is still needed. If the stored value is bigger than it should be, the garbage collection task must deal with this (compare fsck for file systems).
Envelopes received by the SMTP servers are put into the incoming queue and backed up on disk. If an envelope has been completely received, the data is copied into the active queue unless that queue is full2.16. Entries in the active queue are scheduled for delivery. If delivery attempts are done, the results of those attempts are written to the incoming queue (mark it as delivered) or deferred queue as necessary. Entries from the deferred queue are copied into the active queue based on their ``next time to try'' time stamp.
It would be nice to define some terms for
A delivery attempt consists of:
This section explains how data is added to the various queues, what happens with it, and under which circumstances data is read from a queue if there is no queue into which the data is read, i.e., this is a consumer oriented view.
This section gives a bit more details about the data flow than the previous section. It does only deal with data that is stored by QMGR in some queue, it does not specify the complete data flow, i.e., what happens in the SMTP server or the delivery agents.
Question: should we keep entries in the incoming queue only during delivery attempts, or should we move the envelope data into the deferred queue while the attempts are going on? If we move the envelopes, we have more space available in the incoming queue and can accept more mail. However, moving envelopes costs of course performance. In the ``normal'' case we don't need the envelope data in the deferred queue, i.e., if delivery succeeds for all recipients and no SUCCESS DSNs are requested, we don't need the envelope data ever in the deferred queue. Question: do we want a flexible algorithm that moves the envelope data only under certain conditions? Those conditions could include how much space is free in the incoming queue and how long an entry is already in the queue. There should be two different modes (selectable by an option):
We need the envelope data in the deferred queue, if and only if
If no DSN must be sent and all recipients have been taken care of, the envelope does not need to be moved into the DEFEDB, and it can be removed from the INCEDB afterwards without causing additional data moving.
Note: it should be possible to remove recipient and transaction data from IQDB as soon as it has been transferred to AQ and safely committed to IBDB; at this moment the data is in persistent storage and it is available to the scheduler, hence the data is not really needed anymore in IQDB. There are some implementations issues around this2.19, hence it is not done in the current version, this is something that should be optimized in a subsequent version.
When a delivery attempt (see 4d in Section 2.4.3.2) has been made, the recipient must be taken care of in the appropriate way. Note that a delivery attempt may fail in different stages (see Section 2.4.3.1), and hence updating the status of a recipient can be done from different parts of QMGR. That is, in all cases the recipient address is removed from ACTEDB and
The data is stored in DEFEDB (persistent storage) to avoid retrying a failed delivery, see also Section 2.4.6.
Notice: it is recommended to perform the update for a delivery attempt in one (DB) transaction to minimize the amount of I/O and to maintain consistency. Furthermore, the updates to DEFEDB should be made before updates to INCEDB are made as explained in Section 2.4.1.
Note: this section does not discuss how to deal with a transaction whose recipients are spread out over INCEDB and DEFEDB. For example, consider a transaction with two recipients, all data is in INCEDB. A delivery attempt for one recipient causes a temporary failure, the other recipient is not tried yet. Now the transaction and the failed recipient are written to DEFEDB. However, the recipient counters in the transaction do not properly reflect the number of recipients in DEFEDB but in both queues together. The recovery program must be able to deal with that.
According to item 10 in Section 2.4.3.3 entries are read from the deferred queue into the active queue based on their ``next time to try'' (or whatever criteria the scheduler wants to apply). Instead of reading through the entire DB -- which is on disk and hence expensive disk I/O is involved -- each time entries should be added, an in-memory cache (EDBC) is maintained which contains references to entries in DEFEDB sorted based on the ``next time to try''. Note: it might be interesting to investigate whether an DEFEDB implementation based on Berkeley DB would make this optimization superfluous because Berkeley DB maintains a cache anyway. However, it is not clear which data the cache contains, most likely it is not ``next time to try'' but only the key (recipient/transaction identifiers).
Even though each entry in the cache is fairly small (recipient identifier, next time to try, and some management overhead), it might be impossible to hold all references in memory because of the size. Here is a simple size estimation: an entry is about 40 bytes, hence 1 million entries require 40 MB. If a machine actually has 1 million entries in its deferred queue then it has most likely more than 1 GB RAM. Hence it seems fairly unlikely to exceed the available memory with EDBC. Nevertheless, the system must be prepared to deal with such a resource shortage. This can be done by changing into a different mode in which DEFEDB is regularly scanned and entries are inserted to EDBC such that older entries will be removed if newer entries are inserted and EDBC is full2.20.
If the MTS is busy it might not be possible to read all entries from DEFEDB when their next time to try is actually reached because AQ might be full. Hence it is necessary to establish some fairness between the two producers for AQ: IQDB (SMTP servers) and DEFEDB. A very simple approach is to reserve a certain amount, e.g., half, for each of the producers. However, that does not seem to be useful:
A slightly better approach is as follows:
This approach will
sendmail 8 provides a delivery mode called interactive in which a mail is delivered before the server acknowledges the final dot. An enhanced version of this could be implemented in sendmail X, i.e., try immediate delivery but enforce a timeout after which the final dot is acknowledged. A timeout is necessary because otherwise clients run into a timeout themselves and resend the mail which will usually result in double deliveries.
This mode is useful to avoid expensive disk I/O operations; in a simple mode at least the fsync(2) call can be avoided, in a more complicated mode the message body could be shared directly between SMTP server and delivery agent to even avoid creation of file on disk (this could be accomplished by using the buffered file mode from sendmail 8 with large buffers, however, this requires some form of memory sharing2.21). Various factors can be used to decide whether to use interactive delivery, e.g., the size of the mails, the number of recipients and their destinations, e.g., local versus remote, or other information that the scheduler has about the recipient hosts, e.g., whether they are currently unavailable etc.
Cut-through delivery requires a more complicated protocol between QMGR and SMTP server. In normal mode the SMTP server calls fsync(2) before giving the information about the mail to QMGR and then waits for a reply which in turn is used to inform the SMTP client about the status of the mail, i.e., the reply to the final dot. For cut-through delivery the SMTP server does not call fsync(2) but informs QMGR about the mail. Then the following cases can happen:
For case 1b the SMTP server needs to send another message to QMGR telling it the result of fsync(2). If fsync(2) fails, the message must be rejected with a temporary error, however, QMGR may already have delivered the mail to some recipients, hence causing double deliveries.
Items from the delayed queue need to be read into the active queue based on different criteria, e.g., time in queue, time since last attempt, precedence, random walk.
The queue manager must establish a fair selection of items in the incoming queue and items in the delayed queue. This algorithm can be influenced by user settings, which includes simple options (compare QueueSortOrder in sendmail 8), table driven decisions, e.g., no more than N connections to a given host, and a priority (see Section 2.4.4.1). A simple way to achieve a fair selection is to establish a ratio (that can be configured) between the queues from which entries are read into the active queue, e.g., incoming: 5, deferred: 1. Question: do we use a fixed ratio between incoming and deferred queue or do we vary that ratio according to certain (yet to determine) conditions? These ratios are only honored if the system is under heavy load, otherwise it will try to get as many entries into the active queue as possible (to keep the delivery agents busy). However, the scheduler will usually not read entries from the deferred queue whose next time to try isn't yet reached, unless there is a specific reason to do so. Such a reason might be that a connection to the destination site became available, an ETRN command has been given, or deliver is forced by an admin via a control command. Question: does the ratio refer to the number of recipients or the number of envelopes?
The QMGR must at least ensure that mail from one envelope to the same destination site is send in one transaction (unless the number of recipients per message is exceeded). Hence there should be a simple way to access the recipients of one envelope, maybe the envelope id is a key for the access to the main queue. See also 2.4.4.4 for further discussion. Additionally MX piggybacking (as in 8.12) should be implemented to minimize the required number of transactions.
Question: how to schedule deliveries, how to manage the active queue? Scheduling: Deliveries are scheduled only from the active queue, entries are added to this queue from the incoming queue and from the deferred queue.
To reduce disk I/O the active queue has two thresholds: the maximum size and a low watermark. Only if too few entries are in the cache entries are read from the deferred queue. Problem: entries from the incoming queue should be moved as fast as possible into the active queue. To avoid starvation of deferred entries a fair selection must be made, but this must be done on a ``large'' scale to minimize disk I/O. That is, if the ratio is 2-1 (at least one entry from the deferred queue for every two from the incoming queue), then it could be that 100 entries are moved from the incoming queue, and then 50 from the deferred queue. Of course the algorithm must be able to deal with border conditions, e.g., very few incoming entries but large deferred queue, or only a few entries trickling in such that the number of entries in the active queue is always in the range of the low watermark.
Question: where/when do we ask the address resolver for the delivery tuple? That's probably a configuration option. The incoming queue must be able to store addresses in external and in ``resolved'' form. See also Section 3.11.6 for possible problems when using the resolved form.
Here's a list of scheduling options people (may) want (there are certainly many more):
Question: how to specify such scheduling options and how to do that in an efficient way? It doesn't make much sense to evaluate a complicated expression each time the QMGR looks for an item in the deferred queue to schedule for delivery. For example, if an entry should only be sent at certain times, then this should be ``immediately'' recognizable (and the item can be skipped most of the time, similar to entries on hold).
Remark: qmail-1.03/THOUGHTS [Ber98] contains this paragraph:
Mathematical amusement: The optimal retry schedule is essentially, though not exactly, independent of the actual distribution of message delay times. What really matters is how much cost you assign to retries and to particular increases in latency. qmail's current quadratic retry schedule says that an hour-long delay in a day-old message is worth the same as a ten-minute delay in an hour-old message; this doesn't seem so unreasonable.
Remark: Exim [Haz01] seems to offer a quite flexible retry time calculcation:
For example, it is possible to specify a rule such as `retry every 15 minutes for 2 hours; then increase the interval between retries by a factor of 1.5 each time until 8 hours have passed; then retry every 8 hours until 4 days have passed; then give up'. The times are measured from when the address first failed, so, for example, if a host has been down for two days, new messages will immediately go on to the 8-hour retry schedule.
Courier-MTA has four variables to specify retries:
, , ,
These control files specify the schedule with which Courier tries to deliver each message that has a temporary, transient, delivery failure. and contain a time interval, specified in the same way as queuetime. and contain small integral numbers only.
Courier will first make delivery attempts, waiting for the time interval specified by between each attempt. Then, Courier waits for the amount of time specified by , then Courier will make another delivery attempts, amount of time apart. If still undeliverable, Courier waits amount of time before another delivery attempts, with amount of time apart. The next delay will be amount of time long, the next one , and so on. sets the upper limit on the exponential backoff. Eventually Courier will keep waiting amount of time before making delivery attempts amount of time apart, until the queuetime interval expires.
The default values are:
This results in Courier delivering each message according to the following schedule, in minutes: 5, 5, 5, 15, 5, 5, 30, 5, 5, 60, 5, 5, then repeating 120, 5, 5, until the message expires.
There are two levels of scheduling:
We could assign each entry a priority that is dynamically computed. For example, the priority could incorporate:
However, it is questionable whether we can devise a formula that generates the right priority. How do we have to weight those parameters (linear functions?), and how to combine them ( )? It might be simpler (better) to specify the priority in some logical formula (if-then-else) in combination with arithmetic. Of course we could use just arithmetic (really?) if we use the right operations. However, we want to be able to short-cut the computation, e.g., if one parameter specifies that the entry certainly will not be scheduled now. For example: if time-next-try now then Not-Now unless connections-open(recipient-site) .
On system with low mail volume the schedulers will not be busy all the time. Hence they should sleep for a certain time (in sendmail 8 that's the -q parameter). However, it must be possible to wake them up whenever necessary. For example, when a new mail comes in the first level scheduler should be notified of that event such that is can immediately put that mail into the active queue if that is possible, i.e., there is enough free space. The sleep time might be a configurable option, but it should also be possible to just say: wake up at the next retry time, which is the minimum of the retry times in the deferred queue.
The next retry time should not be computed based on the message/recipient, but on the destination site (Exim does that). It doesn't make much sense to contact a site that is down at random intervals because different messages are addressed to it. Since the status of a destination site is stored in the connection cache, we can use that to determine the next retry time. However, we have the usual problem here: a recipient specifies an e-mail address, not the actual host to contact. The latter is determined by the address resolver, and in general, it's not a single host, but a list of hosts. In theory, we could base the retry time on the first host in the list. However, what should we do if another host in the list has a different next retry time, esp. an earlier one? Should we use the minimum of all retry times? We would still have to try the hosts in order (as required by the standard), but since a lower priority host may be reachable, we can deliver the mail to it. Question: under which circumstance can a host in the list have an earlier retry time? This can only happen if the list changes and a new host is added to it (because of DNS changes or routing changes). In that case, we could set the retry time for the new host to the same time as all the other hosts in the list. However, this isn't ``fair'', it would penalize all mails to that host. So maybe it is best to use the retry time of the first host in the list as the retry time of a message.
Note: There are benefits to some randomness in the scheduling. For example, if some systematic problem knocks down a site every 3 hours, taking 15 minutes to restore itself, then delivery attempts should not accidentally synchronize with the periodic failures. Hence adding some ``fuzz'' factor might be useful.
Notice: it might be useful to have a pre-emptive scheduler. That is, even if the active queue is full, there might be reasons to remove entries from it and replace them with higher priority entries from the incoming queue. For example, the active queue may be filled with a lots of entries from a mailing list and new mail is coming in. If the delivery is slow, then some of those new entries may replace entries in the active queue that aren't actually given to a delivery agent. Theoretically, this could be handled by priorities too.
Whenever there is sufficient free space (number of entries falls below low watermark), then the first level scheduler must put some entries from the incoming and the deferred queue into the active queue.
Problem: we have to avoid using up all delivery agents (all allowed connections) for one big e-mail, e.g., an e-mail to a mailing list with thousands of recipients. Even if we take the number of recipients into account for the priority calculation, we don't want to put all recipients behind other mails with fewer recipients (do we?). This is just another example how complicated it is to properly calculate the priority. Moreover, expansion of an alias to a large list must be done such that it doesn't overflow the incoming queue. That is: where do we put those expanded addresses? We could schedule some entries immediately and put others into the deferred queue (which doesn't have strict size restrictions).
Entries from the incoming queue are placed into the active queue in FIFO order in most cases.
Question: do we put an entry from the incoming queue into the active queue even though we know the destination is unavailable or do we move it in such a case directly to the deferred queue? We could add some kind of probability and a time range to the configuration (maybe even per host). Get a random number between 0 and 100 and check it against the specified probability. If it is lower try the connection anyway. Another way (combinable?) is to specify a time range (absolute or as percentage) and check whether the next time to try is within this range.
Whenever new entries are added to the active queue, a ``micro scheduler'' arranges those in an appropriate order. Question: how to do micro scheduling within the active queue?
Question: do we treat the active queue strictly as queue? Probably not because we want to reuse open connections (as much as allowed by the configuration). So if we have an open connection and we move an entry ``up front'' to reuse the connection, how do we avoid to let other entries ``lie around'' forever in the queue? We could add a penalty to this connection (priority calculcation), such that after some usages the priority becomes too bad and hence entries can't leapfrog others anymore. The problem is still the same: how to properly calculate the priority without causing instabilities? Courier moves the oldest entry (from the tail) to the head of the queue in such a case to prevent starvation. Question: does this really prevent starvation or is it still possible that some entries may stay in the queue forever?
The second level scheduler must be able to preempt entries in the queue. This is required at least for entries that are supposed to be sent to a destination which turns out to be unavailable after the entry has been placed in the active queue. This can happen if an ``earlier'' entry has the same destination and that delivery attempt fails. Then all entries for the same destination will be removed from active queue. In such a case, they will be marked as deferred (assuming it was a temporary delivery failure). Notice: this is complicated due to the possibility of multiple destination sites, so all of them have to be unavailable for this to happen. It may also be useful to just remove entries from the active queue based on request by the first level scheduler. Question: how can this be properly coordinated?
As described in Section 2.4.4 the scheduler should at least ensure that mail from one envelope to the same destination site is sent in one transaction (unless the number of recipients per message is exceeded). However, this isn't as trivial to achieve as it seems on first sight. If MX piggybacking is supposed to implemented then all addresses of one envelope must be resolved first before any delivery is scheduled. This may reduce throughput since otherwise delivery attempts can be made as soon as a recipient address is available. If those recipient addresses would be for different destinations then starting delivery as soon as possible is more efficient (assuming the system has not yet reached its capacity). If the recipient addresses are for the same destination then putting them into one transaction will at least reduce the required bandwidth (and depending on the disk I/O system and its buffer implementation maybe also the number of disk I/O operations). Recipient coalescing based on the domain parts is easier to implement since it can be done before addresses are resolved; it still requires walking through the entire recipient list of course (some optimized access structure, e.g., a tree with sublists, could be implemented). Depending on when addresses are resolved and where they are stored, MX piggybacking may be as easily to achieve, i.e., if the resolved addresses are directly available.
Entries must not stay in the AQ for unlimited time (see Section 2.4.3.2, item 4e) hence some kind of timeout must be enforced. There are two situations in which timeouts can occur:
The queue manager keeps a connection cache that records the number of open connections, the last time of a connection attempt, the status (failure?), etc. For details, see Section 3.4.10.10. Question: if the retry time for a host isn't reached, should an incoming message go directly into the deferred queue instead of being tried? That might be a configuration option. See also 2.4.4.2.2.
For SMTP clients, mail might have multiple possible destinations due to the use of MX records. The basic idea is to provide a metric of hosts that are ``nearer'' to the final delivery host (where usually local delivery occurs). A SMTP client must try those hosts in order of their preference ``until a delivery attempt succeeds''2.22. However, this description is at least misleading, because it seems to imply that if mail delivery fails other destinations hosts should (MUST) still be tried, which is obviously wrong. So the question is: when should a SMTP client (or in this case, the QMGR) stop trying other hosts? One simple reason to stop is of course when delivery succeeded. But what about all the other cases (see Section 3.8.4.1)? qmail stops trying other hosts as soon as a connection succeeded, which is probably not a good idea since the SMTP server may greet with a temporary error or cause temporary errors during a transaction.
The QMGR should maintain the following data structures (``connection caches'', ``connection databases'') to help the scheduler making its decisions:
The last structure (3: AQRD) is just one way to access recipients in AQ, in this case via the DA and the next hop (``destination''). It can be used to access recipients that are to be sent to some destination, e.g., to reuse an open connection. All recipients that have the same destination are linked together.
OCC (1) keeps track of the currently open connections and how busy they are as well as the current ``load'', i.e., the number of open sessions/transactions per destination. This can be used to implement things like slow-start (see 2.4.7) and overall connection limits. Note: these limits should not be implemented per DA, but for the complete MTS. Question: should there be only (global) one open connection cache, not one each per DA?
DCC (2) keeps track of previously made/tried connections (not those that are currently open), it can be compared to the hoststatus cache of sendmail 8. This can be used by the scheduler to decide whether trying to connect to hosts at all, e.g., because they are down for some time already.
All three structures are basically accessed via the same key (DA plus next hop); the structures AQRD (3) and OCC (1) keep an accurate state, while DCC (2) might be implemented in a way that some information is lost in order to keep the size reasonable (it is not feasible to keep track of all connections that have ever been made, nor is it reasonable to keep track of all connections for a certain amount of time if that interval is too large, see 3.4.10.10 for a proposal).
Question: is it useful to merge AQRD (3) and OCC (1) together because they basically provide two parts of a bigger picture (and hence merging them avoids having to update and maintain them separately, e.g., memory (de-)allocation and lookups are done twice for each update). However, keeping them separate seems cleaner from a software design standpoint: AQRD is ``just'' one way to access entry in AQ, while OCC is an overview of the current state of all delivery agents.
There are (at least) two more access methods that are useful for the scheduler:
It might be useful to organize recipients that are waiting for an AR or DA result into a list which is sorted according to their potential timeout.
The administrator should have the chance the trigger a delivery attempt or complete queue runs manually. For example, if the admin notices that a site or a network connection is up again after a problem, she should be able to inform the scheduler about this change, see also Section 2.11.1.2.
According to RFC 1894 five types of DSNs are possible:
Using the terms ``mailing list'' and ``alias'' as defined in RFC 2821 [Kle01], section 3.10.1 and 3.10.2: An action-value of ``expanded'' is only to be used when the message is delivered to a multiple-recipient ``alias''. An action-value of ``expanded'' should not be used with a DSN issued on delivery of a message to a ``mailing list''.
The queue manager collects the delivery status informations from the various delivery agents (temporary and permanent failures). Based on the requested delivery status notifications (delay, failure, success), it puts this information together and generates a DSN as appropriate. DSNs are added to the appropriate queue and scheduled for delivery.
Question: how to coalesce DSNs? We don't want to send a DSN for each (failed) recipient back to the sender individually. After each recipient has been tried at least once (see also 3.4.10.6) we can send an initial DSN (if requested) which includes the failed recipients (default setting). Do we need to impose a time limit after which a DSN should be sent even if not all recipients have been tried yet? Assuming that our basic queue manager policy causes all recipients to be tried more or less immediately, we probably don't need to do this. Recipients would not be tried if the policy says so (hold/quarantine), or if the destination host is known to be down and the retry time for each hasn't been reached yet. In these cases those recipients would be considered ``tried'' for the purpose of a DSN (they are delayed). After the basic warning timeout (either local policy or due to deliver-by) a DSN for the delayed recipients is sent if requested. This still leaves open when to send DSNs for failed recipients during later queue runs. Since the queue manager doesn't schedule deliveries per envelope but per recipient, we need to establish some policy when to send other DSNs. Todo: take a look at other MTAs (postfix) how they handle this. Note: RFC 1891, 6.2.8 DSNs describing delivery to multiple recipients: a single DSN may describe attempts to deliver a message to multiple recipients of that message. Hence the RFC allows to send several DSNs, it doesn't require coalescing.
Notice: it is especially annoying to get several DSNs for the same message if the full message is returned each time. However, it would probably violate the RFCs to return the entire mail only once (which could be fairly easily accomplished). BTW: the RET parameter only applies to ``failed'' DSNs, for others only headers should be returned (RFC 1891, 5.3).
Additional problem: different timeouts for warnings. It is most likely possible to assign different timeouts for DELAY DSNs to different recipients within a single mail. In that case the algorithm to coalesce DELAY DSNs will be even more complicated, i.e., it can't be a simpler counter whether all recipients have been tried already.
Question: where do we store the data for the DSN? Do we store it in the queue and generate the DSN body ``on the fly'' or do we create a body in the CDB? Current vote is for the former.
A MTA must be able to distinguish between different types of recipient addresses:
Note: RFC 1891, 6.2.7.4 explains confidential forwarding addresses which should be somehow implemented in sendmail X.
It doesn't seem to be easy to maintain this data. First of all, the types are only known after address expansion. Even then, they may not be complete because a LDA may perform further alias expansion. Question: must the sum of these counters be the same as the number of original recipients? That is, ``all'' we have to do is to classify the original recipients into those three cases and then keep track of them? Answer: no. DSNs can be requested individually for each recipient. Hence the question should be: must the sum of these counters be less than or equal the number of original recipients?
The main problem is how to deal with address expansions, i.e., addresses that resolve (via aliases) to others. RFC 1891 lists the following cases:
user "|/usr/bin/vacation user"
If DSNs are implemented properly the sender can determine herself whether she wants the full body or just the headers of her e-mail returned in a DSN. sendmail 8 has a configuration option to not return the body of an e-mail in a bounce (to save bandwidth etc). In addition to that, it might be useful to have a configuration option to return the body only in the first bounce but not in subsequent DSNs (see Section 2.4.6 about the problem to send only one DSN). So at least two options are necessary:
These options need to be combined with the DSN requests such that the ``minimum'' is returned, e.g., if option 2 is selected but the sender requests only headers, then only the headers are sent.
There are two aspects of load control:
Some of the measures can be applied to both cases (local/remote load, incoming/outgoing connections).
The queue manager must control local resource usage, by default it should favor mail delivery over incoming mail. To achieve this, the queue manager needs to keep the state of the entire system or at least it must be able to gather the relevant data from the involved sendmail X programs and the OS. This data must be sufficient to make good decisions how to deal with an overload situation. Local resources are:
Therefore the system state (resource usage) should include:
The queue manager must be able to limit number of messages/resources devoted to a single site. This applies to incoming connections as well as to outgoing connections. It must also be possible to allow all the time connections from certain hosts/domains, e.g., localhost for submissions. This can be a fixed number or a percentage of the total number of connections or the maximum of both.
The queue manager must assure that the delivery agents do not overload a single site. It should have an adaptive algorithm to use ``optimal'' number of connections to a single site, these must be within specified limits (lower/upper bound) for site/overall connections. Question: how can the QMGR determine the ``optimal'' number of connections? By measuring the throughput or latency? Will the overhead for measurements kill the potential gain? Proposal: Check whether the aggregate bandwidth increases with a new connection, or if it stays flat. If connections are refused: back off.
The queue manager may use a ``slow start'' algorithm (TCP/IP, postfix) which gradually increases the number of simultaneous connections to the same site as long as delivery succeeds, and gradually decreases the number of connections if delivery fails.
Idea (probably not particularly good): use ``rate control'': don't just check how full the INCEDB is, but also the rate of incoming and ``leaving'' mails. Problem: how to count those? Possible solution: keep the numbers over certain intervals (5s), count envelopes (not recipients, deal with envelope splitting). If the incoming rate is higher than the departure rate and a certain percentage (threshold) is reached: slow down mail reception. If the leaving rate is higher than the incoming rate, the threshold (usage of INCEDB) could be increased. However, since more data is removed than added, the higher threshold shouldn't be reached at all. This can only be useful if we have different thresholds, e.g., slow down a bit, slow down more, stop, and we want to dynamically change them based on the rates of incoming and outgoing messages.
All parts of sendmail X, esp. the queue manager, must be able to deal with local resource exhaustion, see also Section 2.15.7.
The queue manager must implement proper policies to ensure that sendmail X is resistant against denial of service attacks. Even though this can't be completely achieved (at least not against distributed denial of service attacks), there are some measure that can be taken. One of those is to impose restrictions on the number of connections a site can make. This applies not only to the currently open connections, but also to those over certains time intervals. For this purpose appropriate connection data must be cached, see Section 3.11.8.
Todo: structure the following items.
The queue manager does not need any special privileges. It will run as an unprivileged user.
The communication channels between the various modules (esp. between the QMGR and other modules) must be protected. Even if they are compromised, the worst that is allowed to happen is a local denial of service attack and the loss of e-mail. Getting write access to the communication channel must not result in enhanced privileges. It might sound bad enough that compromising the communicaton channels may cause the loss of e-mail, but consider that an attacker with write access to the mail queue directory may accomplish the same by just removing queued mail. There is one possible way to protect the communication even if an attacker can get write access to them: by using cryptography, i.e., an authenticated and encrypted communication channel (e.g., TLS). However, this is most likely not worth the overhead. It could be considered if the communication is done over otherwise unsecured channels, e.g., a network.
There are several alternatives how to implement an SMTP server daemon. However, before we take a look at those, some initial remarks. We have to distinguish between the process(es) that listen(s) for incoming connections (on port 25 by default, in the following we will only write ``port 25'' instead of ``all specified ports'') and those processes that actually deal with an SMTP session. We call the former SMTP listener and the latter SMTP server, while SMTP server daemon is used for both of them. It might be that listeners and servers are different processes (passing open file descriptors from the listener to the server) or the same.
Interesting references about the architecture of internet servers are: [Keg01] for WWW server models and evaluations, [SV96c], [SV96a], and [SV96b] for some comparisons between C, C++, and CORBA for various server models, and [SGI01] for one particular thread model esp. designed for internet server/client applications. Papers about support of multi-threading by the kernel and in libraries are [ABLL92], [NGP02], [Dre02], and [Sol02].
An internet server application (ISA) reads data from a network, performs some actions based on it, and sends answer(s) back. The interesting case is when the ISA has to serve many connections concurrently. Each connection requires a certain amount of state. This state consists at least of:
There is a certain amount of hardware concurrency that must be efficiently used: processors (CPU, I/O), asynchronously operating devices, e.g., SCSI (send a command, do something else, get result) and network I/O. There should be one process per CPU assuming the process is runnable all the time (or it invokes a system call that executes without giving up the CPU for the duration of the call); if the process can block then more processes are required to keep the CPU busy. We need to have always one thread of execution that is runnable. Unix provides preemptive, timesliced multitasking, which might not the best scheduling mechanism for ISA purposes. Assuming that context switches are (more or less) expensive, we want to minimize them. This can be achived by ``cooperative'' multitasking, i.e., context switches occur only when one thread of execution (may) block. Notice: this requires that no thread executes too long such that other threads may starve. This will be a problem if a thread executes a long (compute-intensive) code sequence, e.g., generation of an RSA key. Question: how can we avoid this problem? Maybe use preemptive multitasking, but make the timeslice long enough? As long as each thread only performs a small amount of work, it is better to let it executes its entire work up to a blocking function to minimize the context switch overhead. Question: can we influence the scheduling algorithm for threads? POSIX threads allow for different scheduling algorithms, but most OS implement only one (timesliced, priority based scheduling).
An ISA should answer I/O requests as fast as possible since that allows clients (which are waiting for an answer) to proceed. Hence threads that are able to answer a request should have priority. However, a thread that performs a long computation must proceed too, otherwise its client may time out. So we have a ``classic'' scheduling problem. Question: do we want to get into these low-level problems or leave it to the OS?
The alternatives to implement SMTP server daemons are at least:
Disadvantage: threads require very careful programming, the program must never crash, even if running out of memory or "fatal" errors in one of the threads (connections). Only that connection must be aborted and the problem must be logged.
Advantages: "crash resistent" (if one thread goes down, it can take down only one process). The probably most important part of this solution is: it doesn't bind us to any particular model which may show deficiencies on a particular OS, configuration, or even in the long run of further OS development. We can easily tune the usage of processes and threads based on the OS requirements/restrictions. It is the most flexible (and most complicated) model, with which we can get around limitations in different OSs.
Disadvantages: pretty complicated. Selecting this model shouldn't be necessary for crash resistence since we don't have plugins that could easily kill smtpd, but we use external libraries (Cyrus SASL, OpenSSL). It requires extra efforts to share data since we have multiple processes.
Todo: we have to figure out which of those models works best. A comparison [RB01] between 4a, 4b, and one process per connection clearly points to 4b. However, the tests listed in that article are not really representable for SMTP because no messages have been sent. Moreover, it misses model 5. Even though there might be some data available about the performance of different models, most of those probably apply to HTTP servers (Apache) or caches (Squid). These are not really representative for SMTP because HTTP/1.0 is only one request/one response exchange where the response is often pretty short. SMTP uses several exchanges (some of which can be pipelined) and often transports larger data. HTTP/1.1 can be used to keep a connection open (multiple requests and answers) and might be better comparable in its requirements to SMTP. Question: is there performance data available for this? How about FTP (larger data transports, but often only one request)?
Notice: slow connections must be taken into account too. Those connections have very little (I/O, computational) requirements per time slice, but they take up as much context data as fast connections. It should be tried to minimize any additional data, e.g., process contexts, for these connections. If we for example use one thread per connection, then slow connections will take up an entire thread context, but rarely use it. A worker thread model reduces this overhead.
We need some basic prototypes to do comparisons, esp. on different OS to find out whether threading really achieves high performance (probably on Solaris, less likely on *BSD).
Question: if we choose 5 (current favorite), how do we handle incoming connections? Do we have a single process listen on port 25 and handing over the socket to another process (3.14.15)? This may create a bottleneck. Alternative: similar to postfix have multiple processes do a select() on the port. One of them will get the connection. Possible problem(s):
Question: how much data sharing do we need in SMTPS? We need to know the number of open connections (overall, per host) and the current resource usage. It might be sufficient to have this only in the listener process (if we decide to go for only one) or it can be in the queue manager (which implements general policy and has data about past connections too). Another shared resource is the CDB which may require shared data (locking). If the transaction and session ids are not generated by the queue manager, then these require some form of syncronization too. In the simplest form, the process id might be used to generate unique ids (the MCP may be able to provide ids instead if process ids will not be useful for this purpose because they may be reused too fast).
A SMTP session may consist of several SMTP transactions. The SMTP server uses data structures that closely follow this model, i.e., a session context and a transaction context. A session context contains (a pointer to) a transaction context, which in turn points back to the session context. The data is stored by the queue manager. The transaction context ``inherits'' its environment from the session context. The session context may be a child of a server daemon context that provides general configuration information. The session context contains for example information about the sending host (the client) and possibly active security and authentication layers.
Todo: describe a complete transaction here including the interaction with other components, esp. queue manager.
The basic control flow of an incoming SMTP connection has already been described in Section 2.1.5.
The whole command is passed to the QMGR which stores the relevant data in the incoming queue. Questions: whole or only relevant parts? Are there irrelevant parts? Do we send the original text or a decoded version? A decoded version seems better to avoid double work.
Notice: the server must check whether the first line it reads is a header line. If it isn't, it must put a blank line after its Received: header as a separator between header and body. If the first line starts with a white space character (LWSP), then a blank must be inserted too. This should be covered by the ``is a header'' check because a header can't start with LWSP (it would be folded into the previous line).
Other commands (like NOOP, HELP, etc) can be treated fairly simple and are not (yet) described here.
Question: who removes entries from the CDB? Why should it be the SMTP server? The idea from the original paper was to avoid lockings overhead since the SMTP server is the only one which has write access to the CDB. Note: if we use multiple SMTP server processes then we may run into locking issues nevertheless. The QMGR controls the envelope databases which contain the reference counters for messages. Hence it is the logical place to issue the removal command. However, it's still not completely clear which part of sendmail X actually performs the removal.
Misc:
Questions: which storage format should be used? Most likely: network format (CR LF, dot-stuffed). What about BDAT handling?
The SMTP server must provide similar anti-spam checks as sendmail 8 does. However, it must be more flexible. Currently it is very complicated to change the order in which things are tested. This causes problems in various situations. For example, before 8.12 it could have been that relaying has been denied due to temporary failures even though the mail could have gone through. This was due to the fixed order in which the checks where run and the checks were stopped as soon as an error occurred even if it was just a temporary error. This has been fixed in 8.12 but it was slightly complicated to do so.
The anti-spam checks belong in the SMTP server. It has all the necessary data, i.e., client connection and authentication (AUTH and STARTTLS) data, sender and recipient addresses. If the anti-spam checks are done by an outside module, all these data need to be sent to it. However, anti-spam checks most likely have to perform map requests, and such calls may block. It might be interesting to ``parallelize'' those requests, esp. for DNS based maps, i.e., start several of those requests and collect the data later on. This of course makes programming more complicated, it might be considered as an enhancement later on. We need to define a clean API and then it may be available as library which can be linked into the SMTPS or the AR or another module.
The SMTP server must offer a way to check for valid (local) users (see Section 2.6.6). Otherwise mail to local addresses will be rejected only during local delivery and hence a bounce must be generated which causes problems due to forged sender addresses, i.e., they result in double bounces and may clog up the MTS.
See also Section 2.6.4.
There must be an option to rewrite envelope addresses. This should be separately configurable for sender and recipient addresses.
If sendmail acts as a gateway, it may rewrite addresses in the headers. This can be done by contacting an address rewrite engine. Question: should this be just another mode in which the address resolver can operate?
Question: what is the best way to specify when address rewriting should be used? It might be useful to do this based on the connection information, i.e., when sendmail acts as a gateway between the internal network and the internet.
It would be nice to implement the address rewriting as just another file type. In this case the SMTP servers could just open a file and output the data (message header) to that file. This file type is a layer on top of the CDB file type. The file operations are in this case stateful (similar to those for TLS). As soon as the body is reached, no more interference occurs. Using this approach makes the SMTP server simpler since it doesn't have to deal with the mail content itself.
It must be specifyable to which headers address rewriting is applied. There are be two different classes of header addresses: sender and recipient. These should relate to two different configuration options.
The SMTP server must bind to port 25, which can be done by the supervisor before the server is actually started provided the file descriptor can be transferred to the SMTP server. sendmail 8 closes the socket if the system becomes overloaded which requires it to be reopened later on, which in turn requires root privileges again.
The SMTP server may need access to an authentication database which contains secret information (e.g., passwords). In most systems access to this information is restricted to the root user. To minimize the exposure of the root account, access to this data should be done via daemons which are contacted via protected communication means, e.g., local socket, message queues.
In some cases it might be sufficient to make secret information only available to the user id under which the SMTP server is running, e.g., the secret key for a TLS certificate. This is esp. true if the information is only needed by the SMTP server and not shared with other programs. An example for the latter might be ``sasldb'' for AUTH as used by Cyrus SASL which may be shared with an IMAP server.
The address resolver (AR) has at least two tasks:
Hence the AR is not just for address resolving but also address rewriting. Other tasks might include anti-spam checks. Question: should the two main tasks (rewriting and resolving) be strictly separated?
Question: what kind of interfaces should the address resolver provide? Does it take addresses in external (unparsed) form and offer various modes, e.g., conversion into internal (tokenized) form, syntax check, anti-spam checks, return of a tuple containing delivery agent, host, address and optionally localpart and extension?
Question: who communicates with the AR? The amount of data the AR has to return might be rather large, at least if it is used to expand aliases (see Section 2.6.7). Using IPC for that could cause a significant slowdown compared to intra-process communication. So maybe the AR should be a library that is linked into the queue manager? Possible problems: other programs need easy access to the AR; security problems since AR and QMGR run with the same privileges in that case? Moreover, the AR certainly performs blocking calls which probably should not be in the QMGR. See also Section 3.6.3.4.
Usually the address resolver determines the next hop (mailer triple in sendmail 8) solely based on the envelope and connection information and the configuration. However, it might be useful to take also headers or even the mail body into account. Question: should this be a standard functionality of the AR or should this be only achievable via milters (see Section 2.10) or should this not be available at all? Decision: make this functionality only available via milters, maybe not even at all. It might be sufficient to ``quarantine'' mails (even individual recipients), or reject them as explained in Section 2.10.
sendmail 8 uses a two stage approach for most address lookups:
This approach has the advantages that it can avoid map lookups - which may be expensive (depending on the kind of maps) and in most case several variations are checked - if the entry is not in a class. It has the disadvantages that the class data and the map data must be kept in sync, e.g., it is not sufficient to simply add some entries to a map, the domain part(s) must be added to the corresponding class first2.24.
sendmail 8 provides several facilities for mail routing (besides ruleset hacking):
As can be seen from the previous sections, there are operations that solely affect mail routing and there are operations that solely affect address rewriting. However, some operations affect both, because address rewriting is done before mail routing. Hence the order of operations is important. If address rewriting is performed before mail routing, then the latter is affected. If address rewriting is done after mail routing, then it applies only to the address part of the resolved address (maybe it shouldn't be called resolved address since it is more than an address).
Proposal: provide several operations (rewriting, routing) and let the user specify which operations to apply to which type of addresses and the order in which this happens.
Operations can be:
It might be useful for routing operations to not modify the actual address, i.e., if user@domain is specified, it can be redirected to some other host with the original address or with some new address, e.g., user@host.domain.
Some operations - like masquerading - only modify the address without affecting routing.
So for a clear separation it might be useful to provide two strictly separated set of operations for routing and rewriting. However, in many cases both effects (routing and rewriting) are required together.
Address types are: envelope and header addresses, recipient and sender addresses (others), so there are (at least) four combination.
envelope-sender-handling { canonify }
envelope-recipient-handling { canonify, virtual, mailertable }
Question: is this sufficient?
It must be possible to specify valid recipients via some mechanism. In most cases this applies to local delivery, but there is also a requirement to apply recipient checks to other domains, e.g., those for which the system allows relaying.
Note that local recipients can often only be found in maps that do not specify a domain part, hence the local domains are separately specified. Question: is it sufficient if (unqualified) local recipients are valid for every local domain or is it necessary to have for each local domain a map which specifies the valid recipients? For example, for domain A check map M(A), for domain B check map M(B), etc. Moreover, the domain class would specify whether the corresponding map contains qualified or unqualified addresses. Other attributes might be: preserve case for local part, allow +detail handling, etc.
Configuration example:
local-addresses { domains = { list of domains }; map { type=hash, name=aliases, flags={rfc2821}} map { type=passwd, name=/etc/passwd, flags={local-parts}} }
Valid remote recipients can be specified via entries in an access map to allow relaying to specific addresses, e.g.,
To:user@remote.domain RELAY
If not all valid recipient are known for a domain for which the MTA acts as backup MX server, then an entry of the form:
To:@remote.domain error:451 Please try main MX
should be used.
There are different types of aliases: those which expand to addresses and those which expand to files or programs. Only the former can be handled by the address resolver, the latter must be handled by the local delivery agent for security reasons. Note: Courier-MTA also allows only aliases that expand to e-mail addresses, postfix handles aliases in the LDA. If alias expansion is handled by the LDA then an extra round trip is added to mail delivery. Hence it might be useful to have two different types of alias files according to the categorization above.
Problem: if the SMTP server is supposed to reject unknown local addresses during the RCPT stage, then we need a map that tells us which local addresses are valid. There are two different kinds: real users and aliases. The former can be looked up via some mailbox database (generalization of getpwnam()), the latter in the aliases database. However, if we have two different kinds of alias files then we don't have all necessary information unless the address resolver has access to both files. This might be the best solution: the address resolver just returns whether the address is valid. The expansion of non-address aliases happens later on.
The address resolver expands address aliases when requested by the queue manager. It provides also an owner address for mailing lists if available. This must be used by the queue manager when scheduling deliveries for those expanded addresses to change the envelope sender address.
The queue manager changes the envelope sender for mailing list expansions during delivery. RFC 2821 makes a distinction between alias (3.10.1) and list (3.10.2). Only in the latter case the enveloper sender is replaced by the owner of the mailing list. Whether an address is just an alias or a list is a local decision. sendmail 8 uses owner-address to recognize lists.
Question: what to do about delivery to files or programs? For security reasons, these should never end up in the queue (otherwise someone could manipulate a queue file and cause problems; sendmail would have to trust the security of the queue file, which is a bad idea). In postfix aliases expansion is done by the local delivery agent to avoid this security problem. It introduces other problems because no checkpointing will be done for those deliveries (remember: these destinations - they are not addresses - never show up in queue files).
Notice: alias expansion can result in huge lists (large number of recipients). If we want to suppress duplicates, we need to expand the whole list in memory (as sendmail 8 does now). This may cause problem (memory usage). Since we can't act the same as older sendmail versions do (crash if running out of memory), we need to restrict the memory usage and we need to use a mechanism that allows us to expand the alias piecewise. One such algorithm is to open a DB (e.g., Berkeley DB; proposed by Murray) on disk and add the addresses to it. This will also detect duplicates if the addresses are used as keys. To avoid double delivery, expansion should be done in the local delivery agent and it must mark mails with a Delivered-To: header as postfix [Ven98] and qmail do. Should attempted double delivery (delivery to a recipient that is already listed as Delivered-To:) in this case cause a DSN? Question: is it ok to list all those Delivered-To: headers in an email? Does this cause an information leak? Question: is ok to use Delivered-To: at all? Is this sanctioned by some RFC? Question: do we only do one level expansion per alias lookup? This minimizes the problem about ``exploding'' lists, but it may have a significant performance impact (n deliveries for n-level expansion).
Question: should there be special aliases, e.g., ERROR, DEFER, similar to access map, that cause (temporary) delivery errors, or can those be handled by the access map?
Question: Who does .forward expansion?
551-try <new@address1> 551-try <new@address2> 551-try <new@address3> 551 try <new@address4>
Notice: whether a .forward file is in the home directory of a user, or whether it's in a central directory, or whether it's in a DB doesn't matter much for the design of sendmail X. Even less important is how users can edit their .forward files. sendmail X.0 will certainly not contain any program that allows users to authenticate and remotely edit their .forward file that is stored on some central server. Such a task is beyond the scope of the sendmail X design, and should be solved (in general) by some utilities that adhere to local conventions. Those utilities can be timsieved, scp, digitally signed mails to a script, updates via HTTP, etc.
How about the qmail approach to aliasing? Everything is just handled via one mechanism: HOME(user)/.alias[-extension]. System aliases are in HOME(aliases)/.alias[-extension]. This results in lots of file I/O which probably should be avoided.
Of course this wouldn't be flexible enough for sendmail, it must have the possibility to specify aliases via other means, e.g., maps. It might be better to put everything into maps instead of files spread out over the filesystem. In that case a program could be provided that allows a user to edit her/his own alias entry. However, such a program is certainly security critical, hence it may add a lot of work to implement properly; compare the passwd(1) command.
There have been requests to have other mechanisms than just aliases/.forward to expand an address to multiple recipients. We should consider making the AR API flexible enough to allow for this. However, there is (at least) one problem: the inheritance of ESMTP attributes, e.g., DSN parameters (see also Section 2.4.6). There are rules in RFC 1894 [Moo96a] which explain how to pass DSN requests for mailing lists and aliases. Hence for DSN parameters the rules for aliases should probably apply.
It would be nice to have per-user virtual hosting. This can relieve the admin from some work. Todo: Compare what other MTAs offer and at least make sure the design doesn't preclude this even though it won't be in sendmail X.0. Is contrib/buildvirtuser good enough?
qmail allows to delegate virtual hosts to users via an entry in a configuration file, e.g., virthost.domain: user. Mail to address@virthost.domain goes to user-address. To keep the virtual domain use virthost.domain: user-virthost.domain. address@virthost.domain goes to user-virthost.domain-address then. Problem: lots of little files instead of a table.
FallbackMXHost can be used in sendmail 8 to specify a host which is used in case delivery to other hosts fails (applies only to mailers for which MX expansion of the destination host is performed). It might be useful to make this more flexible:
The advantages/disadvantages of these proposals are not yet clear.
In theory, we could use the the second proposal to have generic bounce and defer mailers. That is, if mail delivery fails with a permanent error, the default ``fallback'' will be a bounce mailer, if mail delivery fails with a temporary error, the ``fallback'' will be a defer mailer. This would allow maximum flexibility, but the impact on the QMGR (which has to deal with all of this) is not clear.
The address resolver should run without any privileges. It needs access to user information databases (mailbox database), but it does not need access to any restricted information, e.g., passwords or authentication data.
Initial mail submission poses interesting problems. There are several ways to submit e-mail all of which have different advantages and disadvantages. In the following section we briefly list some approaches. But first we need to state our requirements (in addition to the general sendmail X requirements that have been given in Chapter 1).
Don't use SMTP (because it's complicated and may reject mail), but SMSP (simple mail submission protocol), submission via socket. Possible problem: how to identify other side? Always require authentication (SMTP AUTH)? Way too complicated. It's in general not possible to get the uid of the sender side, even for local (Unix socket) connections.
At the MotM (2002-08-13) option 3 was clearly favored.
See also [Ber] for more information about the problem of secure interprocess communication (for Unix).
Does this work? The files should be not world writable, so there must be some common group. Since it is not practical to have all users in the same group (and making sure that that group is used when a queue file is written), this may not work after all. Run a daemon as root, notify it of new entries: cd queuedirectory, set*uid to owner of queuedirectory, run the entry.
Possible pitfalls: chown on some systems possible for non-root!
Notice: in the first version we may be able to reuse the sendmail 8 MSP. This gives us a complete MHS without coding every part.
Since the initial mail submission program is invoked by users, it must be careful about its input. The usual measures about buffer overflows, untrusting data, parsing user input, etc. apply esp. to this program. See the Section 2.14 for some information.
Todo: Depending on the model selected above describe the possible security problems in more detail.
There are several types of mail delivery agents in sendmail X similar to sendmail 8. One of them acts as SMTP client which is treated separately in Section 2.9. Another important one is the local delivery agent treated in the Section 2.8.3.
Question: does a DA (esp. SMTP client) check whether a connection is ``acceptable''? Compare sendmail 8: TLS_Srv, TLS_RCPT. It could also be done by the QMGR. The DA has the TLS information, it would need to send that data to the QMGR if the latter should perform the check. That might make it simpler for the QMGR to decide whether to reuse a connection (see also Section 3.4.10.2; maybe the QMGR doesn't need this additional restriction for reuse). However, if it is a new connection it is simpler (faster) to perform that check in the DA.
Idea: instead of having a fixed set of delivery agents and an address resolver that ``knows'' about all of them, maybe a more modular approach should be taken. Similar to Exim [Haz01] and Courier-MTA [Var01] delivery agents would be provided as modules which provide their own address rewriting functions. These are called in some specified order and the first which returns that it can handle the address will be selected for delivery.
sendmail 8 uses a centralized approach: all delivery agents must be specified in the .cf file and the address rewriting must select the appropriate delivery agent.
sendmail X must provide a simple way to add custom delivery agents and to select them. It seems best to hook them into the address resolver, that's the module which selects a delivery agent.
There must be a simple way to specify different delivery agents, i.e., their behavior and their features (see Section 3.8.2 for details). This not refers to local delivery agents (2.8.3) and SMTP clients (2.9), but also to variants of those.
In addition to specifying behavior, actual instances must be described, i.e., the number of processes and threads that are (or can be) started and are available. These two descriptions are orthogonal, i.e., they can be combined in almost any way. The configuration must reflect this, e.g., by having to (syntactical separate) structures that describe the two specifications. For practical reasons, the following approach might be feasible:
Note: sendmail 8 only specified delivery classes (called mailers), it does not have the need for delivery instances because it is a monolithic program that implements the mailers itself or invokes them as external programs without restrictions. In sendmail X certain restrictions are imposed, i.e., the number of processes that can run as delivery agents or the number of threads are in general limited. Even though these limits might be arbitrarily high, they must be specified.
Example:
delivery-class smtp { port = 25; protocol = esmtp; } delivery-class msaclient { port = 587; protocol = esmtp; } delivery-class lmtp { socket = lmtp.sock; protocol = lmtp; } delivery-agent mailer1 { delivery-classes = { esmtp, lmtp }; max_processes = 4; max_threads = 255; } delivery-agent mailer2 { delivery-classes = { msaclient }; max_processes = 1; max_threads = 16; } delivery-agent mailer3 { delivery-classes = { esmtp }; max_processes = 2; max_threads = 100; }
Notes:
A local delivery agent usually needs to change its user id to that of the recipient (depending on the local mail store; this is the common situation in many Unix versions). Since sendmail X must not have any set-user-id root program, a daemon is the appropriate answer to this problem (started by the supervisor, see Section 2.3).
Alternatively, a group-writable mailstore can be used as it is done in most System V based Unix systems. A unique group id must be chosen which is only used by the local delivery agent. It must not be shared with MUAs as it is done in some OSs. There is at least one problem with this approach: a user mailbox must exist before the first delivery can be performed. That requires that the mailbox is created when the user account is created and no MUA must remove the mailbox when it is empty. There could be a helper program that creates an empty mailbox for a user which however must run as root and hence will have security implications.
The local delivery agent in sendmail X will be the equivalent of mail.local from sendmail 8. It runs as a daemon and speaks LMTP. By default, it uses root privileges and changes its user id to that of a recipient before writing to a mailbox.
There might be other local delivery agents which use the content database access API to maximize performance, e.g., immediate delivery while the sender is waiting for confirmation to the final dot of an SMTP session.
sendmail X.0 will use a normal SMTP client - which is capable of speaking also LMTP - as an interface between mail.local and the queue manager. That program implements the general DA API on which the queue manager relies. The API is described in Section 3.8.4. Later versions may integrate the API into mail.local.
If the LDA also takes care of alias (and .forward) expansion (see Section 2.6.7.1), then sendmail X must provide a stub LDA that interfaces with custom LDAs. The stub LDA must provide the interface to the QMGR and the ability to perform .forward expansion. Its interface to the custom LDAs should be via LMTP in a first approach.
The interface to the local delivery agents must be able to provide the full address as well as just the local part (plus extensions) in all required variations. There are currently some problems with LDAs that require the full address instead of just the local part which must be solved in sendmail X. Todo: explain problems and solution(s).
Mail delivery agents may require special privileges as explained above.
For obvious security reasons, the LDA will not deliver mail to a mailbox owned by root. There must be an alias (or some other method) that redirects mail to another account. The LDA should also not read files which require root privileges.
The SMTP client is one of the mail delivery agents (see Section 2.8).
Todo: describe functionality.
Similar to SMTPS there are several architectures possible, e.g., simple (preforked) processes, multi-threaded, event-driven, or using state-threads. We need to write similar prototypes to figure out the best way to implement the SMTP clients. It isn't clear yet whether it should be the same model as SMTPS. However, it might be best to use the same model to minimize programming effort.
There are basically two different situations for a SMTPC: open a new connection (new session) or reuse an existing connection (new transaction).
New session:
New transaction:
It might be useful if the data from the QMGR includes:
The SMTP client will run without root privileges. It needs only access to the body of an e-mail that it is supposed to deliver. However, it may need access to authentication data, e.g., for STARTTLS: a client certificate (can be readable by everyone) and a corresponding key (must be secured), for AUTH it needs access to a pass phrase (for most mechanisms) which also must be secured2.25. For these reasons it seems appropriate that an SMTP client uses a different user id than other sendmail X programs, and achieves access to shared data (mail body, interprocess communication) via group rights.
The Milter API should be extended in sendmail X, even though maybe not in the first version. However, sendmail X must allow for the changes proposed here.
Notice: if a milter is allowed to change recipient information (1) then the sendmail X architecture must allow for this. The architecture could be simpler if the address resolver solely depends on envelope information and the configuration. If it depends also on the mail content, then the address resolver must be called later during the mail transmission. This also defeats ``immediate'' delivery, i.e., delivery to the next hop while the mail is being received. The additional functionality will most likely be a requirement (too many people want to muck around with mail transmission). It would be nice to allow for both, i.e., mail routing solely based on envelope data, and mail routing based on all data (including mail content). There should be a configuration options which allows the MTA to speed up mail routing by selecting the first option.
Milters should run as a normal (unprivileged) user, but without any access to the sendmail X configuration/data files. The communication between the MTA and the milters must occur via protected means to prevent bogus milters to interfere with the operation of the MTA.
A program called mailq will show the content of the mail queue. Various options control the output format and the kind of data shown.
It might be useful to ask the queue manager to schedule certain entries for immediate (``as soon as possible'') delivery. This will also be necessary for the implementation of ETRN.
Some statistics need to be available. At least similar to mailstats in sendmail 8 and the data from the control socket. Data for SNMP should be made available. Maybe the rest is gathered from the logfile unless it can be provided in some other fashion. For example, it is probably not very useful to provide per-address statistics inside the MTA (QMGR). This would require too much resources and most people will not use that data anyway. However, it might be useful to provide hooks into the appropriate modules such that another program can collect the data in ``real-time'' without having to parse a logfile.
There should be some sophisticated programs that can give feedback about the performance of the MTA.
General question: how to allow access to the data? Should we rely on the access restriction of the OS? That might be complicated since we probably have to use group access rights to share data between various modules of sendmail X. It is certainly not a good idea to give out those group rights to normal users. Moreover, some OS only allow up to 8 groups for an account. Depending on the number of modules some program has to communicate with this may cause problems.
Maps can be used to lookup data (keys, LHS) and possibly replace a key with the result of the lookup, some lookups are only used to find out whether an entry exists in the map. Ordinary maps (databases) provide only a lookup function to find an exact match2.26. There are many cases in which some form of wildcard matching needs to be provided. This can be achieved by performing multiple lookups as explained in the following.
In many places we will use maps to lookup data and to replace it by the RHS of the map. Those places are: aliases, virtual hosting, mail routing, anti-spam, etc. There are some items which can be looked up in a map that need some form of wildcard matching. These are:
There's an important difference between 1 and 2: IP addresses have a fixed length2.27, while hostnames can have a varying length. This influences how map entries for parts (subnets/subdomains) can be specified: for the former it is clear whether a map entry denotes a subnet while this isn't defined for the latter, i.e., domain.tld could be meant to apply only to domain.tld or also to host.domain.tld.
We need a consistent way to define how a match occurs. This refers to:
dom.ain | RHS |
.dom.ain | RHS |
to match the domain itself and all of its subdomains.
So the lookup would be: full name, name without first component but with leading dot, if there is something left repeat previous step, until a lookup succeeds.
*dom.ain | RHS |
This would be more in line with the expectation of normal users due to wildcard usage in shells or regular expressions.
@dom.ain | RHS |
dom.ain | RHS |
The first one will be an exact match, the second matches also all subdomains. However, this makes the lookup algorithm slightly more complicated: during the first lookup (full host name) it has to include the anchor (@), thereafter it must omit it. Moreover, the anchor is confusing if the entry doesn't apply to e-mail addresses but to connection information etc.
Lookup full address, Lookup address with ++ if +detail exists and detail not null, Lookup address with +* if +detail exists, Lookup address without +detail.
Replacement:
1 | user name |
2 | detail |
3 | +detail |
4 | omitted subdomain? |
As usual localparts are anchored with a trailing @ to avoid confusion with domain names.
Notice: the ``detail'' delimiter should be configurable. sendmail 8 uses +, other MTAs use -. This may interfere with aliases, e.g., owner-list. Question: how to solve this problem?
As explained in Section 1.1.3.4, sendmail X must provide hooks for extensions. One possible way are modules, similar to Apache [ASF]. Modules help to avoid having large binaries that include everything that could ever be used.
Modules must only be loaded from secure locations. They must be owned by a trusted user.
This section contains a hints and thought on how to design and implement programs, esp. those related to MTAs, to ensure they are secure.
It has not yet been decided whether the initial mail submission program (see Section 2.7) will be set-group-ID. No program in sendmail X is set-user-ID root.
A set-group-ID/set-user-ID program must operate in a very dangerous environment that can be controlled by a malicious user. Moreover, the items that must be checked varies from OS to OS, so it is difficult to write portable code that cleans up properly.
Only use root if absolutely necessary. Do not keep root privileges because they might be needed later on again, consider splitting up the program instead.
Avoid writing to root owned files. Question: is there any situation where this would be required? Avoid reading from files that are only accessible for root. This should only be necessary for the supervisor, since this program runs as root so its configuration file should be owned by root. Otherwise root would have to rely on the security of another account.
sendmail 8 treats programs and files as addresses. Obviously random people can't be allowed to execute arbitrary programs or write to arbitrary files, so sendmail 8 goes through contortions trying to keep track of whether a local user was ``responsible'' for an address. This must be avoided.
The local delivery agent can run programs or write to files as directed by $HOME/.forward, but it must always run as that user. The notion of ``user'' should be configurable, but root must never be a user. To prevent stupid mistakes, the LDA must make sure that neither $HOME nor $HOME/.forward are group-writable or world-writable.
Security impact: Having the ability to write to .forward, like .cshrc and .exrc and various other files, means that anyone who can write arbitrary files as a user can execute arbitrary programs as that user.
Do not assume that data written to disk is secure. If at all possible, assume that someone may have altered it. Hence no security relevant actions should be based on it.
The essence of user interfaces is parsing: converting an unstructured sequence of commands into structured data. When another program wants to talk to a user interface, it has to quote: convert the structured data into an unstructured sequence of commands that the parser hopefully will convert back into the original structured data. This situation calls for disaster. The parser often has bugs: it fails to handle some inputs according to the documented interface. The quoter often has bugs: it produces outputs that do not have the right meaning. When the original data is controlled by a malicious user, many of these bugs translate into security holes (e.g., find | xargs rm).
For e-mail, only a few interfaces need parsing, e.g., RFC 2821 [Kle01] (SMTP) and RFC 2822 [Res01] (for mail submission). All the complexity of parsing RFC 2822 address lists and rewriting headers must be in a program which runs without privileges.
Security holes can't show up in features that don't exist. That doesn't mean that sendmail X will have almost no features, but we have to be very carefull about selecting them and their security and reliability impact.
Especially the availability of several options can cause problems if a program can access data that is not directly accessible to the user who calls it. This applies not only to set-group/user-ID programs, but also daemons that answer requests. This has been demonstrated by sendmail 8, e.g., debug flags for queue runners, which reveal private data.
C ``strings'' are inherently dangerous. Use something else which prevents buffer overflows.
There are more things to security than just these programming advices. For example, a program should not leak privileged (private/confidential) information. This applies to data that is logged or made available via debugging options. A program must also prevent being abused to access data that it can read due to its rights and being tricked into making that data available to an attacker, neither for reading nor writing.
Question: where does sendmail need privileged access? The following sections provide a list and hopefully solutions.
LDAP with Kerberos: there should be a way to do this without root privileges. This might be a documentation issue (kinit before starting sendmail, chown ticket file)
PH map: it's possible... (how?)
Problem: the server may close(2) the socket due to errors or load conditions, e.g., RefuseLA, MaxChildren in sendmail 8. In that case the server needs to bind(2) to the port again later on. Since the server is not supposed to run with root privileges, another program (the MCP) must take care of that, i.e., it is notified of the problem and can either start a new server or pass an open fd to the server.
Note: To bind(2) to a reserved port may not require root on all OS variants, there might be other access control methods, e.g., a different, privileged user id that is allowed to bind(2) to certain ports.
Alternatively, a server might not close(2) the socket, Instead of close(2), accept(2) and give 421 error, then close(2) just the connection.
Side note: RefuseLA, MaxDaemonChildren should be configurable per DaemonPortOption.
Can we keep interfaces abstract and simple enough so we can use RPCs2.28? This would allow us to build a distributed system. However, this must be a compile time option, so we can "put together" an MTA according to our requirements (all in one; some via shared libraries; some via sockets; some via RPCs). See also Section 3.1.1.
Rulesets, esp. check_*: make the order in which things happen flexible. Currently it's fixed in proto.m4, which causes problems; (tempfail in parts: require rewrite, even then it's hard to maintain etc). Use subroutines and make the order configurable (within limits).
Use of mode bits to indicate status of file? e.g., for forward: it +t: being edited right now, don't use (temp.fail.) for queue files: +x completely written.
Can several processes listen on a socket? Yes, but there is a ``thundering herd'' problem: all processes are woken up, but only one gets the connection. That is inefficient for large number of processes. However, it can be mitigated by putting locks around the file such that only one process will do an accept(). See [Ste98] for examples.
Configuration: instead of having global configuration options why not have configuration functions? For example: Timeout.Queuereturn could be a function with user-defined input parameters (in form of macros?):
Timeout.Queuereturn(size, priority, destination) = some expression.
This way we don't have to specify the dependency of options on parameters, but the user can do it. Is this worthwhile and feasible? What about the expression? Would it be a ruleset? Too ugly and not really useful for options (only for addresses). Example where this is useful: the FFR in 8.12 when milters are used (per Daemon). Example where this is already implemented: srv_features in 8.12 allows something like this.
Problem: which macros are valid for options? For example, recipient can be a list for most mails.
Configuration changes may cause problems because some stored data refers to a part of the configuration that is changed or removed. For example, if there are several retry schedules and an entry in the queue refers to one which is removed during a configuration change, what should we do? Or if the retry schedule is changed, should it affect ``old'' entries? sendmail 8 pretty much treats a message as new every time it processes it, i.e., it dynamically determines the actual delivery agent, routing information, etc. This probably can solve the problem of configuration changes, but it is certainly not efficient. We could invalidate stored information if the configuration changes (see also Section 3.11.6).
Most sendmail X programs must have a compilation switch to turn on profiling (not just -pg in the compiler). Such a switch will turn on code (and data structures) that collect statistics related to performance. For example, usage (size, hit rate) of caches, symbol tables, general memory usage, maybe locking contentions, etc. More useful data can probably be gathered with getrusage(2). However, this system call may not return really useful data on most OS. On OpenBSD 2.8:
long ru_maxrss; /* max resident set size */ long ru_ixrss; /* integral shared text memory size */ long ru_idrss; /* integral unshared data size */ long ru_isrss; /* integral unshared stack size */
seem to be useless (i.e., 0). SunOS 5.7: NOTES: Only the timeval member of struct rusage are supported in this implementation.
A program like top might help, but that's extremely OS dependent. Unless we can just link a library call in, we probably don't want to use this.
There are various requirements for logging:
Note: a different approach to logging is to use the normal I/O (well, only O) operations and have a file type that specifies logging. The basic functionality for that is available in the sendmail 8/9 I/O layer. However, it seems that this approach does not fulfill the requirements that are stated in the following sections.
The logging in sendmail X must be more flexible than it was in older versions. There are two different issues:
About item 1: the current version (sendmail 8) uses: LogLevel and the syslog priorities (LOG_EMERG, LOG_ERR, ...). The latter can be used to configure where and how to log entries via syslog.conf(5). The loglevel can be set by the administrator to select how much should be logged. Note: in some sense these are overlapping: syslog priorities and loglevels are both indicators of how important an log event is. However, the former is not very fine grained: there are only 8 priorities, while sendmail allows for up to 100 loglevels. Question: is it useful to combine both into a single level or should they be kept separate? If they are kept separate, is there some correlation between them? For example, it doesn't make sense to log an error with priority LOG_ERR but only if LogLevel is at least 50. ISC [ISC01] combines those into a single value, but it basically uses only the default syslog priorities and then additionally debug levels.
An orthogonal problem (item 2) is logging per ``functionality''2.29. There are many cases where it is useful to select logging granularity dependent on functionalities provided by a system. This is similar to the debugging levels in sendmail 8. So we can assign a number to a functionality and then have a LogLevel per functionality. For example, -L8.16 -L20.4 would set the LogLevel for functionality 8 to 16 and for 20 to 4. Whether we use numbers or names is open for discussion.
syslog offers facilities (LOG_AUTH, LOG_MAIL, ..., LOG_LOCALx), however, the facility for logging is fixed during the openlog() call, it is not a parameter for each syslog() call. This is serious drawback and makes the facility fairly useless for software packages that consists of several parts within a single process like sendmail 8 (which performs authentication calls, mail operations, and acts as daemon (at least)).
ISC ([ISC01], see Section 3.14.16.1) offers categories and modules to distinguish between various invocations of the logging call. Logging parameters are per category, i.e., it is possible to configure how entries are logged per category and per priority. The category is similar to the syslog facility but it is an argument for each logging call and hence offers more flexibility. However, ISC does not offer loglevels beyond the priorities. A simple extension can associate loglevels with categories and modules. If the loglevel specified in the logging call is larger than the selected value, then the entry will not be logged.
Misc: should we use log number consisting of 16 bit category and 16 bit type number?
Logfiles must be easy to parse and analyze. For parsing it is very helpful if simple text tools like awk, sed, et.al. can be used instead of requiring a full parser, e.g., one that understands quoting and escaping.
The basic structure of logfile entries is a list of fields which consist of a name and a value, e.g.,
name1=value1, name2=value2, ...
The problems here are whether the delimiters (space or comma) are unique, i.e., whether they do not appear in the name or the value of a field. While this is easy to guarantuee for the name (because it's chosen by the program), values may contain those delimiters because they can be (indirectly) supplied by users. There are two approaches to solve this problem:
Proposal 1 is not easy to achieve since values are user controlled as explained before. Approach 2 seems to be more promising, even now there is some encoding happening in sendmail 8, i.e., non-printable characters are replaced by their octal representation (or in some cases simply by another character, e.g., '?'). A simple encoding scheme would be: replace space with underscore, escape underscore and backslash by a leading backslash. The decoding for this requires parsing to see whether underscores or backslashes where escaped. This encoding allows to use space as delimiter2.30. A different proposal is not to use spaces as delimiters (and hence not to change them), but commas or another fairly unique character. Which character (besides the obvious ',' and ';') would be a ``good'' delimiter, i.e., would not commonly appear in a value?
The logging functionality must be abstracted out, i.e., we have to come up with a useful API and provide one library for it, which would use syslog(). Other people can replace that library with their own, e.g., for logging into a database, files, or whatever.
A simple level of debugging can be achieved by turning on verbose logging. This may be via additional logging options, e.g., -dX.Y, similar to sendmail 8, but the output is logged not printed on stdout.
Additionally, there must be some way to start a program under a debugger. Remember: most programs are started by the supervisor, so it is not as simple as in sendmail 8 to debug a particular module. Take a look at postfix for a possible solution.
sendmail X must behave well in the case of resource shortages. Even if memory allocation fails, the program should not just abort, but act in a fail-safe manner. For example, a program that must always run even in the worst circumstances is the queue manager. If it can't allocate memory for some necessary operation, it must fall back to a ``survival'' mode in which it does only absolutely necessary things and shuts down everything else and then ``wait for better times'' (when the resource shortage is over). This might be accomplished by reserving some memory at startup which will be used in that ``survival'' mode.
postfix components just terminate when they run out of memory if a my*alloc() routine is used. This is certainly not acceptable by some parts of postfix nor sendmail X. Library routines especially shouldn't have this behavior.
sendmail X will be developed in several stages such that we have relatively soon something to test and to experiment with.
First, the design and architecture must be specified almost completely. Even if not every detail is specified, every aspect of the complete system must be mentioned and be considered in the overall design. We don't want to patch in later new parts which may require a redesign of some components, esp. if we already worked on those.
However, the implementation can be done in stages as follows: the first version will consist only of a relaying MTA, i.e., a system that can accept mail via SMTP and relay it to delivery agents that speak SMTP or LMTP. This way we will have an almost complete system with basic functionality. The ``only'' parts that are missing are other delivery agents, an MSP, and some header rewriting routines etc.
This section contains the explanation of terms used in this (and related) documents. The terms are usually taken from various RFCs.
This chapter describes the external functionality as well as the internal interfaces of sendmail X as much as required, i.e., for an implementation by experienced software engineers with some knowledge about MTAs etc. In each section it will be made clear what is the external functionality and which API or other interface (in the form of a protocol) will be provided.
The internal functionality describes the interfaces between the modules of sendmail X. As such they must never be used by anything else but the programs of which sendmail X consists. These interfaces are subject to changes without notice. It is not expected that modules from different versions work with each other, even though it might be possible.
An external interface is one which is accessible by a user.
Many of the sendmail X modules require that function calls do not block. If a function can block then an asynchronous API is required. Unfortunately it seems hard to specify an API that works well with blocking and non-blocking calls. Hence the APIs will be specified such that most (or all) blocking functions are split in two: one to initiate a function (the initiate function) and another one (the result function) to get the result(s) of the function call. To allow for a synchronous implementation, the first (initiate) function should also provide result parameters and a status return. The status indicates whether the result parameters have valid values, or whether the result will be returned later on via the second (get result) function. This convention seems to make it possible to provide APIs that work for both (synchronous and asynchronous) kind of implementations.
The calling program should not be required to poll for a result, but instead it should be notified that a result is available. This can be done via some interprocess notification mechanism, e.g., System V message passing or via a message over a socket through which both programs communicate. Then those programs can use select(2), poll(2), or something similar to determine whether a message is available. Note: this requires that both programs agree on a common notification mechanism which can be a problem in case of (third-party) libraries. For example, OpenLDAP [LDA] does provide an asynchronous interface (e.g., ldap_search(3)), but it does not provide an ``official'' notification mechanism3.1, it requires polling (ldap_result(3)). It should be possible to have a function associated with a message (type) such that the main loop is just a function dispatcher (compare engine in libmilter).
In this section we describe how asynchronous ``function'' calls can be handled. For the following discussion we assume a caller C and a callee S. The caller C creates a request RQ that contains the arguments for a function F, and a token T which uniquely identifies the request RQ (we could write RQ(T)). Note: it is also possible to let S generate the token if the argument is a result parameter (pointer to a token which can be populated by S). Additionally, the argument list may contain a result parameter PR in which a result can be immediately stored if the call can be handled synchronously. C enqueues the fact that it made the function call by storing data describing the current state St(T) in a queue/database DB along with the identifier T. The request is sent to S (may be appended to a queue for sending via a single communication task instead of sending it directly) unless it can be handled immediately and returned via the result parameter PR (the return status must specify that the result has been returned synchronously). A communication task receives answers A from S. The answer contains the identifier T which is used to lookup the current data St(T) in the DB. Then a function (the result or callback function RC) is invoked with St and A as parameters. It decodes the answer A (which contains the function result) and acts accordingly, also taking the current state St into account and probably modifying it accordingly. Additionally, it may use the result parameter PR to store the result in the proper location. The result parameter may provide some buffer (or allocated) structure in which to store the result such that the callee is not responsible for the allocation. However, the details of this depend on the functions and their requirements, e.g., their output values (it may be just a scalar).
To handle immediate status returns, the function that sends (enqueues) the request has to be able to deal with that. Such an implementation is an enhancement for a later version (probably not for 9.0).
There are several kinds of identifiers that are used within sendmail X for various purposes. Identifiers (handles, tags) in general must of course be unique during the time in which they are used to identify an object. Beyond this obvious purpose, some identifiers should fulfill other requirements.
If a handle is only used by the caller to identify an object in its memory, then it seems a good idea to use the memory address of that object as handle (as for example done by aio(3), see aio_return(3)). The advantages are:
It is important to use consistent naming; not just in the source code but especially in the configuration file.
In the following some proposals are given to achieve this. First about the structure of entries:
Next about the (components of) names used:
Question: how should classes (lists, sets) be specified? Should it be just a lists of entries in the configuration file? sendmail 8 allows to load classes from other sources, e.g., from an external file, a program, or a map lookup. In the latter case a key is looked up in a map and the RHS (split?) is added to the specified class. It might be useful to have also a map which simply specifies the valid members as keys, i.e., lookup a value and if it is found, it is a member of the class for which the map is given. More complicated would be that the RHS contains a list of class names which would be compared against the class that is checked.
The supervisor doesn't have an external interface, except for starting/stopping it. Reconfiguring is probably done via stop/start, or maybe via sending it a HUP signal. The latter would be good enough to enable/disable parts of sendmail X, i.e., edit the configuration file and restart the supervisor. In the first version, there will be no special commands to enable/disable parts of sendmail X.
The supervisor can be configured in different ways for various situations. Example: on a gateway system local delivery is not required hence the LDA daemon is not configured (or started).
The MCP starts processes on demand, which can come from several sources:
The MCP keeps track of the number of processes and acts accordingly, i.e., when the upper limit is reached no further processes will be started, when a lower limit is reached more processes are started. Child processes also inform the MCP whenever they finished a task and are available for a new one.
Case 2: If there is no process available then the MCP listens on the communication channel and starts a process when a communication comes in. In that case the connection will be passed to the new process (in the form of an open file descriptor). This is similar to inetd or postfix's master program. Question: how to deal with the case where processes are available? Should those processes just call accept() on the connection socket? That's the solution that the postfix master program uses; the child processes have access to the connection socket and all(?) of them listen to it (Wietse claims the thundering herd problem isn't really a problem since the number of processes is small).
Question: would it be more secure if only a program can request to start more copies of itself? However, that would probably increase the program complexity (slightly) and communication overhead. Decision: designated programs can request to start other programs.
Question: should the MCP use only one file descriptor to receive requests from all processes it started or should it have one fd per process or process type? The latter seems better since it allows for a clean separation, even though it may require (a few) more fds.
There are several types of shutdown:
Question: how to distinguish between immediate shutdown and ``normal'' shutdown? The former is probably triggered by a signal (SIGTERM), the latter by a command from the MCP (or some other control program).
The supervisor is responsible for shutting down the entire sendmail X system. To achieve this, the receiving processes are first told to stop accepting new connections. Current connections are either terminated (if immediate shutdown is required, e.g., system will be turned off fast) or are allowed to proceed (up to a certain amount of time). The queue manager will not schedule any delivery attempts any more and wait for outstanding connections to terminate. Delivery agents will be told to terminate (again: either immediately or orderly). The helper programs, e.g., address resolver, will be terminated as soon as the programs which require them have stopped.
The configuration of the MCP might be similar to the master process in postfix. It contains a list of processes, their types, the uid/gid they should run as, the number of processes that should be available, how they are supposed to be started, etc. It should also list how the processes communicate with the MCP (fork()/wait(), pipe), how often/fast a process can be restarted before it is considered to fail permanently. The latter functionality should probably be similar to (x)inetd.
Here's the example master.cf3.2file from postfix:
Postfix master process configuration file. Each line describes how a mailer component program should be run. The fields that make up each line are described below. A "-" field value requests that a default value be used for that field.
Service: any name that is valid for the specified transport type (the next field). With INET transports, a service is specified as host:port. The host part (and colon) may be omitted. Either host or port may be given in symbolic form or in numeric form. Examples for the SMTP server: localhost:smtp receives mail via the loopback interface only; 10025 receives mail on port 10025.
Transport type: "inet" for Internet sockets, "unix" for UNIX-domain sockets, "fifo" for named pipes.
Private: whether or not access is restricted to the mail system. Default is private service. Internet (inet) sockets can't be private.
Unprivileged: whether the service runs with root privileges or as the owner of the Postfix system (the owner name is controlled by the mail_owner configuration variable in the main.cf file).
Chroot: whether or not the service runs chrooted to the mail queue directory (pathname is controlled by the queue_directory configuration variable in the main.cf file). Presently, all Postfix daemons can run chrooted, except for the pipe and local daemons. The files in the examples/chroot-setup subdirectory describe how to set up a Postfix chroot environment for your type of machine.
Wakeup time: automatically wake up the named service after the specified number of seconds. A ? at the end of the wakeup time field requests that wake up events be sent only to services that are actually being used. Specify 0 for no wakeup. Presently, only the pickup, queue manager and flush daemons need a wakeup timer.
Max procs: the maximum number of processes that may execute this service simultaneously. Default is to use a globally configurable limit (the default_process_limit configuration parameter in main.cf). Specify 0 for no process count limit.
Command + args: the command to be executed. The command name is relative to the Postfix program directory (pathname is controlled by the program_directory configuration variable). Adding one or more -v options turns on verbose logging for that service; adding a -D option enables symbolic debugging (see the debugger_command variable in the main.cf configuration file). See individual command man pages for specific command-line options, if any.
SPECIFY ONLY PROGRAMS THAT ARE WRITTEN TO RUN AS POSTFIX DAEMONS. ALL DAEMONS SPECIFIED HERE MUST SPEAK A POSTFIX-INTERNAL PROTOCOL.
DO NOT CHANGE THE ZERO PROCESS LIMIT FOR CLEANUP/BOUNCE/DEFER OR POSTFIX WILL BECOME STUCK UP UNDER HEAVY LOAD
DO NOT CHANGE THE ONE PROCESS LIMIT FOR PICKUP/QMGR OR POSTFIX WILL DELIVER MAIL MULTIPLE TIMES.
DO NOT SHARE THE POSTFIX QUEUE BETWEEN MULTIPLE POSTFIX INSTANCES.
# service type private unpriv chroot wakeup maxproc command + args # (yes) (yes) (yes) (never) (50) smtp inet n - n - - smtpd #628 inet n - n - - qmqpd pickup fifo n n n 60 1 pickup cleanup unix - - n - 0 cleanup qmgr fifo n - n 300 1 qmgr #qmgr fifo n - n 300 1 nqmgr rewrite unix - - n - - trivial-rewrite bounce unix - - n - 0 bounce defer unix - - n - 0 bounce flush unix - - n 1000? 0 flush smtp unix - - n - - smtp showq unix n - n - - showq error unix - - n - - error local unix - n n - - local virtual unix - n n - - virtual lmtp unix - - n - - lmtp
Interfaces to non-Postfix software. Be sure to examine the manual pages of the non-Postfix software to find out what options it wants. The Cyrus deliver program has changed incompatibly.
cyrus unix - n n - - pipe flags=R user=cyrus argv=/cyrus/bin/deliver -e -m ${extension} ${user} uucp unix - n n - - pipe flags=Fqhu user=uucp argv=uux -r -n -z -a$sender - $nexthop!rmail ($recipient) ifmail unix - n n - - pipe flags=F user=ftn argv=/usr/lib/ifmail/ifmail -r $nexthop ($recipient) bsmtp unix - n n - - pipe flags=Fq. user=foo argv=/usr/local/sbin/bsmtp -f $sender $nexthop $recipient
First take at necessary configuration options for MCP:
Notice: using a syntax as in master.cf or inetd.conf is not a good idea, since it violates the requirements we listed in Section 2.2.2. The syntax must be the same as for the other configuration files for consistency.
The supervisor starts and controls several processes. As such, it has a control connection with them. In the simplest case these are just fork(), wait() system calls, in more elaborate case it may be a socket over which status and control commands are exchanged.
The supervisor may set up data (fd, shared memory, etc) that should be shared by different processes of the sendmail X system. Notice: since the MCP runs as root, it could setup sockets in protected directories to which normal users don't have access. Open file descriptors to those sockets (for communication between modules) could then be passed on to forked programs. This may help to increase the security of sendmail X due to the extra protection. However, it requires that the MCP sets up all the necessary communication files. Moreover, if a program closes the socket (as SMTPS may do for port 25) it can't reopen it anymore and hence must get the file descriptors from the MCP again (either by passing a fd or by terminating and being started). This may not be really useful, but it's just an idea to be noted.
The MCP will bind to port 25 as explained earlier (see Section 2.3) and hand the open socket over to the SMTPS (after changing the uid).
Question: Will this be a (purely) event driven program?
Question: worker model or thread per functionality? It won't be a single process which uses event based programming since this doesn't scale on multi-processor machines. It seems a worker model is more appropriate: it is more general and we might be able to reuse it for other modules, see also Section 3.18.2.
The queue manager doesn't have an external interface. However, it can be configured in different ways for various situations. Todo: and these are?
Such configuration options influence the behavior of the scheduler, the location of the queues, the size of memory caches, etc.
The queue manager will not schedule any delivery attempts any more. It will wait for outstanding connections to terminate unless an immediate shutdown is requested. The incoming queue will be flushed to the deferred queue. Delivery agents are informed by the MCP to stop. The QMGR is waiting for them to terminate and records the delivery status in the deferred EDB.
The status of the queue manager must be accessible from helper programs, e.g., see Section 2.11.1 The QMGR does not provide this as user accessible interface to allow for changing the internal protocols without having to change programs outside sendmail X.
Note: due to the complexity of the QMGR the following sections are not structured as subsection of this one because the nesting becomes too deep otherwise.
The main index to access the deferred EDB will be the time of the next try. However, it is certainly useful to send entries to a host for which a DA has an open connection. Question: what do we use as index to find entries in an EDB to reuse an open connection? An EDB stores canonified addresses or resolved addresses (DA, host, address), it does not store MX records. Those are kept in a different cache (mapping), if they are cached at all. We cannot use the host signature for lookups as sendmail 8 does for piggybacking for the same reason: the host signature consists of the MX records. 8.12 uses an enhanced version for piggybacking, i.e., not the entire host signatures need to match, but only the first. Maybe it is sufficient for selection of entries from an EDB to use the host name (beside all the other criteria as listed in Section 2.4.4.1). The actual decision to reuse a connection is made later on (by the micro scheduler, see Section 2.4.4.3). That decision is based on the host name/IP address and maybe the connection properties, e.g., STARTTLS, AUTH data.
If a DA has an open connection, then that data is added to the outgoing open connection cache (an incoming connection from a host may be taken as hint that the host can also receive mail). The hint is given to the schedulers, which may change their selection strategy accordingly. The first level scheduler can reverse map the host name/IP address to a list of destination hosts that can appear in resolved addresses. Then a scan for those hosts can be made and the entries may be moved upward in the queue.
Question: do we only lookup queue entries for the best MX record (to piggyback other recipients on an open connection)? We could lookup queue entries for lower priority MX records if we know the higher ones are unavailable. It may be a question of available resources (computational as well as network I/O) whether it is useful to perform such fairly complicated searches (and decision processes) to reuse a connection. In some cases it might be simpler (and faster) to just open another connection. However, it might be ``expensive'' to setup a connection, esp. if STARTTLS or something similar is used. A really sophisticated scheduler may take this into account for the decision whether to use an existing connection or whether to open new ones. For example, if the number of open connections is large then it is most likely better to reuse an existing connection. Those numbers (total and per domain) will be configurable (and may also depend on the OS resources).
Question: which host information does the QMGR send to the DA? Only the host from the resolved address tuple (so the DA does MX lookups), the MX list, only one host out of the MX list, or only one IP address? A ``clean'' interface would be to send only the resolved address and let the DA do the MX lookup etc. However, for piggybacking the QMGR needs to know the open connections and it must be able to compare those connections with the entries in the EDB. Hence the QMGR must do the MX expansion (using the AR or DNS helper processes).
Very simple outline of selection of next hop(s) for SMTP delivery:
The MX list and the address list are represented as linked lists, with two different kind of links: same priority, lower priority. This can be either coded as two pointers or as a ``type of link'' value. If we use two pointers then we have to decide whether we fill in both pointers (if a record of that type is available) or only one. For example, let A, B, and C be MX records for a host with value 10, 10, and 20 respectively. Does A have only a (same priority) pointer to B, or does it have pointers to B and C? Is there a case where we do not go sequentially through the list? Maintaining two pointers is more effort which may not give us any advantage at all.
The QMGR provides APIs for the different modules with which it communicates. The two most important interfaces are those to the SMTP server and the delivery agents, the former is discussed in Section 3.4.12.
Question: how does the QMGR control the DAs? That is, who starts a new DA if necessary? This should be the MCP, since it starts (almost) all sendmail X processes. However, what if a DA is already running and the QMGR just wants more? Does the MCP start more or does a DA fork()? Probably the former, which means the QMGR and the MCP must be able to communicate this type of data. Do the DAs terminate themselves after some (configurable) idle time? That should be a configuration option in the MCP control file, see Section 3.3.4.
Question: does the QMGR have each of the following functions per DA or are these generic functions which take the name/type of the DA as argument and are dispatched accordingly?
General side note about (communication) protocols: it seems simpler that the caller generates the appropriate handles instead of the callee. The caller has to pass some kind of token (handle/identifier) anyway to the callee to associate the result it is getting back with the correct data (supplied via the call). If this would be a synchronous call, then the callee could generate the handle, but since we must deal with asynchronous calls, we must either generate the handle ourselves such that the callee can use it to identify the returned data, or the calling mechanism itself must generate the data, which however makes this too complicated.
Create/request a new DA (or maybe several). The properties etc are defined in da-descripton.
Get result(s) from creating/requesting a new DA.
Stop a DA (all DAs of that type).
Get result of stopping a DA.
Open a connection for delivery and send one mail. session contains the destination host to which to connect and some requirements, e.g., STARTTLS, AUTH. If we only want to open a connection, transaction and da-trans-handle can be NULL.
Get results of opening a connection for delivery and sending one mail. The status can be fairly complicated since the operation can fail for various reasons in different stages, see 3.8.4.1.
Perform one delivery (maybe to multiple recipients), transaction contains necessary information for session.
Get results for delivery attempt. This can be a state for the entire transaction, or per recipient depending on the DA and the actual delivery.
Notice: da-session-handle might not be necessary, da-trans-handle is sufficient to identify the transaction. However, an implementation might prefer to get also the session handle.
Close a connection (session).
Get result of closing a connection (session).
Notice: the handles (identifiers) from the DA (da-trans-handle, da-session-handle) are not related to the transaction/session identifiers of SMTPS. That is, we do not ``reuse'' those identifiers except for transaction identifiers for the purpose of logging. We only need those handles to identify the sessions/transactions in SMTPC and QMGR, i.e., to associate the data structures (and maybe threads) that describe the sessions/transactions. We can generate the identifiers in a DA (SMTPC) similarly as in SMTPS; the differences are:
There are several different EDBs in the QMGR: active queue (AQ or ACTEDB), incoming queue (INCEDB; two variants: IBDB: backup on disk and IQDB: in memory only), and main (deferred) queue (DEFEDB). Data must be transferred between those DBs in various situations, e.g., for scheduling data is taken from IQDB or DEFEDB and put into ACTEDB. Doing so involves copying of data and maybe allocating memory for referenced data structures, e.g., mail addresses, and then copying the data from one place (Src) into another (Dst). This problem can be solved in two ways:
Another approach is to use the same data structures for all/most EDBs with an additional type field that defines which data elements are valid. This way copying is either not necessary or can be done almost one-to-one in most cases. The disadvantage of this approach is the potential waste of some memory and fewer chances for typechecks by the compiler (however, more generic functions might be possible). Moreover, the data requirements for incoming and outgoing envelopes are fairly different, so maybe those should be separate data structures.
Maybe only ACTEDB and DEFEDB data structures should be identical, or at least ACTEDB structs should be a superset of DEFEDB structs. Otherwise we need to update entries in DEFEDB by reading the data in (into a structure for DEFEDB), modifying the data (according to the data in ACTEDB), and writing the data out.
Section 2.4.3.3 describes the data flow of envelopes between the various EDB that QMGR maintains, while Section 2.4.3.6 describes the changes required for cut-through delivery. This section specifies the functional behavior for the latter.
If the final data dot is received by the SMTP server, it sends a CTAID record to QMGR including the information that this transaction is scheduled for cut-through delivery. Such an information is given by QMGR in reply to RCPT records: if all of them are flagged by QMGR for cut-through delivery, the transaction is scheduled for it. After receiving CTAID QMGR decides whether the scheduler will go ahead with cut-through delivery. If it doesn't, it sends back a reply code for case 1b below and proceeds as for normal delivery mode. Otherwise all recipients are transferred to AQ immediately, the recipients and the transaction are marked properly and delivery attempts are made. Moreover, a timeout is set for the transaction after which case 1b is selected.
Question: should the data be in some IBDB list too?
For case 1b the SMTP server needs to send another message to QMGR telling it the result of fsync(2). If fsync(2) fails, the message must be rejected with a temporary error, however, QMGR may already have delivered the mail to some recipients, hence causing double deliveries.
To minimize disk I/O an envelope database cache (EDBC) is used (see Section 3.4.6.2). As explained in Section 2.4.3.5.1 the cache may not always contain all references to DEFEDB due to memory restrictions. Such a restriction can be either artificially enforced (by specifying a maximum number of entries in EDBC) or indirectly if the program runs out of memory when trying to add a new entry to the cache (see also Section 3.4.17.1). In that case, the operation mode of reading entries from the deferred queue must be changed (from cached to disk). In the disk mode the entire DEFEDB is read at regular intervals to fill up EDBC with the youngest entries which in turn are read at the appropriate time from DEFEDB into AQ. Question: how can those ``read the queue'' operations be minimized? It is important that EDBC is properly maintained, it contains the ``youngest'' entries, i.e., all entries in DEFEDB that are not referenced from EDBC have a ``next time to try'' that is larger than the last entry in EDBC. Question: how can this be guaranteed? Proposal: when switching from cached to disk set a status flag to keep track of the mode, and store the maximum in . When inserting entries into EDBC, ignore everything that has a greater then . If entries are actually added () and hence older entries are removed, set a new . Perform a DEFEDB scan when EDBC is empty or below some threshold, e.g., only filled up to ten per cent and . If all entries from DEFEDB can be read, reset the mode to cached.
In case of an unclean shutdown there might be open transactions in IBDB. Hence on startup the QMGR must read the IBDB files and reconstruct data as necessary. The API for IBDB is specified in Section 3.11.4.3. It allows an application to read all records from the IBDB. To reconstruct the data, the function sequentially reads through IBDB and stores the data in an internal data structure (in the following called RecDB: Reconstruction DB)that allows access via transaction/recipient id. The entries are ordered, i.e., the first entry for a record has the state open, the next entry has the state done (with potentially more information, e.g., delivered, transferred to DEFEDB due to a temporary/permanent failure, or cancelled). The second entry for a record might be missing which indicates an open transaction that must be taken care of. For each done transaction the corresponding open entry is removed from RecDB. After all records have been read RecDB contains all open transactions. These must be added to DEFEDB or AQ. If they are added only to the latter, then we still need to keep the IBDB files around until the transactions are done in which case a record is written to IBDB. This approach causes problems if the number of open transactions exceeds the size of AQ in which case an overflow mechanism must set in, e.g., either delaying further reading of IBDB or writing the data to DEFEDB. In the first sendmail X version the data should be transferred to DEFEDB instead for simplicity. Even with this simpler approach there are still some open problems:
About 1: If yes, a faster startup time is achieved since the QMGR does not need to wait for the reconstruction. This of course causes other problems, e.g., what to do if the reconstruction runs into problems3.4? There are further complications with the order of operations: the reconstruction must be performed before the new IBDB is opened unless either a different name is used or the sequence numbers start after the last previously used entry. In the former case some scheme must be used to come up with names that will not cause problems if the system goes down while the reconstruction is still ongoing. In the latter case the problem of a wrap-around must be dealt with, i.e., what happens if the sequence number reaches the maximum value? A simple approach would be to simply start over at 1 again, but then it must be avoided that those files are still in use (which seems to be fairly unlikely if for example a 32 bit unsigned integer is used because the number of files would be huge, definitely larger what is sane to store in a single directory3.5).
A potential solution to the problem of overlapping operation is to use different sets of files, e.g., two different directories: ibdbw for normal operation, ibdbr for recovery. In that case at startup the following algorithm is used: If both ibdbw and ibdbr exist then the recovery run from the last startup didn't finish properly. Hence the rest of QMGR is not started before completing recovery, i.e., asynchronous operation is only allowed if the previous recovery succeeded. This simple approach allows us to deal with recovery problems without introducing an unbound number of IBDB file sets (directories). If only ibdbw exists, then rename it to ibdbr and let the QMGR startup continue while recovery is running. After recovery finished, ibdbr is removed thus indicating successful recovery for subsequent startups.
This approach can be extended to deal with a cleanup process that ``compresses'' IBDB files by removing all closed transactions from them. This cleanup process uses a different directory ibdbc in which it writes open transactions similar to the recovery program. That is, it reads IBDB files from ibdbw, gets rid of closed transaction in the same way as the recovery function does and writes the open transactions into a new file. After this has been safely committed to persistent storage, the IBDB files that have been read can be removed. The recovery function needs to read in this case the files from ibdbw and ibdbc to reconstruct all open transactions. The cleanup process should minimize the amount of data to read at startup. Such a cleanup function may not be available in sendmail X.0, however, if the MTA runs for a long time it may require a lot of disk space if there is not cleanup task. Question: is cleaning up a recursive process? If so, how to accomplish that? In a simple approach two ``versions'' can be used between which the cleanup task toggles, i.e., read from version 1 and write to version 2, then switch: read from version 2 and write to version 1. Question: how easy is it to keep track of this?
An example of the missing data mentioned as problem 2 in the list of problems is the start time (of the transaction) which is not recorded. This can be either taken in first approximation from the creation time of the IBDB file, or the current time can be simply used (which might be off by a fair amount if the system is down for a longer time, which, however, should not happen).
Question: is there a way to avoid writing data when cleaning up IBDB? The cleanup task could ``simply'' remove IBDB files that contain only references to closed transactions. We may not even have to read any IBDB files at all since the data can be stored in IQDB, i.e., a reference to the IBDB logfile (sequence number). This requires a reference counter for each IBDB logfile which contains the number of open transactions (TA/RCPTs). When both reference counters reach zero the logfile can be removed. Note: this violates the modularity: now IQDB is used to store data that is related to the implementation of IBDB, i.e., both are tied together. Currently IQDB is fairly independent of the implementation of IBDB, e.g., it does not know about sequence numbers. Now there must a way to return those numbers to IBDB callers and they must be stored in IQDB. Moreover, there are functions that do not allow for a simple way to do this, e.g., those which act on request lists. In this case it would be fairly complicated to associate the IBDB data with the IQDB data. However, at the expense of memory consumption, the data could be maintained by the IBDB module itself. In this case it behaves similar as the IBDB recovery program, i.e., it stores open transactions (only a subset of data is necessary: identifier and sequence number) in an internal (hash) table, and matches closed transactions against existing entries to remove them. Periodically it can scan the table to find out which sequence numbers are still in use and remove unused files.
A different (simpler?) approach to the problem is to use time-based cleanup. Open transactions that are stored in IBDB are referenced by IQBD (at least with the current implementation) or AQ (entries which are marked to come from IQDB). Periodically these entries can be scanned to find the oldest. All older IBDB logfile can be removed. Note: there should be a correctness proof for this before it is implemented.
An interesting question is in which format recipient addresses are stored in the EDB, i.e., whether only the original recipient address is stored or whether also the resolved format (DA, host, address, etc) is stored too. If we use the resolved (expanded) format then we need to remove/invalidate those in case of reconfigurations. These reconfigurations may change DAs or routing. A possible solution is to add a time stamp to the resolved form. If that time stamp is older than the last reconfiguration then the AR must be called again. However, the routing decision may have been based on maps which have changed inbetween, hence this isn't a complete solution. It may require a TTL based on the minimum of all entries in maps used for the routing decision. Alternatively we can keep the resolved address ``as-is'' and do not care about changes. For examples, some address resolution steps happen in sendmail 8 before the address is stored in the queue, some happen afterwards. Examples for the former are alias expansion, which certainly should not be done every time. So the only address resolution that happens during delivery are DNS lookups (MX records, A/AAAA records), and those can be cached since they provide TTLs. We might make the address resolution a clean two step process:
It might be an interesting idea to provide a cache of address mappings. However, such a cache cannot be simply for domains since it might be possible to provide per-address routing. The cache may be for DNS (MX/A/AAAA) lookups nevertheless, i.e., a ``partially'' resolved address maps to a domain which in turn maps to a list of destination hosts. This is fairly much what sendmail 8 does:
These two steps are clearly related to the two steps listed above.
Incomplete (as of now) summary: Advantages of storing resolved addresses:
Disadvantages of storing resolved addresses:
If an address ``expires'' earlier than the computed ``next time to try'' then it probably is not useful to store the resolved address in DEFEDB. However, if the scheduler decides to try the address before the ``next time to try'', e.g., because a host is available again, then the resolved address might still be useful.
See also Section 3.4.15 for further discussion of this problem.
We also need to mark addresses (similar to sendmail 8 internally) to denote their origin, i.e., original recipient, expanded from alias (alias or list), etc.
As described in Section 2.4.7 the scheduler must control the load it generates.
One of the algorithms the scheduler should implement is ``slow start'', i.e., when a connection to a new host is created, only a certain number of initial connections must be made (``initial concurrency''). When a session/transaction has been succesful, the number of allowed concurrent connections can be increased (by one for each successful connection) until an upper limit (``maximum concurrency'') is reached. This varying number of concurrency is usually called ``window'' (see TCP/IP). If a connection fails, the size of the window is decreased until it reaches 0 in which the destination host is declared ``dead'' (for some amount of time).
Question: which kind of failures should actually decrease the window size besides any kind of network I/O errors? For example, a server may respond with 452 Too many connections but sm9 will not try to interpret the text of the reply, only the reply code.
The queue manager uses several data structures to store the status of the system and the envelopes of mails for which it has to schedule delivery. These are listed in the next sections.
Question: are the connection caches indexed on host name or IP address? Problem: there is no 1-1 relation between host names and IP addresses, but an N-M relation. This causes problems for finding recipients to send over an open connection. A recipient address is mapped to (a DA and) a destination host, which is mapped to a list of MX records (host names) which in turn are mapped to IP addresses. Since there can be multiple address records for a host name and multiple PTR records for an IP address, we have a problem. The general problem is what to use as index for the connection caches. A smaller problem results from the expiration (TTL) of DNS entries (all along the mapping: MX, A/AAAA, and PTR records). Question: do load balancers further complicate this or can we ignore them for our purposes? Possible solution: use host name as index, provide N-M mappings for host name to IP address and vice versa. These mappings are provided by DNS. A simpler (but more restrictive solution) is to use the IP address and the host name together as index (see Exim [Haz01]). Question: do we want our own (simpler) DNS interface? We need some (asynchronous) DNS ``helper'' module anyway, maybe that can add some caching and a useful API? We shouldn't replicate caches too often due to the programming and memory usage overhead. So how do we want to access the connection caches? If a host name maps to several IP addresses, must it be a single machine? Does it matter wrt SMTP? Is a host down or an IP address? It could be possible that some of the network interfaces are down, but the host still can receive e-mail via another one.
So the QMGR to DA interface should provide an option to tell the DA to use a certain IP address for a delivery attempt because the QMGR knows the DA has an open connection to it. Even though this is a slight violation of the abstraction principle, the QMGR scheduled this delivery because of the connection cache, so it seems better than letting the DA figuring out to use one of its open connection (by looking up addresses etc).
Problem: a connection (session) has certain attributes, e.g., STARTTLS, AUTH restrictions/requirements. These session-specific options can depend on the host name or the IP address (in sendmail 8: TLS_Srv:, AuthInfo:). This makes it even more complicated to reuse a connection. If a host behaves differently based on under which IP address it has been contacted, or if different requirements are specified for host name/IP addresses then connection reuse is significantly more complicated. In the worst case this could result in bounces that happen only in certain situations. It is certainly possible to document this behavior on the sendmail side, but if the other (server) side has a connection information dependent behavior then we have a problem.
A connection is made to an IP address, the server only knows the client IP address (plus port, plus maybe ident information, the latter should not be used for anything important). SMTP doesn't include a ``Hi, I would like to speak with X'' greeting, but only a ``Hi, this is Y'' (which might be considered a design flaw), so the server can't change its behavior based on whom the client wants to speak to (which would be useful for virtual hosting), but only based on the connection information (IP addresses, ports). Hence when a connection is reused (same IP address) the server can't change its behavior. Problem solved? Not really, someone could impose wierd restrictions based on sender addresses. It seems to be necessary to make this configurable (some external map/function can make the decision). This probably can be achieved by imposing a transactions per session limit, see also Section 2.4.8.
Another solution might be to base connection reuse on host name and IP address3.6. This may restrict connection reuse more than necessary, but it should avoid the potential problems. Maybe that should be a compile time option? Make sure that the code does not depend completely on the type of the index.
There have been requests to perform SMTP AUTH based on the sender address. which of course invalidates connection reuse. It's one (almost valid) example for additional requirements for a connection. It's only almost valid, since SMTP AUTH between MTAs is not for user authentication, it is used to authenticate the directly communicating parties. However, SMTP AUTH allows some kind of proxying (authenticate as X, authorize to act as Y), which seems to be barely used.
Note: connection reuse requires that the delivery agent is the same, if two recipients resolve to different delivery agents -- even for the same IP address and hostname -- then the connection will not be reused. In some cases this seems rather useless3.7hence the maximum flexibility would be reached by establishing congruence classes of mailer with respect to connection reuse. If then not just the mailer definitions are used to create those classes but also some connection attributes (see begin of this section), then we may have found the right approach to connection reuse.
See Section 2.4.6 for the different types of DSNs and the problems we may encounter. We need to store (at least?) three different counters as explained in Section 2.4.6.1. DSNs can be requested individually for each recipient. Hence the sum of these counters is less than or equal the number of original recipients. Question: could it ever increase due to alias expansion? We could store the number of requested DSNs for each type and then compare the number of generated DSNs against them. If any of the generated DSN counters reached the requested counter we can schedule the DSN for delivery and hence we can be sure (modulo the question above) that all DSNs of the same type can be merged into one. Question: do we really need those counters? Or do we just generate a DSN whenever necessary and schedule it for delivery (with some delay)? Then the DSN generation code could look whether there is already a DSN generated and add the new one to it, as far as this is possible since the scheduler has to coalesce those DSNs. Problem: there is extra text (the error description) that should be put into one mail. How to do this? See also Section 2.4.6 about the question how to generate the body of DSNs.
As explained in Section 2.4.1, the incoming queue consists of two parts: a restricted size cache in memory and a backup on disk.
The incoming queue does not (need to) have the same amount of data as the SMTP servers. It only stores data that is relevant to the QMGR. There is even less information in the data that is written to disk when an entry has been received. The data in the RSC is not just needed for delivery, but also during mail reception for policy decisions. In contrast, the backup data is only there to reconstruct the incoming cache in case of a system crash, i.e., the mail has already been received, there will not be any policy decisions about it. Hence the backup stores only the data that is required for delivery, not the data that is necessary to make decision while the mail is being received. This of course means that a reconstructed RSC does not contain the same (amount of) information as the original RSC.
The size of the cache defines the maximum number of concurrently open sessions and transactions in the SMTP servers. Question: do we store transactions and recipients also in fixed size caches? We certainly have to limit the size used for that data, which then acts as an additional restriction on the number of connections, transactions, and recipients. It's probably better to handle at least the recipients dynamically instead of pre-allocating a fixed amount of memory. The amount of memory must be limited, but it should be able to shrink if it has been expanded during a high volume phase (instead of having some maximum reserved all the time). Since most of the information in the cache is not fixed size, we need to dynamically allocate memory anyway. We maybe need some special kind of memory allocation for this which works within a given allocated area (see also Section 3.14.6).
Each entry in the cache has one of the following two formats (maybe use two different RSC):
Session
session-id | session identifier |
client-host | identification of connecting host |
IP address, host name, ident | |
features | features offered: AUTH, TLS, EXPN, ... |
workarounds | work around bugs in client (?) |
transaction-id | current transaction |
reject-msg | message to use for rejections (needed?) |
auth | AUTH information |
starttls | TLS information |
n-bad-cmds | number of bad SMTP commands |
n-transactions | number of transactions |
n-rcpts | total number of recipients |
n-bad-rcpts | number of bad recipients |
Transaction:
transaction-id | transaction identifier |
start-time | date/time of transaction |
address, arguments (decoded?) | |
n-rcpts | number of recipients |
rcpt-list | addresses, arguments (decoded?) |
cdb-id | CDB identifier (obtained from cdb?) |
msg-size | message size |
n-bad-cmds | number of bad SMTP commands (necessary?) |
n-rcpts | number of valid recipients |
n-bad-rcpts | number of bad recipients |
session-id | (pointer back to) session |
statistics: | |
end-time | end of transaction |
If recipients addresses are expanded while in the INCEDB, we need to store the number of original recipients too.
Backup on disk: those entries have a different format than the in-memory version. The entries must be clearly marked as to what they are: transaction sender or recipient.
Sender (transaction):
transaction-id | transaction identifier |
start-time | start time of transaction |
sender-spec | address incl. ESMTP extensions |
cdb-id | CDB identifier (obtained from cdb?) |
n-rcpts | reference count for cdb-id |
Notice: the sender information is written to disk after all recipients have been received, i.e., when DATA is received, because it contains a counter of the recipients (reference count).
ESMTP sender extensions (substructure of the structure above)
size | size of mail content (SIZE=) |
bodytype | type of body |
envid | envelope id |
ret | DSN return information (FULL, HDRS) |
auth | AUTH parameter |
by | Deliverby specification |
Per recipient (transaction):
transaction-id | transaction identifier |
rcpt-spec | address incl. ESMTP extensions |
and maybe a unique id (per session/transaction?) |
ESMTP Recipient extensions (substructure of the structure above):
notify | DSN parameters (SUCCESS, FAILURE, WARNING) |
orcpt | original recipient |
The data for sender and recipient should be put into a data structure such that all relevant data is kept together. That structure must contain the data that must be kept in (almost) all queues.
The active queue needs different types of entries, hence it might be implemented as two RSCs, i.e., one for senders and one for recipients3.8. It could also be only one RSC if the ``typed'' variant is used (see 4.3.5.1).
Notice: there are two types of transaction records:
The AQ context itself contains some summary data and the data structures necessary to access transactions and recipients:
max-entries | maximum number of entries |
limit | current limit on number of entries |
entries | current number of entries |
t-da | entries being delivered |
nextrun | if set: don't run scheduler before this time |
tas | access to transactions |
rcpts | access to recipients |
There should probably be more specific counters: total number of recipients, number of recipients being delivered, number of recipients waiting for AR, number of recipients ready to be scheduled, and total number of transactions.
The incoming transaction context contains at least the following fields:
transaction-id | SMTPS transaction identifier |
sender-spec | address (see INCEDB) |
cdb-id | CDB identifier |
from-queue | from which queue does this entry come? (deferred or incoming) |
counters | several counters to keep track of delivery status |
For ACTEDB we only need to know how many recipients are referenced by this transaction in the DB itself. That is, when we put a new recipient into ACTEDB, then we need to have the sender (transaction) context in it. If it is not yet in the queue, then we need to get it (from INCEDB or ACTEDB) and initialize its counter to one. For each added recipient the counter is incremented by one, when the status of a recipient delivery attempt is returned from a DA, the counter is decremented by one and the recipient is taken care of in the appropriate way. See also 2.4.3.4. However, because AQ should contain all necessary data to update DEFEDB it must also store the overall counters, e.g., how many recipients are in the system in total (not just in AQ).
Note: the delivery status for a transaction is not stored in DEFEDB since each delivery attempt in theory may lead to a different transaction, i.e., a DA transaction is not stored in DEFEDB.
The recipient context in AQ contains at least the following elements:
SMTPS transaction-id | SMTPS transaction identifier |
DA transaction-id | DA transaction identifier |
rcpt-spec | address (see INCEDB) |
rcpt-internal | delivery tuple (DA, host, address) |
from-queue | from which queue does this entry come? |
status | not yet scheduled, being delivered, (temp) failure, ... |
SMTPS-TA | recipient from same SMTPS transaction |
DA-TA | recipient in same DA transaction |
DEST | recipient for same destination |
The last three entries are links to access related recipients. These are used to group recipients based on the usual criteria, i.e., same SMTPS transaction, same delivery transaction, same next hop. Maybe this data can also be stored in DEFEDB to pull in a set of recipients that belongs together instead of searching during each scheduling attempt for the right recipients that can be grouped into a single transaction. Questions: how to do this? Is it worth it?
The internal recipient format is the resolved address returned by the AR. Its format is explained in Section 3.6.3.1.
As explained in 3.4.4, several access methods are required for the various EDBs, those for AQ are:
The key described in item 3 refers to another data structure which summarizes the entries. This data structure is the ``head'' of the DEST list:
DA | Delivery Agent |
next-hop | Host (destination/next hop) to which to connect for delivery |
todo-entries | Number of entries in todo-list |
todo-list | Link to recipients which still need to be sent |
busy-entries | Number of entries in busy-list |
busy-list | Link to recipients which are being sent |
The number of waiting transactions (todo-entries) can be used to determine whether to keep a session open or close it.
Question: is it really useful to have a busy list? What's the purpose of that list, which algorithms in the scheduler need this access method? The number of entries in the busy list is somehow useful if it were the number of open transactions or sessions, however, this is the number of recipients which does not have a useful relation to transactions/sessions.
Note: when a recipient is added to AQ it may not be in these destination queues because its next hop has not yet been determined, i.e., the address resolver needs to be called first. Those entries must be accessible via other means, e.g., their unique (recipient) identifier (see item 1 above). It might also be possible (for consistency) to have another queue with a bogus destination (e.g., a reserved DA value or IP address) which contains the entries whose destination addresses have not yet been resolved. Section 3.4.4 explains some of the problems with chosing indices to access AQ (and other EDBs), there is an additional problem for AQRD: if the connection limit for an IP address is reached, the scheduler will skip recipients for that destination. However, the recipient may have other destinations with the same priority whose limit is not yet reached. Either the system relies on the randomization of those same priority destinations (real randomization in turn causes problems for session reuse), or some better access methods need to be used. It might be useful in certain cases to look through the recipient destinations nevertheless (which defeats the advantage of this organization to easily skip entries that cannot be scheduled due to connection limits).
There might be yet another data structure which provides a summary of the entries in DEFEDB of this kind. That data structure can be used to decide whether to pull in recipients from DEFEDB to deliver them over an open connection.
The QMGR/scheduler must also remove entries from AQ that are too long in the queue, either because AR didn't respond or because a delivery attempt failed and the DA didn't tell QMGR about it (see Section 2.4.4.5). Question: what is an efficient way to do this? Should those entries also be in some list, organized in the order of timeout? Then the scheduler (or some other module) just needs to check the head of the list (and only needs to wake up if that timeout is actually reached). When an item is sent to AR or a DA then it must be sorted into that list.
The entries must be clearly marked as to what they are: transaction sender or recipient.
Sender:
transaction-id | transaction identifier |
start-time | date/time mail was received |
sender-spec | address (see INCEDB) |
cdb-id | CDB identifier |
rcpts-left | reference count for cdb-id |
rcpts-tried | counter for ``tried'' recipients |
rcpts-left refers to the number of recipients which somehow still require a delivery, whether to the recipient address or a DSN back to the sender. rcpts-tried is used to determine when to send a DSN (if requested). It might be useful to have counters for the three different delivery results: ok, temporary/permanent failure:
rcpts-total | total number of recipients |
rcpts-left | number of recipients left |
rcpts-temp-fail | number of recipients which temporary failed |
rcpts-perm-fail | number of recipients which permanently failed |
rcpts-succ | recipients which have been delivered |
In case of a DELAY DSN request we may need yet another counter. See also Sections 2.4.6 and 3.4.10.3.
rcpts-total is at least useful for statistics (logging); one of rcpts-succ and rcpts-total may be omitted. The invariances are:
rcpts-total also counts the bounces that have been generated. It is never decremented.
Notice: these counters must be only changed if the delivery status of a recipient changes. For example, if a recipient was previously undelivered and now a delivery caused a temporary failure, then rcpts-temp-fail is increased. However, if a recipient previously caused a temporary failure and now a delivery failed again temporarily, then rcpts-temp-fail is unchanged. This obviously requires to keep the last delivery status for each recipients (see below).
Recipient:
transaction-id | transaction identifier |
rcpt-spec | address (see INCEDB) |
rcpt-internal | delivery tuple (DA, host, address, timestamp) |
d-stat | delivery status (why is rcpt in deferred queue) |
schedule | data relevant for delivery scheduling, e.g., |
last-try: last time delivery has been attempted | |
next-try: time for next delivery attempt |
d-stat must contain sufficient data for a DSN, i.e.:
act-rcpt | actual recipient (?) |
orig-rcpt | original recipient (stored in rcpt-spec, see above) |
final-rcpt | final recipient (from RCPT command) |
DSN-status | extended delivery status code |
remote-mta | remote MTA |
diagnostic-code | actual SMTP code from other side (complete reply) |
last-attempt | data/time of last attempt |
will-retry | for temporary failure: estimated final delivery time |
The internal recipient format is the resolved address returned by the AR. Its format is explained in Section 3.6.3.1. Question: do we really want to store rcpt-internal in DEFEDB? See Section 3.4.8 for a discussion. The timestamp for the delivery tuple is necessary as explained in the same section.
Question: which kind of delivery timestamp is better: last time a delivery has been attempted or time for next delivery attempt? We probably need both (last-try for DSN, next-try for scheduling).
EDBC implements a sorted list based on the ``next time to try''. with references to recipient identifiers (which are the main indices to access DEFEDB). ``Next time to try'' is not a unique identifier hence this structure must be aware of that, e.g., when adding or removing entries.
This cache is accessed via IP addresses and maybe hostnames. It is used to check whether an incoming connection (to SMTPS) is allowed (see also Section 2.4.7).
open-conn | number of currently open connections |
open-conn-X | number of open connections over last X seconds |
(probably for X in 60, 120, 180, ...) | |
trans-X | number of performed transactions over last X seconds |
rcpts-X | number of recipients over last X seconds |
fail-X | number of SMTP failures over last X seconds |
last-conn | time of last connection |
last-state | status of last connection, see 3.4.10.9 |
Notice: statistics must be kept in even intervals, otherwise there is no way to cycle them as time goes on.
Question: do we use a RSC for this? If so, how do we handle the case when the RSC is full? Just throwing out the least recently used connection information doesn't seem appropriate.
Question: what kind of status do we want to store here? Whether the connection was succesfull, or aborted by the sender? Or whether it acted strange, e.g., caused SMTP violations? Maybe performance related data? For example, number of recipients, number of transactions, throughput, and latency.
As explained in Section 2.4.4.8, there are two different connection caches for outgoing connections: one for currently open connections (OCC, 1) and one for previously made (tried) connections (DCC, 2). These two connection caches are described in the following two subsections.
This is OCC (see Section 2.4.4.8: 1) for the currently open (outgoing) connections.
OCC helps the scheduler to perform its operation, it contains summary information (and hence could be gathered from the AQ/DA data by going through the appropriate entries3.9).
open-conn | number of currently open connections |
open-conn-X | number of open connections over last X seconds |
(probably for X in 60, 120, 180, ...) | |
trans-X | number of performed transactions over last X seconds |
rcpts-X | number of recipients over last X seconds |
fail-X | number of failures over last X seconds |
performance | data related to performance, see 3.11.8 |
first-conn | time of first connection |
last-conn | time of last connection |
last-state | status of last connection, see 3.4.10.11 |
initial-conc | initial concurrency value |
max-conc | maximum concurrency limit |
cur-conc | current concurrency limit (``window'') |
This connection cache stores only information about current connections. The connection cache also stores the time of the last connection. Question: do we need to store a list of those times, e.g., the last three? We can use these times to decide when a new connection attempt should be made (if the last connection failed). For this we need at least the last connection time and the time interval to the previous attempt. If we use exponential backup we need only those two values. For more sophisticated methods (which?) we need probably more time stamps.
It is not yet clear whether the open connection cache actually needs the values listed above, especially those for ``over last X seconds''. Unless the scheduler actually needs them, they can be omitted (they might be useful for logging or statistics). Instead, the counters may be for the entire ``lifetime'' of the connection cache entry; those counters can be used to implement limits for the total number of sessions, transactions, recipients, etc.
The last three values are used to implement the slow-start algorithm, see 3.4.9.1.
This is DCC (see Section 2.4.4.8: 2) for previously open connections. It contains similar data as OCC but only for connections which are not open anymore.
open-conn-X | number of open connections over last X seconds |
(probably for X in 60, 120, 180, ...) | |
trans-X | number of performed transactions over last X seconds |
rcpts-X | number of recipients over last X seconds |
fail-X | number of failures over last X seconds |
performance | data related to performance, see 3.11.8 |
last-conn | time of last connection |
last-state | status of last connection, see 3.4.10.11 |
The connection cache also stores the time of the last connection. Question: do we need to store a list of those times, e.g., the last three? We can use these times to decide when a new connection attempt should be made (if the last connection failed). For this we need at least the last connection time and the time interval to the previous attempt. If we use exponential backup we need only those two values. For more sophisticated methods (which?) we need probably more time stamps.
This connection cache can be optimized to ignore some recent connections at the expense of being limited in size. For example, see the BSD inetd(8) ([ine]) implementation which uses a fixed-size hash array to store recent connections. If there are too many connections, some entries are simply overwritten (least recent entry will be replaced).
See Section 3.8.4.1 for a delivery status that must be stored in the appropriate entries. Question: where do we store the status? We store it on a per-recipient basis in the EDB and on a per-host (or whatever the index will be) basis in the connection cache. The delivery status will be only stored in the connection cache if it pertains to the connection. For example, ``Recipient unknown'' is not stored in that cache. The delivery status should reflect this distinction easily. Question: is it useful to create groups of recipients, i.e., group those recipients within an envelope that are sent to the same host via the same DA? This might be useful to schedule delivery, but should we create extra data types/entries for this?
It must also be stored whether currently a connection attempt is made. This could be denoted as one open connection and status equal ``Opening'' (or something similar).
Question: how much should the QMGR control (know about) the status of the various DAs? Should it know exactly how many are active, how many are free? That seems to be useful for scheduling, e.g., it doesn't make sense to send a delivery task to a DA which is completely busy and unable to make the delivery attempt ``now'', i.e., before another one is finished. Hence we need another data structure that keeps track of each available DA (each ``thread'' in it, however, this should be abstracted out; all the QMGR needs to know is how many DAs are available and what they are doing, i.e., whether they are busy, idle, or free3.10). The data might be hierarchically organized, e.g., if one DA of a certain type can offer multiple incarnations, then the features of the DA should be listed in one structure and the current status of the ``threads'' in a field (or list or something otherwise appropriate). Some statistics need to be stored too which can be used to implement certain restrictions, e.g., limit the number of transactions/recipients per session, or the time a connection is open.
status | busy/free (other?) |
DA session-id | DA session identifier |
DA transaction-id | DA transaction identifier |
SMTPS transaction-id | SMTPS transaction identifier |
server-host | identification of server: IP address, host name |
n-trans | number of performed transactions |
n-rcpts | number of recipients |
n-fail | number of failures |
performance | data related to performance, see 3.11.8 |
opened | time of connection open |
last-conn | time of last connection |
Question: what do we use as index to access this structure? We could use a simple integer DA-idx (0 to max threads-1), i.e., a fixed size array. Then however we should also use that value as an identifier for communication between QMGR and DA, otherwise we still have to search for session/transaction id. Nevertheless, using an integer might give us the wrong idea about the level of control of the QMGR over the DA, i.e., we shouldn't assume that DA-idx is actually useful as an index in the DA itself.
Notice: this is the only place where we store information about a DA session, the active queue contains only mail and recipient data. Hence we may have to store more information here. This data structure is also used for communication between QMGR and DAs; it associates results coming back from DAs (which use DA session/transaction ids) with the data in AQ (which use SMTPS session/transaction ids).
One simple approach is to check how much storage resources are used, e.g., how full are AQ, IQDB, etc, as well as disk space usage. However, that does not take into account the ``load'' of the system, i.e., CPU, I/O, etc.
Various DBs are stored on disk: CDB, DEFEDB, and IBDB. The latter two are completely under control of QMGR, the former is used by SMTPS (write), DA (read), and QMGR (unlink). The amount of available disk can be stored in a data structure and updated on each operation that influences it. Additionally system calls can be made periodically to reflect changes to the disk space by other processes. About CDB: SMTPS should pass the size of a CDB entry to QMGR which then can be used when a transaction is accepted and when all recipients for a transaction have been delivered and hence the CDB entry is removed.
sendmail 8.12 uses a data structure to associate queue directories with disk space (``partitions''). A similar structure can be used for sm9.
struct filesys_S { dev_t fs_dev; /* unique device id */ long fs_kbfree; /* KB free */ long fs_blksize; /* block size, in bytes */ time_T fs_lastupdate; /* last time fs_kbfree was updated */ const char *fs_path; /* some path in the FS */ };
For internal (memory resident) DBs it is straightforward to use the number of entries in the DB as a measure for its usage. This number should be expressed as percentage to be independent of the actual size chosen at runtime. Hence the actual usage of a DB can be represented as a single number whose value ranges from 0 to 100.
See Section 3.11.3 for more information about envelope databases, esp. APIs.
Notice: these functions are ``called'' from SMTPS (usually via a message), hence they do not have a corresponding function that returns the result. The functions may internally wait for the results of others, but they will return the result ``directly'' to the caller, i.e., via a notification mechanism. The other side (SMTPS) may use an asynchronous API (invoke function, ask for result) as explained in Section 3.1.1.
Maybe qmgr_trans_close() and qmgr_trans_discard() can be merged into one function which receives another parameter: qmgr_trans_close(IN trans-id, IN cdb-id, IN smtps-status, OUT status) that determines whether SMTPS has accepted so mail so far; the QMGR can still return an error.
See Section 2.4.4.2 for a description of the tasks of the first level scheduler. This part of the QMGR adds entries to the active queue, whose API is described in Section 3.11.5.
qmgr_fl_getentries(IN actq, IN incq, IN defq, IN number, IN policy, OUT status) get up to a certain number of entries for the active queue.
qmgr_fl_get_inc(IN actq, IN incq, IN number, IN policy, OUT status) get up to a certain number of entries for the active queue from the incoming queue.
qmgr_fl_get_def(IN actq, IN defq, IN number, IN policy, OUT status) get up to a certain number of entries for the active queue from the deferred queue.
qmgr_fl_get_match(IN actq, IN number, IN criteria, OUT status) get some entries for the active queue from the deferred queue that match some criteria, e.g., items on hold with a matching hold message, or items for a certain domain. We may want different functions here depending on what a user can request. But we also want a generic function that can get entries depending on some conditions that can fairly freely specified.
See Section 2.4.4.3 for a description of the tasks of the second level (micro) scheduler. This part of the QMGR controls the active queue, whose API is described in Section 3.11.5. It uses the first level scheduler to fill the active queue whenever necessary, and the DA API (3.4.5.0.1) to request actual deliveries.
Question: which additional functions do we need/want here?
Whenever a delivery attempt has been made, the status will be collected and an appropriate update in the main (or incoming) queue must be made.
Common to all cases is the handling of DSNs. If a particular DSN is requested and the conditions for that DSN are fulfilled, then the recipient is added to the DSN for that message (based on the transaction id of the received message). If there is no DSN entry yet, then it will be created. If all entries for the particular DSN have been tried, ``release'' (schedule) the DSN to be sent. See also Section 2.4.6. Question: do DSN cause new entries in the main queue or do we just change the type of the recipient entry?
If the entry will be tried later on, i.e., the queue return timeout isn't reached, then determine the next time for retry. Update the entry in the deferred queue (this may require moving the entry from the incoming cache to the deferred queue).
The AR API is briefly described in Section 3.6.3. An open question is when the QMGR should call the AR. There are at least the following possibilities:
The SMAR may expand aliases in which case it can return a list of (resolved) addresses. The QMGR must make sure that the new addresses are either safely stored in persistent storage or that the operation is repeatable. The simple approach is to store the new addresses in DEFEDB after they have been received from SMAR and remove the address which caused the expansion from the DB in which it is stored3.11(IBDB or DEFEDB). If the new addresses are stored only in AQ, then the handling becomes complicated due to potential delivery problems and crashes before all expanded addresses have been tried. The expansion would be done in a two step process:
Note: the design requires that all data in AQ can be removed (lost) at any time without losing mail.
If QMGR is terminated between step one and two, the alias expansion will be repeated the next time the (original) address is selected for delivery. If QMGR is terminated during step two, i.e., the delivery of expanded addresses, then this approach may result in multiple deliveries to the same recipient(s). Note: the current functions to update the recipient status after a delivery attempt do not yet deal with recipients resulting from an alias expansion.
For 1-1 aliases it seems simplest to replace the current address with the new one, which avoids most of the problems mentioned above3.12.
In a first step 1- () alias expansion should be done by writing all data to DEFEDB. Later on optimizations can be implemented, e.g., if is small, then the expansion is done in AQ only.
When a delivery attempt (see 4d in Section 2.4.3.2) has been made, the recipient must be taken care of in the appropriate way. Note that a delivery attempt may fail in different stages (see Section 2.4.3.1), and hence updating the status of a recipient can be done from different parts of QMGR:
See Section 2.4.3.4 for a description what needs to be done after a delivery attempt has been made. As mentioned there it is recommended to perform the update for a delivery attempt in one (DB) transaction to minimize the amount of I/O and to maintain consistency. To achieve this, request lists are created which contain the changes that should be made. Change requests are appended to the list when the status for recipients are updated. After all changes (for one transaction or one scheduler run) have been made, the changes are committed to persistent storage. Updates to DEFEDB are made before updates to INCEDB (as explained in Section 2.4.1), that is, first the request list for DEFEDB is committed and if that succeeded, the request list of IBDB is committed to disk.
Here's a more detailled version of the description in Section 2.4.3.4 but only for the case that no special DSN is requested, i.e., without DSN support (RFC 1894). After a delivery attempt, the recipient is removed from ACTEDB and
Updates to DEFEDB are partially ordered. If multiple threads prepare updates for DEFEDB, they may contain changes for the same transaction(s). The order of updates of the transaction(s) in AQ must be reflected in the order of updates in DEFEDB. There can either be a strict order or a partial order, i.e., two updates need to be done only in a certain order if they are for the same transaction. To simplify implementation, a strict order is preserved3.13.
Each function that needs to update DEFEDB creates a ``request list'', i.e., a list of items that need to be committed to DEFEDB. This has two advantages over updating each entry individually:
The first approach (more flexible and more complicated than approach 2) is to enforce just an ordering on writing data to DEFEDB by: asking for a ``sequence number'' and enforcing the ordering by that sequence number in the write routine.
Algorithm:
Use a mutex, a condition variable, and two counters: and .
Initialize and to 0.
Invariants: . Number of entries: if otherwise , i.e., there are no requests iff .
Resetting and to 0 is done to minimize the chance for an overflow.
Get an entry:
lock(mutex) if (first == 0) first = last = 1 n = 1 else n = ++last unlock(mutex) return n
To write a request list:
lock(mutex) while (number != first) cond_wait(cond, mutex) unlock(mutex) lock(write_mutex) ... write list ... unlock(write_mutex) lock(mutex) assert(number == first) assert(first <= last) if (first < last) ++first signal(cond) else first = last = 0 unlock(mutex)
Note: there is currently no absolute prevention of overflowing . If this needs to be done, then would be checked in the function that increases it and if it hits an upper limit, it would wait on a condition variable. The signalling would be done in the function that increases first: if it reaches the upper limit, then it would signal the waiting process.
A simpler way to solve the problem is to lock DEFEDB and AQ, and write changes to DEFEDB before it is unlocked. Even though that keeps DEFEDB locked while making changes to AQ (which prevents only a reader process from accessing DEFEDB even though it might be possible to allow that otherwise), this seems like the simplest approach to solve the problem.
According to Section 2.4.7 QMGR must control the local load of the system, Section 3.4.10.13 describes the possible data structures for this purpose. In this section the functionality will be specified.
An MTS has some modules that produce data (SMTP servers) and some which consume data (SMTP clients, delivery agents in general). The simplest approach to control how much storage is needed is to regulate the producers such that they do not exceed the available storage capacities. To accomplish this two thresholds are introduced for each resource:
To allow for a fine grained control the capacity (range: ) of the producers should be a regulated on a sliding scale proportional to the actual value if it is between the lower and upper threshold. Capacity is the inverse of the resource usage , i.e., .
if then else if then else
or
if then else if then else
For multiple values a loop can be used which is stopped as soon as one value exceeds its upper threshold. Computing the capacity can be done as follows:
for each
do
if
then ; break;
else if then
else
done
if
then
Computing the resource usage is done in the corresponding way:
for each
do
if
then ; break;
else if
then
done
if
then
Notes:
In general, whenever a resource is used, it must be checked whether it is exhausted. For example, whenever an item is added to a DB the result value is checked. If the result is an error which states that the DB is full, the SMTP servers (producers) must be stopped. This is the simple case where the usage of a resource reaches the upper limit (exceed the upper threshold).
After several operations have been performed of which some contain additions to DBs, a generic throttle function can be called which checks whether any of the resource usages exceeds its lower limit in which case the SMTP servers are throttle accordingly. This needs to be done only if the new resource usage is sufficiently different from the old value, otherwise it is not worth to notify the producers of a change.
If the system is in an overloaded state, i.e., the producers are throttled or even stopped, then the resource usage must be checked whenever items are removed from DBs. If the new resource usage is sufficiently less than the old value, the SMTP servers are informed about that change (unthrottled). Alternatively, the producers can be unthrottled only after all resource usages are below their lower thresholds.
Unfortunately there is no simple, portable way to determine the amount of memory that is used by a process. Moreover, even though malloc(2) is supposed to return ENOMEM if there is no more space available, this does not work on all OS because they overallocate memory, i.e., they allocate memory and detect a memory shortage only if the memory is actually used in which case some OSs even start killing processes to deal with the problem. This is completely unacceptable for robust programming: why should a programmer invest so much time in resource control if the OS just ``randomly'' kills processes? One way to deal with this might be to use setrlimit(2) to artificially limit the amount of memory that a process can use, in which case malloc(2) should fail properly with ENOMEM.
If the OS properly returns an appropriate error code if memory allocation fails, then the system can be throttled as described in 3.4.17.1. However, it is hard to recover from this resource shortage because the actual usage is unknown. One way to deal with the problem is to reduce the size of the memory caches if the system runs out of memory. That can be done in two steps:
Now the system can either stay in this state or after a while the limits can be increased again. In the latter case the resource control mechanisms may trigger again if the system runs out of memory again. It might be useful to keep track of those resource problems to adapt the resource limits to avoid large variations and hence repeated operation of the throttling/unthrottling code (i.e., implement some ``dampening'' algorithm).
The external interface of the SMTP server (also called smtpd in this document) is of course defined by (E)SMTP as specified in RFC 2821. In addition to that basic protocol, the sendmail X SMTP server will support: RFC 1123, RFC 1869, RFC 1652, RFC 1870, RFC 1891, RFC 1893, RFC 1894, RFC 1985, RFC 2034, RFC 2487, RFC 2554, RFC 2852, RFC 2920. See Section 1.1.1 for details about these RFCs.
Todo: verify this list, add CHUNKING, maybe (future extension, check whether design allows for this) RFC 1845 (SMTP Service Extension for Checkpoint/Restart).
A SMTP session may consist of several SMTP transactions. The SMTP server uses data structures that closely follow this model, i.e., a session context and a transaction context. A session context contains (a pointer to) a transaction context, which in turn contains a pointer back to the sessions context. The latter ``inherits'' its environment from the former. The session context may be a child of a daemon context that provides general configuration information. The session context contains for example information about the sending host (the client) and possibly active security and authentication layers.
The transaction context contains a sender address (reverse-path buffer in RFC 2821 lingo), a list of recipient addresses (forward-path buffer in RFC 2821 lingo), and a mail data buffer.
At startup a SMTP server registers with the QMGR. It opens a communication channel to the QMGR and announces its presence. The initial data includes a least a unique identifier for the SMTP server which will be used later on during communication. Even though the identification of the communication channel itself may be used as unique identifier, e.g., a file descriptor inside the QMGR, it is better to be independent of such implementation detail. The unique identifier is generated by the MCP and transferred to the SMTP server. It may also act as a ``key'' for the communication between SMTPS and QMGR/MCP if the communication channel could be abused by other programs. In that case, the MCP tells the QMGR about the new SMTPS (including its key).
It seems appropriate to have two different states: the state of the session and that of the transaction, instead of combining these into one. Not all of the combinations are possible (if there is no active session, there can't be a transaction). Many of the states may have substates with simple (enumerative) or complex (e.g., STARTTLS information) descriptions, some of them must be combinable (``additive''). Question: How to describe ``additive'' states? Simplest way: bit flags, binary or. So: can we put those into bitfields?
Session states:
Transaction states:
Multiple MAIL commands are not allowed during a transaction, i.e., MAIL is only valid if state is SMTPS-TA-INIT.
RCPT is only valid if state is SMTPS-TA-MAIL or SMTPS-TA-RCPT.
Some of these states require a certain sequence, others don't. We can either put this directly into the code (each function checks itself whether its prerequisites are fulfilled etc) or into a central state-table which describes the valid state changes (and includes appropriate error messages). It seems easier to have this coded into the functions than having it in the event loop which requires some state transition table, see libmilter.
VRFY, EXPN, NOOP, and HELP can be issued at (almost) any time and do not modify the transaction context (no change of buffers).
There need to be at least two data structures in the SMTP server; one for a transaction, one for a session.
Additionally, a server context is used in which configuration data is stored. This context holds the data that is usually stored in global variables. In some cases it might be useful to change the context based on the configuration, e.g., per daemon options. Hence global variables should be avoided; the data is stored instead in a context which can be easily passed to subroutines. The session context contains a pointer to the currently used server context.
Session:
session-id | session identifier, maybe obtained from QMGR |
connect-info | identification of connecting host (IP address, host name, ident) |
fd | file descriptor/handle for connection |
helo-host | host name from HELO/EHLO |
access times | start time, last read, last write |
status | EHLO/HELO, AUTH, STARTTLS (see above) |
features | features offered: AUTH, TLS, EXPN, ... |
workarounds | work around bugs in client |
reject-msg | message to use for rejections (nullserver) |
auth | AUTH context |
starttls | TLS context |
transaction | pointer to current transaction |
Transaction:
transaction-id | transaction identifier, maybe obtained from QMGR |
address, arguments (decoded?) | |
rcpt-list | addresses, arguments (decoded?) |
cdb-id | CDB identifier (obtained from cdb?) |
cmd-failures | number of failures for certain commands |
session | pointer to session |
Question: How much of the transaction do we need to store in the SMTP server? The QMGR holds the important data for delivery. The SMTP server needs the data to deal with the current session/transaction.
Todo: describe a complete transaction here including the interaction with other components, esp. queue manager.
Notice: this probably occurs in an event loop. So all functions are scheduled via events (unless noted otherwise).
As described in Section 3.14.4 the I/O layer will try to present only complete records to the middle layer. However, this may be intertwined with the event loop because otherwise we have a problem with non-blocking I/O. For the purpose of this description, we omit this I/O layer part and assume we receive full records, i.e., for the ESMTP dialogue CRLF terminated lines.
Before accepting any new connection, a session context is created. This allows the server to react faster (it doesn't need to create the context on demand), and it is guaranteed that a session context is actually available. Otherwise the server may have not enough memory to allocate the session context and hence to start the session.
smtps_session_new(OUT smtps-session-ctx, OUT status): create a new session context.
smtps_session_init(IN fd, INT connect-info, INOUT smtps-session-ctx, OUT status): Initialize session context, including state of session, fill in appropriate data.
The basic information about the client is available via a system call (IP address). The (canonical) host name can be determined by another system call (gethostbyaddr(), which may be slow due to DNS lookups). Question: do we want to perform this call in the SMTP server, in the QMGR, or in the AR? There should be an option to turn off this lookup completely and hence only rely on IP addresses in checks. This can avoid those (slow) DNS lookups. SMTPS optionally (configurable) performs an auth (ident) lookup.
smtps_session_chk(INOUT smtps-session-ctx, OUT status): SMTPS contacts the queue manager and milters that are registered for this with the available data (IP address, auth result). The queue manager and active milters decide whether to accept or reject the connection. In the latter case the status of the server is changed appropriately and most commands are rejected. In the former case a session id is returned by the queue manager. Further policy decisions can be made, e.g., which features to offer to the client: allow ETRN, AUTH (different mechanisms?), STARTTLS (different certs?), EXPN, VRFY, etc, which name to display for the 220 greeting, etc. The QMGR optionally logs the connection data.
Question: how can we overlap some OS calls, i.e., auth lookup, gethostbyaddr(), etc?
Output initial greeting (usually 220, may be 554 or even 421). Set timeout. Progress state or terminate session.
smtps_starttls(INOUT session-ctx, OUT state)
smtps_auth(INOUT session-ctx, OUT state)
All the following functions check whether their corresponding commands are allowed in the current state as stored in the session/transaction context. Moreover, they check whether the arguments are syntactically correct and allowed depending on the features in the session context. The server must check for abuse, e.g., too many wrong commands, and act accordingly, i.e., slow down or in the worst case closing the connection. Functions in SMTP server (see Section 3.4.12 for counterparts in the QMGR):
smtps_helo(INOUT session-ctx, IN line, OUT state): store host name in session context, clear transaction context, reply appropriately (default 250).
smtps_ehlo(INOUT session-ctx, IN line, OUT state): store host name in session context, clear transaction context, reply with list of available features.
smtps_noop(INOUT session-ctx, IN line, OUT state): reply appropriately (default 250).
smtps_rset(INOUT session-ctx, IN line, OUT state): clear transaction context, reply appropriately (default 250).
smtps_vrfy(INOUT session-ctx, IN line, OUT state): maybe try to verify address: ask address resolver.
smtps_expn(INOUT session-ctx, IN line, OUT state): maybe try to expand address: ask address resolver.
smtps_mail(INOUT session-ctx, IN line, OUT state): start new transaction, set sender address.
smtps_rcpt(INOUT session-ctx, IN line, OUT state): add recipient to list.
smtps_data(INOUT session-ctx, IN line, OUT state): start data section (get cdb-id now or earlier?).
smtps_body(INOUT session-ctx, IN buffer, OUT state): called for each chunk of the body. It's not clear yet whether this needs to be line oriented or whether it can receive simply entire body chunks. It may have to be line oriented for the header while the rest can be just buffers, unless we want to perform operations on the body in the SMTP server too. This function is responsible for recognizing the final dot and to act accordingly.
Question: how do we deal with information sent to the QMGR and the responses? Question: how do we describe this properly without defining the actual implementation, i.e., with leaving ourselves room for possible (different) implementations?
Assuming that the SMTP servers communicate with the QMGR, we need some asynchronous interface. The event loop in the SMTP servers must then also react on data sent back from the QMGR to the SMTP servers. As described in Section 3.5.2.5, events for the session must be disabled or ignored while the SMTP server is ``waiting'' for a response from the QMGR.
If the QMGR is unavailable the session must be aborted (421).
There is one easy way to deal with PIPELINING in the SMTP server and one more complicated way. The easy way is to more or less ignore it, i.e., perform the SMTP dialogue sequentially and let the I/O buffering take care of PIPELINING. This works fine and is the current approach in sendmail V8. The more complicated way is to actually trying to process several SMTP commands concurrently. This can be used to ``hide'' latencies, e.g., if the address resolver or the anti-spam checks take a lot of time - maybe due to DNS lookups - then performing several of those tasks concurrently will speed up the overall SMTP dialogue. This requires of course that some of those lookups can be performed concurrently as it is the case if an asynchronous DNS resolver is used.
Question: who generates the ids (session-id, transaction-id)? The QMGR could do this because it has the global state and could use a running counter based on which the ids are generated. This counter would be initialized at startup based on the EDB, which would have the last used value. However, it might reduce the amount of communication between SMTPS and QMGR if the former could generate the ids themselves. For example, just asking the QMGR for a new id when a session is initialized is a problem: which id should a SMTPS use to identify this request? The ids must be unique for the ``life time'' of an e-mail, they should be unique for much longer. Using a timestamp (as sendmail 8.12 does more or less) causes problems if the time is set backwards. Hence we need a monotonically increasing value. The ids should have a fixed length which makes several things simpler (even if it is just the output formatting of mailq). Since there can be several SMTPS processes, a single value in a shared location would require locking (the QMGR could provide this as discussed above), hence it would be more useful to separate the possible values, e.g., by adding a ``SMTPS id'' (which shouldn't be the pid) to the counter. This SMTPS id can be assigned at startup by the MCP. It seems tempting to make the transaction id name space disjunct from the session id name space and to provide a simple relation between those. For example, a session id could be: ``S SMTPS-Id counter'' and a transaction id could be: ``T SMTPS-Id counter ta-counter''. However, this makes it harder to generate fixed-length ids, unless we restrict the number of transactions in a session and the number of SMTPS processes. If we do that, then we must not set those limits to small (maybe 9999 would be ok), which on the other hand wastes digits for the normal cases (less than 10 for both values). Using 16 bit for the SMTPS id and 64 bit for the counter results in a string of 20 characters if we use base 16 encoding (hex). We could go to 16 (14) characters if we use base 32 (64) encoding. However, if we use base 64 encoding, the resulting id can't be used as filename on NT since it doesn't use case sensitive filenames (Check this).
If an SMTP server is implemented as an event-driven state machine, then a small precaution must be taken for pipelining. After an SMTP command has been read and a thread takes care of processing it, another SMTP command may be available on the input stream, however, it should not be processed before the processing of the current command has been finished. This can either be achieved by disabling the I/O events for that command stream during processing, or the session must be locked and any data available must be buffered while processing is active. It is not clear yet which of those two approaches is better; changing the set of I/O events may be an expensive operation; additional buffering of input data seems to be superfluous since the OS or the I/O layer takes care of that and more buffering will complicate the I/O handling (and violates layering). Maybe the loops which checks for I/O events should check only those sessions which are not active, i.e., which are currently not being processed. However, then the mechanism that is used to check for I/O events must present that I/O event each time (select() and poll() do this, how about others?).
Note: there are different checks all of which are currently called ``anti-spam checks''. Basically they can be divided into two categories:
Here RELAY is stronger than OK because it grants additional rights. However, it is not clear whether RELAY should override also tests in other (later) stages or just a possible anti-relay test (which might be done for RCPT). However, there might be cases where this simple model is not sufficient. Example: B is a backup server for A, but B does not have a complete list of valid addresses for A. Hence it allows relaying for a (possible) subset of addresses and temporarily rejects other (unknown) addresses, e.g.,
To:a1@A RELAY To:a2@A RELAY To:@A error:451 4.7.1 Try A directly
If session relaying actually overrides all subsequent tests, then this example works as expected, because hosts which are authorized to relay can send mail to A. However, if an admininstrator wants to maintain blacklists which are not supposed to be used then relaying tests and other policy tests need to be separate.
The order (and effect) of checks must be specifyable by the user; in sm8 this is controlled by the delay_checks feature. This allows only for either the ``normal'' order (client, MAIL, RCPT) or the reverse order (RCPT, MAIL, client). With the default implementation this means that MAIL and client checks are done for each RCPT which can be very inefficient.
The implementation of the anti-spam checks is non-trivial since multiple tests can be performed and hence the individual return values must be combined into one. As explained in Section 3.12.1, map lookups return two values: the lookup result and the RHS if the key was found. For example, for anti-relay checks the interesting results are (written as pair: (lookup-result, RHS), where RHS is only valid if lookup-result is FOUND):
state = NO-DECISION-YET; while (state != RELAY && more(tests)) { (r, rhs) = test(); switch (r, rhs) { case (TEMPFAIL, -): state = r; break case (FOUND, RELAY): return r; default: break; } } if (state == NO-DECISION-YET) return REJECT; else return state;
These anti-relay checks should be done in several phases of the ESMTP dialogue.
To do this, the flow control depicted above is enhanced such that the state is initialized properly by the state of the surrounding ESMTP phase, i.e., for session it is initialized to NO-DECISION-YET, for RCPT it is set to the relay-state of the session.
It becomes more complicated for other anti-spam tests. For example: some test may return TEMPFAIL, another may return OK, a third one REJECT, a fourth one REJECT but with a 4xy error. Question: how to prioritize? Question: should it be just a fixed algorithm in the binary or something user-definable? Maybe compare nsswitch.conf(2)([nss]) which provides a ``what to do in case of...'' decision between different sources (checks in the anti-spam case)? That is:
Result | Meaning |
FOUND | Requested database entry was found |
NOTFOUND | Source responded "no such entry" |
PERMFAIL | Source is not responding or corrupted (permanent error) |
TEMPFAIL | Source is busy, might respond to retries (temporary error) |
Action | Meaning |
CONTINUE | Try the next source in the list |
RETURN | Return now |
<entry> ::= <database> ":" [<source> [<criteria>]]* <criteria> ::= "[" <criterion>+ "]" <criterion> ::= <status> "=" <action> <status> ::= "FOUND" | "NOTFOUND" | "PERMFAIL" | "TEMPFAIL" <action> ::= "RETURN" | "CONTINUE" <source> ::= "file" | "DNS" | ...
So we need a state that is carried through and can be modified, and an action (break/return or continue).
<entry> ::= <what> ":" [<test> [<criteria>]]* <return> <test> ::= some-test-to-perform <criteria> ::= "[" <criterion>+ "]" <criterion> ::= <status> "?" [<assignment>] <action> ";" <status> ::= "FOUND" | "NOTFOUND" | "PERMFAIL" | "TEMPFAIL" <action> ::= <return> | "continue" <return> ::= "RETURN" [<value>] <assignment> ::= "state" "=" <value> "," <value> ::= <status> | "status"
Alternatively we can use a marker that says: ``use this result'' (e.g., pf([pf]): quick); by default the last result would be used. However, this doesn't seem to solve the TEMPFAIL problem: the algorithm needs to remember that there was a TEMPFAIL and return that in certain cases.
Question: can we define an algebra for this? Define the elements and the operations such that a state and a lookup result can be combined into a new state? Would that make sense at all? Would that be flexible enough?
Question: what are the requirements for anti-spam? Are the features provided by sm8 sufficient? Requirements:
Question: Where do we put the anti-spam checks? Some of them need similar routines as provided by the address resolver, e.g., for anti-relay we need the recipient address in a canonical form, i.e., which is the real address to which the mail should be sent? Maybe we need an interface that allows different modes, i.e., one for verification etc only, one for expansion.
For each stage in the SMTP dialogue there are builtin checks (which usually can be turned on/off via an option, e.g., accept-unresolvable-domain), and there are checks against the access map. The order of checks specifies whether a check can override other (``later'') checks unless special measures are taken (see below). Note: some of the builtin checks are configured by the user, hence it does not make sense to provide another option to override these because that can be done by setting the option accordingly, e.g., for connection rate it simply requires increasing the limit.
user defined: check host name and address against maps (access map, DNS based blacklists).
builtin: syntax checks3.15.
user defined: check address against maps (access map, DNS based blacklists).
builtin: unresolvable domain, others?
builtin: syntax checks, local user: does the address exist?
user defined: check address against maps,
builtin (configurable via map): check anti-relay.
As usual these checks may all be done during the RCPT stage or when they occur during the SMTP dialogue.
Before the control flow of sm8 is described, the values (data, RHS) of a map lookup need to be explained:
Control flow in sm8:
These two methods could be combined by keeping track of a state. If delay-checks is used, then MAIL and connect checks are performed for each recipient (unless overridden via an accept return code). Moreover, either the return codes could be more fine grained, e.g., for connect checks a return value could say: reject now, don't bother to do any further checks later on (compare firewall configurations ([pf]): quick). This can be achieved by either using different tags for the LHS or different tags for the RHS.
Question: what would the state look like? Is it possible to define an order on the value of the RHS (in that case the state would be the ``maximum'' of the values)?
Note: a simple state might not be good enough, e.g., if MAIL should be rejected only if RCPT matches an entry in a map, then a state for MAIL does not help. However, such a test would have to be performed in RCPT itself (and use MAIL too for its check).
Other hooks should be the same as in sendmail 8, i.e., for the various SMTP commands (DATA, ETRN, VRFY, EXPN) and some checks for TLS or AUTH.
As the experiences with sendmail 8 anti-spam checks have shown, it will be most likely not sufficient to provide some simple, restricted grammar. Many people want expansions that are only available in sendmail 8 due to the ruleset rewrite engine (even though that has several problems by itself since it was intended for address rewriting, not as a general programming language). Question: how can sendmail X provide a flexible description language for anti-spam checks (and other extensions, e.g., the address resolver)? Of course the simplest way for the implementors is to let users hack the C source code. If there is a clean API and a module concept (see also Section 3.15) then this might be not as bad as it sounds. However, letting user supplied C code interact with the sendmail source code may screw up the reliability of sendmail. A separation like libmilter is cleaner but slower.
For the first release (sendmail X.0) some basic anti-spam functionality must be provided that is sufficient for most cases. For each step SMAR returns up to three values:
OK | 250 |
REJECT | 550 |
TEMP | 450 |
RELAY | 150 |
This code is modified by adding a constant to it if the RHS string contained QUICK3.18.
Below is a proposal for the basic anti-spam algorithm. There are a few configuration options for this: feature delay-checks (same behavior as in sendmail 8, i.e., returning rejections is delayed to the RCPT stage).
Note: SMAR and SMTPS have to interact here properly to avoid too much communication between them (and hence latency). The problem is that SMTPS may make some local decisions whether to allow relaying (based on local maps of some kind, e.g., regular expressions) and those checks are also part of the algorithm outlined above. Unless SMTPS tells SMAR whether relaying is allowed, SMAR may have to make the lookups with tag Spam: even though they might not be necessary because SMTPS will reject the recipient due to unauthorized relaying. In the current version, SMTPS checks whether it is an unauthorized relaying attempt before calling SMAR, i.e., relaying via access map is not possible right now3.20.
The API provides functions for each SMTP command/phase similar to sendmail 8. For maximum flexibility, we specify it as asynchronous API as outlined in Section 3.1.1. In this case, two different function calls for the same session/transaction can not overlap (theoretically we could do that due to pipelining, but it probably doesn't give us anything). Therefore the session context (which may contain (a pointer to) the transaction context) should be enough to store the state between calls, i.e., it acts as a handle for the two corresponding functions.
check_connect(INOUT session-ctx, IN line, OUT state), check_connect_status(IN session-ctx, OUT state),
check_helo(INOUT session-ctx, IN line, OUT state), check_helo_status(IN session-ctx, IN line, OUT state),
check_ehlo(INOUT session-ctx, IN line, OUT state), check_ehlo_status(IN session-ctx, OUT state),
check_noop(INOUT session-ctx, IN line, OUT state), check_noop_status(IN session-ctx, OUT state),
check_rset(INOUT session-ctx, IN line, OUT state), check_rset_status(IN session-ctx, OUT state),
check_vrfy(INOUT session-ctx, IN line, OUT state), check_vrfy_status(IN session-ctx, OUT state),
check_expn(INOUT session-ctx, IN line, OUT state), check_expn_status(IN session-ctx, OUT state),
check_mail(INOUT session-ctx, IN line, OUT state), check_mail_status(IN session-ctx, OUT state),
check_rcpt(INOUT session-ctx, IN line, OUT state), check_rcpt_status(IN session-ctx, OUT state),
check_data(INOUT session-ctx, IN line, OUT state), check_data_status(IN session-ctx, OUT state),
check_header(INOUT session-ctx, IN buffer, OUT state), check_header_status(IN session-ctx, OUT state), called for each header line?
check_body(INOUT session-ctx, IN buffer, OUT state), check_body_status(IN session-ctx, OUT state), called for each body chunk?
It might also be possible to just pass the SMTP command/phase as parameter and hence minimize the amount of functions. However, then the callee most likely has to dispatch functions appropriately. sendmail X will most likely only provide functions for connect, mail, and rcpt, hence it might really be simpler to have just a single function. There must be some mechanism anyway to act upon configuration data (see Section 3.5.2.7).
Notice: if the anti-spam functions end up in a different module from the SMTP server, we certainly have to minimize the amount of data to transfer; it would be bad to transfer the entire session/transaction context all the time. Crazy idea: have a stub routine that only sends the changed data to the module. This might be ugly, but could be generalized. However, it would require that the stub routine recognizes changed data, i.e., it must store the old data and provide a fast method to access and compare it. Then we need an interface that allows to transfer only changed data, which could be done by using ``named'' fields. However, then also the other side needs to store the data and re-assemble it. Moreover, both caches need to be cleaned up, either explicitly (preferred, but would require extra API calls which may expose the implementation; at least we would need a call like close or discard) or by expiration (if an entry is expired too early, i.e., while in use, it would ``simply'' be added again). The overhead of keeping track and assemble/re-assemble data might outweigh the advantage of minimized data transfers.
The SMTP server must offer a way to check for valid (local) users. This is usually done in two steps:
These steps could be combined into one, i.e., look up the complete address in one map. However, this may require many entries if a lot of ``virtual'' domains are used and valid users are the same in those. Moreover, a map of valid localparts might be the password file which does not contain domain parts, hence the check must be done in two steps.
The mail sender address must not just be syntactically valid but also replyable. The simplest way to determine this is to treat it as a recipient address and ask the address resolver to route (resolve) it. However, there are a few problems: the address resolver should not perform alias expansion because it would be a waste of time to check a long list of addresses (recipients). Question: if an address is found in the alias map, is that a sufficient indication that it is acceptable (i.e., routable)? Answer: most likely yes, it is at least very unusual to have an address in the alias map which is not valid (routable).
sendmail 8 performs a simple check for the sender address: it tries to determine whether the domain part of the address is resolvable in DNS. Even though this is a basic check, it is not sufficient as spammers simply create bogus DNS entries, e.g., pointing to 127.0.0.1. Hence it seems to be useful to perform an almost full address expansion as described above and then check the IP addresses against a list of forbidden values which are specified in some map. This approach has also a small problem: if the address is routed via mailertable then it could be directed to one of those forbidden values. There are two possible solutions to this problem:
Solution 2 seems to be useful because mailertable entries are most likely made by an administrator with a clue and hence are implicitly ok. However, implicit operations can cause problems3.21, hence approach 1 might be better from a design standpoint3.22, but less user friendly.
Most map lookups for the SMTP servers are implemented by SMAR to avoid blocking calls in the state-threads application. SMTPS sends minimal information to SMAR, i.e., basically a key, a map to use for the lookup, and some flags, e.g., lookup the full address, the domain part, the address without details (``+detail''). SMAR performs the lookup(s) and send back the result(s). The result includes an error code and the RHS of the map (maybe already rewritten if requested?).
The SMTP server is started and controlled by the MCP. Each server communicates with the queue manager and the address resolver (question: directly?).
The SMTP servers must be able to reject connections if requested by QMGR to enable load control. This should be possible in various steps, i.e., by gradually reducing the number of allowed connections. In the worst case, the SMTP server must reject all incoming connections. This may happen if the system is overloaded, or simply if there is no free disk space to accept mail.
The original design required that the SMTP server removes a content entry from the CDB on request by the queue manager. This was done to avoid lockings overhead since the SMTP server is the only one which has write access to the CDB. However, in the current design the QMGR interacts directly with CDB (which should be implemented as library) to remove entries from it. This is done for several reasons:
The address resolver doesn't have an external interface. However, it can be configured in different ways for various situations. Todo: and these are?
Question: how to specify address resolving routines? Provide some basic functionality, allow hooks and replacements? How to describe? There are four different types of addresses: envelope/header combined with sender/recipient. It must be possible to treat those differently. For example, a pure routing MTA doesn't touch header addresses at all.
In Section 2.6.6.1 a configuration example was given which is slightly modified here:
local-addresses { domains = { list of domains }; map { type=hash, name=aliases, flags={rfc2821, local-parts, detail}} map { type=passwd, name=/etc/passwd, flags={local-parts}} }
See Section 3.2.2 how the list of valid domains can be specified, and Section 4.3.3 about a possible way to check whether a domain is in a list.
The current implementation (2004-07-27) uses the following hacks:
[127.0.0.2]
which indicates that LMTP over a local socket should be used.
local:
.
Item 1 could be modified a bit to use a different RHS, e.g., local: or lmtp:, to avoid using an IP address. The implementation must be modified accordingly (theoretically it might be possible to use a broadcast address but this is just another hack which may cause problems later). Item 2 should be extended by the more generic description given above in the configuration example.
Possible flags for map types are:
The address resolver must be able to access the usual system routines (esp. DNS) and different map types (Berkeley DB, NDBM, LDAP, ...). In addition to this, sendmail X.x should contain a generic map interface. This can either be done using a simple protocol via a socket or shared libraries (modules).
Question: should the address resolver receive a pre-parsed address (internal, tokenized format) or the string-representation (external format) of the address? This depends on where the address parser is called. It is non-trivial to pass the internal (tokenized) form via IPC because it is a tree with pointers in memory. So if we want to use that format, we need to put it into a contiguous memory region with the appropriate information (we could use pointers starting from the begin of the memory section, but is the packing/unpacking worth it?). Question: do we want the SMTP server to do some basic address (syntax) checking? That seems to be ok, but adds code to it. This may also depend on the decision who will do anti-spam checks, because the address must be parsed before that.
Basic functionality: return DA information for an address.
ar_init(IN options, OUT status): Initialize address resolver.
ar_da(IN address, IN ar_tid, OUT da_info, OUT status): start an AR task; optionally get the result immediately.
ar_da_get(IN ar_tid, OUT da_info, OUT status): get results for an AR task.
ar_da_cancel(IN ar_tid, OUT status): cancel an AR task.
Question: do we want to add a handle that is returned from the initialization function and that is passed to the various AR routines? We may want to have different ARs running, or we want to pass different options to the routines. If we use a handle, then we need also a terminate function that ends the usage of that handle.
Note: as explained in Section 3.4.5, the AR is not doing MX lookups. Those are either done in the QMGR or the DAs.
Resolved address format is a tuple consisting of
Question: should there be more elements like: localpart, extension, hostpart? It would be nice to make this flexible (based on DA?).
Notice: the result of an address resolver call can be a list of such tuples. This is more than MX records can do. By allowing a list of tuples we have maximum flexibility. For example, the address resolver could return lmtp,host0,user@domain; esmtp,host1:host2,user@domain. Question: what kind of complexity does this introduce? By allowing different DA the scheduling may become more complicated. Most MTAs allow at most several hosts, but only one address and DA. This flexibility may have serious impact on the QMGR, esp. the scheduling algorithms. Before we actually allow this, we have to evaluate the consequences (and the advantages).
Of course an address can also resolve to multiple addresses (alias expansion).
Handling errors is an especially interesting problem in the address resolver. Even though it looks simple at first, the problem is complicated by the fact that multiple requests might have to be made. For example, getting the IP addresses of hosts to contact for an e-mail address involves (in general) MX and A record lookups, i.e., two phases where the second phase can consist of multiple requests. Handling an error in the first phase is simple: that error is passed back to the caller. Handling an error in the first phase is complicated: it depends on the kind of error and for which of the A records an error occurs. The main idea for the address resolution is to find a host to which the mail can be delivered. According to the requirements, this is one of the systems that is listed as ``best MX host''. Hence if a lookup of any A record for any of the best MX hosts succeeds it is sufficient to return that result; maybe with a flag indicated that there might be more data but the lookup for that failed in some way. That is, a partial result is better than insisting that all lookups return successfully.
It might be useful to cache results of map lookups in the AR to avoid too many lookups, esp. for maps where lookups are expensive. This requires that similar to DNS a TTL (Time To Live) is specified to avoid caching data for too long. A TTL of 0 will disable caching. In general it would be better if the external maps provide their own caching since then the code in the AR will be simpler. Moreover, if several modules in sendmail X will use maps, then caching in each of the modules will just replicate data with the usual problems: more memory in use, possibility of inconsistencies, etc.
Note: this may be a future enhancement that is not part of sendmail X.0.
The AR is likely to call functions that may take a long time before they return a result. Those functions will include network activity or other tasks that have a high latency, e.g., disk I/O: searching for an entry in a large map. If we actually know all the places that can block, we could employ a similar threading model as intended for the QMGR, and hence even make the AR a part of the QMGR. However, it is unlikely that we are able to determine all possible blocking function calls in advance, especially if third-party software (or modules) are involved. Hence it is easier for programming to use a threading model in which the OS helps us to get around the latency, i.e., the OS schedules another thread if the current one is blocked. This will require more threads than are usually advisable for a high-performance program (e.g., two times the number of processors), since more threads can be blocked. However, it is probably still a bad idea to use one thread per task, since that will more or less relate to one thread per connection which definitely is bad for performance. The event thread library described in Section 3.18.5.1.1 can be configured to have reasonable number of worker threads and hence should be flexible enough to use for the implementation of the AR.
If all blocking calls actually are introduced by network activity, then state threads (see Section 3.18.3.1) would be an efficient way to implement the AR. However, that requires that all libraries which are used by the AR perform I/O only via the state threads interface which is not achievable because that would require to recompile many third-party libraries.
The external interface of the mail submission program must be compatible (as much as possible) with sendmail 8 when used for this purpose. There will be less functionality overall, but the MSP must be sufficient to submit e-mails.
Todo: to which other program does the MSP speak? See the architecture chapter.
Mail delivery agents must obey the system conventions about mailboxes. sendmail X.0 will support LDAs that speak LMTP and run as a daemon or can be started by MCP (on demand) such as procmail [pro].
Mail delivery agents have different properties (compare mailer definitions in sendmail 8). These must be specified such that the QMGR can make proper use of the DA.
These properties are listed below (some of them are configuration options for the MCP, i.e., how to start DA). Notice: this list is currently almost verbatim taken from the sendmail 8 specification.
MCP configuration:
sendmail 8 options to fix mails by adding headers like Date, Full-Name, From, Message-Id, or Return-Path won't be in sendmail X(.0) as options for delivery, but should be in the MSA/MSP such that only valid mails are stored in the system. It's not the task of an MTA to repair broken mails.
Other sendmail 8 options that probably won't be in sendmail X.0:
More sendmail 8 options that won't be in sendmail X.0:
See Section 2.8.2 about delivery classed and delivery instances for some basic terminology (which is not yet fixed and hence not used consistently here). There are several ways multiple delivery agents (instances) can be specified:
Note: proposal 3 is probably required to make efficient use of multi-processor systems, i.e., if a DA is implemented using statethreads [SGI01] (see also Section 3.18.3.1) then the DA can only make use of a single processor, moreover, it can block on disk I/O. Additionally, if a DA can only perform a single delivery per incocation (process), then multiple processes can be used to improve throughput by providing concurrency.
Proposal 1 is required if the different features are too complicated to implement in a single process. For example, it does not seem to be useful to implement a DA that speaks SMTP and performs local delivery as a single process because the latter usually requires changing uids. However, it makes sense to implement a DA that can speak ESMTP, SMTP, and LMTP in a single process, maybe even using different ports as well as some other minor differences. Multiple processes are also useful if the DA process has certain restriction on the number of DA instances (threads) it can provide. For example, one DA may provide only 10 instances while another one may provide 100. However, such a restriction may also be achieved by other means, e.g., in the scheduler.
In some cases it might be useful to have a fairly generic DA which can be instantiated by external data, e.g., via mailertable entries that select a port number. Obviously one instantiation parameter is the destination IP address (it would not make sense to specify a class for each of them), but which other parameters should be flexible (specified externally to the configuration itself)? Possible parameters are for example port number and protocol (ESMTP, SMTP, LMTP). However, this data must be provided somehow to instantiate the values. Offering too much flexibility here means that these parameters must be configured elsewhere -- outside the main configuration file -- which introduces most likely an additional configuration syntax (e.g., in mailertable: mailer:host:port) which needs to be understood by the user and parsed by some program, such adding complexity in the configuration and the program. This is another place where it is necessary to carefully balance consistency (simplicity) and flexibility. In a first version, a less flexible and more consistent configuration is desired (which also should help to reduce implementation time).
The previous section showed several proposals how to specify multiple DAs. It is most likely that all of them will be implemented, because each of them is useful under different circumstances. This causes a problem: how to select (name) a DA? If multiple processes can implement the same DA behavior, then they either need the same name or some other way must be found to describe them such that they can be referenced (selected).
There will be most likely only two ways to select a DA: one default DA (e.g., esmtp) and the rest can be selected via mailertable.
A DA needs several names:
The scheduler receives a delivery class identifier from smar. Based on that it has to find a delivery agent instance that provides the delivery class. There can be multiple DAs, and hence some selection algorithm must be implemented; see also below (item 4) about round robin selection etc.
Todo: need some terminology here to properly specify what is meant and unambiguously reference the various items.
Some random notes that need to be cleaned up:
args = "program -f sm.conf -i %d" snprintf(realargs, sizeof(realargs), args, id)Instead of using such a format it might be simpler to just have an option pass_id = -i; and then pass -i id as first argument (just insert it). MCP can give a unique id to each process that specifies this option.
Note: the SMTPC session/transaction id format expects an 8 bit id only. Possible solution: qmgr keeps an ``internal'' id for each smtpc which is used to generate the transaction/session ids but not anywhere else.
min/max processes: who requests more processes? How do they stop? Idle timeout? Number of uses? On request?
Compare this with sendmail 8: there are different types of mailers specified by the Path field, e.g., those which use TCP/IP for communication with a program (IPC), delivery to files (FILE), and external programs for which the communication occurs via stdin and stdout. All IPC mailers can be treated as one delivery family.
Interface between QMGR and DA (see Section 3.4.5.0.1).
Question: what about outgoing address rewriting, header changes, output milters?
The result from the DA of a delivery attempt can be subdivided into two categories: per session and per task.
For the session the result contains the session-id and a status, which is one of the following:
For a transaction the result contains the transaction-id and a status, where status might be fairly complicated:
Usually only one status value is send back, but if there are multiple recipients and some of them are accepted while some are rejected, then status codes are send back for each of the failed recipients in addition to an ``overall'' (transaction) status code. If individual recipients had errors then those should be sent back to QMGR. It might be possible to return only one result if all recipient errors are identical, e.g., in the simplest case if there is only one recipient.
This list is SMTP client specific, it must be extended (generalized) for other DA types. Important is whether the transaction was successful, failed temporarily or permanently, and whether per-recipient results are available. In case of failures: is this failure ``sticky'' or can another delivery attempt (with different sender/recipients) made immediately?
Question: which information does the QMGR really need and what is just informational (for display in the mail queue)? Compare mci in sendmail 8. The QMGR needs to know whether the error is permanent or temporary and whether it is ``sticky'', i.e., it will influence other connections to that host too. Question: anything else?
The external interface of the SMTP client is of course defined by (E)SMTP as specified in RFC 2821. In addition to that basic protocol, the sendmail X SMTP client implementation will support: RFC 974, RFC 1123, RFC 2045, RFC 1869, RFC 1652, RFC 1870, RFC 1891, RFC 1893, RFC 1894, RFC 2487, RFC 2554, RFC 2852, RFC 2920.
Todo: verify this list, add CHUNKING, maybe (future extension) RFC 1845 (SMTP Service Extension for Checkpoint/Restart).
We have similar choices here for the process model as for the SMTP server. However, we have to figure out which of those models is the best for the client, it might not be the same as for the server.
The internal interface must be the same as for the other delivery agents, except maybe for minor variations.
This section describes one possible choice for user and group ids in the sendmail X system and the owners and permissions of files, directories, sockets etc.
The MCP is started by root. The configuration file can either be owned by root or a trusted user. Most other sendmail X programs are started by the MCP.
The following notation is used:
There are the following files which must be accessed by the various programs:
Hence the directory must be writable by SMTPS and QMGR, accessible by SMTPC. The files must be writable by SMTPS and readable by SMTPC.
Hence the directory is owned by SMTPS (O-S), has group QMGR (G-Q) and the following permissions 0771 (or 0731):
d rwx rwx --x
The files in the directory are owned by SMTPS (O-S) and have group SMTPC (G-C) with permissions 0640:
rw- r-- ---
Todo: figure out how this works for non-Unix systems.
Moreover, there is one communication socket each between QMGR and other sendmail X programs, hence these sockets must be readable/writable by QMGR and that program. Considering that socket permissions are not sufficient to protect a socket in some OS, the socket must be in a directory with the proper permissions.
The sockets and directories are owned by QMGR and group-accessible to the corresponding program. Problem: either QMGR must belong to all those groups (to do a chown(2) to the correct group), or the directories must exist with the correct permissions and the group id must be inherited from the directory if a new socket is created in that group. The former can be considered a security problem since it violates the principle of least privileges. The latter may not work on all OS versions.
This section deals in general with databases and caches as they are used in sendmail X.
General notice: us usual it is important to tune the amount of information stored/maintained such that the advantages gained from the information (faster access to required data) outweighs the disadvantages (more storage and effort to properly maintain the data). For example, data should not be replicated in different places just to allow simpler access. Such replication requires more storage (which in case of memory is precious) and the overhead for maintaining that data and keeping it consistent can easily outweigh the advantage of faster access.
Section 3.4.4 explains that some DBs need multiple access keys. This can be achieved by having one primary key and several secondary keys. Question: do we want to allow a varying number of keys, or do we want to write modules with one, two, three, and four keys? The latter is easier to achieve, but not as flexible. However, if the full flexibility is not needed (it most likely isn't), then it might not be worth to try to specify and implement it.
In some cases an access key (which does not need to be the primary key, see 3.11.1) may not uniquely specify an element. For example, if a host name is used as key, then there might be multiple entries with the same key. In that case, we need to provide functions to return either a list of entries or to walk through the list. The latter is probably more appropriate, since usually the individual entries are interesting, not the list as a whole. For such ``walk through'' function we need a context pointer that is passed to the get-next-entry function to remember the last position.
Note: if non-unique access keys are used, then it might be still better to have a primary key which is unique. This should simplify some parts of the API and the implementation. For example, to remove an element from the DB it is required to uniquely identify it (unless the application wants to remove any element matching the non-unique key which might be only useful in a few cases). If we don't have a unique key, then the application needs to search the correct data first and pass a context to the remove function which is in turn used as unique identifier (e.g., simply the pointer to the entry, but the structure should be kept opaque for abstraction).
In some cases the application data itself can provide a list of elements with the same key such that the access method can simply point to one element (usually the head of the list). It is then up to the application to access the desired elements. However, it might be useful to provide a generic method for this, at least in the form of macros to maintain such a list. For example, if the element to which the access structure points is removed, the (pointer in the) access structure itself must be updated. Such a removal function should be provided to the application programmer to minimize coding effort (and errors).
First we need to figure out what kind of access methods we need. This applies to the incoming queue, the active queue, and all other queues. Then we can specify an API that gives us those access methods. Of course we have to make sure that we do not restrict ourselves if we later come up with another access method that is useful but can't be implemented given the current specification. We also have to make sure that the API can be implemented efficiently. Just because we would like some access methods doesn't mean that we should really implement them in case they lead to slow implementations.
Question: do we need a recipient id? This might be necessary if the same recipient is given twice. Even though we could try to remove duplicates, there might be different arguments for the recipient command. It might not be useful, but we have to be able to deal with it. Answer: we need a recipient id as key to access the data in the various EDBs.
There must be some way to search for entries in an EDB according to certain criterias, e.g., recipient host (delivery agent, destination host), delivery time, in general: scheduling information.
Collection of thoughts (just so they don't get lost, they need reconsideration and might be thrown away):
The incoming envelope database (INCEDB) is the place where the queue manager stores its incoming queue. It is stored in memory (restricted size cache) and backed up on disk. Question: do we need two APIs or can we get along with one? There are some functions that only apply to the backup, but that can be taken care of by a unified API. However, this unified API is just a layer on top of two APIs, i.e., its implementation will make use of the APIs specific to the RSC and the disk backup. For the queue manager, the latter two APIs should be invisible. For the implementation, it might be useful to describe them here.
Question: When do we need access to the backup? Answer:
Decision: Use the disk-backup of the INCEDB purely for desaster recovery. So envelopes stay in the RSC if they can't be transferred to the ACTEDB as long as the RSC doesn't overflow. If it does we use the deferred EDB or we slow down mail reception.
Question: do we just store the data (addresses) in unchanged form in the INCEDB or do we store them in internal (expanded) form? In general we want the expanded form, but due to temporary problems during the RCPT stage external formats may be stored too. In that case the address must be expanded when read from the INCEDB before it is placed into the active queue. See also Section 3.11.6.
API proposal (this is certainly not finalized):
If the RSC stores data of different types, it must be possible to distinguish between them. This is necessary for functions that ``walk'' through the RSC and perform operations on the data or just for generic functions. A different approach would be to use multiple RSCs each of which stores only data of a unique kind. However, that wastes memory (internal fragmentation), since it cannot be known in advance in which relations the data needs to be stored, e.g., there might be only a few sessions with lots of transactions, or there might be many sessions with one transaction each.
The INCEDB stores envelope information only temporarily. The envelopes will usually be moved into the active queue. A reasonable implementation of the disk backup currently seems to be a logfile. This file shouldn't grow endlessly, so it must be rotated from time to time (based on size, time, etc). This is done by some cleanup task, which should be a low-priority thread. It reads the backup file and extracts that envelope data that hasn't been taken care of yet. These entries are consolidated into a new file and the files read will be removed or marked for reuse. The more often it runs the less memory it may need because then it doesn't need to read so many entries (hopefully). Question: should this be made explicit in the API or should this be a side effect of the commit operation, e.g., if logfile big enough, rotate it? It seems cleaner to have this done implicitly. But that might be complicated and it might slow down the QMGR in an unexpected moment. Making the operation explicit exposes the internal implementation and binds part of the queue manager to one particular implementation. This violates modularity and layering principles. However, we could get around this by making those functions (rotate disk backup) empty functions in other implementations. Decision: don't make the implementation visible at the API level. It's just cleaner. It can be some low-priority thread that is triggered by the commit operation.
Question: do we need to store session oriented data, e.g., AUTH and STARTTLS, in the INCEDB?
Question: how do we handle commits? Should that be an asynchronous function? That is, initiate-commit, and then get a notification later on? The notification might be via an event since the queue manager is mainly an event-driven program. This might make it simpler to perform synchronous (interactive) delivery, i.e., start delivery while the connection from the sender is still open and only confirm the final dot after delivery succeeded, which allows to not actually commit the content to stable storage. However, does this really help anything? There has to be a thread that performs the appropriate actions, so that thread would be busy anyway. However, most of the work is done by a delivery agent, so we don't need to block a thread for this. So we need another function: incedb_commit_done(IN incedb-handle, IN trans-id, IN commit_status, OUT status): trigger notification that envelope has been committed.
Question: should we use asynchronous operations (compare aio)? edb_commit(handle, trans-id, max-time): do this within max-time. edb_commit_status(handle, trans-id): check the status. How do we want to handle group-commits? We can't do commits for each entry by itself (see architecture section), so we either need to block on commit (threading: the context is blocked and when enough entries are there for committing or too much time passed, then the commits are actually performed and the contexts are ready for execution again), or we can use asynchronous operation. However, we don't want to poll, but we want to be notified. Does the API depend on the way we implement the queue manager? Since the QMGR is the only one with (write) access to the EDB, it seems so. But we need a similar interface for the CDB (see Section 3.11.7.1). So we should come up with an API that is more or less independent of the callers processing model. On the other hand, it doesn't make sense to provide different functions to achieve the same, if we don't need it (compare blocking vs. non-blocking I/O). It only makes it harder for us to implement the API. Todo: figure out whether EDB and CDB can use similar APIs for commits and how much that depends on the callers processing model (thread per connection, worker threads, processes).
The disk backup for the INCEDB should have its own API for the usual reasons (abstraction layer). API proposal (this is not finalized):
The last four functions deal with request lists. These lists contain status updates for transactions or recipients. Requests lists are used to support transaction based processing: all change requests are collected in a list and after all of the necessary operations have been performed, the requests are committed to disk. This is for example helpful in a loop that updates status information: instead of updating the status for each element one by one, the changes are collected in a list. If during the processing of the loop an error occurs which requires that all changes are undone, then the request list can simple be discarded (ibdb_req_cancel()). If the loop finished without errors, then the requests are committed to disk (ibdb_wr_status()). This also has the advantage of being able to implement group commits, which may result in better performance.
Question: do we need ibdb_trans_discard()? If the transaction data is only stored after the final dot, then we wouldn't need it. However, that might cause unnecessary delays, and in most cases mails are transmitted and accepted without rejected after the final dot (or transmission aborts due to timeouts etc). We could also merge ibdb_trans_rm() and ibdb_trans_discard() by adding a status parameter to a common function, e.g., ibdb_trans_end().
The active envelope database (ACTEDB) is the place where the queue manager stores its active queue, it is implemented as a restricted size cache. The active queue itself is not directly backed up by disk, but other queues on disk act as (indirect) backup.
Todo: clarify this API.
Routine | Arguments | Returns |
actedb_open | name, size | status, actedb-handle |
actedb_close | actedb-handle | status |
actedb_env_add | actedb-handle, sender-env-info | status |
actedb_env_rm | actedb-handle, trans-id | status |
actedb_rcpt_add | actedb-handle, trans-id, rcpt-env-info | status |
actedb_rcpt_status | actedb-handle, trans-id, rcpt, d-stat | status |
actedb_rcpt_rm | actedb-handle, trans-id, rcpt-env-info | status |
actedb_commit | actedb-handle, trans-id | status |
actedb_discard | actedb-handle, trans-id | status |
We need some support functions for the API specified in Section 3.4.5.0.1 to manipulate the Delivery Agent DB whose purpose and content has been explained in Section 3.4.10.12.
Open a DA DB.
Close a DA DB.
Open a session and transaction. session contains the destination host to which to connect. The created dadb-entry contains newly created session and transaction handles.
Open a transaction. The updated dadb-entry contains a newly created transaction handle.
Close a transaction.
Close a session.
The deferred envelope database (DEFEDB) is obviously the place where the queue manager stores the deferred queue.
Question: do we need session oriented data in the deferred EDB? Answer: shouldn't be necessary, all data required for delivery is in the envelopes (sender and recipients and ESMTP extensions). Note: this assumes that nobody is so wierd and uses session oriented data for routing. There might be people who want to do that; can those be satisfied by allowing additions to the data similar to ``persistent macros'' in sendmail 8?
We need several access methods for the EDB (from the QMGR for scheduling). Question: can we build indices for access ``on-the-fly'', i.e., can we specify a configuration option: `use field X as index' without having code for each field? Answer: we probably need at least code per data type (int, string, ...). Question: is this sufficient? How about (DA, host) as index or something similarly complicated?
Todo: clarify this API.
Routine | Arguments | Returns |
defedb_open | name, mode | status, defedb-handle |
defedb_mail_add | defedb-handle, sender-env-info | status |
defedb_env_rm | defedb-handle, trans-id | status |
defedb_rcpt_add | defedb-handle, trans-id, rcpt-env-info | status |
defedb_rcpt_status | defedb-handle, trans-id, rcpt, d-stat | status |
defedb_rcpt_rm | defedb-handle, trans-id, rcpt-env-info | status |
defedb_commit | defedb-handle, trans-id | status |
defedb_readprep | defedb-handle | status, cursor |
defedb_getnext | cursor | status, record |
defedb_close | defedb-handle | status |
The last three functions deal with status change requests, similar to those for IBDB (see Section 3.11.4.3).
Possible implementations:
About option 3: The basic idea about a cyclical file system is to create a fixed set of files and reuse them. In this particular case we need files that represent the times at which queue entries should be tried again (theoretical maximum: Timeout.queuereturn, practical maximum: max-delay). Subdivide the maximum time into units, e.g., 1 minute, and use the files for the queue in round robin fashion. Each file represents the entries that are supposed to be tried at . So a new entry (which represents a deferred mail/recipient) is placed into the file which represents its next-retry-time. If an entry is delayed again, it is appended to the appropriate file. Since the QMGR is supposed to take care of all entries that are to be tried , the file will be afterwards ``empty'', i.e., no entry in the file is needed anymore.
Possible problems:
Whenever an entry is added to the deferred queue a reason is specified (d-stat). The entry also contains scheduling information, see 3.4.10.6. This data can be used by the implementation to treat the entry accordingly.
Note: Courier-MTA uses something remotely similar: it stores the queue files according to the next retry time, see courier/doc/queue.html.
The mail queue consists of two directories: LSDIR/msgs and LSDIR/msgq. The control file is stored in LSDIR/msgs/nnnn/Ciiiii, and the data file is LSDIR/msgs/nnnn/Diiiiii. ``iiiiii'' is the inode number of the control file. Since inode numbers are always unique, the inode is a convenient way to obtain unique identifiers for each message in the mail queue. ``nnnn'' is the inode number hashed, the size of the hash is set by the configure script (100 is the default, so this is usually just the last two digits of the inode number).
One item of information that's stored in the control is the time of the next scheduled delivery attempt of this message, expressed as seconds since the epoch. There is also a hard link to the control file: LSDIR/msgq/xxxx/Ciiiiii.tttttt. ``tttttt'' is the next scheduled delivery time of this message. ``xxxx'' is the time divided by 10,000. 10,000 seconds is approximately two and a half hours. Each subdirectory is created on demand. Once all delivery attempts, scheduled for the time range that's represented by each subdirectory, have been made, the empty subdirectory is deleted. If a message needs to be re-queued for another delivery attempt, later, the next scheduled delivery time is written into the control file, and its hard link in LSDIR/msgq is renamed.
This scheme comes into play when there is a large amount of mail backed up. When reading the mail queue, Courier doesn't need to read the contents of any directory that represents a time interval in the future. Also, Courier does not have to read the contents of all subdirectories that represent the current and previous time intervals, when it's falling behind and can't keep up with incoming mail. Courier does not cache the entire mail queue in memory. Courier needs to only cache the contents of the oldest one or two subdirectories, in order to begin working on the oldest messages in the mail queue.
About option 2: this may be simple because it requires only little implementation effort since the DB is already available. However, it isn't clear whether the DB is fast and stable enough. Considering the problems that other Sendmail projects encounter with Berkeley DB we have to be careful. Those problems are mainly related to locks help by several processes. In case a problem occurs and a process that holds a DB lock aborts without releasing the lock, all processes must be stopped and a recovery run must be started. In sendmail X we do not have multiple processes accessing the same EDB, only the QMGR has access to those. Hence the problem is minimized, we still need to make sure that in case of an abort the DB locks are properly released. However, we may even be able to use BDB without using its locks. If we lock the access to the DB inside the QMGR itself, then we avoid the DB corruption problem. Using locks inside the QMGR to sequentialize access to an EDB does not seem like a big problem. This would lead to some coarse-grain locking (obtain a mutex before accessing the DB), but that might be ok. It might even be possible to use BDB as cyclical filesystem, but that is something we have to investigate. The cyclical filesystem idea restricts the access methods to the recipient records to a purely sequential way, other access methods would be fairly expensive (require exhaustive search or at least other indices that may require many updates).
BDB provides a fairly easy way to use multiple indices to access the DB: the DB->associate method can link a secondary index to a primary and there can be multiple secondary indices (see [Sle]: Berkeley DB: Secondary indices).
Instead of keeping secondary indices on disk (and hence increasing disk I/O which usually is a scarce resource), we may want to keep them in main memory and just restrict their size. Since they are just indices, i.e., they just contain a pointer (e.g., the main key for the DB), the size of each entry can be fairly small. For example, if the recipient identifier is used as main key then that's about 25 bytes (see Section 4.2.1). If then a secondary index is based on the `next time to try' the overall size of the index is less than 32 bytes by entry (assuming a 4 byte time value). Hence even storing 100000 entries will only require about 3 MB of memory (only data; using an AVL tree will add about 20 bytes per entry on a 32 bit machine, hence resulting in a total of about 5 MB). If a system really has that many entries in the queue then the size of the secondary index seems fairly small compared to the amount of data it stores on disk (and the common memory sizes). Moreover, it should be possible to use restricted size caches for some indices and fill them up based on scanning the main queue. Coming back to our example of using `next time to try' as secondary index, the data can be sorted (e.g., using AVL trees) and limiting the number of entries to some configurable value. Then we can simply remove entries that are too far away (their `next time to try' is too far in the future for immediate consideration) when the cache is full and new entries must be added.
Note that not each transaction must be committed to disk immediately. It is not acceptable to lose mail, but it is acceptable (even though highly undesireable) to have mail delivered multiple times. Hence the information that a message has been accepted and not yet successfully delivered must always safely be stored on persistent storage. However, the fact that a mail has been delivered is not of that high importance, i.e., multiple of these informations can be committed in a group to minimize transaction costs. See also the sendmail 8 option CheckpointInterval and Section 3.11.11.
Note: just a small reminder: transaction ids in Berkeley DB are 32 bit integers (of which only 31 bit are available) and hence may be exhausted rather soon. They are reset by running recovery, which hence might be necessary to do on a regular basis (at least on high volume systems).
The content database stores the mail contents, not just the mail bodies (see 2.17).
Possible implementations:
Question: will the CDB be implemented as library or as a process by itself? In the former case there might be a problem to coordinate access to the CDB. If the CDB is accessed from different processes, then some form of locking may be required. This should be handled internally by the CDB implementation.
We need an API (access abstraction) that allows for different implementations. The abstraction must be good enough to allow for simple implementations (e.g., one file per mail content) as well as maybe complicated and certainly fast ones. It must be possible to use just memory storage if delivery modes allow for this (compare interactive delivery in sendmail 8), and it must be possible to use some remote storage on a distributed system. The API must neither exclude any imaginable implementation (as long as it makes some sense), nor must it keep us from doing extremely efficient implementations. It must accommodate (unexpected) changes in the underlying implementations during the lifetime of sendmail X.
Question: what exactly do we need here?
First we need to get a handle for a CDB. We may have to create a CDB before that but it seems better to do this outside the system with other programs, e.g., just like you have to set up queue(s) right now. It might be more flexible to create CDBs on demand (e.g., new directories to avoid ``overflowing'' the ones that exist), however, that doesn't seem to belong in the SMTP server daemon. If the CDB implementation needs it, it should do it itself whenever its operation requires it.
Question: what about different file systems for CDB? Maybe the user has a normal disk, a RAID, a memory file system, etc. So we need also some ideas how to use different file systems, which should be controllable by the user. Do we really want to make it that complicated? Can we hide this in the API and use ``hints'' in the open call? That should be good enough (for now); but it is sufficiently general?
We need to create a storage for the mail content per envelope. We should give it the transaction id and if existent the SIZE parameter. The library returns an id (content-id) and a status.
To make the interface flexible, there are two identifiers: transaction id as created by SMTPS and stored in the envelope data. If it proves to be useful or necessary, the CDB implementation can return its own identifier which will be used for further function calls (and will be stored in the envelope data too). It's not yet clear which data types to use, currently character strings are fairly likely. If the CDB doesn't want/need to generate its own ids, it can simply return the transaction id. If the CDB is realized as a library, the content-id may be an opaque pointer (or a small integer like a fd).
Question: should read/write maintain the data pointer (offset) themselves (hidden behind content-id) or should this be passed back to the caller? If it is hidden, the calls might be optimized (assuming there is only one writer/reader at a time, which may not be true if we use interleaved delivery). The implementation must allow one writer and several readers concurrently.
Question: should all functions have cdb-handle as input? That allows to have multiple CDBs open at a time. It seems to be useful, because that makes the API more generic. However, the cdb-handle could be ``hidden'' in the content-id. For now, keep cdb-handle as parameter. If we don't use cdb-handle explicitly all the time, the interface is almost the same as for (file) I/O. Hence the question: do we want to use the same API? The I/O API call sm_io_setinfo() can be ``abused'' to implement calls like cdb_commit(), cdb_abort(), and cdb_unlink().
We have the same problem here as with the envelope DB: how to perform group commits? However, we probably can't use the solution (whatever it may be) from the QMGR in the SMTPS. The mechanism may depend on the process model of the respective module; it might be different in a program that uses worker threads from one that uses (preforked) processes or per connection threads. If we use per connection threads then we can simply use a blocking call: we can't do anything else (can we?) inbetween, we have to wait for the commit to give back the 250 (or an error). Moreover, most of the proposals how to implement the CDB do not even allow for group commits (unless the underlying filesystem provides this facility somehow). For example, if one file per mail content is used, then it is impossible in FFS to perform a group commit; each file has to be fsync(2)ed on its own.
Proposal 2 (see Section 3.11.7) basically tries to get around the problem of fast writes and group commits (synchronous meta-data operations). By using a log-structured filesystem (implemented by sendmail X or maybe just used on some OS that actually provide such a filesystem), writing contents to disk should be significantly faster (an order of magnitude) than using a conventional filesystem and one file per mail [SSB+95].
In the following, we take a look at some variations of this topic.
For some references on log-structured filesystems and their performance relative to conventional filesystems see [RO92], [SBMS93], and [SSB+95]. Especially [SSB+95] indicates that proposal 2 (see Section 3.11.7) may not work as good as hoped. The main reason for this is garbage collection, which can decrease performance by up to 40%, which in turn causes LFS and FFS to have similar performance.
Question: can we use a cyclical filesystem also for the CDB? In addition to the problems mentioned in Section 3.11.6.1 for case 3 we also have a different garbage collection problem. If we store the mail content in files related to their arrival time, and do not reuse them until the maximum queue time is reached, then we have a huge fragmentation, but we would get garbage collection for free since we will simply reuse the storage after the timeout. How much was would this be? Let's assume we achieve a throughput of 10MB/s, that's 36GB/h, which is 864GB/day. Even though 100GB IDE disks can be bought for $2003.23, that's a lot of waste. Moreso due to the number of disks and hence controllers required. Therefore this proposal seems too wierd, it requires too much hardware, most of which is barely used. Since most mail is delivered almost immediately, probably more than 90% of that space would be wasted, i.e., used only once during the reception and (almost immediate) delivery of a mail.
It seems that garbage collections is the main problem for using log-structured filesystems as CDB for sendmail. However, maybe a property of the mail queue can help us to alleviate that problem. Garbage collection in conventional log-structured filesystems must be able to reclaim space in partially used segments (see [SSB+95] for the terminology). This requires compacting that space, e.g., by retrieving the ``live'' data and storing it in another segment and then releasing the old segment. Doing that requires of cause disk access and hence reduces the available I/O bandwidth. Therefore it should be avoided. Instead of reclaiming partially free segments only entirely free segments will be reclaimed, which doesn't require any disk access (except for storing the information on disk, i.e., the bitmap for free/used segments). This should be feasible without causing too much fragmentation since the files in the mail queue have a limited lifetime. Hence the general problem that a filesystem has to deal with doesn't (fully) apply here. It should be possible to reclaim most segments after a fairly short time. This approach might be something we can persue in a later version of sendmail X, for 9.0 it is definitely too much work. Even if we can get access to the source code of a log-structured filesystem (as offered by Margo Seltzer in [SSB+95], pg. 16), that code is in the kernel and hence would require a significant effort to rewrite. Moreover, it is not clear how much such a code would interfere with the (U)VM of the OS.
Proposal 5 looks like a strange idea. However, it may have some merits, i.e., it gives us group commits. If we use the transaction id concatenated with a block number as key, then the value would be the corresponding block. Group commits are possible due to the transaction logging facility of BDB. Whether this will give good performance (esp. for reading) remains to be seen (tested).
General notice: currently it is claimed that meta-data operations in Unix file systems are slow and this (together with general I/O limits of hard disks) limits the throughput of an MTA. However, there are file systems that try to minimize the problem, e.g., softupdate (for *BSD), ext3, ReiserFS (Linux), and journalling file systems (many Unix versions). We may be able to take advantage of those file system enhancements to minimize the amount of work we have to do to implement a high performance MTA. Moreover, in high-performance systems disks with large NVRAM caches will be used, thus making disk access really fast.
Idea for simple CDB: keep queue files around and reuse them. That should minimize file creation/deletion overhead. However, does it really minimize meta-data operations? As it is shown in Section 5.2.1.2.1, some filesystems seem to become actually slower when reusing existing files.
The queue manager keeps track of all connections:
The first two should be kept in memory, the last two can be stored on disk.
The open connections must be completely available in memory. The older connections can be limited to some maximum size, using something like LRU to throw expired connection out of the cache.
The connection caches should contain all information that might be relevant for the queue manager:
For more details see 3.4.10.8 and 3.4.10.10.
Do we want to look into other databases than Berkeley DB? How about:
Some data should be stored in memory but the amount of memory used should be limited. For example, the connection caches must be limited in size.
An RSC stores only pointers to data. We may need to create a copy of the data for the RSC for each new entry, in which case a entry_create() function should be passed to rsc_create().
Returns a status and (if successful) a handle (otherwise NULL).
Question: how do we implement such caches? If the cache size is small, then we can create a fixed size array. We can then simply use a linear search for entries. Up to which size is this efficient enough?
If linear search becomes too slow, we need a better organization. We need two access methods: First by key, second by time. So we need to maintain two lookup structures for the array. The first one for the key can be a b-tree, the second one for the time can be a linked list or a b-tree.
We can use a ring buffer (fixed size linked list; fixed size array) for the time access. If the list is supposed to expire purely based on the time when an entry has been entered, then that organization is the simplest and fastet: when the buffer is full, just remove the tail (which is the oldest entry). If the expiration is based on LRU then a new entry can be made each time. The old one can be either removed (esp. if it is linked list), or marked as invalid (free). In the latter case the effective size would shrink which might not be intended but may be reasonable to get efficient operations.
One possible implementation can be taken from postfix: ctable.
For some structures it might be useful to vary the size of a cache depending on the load. Question: which ones are these? Is this really necessary? A possible implementation might be to link RSCs together, but that might be complicated if b-trees are used for lookups. Moreover, shrinking a cache is non-trivial, the question is whether it's worth to save some memory versus the computational overhead for shrinking.
For some operations atomic update are necessary or at least very useful. Such operations include updates of the delivery status of an email. In previous sendmail versions, rename(2) is used for atomic changes, i.e., to write a new qf file a tf fike is written and then rename(tf, qf) is used. BSD FFS provides this guarantee for rename(const char *from, const char *to):
rename() guarantees that if to already exists, an instance of to will always exist, even if the system should crash in the middle of the operation.
However, sendmail 8 doesn't update the delivery status of an envelope all the time by default:
CheckpointInterval=: Checkpoints the queue every (default 10) addresses sent. If your system crashes during delivery to a large list, this prevents retransmission to any but the last recipients.
sendmail X can behave similar, which means that atomic updates are not really necessary. However, it must made sure that there is no inconsistent information on disk, or that at least sendmail can determine the correct state out of the stored data.
Question: are disk writes atomic? As long as they are within a block, are they guaranteed to leave the file in a consistent state? Can we figure out whether the write is within a block? If it spans multiple blocks, it may cause problems. Otherwise a write could be used for an atomic update, e.g., a reference counter or status update, without having to copy a file or create a new entry. If we can do ``atomic one byte/block writes'', we can update the information with an fseek(); write(new_status,1). Remark: qmail-1.03/THOUGHTS [Ber98] contains this paragraph:
Queue reliability demands that single-byte writes be atomic. This is true for a fixed-block filesystem such as UFS, and for a logging filesystem such as LFS.
If we store several queue file entries in a single file instead of each entry in an individual file (or some database), then the following layout instructions should be observed. Note: the same structure will be used for communication between sendmail X modules, see also Section 1.3.1 about this decision.
It might be useful to implement the queue files using a record structure with fixed length entries. This allows easy access to individual entries instead of sequentially reading through a file. To deal with variable length entries, continuation records can be used. Even if we don't use fixed size records, the entries should be structured as described below (tagged data including length specification). This allows for fast scanning and recognition without relying on parsing (looking for an end-of-record marker, e.g., CR).
The first record must have a version information and should contain a description of the content. It must have the length of the records in this file. Possible layout: record length (32 bit), version (32 bit), content description, maybe similar to a printf format string; see also Section 3.14.11.1.
Each record contains first the total length of the entry (in byte, must be multiple of the record length), then a type (32 bit), and then one or more subentries. The type should contain a continuation flag to denote whether this is the first or an continuation record. The last record has a length (less or) equal to the record length. It also has a termination subentry to recognize that the record is completely written to disk when it is read. Each of the subentries consists of its length (32 bit, must be multiple of 4 to be aligned), type (32 bit), and content. The length specifies only the length of the data, it doesn't include the type (or the length itself). We could save bytes if we do this the other way round: type first, then optionally length, whereby length can be omitted if the type says its a fixed size entry (32 bit as default). However, this probably makes en/decoding more complicated and is not worth the effort. Question: should each subentry be terminated by a zero character? If an entry doesn't fit into the record, a continuation record is used and the current record is padded, which is denoted by a special type. The termination subentry is identified by its type. Additionally, it might be useful to write the entire length of the record as value to minimize the problem of mistakenly identifying garbage on the disk as valid record. We won't use a checksum because the requires us to actually read through all the data before we can write it. Notice: variable length data should be aligned on 4 byte boundaries to allow things like accessing a (byte) buffer as an integer value. On 64bit systems it might be necessary to use 8 byte alignment.
Some open questions:
Note: see also xdr(3).
Note: this might be not needed if we go with Berkeley DB for the envelope DB. However, the ``queue'' access type for Berkeley DB works only with fixed sized records, so there may be a use for the following considerations.
Let's do some size estimates for INCEDB, i.e., the disk back (see Section 3.11.4.3 and 3.4.10.4). The smallest size should be 4 bytes (32 bit word); for each entry a description (type) and length field is used additionally (see above).
entry | length (bytes) |
transaction-id | 20, up to 32 or even 64 |
start-time | 8 |
cdb-id | up to 64 |
n-rcpts | 8 |
sender-spec | up to 256 (RFC 2821) plus ESMTP extensions |
size | 8 |
bodytype | 4 |
envid | up to 100 (RFC 1891) |
ret | 4 |
auth | no length restriction? (RFC 2554, refers to 1891) |
by | 10 for time, 1 (4?) for mode/trace |
entry | length (bytes) |
transaction-id | 20, up to 32 or even 64 |
rcpt-spec | up to 256 (RFC 2821) plus ESMTP extensions |
maybe a unique id (per session/transaction?) | |
notify | 4 |
orcpt | up to 500 (RFC 1891) |
If we use a default size of 512 bytes, then we would get 100 KB/s if the MTA can deal with 1000 msgs/s (for one recipient per message, if it would be two - which is more than the average - it still would be only 150 KB/s). This is fairly small and should be easily achievable. Moreover, it doesn't amount to much data, e.g., for one day (without cleanup) it would be 12 GB (for very large traffic, i.e., 86 million messages).
Some questions and ideas; this part needs to be cleaned up.
Generic idea (see the appropriate sections earlier on which may have more detailled information): EDB: maybe use some of those as files (just append) and switch to a new one at regular intervals (time/size limits). The old one will be cleaned up then: all entries for recipients that have been delivered will be removed (logged), new file will be created. This has to be flexible to deal with ``overflow'', i.e., many of these files might be used, not a fixed set.
In some cases it might be useful to store domain names in reverse order in a DB (type btree). Then a prefix based lookup can (sequentially) return all subdomains too. Whether host.domain.tld should be stored as dlt.niamod.tsoh or tld.domain.host remains to be seen (as well as whether this idea makes much sense at all).
As explained in Section 2.12 maps are used to lookup keys and possibly replace the keys with the matching RHS. The required functionality exceeds that of default lookups (exact match) which can be provided by performing multiple lookups in which parts of the key are omitted or replaced by wildcard patterns (see 3.12.2). Accordingly replacement must provide a way to put the parts of a key that have been omitted or replaced into the RHS; this is explained in Section 3.12.3.
A map lookup has two results3.24:
Section 2.12 explains the problem of checking subdomains (pseudo wildcard matches). The solution chosen3.25is a leading dot to specify ``only subdomains'' (case 2.12 in Section 2.12).
Below are the cases to consider. Note: in the description of the algorithms some parts are is omitted:
Notes:
In sendmail 8 there are various places where (parts of) e-mail addresses are looked up: aliases, virtusertable, access map. Unfortunately, all of these use different patterns which are inconsistent and hence should be avoided:
Question: should this be made consistent? If yes, which format should be chosen? By majority, it would be: user@host, user, @host. A compromise could be to always use the '@' sign to distinguish an e-mail address (or a part of it) from a hostname or something else: user@host, user@, @host.
The RHS of a map entry can contain references to parts of the LHS which have been omitted or replaced into the RHS. In sm8 this is done via %digit where the arguments corresponding to the digits are provided in the map call in a rule. Since sendmail X does not provide rulesets, a fixed assignment must be provided. A possible assignment for case 3 (see previous section) is:
digit | refers to |
0 | entire LHS |
1 | user name |
2 | detail |
3 | +detail |
4 | omitted subdomain? |
Functions can return errors in different way:
We will not use one general error return mechanism just for sake of uniformity. It is better to write functions in a ``natural'' way and then choose the appropriate error return method.
Question: is timeout an error? Usually yes, so maybe it should be coded as one.
Question: Do we want to use an error stack like OpenSSL does? This would allow us to give more information about problems. For example, take a look at safefile(): it only tells us that there is a problem and what kind of problem, but it doesn't exactly say where. That is: Group writable directory can refer to any directory in a path. It would be nice to know which directory causes the problem.
OpenSSL uses a fixed size (ring) buffer to store errors (see openssl/include/err.h), which is thread specific.
A detailed error status consists of several fields (the status, i.e., all of its fields, must be zero if no error occurred):
As explained in 3.13.1 some functions return positive integers as valid results, hence negative values can be used to indicate errors. To facilitate this, the error codes will always be negative, i.e., the topmost bit is set.
So overall the error status is encoded in a signed 32 bit integer (of course typedef will be used to hide this implementation detail). Macros will be used to hide the partitioning of the fields to allow for changes if necessary.
In some cases a numeric error status is not sufficient, an additional description (string) is required or at least useful. This can be made available as a result parameter, however, see also the next section how errors can be returned.
If modules are used, it is hard to have a single function that converts an error code into a textual description like strerror(3) does as Section 3.15 explains. The subsystem field in the error value can be used to call the appropriate conversion function.
If we want to be extremely flexible, then a module can register itself with a global error handler. It submits its own error conversion function and it receives a subsystem code. This would make the subsystem code allocation dynamic and will avoid conflicts that otherwise may arise. Question: is this too complex?
Minor nit about tags in structures for assertion that the submitted data is of the right type. 8.12 uses strings, a nice integer value is sufficient, e.g., 0xdeadbeef, 0xbc2001. Then the check is just an integer comparison instead of a strcmp(). Currently it's just a pointer comparison, so it's probably not much difference. Question: is this guaranteed to work? Someone reported problems on AIX 4.3, it crashes in the assertion module. The reason for that is a bug in some AIX C compiler that does not guarantee the uniqueness of a C const char pointer. Bind 9 uses a nice trick defining a magic value as the ``concatenation'' of four characters (in a 32 bit word). Even though this restricts the range of magic values, at least it helps debugging since it could be displayed as string.
sendmail X should make use of many libraries such that the source code of the main programs is only relatively small and easy to follow. Most complexity should be hidden in libraries, such as different I/O handling with timeouts and for different OS as well as different actual lower layers (e.g., IPv4 vs IPv6). The more modules are implemented via libraries, the easier it should be to change the underlying implementations to accomodate different requirements and to actually reuse the code in other parts of sendmail X as well as other projects.
_T | type |
_E | enum |
_F | function |
_P | pointer |
_S | structure |
_U | union |
_T is default, the other should be only used if necessary.
_new() | create a new object |
or _create() | |
_destroy() | destroy object (other name?) |
_open() | open an object for use, this may perform an implicit new() |
_close() | close an object, can't be used unless reopened, this does not destroy the object |
_add() | add an entry to an object |
_rm() | remove an entry from an object |
_alloc() | allocate an entry and add it to an object |
_free() | free an entry and remove it from an object |
_lookup() | lookup an entry, return the data value |
_locate() | lookup an entry, return the entry |
There are many alternative names used for lookup: search, find, and get. The latter is used by BBD and even though it implies that a value is returned, it is a bit overloaded by the I/O function with that name, which return the next element, without implying that a specific entry is requested.
X_ctx | context for object X |
X_next | pointer to next element for X |
X_nxt | pointer to next element for X (if X is a long name) |
_state | state (if the object goes through various states/stages) |
_status | status: what happened to the object, e.g., SMTP reply code |
_flags | (bit mask of) flags that can be set/cleared/checked |
more or less independent of each other |
The distinction between state and status is not easy. Here's an example: A SMTP server transaction goes through various stages (states): none, got SMTP MAIL/RCPT/DATA command, got final dot. The status should be whether some command was accepted, so in this case the two are strongly interrelated. It probably depends on the object whether the distinction can be made at all and whether it's useful. For example, the status might be stored in the substructures, e.g., the mail-structure has the response to the SMTP MAIL command.
As shown in Section 3.14.1.1 a structure definition in general looks like this:
typedef struct abc_S abc_T, *abc_P; struct abc_S { type1 abc_ctx; type2 abc_status; type3 abc_link; };
X_link | link for list X |
X_lnk | link for list XYZ (if X is a long name) |
X_l | link for list XYZ (if X is a really long name) |
Note: link is used for structure elements instead of next which is used for variables. This makes it easier to distinguish variables and structure elements.
The name of macros should indicate whether they can cause any side effects, i.e., whether they use their arguments more than once and whether they can change the control flow. For the former, the macro should be written in all upper case. For the latter, the macro should indicate how it may change the control flow as depicted in the next table.
_B | break |
_C | continue |
_G | goto |
_E | jump to an error label |
_R | return |
_T | terminate (exit) |
_X | some of the above |
Macros that include assert statements don't need to be of any special form, even though _A could be used; however, many functions use assert/require too without indication of doing that in their name.
All include files in sendmail program code will refer to sendmail X specific include files. Those sendmail X specific files in turn will refer to the correct, OS dependent include files. This way the main program modules are not cluttered with preprocessor conditionals, e.g., #if, #ifdef.
sendmail X will use its own I/O layer, which might be based on the libsm I/O of sendmail 8.12. However, it must be enhanced to provide an OS independent layer such that sendmail X doesn't have to distinguish in the main code between those OSs. Moreover, it should be a stripped-down version containing only functions that are really needed. For example, it should have record oriented I/O functions (record for SMTP: a line ending in CRLF), and functions that can deal with continuations lines (for headers).
How should the I/O layer work?
It should buffer reads and writes in the following manner:
Requirements of SMTPS for data:
How to read the data with the least effort? We cannot just read a buffer and write it entirely to disk (give it to the CDB) because we need to recognized the trailing dot. The smtpsdata() routine should have access to the underlying I/O buffer. We could either be ugly and access the buffer directly (which violates basic software design principly, e.g., abstraction), or we can provide a function to do that. In principle, sm_getc() is that function, but it can trigger a read, which we don't want. If the buffer is empty, we want to know it, write the entire buffer somewhere, and then trigger a read. We could add a new macro: sm_getb() that returns SM_EOB if the buffer is empty. Note: to do this properly (accessing the buffer almost directly) we should provide some form of locking. Currently it's up to the application to "ensure" that the buffer isn't modified, i.e., the file must not be accessed in any other way inbetween.
The I/O buffer currently consists of the following elements:
int | size | total size |
int | r | left to read |
int | w | left to write |
int | fd | fileno, if Unix fd, else -1 |
sm_f_flags_T | flags | flags, see below |
uchar | *base | start of buffer |
uchar | *p | pointer into buffer |
This implements a simple buffer. The buffer begins at base and its size is size. The current usage of the buffer is encoded in flags. The interesting flags for the use as buffer are:
RD | currently reading |
WR | currently writing |
RD and WR are never simultaneously asserted. RW open for reading and writing, i.e., we can switch from one mode to the other.
The following always hold:
flags&RD w = 0
flags&WR r = 0
This ensures that the getc and putc macros (or inline functions) never try to write or read from a file that is in `read' or `write' mode. (Moreover, they can, and do, automatically switch from read mode to write mode, and back, on "r+" and "w+" files.)
r/w denote the number of bytes left to read/write. is the read/write pointer into the buffer, i.e., it points to the location in the buffer where to read/write the next byte.
|<- size ->| |----------------|--------| ^ ^ ^ base p base + size
RD
WR
The buffer acts as very simple queue between the producer and the consumer. ``Simple'' means:
``Real'' queues would have different pointers to write/read for the producer and consumer such that the queue is entirely utilized. However, that causes problems at ``wrap-around'', esp. if more than just one item (byte) should be read/written: it must be done piecewise (or at least special care must be taken for the wrap-around cases). We can do something like that because is our pointer to write more data into the buffer.
The main part of most sendmail X programs is an event driven loop.
For example, the SMTP server reads data from the network and performs appropriate actions. The context for each server thread contains a function that is called whenever an I/O event occurs. However, if we use non-blocking I/O (which we certainly do), we shouldn't call a SMTP server function (e.g., smtps_mail_from) on each I/O event for that thread, but only if complete input data for that middle-level function is available. As stated before (see 3.14.3) the I/O layer should be able to assemble the input data first. So the thread has an appropriate I/O state which is used by the I/O layer to use the right function to get the input data. During the SMTP dialogue this would be an CRLF terminated line, if CHUNKING is active it would be an entire chunk.
Event driven programming can be fairly complicated, esp. if events can occure somewhere ``deep down'' in a function call sequence. We have to check which programs are easy to write in an event based model, and how we can deal with others (appropriate structuring or using a different model, e.g., threads per connection/task).
If a blocking call has to be made inside a function, it seems that the function must be split into two parts such that the blocking call is the divider, i.e., the function ends with a call that initiates the blocking function. The blocking function must be wrapped into an asynchronous call, i.e., an ``initiate'' call and a ``get result'' call. The function must trigger an event that is checked by the main loop. Such a wrapper can be implemented by having (thread-per-functionality) queue(s) into which the initiate calls are queued. A certain number of threads a devoted to such queue(s) and take care of the work items. When they are done they trigger an (I/O?) event and the main event loop can start the function that receives that result and continues processing. This kind of programming can be become ugly (see above).
sendmail X will reuse the rpool abstraction available in libsm (with the exception of not using exceptions). Moreover, there should be an rpool_free_block() function even though it may actually do nothing. An additional enhancement might be the specification of an upper limit for the amount of memory to use.
In addition to resource pools (which are only completely freed when a task is done), it might be useful to have memory (de)allocation routines that work in an area that is passed as parameter. This way we can restrict memory usage for a given functionality.
A debugging version of malloc/free can be taken from libsm. It might be useful to provide a version that can do some statistics, see malloc(9) on OpenBSD.
In case of memory allocation problems, i.e., if the system runs out of memory, the usage has to be reduced by slowing down the system or even aborting some actions (connections). The system should pre-allocate some amount of memory at startup which it can use in case of such an emergency to allow basic operation. The pre-allocated memory can either be explicitly used by specifying it as parameter for memory handling routines, or it can be simply free()d such that system operation can continue. It's not yet clear which of those both approaches is better.
sendmail X should have a string abstraction that is better than the way C treats ``strings'' ('0' terminated sequences of characters). They may be modelled after postfix's VSTRING abstraction or libsmi, the latter seems a good start. They should make use of rpools to simplify memory management. We have to analyze which operations are most needed and optimize the string abstraction for those without precluding other (efficient) usages. Some stuff:
Note: it might be useful to have a mode for snprintf() that denotes appending to a string instead of writing from the beginning. Should this be a new function or an additional parameter? The former seems the best at the user level, internally it will be some flag.
Even though strings should dynamically grow, there also need to be a way to specify an upper bound. For example, we don't want to read in a line and run out of memory while doing so because some attacker feeds an ``endless'' line into the program.
The string structure could look like this:
struct sm_str_S { size_t sm_str_len; /* current length */ size_t sm_str_size; /* allocated length */ size_t sm_str_maxsz; /* maximum length, 0: unrestricted? */ uchar *sm_str_base; /* pointer to byte sequence */ rpool_t sm_str_rpool; /* data is allocated from this */ }
Some strings are only created once, then used (read only) and maybe copied, but never modified. Those strings can be implemented by using reference counting and a simplified version of the general string abstraction.
struct sm_cstr_S { size_t sm_cstr_len; /* current length */ unsigned int sm_cstr_refcnt; /* reference counter */ uchar *sm_cstr_base; /* pointer to byte sequence */ }
Copying is then simply done by incrementing the reference counter. This is useful for identifiers that are only created once and then read and copied several times.
Available functions are:
/* create a cstr with a preallocated buffer */ sm_cstr_P sm_cstr_crt(uchar *_str, size_t _len); /* create a cstr by copying a buffer of a certain size */ sm_cstr_P sm_cstr_scpyn(const char *_src, size_t _n); /* duplicate a cstr */ sm_cstr_P sm_cstr_dup(sm_cstr_P _src); /* free a cstr */ void sm_cstr_free(sm_cstr_P _cstr); size_t sm_cstr_getlen(sm_cstr_P _cstr); bool sm_cstr_eq(const sm_cstr_P _s1, const sm_cstr_P _s2);
Triggering events at certain times requires either signals or periodic checks of a list which contains the events (and their times). If we use signals, then the signal handler must be thread-safe (of course). In a multi-threaded program there is one thread which takes care of signal handling. This thread is only allowed to use a few functions (signal-safe). If the main program is event-driven, then we can use one pipe to itself and the signal handler will just write a message to it which says that a timed event occurred. In that case, the event-loop will trigger due to the I/O operation (ready for reading) and the appropriate action can be performed.
The alternative is busy-waiting (polling), i.e., use a short timeout in the main event-loop (e.g., one second) and check whenever the thread awake whether a scheduled action should be performed. This seems to be more compute-intensive than the above solution.
Another way is to set the timeout to the interval to the next scheduled event. However, this causes slight problems when new events are added in the mean time that are supposed to be performed earlier than the previous next event.
Currently the signal-handler approach seems to be the best since it provides a clean interface and it doesn't change the main event handler loop logic. Note: take a look at libisc code.
System V shared memory can be used between unrelated processes (just share the key and give appropriate access). Other forms of shared memory (mmap()) require a common ancestor, which is available in form of the supervisor process. However, we have to be careful how such shared memory would be used (we can't create more on the fly, it's a fixed number).
We need an RFC 2821 and an RFC 2822 parser.
RFC 2821: Routing based on address (MX) and additional rules (configuration by admin).
RFC 2822: We also need a rewrite engine to modify addresses based on rules (configuration) by the admin.
The SMTP server/address resolver needs an RFC 2821 parser to analyze (check for syntactical correctness) and to decide what to do about addresses (routing to appropriate delivery agent).
Note: the syntax for addresses in the envelope (RFC 2821) and in the headers (RFC 2822) is different, the latter is almost a superset of the former. In theory we only need an RFC 2821 parser for SMTP server daemon, but some MTAs may be broken and use RFC 2822 addresses. Should we allow this? Maybe consider it as an option. This would require that we have a good library to do both where the RFC 2821 API is a subset of the RFC 2822 API.
The parser shouldn't be too complicated, the syntax is significantly simpler than RFC 2822. Quoting RFC 2821:
Reverse-path = Path Forward-path = Path Path = "<" [ A-d-l ":" ] Mailbox ">" A-d-l = At-domain *( "," A-d-l ) ; Note that this form, the so-called "source route", ; MUST BE accepted, SHOULD NOT be generated, and SHOULD be ; ignored. At-domain = "@" domain Domain = (sub-domain 1*("." sub-domain)) / address-literal sub-domain = Let-dig [Ldh-str] address-literal = "[" IPv4-address-literal / IPv6-address-literal / General-address-literal "]" Mailbox = Local-part "@" Domain Local-part = Dot-string / Quoted-string ; MAY be case-sensitive Dot-string = Atom *("." Atom) Atom = 1*atext Quoted-string = DQUOTE *qcontent DQUOTE String = Atom / Quoted-string
Some elements are not defined in RFC 2821, but RFC 2822; i.e., atext, qcontent.
IPv4-address-literal = Snum 3("." Snum) IPv6-address-literal = "IPv6:" IPv6-addr General-address-literal = Standardized-tag ":" 1*dcontent Standardized-tag = Ldh-str Snum = 1*3DIGIT ; representing a decimal integer ; value in the range 0 through 255 Let-dig = ALPHA / DIGIT Ldh-str = *( ALPHA / DIGIT / "-" ) Let-dig IPv6-addr = IPv6-full / IPv6-comp / IPv6v4-full / IPv6v4-comp IPv6-hex = 1*4HEXDIG IPv6-full = IPv6-hex 7(":" IPv6-hex) IPv6-comp = [IPv6-hex *5(":" IPv6-hex)] "::" [IPv6-hex *5(":" IPv6-hex)] ; The "::" represents at least 2 16-bit groups of zeros ; No more than 6 groups in addition to the "::" may be ; present IPv6v4-full = IPv6-hex 5(":" IPv6-hex) ":" IPv4-address-literal IPv6v4-comp = [IPv6-hex *3(":" IPv6-hex)] "::" [IPv6-hex *3(":" IPv6-hex) ":"] IPv4-address-literal ; The "::" represents at least 2 16-bit groups of zeros ; No more than 4 groups in addition to the "::" and ; IPv4-address-literal may be present
Note about quoting: the addresses
<"abc"@abc.de> <abc@abc.de> <\a\b\c@abc.de>are the same. A string is just quoted because it may contain characters that could be misinterpreted. The ``value'' of the string is the string without quotes. Just its representation differs. Hence the parser (scanner) must get rid of the quotes and the value must be used. The quotes are only necessary for external representation. We have to be careful when strings with ``wierd'' characters are used to lookup data or passed to other programs/functions that may interpret that data, e.g., a shell. The address must be properly quoted when used externally. In some cases all ``dangerous'' characters should be replaced for safety, i.e., a list of ``safe'' characters exists and each character not in that list is replaced by a safe one. Whether this replacement is reversible is open for discussion. Some MTAs just use a question mark as replacement, which of course is not reversible.
The message submission program needs an RFC 2822 parser to extract addresses from headers as well from the command line. Moreover, addresses must be brought into a form that is acceptable by RFC 2822 (in headers) and RFC 2821 (for SMTP delivery).
atext = ALPHA / DIGIT / ; Any character except controls, "!" / "#" / ; SP, and specials. "$" / "%" / ; Used for atoms "&" / "'" / "*" / "+" / "-" / "/" / "=" / "?" / "^" / "_" / "`" / "{" / "|" / "}" / "~" atom = [CFWS] 1*atext [CFWS] dot-atom = [CFWS] dot-atom-text [CFWS] dot-atom-text = 1*atext *("." 1*atext)
qtext = NO-WS-CTL / ; Non white space controls %d33 / ; The rest of the US-ASCII %d35-91 / ; characters not including "\" %d93-126 ; or the quote character qcontent = qtext / quoted-pair quoted-string = [CFWS] DQUOTE *([FWS] qcontent) [FWS] DQUOTE [CFWS]
Address parsing (RFC 2822/2821) should be based on a tokenized version of the string representation of an address. Therefore we need a token handling library (compare postfix).
Question: can we create only one address rewriting engine or do we need different ones for RFC 2821 and 2822? Question: is it sufficient for the first version to just have a table-driven rewrite engine (only mapping like postfix, not the full rewrite engine of sendmail 8)?
The communication between program modules in sendmail X is of course hidden inside a library. This library presents an API to the modules. In its first implementation, the library will probably use sockets for the communication.
Note: we may use ``sliding windows'' just as TCP/IP does to specify the amount of data the other side can receive. This allows us to send data without overflowing the recipient. Question: can we figure this out at the lower level?
The data structures for internal communication consist of tagged fields, see also Section 3.11.12. Tagged fields are used to allow for:
Question: do we need a package header to identify packages? This may be required if ``garbled'' packages can be send over one connection. To find the begin of the next package, a linear scan for a package header would be necessary. TCP guarantees reliable connections (Unix sockets too?), so this may not be necessary. If a ``garbled'' package is received, the connection is closed such that the sender has to deal with the problem.
A record communication buffer (rcb) can be realized as an extension of the string abstraction (see Section 3.14.7) by using one additional counter () (or pointer) to keep track of sequential reading out of it.
struct sm_rcb_S { size_t sm_rcb_len; /* current length */ size_t sm_rcb_size; /* allocated length */ size_t sm_rcb_maxsz; /* maximum length */ uchar *sm_rcb_base; /* pointer to byte sequence */ rpool_t sm_rcb_rpool; /* data is allocated from this */ int sm_rcb_rw; /* read index/left to write */ }
Invariants:
,
If reading: . If writing: (no length specified) or size of the data to put into the RCB.
Basic operations to create and delete RCBs:
new() | create a new rcb |
free() | free an rcb |
A RCB can be operated in two different modes, each of which consists of two submodes:
For 1a: read record from file: if we do only record oriented operations, then we can use to write data into the buffer and to keep track of how much data is left to read.
The first word from the file (which is the record size, see Section 3.11.12) is written into (only if ). Make sure there is enough space in the buffer (right after the size of the record is read) without exceeding the maximum size (this may require reallocation); read data from a file and put it into the buffer (asserting the size will not be exceeded), this may happen in pieces, so and are updated accordingly. The buffer will be filled until . Notice: the initial read may cause a problem if the number of elements to read is specified too large: it may cause more than one record to end up in the buffer.
For 1b: After the entire record has been copied into the buffer we can read out of it (for decoding). Decode from buffer requires the following functionality:
rcb_getn(IN rcb_P rcb, OUT uchar *buf, IN size_t n): read characters
rcb_getint(IN rcb_P rcb, OUT int *val): read an integer value
Even if we have a common encoding (e.g., lowest bit) to distingush between integers and strings we still can't use one decoding function because we need to pass a buffer and its size for strings to the function.
For 2a: encoding into rcb:
rcb_putn(IN rcb_P rcb, IN uchar *buf, IN size_t n): put characters into at current
rcb_putint(IN rcb_P rcb, IN int val): put integer value into at current
These functions check whether the expected record size is reached unless it is .
Question: should we have an additional parameter: record type? Then we would just need one call (instead of three) to write (encode) one subrecord.
For 2b: writing to file:
rcb_snd(INOUT rcb_P rcb, IN fd_T fd): read data from and write to .
Returns number of bytes left to write, i.e., if it has to be called again, if it is done, on error.
Notice:
The only (important) difference to the string abstraction explained in Section 3.14.7 is to keep track of sequential reading out of/writing into it. This is similar to I/O buffers (see Section 3.14.3.2).
As explained in Section 3.11.12 the end of a record should be marked. This is not only useful for data written to disk but also for data transferred over the network. Even though TCP offers a reliable connection, there is a change for receiving garbled RCBs, e.g., consider the following situation: a multi-threaded client sends RCBs to a client, one of the write attempts fails such that a partial RCB is transmitted, another thread sends an RCB whose begin will then mistaken as the end of the previous (partial) RCB. Even though the client should stop communicating with the server as soon as the timeout (write problem) occurs, an end-of-record marker will help to recognize the problem. However, the question is how to properly implement this. The best approach is to do this completely transparent to the application. Question: how?
As explained in Section 3.11.12 each record entry has a type field that describes the content. There is a small problem in selecting these type because they can encode information which may as well transported separately. For example, there might be errors for A, B, C, ..., then we can have
Option 1 is not easily extendible: we need to add record types for each new error and we need to recognize them on the decoding site (and obviously they have to be generated by the sender). Option 2 would cause one record type to describe a ``compound'' field. As long as the type of the subentries is clear this should not be a problem.
There are several libraries that need to be written which provide some kind of storage functionality, e.g., hash tables, (balanced) trees, RSC (4.3.5). If we can use (almost) the same API for them, then it can be fairly easy to replace an underlying implementation to trade one storage/access method against another. This can be useful if different access methods are required. For example, while all of these provide a simple lookup function (map key to value, i.e., the typical DB functionality), some provide also additional functionality, or at least a more efficient implementation of certain access methods than others. This is exemplified by the different Berkeley DB access methods (see [OBS99] and [Sle]), e.g., hash, btree, queue. That API might serve as an example for others that need to be implemented. Starting with version 4.23.26Berkeley DB does support an in-memory database without a backup on disk. If this option would have been available earlier, it wouldn't have been necessary to (re-)implement some of the functionality.
Maps can be considered as a specialized version of (file) I/O: while (file) I/O is in general sequential (even though positioning can be used and some OS provide record-based I/O), map I/O is based on an access key (however, sequential access is usually provided too). It might be useful to treat maps as a subset of the generic (file) I/O layer, i.e., use its open(), close(), read(), write(), etc functions. Todo: investigate this further as time permits.
Note: there are basically two different kinds of maps that are used:
The first kind is used internally by sendmail X for various purpose to store and access data. The second kind is used to control the behavior of sendmail X, e.g., to map addresses from one form into another, to change options based on various keys like the name or IP address of the site to which an connection is open, etc. Even though it might be useful to distinguish between both, it seems more generic to use one common API. Maps that don't provide write access, e.g., a map of valid users usually is not written by an MTA, simply don't have the corresponding functions calls and trying to invoke those will trigger an appropriate error.
The basic API for BDB looks like this:
db_create | create | Create a database handle |
DB-open | open | Open a database |
DB-close | close | Close a database |
DB-get | lookup | Get items from a database |
DB-put | add | Store items into a database |
DB-del | rm | Delete items from a database |
DB-remove | destroy | Remove a database |
This doesn't fully match the naming conventions for sendmail X explained in Section 3.14.1.2; the names for the latter are listed in the second column.
Additional functions should include walk: walk through all items and apply a function to them, and for some access methods a way to get elements in a sorted order out of the storage. In Berkeley DB, the former can be implemented using the cursor functions, the latter works at least if btree is used as access method.
Generic API proposal:
_create() | create a new object |
_destroy() | destroy object (other name?) |
_open() | open an object for use, this may perform an implicit create() |
_close() | close an object, can't be used unless reopened, this does not destroy the object |
_reopen() | close and open an object; this is done only if necessary, |
e.g., for a DB if the file on disk has changed | |
_add() | add an entry to an object |
_rm() | remove an entry from an object |
_alloc() | allocate an entry and add it to an object |
_free() | free an entry and remove it from an object |
_lookup() | lookup an entry, return the data value |
_locate() | lookup an entry, return the entry |
_first() | return first entry |
_next() | return next entry |
Instead of creating a new function name for each map type, the map (context) itself is passed to a generic function map_f(), alternatively the context can provide function pointers such that the functions are invoked as map_ctx-f() (as it is done by BDB, see above).
An application uses the functions as follows:
To simplify the code the following shortcuts might be implemented:
Question: which level should handle the _reopen() function? It could be done in the abstraction layer but then it needs to know when to reopen a map. This could be done by checking where the underlying file has changed, but this is specific to some map implementation, e.g., Berkeley DB, while a network map does not provide such a way to check. This can be solved as follows: add a flag that indicates whether the simple check (file changed) should be performed in the abstraction layer, and add a function pointer to a _reopen() function in the map implementation which will be used if the flag is not set and the function pointer is not NULL. An additional problem is that if the _open() function takes a variable number of arguments, then those need either be resupplied when _reopen() is called, or they need to be stored in the map instance3.27(which would make it necessary to preserve that data across the sequence _close(), _open() somehow), or they need to be supplied in some other way. Such ``other way'' could be separate initialization and option settings functions, e.g., first _create() returns a map context, then various options are set via _setoption() which is a polymorphic function (compare setsockopt(2)) (and has a corresponding map_getoption() function), and finally _open() is called with the map context that has all options set in the proper way. Closing a map then requires either two steps or at least a parameter which indicates whether to discard (destroy) the map context too. However, this requires that the underlying map implementation actually supports this; Berkeley DB does not preserve the handle across the close3.28function. One way around this is to store the options in an additional abstraction layer between the map abstraction and the map implementation, or have a variable sized array of option types and option values in the map abstraction context in addition to the pointer to the map implementation context. Then the _setoption() function could store the options in that array and _reopen() could ``replay'' the calls to _setoption() by going through that array. Nevertheless, the same argument can be made here: it requires that the underlying map implementation actually supports a _create() and an _open() function; if there is only one _open() function that allocates the map context and its internal data while maybe taking several arguments, then this split can not be done.
Note: it might be useful to have a _load() callback for the _open() function to initialize data in the map, e.g., to read key and data pairs from a file. This is especially useful in conjunction with _reopen() to reload a file that has changed.
The following data seems to be sufficient to describe elements in a storage system (e.g., database):
uchar *STH_key; /* key */ uint32_t STH_key_len; /* length of key */ void *STH_data; /* pointer to data (byte string) */ uint32_t STH_data_len; /* length of data */
An alternative is to use the string type described in Section 3.14.7. The advantage of doing so is that the functions for it can be reused.
Berkeley DB uses the following structure (called DBT) to describe elements in the database as well as keys:
typedef struct { void *data; /* pointer to byte string */ u_int32_t size; /* length of data */ u_int32_t ulen; /* length of user buffer */ u_int32_t dlen; /* length of partial record */ u_int32_t doff; /* offset of partial record */ u_int32_t flags; /* various flags */ } DBT;
For an internal read/write map the storage for the key itself is managed by the library or by the application, the data storage is by default managed by the application, i.e., the data must be persistent as long as the map is used. This is in contrast to BDB which manages the data itself since it has to store it on disk. The default can be changed by setting flags in the open() call to indicate whether the application supplies memory for storing the data in which case another field specifies the length of that memory section. This is done by the DEFEDB library. It is useful to provide callbacks for data allocation/deallocation3.29such that the library can (indirectly) manage the data storage if necessary, e.g., if it is supposed to automagically remove outdated entries, or if it can update an existing entry or create new entries based on whether the add function should allow for this, and for the destroy function to remove all allocated data (memory).
A map context should contain flags that describe how keys and data is handled. First a map implementation needs to specify whether the implementation can actually perform memory management functions (it may not have the required code). If it does not provide memory management function, then the application must take care of it. When a map is opened the application specifies whether keys and data are (de)allocated by the library (if available). The flags are per map instance, not per key/data pair; this would require to store the information in the key or data itself which would complicate the storage interface, e.g., it would have to encode the flags somehow and make sure that the application will only get the relevant part, not the additional control information (compare allocation contexts for malloc(3)).
Note: if the map abstraction uses the string type described in Section 3.14.7 then the memory management functions are those provided by the string implementation, there is no need to have additional functions and hence no function pointers are needed for this.
An abstraction layer should be placed on top of the various storages libraries to access the latter through a generic interface. This is similar to the Berkeley DB layer which specifies which type of database to use (open). To access the underlying implementation two methods are common:
The first method avoids an additional indirection, but the second can perform generic operations in one place before/after calling the specific storage library function. This is helpful for something like an expand function (sm_map_rewrite()) that replaces certain parts of the result by (parts of) the key, e.g., positional parameters (``%n'' in sendmail 8). Moreover, it makes it easier to deal with functions that aren't available because the wrapper can check for that (otherwise the function pointer must point to a dummy function).
The abstraction layer needs to provide functions to
For case 2 the abstraction layer also needs to provide the generic map functions (see Section 3.14.12).
Convention: functions that operate on
In its simplest form the maps context contains a list or a hash table (using the type as key) of registered map classes:
struct sm_maps_S { List/Hash sm_mapc_P; }
A map class should look like this:
struct sm_mapc_S { sm_cstr_P sm_mapc_type; uint32_t sm_mapc_flags; sm_map_create_F sm_mapc_createf(); sm_map_destroy_F sm_mapc_destroyf(); sm_map_open_F sm_mapc_openf(); sm_map_load_F sm_mapc_loadf(); sm_map_close_F sm_mapc_closef(); sm_map_reopen_F sm_mapc_reopenf(); sm_map_add_F sm_mapc_addf(); sm_map_rm_F sm_mapc_rmf(); sm_map_alloc_F sm_mapc_allocf(); sm_map_free_F sm_mapc_freef(); sm_map_lookup_F sm_mapc_lookupf(); sm_map_locate_F sm_mapc_locatef(); sm_map_first_F sm_mapc_firstf(); sm_map_next_F sm_mapc_nextf(); }
Flags for map classes describe the capabilities of a map class, e.g., some of these are:
MAPC-ALLOCKEY | Map can allocate storage for key |
MAPC-ALLOCDATA | Map can allocate storage for data |
MAPC-FREEKEY | Map can free storage for key |
MAPC-FREEDATA | Map can free storage for data |
MAPC-CLOSEFREEKEY | Map must free storage for key on close (destroy?) |
MAPC-CLOSEFREEDATA | Map must free storage for data on close (destroy?) |
MAPC-NORMAL-REOPEN | Map can be reopened using close/open |
These flags indicate whether the library can perform memory management functions. If it does not set those flags, then the application must take care of it.
A map is an instance of a map class:
struct sm_map_S { sm_cstr_P sm_map_name; sm_cstr_P sm_map_type; sm_mapc_P sm_map_class; char *sm_map_path; uint32_t sm_map_flags; uint32_t sm_map_openflags; /* flags when open() was called */ uint32_t sm_map_caps; /* capabilities */ int sm_map_mode; time_T sm_map_mtime; /* mtime when opened */ ino_t sm_map_ino; /* inode of file */ void *sm_map_db; /* for use by map implementation */ void *sm_map_app_ctx; /* array of option types and option values */ sm_map_opt_T sm_map_opts[SM_MAP_MAX_OPT]; }
There are different type of flags for maps: some describe the state of the map, some describe the functionalities (capabilities) that are offered. Capabilities are also something that can be requested when a map is opened, if a map does not offer the requested functionality, the open fails. Note: it might be useful to have a query function that returns the capabilities of a map class.
Flags that describe the state of a map are: (some of these can probably be collapsed if locking is used)
CREATED | Map has been created |
INITIALIZED | Map has been initialized |
OPEN | Map is open |
OPENBOGUS | open failed, do not call close |
CLOSING | map is being closed |
CLOSED | map is closed |
VALID | this entry is valid |
WRITABLE | open for writing |
ALIAS | this is an alias file |
Generic flags (options) that describe how the map abstraction layer should handle various operations:
INCLNULL | include nul byte in key |
OPTIONAL | don't complain if map can't be opened |
NOFOLDCASE | don't fold case in keys |
MATCHONLY | don't use the map value |
ALIAS | this is an alias file |
TRY0NUL | try without nul byte |
TRY1NUL | try with nul byte |
LOCKED | this map is currently locked |
KEEPQUOTES | don't dequote key before lookup |
NODEFER | don't defer if map lookup fails |
SINGLEMATCH | successful only if match returns exactly one key |
The alloc() and free() functions receive the
application context as additional parameter:
typedef void *(*sm_map_*_alloc_F)(size_t size, void *app_ctx);
typedef void (*sm_map_*_free_F)(void *ptr, void *app_ctx);
The create() function creates a map instance:
typedef sm_ret_T (*sm_map_create_F)(sm_mapc_P mapc, sm_cstr_P name, sm_cstr_P type, uint32_t flags, sm_map_P *pmap, ...);
The open() function opens a map for usage:
typedef sm_ret_T (*sm_map_open_F)(sm_mapc_P mapc, sm_cstr_P name, sm_cstr_P type, uint32_t flags, char *path, int mode, sm_map_P *pmap);
Question: what is the exact meaning of the parameters, especially name, flags, path, and mode? name: name of the map, can be used for socket map? This is mostly a descriptive parameter. path: if there is a file on disk for the map this is the name of that file. So what are flags and mode? mode could have the same meaning as the mode parameter of the Unix system call chmod(2). flags could have the same meaning as the flags parameter of the Unix system call open(2), e.g.,
O_RDONLY | open for reading only |
O_WRONLY | open for writing only |
O_RDWR | open for reading and writing |
O_NONBLOCK | do not block on open |
O_APPEND | append on each write |
O_CREAT | create file if it does not exist |
O_TRUNC | truncate size to 0 |
O_EXCL | error if create and file exists |
O_SHLOCK | atomically obtain a shared lock |
O_EXLOCK | atomically obtain an exclusive lock |
O_DIRECT | eliminate or reduce cache effects |
O_FSYNC | synchronous writes |
O_NOFOLLOW | do not follow symlinks |
Whether these flags actually make sense depends on the underlying map implementation. Questions:
Here is an example of the valid flags for a map implementation (Berkeley DB), some of which have equivalents from open(2):
DB_AUTO_COMMIT | |
DB_CREATE | O_CREAT |
DB_EXCL | O_EXCL |
DB_DIRTY_READ | |
DB_NOMMAP | |
DB_RDONLY | O_RDONLY |
DB_THREAD | |
DB_TRUNCATE | O_TRUNCATE |
The load() function needs the map context and the name of the file from which to read the data:
typedef sm_ret_T (*sm_map_load_F)(sm_map_P map, char *path);
The close() function requires only the map context:
typedef sm_ret_T (*sm_map_close_F)(sm_map_P map);
Adding an item requires the map context, the key, and the data (note: see elsewhere about whether the map needs to copy key or data).
typedef sm_ret_T (*sm_map_add_F)(sm_map_P map, sm_map_key key, sm_map_data data, unsigned int flags);
Removing an item requires the map context and the key:
typedef sm_ret_T (*sm_map_rm_F)(sm_map_P map, sm_map_key key);
typedef sm_ret_T (*sm_map_alloc_F)(sm_map_P map, sm_map_entry *pentry);
typedef sm_ret_T (*sm_map_free_F)(sm_map_P map, sm_map_entry entry);
typedef sm_ret_T (*sm_map_lookup_F)(sm_map_P map, sm_map_key key, sm_map_data *pdata);
typedef sm_ret_T (*sm_map_locate_F)(sm_map_P map, sm_map_key key, sm_map_entry *pentry);
typedef sm_ret_T (*sm_map_first_F)(sm_map_P map, sm_map_cursor *pmap_cursor, sm_map_entry *pentry);
typedef sm_ret_T (*sm_map_next_F)(sm_map_P map, sm_map_cursor *pmap_cursor, sm_map_entry *pentry);
Notes:
struct sm_mapn_S { sm_map_P sm_mapn_map; uint32_T sm_mapn_refcnt; }
sendmail 8 opens map dynamically on demand. This can be useful if some maps are not needed in some processes. Moreover, some map types do not work very well when shared between processes. For sendmail X it can be useful to close an reopen maps in case of errors, e.g., network communication problems. Question: which level should handle this: the abstraction layer or the implementation? If it's done in the abstraction layer then code duplication can be avoided, however, it's not yet clear what the proper API for this functionality is. It seems that using sm_map_close() and sm_map_open)() is not appropriate because we already have the map itself. Would sm_map_reopen() (see Section 3.14.12.0.1) be the appropriate interface?
The previous sections describe a synchronous map interface. Section 3.1.1 contains a generic description of asynchronous functions. The map abstraction layer can either provide just a synchronous map API, or it can provide the more generic approach described in Section 3.1.1, however, it should let the upper layer ``see'' whether the lower level implements synchronous or asynchronous operations.
In most cases access to maps must be protected by mutexes. This can be either provided by the abstraction layer or by the low-level implementation. Doing it in the former minimizes the amount of code duplication.
Some map implementations may be slow (e.g., network based lookups) and hence some kind of enhancement is required. One way to do this is to issue multiple requests concurrently. Besides asynchronous lookups (which are currently not supported, see Section 3.14.12.4) this can be achieved by having multiple map instances of the same map type provided the underlying implementation allows for that.
It would be very useful if this can be provided by the map abstraction layer to avoid having this implemented several times in the low level map types.
Question: how to do this properly? The map type needs to specify that it offers this functionality and some useful upper limit for the number of concurrent maps3.30.
Unfortunately older DNS resolvers (BIND 4.x) are not thread safe. The resolver that comes with BIND 8 (and 9) is supposed to be thread safe, but even FreeBSD 4.7 does not seem to support it (res_nsearch(3) is not in any library). In general there are two options to use DNS:
Option 1 does not directly work for applications that use state threads (see Section 3.18.3.1), the communication must at least be modified to use state threads I/O to avoid blocking of the entire application. Therefore it is probably not much different to chose option 2, which would be a customized (and hopefully simpler) version of the resolver library. This option however causes a small problem: how should the caller, i.e., the task that requested a DNS lookup, be informed about the result? See Section 3.1.1.1 for a general description of this problem. We could integrate the functionality directly into the application; e.g., for an application which uses event threads this is exactly the activation sequence that is used: a request is send, the task waits for an I/O event and then is activated to perform the necessary operations. However, this tightly integrates the DNS resolver functionality into the application which may not be the best approach. Alternatively the DNS request can include a context and a callback which is invoked (with the result and the context) by the DNS task. The latter is useful for an application using the event thread library. For a state threads application a different notification mechanism is used, i.e., a condition variable. See Section 4.2.2 for a description of the possible implementations.
A DNS request contains the following elements:
name | hostname/domain to resolve |
type | request type (MX, A, AAAA, ...) |
flags | flags (perform recursion, CNAME, ...) |
dns-ctx | DNS resolver context |
event thread: | |
app-ctx | application context |
app-callback | application callback |
state threads: | |
app-cond | condition variable to trigger |
The DNS resolver context is created by a dns_ctx_new() function and contains at least these elements:
DNS servers | list of DNS servers to contact |
flags | flags (use TCP, ...) |
Available functions include:
dns_mgr_ctx_new() | create a new DNS manager context |
dns_mgr_ctx_free() | free (delete) DNS manager context |
dns_rslv_new() | create a new DNS resolver context |
dns_rslv_free() | free (delete) DNS resolver context |
dns_tsk_new() | create a new DNS task |
dns_tsk_free() | free (delete) DNS task |
dns_req_add() | send DNS request; this may either receive a DNS request structure |
as parameter or the individual elements |
There should be only one DNS task since it will manage all DNS requests. However, this causes problems due to:
If multiple DNS resolver tasks are used then there is the problem of distributing the requests between them and of a fallback in case of problems, e.g., timeout or truncations (UDP).
A DNS resolver task should have the following context:
DNS server | which server? (fd is stored in task context) |
flags | flags (UDP, TCP, ...) |
dns-ctx | DNS resolver context |
error counters | number of timeouts etc |
If too many errors occurred the task may terminate itself after it created a new task using a different DNS server.
Many functions can cause temporary errors of which timeouts, especially from asynchronous functions, are hard to handle. Question: should the caller implement a timeout or the callee? That is, should the caller set some timestamp on its request and the data where the result should be stored and check this periodically or should the callee implement the timeout and guarantee returning a result (even if it is just an error code) to the caller? The latter looks like the better solution because it avoids periodic scanning in the application (caller), however, then it needs to be in the callee, which at least has some centralized list of outstanding requests. The caller has to make sure that its open requests are removed after some time, e.g., some sort of garbage collection is required. If the caller removes a request but the callee returns a result later on, then the caller must handle this properly; in the simplest case it can just ignore the result. A more sophisticated approach would use a cancel_request() function such that the request is also discarded by the callee.
In some cases the caller can easily implement a timeout, e.g., SMTPS does this when it waits for a reply from QMGR. If the reply does not arrive within a time limit, then the SMTP server returns a temporary error code to the client.
Todo: structure this.
For example: after final dot a message must (probably) be fsync()ed, but we don't want to do this for each individually, but maybe a group. Then we use fsync(within-10-seconds) and the library can group several several requests together (compare softupdates).
ISC [ISC01] logging uses a context (type isc_log_t), a configuration (type isc_logconfig_t), categories and modules, and channels.
Channels specify where and how to log entries:
<channel> ::= "channel" <channel_name> "{" ( "file" <path_name> [ "versions" ( <number> | "unlimited" ) ] [ "size" <size_spec> ] | "syslog" <syslog_facility> | "stderr" | "null" ); [ "severity" <priority>; ] [ "print-category" <bool>; ] [ "print-severity" <bool>; ] [ "print-time" <bool>; ] "}"; <priority> ::= "critical" | "error" | "warning" | "notice" | "info" | "debug" [ <level> ] | "dynamic"
For each category a logging statement specifies where to log entries for that category:
<logging> ::= "category" <category_name> { <channel_name_list> };
Categories are ``global'' for a software package, i.e., there is a common superset of all categories for all parts of the package. Some parts may only use a subset, but the meaning of a category must be consistent across all parts, otherwise the logging configuration will cause problems (at least inconsistencies). Modules are use by the software (libraries etc) to describe from which part of the software a logging entry has been made.
The API is as follows:
isc_log_create(isc_mem_t *mctx, isc_log_t **lctxp, isc_logconfig_t **lcfgp); isc_log_destroy(isc_log_t **lctxp);
isc_logconfig_create(isc_log_t *lctx, isc_logconfig_t **lcfgp); isc_logconfig_use(isc_log_t *lctx, isc_logconfig_t *lcfg); isc_logconfig_get(isc_log_t *lctx); isc_logconfig_destroy(isc_logconfig_t **lcfgp);
isc_log_registercategories(isc_log_t *lctx, isc_logcategory_t categories[]); isc_log_registermodules(isc_log_t *lctx, isc_logmodule_t modules[]);
To create and use channels:
isc_log_createchannel(isc_logconfig_t *lcfg, const char *name, unsigned int type, int priority, const isc_logdestination_t *destination, unsigned int flags) isc_log_usechannel(isc_logconfig_t *lcfg, const char *name, const isc_logcategory_t *category, const isc_logmodule_t *module)
isc_log_write(isc_log_t *lctx, isc_logcategory_t *category, isc_logmodule_t *module, int priority, const char *format, ...) isc_log_vwrite(isc_log_t *lctx, isc_logcategory_t *category, isc_logmodule_t *module, int priority, const char *format, va_list args)
isc_log_setdebuglevel(isc_log_t *lctx, unsigned int level); isc_log_getdebuglevel(isc_log_t *lctx);
isc_log_wouldlog(isc_log_t *lctx, int priority);
isc_log_setduplicateinterval(isc_logconfig_t *lcfg, unsigned int interval); isc_log_getduplicateinterval(isc_logconfig_t *lcfg);
isc_log_settag(isc_logconfig_t *lcfg, const char *tag); isc_log_gettag(isc_logconfig_t *lcfg);
isc_log_opensyslog(const char *tag, int options, int facility); isc_log_closefilelogs(isc_log_t *lctx);
isc_log_categorybyname(isc_log_t *lctx, const char *name); isc_log_modulebyname(isc_log_t *lctx, const char *name);
There are some technical problems that must be solved by a clean design. For example, in sendmail 8.12 the function that returns error messages for error codes includes the basic OS error codes but also extensions like LDAP if it has been compiled in. This makes it hard to reuse that library function in different programs that don't want to use LDAP because it will be linked in since it is referenced in the error message function. This must be avoided for obvious reasons.
One possible solution is to have a function list where modules register their error conversion functions. However, this requires that the error codes are disjunct. This can be achieved as explained in Section 3.13.2. So the current approach (having a ``global' error conversion function) doesn't work in general. The error conversion must be done locally and the error string must be properly propagated.
Question: how do we build sendmail X on different systems? Currently we have one big file (conf.h) with a lot of OS-specific definitions and several OS-specific include files (os/sm_os_OS.h). However, this is ugly and hard to maintain. We should use something like autoconf to automagically generate the necessary defines on the build system. This should minimize our maintainance overhead, esp. if it also tests whether the feature actually works.
Since we do not have enough man power to develop yet another build system and since our current system is completely static, we will use the GNU autotools (automake, autoconf, etc) for sendmail X. We already have a (partial) patch for sendmail 8 to use autoconf etc (contributed by Mark D. Roth). We can use that as a basis for the build configuration of sendmail X.
Interestingly enough, most other open source MTAs use their own build system. However, BIND 9 and Courier MTA use also autoconf.
Note: there are some things which can't be solved by using autoconf. Those are features of an OS that cannot be (easily or at all) determined by a test program. Examples for these are:
In those case we need to provide the data in some script (config.cache?) that can be easily used by configure. The data should probably be organized according to the output of config.guess: ``CPU-VENDOR-OS'', where OS can be ``SYSTEM'' or ``KERNEL-SYSTEM''.
This section contains hints about some operating system calls, i.e., how they can be used.
This section should contain only standard (POSIX?) behavior, nothing specific to some operating system. See the next section for the latter.
Todo: structure this and then use it properly.
(OpenBSD man page, should apply to all OS) If the socket is marked non-blocking and no pending connections are present on the queue, accept() returns an error as described below.
It is possible to select(2) or poll(2) a socket for the purposes of doing an accept() by selecting it for read.
One can obtain user connection request data without confirming the connection by issuing a recvmsg(2) call with an msg_iovlen of 0 and a nonzero msg_controllen, or by issuing a getsockopt(2) request. Similarly, one can provide user connection rejection information by issuing a sendmsg(2) call with providing only the control information, or by calling setsockopt(2).
Question about the last sentence: how to do this and is it portable?
This section contains hints about the behavior of some system calls on certain operating system. That is, anything that is interesting with respect to the use of system calls in sendmail X. For example: don't use call xzy in these situations on this OS.
Todo: structure this.
Just some unsorted notes about worker threads.
We have a pool of worker threads. Can/should this grow/shrink dynamically? In first attempt: no, but the data structures should be flexible. We may use fixed size arrays because the number of threads will possibly neither vary much nor will it be very large. However, fixed size doesn't refer to compile time, but configuration (start) time.
We have a set of tasks. This set will vary (widely) over time. It will have a fixed upper limit (configuration time), which depends for example on the number of available resources of the machine/OS, e.g., file descriptors, memory.
We need a queue of runnable tasks (do we need/want multiple queues? Do we want/need prioritized queues?). Whenever a task becomes active (in our case usually: I/O is possible) then it will be added to the ``run'' queue (putwork()). A thread that has nothing to do, looks for work by calling getwork(). These two functions access the queue (which of course is protected by a condition variable).
Maybe we can create some shortcuts (e.g., if a task becomes runnable and there is an inactive worker thread, give it directly to it), but that's probably not worth the effort (at least in the first version).
A SMTP server is responsible for a SMTP session. As such it has a file descriptor (fd) that denotes the connection with the client. The server reads commands from that fd and answers accordingly. This seems to be a completely I/O driven process and hence it should be controlled by one thread that is watching I/O activities. This assumes that local operations are fast enough or that they can be controlled by the same loop.
There is one control thread that does something like this:
while (!stop) { check for I/O activity - timeout? check whether a connection has really timed out (inactive for too long) - input on fd: lookup context that takes care of this fd add context to runnable queue - fd ready for output: lookup context that takes care of this fd add context to runnable queue - others? }
Question: do we remove fd from the watch-set when we schedule work for it? If so: who puts it back? A worker thread when it performed the work?
The worker threads just do:
while (!stop) { get_work(&ctx); ctx->perform_action(ctx); }
See also Section 3.14.4 about programming problems.
It would be nice to build a skeleton library that implements this model. It then can be filled in with the appropriate data structure, that contain (links to) the actual data that is needed.
It would be even better to implement the generic model (multiple, pre-forked processes with worker threads) as a skeleton. This should be written general enough such that it can be tweaked (run time/compile time) to extreme situations, i.e., one process or one thread. The latter is probably not possible with this model, since there must be at least one control thread and one worker thread. Usually there's also a third thread that deals with signals.
However, this generic model could be reused in other situations, e.g., the first candidate would be libmilter.
This section contains some comments about available thread libraries. We try to investigate whether those libraries are suitable for use in sendmail X, and if so, for which components.
Some comments about [SGI01]: State Threads (ST) for Internet Applications (IA).
We assume that the performance of an IA is constrained by available CPU cycles rather than network bandwidth or disk I/O (that is, CPU is a bottleneck resource).
This isn't true for SMTP servers in general, they are disk I/O bound. Does this change the suitability of state threads for SMTPS?
The state of each concurrent session includes its stack environment (stack pointer, program counter, CPU registers) and its stack. Conceptually, a thread context switch can be viewed as a process changing its state. There are no kernel entities involved other than processes. Unlike other general-purpose threading libraries, the State Threads library is fully deterministic. The thread context switch (process state change) can only happen in a well-known set of functions (at I/O points or at explicit synchronization points). As a result, process-specific global data does not have to be protected by mutual exclusion locks in most cases. The entire application is free to use all the static variables and non-reentrant library functions it wants, greatly simplifying programming and debugging while increasing performance.
The application program must be extremely aware of this! For example: x = GlobalVar; y = f(x); GlobalVar = y; is ``dangerous'' if f() has a ``synchronization point''.
Note: Any blocking call must be converted into an I/O event, otherwise the entire process will block, because scheduling is based on asynchronous I/O. This doesn't happen with POSIX threads. Does this make ST unusable for the SMTPS? For example, fsync() may cause a problem. Question: can we combine ST and POSIX threads? The latter would be used for blocking calls, e.g., fsync(), maybe read()/write() to disk, or compute-intensive operations, e.g., cryptographic operations during TLS handshake. Answer: no [She01a].
Note: if you link with a library that does network I/O, it must use the I/O calls of ST [She01b]:
This is a general problem - external libraries should conform to the core server architecture. E.g., if the core server uses POSIX threads, all libraries must be thread-safe and if the core server is ST-based, all libraries must use ST socket I/O.That might be an even bigger problem than the compute-intensive operations. However, those libraries might be only used in the address resolver.
Since purely event based programming is hard as explained in Section 3.14.4, another approach has been suggested.
The problems with a thread-per-connection programming model have already been mentioned in Section 2.5.2, 4a. Additionally, the scheduling of those threads is entirely up to the OS (or the thread library).
We would like to reduce the number of threads without having to resort to the complicated event based programming model. Hence we use a worker model, but we do not need to split functions when they perform a blocking call. Instead, we protect those regions by counting semaphores to limit the number of threads that can execute those (compute intensive or blocking) functions. This ``helps'' the OS/thread library to schedule threads by restricting the number of threads in those sections.
Example: we have one counting semaphore (iosem) for events that must be taken care of and one (compsem) for a compute intensive section. Before a worker thread takes a task out of the queue in which the event scheduler added them, it must acquire iosem. When a thread wants to enter the compute intensive section, it releases iosem (thus another worker thread can take care of an event task), and acquires compsem, which may cause it to block (thus allowing another thread to run). After the thread finished the compute intensive section, it releases compsem and acquires iosem again before continuing.
Notice: it is still possible that the OS scheduler will let a compute intensive operation continue without switching to another thread. Scheduling threads is usually cooperative, not time-sliced. Hence we may have a similar problem here as for state threads.
An event thread library provides the basic framework for a worker based thread pool that is driven by I/O events and by wakeup times for tasks.
The library uses a general context that describes one event thread system and a per task context. It maintains two queues: a run queue of tasks that can be executed, and a wait queue of tasks that wait for some event (IS-READABLE, IS-WRITABLE, TIMEOUT, NEW-CONNECTION (listen())). Each task is at most in one of the queues at each time; if it is taken out of a queue, it is under the sole control of the function that removed it from the queue. If this model can be achieved then no per-task mutex is required, i.e., access to a task is either protected by the mutex for the queue it is in or is is ``protected'' by the fact that it isn't in a queue.
The system maintains some number of worker threads between the specified minimum and maximum. Idle threads vanish after some timeout until the minimum number is reached. An application first initializes the library (evthr_init()), then it creates at least one task (evthr_task_new()), and thereafter calls evthr_loop() which in turn monitors the desired events (I/O, timeout) and invokes the registered callback functions. Those callback functions return a result which indicates what should happen next with the task:
flag | meaning |
OK | do nothing, task has been taken care of |
WAITQ | put in wait queue |
RUNQ | put in run queue |
SLPQ | sleep for a while |
DEL | delete task |
TERM | terminate event thread loop |
Additionally the task may want to change the events it is waiting for. This can be accomplished in several ways:
Solution 2 would extend the return values given before by more flags (the required values can be easily encoded as bits in an integer value). These flags could be returned as ``SET'' and ``CLEAR''.
flag | meaning |
RD | IS-READABLE |
WR | IS-READABLE |
SL | TIMEOUT |
LI | NEW-CONNECTION |
To maximize concurrency, the system must be able to handle multiple requests over one file descriptor concurrently. This is required for the QMGR if it serves requests from SMTPS, which is multi-threaded and will use only one connection per process, over which it will multiplex all requests to (and answers from) the QMGR. Hence there must a (library supplied) function that can be called from an application to put a task back into the wait queue after it read a request (usually in form of an RCB, see Section 3.14.11.1.1). Then the thread can continue processing the request while the task manager can wait for an event on the same file descriptor and schedule another thread to take care of it. This way multiple requests can be processed concurrently.
This raises another problem: usually tasks in a server wait for incoming requests. Then they process them and send back responses. However, if an application returns a task as soon as possible to the wait queue, then it can't change the event types for the task (to include IS-WRITABLE so the answer is sent back to the client), because it relinquished control of it. So either the task descriptions must be protected by a mutex (which significantly complicates locking and error handling), or a different way must be used to change the event mask of a task (see 3.18.5.1.2). One such way is to provide a function that informs the task manager of the change of the event mask, i.e., send the task manager (over an internal communication mechanism, usually a pipe) the request to enable a certain event type (e.g., IS-WRITABLE). Since the task manager locks both queues as soon as an event occurred, it can easily change the task event type without any additional overhead. Moreover, this solution has the useful side effect that the wait queue is scanned again and the new event type is added to the list of events to wait for. However, the task might be active, i.e., in neither of the queues. In that case, we can either request that the user-land program does not change the status flags (which is probably a good idea anyway), or we actually have to use a mutex per task, or we need to come up with another solution.
A related problem is ``waking'' up another task. For example, we may have a task that periodically performs a function if data is available on which to act. If no data is available, then this task should not wake up at all, or only in long intervals. Initially the task may wake up in long intervals, look for data, and go back to sleep for the long interval if no data is available. If however data becomes available while the task is sleeping for the long interval, its timeout should be decreased to the short interval. There might be varying requirements:
This can probably be handled similar to the problem mentioned before, in addition to telling the task manager that a certain event type should be enabled, it also tells it the new timeout.
The task context should contain the following status flags:
There doesn't seem to be a way around using a mutex to protect access to the event request flags (1) and similar data, e.g., the timeout of a task. If some data can be modified in a task context while it is not ``held'' because it's in one of the queues, then we need to protect it.
Maybe we can split the event request flags:
Then we can modify the event request flags of a task when it is ``under control'', i.e., when it is in a queue and the main control loops examines the tasks. That is, a function adds a change request for a task to a list; the change request includes the new event request flags and the task identifier (as well as other data that might be necessary, e.g., new sleep time). When the main loop is invoked, it will check the change request list and apply all changes to tasks that are under its control, i.e., in the wait queue. This should allow us to avoid per-task mutexes.
Alternatively, we can use mutexes just to protect the ``new request'' flags (see list above: item 2). That way the protected data is fairly small and there should almost never be any contention. Moreover, we can avoid having a list (with all its problems: allocation, searching for data, etc).
It might be useful to be able to specify a task that is invoked if the system is idle (the so-called ``idle task''). For example, if the system doesn't do anything (all threads idle), invoke the idle task. That task may take some options like:
Question: how to "signal" the idle task when the system is busy again? It might be useful to give this task a low priority in case other tasks are started while it is running such that the scheduler can act accordingly. Notice that some pthread implementation do not care much about priorities, some even implement cooperative mult-threading.
Question: what about callback functions for signals?
Order of implementation:
based on state threads
Milestone 3 is complete if a mail can be received by SMTPS, safely stored in a queue by QMGR, scheduled for delivery by a very simple scheduler in the QMGR, and then delivered by SMTPC.
Milestone 3 has been reached 2002-08-19.
Notice: this is prototype code. There are many unfinished places within the existing code not to mention all the parts that are completely missing.
Milestone 4 has been reached 2002-09-19.
Next steps:
As of 2004-01-01 sendmail X is running as MTA on the machine of the author. sendmail 8 is only used for mail submission.
There are some parts of the sendmail X source code which should be explained before the individual modules and the libraries. Those are conventions that are common to at least two modules and which will be used in the following sections.
In Section 3.1.1.2 some remarks about identifiers have been made. In this section the structure of session and transaction identifiers are explained. These identifiers for the SMTP server consist (currently) of a leading `S', a 64 bit counter, and an 8 bit process index, i.e., the format is: "S%016qX%02X". Those identifiers are of course used in other modules too, especially the QMGR. For simplicity the identifiers for the delivery agents follow a similar scheme: a leading `C', an 8 bit process index, a running counter (32 bit) and a thread index (32 bit). Notice that only the SMTPS identifiers are supposed to be unique over a very long time period (it takes a while before a 64 bit counter wraps around). The SMTPC (DA) identifiers are unique for a shorter time (see Section 3.1.1.2 about the requirements) and they allow easy identification of delivery threads in SMTPC. The type for this identifier is sessta_id_T.
To uniquely identify recipients they are enumerated within a transaction, so their format is "%19s-%04X" where the first 19 characters is the SMTPS transaction id and the last four characters are the recipient index (this limits the number of recipients per transaction to 65535 which seems to be enough4.1). The type for this identifier is rcpt_id_T.
In some cases it is useful to have only one identifier in a structure and some field that denotes which identifier type is actually used, i.e., a superset of sessta_id_T and rcpt_id_T. This type is smtp_id_T.
Notice: under certain circumstances it might be worth to store identifiers as binary data instead of printable strings. For example, if they are used for large indices that are stored in main memory. For SMTP server identifiers this shrinks the size from 20 bytes down to 12 (8 byte for the counter, 4 for the process index assuming that it's a 32 bit system, i.e., the 8 bit process index will be stored in a 32 bit word). For the recipient identifiers the length will shrink from 25 bytes (probably stored in 28 bytes) down to 16 bytes. Whether it is worth to trade memory storage for conversion overhead (and added complexity) remains to be seen. In the first version of sendmail X this will probably not be implemented. See also Section 3.11.6.1.1 about size estimations.
This section talks about the integration of asynchronous functions with the event thread library. Section 3.1.1.1 gives a description of asynchronous functions, stating that a result (callback) function RC is invoked with the current status and the result of an aysnchronous function as parameters. However, the implementation of this is not as simple as it might seem. The event thread library (see Section 3.18.5 for a functional description and Section 4.3.8 for the implementation) uses a task context that is passed to worker threads which in turn execute the application function (passing it the entire task context).
There are basically two approaches to this problem:
The advantage of solution 1 is that it requires less context switching (even if it is only a thread context). If the function RC is a callback that is directly invoked from the function that receives the result from an external source, then the result values should not be those that are returned to the worker manager (e.g., OK, WAITQ, RUNQ, DEL, see 3.18.5.1.1), unless we put an extra handler inbetween that knows about these return values and can handle them properly by performing the appropriate actions. Therefore this approach requires careful programming, see below for details.
Solution 2 does not mess around with the worker function and the task (event thread) context without telling the event thread system about it, hence it does not have the programming restriction mentioned above; it allows for a more normal programming style.
An asynchronous call sequence looks like this:
The next step depends on which solution has been chosen:
If the task C is invoked via the callback from result_handler() then we either have to make C aware of this (implicitly by ``knowing'' which parts of C are invoked by callbacks or explicitly by passing this information somehow, e.g., encoded in the task structure). We can either require that a callback has different set of return values, i.e., only EVTHR_OK, or we need an ``inbetween'' function that can interpret the result values for a manager thread and manipulate the task context accordingly. The latter seems like the most generic and clean solution. Notice, however, that we have to take care of the usual problem with accessing the task while it is outside of our control, i.e., if it has been returned to the event thread system earlier on, the well-known problems arise (see Section 3.18.5.1.5, additionally the task must know whether it placed itself earlier on into the wait queue to avoid doing it again).
In the previous section solution 3a for approach 1 states that the callback function may store the result in a data structure. If the caller of the asynchronous function waits for the result, then the access to the structure must be synchronized and the caller must be informed that the data is available (valid). Note: this is not a good solution to the problem of asynchronous functions (because the caller blocks while waiting without giving the event threads library a chance to switch to another thread; if too many functions are doing this the system may even deadlock?), but just a simple approach until a better solution is found.
The simple algorithm uses a mutex (to protect access to the shared data structure and the status variable), a condition variable (to signal completion), and a status variable which has three states:
Note: due to the asynchronous operation, states 2 and 3 do not need to happen in that order. Moreover, state 2 may not be reached at all because the callee is done before the caller needs the result (this is the best case because it avoids waiting).
Caller:
status = init; /* initialize system */ invoke callee /* call asynchronous function */ ... /* do something */ lock /* acquire mutex */ if (status == init) { /* is it still the initial value? */ status = wait; /* indicate that caller is waiting */ while (status == wait) /* wait until status changes */ cond_wait /* wait for signal from callee */ } } unlock
Callee:
... /* compute result */ lock; /* acquire mutex */ v = result; /* set result */ notify = (status == wait); /* is caller waiting? */ status = sent; /* result is available */ if (notify) cond_signal /* notify caller if it is waiting */ unlock /* done */
This can be extended if there are multiple function calls and hence multiple results to wait for. It would be ugly to use an individual status variable for each function, hence a status variable and a counter are used. If the counter is zero, then all results are available. The status variable simply indicates whether the caller is waiting for a result and hence the callee should use the condition variable to signal that the result is valid. Since there can be multiple callees, only the one for which the counter is zero and the status variable is wait must signal completion.
Caller:
status = init; /* initialize system */ counter = 0; ... lock ++counter; invoke callee /* call asynchronous function */ unlock /* (maybe multiple times) */ ... /* do something */ lock /* acquire mutex */ if (counter > 0) { /* are there outstanding results? */ status = wait; /* indicate that caller is waiting */ while (status == wait && counter > 0) /* wait for results */ cond_wait /* wait for signal from callee */ } unlock /* done */
Callee:
... /* compute result */ lock /* acquire mutex */ --counter; v = result; /* set result */ /* is caller waiting and this is the last result? */ if (status == wait && counter == 0) { status = sent; /* results are available */ cond_signal /* notify caller */ } unlock /* done */
Notes:
If the result value cannot have all possible values in its range, then two values can be designated as v-init and v-wait while all others can be considered as v-sent. This removes the need for a status variable because the result variable v itself is used for that purpose:
Caller:
v = v_init; /* initialize system */ invoke callee /* call asynchronous function */ ... /* do something */ lock /* acquire mutex */ if (v == v_init) { /* is it still the initial value? */ v = v_wait; /* indicate that caller is waiting */ while (v == v_wait) /* wait until status changes */ cond_wait /* wait for signal from callee */ } } unlock
Callee:
... /* compute result */ lock; /* acquire mutex */ notify = (v == v_wait); /* is caller waiting? */ v = result; /* set result */ if (notify) cond_signal /* notify caller if it is waiting */ unlock /* done */
To deal with various kinds of errors, it is necessary to write functions such that they are ``reversible'', i.e., they either perform a transaction or they don't change the state (in an inconsistent manner). Unfortunately, it is fairly complicated to write all functions in such a way. For example, if multiple changes must be made each of which can fail, then either the previous state must be preserved such that it can be restored when something goes wrong, or the changes must be undone individually. For this to work properly, it is essential that the ``undo'' operation itself can not fail. Hence an ``undo'' operation must not rely on resources that might become unavailable during processing; for example, if it requires memory, then that memory must be allocated before the whole operation is started, or at least while the invidual changes are made, such that the ``undo'' operation does not need to allocate memory. This might not be achievable in some cases, however. For example, there are operations in sendmail X that require updates to two persistent databases, i.e., DEFEDB and IBDB. Both of these require disk space which cannot be pre-allocated nor can it be guaranteed that two independent disk write operations will succeed. If the first one fails, then we don't need to perform the second one. However, if the first one succeeds and the second one fails, then there is no guarantee that the first operation can be undone. Hence the operations must be performed in such a manner that even in the worst case mail will not be lost but at most delivered more than once.
Transaction based processing is also fairly complicated in case of asynchronous operations. If a function has to perform changes to a local state and creates a list of change requests which will be performed later on by some asynchronously running thread, then that list of change requests should offer a callback functionality which can be invoked with the status of the operation such that it can either perform the local changes (if that has not been done yet) or undo the local changes (if they have been performed earlier). An additional problem is that other functions may have performed changes inbetween. Hence it is not possible to save the old state and restore it in case of an error because that would erase other (valid) changes. Instead ``relative'' changes must be performed, e.g., in/decreasing a counter.
This section describes the implementation of some libraries which are used by various sendmail X modules.
Queues, lists, et.al., are taken from OpenBSD <sys/queue.h>.
Two hash table implementations are available: a conventional one with one key and a version with two keys. The latter is currently not used, it was intended for the DA database.
Classes can be used to check whether a key is in a set.
Classes can be implemented via a hash table: sendmail 8 allows up to 256 classes (there is a simple mapping of class names to single byte values). Each element in a class is stored in a hash table, the RHS is a bitmap to indicate to which classes it belongs. Checking whether something is in a class is done as follows: lookup word in hash table: if it exists: is it a member of the class we are looking for (which is just a bitnset() test)? This kind of implementation uses one hash table for many classes.
Alternative implementations are lists if a class has only a few elements (linear search) or tree.
Various versions of balanced trees are available which are based on code that has been found on the internet. See include/sm/tree.h, include/sm/avl.h, and include/sm/bsd-tree.h. The latter is taken from OpenBSD <sys/tree.h>, it can be used to generate functions that operate on splay trees and red-black trees. That is, there aren't generic functions that operate on trees, but specific functions are generated that operate on one type. Of course it is possible to generate a generic version where a tree node contains the usual data, i.e., a key and a pointer to the value.
A restricted size cache is implemented using a hash table (for access) and a linked list which is kept in most-recently used order (MRU).
Note: if we don't need to dynamically delete and add entries from the RSC and we don't need the MRU feature, then we can probably use a (binary) hash table which keeps tracks of the number of entries in it.
The structure definitions given below are hidden from the application.
typedef struct rsc_S rsc_T, *rsc_P; struct rsc_entry_S { CIRCLEQ_ENTRY(rsc_entry_S) rsce_link; /* MRU linkage */ const char *rsce_key; /* lookup key */ unsigned int rsce_len; /* key len */ void *rsce_value; /* corresponding value */ }; struct rsc_S { bht_P rsc_table; /* table with key, rsc_entry pairs */ unsigned int rsc_limit; /* max # of entries */ unsigned int rsc_used; /* current # of entries */ rsc_create_F rsc_create; /* constructor */ rsc_delete_F rsc_delete; /* destructor */ CIRCLEQ_HEAD(, rsc_entry_S) rsc_link; /* MRU linkage */ void *rsc_ctx; /* application context */ };
The following functions (and function types) are provided:
typedef void *(*rsc_create_F) (const char *_key, unsigned int _len, void *_value, void *_ctx); typedef sm_ret_T (*rsc_delete_F) (void *_value, void *_ctx); typedef void (rsc_walk_F)(const char *_key, const void *_value, const void *_ctx); extern rsc_P rsc_create(sm_rpool_P _rpool, unsigned int _limit, unsigned int _htsize, rsc_create_F _create, rsc_delete_F _delete, void *_ctx); extern void rsc_free(rsc_P _rsc); extern void rsc_walk(rsc_P _rsc, rsc_walk_F *_f); extern sm_ret_T rsc_add(rsc_P _rsc, bool _delok, const char *_key, unsigned int _len, void *_value, void **_pvalue); extern const void *rsc_lookup(rsc_P _cache, const char *_key, unsigned int _len); extern sm_ret_T rsc_rm(rsc_P _rsc, const char *_key, unsigned int _len); extern int rsc_usage(rsc_P _rsc); extern void rsc_stats(rsc_P _rsc, char *_out, unsigned int _len);
If an RSC stores data of different types, it must be possible to distinguish between them. This is necessary for functions that ``walk'' through an RSC and perform operations on the data or just for generic functions, i.e., create and delete. As such, the RSC could store an additional type and pass it to the delete and create functions. However, this could also handled by the application itself, i.e., it adds a unique identifier to the data that it stores in an RSC.
The API of a typed RSC implementation differs as follow:
Note that this causes problems with a unified API as described in Section 3.14.12.
We could use a fixed number of keys and set those that we are not interested in to NULL, or we could use a variable number of keys and specify a key and an index. How easy is the second solution to implement? The keys might be implemented as an array with some upper limit that is specified at creation time.
A possible approach to allow for any number of keys is to specify an array of keys (and corresponding lengths). During initialization the maximum number of keys is defined (and stored in the structure describing the DB). Access methods pass in a key and length and the index of the key, where zero is the primary key. When an element is entered, the primary key is used. What about the other keys? They must be specified too since the appropriate entries in the hash tables must be constructed. This seems to become ugly. Let's try a simpler approach first (one, two, three keys).
This might be implemented by having a (second) link in the entries which points to the next element with the same key. Check the hash table implementation.
The event thread library provides the basic framework for a worker based thread pool that is driven by I/O events and by wakeup times for tasks. See Section 3.18.5 for a functional description.
This library uses a general context that describes one event thread system and a per task context. It uses two queues: a run queue of tasks that can be executed, and a wait queue of tasks that wait for some event. Each task is in exactly one of the queues at each time.
The event thread context looks like this:
struct sm_evthr_ctx_S { pthread_cond_t evthrc_cv; pthread_mutex_t evthrc_waitqmut; pthread_mutex_t evthrc_runqmut; int evthrc_max; /* max. number of threads */ int evthrc_min; /* min. number of threads */ int evthrc_cur; /* current number of threads */ int evthrc_idl; /* idle threads */ int evthrc_stop; /* stop threads */ int evthrc_maxfd; /* maximum number of FDs */ timeval_T evthrc_time; /* current time */ sm_evthr_task_P *evthrc_fd2t; /* array to map FDs to tasks */ /* pipe between control thread and worker/signal threads */ int evthrc_pipe[2]; CIRCLEQ_HEAD(, sm_evthr_task_S) evthrc_waitq; CIRCLEQ_HEAD(, sm_evthr_task_S) evthrc_runq; };
The system maintains some number of worker threads between the specified minimum and maximum. Idle threads vanish after some timeout until the minimum number is reached.
An application function has this prototype:
typedef sm_ret_T (evthr_task_F)(sm_evthr_task_P);
That is, the function receives the per-task context, whose structure is listed next, as a parameter. Even though this is a violation of the abstraction principle, it allows for some functionality which would be awkward to achieve otherwise. For example, a user application can directly manipulate the next wakeup time. We could hide this behind a void pointer and provide some functions to manipulate the task context if we want to hide the implementation.
struct sm_evthr_task_S { CIRCLEQ_ENTRY(sm_evthr_task_S) evthrt_next; /* protects evthrt_rqevf and evthrt_sleep */ pthread_mutex_t evthrt_mutex; int evthrt_rqevf; /* requested event flags; see below */ int evthrt_evocc; /* events occurred; see below */ int evthrt_state; /* current state; see below */ int evthrt_fd; /* fd to watch */ timeval_T evthrt_sleep; /* when to wake up */ evthr_task_F *evthrt_fct; /* function to execute */ void *evthrt_actx; /* application context */ sm_evthr_ctx_P evthrt_ctx; /* evthr context */ sm_evthr_nc_P evthrt_nc; /* network connection */ };
The first element describes the linked list, i.e., either the wait queue or the run queue. The second element (requested event flags) lists the events the task is waiting for:
flag | meaning |
EVT_EV_RD | task is waiting for fd to become ready for read |
EVT_EV_WR | task is waiting for fd to become ready for write |
EVT_EV_LI | task is waiting for fd to become ready for accept |
EVT_EV_SL | task is waiting for timeout |
The third element (events occurred) lists the events that activated the task:
flag | meaning |
EVT_EV_RD_Y | task is ready for read |
EVT_EV_WR_Y | task is ready for write |
EVT_EV_LI_Y | task received a new connection |
EVT_EV_SL_Y | task has been woken up (after timeout) |
The fourth element (status) consists of some internal state flags, e.g., in which queue the task is and what it wants to do next.
The event thread library provides these functions:
extern sm_ret_T evthr_init(sm_evthr_ctx_P *_pctx, int _minthr, int _maxthr, int _maxfd); extern sm_ret_T evthr_task_new(sm_evthr_ctx_P _ctx, sm_evthr_task_P *_task, int _ev, int _fd, timeval_T *_sleept, evthr_task_F *_fct, void *_taskctx); extern sm_ret_T evthr_loop(sm_evthr_ctx_P _ctx); extern sm_ret_T evthr_waitq_app(sm_evthr_task_P _task); extern sm_ret_T evthr_en_wr(sm_evthr_task_P _task); extern sm_ret_T evthr_time(sm_evthr_ctx_P _ctx, timeval_T *_ct); extern sm_ret_T evthr_new_sl(sm_evthr_task_P _task, timeval_T _slpt, bool _change);
An application first initializes the library (evthr_init()), then it creates at least one task (evthr_task_new()), and thereafter calls evthr_loop() which in turn monitors the desired events (I/O, timeout) and invokes the callback functions. Those callback functions return a result which indicates what should happen with the task:
flag | meaning |
EVTHR_OK | do nothing, task has been taken care of |
EVTHR_WAITQ | put in wait queue |
EVTHR_RUNQ | put in run queue |
EVTHR_SLPQ | sleep for a while |
EVTHR_DEL | delete task |
EVTHR_TERM | terminate event thread loop |
Section 3.18.5 describes some problems that need to be solved:
Question: what about callback functions for signals? Currently the usual signals terminate the system.
sendmail X uses RCBs for communication between the various modules (see Section 3.14.11.1.1). This section describes some functions that simplify handling of RCB based communication for modules which use the event thread library (see Section 4.3.8).
This library uses the following structure:
struct rcbcom_ctx_S { sm_evthr_task_P rcbcom_tsk; /* event thread task */ sm_rcb_P rcbcom_rdrcb; /* RCB for communication (rd) */ SIMPLEQ_HEAD(, sm_rcbe_S) rcbcom_wrrcbl; /* RCB list (wr) */ pthread_mutex_t rcbcom_wrmutex; };
The first entry is a pointer to the task that is responsible for the communication. The second entry is the RCB in which data is received. The third is a list of RCBs for data that must be send out, this list is proteced by the mutex that is the last element of the structure.
The following functions are provided by the library:
sm_ret_T sm_rcbcom_open(rcbcom_ctx_P rcbcom_ctx): create a RCB communication context.
sm_ret_T sm_rcbcom_close(rcbcom_ctx_P rcbcom_ctx): close a RCB communication context.
sm_ret_T sm_rcbe_new_enc(sm_rcbe_P *prcbe, int minsz): create an entry for the write RCB list and open it for encoding.
sm_rcbcom_prerep(rcbcom_ctx_P rcbcom_ctx, sm_evthr_task_P tsk, sm_rcbe_P *prcbe): prepare a reply to a module: close the read RCB after decoding, open it for encoding, create a new RCB for writing (using sm_rcbe_new_enc()), and put the task back into the wait queue (unless tsk is NULL). Notice: after calling this function (with tsk not NULL), the task is not under control of the caller anymore, hence all manipulation of its state must be done via the functions provided by the event thread library.
sm_rcbcom_endrep(rcbcom_ctx_P rcbcom_ctx, sm_evthr_task_P tsk, bool notified, sm_rcbe_P rcbe): close the write RCB, append it to the RCB list in the context, and if notified is not set inform the task about the new write request (using evthr_en_wr()).
sm_rcbcom2mod(sm_evthr_task_P tsk, rcbcom_ctx_P rcbcom_ctx): send the first element of the RCB write list to filedescriptor specified in the task. This function should be called when the event threads library invoke the callback for the task and denotes that the file descriptor is ready for writing, i.e., it will be called from an event thread task that checks for I/O activity on that file descriptor. Since the function does not conform to the function specification for a task, it is called by a wrapper that extracts the RCB communication context from the task (probably indirectly). The function will only send the first element of the list; if the list is empty afterwards, it will disable the write request for this task, otherwise the event thread library will invoke the callback again when the filedescriptor is ready for writing. This offers a chance for an optimization: check whether another RCB can be written; question: is there a call that can determine the free buffer size for the file descriptor such that the next write operation does not block?
An asynchronous DNS resolver has been implemented (libdns/) according to the specification of Section 3.14.13.
A DNS request has the following structure:
struct dns_req_S { sm_cstr_P dnsreq_query; /* the query itself */ time_T dnsreq_start; /* request start time */ /* unsigned int dnsreq_tries; * retries */ dns_type_T dnsreq_type; /* query type */ unsigned short dnsreq_flags; /* currently: is in list? */ sm_str_P dnsreq_key; /* key for hash table: query + type */ dns_callback_F *dnsreq_fct; /* callback into application */ void *dnsreq_ctx; /* context for application callback */ TAILQ_ENTRY(dns_req_S) dnsreq_link; /* next entry */ };
A DNS result consists of a list of entries with the following elements:
typedef union { ipv4_T dnsresu_a; sm_cstr_P dnsresu_name; /* name from DNS */ } dns_resu_T; struct dns_res_S { sm_cstr_P dnsres_query; /* original query */ sm_ret_T dnsres_ret; /* error code */ dns_type_T dnsres_qtype; /* original query type */ unsigned int dnsres_entries; /* number of entries */ unsigned int dnsres_maxentries; /* max. number of entries */ dns_resl_T dnsres_hd; /* head of list of mx entries */ }; struct dns_rese_S { dns_type_T dnsrese_type; /* result type */ unsigned int dnsrese_ttl; /* TTL from DNS */ unsigned short dnsrese_pref; /* preference from DNS */ unsigned short dnsrese_weight; /* for internal randomization */ sm_cstr_P dnsrese_name; /* RR name */ TAILQ_ENTRY(dns_rese_S) dnsrese_link; /* next entry */ dns_resu_T dnsrese_val; };
A DNS manager context contains the following elements:
struct dns_mgr_ctx_S { unsigned int dnsmgr_flags; dns_rql_T dnsmgr_req_hd; /* list of requests */ dns_req_P dnsmgr_req_cur; /* current request */ /* hash table to store requests */ bht_P dnsmgr_req_ht; #if SM_USE_PTHREADS pthread_mutex_t dnsmgr_mutex; /* for the entire context? */ sm_evthr_task_P dnsmgr_tsk; /* XXX Just one? */ sm_evthr_task_P dnsmgr_cleanup; #endif /* SM_USE_PTHREADS */ };
and a DNS task looks like this:
struct dns_tsk_S { /* XXX int or statethreads socket */ int dnstsk_fd; /* socket */ dns_mgr_ctx_P dnstsk_mgr; /* DNS manager */ uint dnstsk_flags; /* operating flags */ uint dnstsk_timeouts; /* queries that timed out */ sockaddr_in_T dnstsk_sin; /* socket description */ sm_str_P dnstsk_rd; /* read buffer */ sm_str_P dnstsk_wr; /* write buffer */ };
The DNS library offers functions to create and delete a DNS manager context:
sm_ret_T dns_mgr_ctx_new(uint flags, dns_mgr_ctx_P *pdns_mgr_ctx); sm_ret_T dns_mgr_ctx_del(dns_mgr_ctx_P dns_mgr_ctx);
and similar functions for DNS tasks:
sm_ret_T dns_tsk_new(dns_mgr_ctx_P dns_mgr_ctx, uint flags, ipv4_T ipv4, dns_tsk_P *pdns_tsk); sm_ret_T dns_tsk_del(dns_tsk_P dns_tsk); sm_ret_T dns_tsk_start(dns_mgr_ctx_P dns_mgr_ctx, dns_tsk_P dns_tsk, sm_evthr_ctx_P evthr_ctx)
An application can make a DNS request using the following function:
sm_ret_T dns_req_add(dns_mgr_ctx_P dns_mgr_ctx, sm_cstr_P query, dns_type_T type, dns_callback_F *fct, void *ctx);
Internal functions of the library are:
sm_ret_T dns_comm_tsk(sm_evthr_task_P tsk); sm_ret_T dns_tsk_cleanup(sm_evthr_task_P tsk); sm_ret_T dns_tsk_rd(sm_evthr_task_P tsk) sm_ret_T dns_tsk_wr(sm_evthr_task_P tsk) void dns_req_del(void *value, void *ctx); sm_ret_T dns_receive(dns_tsk_P dns_tsk); sm_ret_T dns_send(dns_tsk_P dns_tsk); sm_ret_T dns_decode(sm_str_P ans, uchar *query, int qlen, dns_type_T *ptype, dns_res_P dns_res);
dns_comm_tsk() is a function that is called whenever I/O activity is possible, it invokes the read and write functions dns_tsk_rd() and dns_tsk_wr(); respectively. dns_tsk_cleanup() is cleanup task that deals with requests which didn't receive an answer within a certain amount of time.
The following steps are necessary for initializing and starting the DNS resolver (when using the event threads system):
/* Initialize DNS resolver */ ret = dns_rslv_new(random); /* Create DNS manager context */ ret = dns_mgr_ctx_new(0, &dns_mgr_ctx); /* Create one DNS resolver task */ ret = dns_tsk_new(dns_mgr_ctx, 0, ipv4, &dns_tsk); /* Start DNS tasks (resolver and cleanup) */ ret = dns_tsk_start(dns_mgr_ctx, dns_tsk, evthr_ctx);
dns_req_add() creates a DNS request and adds it to the list maintained in the DNS manager context (dnsmgr_req_hd), unless such a request is already queued, which is checked via the hash table dnsmgr_req_ht, to which the request is always added. The key for the hash table is the DNS query string and the DNS query type concatenated (if the query string has a trailing dot it will be removed). Whenever a DNS request is added to the list the write event is triggered for dns_tsk_wr(), which will take the current request (pointed to by dnsmgr_req_cur), form a DNS query and send it to a DNS server using dns_send(). Notice: the request will not be removed from the list, this is done by either the cleanup task or the read function. To indicate whether a request is in the list or merely in the hash table, the field dnsreq_flags is used.
The hash table is used by the function dns_tsk_rd() which receives replies from a DNS server to identify all requests for a DNS query (it uses dns_tsk_rd() and dns_decode()). The function removes all those requests from the hash table and the DNS manager list, and creates a local list out of them. Then it walks through that list and invokes the callback functions specified in the requests with DNS result and the application specific context.
The cleanup task dns_tsk_cleanup() uses the DNS manager list of requests (from the first element up to the current one), and checks whether they have timed out (based on the time the request has been made dnsreq_start). If this is the case then the request is removed from the list and appended to a local list. Moreover, all requests in the hash table for the same query are moved into the local list too - this needs improvement. As soon as a request is found that is not timed out (or the current element is reached), the search stops (this currently doesn't allow for individual timeouts). Thereafter, the callbacks for the requests in the local lists are invoked with a DNS result that contains an error code (timeout).
There is currently a simple implementation for logging in libmta/log.c. It closely follows the ISC logging module API mentioned in Section 3.14.16.1.
The current implementation has a slight problem with logfile rotation. The programs use stdout/stderr for logging; they do not open a named logfile themselves, this is done by the MCP for them. Hence there is no way for them to reopen a logfile, instead the sm_log_reopen() function rewinds the file using ftruncate(2) and fseek(3).
Alternatively the name of a logfile can be passed to each module such that it can open the file itself and hence the filename is available for the reopen function. The filename could be stored in the logging context and hence the reopen function could act accordingly, i.e., use sm_io_open() if a filename exists and otherwise the method described above.
The first prototype of SMTPS is based on state threads (see Section 3.18.3.1).
This prototype uses a set of threads which is limited by an upper and lower number. If threads are idle for some time and there are more than a specified number of idle threads available, they terminate. If not enough threads are available, new ones are created up to a specified maximum.
Remark (placed here so it doesn't get lost): there is a restricted number ( 60000) of possible open connections to one port. Could that limit the throughput we are trying to achieve or is such a high number of connections unfeasible?
We do not want a separate connection between QMGR and SMTPS for each thread in SMTPS, hence we need to associate the data from QMGR with the right thread in SMTPS. One approach is to have a receiver thread in SMTPS which communicates with the QMGR. It receives all data from the QMGR and identifies the context (session/transaction) to which the data belongs. This needs some list of all contexts, e.g., an AVL tree, or, if the number of entries is small enough, a linear list. Question: is there some better method, i.e., instead of searching some structure have direct access to the right thread (see also Section 3.1.1.2)? There might be some optimization possible since each SMTPS has only a limited number of threads, so we could have an array of that size and encode an index into that array into the RCB, e.g., use another ID type that is passed around (like a context pointer). It then adds that data to the context and notifies the thread There is one thread to read from the communication channel, but multiple tasks can actually write to it; writing is sequentialized by a mutex. In a conventional thread model, we would just select on I/O activities for that channel and notifications when a new RCB is added to the list to send to the QMGR (like it is done in the QMGR), however, in state threads I/O activity is controlled by the internal scheduler. Since there are multiple threads, it might be necessary to control how far ahead the writers can be of the readers (to avoid starvation and unfairness). However, this should be self-adjusting since threads are waiting for replies for requests they send before they send out new ones (by default, in some cases a few requests may be outstanding from one thread). If too many threads send data, then the capacity of the communication channel and the way requests are handled by the QMGR should avoid starvation and guarantee fairness.
As explained above there is one thread that takes care of the communication between the module and the QMGR. This thread uses the following structure as context:
struct s2q_ctx_S { int s2q_status; /* status */ st_netfd_t s2q_fd; /* fd for communication */ int s2q_smtps_id; /* smtps id */ st_mutex_t s2q_wr_mutex; /* mutex for write */ unsigned int s2q_maxrcbs; /* max. # of outstanding requests */ unsigned int s2q_currcbs; /* current # of outstanding requests */ sessta_id_P *s2q_sids; /* array of session ids */ smtps_sess_P *s2q_sess; /* array of session ctx */ };
For initialization and termination of the communication task the following two functions are provided:
sm_ret_T sm_s2q_init(s2q_ctx_P s2q_ctx, int smtps_id, unsigned int maxrcbs); sm_ret_T sm_s2q_stop(s2q_ctx_P s2q_ctx);
The initialization function connects to the QMGR and stores the file descriptor for communication in s2q_fd. It allocates two arrays for sessions IDs and session contexts which are used to find the SMTPS session for an incoming RCB, and it sends the initial ``A new SMTPS has been started'' to the QMGR. Finally sm_s2q_init() starts a thread that executes the function void *sm_rcb_from_srv(void *arg) which receives the s2q context as parameter. This function receives an RCB from QMGR and notifies the thread that is associated with the task via a condition variable; the thread can be found using the s2q_sids array.
Data can be sent to the QMGR using one of the functions sm_s2q_*() for new session ID, close session ID, new transaction ID, new recipient, close transaction ID, and discard transaction ID.
The function sm_w4q2s_reply() is used to wait for a reply from QMGR. It waits on a condition variable (which is stored in the SMTPS session context) which is signalled by sm_rcb_from_qmgr().
Initially the SMTP server sends the QMGR its id and the maximum number of threads it is going to create.
RT_S2Q_NID | id of new SMTPS |
RT_S2Q_ID | id of SMTPS |
RT_S2Q_CID | close SMTPS (id) |
RT_S2Q_STAT | status |
RT_S2Q_MAXTHRDS | max number of threads |
RT_S2Q_NSEID | new session id |
RT_S2Q_SEID | session id |
RT_S2Q_CSEID | close session id |
RT_S2Q_CLTIP4 | client IPv4 address |
RT_S2Q_CLTIP6 | client IPv6 address |
RT_S2Q_CLTPORT | client port |
RT_S2Q_NTAID | new transaction id |
RT_S2Q_TAID | transaction id |
RT_S2Q_CTAID | close transaction id |
RT_S2Q_DTAID | discard transaction id |
RT_S2Q_MAIL | mail from |
RT_S2Q_RCPT_IDX | rcpt idx |
RT_S2Q_RCPT | rcpt to |
RT_S2Q_CDBID | cdb id |
The common reply format from QMGR to SMTPS consists of the SMTPS id (which is only transmitted for paranoia), a session or transaction id, a status code and an optional status text:
RT_Q2S_ID | SMTPS id |
RT_Q2S_SEID/RT_Q2S_TAID | session/transaction id |
RT_Q2S_STATD | status (ok, reject, more detailed?) |
RT_Q2S_STATT | status text |
The function sm_rcb_from_srv() uses the session/transaction id to find the correct thread to which the rest of the RCB will be given.
RT_Q2S_ID | id of SMTPS |
RT_Q2S_STAT | status for session/transaction/... |
RT_Q2S_STATV | status value (text follows) |
RT_Q2S_STATT | status text |
RT_Q2S_SEID | session id |
RT_Q2S_TAID | transaction id |
RT_Q2S_RCPT_IDX | rcpt idx |
RT_Q2S_CDBID | cdb id |
RT_Q2S_THRDS | slow down |
RT_Q2S_STOP | stop reception (use slow = 0?) |
RT_Q2S_DOWN | shut down |
The SMTP server uses the AR as map lookup server to avoid blocking calls in the state-threads application. While the anti-spam logic etc is implemented in SMTPS, the map lookups are performed by SMAR. Hence SMTPS only sends minimal information to SMAR, e.g., the sender or recipient address and asks for lookups in some maps with certain features, e.g., lookup the full address, the domain part, the address without details (``+detail'').
RT_S2A_TAID | transaction id |
RT_S2A_MAIL | mail from |
RT_S2A_RCPT_IDX | rcpt idx |
RT_S2A_RCPT | rcpt to (printable address) |
RT_S2A_LTYPE | lookup type |
RT_S2A_LFLAGS | lookup flags |
To simplify the SMTP server code, the reply format for SMAR is basically the same as for QMGR:
RT_A2S_ID | id of SMTPS |
RT_A2S_TAID | transaction id |
RT_A2S_STATD | status (ok, reject, more detailed?) |
RT_A2S_STATT | status message |
RT_A2S_MAIL_ST | mail status |
RT_A2S_RCPT_IDX | rcpt index |
RT_A2S_RCPT_ST | rcpt status |
Values for lookup types are:
LT_LOCAL_ADDR | is local address? |
LT_RCPT_OK | is address ok as a recipient? |
LT_MAIL_OK | is address ok as a sender? |
Values for lookup flags are:
LF_DOMAIN | try domain |
LF_FULL | try full address |
LF_LOCAL | try localpart |
LF_NODETAIL | try without detail |
As explained elsewhere (2.8.2) it is possible to specify multiple delivery classes and multiple delivery agents that implement delivery classes. The former are referenced by the address resolver when selecting a mailer. The latter are selected by the scheduler after it receives a mailer to use.
Every delivery agent has an index and a list of delivery classes it implements. There is also a list of delivery classes (which are referenced by some id, most likely a numeric index into an array). This list is maintained by SMAR, each DA, and QMGR (and must obviously be kept in sync if numeric indices are used instead of names). QMGR keeps for each delivery class a list of delivery agents that implement the class, which can be used by the scheduler to select a DA that will perform a delivery attempt.
Note: As described in Section 3.8.3.1 item 6, the first version of sendmail X does not need to implement the full set of this; all delivery agents implement the same delivery classes, hence they can be selected freely without any restriction.
The first prototype of SMTPC is based on state threads (see Section 3.18.3.1).
It follows a similar thread model as that used for the SMTP server daemon (see Section 4.4).
See Section 3.4.5.0.1 for a description.
As usual, a protocol header is sent first. Moreover, the next entry in each RCB is the identifier of the SMTPC to which the QMGR wants to talk: RT_Q2C_ID.
The rest of the RCB is described below for each function.
Notice: for status codes an additional text field might follow, which currently isn't specified here.
RT_Q2C_DCID: delivery class id.
More data to follow, e.g., requirements about the session.
For the transaction data see below (item 7).
RT_C2Q_SESTAT: session status: either SMTP status code or an error code, e.g., connection refused etc.
The recipient data might be repeated to list multiple recipients Notice: we may run into a size limit of RCBs here; do we need something like a continuation RCB?.
RT_Q2C_CSEID: close session id
This value can be pretty much ignored for all practical purposes, except if we want to see whether the server behaves properly and still responds.
RT_C2Q_SESTAT: session status (or do we want to use a different record type? Might be useful to distinguish to avoid confusion)
The main SMTPC context structure looks like this:
struct sc_ctx_S { unsigned int sc_max_thrds; /* Max number of threads */ unsigned int sc_wait_thrds; /* # of threads waiting to accept */ unsigned int sc_busy_thrds; /* # of threads processing request */ unsigned int sc_rqst_count; /* Total # of processed requests */ uint32_t sc_status; /* SMTPC status */ sm_str_P sc_hostname; /* SMTPC hostname */ sc_t_ctx_P *sc_scts; /* array of sct's */ };
The last element of that structure is an array of SMTPC thread contexts ( to sc_max_thrds):
struct sc_t_ctx_S { sc_ctx_P sct_sc_ctx; /* pointer back to sc_ctx */ unsigned int sct_thr_id; /* thread id (debugging) */ unsigned int sct_status; st_cond_t sct_cond_rd; /* received data from QMGR */ sc_sess_P sct_sess; /* current session */ };
The condition variable denotes when data from the QMGR is received for this particular thread. The last element is a pointer to the SMTPC session:
struct sc_sess_S { sc_t_ctx_P scse_sct_ctx; /* pointer to thread context */ sm_file_T *scse_fp; /* file to use (SMTP) */ sm_str_P scse_rd; /* smtp read buffer */ sm_str_P scse_wr; /* smtp write buffer */ sm_str_P scse_str; /* str for general use */ sm_rpool_P scse_rpool; unsigned int scse_cap; /* server capabilities */ unsigned int scse_flags; unsigned int scse_state; struct in_addr *scse_client; /* XXX use a generic struct! */ sc_ta_P scse_ta; /* current transaction */ sessta_id_T scse_id; sm_rcb_P scse_rcb; /* rcb for communication with QMGR */ SOCK_IN_T scse_rmt_addr; /* Remote address */ st_netfd_t scse_rmt_fd; /* fd */ };
The SMTPC transaction structure looks as follows:
struct sc_ta_S { sc_sess_P scta_sess; /* pointer to session */ sm_rpool_P scta_rpool; sc_mail_P scta_mail; /* mail from */ sc_rcpts_P scta_rcpts; /* rcpts */ sc_rcpt_P scta_rcpt_p; /* current rcpt for reply */ unsigned int scta_rcpts_rcvd; /* # of recipients replies received */ unsigned int scta_rcpts_tot; /* number of recipients total */ unsigned int scta_rcpts_snt; /* number of recipients sent */ unsigned int scta_rcpts_ok; /* number of recipients ok */ unsigned int scta_rcpts_lmtp; /* #LMTP rcpts still to collect */ unsigned int scta_state; smtp_status_T scta_status; /* SMTP status code (if applicable) */ sessta_id_T scta_id; /* transaction id */ sm_str_P scta_cdb_id; /* CDB id */ };
In the main() function SMTPC calls several initialization function, one of which (sc_init(sc_ctx)) initializes the SMTPC context and allocates the array of SMTPC thread contexts. Then it starts the minimum number of threads (using start_threads(sc_ctx)) and the main thread takes care of signals afterwards. The threads run the function sc_hdl_requests() which receives the SMTPC context as parameter. This function looks for a free entry in the SMTPC thread context array, and allocates a new thread context which it assigns to that entry. It also allocates a new SMTPC session context. Thereafter it sets its status to SC_T_FREE and the first thread that is called informs the QMGR communication thread that SMTPC is ready to process tasks. The main part of the function processes a loop:
while (WAIT_THREADS(sc_ctx) <= max_wait_threads) { ... }
i.e., the thread stays active as long as the number of waiting threads is below the allowed maximum. This takes care of too many waiting threads by simply terminating them if the condition is false, in which case the thread cleans up after itself and terminates. Inside the loop the thread waits on its condition variable: sc_t_ctx->sct_cond_rd. If that wait times out, the current session (if one is open) will be terminated. If the QMGR actually has a task for this thread, then it first checks whether another thread should be started:
if (WAIT_THREADS(sc_ctx) < min_wait_threads && TOTAL_THREADS(sc_ctx) < MAX_THREADS(sc_ctx)) { /* start another thread */ }
and then handles the current session: handle_session(sc_t_ctx). This functions handles one SMTP client session. The state of the session is recorded in sc_sess->scse_state and can take one of the following values:
SCSE_ST_NONE | no session active |
SCSE_ST_NEW | new session |
SCSE_ST_CONNECTED | connection succeeded |
SCSE_ST_GREETED | received greeting |
SCSE_ST_OPEN | connection open |
SCSE_ST_CLOSED | close session |
Based on this state the function opens a session if that hadn't happened yet and performs one transaction according to the data from the QMGR. Depending on a flag in sc_sess->scse_flags the session is optionally closed afterwards.
As usual there is one thread that takes care of the communication between the module and the QMGR. This thread uses the following structure as context:
struct c2q_ctx_S { sc_ctx_P c2q_sc_ctx; /* pointer back to SMTPC context */ unsigned int c2q_status; /* status */ st_netfd_t c2q_fd; /* fd for communication */ unsigned int c2q_sc_id; /* smtpc id */ st_cond_t c2q_cond_rd; /* cond.var for read */ st_cond_t c2q_cond_wr; /* cond.var for write */ unsigned int c2q_maxses; /* max. # of open sessions */ sc_sess_P *c2q_sess; /* array of session ctx */ };
For initialization and termination the following two functions are provided:
sm_ret_T sm_c2q_init(sc_ctx_P sc_ctx, c2q_ctx_P c2q_ctx, unsigned int sc_idx, unsigned int maxses); sm_ret_T sm_c2q_stop(c2q_ctx_P c2q_ctx);
The initialization function starts a thread that executes the function void *sc_rcb_from_qmgr(void *arg) which receives the c2q context as parameter. This function receives an RCB from QMGR and notifies a thread that is associated with the task or finds a free SMTPC thread if it is a new task. To maintain the former information one array for session contexts c2q_sess is allocated; its size is maxses which is set to MAX_THREADS(sc_ctx) by the caller. This allows the communication module to find the correct session context based on the session (or transaction) identifier sent by the QMGR in its requests if the request refers to an open session. To find a free SMTPC thread, the array sc_scts in the SMTPC context is searched for a NULL entry.
Status information can be sent back to the QMGR using the function sm_ret_T sc_c2q(sc_t_ctx_P sc_t_ctx, uint32_t whichstatus, sm_ret_T ret, c2q_ctx_P c2q_ctx).
The SMTP client functionality is fairly restricted right now, but the system implements full pipelining (in contrast to sendmail 8 which uses MAIL as synchronization point). As usual, the SMTP client is also able to speak LMTP.
To open and close a SMTP session two functions are provided: sm_ret_T smtpc_sess_open(sc_t_ctx_P sc_t_ctx) and sm_ret_T smtpc_sess_close(sc_t_ctx_P sc_t_ctx). The function sm_ret_T smtpc_ta(sc_t_ctx_P sc_t_ctx) performs one SMTP transaction. As it can be seen from the prototypes, the only parameter passed to these function is the SMTPC thread context which contains (directly or indirectly) pointers to the current SMTPC session and transaction.
As shown in Section 4.6.2.1, the SMTPC session context contains three strings (see 3.14.7) that are used for the SMTP dialog and related operations.
Since the content database stores the mail in SMTP format, it can be sent out directly without any interaction. Similar to the SMTP server side this function access the file buffer directly to avoid too much copying.
Just some items to take into consideration for the implementation of the queue manager. These are written down here so they don't get lost...
Problem here: what about disk I/O? For example: calling fsync() for the logfile may cause the queue manager to block. If the thread implementation doesn't schedule another thread while one is blocked on disk I/O, then the entire process will hang and the queue manager will not respond to other requests.
If this actually happens (fairly likely on some OSs with user-land pthread implementation), and it causes a problem (performance), then it might be necessary to create another process that actually performs disk I/O on behalf of the QMGR.
How about a flow diagram? Some architectural overview would be nice.
The QMGR should not have to deal with many connections. SMTPS and SMTPC are multi-threaded themselves; we may have some SMTPS/SMTPC processes. However, it won't be so many that we have a problem with the number of connections to monitor, i.e., poll() should be sufficient.
Which threading model should we choose? Just a few worker threads that will go back to work whenever they encounter a blocking action? See Section 3.18 for discussion.
Do we need priority queues or can we serve all jobs FIFO?
The QMGR is based on the event threads library described in Section 4.3.8.
Currently access to tasks is controlled via the mutexes that control the queues: if a task is taken out of a queue, it is under the sole control of the thread that did it, no other thread can (should) access the task. Unless we change this access model, no mutex is necessary for individual tasks.
The queue manager has several data structures that can be concurrently accessed from different threads. Hence the access must be protected by mutexes unless there are other means which prevent conflicts. Some data structures can be rather larger, e.g., the various DBs and caches. Locking them for an extended time may cause lock contention. Some data structures and operations on them may allow to lock only a single element, others may require to lock the entire structure. Examples of the latter are adding and removing elements which in most cases require locking of the entire structure.
In some cases there might be ways around locking contention. For example, to delete items from a DB (or cache) the item might be just marked ``Delete'' instead of actually deleting it. This only requires locking of a single entry, not the entire DB. Those ``Delete'' entries can be removed in a single sweep later on (or during normal ``walk through'' operations), or they can be simply reclaimed for use. Question: what is more efficient? That is, if the DB is large and a walk through all elements is required to free a few then that might take too long, and we shouldn't hold a lock too long. We could gather ``Delete'' elements in a queue, then we don't have to walk through the entire DB. However, then the position of the elements must be fixed such that we can directly access and delete them, or at least lookup prior to deletion must be fast. If the DB internally may rearrange the location of entries then we can't keep a pointer to them. Question: will this ever happen? Some DB versions may do this, how about the ones we use? In some cases, some of the algorithm may require that DB elements don't move, but in most cases the elements just contain pointers to the data which isn't moved and hence can be accessed even if the DB rearranges its internal data structures.
If a system locks various items then there is a potential for deadlocks. One way to prevent this is a locking hierarchy, i.e., items are always locked in the same order. We probably need to define a locking order. It's currently unclear how this can be done such that access is still efficient without too much locking contention. See also Section 4.7.2 for possible ways around locking contention.
The main context for QMGR looks like this: (2004-04-14)
struct qmgr_ctx_S { sm_magic_T sm_magic; pthread_mutex_t qmgr_mutex; unsigned int qmgr_status; /* see below, QMGR_ST_* */ time_T qmgr_st_time; /* Resource flags */ uint32_t qmgr_rflags; /* see QMGR_RFL_* */ /* Overall value to indicate resource usage 0:free 100:overloaded */ unsigned int qmgr_total_usage; /* Status flags */ uint32_t qmgr_sflags; /* see QMGR_SFL_* */ sm_str_P qmgr_hostname; sm_str_P qmgr_pm_addr; /* <postmaster@hostname> */ /* info about connections? */ fs_ctx_P qmgr_fs_ctx; cdb_fsctx_P qmgr_cdb_fsctx; unsigned long qmgr_cdb_kbfree; edb_fsctx_P qmgr_edb_fsctx; unsigned long qmgr_edb_kbfree; unsigned long qmgr_ibdb_kbfree; /* SMTPS */ id_count_T qmgr_idc; /* last used SMTP id counter */ int qmgr_sslfd; /* listen fd */ int qmgr_ssnfd; /* number of used fds */ uint32_t qmgr_ssused; /* bitmask for used elements */ qss_ctx_P qmgr_ssctx[MAX_SMTPS_FD]; ssocc_ctx_P qmgr_ssocc_ctx; occ_ctx_P qmgr_occ_ctx; /* SMTPC */ int qmgr_sclfd; /* listen fd */ int qmgr_scnfd; /* number of used fds */ uint8_t qmgr_scused; /* bitmask for used elements */ qsc_ctx_P qmgr_scctx[MAX_SMTPC_FD]; sm_evthr_ctx_P qmgr_ev_ctx; /* event thread context */ iqdb_P qmgr_iqdb; /* rsc for incoming edb */ ibdb_ctx_P qmgr_ibdb; /* backup for incoming edb */ sm_evthr_task_P qmgr_icommit; /* task for ibdbc commits */ qss_opta_P qmgr_optas; /* open transactions (commit) */ sm_evthr_task_P qmgr_sched; /* scheduling task */ aq_ctx_P qmgr_aq; /* active envelope db */ edb_ctx_P qmgr_edb; /* deferred envelope db */ edbc_ctx_P qmgr_edbc; /* cache for envelope db */ sm_evthr_task_P qmgr_tsk_cleanup; /* task for cleanup */ qcleanup_ctx_P qmgr_cleanup_ctx; sm_maps_P qmgr_maps; /* map system context */ /* AR */ sm_evthr_task_P qmgr_ar_tsk; /* address resolver task */ int qmgr_ar_fd; /* communication fd */ qar_ctx_P qmgr_ar_ctx; sm_rcbh_T qmgr_rcbh; /* head for RCB list */ unsigned int qmgr_rcbn; /* number of entries in RCB list */ /* currently protected by qmgr_mutex */ qmgr_conf_T qmgr_conf; sm_log_ctx_P qmgr_lctx; sm_logconfig_P qmgr_lcfg; uint8_t qmgr_usage[QMGR_RFL_LAST_I + 1]; uint8_t qmgr_lower[QMGR_RFL_LAST_I + 1]; uint8_t qmgr_upper[QMGR_RFL_LAST_I + 1]; };
There are task contexts for QMGR/SMTPS (2004-04-15):
struct qss_ctx_S { sm_magic_T sm_magic; rcbcom_ctx_T qss_com; qmgr_ctx_P qss_qmgr_ctx; /* pointer back to main ctx */ int qss_id; /* SMTPS id */ uint8_t qss_bit; /* bit for qmgr_ssctx */ qss_status_T qss_status; /* status of SMTPS */ unsigned int qss_max_thrs; /* upper limit for threads */ unsigned int qss_max_cur_thrs; /* current limit for threads */ unsigned int qss_cur_session; /* current # of sessions */ };
and QMGR/SMTPC (2004-04-15):
struct qsc_ctx_S { sm_magic_T sm_magic; rcbcom_ctx_T qsc_com; qmgr_ctx_P qsc_qmgr_ctx; /* pointer back to main ctx */ int qsc_id; /* SMTPC id */ uint8_t qsc_bit; /* bit for qmgr_ssctx */ dadb_ctx_P qsc_dadb_ctx; /* pointer to DA DB context */ /* split this in status and flags? */ qsc_status_T qsc_status; /* status of SMTPC */ uint32_t qsc_id_cnt; };
Both refer to a generic communication structure:
struct qcom_ctx_S { qmgr_ctx_P qcom_qmgr_ctx; /* pointer back to main ctx */ sm_evthr_task_P qcom_tsk; /* pointer to evthr task */ sm_rcb_P qcom_rdrcb; /* rcb for rd */ SIMPLEQ_HEAD(, sm_rcbl_S) qcom_wrrcbl; /* rcb list for wr */ pthread_mutex_t qcom_wrmutex; /* protect qss_wrrcb */ };
The QMGR holds also the necessary data for SMTPS sessions (2004-04-15):
struct qss_sess_S { sessta_id_T qsses_id; time_T qsses_st_time; sm_rpool_P qsess_rpool; struct in_addr qsess_client; /* XXX use a generic struct! */ };
and transactions (2004-04-15):
struct qss_ta_S { sm_rpool_P qssta_rpool; time_T qssta_st_time; qss_mail_P qssta_mail; /* mail from */ qss_rcpts_T qssta_rcpts; /* rcpts */ unsigned int qssta_rcpts_tot; /* total number of recipients */ unsigned int qssta_flags; sessta_id_T qssta_id; cdb_id_P qssta_cdb_id; size_t qssta_msg_size; /* KB */ qss_ctx_P qssta_ssctx; /* pointer back to SMTPS ctx */ pthread_mutex_t qssta_mutex; };
The open transaction context (from SMTPS) stores information about outstanding transactions, i.e., those transactions in SMTPS that have ended the data transmission, but have not yet been confirmed by the QMGR. This data structure (fixed size queue) is used for group commits to notify the threads in the SMTPS servers that hold the open transactions.
struct qss_opta_S { unsigned int qot_max; /* allocated size */ unsigned int qot_cur; /* currently used (basically last-first) */ unsigned int qot_first; /* first index to read */ unsigned int qot_last; /* last index to read (first to write) */ pthread_mutex_t qot_mutex; qss_ta_P *qot_tas; /* array of open transactions */ };
Other structures that the QMGR currently uses are
All envelope DBs (INCEDB: ibdb and iqdb, ACTEDB, DEFEDB, EDB) have their own mutexes in their context structures.
IQDB contains references to qss_sess_T, qss_ta_T, and qss_rcpts_T.
The recipient structure in AQ uses these flags:
AQR_FL_IQDB | from IQDB |
AQR_FL_DEFEDB | from DEFEDB |
AQR_FL_SENT2AR | Sent to AR |
AQR_FL_RCVD4AR | Received from AR |
AQR_FL_RDY4DLVRY | Ready for delivery |
AQR_FL_SCHED | Scheduled for delivery, is going to be sent to DA |
AQR_FL_WAIT4UPD | Waiting for status update, must not be touched by scheduler |
AQR_FL_TO | Too long in AQ |
AQR_FL_TEMP | temporary failure |
AQR_FL_PERM | permanent failure |
AQR_FL_ARF | failure from SMAR |
AQR_FL_DAF | failure from DA |
AQR_FL_MEMAR | memory allocation for aqr_addrs failed, use fallback |
AQR_FL_ARINCOMPL | addr resolution incomplete |
AQR_FL_ARF_ADD | rcpt with SMAR failure added to delivery list |
AQR_FL_TO_ADD | rcpt with timeout added to delivery list |
AQR_FL_IS_BNC | this is a bounce |
AQR_FL_IS_DBNC | double bounce |
AQR_FL_DSN_PERM | perm error |
AQR_FL_DSN_TMT | timeout |
AQR_FL_DSN_GEN | bounce has been generated |
AQR_FL_CNT_UPD | rcpt counters have been updated, i.e., aq_upd_rcpt_cnts() has been called |
AQR_FL_STAT_UPD | rcpt status (aqr_status) has been updated individually |
Section 2.4.3.3 explains how transaction and recipient data flows through the various DBs in QMGR. This section tries to tie the various steps to functions in QMGR (which are explained in Section 4.7.5)
The main() function of the QMGR is very simple (Notice: in almost all example code error checking etc has been removed for simplicity).
ret = sm_qmgr_init0(qmgr_ctx); /* basic initialization */ ret = sm_qmgr_rdcf(qmgr_ctx); /* read configuration */ ret = sm_qmgr_init(qmgr_ctx); /* initialization after configuration */ ret = sm_qmgr_start(qmgr_ctx); /* start all components */ ret = sm_qmgr_loop(qmgr_ctx); /* start event threads loop */ ret = sm_qmgr_stop(qmgr_ctx); /* stop all componets */
where all functions do what is obvious from their name.
The main loop sm_qmgr_loop() simply calls evthr_loop(qmgr_ctx->qmgr_ev_ctx).
sm_qmgr_init0() performs basic initialization, sm_qmgr_rdcf() reads the configuration (currently (2004-02-13) only command line parameters), and sm_qmgr_init() initializes various QMGR data structures.
sm_qmgr_start() starts various tasks:
ret = sm_qm_stli(qmgr_ctx); ret = sm_qm_stcommit(qmgr_ctx, now); ret = sm_qm_stsched(qmgr_ctx, now);
sm_qm_stli() starts two (event thread) tasks listening for connections from SMTPS and SMTPC using the function sm_qm_smtpsli() and sm_qm_smtpcli(). sm_qm_stcommit() starts the periodic commit task and sm_qm_stsched() starts the scheduling task.
The two listener tasks sm_qm_smtpsli() and sm_qm_smtpcli() do basically the same: wait for a new connection from the respective service (SMTPS/SMTPC), ``register'' it in the QMGR context, and start one task sm_qmgr_smtpX(sm_evthr_task_P tsk) that takes care of the communication with the SMTPX process. Notes:
The communication tasks sm_qmgr_smtpX() dispatch a read function sm_smtpX2qmgr() or a write function sm_qmgr2smtpX to deal with the communication request. Those functions use the read RCB qsX_rdrcb to read (sequentially) data from SMTPS/SMTPC and a list of write RCBs qsX_wrrcbl to write data back to those modules. Access to the latter is protected by a mutex and RCBs are appended to the list by various functions. The communication tasks are activated via read/write availability, where the write availability is additionally triggered by functions that put something into the list of write RCBs (otherwise the task would be activated most of the time without actually having anything to do).
The read functions sm_smtpX2qmgr() receive an RCB qsX_ctx->qsX_rdrcb from the module and then call the function sm_qsX_react() to decode the RCB and act accordingly. Those functions may return different value to determine what should happen next with the task. If it is an error, the task terminates (which might be overkill), other values are: QMGR_R_WAITQ (translated to EVTHR_WAITQ), QMGR_R_ASYNC (translated to EVTHR_OK), EVTHR_DEL which cause the task to terminate; other values are directly returned to the event threads library. QMGR_R_ASYNC means that the task has already been returned to the event thread system (waitq), see Section 3.18.5.1.3.
The write function sm_qmgr2mod() locks the mutex qsX_wrmutex, then checks whether the list qsX_wrrcbl of RCBs is empty. If it is, then the task returns and turns off the WRITE request. Otherwise it sends the first element to the respective module using sm_rcb_snd(), removes that element and if the list is empty thereafter turns off the WRITE when it returns. Notice: it currently does not go through the entire list trying to write it all. This is done to prevent the thread from blocking, it is assumed that a single RCB can be sent. This might be wrong in which case the thread blocks (and hopefully another runs), which might be prevented by requiring enough space in the communication buffer (can be set via setsockopt() for sockets).
The commit task sm_qm_stcommit() is responsible for group commits. It checks the list of open transactions qmgr_ctx->qmgr_optas and if it isn't empty calls q_ibdb_commit(qmgr_ctx) which in turns commits the current INCEDB and then notifies all outstanding transactions of this fact. This is done by going through the list and adding an RCB with the commit information to the list of RCBs qss_wrrcbl for the task qss_ta->qssta_ssctx that handles the transaction qss_ta.
The scheduling function sm_qm_stsched() is supposed to implement the core of the QMGR.
A recipient goes through the following stages:
The function sm_qs2c_task(qsc_ctx_P qsc_ctx, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, sm_rcbe_P rcbe, sessta_id_P da_sess_id, sessta_id_P da_ta_id) creates one session with one transaction for SMTPC. The protocol is as follows:
RT_Q2C_ID | SMTPC identifier |
RT_Q2C_DCID | delivery class identifier |
RT_Q2C_ONESEID | Session id, only one transaction (hack) |
RT_Q2C_SRVIP4 | IPv4 address of server (hack) |
RT_Q2C_NTAID | New transaction id |
RT_Q2C_MAIL | Mail from address |
RT_Q2C_CDBID | CDB identifier |
RT_Q2C_RCPT_IDX | recipient index |
RT_Q2C_RCPT | recipient address |
Additional recipients can be added via sm_qs2c_add_rcpt(qsc_ctx_P qsc_ctx, aq_rcpt_P aq_rcpt, sm_rcbe_P rcbe) which justs adds recipient index and address to the RCB.
If the transaction denotes a bounce message only one recipient can be send and instead of the record tag RT_Q2C_NTAID either RT_Q2C_NTAIDB (bounce) or RT_Q2C_NTAIDDB (double bounce) is used. Additionally an entire error text is sent using RT_Q2C_B_MSG (bounce message) as record tag. Currently this does not include the headers. It should be something like:
Hi! This is the sendmail X MTA. I'm sorry to inform you that a mail from you could not be delivered. See below for details.
listing recipient address, delivery host, and delivery message for each failed recipient.
The main QMGR context contains three arrays which store the lower and upper thresholds for various resources and the current usage. A single scalar contains the overall resource usage.
uint8_t qmgr_usage[QMGR_RFL_LAST_I + 1]; uint8_t qmgr_lower[QMGR_RFL_LAST_I + 1]; uint8_t qmgr_upper[QMGR_RFL_LAST_I + 1]; /* Overall value to indicate resource usage 0:free 100:overloaded */ unsigned int qmgr_total_usage;
To store the amount of free disk space, two data structures are used: one to store the amount of available disk space per partition (see also Section 3.4.10.13.1):
struct filesys_S { dev_t fs_dev; /* unique device id */ unsigned long fs_kbfree; /* KB free */ unsigned long fs_blksize; /* block size, in bytes */ time_T fs_lastupdate; /* last time fs_kbfree was updated */ const char *fs_path; /* some path in the FS */ };
and one which contains an array of those individual structures:
struct fs_ctx_S { #if SM_USE_PTHREADS pthread_mutex_t fsc_mutex; #endif /* SM_USE_PTHREADS */ int fsc_cur_entries; /* cur. number of entries in fsc_sys*/ int fsc_max_entries; /* max. number of entries in fsc_sys*/ filesys_P fsc_sys; /* array of filesys_T */ };
The function qm_comp_resource(qmgr_ctx_P qmgr_ctx, thr_lock_T locktype) computes a value that is a measure for the overall resource usage: qmgr_total_usage. Moreover, the function also invokes functions that return the amount of free disk for a DB that is stored on disk: cdb_fs_getfree(), edb_fs_getfree(), and ibdb_fs_getfree(). Each of these functions receives a pointer to a variable of type fs_ctx_T and a pointer to a integer variable which will contain the amount of available disk space after a succesful return. The functions themselves check the last update timestamp to avoid invoking system functions too often. Since each DB operations tries to keep track of the amount of disk space changes, this should return a reasonable estimate of the actual value.
The function q2s_throttle(qss_ctx_P qss_ctx, sm_evthr_task_P tsk, unsigned int nthreads) informs one SMTP server (referenced by qss_ctx) about the new maximum number of threads it should allow.
The generic function qs_control(qss_ctx_P qss_ctx, int direction, unsigned int use, unsigned int resource) checks the new usage of a resource and based on the input parameter direction decides whether to (un)throttle one SMTP server. qs_control() has the following behavior: throttle the system iff
else unthrottle the system iff
The specific function qs_unthrottle(qss_ctx_P qss_ctx) checks whether one SMTP server can be unthrottled based on the current resource usage. It is called by sm_smtps_wakeup() which is scheduled by sm_qss_wakeup(qmgr_ctx_P qmgr_ctx, thr_lock_T locktype) as a sleep() task. sm_qss_wakeup() in turn is invoked from qm_resource() when all resources are available (again).
The function qs_comp_control(qss_ctx_P qss_ctx, bool unthrottle) is invoked from sm_qss_react(). It will only check whether the address resolver (SMAR) is available and accordingly call qs_control().
The requirements for updating the recipient status after a delivery attempt has been made are described in Section 2.4.3.4. Section 3.4.16 describes the the functionality, which distinguishes several reasons that require updating the status of a recipient:
Before examining these cases, a short note about updating the various queues: entries in IQDB are removed immediately if the recipient was in that queue (this can be done because the recipient is safely stored in DEFEDB or IBDB). To update DEFEDB and IBDB more complicated measures are taken: a request is queued that the status must be changed (this may also mean removal of an entry from the respective DB) while the function goes through all the recipients of the transaction. DEFEDB provides functions to do this: edb_ta_rm_req() and edb_rcpt_rm_req() which are described in Section 4.9.4. See Section 4.9.2 about the implementation of updating IBDB based on a list of change requests.
As explained in Section 3.4.16.1 it is necessary to preserve the order of updates for recipients and transactions when those changes are committed to DEFEDB.
To update the status for some (failed) recipients (case 2a) the function
qda_upd_ta_rcpt_stat(qmgr_ctx_P qmgr_ctx, sessta_id_T da_ta_id, sm_rcb_P rcb, unsigned int err_st)
is used, the RCB contains the recipient status from a DA. This function simply takes the data out of the RCB and updates the recipient status in the active queue. For this it invokes
aq_rcpt_status(aq_ctx, da_ta_id, idx, rcpt_status, err_st, errmsg),
which updates the field aqr_status_new that is later on used for aq_upd_rcpt_cnts() (see 4.7.5.7.6) which requires the previous and the new status of a recipient to determine which recipients counters to change in the transaction context.
To update the status for an entire transaction (case 2b) from a DA the function
qda_update_ta_stat(qmgr_ctx_P qmgr_ctx, sessta_id_T da_ta_id, sm_ret_T status, unsigned int err_st, dadb_ctx_P dadb_ctx, dadb_entry_P dadb_entry, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, thr_lock_T locktype)
is called. This function walks through all recipients of a transaction and updates the various DBs and counters based on the individual recipient status (which may be different from the overall transaction status). See Section 2.4.3.4 for a high-level description.
The function qda_update_ta_stat() simple invokes
qda_upd_ta_stat(qmgr_ctx, da_ta_id, status, err_st, dadb_ctx, dadb_entry, aq_ta, aq_rcpt, &edb_req_hd, &ibdb_req_hd, locktype)
(see 4.7.5.7.4) and then writes the changes for DEFEBD and IBDB to disk (unless the result is an error).
The function
qda_upd_ta_stat(qmgr_ctx_P qmgr_ctx, sessta_id_T da_ta_id, sm_ret_T status, unsigned int err_st, dadb_ctx_P dadb_ctx, dadb_entry_P dadb_entry, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, edb_req_hd_P edb_req_hd, ibdb_req_hd_P ibdb_req_hd, thr_lock_T locktype)
can be used to update an entire transaction, i.e., all recipients of that transaction, or just an individual recipient. These two cases are distinguished by specifying exactly one of the DA transaction identifier da_ta_id (i.e., the id must be valid - the first character must not be '0') and the recipient aq_rcpt (i.e., must not be NULL) is specified.
This function is also used in other places to update the status of a single recipient, e.g., for failures from the address resolver (called from the scheduler when it comes across such a recipient). qda_upd_ta_stat() invokes for all recipients that need to be updated the function
q_upd_rcpt_stat(qmgr_ctx, ss_ta_id, status, err_st, aq_ta, aq_rcpt, edb_req_hd, ibdb_req_hd, &iqdb_rcpts_done, THR_NO_LOCK).
(see 4.7.5.7.5). Afterwards it checks whether iqdb_rcpts_done is greater than zero in which case the function qda_upd_iqdb(qmgr_ctx, iqdb_rcpts_done, ss_ta_id, cdb_id, ibdb_req_hd) is invoked, see 4.7.5.7.10.
If there are no deliverable recipients in AQ anymore for the current transaction or it is required to update the transaction, then the function performs the following steps: first check whether there are no recipients at all, i.e., aq_ta->aqt_rcpts_left is zero, which means that the transaction and the data file (CDB) must be removed. If that's not the case but the transaction needs to be updated in DEFEDB, then a request is appended to the DEFEDB request list and the flag AQ_TA_FL_DEFEDB is set4.2and the flags AQ_TA_FL_EDB_UPD_C and AQ_TA_FL_EDB_UPD_R are cleared. A transaction needs to be updated if at least one of the following conditions helds:
Note: When a RCPT is updated in DEFEDB then TA is in DEFEDB or the counters change (the counters do not change iff the RCPT status does not change; if the RCPT status does not change, then the only reason the RCPT is written to DEFEDB is because it was there earlier and hence TA was there too - that is a pre-requirement of the algorithm: a RCPT is only in DEFEDB iff its TA is there too). Hence we can simplify to .
Note: this is a side-effect of the current scheduler which keeps recipients in AQ until a delivery attempt is complete. If the scheduler changes to include pre-empting then the update logic must be modified to take care of that, i.e., the requirement - a RCPT is only in DEFEDB iff its TA is there too - does not necessarily hold anymore.
This can be expressed as:
Without the simplification it is:
If aq_ta->aqt_rcpts_left is zero and the transaction is in DEFEDB, then a remove request is appended to the request list.
If there are no more recipients in AQ for the TA (aq_ta->aqt_rcpts == 0), then the TA is removed from AQ.
If aq_ta->aqt_rcpts_left is zero and the CDB identifier is set (which must be the case), then the entry is removed from the CDB.
Finally, if the DA TA identifier is valid and the DA context is not NULL, then the session is closed (which can be done because the scheduler is currently a hack that only uses one transaction per session).
The function
q_upd_rcpt_stat(qmgr_ctx_P qmgr_ctx, sessta_id_T ss_ta_id, sm_ret_T status, unsigned int err_st, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, edb_req_hd_P edb_req_hd, ibdb_req_hd_P ibdb_req_hd, unsigned int *piqdb_rcpts_done, thr_lock_T locktype)
in turn updates the status for one recipient. If the recipient is in IQDB and it won't be retried, i.e.,
(rcpt_status == SM_SUCCESS || smtp_reply_type(rcpt_status) == SMTP_RTYPE_FAIL || !AQR_MORE_DESTS(aq_rcpt) || AQR_DEFER(aq_rcpt))
then it is immediately removed from IQDB. Next the recipient counters in the transaction are updated:
aq_upd_rcpt_cnts(aq_ta, aq_rcpt->aqr_status, rcpt_status)
(see 4.7.5.7.6).
Then one of two functions is called:
if the recipient has been delivered or is a double bounce that can't be delivered (and hence will be dropped on the floor); see 4.7.5.7.7.
for temporary or permanent errors; see 4.7.5.7.8.
Afterwards it is checked whether there can be no more retries for that recipient in which case it is removed from AQ4.3, otherwise the next destination host will be tried and the flags AQR_FL_SCHED, AQR_FL_WAIT4UPD, AQR_FL_STAT_NEW, and AQR_FL_ERRST_UPD are cleared.
The counters in the transaction are updated via
aq_upd_rcpt_cnts(aq_ta, oldstatus, newstatus)
This function sets the flag AQ_TA_FL_EDB_UPD_C if a counter has been changed.
Case 1 (from 4.7.5.7.5): q_upd_rcpt_ok() is responsible to remove a recipient from the queues in which it is stored.
q_upd_rcpt_ok(qmgr_ctx_P qmgr_ctx, sessta_id_T ss_ta_id, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, ibdb_rcpt_P ibdb_rcpt, rcpt_id_T rcpt_id, edb_req_hd_P edb_req_hd)
In case of a double bounce it decrements the number of recipients left and logs the problem (dropped a double bounce). Then it removes the recipient from IBDB if it is stored there (directly, without going via the request queue), or from DEFEDB by appending the remove operation to the request queue. Finally, if the recipient is a (double) bounce, the function qda_upd_dsn() is called to remove the recipients for which the DSN has been generated; see 4.7.5.7.9.
Case 2 (from 4.7.5.7.5): q_upd_rcpt_fail() examines rcpt_status to decide whether it is a temporary error or a permanent failure. In the former case the time in the queue is checked: if it exceeds a limit and there are no more destination hosts to try or the recipient must be deferred (e.g., address resolver failure), then two flags are set: AQR_FL_PERM and AQR_FL_DSN_TMT, in the latter case the flags AQR_FL_PERM and AQR_FL_DSN_PERM are set.
If the recipient can't be delivered and is not a double bounce itself then sm_q_bounce_add(qmgr_ctx, aq_ta, aq_rcpt, errmsg) is called to create a bounce message for this recipient; see 4.7.5.8.1.
If there are no more destinations to try or the recipient must be deferred (because of an address resolver problem or because it is too long in AQ), or a bounce message has been generated4.4, then the number of tries is incremented, the next time to try is computed if necessary (i.e., recipient has only a temporary failure or it has a permanent failure but no bounce because generation of bounce recipient failed), and a request to update the recipient status in DEFEDB is appended to the request list. If that was successful and the recipient is a (double) bounce then qda_upd_dsn(qmgr_ctx, aq_ta, aq_rcpt, ss_ta_id, edb_req_hd) is called to remove the recipients for which this was a bounce (see 4.7.5.7.9).
If the recipient must be retried, i.e., it is not permanent failure then it is added to the EDB cache: edbc_add(qmgr_ctx->qmgr_edbc, rcpt_id, aq_rcpt->aqr_next_try, false)
If the recipient was in IQDB then a status update is appended to the request list for IBDB using the function ibdb_rcpt_app().
Finally q_upd_rcpt_fail() returns a flag value that indicates either an error or whether some actions (in this case: activate the address resolver) need to be performed by the caller.
qda_upd_dsn(qmgr_ctx_P qmgr_ctx, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, sessta_id_T ss_ta_id, edb_req_hd_P edb_req_hd)
is responsible for removing the recipients for which the DSN has been generated, which is done by going through its list (aq_rcpt->aqr_dsns[]) and appending remove requests to the DEFEDB change queue. It also updates the number of recipients left if necessary, i.e., if the DSN was for more than one recipient, and resets the used data structures.
qda_upd_iqdb(qmgr_ctx_P qmgr_ctx, unsigned int iqdb_rcpts_done, sessta_id_T ss_ta_id, cdb_id_P cdb_id, ibdb_req_hd_P ibdb_req_hd) updates IQDB status for one transaction; if all recipients have been delivered then it removes the transaction from IQDB using iqdb_trans_rm(), adds a request to remove it from IBDB via ibdb_ta_app() and removes it from the internal DB using qss_ta_free().
The current implementation of sendmail X does not support DSN per RFC 1894, but it creates non-delivery reports in a ``free'' format; see also 4.7.5.5.
If a bounce message is generated the function
sm_q_bounce_add(qmgr_ctx_P qmgr_ctx, aq_ta_P aq_ta, aq_rcpt_P aq_rcpt, sm_str_P errmsg)
is used. See Section 4.9.3 about the data structures that are relevant here (AQ transaction and recipient), and Section 4.7.3 about the flags (esp. those containing DSN or BNC in the name).
To generate a bounce, a new recipient is created (the ``bounce recipient'') using the function sm_q_bounce_new() unless the transaction already has a bounce recipient (that hasn't been scheduled yet). This recipient has an array aqr_dsns which contains the indices of the recipients for which this recipient contains the bounce message. Whether a transaction already has a (double) bounce recipient is recorded in the transaction (see 4.9.3): aqt_bounce_idx and aqt_dbl_bounce_idx. These can be reused to add more recipients to a bounce recipient (instead of sending one DSN per bounce).
The function
sm_q_bounce_new(qmgr_ctx_P qmgr_ctx, aq_ta_P aq_ta, bool dbl_bounce, aq_rcpt_P *aq_rcpt_bounce)
creates a new bounce recipient. It uses aq_ta->aqt_nxt_idx as the index for the bounce recipient (after checking it against the maximum value: currently the index is only 16 bits) and stores the value in aqt_bounce_idx or aqt_dbl_bounce_idx, respectively. aq_rcpt_add() is used to add a new recipient to AQ, then an array of size aq_ta->aqt_rcpts_tot is created to hold the indices of those recipients for which this will be a bounce. This array is in general too big, some optimization can be applied (later on). sm_q_bounce_new() then fills in the data for the recipient and sends it to the address resolver using qmgr_rcpt2ar(). It also increases the number of recipients for this transaction (aqt_rcpts_tot and aqt_rcpts). This may create an inconsistent state since the bounce recipient is only in AQ, not in a persistent DB (DEFEDB), see 4.7.5.8.3.
A bounce recipient is not written to a persistent DB when it is generated, but the failed recipients are written to DEFEDB. Only when a delivery attempt for a bounce message fails the bounce recipient is written to DEFEDB and the recipients for which it is a bounce are removed by qda_upd_dsn(), see 4.7.5.7.9. Hence when the system crashes before the bounce is delivered (or at least tried and then written to the queue), the bounce will be lost. However, the original recipients that caused the bounce are in DEFEDB, and hence the bounce message can be reconstructed.
Alternatively the bounce recipient can be written to one of the persistent queues and the original recipients can be removed, this could reduce the on-disk storage. However, it requires that the RCB to store the bounce recipient is fairly large because it contains the complete error text for each failed recipient and a large list of recipient indices. Theoretically it might also be possible to store the error text in IBDB, but that most likely requires changes to the storage format which does not make much sense because bounces should occur infrequently. Moreover, recovery of IBDB would become more complicated. Additionally, the failed recipient might not be in IBDB but in DEFEDB, hence making it even harder to correctly reconstruct the bounce data because it can be spread out over various places.
This is a place where optimizations are certainly possible, but it is currently not important enough (it is more important to implement the full sendmal 9.0 system instead of optimizing bounces).
The address resolver sends resolved addresses to QMGR which are used in turn by the scheduler for delivery attempts. The basic protocol returns a DA that must be used for delivery and a number of addresses to try.
A significantly more complicated part is alias expansion (see also Section 3.4.15.1).
The address resolver is called smar for sendmail address resolver since ar is already used. SMAR is based on the event threads library described in Section 4.3.8.
The description below is based on the implementation from 2003-06-26 and hence not up to date.
The AR uses a very simple mailertable which must be in a strict form: a domain part of an e-mail address, then one or more whitespace characters, followed by the IPv4 address (in square brackets) of the host to which the mail should be sent or the hostname itself. If an entry is not found in the mailertable or the RHS is a hostname, DNS lookups (for MX and A records) are performed (see Section 4.3.10 for the DNS library).
The main context for the AR has currently the following elements:
struct smar_ctx_S { sm_magic_T sm_magic; pthread_mutex_t smar_mutex; int smar_status; /* see below, SMAR_ST_* */ sm_evthr_ctx_P smar_ev_ctx; /* event thread context */ int smar_qfd; /* fd for communication with QMGR */ sm_evthr_task_P smar_qmgr_tsk; /* QMGR communication task */ rcbcom_ctx_T smar_qmgr_com; /* QMGR communication context */ sm_log_ctx_P smar_lctx; ipv4_T smar_nameserveripv4; unsigned int smar_dns_flags; dns_tsk_P smar_dns_tsk; dns_mgr_ctx_P smar_dns_mgr_ctx; bht_P smar_mt; /* "mailertable" (XXX HACK) */ };
To store addresses sent by the QMGR to the AR for resolving, the following structure is used:
struct smar_rcpt_S { sm_str_P arr_rcpt_pa; /* printable addr */ rcpt_id_T arr_rcpt_id; /* rcpt id */ sm_str_P arr_domain_pa; unsigned int arr_flags; /* status of address resolving */ unsigned int arr_da; /* DA */ int arr_n_mx; /* total number of MX records */ int arr_c_mx; /* current number of MX records */ int arr_n_a; /* total number of A records */ smar_dns_T *arr_res; /* array of results */ ipv4_T arr_ip4; /* single A record */ sm_rcbe_P arr_rcbe; /* RCB to write back result */ smar_ctx_P arr_smar_ctx; /* pointer back to SMAR context */ };
arr_n_mx stores the total number of MX records after the initial query for an MX record comes back. arr_c_mx keeps track of the current number of answer for those MX records, i.e., when both variables have the same value then all outstanding requests have been answered and the complete result can be returned to the QMGR.
The array of results has the following type:
struct smar_dns_S { unsigned int ardns_ttl; /* TTL from DNS */ unsigned short ardns_pref; /* preference from DNS */ sm_cstr_P ardns_name; /* name from DNS */ int ardns_n_a; /* number of A records */ ipv4_T *ardns_a; /* pointer to list of A records */ };
The address resolver receives recipient addresses from QMGR, creates a recipient structure using
smar_rcpt_new(smar_ctx, &smar_rcpt)
fills in the necessary data, and invokes
smar_rcpt_rslv(smar_ctx, smar_rcpt)
which first checks the ``mailertable'' and if it can't find a matching entry, then it will invoke the DNS resolver (see Section 4.3.10):
dns_req_add(dns_mgr_ctx, q, T_MX, smar_rcpt_cb, smar_rcpt)
The callback function
smar_rcpt_cb(dns_res_P dns_res, void *ctx)
locks smar_ctx and analyzes dns_res as follows.
If the result is DNSR_NOTFOUND and the lookup was for an MX record then it will simply try to find A records. If it is another error then a general error handling section will be invoked.
First it will be checked whether the query was for an A record in which case arr_c_mx is incremented. Then the error type (temporary or permanent) is checked and a flag in smar_rcpt is set in the former case. An error is returned to the caller iff the query was for an MX record or the query was for an A record and all records have been received and there was a temporary error. If that is not the case, but all open requests are answered, then the results are returned to the caller using smar_rcpt_re_all(smar_rcpt) and thereafter the rcpt structure is freed using smar_rcpt_free(smar_rcpt).
If the result was not an error then two cases have to be taken care of:
Expanding aliases makes the address resolver significantly more complex. Unfortunately the current implementation does not allow for a simple layer around the current recipient resolver. This is due to the asynchronous nature of the DNS library which requires the use of callback functions. As explained in the previous section, the callback function checks whether all results arrived in which case it will put the data into an RCB and send it back to QMGR.
Question: how to implement owner-alias and alias-request? Problem: bounces go to owner-alias (see also Section 2.6.7). Does this mean a transaction should be ``cloned'' or should the transaction context be extended? What if mail is sent to two lists and two "normal" rcpts?
mail from:<sender> rcpt to:<list1> rcpt to:<list2> rcpt to:<rcpt1> rcpt to:<rcpt2>
Usually a bounce goes to sender. However, if mail to listX fails, the function that creates bounces needs to handle this differently. Here's a list of possible ways to handle owner- aliases and the problems with them.
Problems:
Problems:
Problems:
It might be possible to add another counter: aqt_clones which counts the number of cloned transactions. A cloned transaction contains a link to the original transaction (just the TA id). If the number of recipients aqt_rcpts_left is decreased to zero, then it is checked whether the TA is cloned in which case the clone counter in the original TA is decreased. A CDB entry is removed if the two counters aqt_rcpts_left and aqt_clones in a original TA are both zero.
Overall, proposal 3 seems like the ``cleanest'' solution even though it currently (2004-06-30) has the biggest impact on the implementation, i.e., it requires a lot of changes.
IQDB is currently implemented as typed RSC, see Section 4.3.5.1. IQDB simply contains references to qss_sess_T, qss_ta_T, and qss_rcpts_T, see Section 4.7.3, it does not have its own data structures, it merely allows for a way to access the data via a key (SMTPS session, transaction, recipient identifier).
The API of IBDB is described in Section 3.11.4.3.
The current implementation requires more functions than described there. This is an outcome of the requirement to updata the various queues safely and in a transaction based manner. To achieve this, a list of change request can be maintained; the elements of this list have the following structure:
struct ibdb_req_S { unsigned int ibdb_req_type; int ibdb_req_status; sessta_id_T ibdb_req_ss_ta_id; /* SMTPS transaction id */ sm_str_P ibdb_req_addr_pa; /* MAIL/RCPT address */ cdb_id_P ibdb_req_cdb_id; unsigned int ibdb_req_nrcpts; rcpt_idx_T ibdb_req_rcpt_idx; /* RCPT index */ SIMPLEQ_ENTRY(ibdb_req_S) ibdb_req_link; };
The functions to deal with the request lists are:
sm_ret_T ibdb_rcpt_app(ibdb_ctx_P ibdb_ctx, ibdb_rcpt_P ibdb_rcpt, ibdb_req_hd_P ibdb_req_hd, int status) appends a recipient change request.
sm_ret_T ibdb_ta_app(ibdb_ctx_P ibdb_ctx, ibdb_ta_P ibdb_ta, ibdb_req_hd_P ibdb_req_hd, int status) appends a transaction change request.
sm_ret_T ibdb_req_cancel(ibdb_ctx_P ibdb_ctx, ibdb_req_hd_P ibdb_req_hd) cancel all the requests in the list.
sm_ret_T ibdb_wr_status(ibdb_ctx_P ibdb_ctx, ibdb_req_hd_P ibdb_req_hd) perform the status updates as specified by the request list.
The active queue currently uses the following structure as context:
struct aq_ctx_S { pthread_mutex_t aq_mutex; /* only one mutex for now */ unsigned int aq_limit; /* maximum number of entries */ unsigned int aq_entries; /* current number of entries */ unsigned int aq_t_da; /* entries being delivered */ aq_tas_T aq_tas; aq_rcpts_T aq_rcpts; };
For now we just use lists of aq_ta/aq_dta/aq_rcpt structures. Of course we need better access methods later on. Currently we will use FIFO as only scheduling strategy.
A recipient context consists of the following elements:
struct aq_rcpt_S { sessta_id_T aqr_ss_ta_id; /* ta id in SMTPS */ sessta_id_T aqr_da_ta_id; /* ta id in DA */ sm_str_P aqr_rcpt_pa; /* printable addr */ smtp_status_T aqr_status; /* status */ smtp_status_T aqr_status_new; /* new status */ unsigned int aqr_err_st; /* state which caused error */ unsigned int aqr_flags; /* flags */ rcpt_idx_T aqr_rcpt_idx; /* rcpt idx */ unsigned int aqr_tries; /* # of delivery attempts */ unsigned int aqr_da_idx; /* DA idx (kind of DA) */ /* SMTPC id (actually selected DA) */ int aqr_qsc_id; /* ** HACK! Need list of addresses. Do we only need IP addresses ** or do we need more (MX records, TTLs)? We need at least some ** kind of ordering, i.e., the priority. This is needed for the ** scheduler (if a domain has several MX records with the same ** priority, we can deliver to any of those, there's no order ** between them). Moreover, if we store this data in DEFEDB, ** we also need TTLs. */ /* ** Number of entries in address array. ** Should this be "int" instead and denote the maximum index, ** where -1 means: no entries? ** Currently the check for "is there another entry" is ** (aqr_addr_cur < aqr_addr_max - 1) ** i.e., valid entries are 0 to aqr_addr_max - 1. */ unsigned int aqr_addr_max; unsigned int aqr_addr_cur; /* cur idx in address array */ aq_raddr_T *aqr_addrs; /* array of addresses */ /* XXX Hack */ ipv4_T aqr_addr_fail; /* failed address */ /* address storage to use if memory allocaction failed */ aq_raddr_T aqr_addr_mf; time_T aqr_entered; /* entered into AQ */ time_T aqr_st_time; /* start time (rcvd) */ time_T aqr_last_try; /* last time scheduled */ /* next time to try (after it has been stored in DEFEDB) */ time_T aqr_next_try; /* Error message if delivery failed */ sm_str_P aqr_msg; /* ** Bounce recipient: stores list of recipient indices for which ** this is a bounce message. ** Note: if this is stored in DEFEDB, then the array doesn't need ** to be saved provided that the recipients are removed in the ** same (DB) transaction because the bounce recipient contains ** all necessary data for the DSN. If, however, the recipients ** are not removed "simultaneously, then it is a bid harder to ** get consistency because it isn't obvious for which recipients ** this bounce has been created. That data is only indirectly ** available through aqr_bounce_idx (see below). */ sm_str_P aqr_dsn_msg; unsigned int aqr_dsn_rcpts; unsigned int aqr_dsn_rcpts_max; rcpt_idx_T *aqr_dsns; /* array of rcpt indices */ /* ** rcpt idx for bounce: stores the rcpt_idx (> 0) if a bounce ** for this recipient is generated and being delivered. ** This is used as "semaphore" to avoid multiple bounces for ** the same recipient (needs to be stored in DEFEDB). */ rcpt_idx_T aqr_bounce_idx; /* linked list for AQ, currently this is the way to access all rcpts */ TAILQ_ENTRY(aq_rcpt_S) aqr_db_link; /* links */ /* ** Linked lists for: ** - SMTPS transaction: ** to find all recipients for the original transaction ** (to find out whether they can be delivered in the same ** transaction, i.e., same DA, + MX piggybacking) ** - DA transaction: ** to find the recipients that belong to one delivery attempt ** and update their status ** Link to ta: ** to update the recipient counter(s). */ sm_ring_T aqr_ss_link; sm_ring_T aqr_da_link; aq_ta_P aqr_ss_ta; /* transaction */ };
This structure contains linked lists for:
A reference to the transaction aqr_ss_ta is used to make it easier to update the recipient counter(s).
A transaction context has these elements:
struct aq_ta_S { /* XXX other times? */ time_T aqt_st_time; /* start time (received) */ aq_mail_P aqt_mail; /* mail from */ /* XXX only in aq_da_ta */ unsigned int aqt_rcpts; /* number of recipients in AQ */ unsigned int aqt_rcpts_ar; /* rcpts to receive from AR */ unsigned int aqt_rcpts_arf; /* #of entries with SMAR failure */ /* Number of recipients in DEFEDB */ unsigned int aqt_rcpts_tot; /* total number of recipients */ unsigned int aqt_rcpts_left; /* rcpts still to deliver */ unsigned int aqt_rcpts_temp; /* rcpts temp failed */ unsigned int aqt_rcpts_perm; /* rcpts perm failed */ unsigned int aqt_rcpts_tried; /* rcpts already tried */ rcpt_idx_T aqt_nxt_idx; /* next recipient index */ unsigned int aqt_state; unsigned int aqt_flags; /* ** rcpt idx for (double) bounce; when a bounce is needed a recipient ** struct is created, its rcpt_idx is this bounce_idx. ** It should be aqt_rcpts_tot (+1) when it is created; afterwards ** aqt_rcpts_tot is increased of course. */ rcpt_idx_T aqt_bounce_idx; rcpt_idx_T aqt_dbl_bounce_idx; sessta_id_T aqt_ss_ta_id; /* ta id in SMTPS */ cdb_id_P aqt_cdb_id; TAILQ_ENTRY(aq_ta_S) aqt_ta_l; /* links */ /* XXX add list of recipients? that makes lookups easier... see above */ };
The field aqt_ta_l links all transactions in the active queue together.
As it can be seen, there are many counters in the transaction context:
The number of recipients that encountered a temporary failure aqt_rcpts_temp does not reflect those recipients that are in DEFEDB. When a recipient is written to DEFEDB, the status flag indicating a temporary failure will not be saved. Hence when a recipient is tried again, the previous temporary failure is mostly ignored except for bookkeeping.
The current implementation (2003-01-01) of the main queue uses Berkeley DB 4.1 (proposal 2 from Section 3.11.6.1).
The main context looks like this:
typedef SIMPLEQ_HEAD(, edb_req_S) edb_req_hd_T, *edb_req_hd_P; struct edb_ctx_S { pthread_mutex_t edb_mutex; /* only one mutex for now */ unsigned int edb_entries; /* current number of entries */ edb_req_hd_T edb_reql_wr; /* request list (wr) */ edb_req_hd_T edb_reql_pool; DB_ENV *edb_bdbenv; DB *edb_bdb; /* Berkeley DB */ };
A change request has the following structure:
struct edb_req_S { unsigned int edb_req_type; smtp_id_T edb_req_id; sm_rcb_P edb_rcb; SIMPLEQ_ENTRY(edb_req_S) edb_req_link; };
The context maintains two lists of requests: a pool of free request list entries (edb_reql_pool) for reuse, and a list of change requests (edb_reql_wr) that are committed to disk when edb_wr_status(edb_ctx_P edb_ctx) is called.
A request itself contains a type (TA or RCPT), and identifier (which is used as key) and a RCB that stores the appropriate context in encoding form as defined by the RCB format. The type can actually be more than just transaction or recipient, it can also denote that the entry matching the identifier should be removed from the DB.
The data structures for transactions (mail sender) and recipients are shared with the active queue, see Section 4.9.3.
See Section 3.11.6 for the DEFEDB API, here's the current implementation:
Two functions are available to open and close a DEFEDB:
The functions which take a parameter edb_req_hd_P edb_req_hd will use that argument as the head of the request list unless it is NULL, in which case the write request list edb_reql_wr of the context will be used.
To retrieve an entry from the DEFEDB one function is provided:
To decode the RCB retrieved via edb_rd_req() and fill out an active queue context of the correct type the following two functions are available:
To read through a DEFEDB these four functions are provided:
To remove a transaction or a recipient from the DB (directly) use:
To add a request to remove a transaction or a recipient from the DB use:
and commit that request later on with edb_wr_status().
Internal functions to manage a request entry or list are:
This section describes the implementation of various programs.
How to efficiently perform IBDB cleanup?
Try to minimize the amount of data to clean up. This can be done by performing rollovers at an appropriate moment, i.e., when the number of outstanding transactions and recipients is zero. This is probably only possible for low-volume sites. If those two values are zero, then all preceeding files can be removed.
Read an IBDB file and create a new one that has only the open transactions/recipients in there? Leave ``holes'' in the sequence, e.g., use 0x1-0xf and leave 0 free for ``cleaning'', i.e., read 0x1-0xf and then write all the open transactions into 0. Problem: what to do with repeated passes?
How about different names (extensions) instead?
It might be possible to ignore logfiles that are older than the transactional timeout. Those logfiles can only contain data about transaction that have been either completed or have timed out. Neither of these are of interest for the reconstruction of the queue. Does this mean a very simple cleanup process is possible which simply removes old logfiles? This minimizes the amount of effort during runtime at the expense of diskspace and startup after an unclean shutdown. For the first sendmail X version this might be a ``good enough'' solution.
As requested in Section 1.1.3.5.1 there are many test programs which can as usual be invoked by make check.
Some of them are included in the directories for which they perform tests, e.g., smar and libdns, but most of them are in the directories checks and chkmts. There are two ``check'' directories because of some length restrictions on AIX (2048 bytes line length for some utilities used by the autoconf tools).
The test programs range from very simple: testing a few library functions, e.g., string related, to those which test the entire MTS.
Remark (placed here so it doesn't get lost): there is a restricted number ( 60000) of possible open connections to one port. Could that limit the throughput we are trying to achieve or is such a high number of connections unfeasible?
For simple performance comparisons several SMTP sinks have been implemented or tested.
Test programs are:
Test machines are:
Entries in the tables down below denote execution time in seconds unless otherwise noted, hence smaller values are better.
Tests have been performed with myslam (a multi-threaded SMTP client), using 7 to 8 client machines, 50 threads per client, and 5000 messages per client.
parameters | smtp-sink | smtps | thrperconn | thrpool |
1KB/msg (40MB) | 45s | 70s | 92s | 43s |
4KB/msg (160MB) | 49s | 56s | 259s | 78s |
32KB/msg (1280MB) | 203s | 208s | 999s | 110s |
-w 1 | 141s | 109s | 156s | 230s |
Note: v-sun is a four processor machine, hence the multi-threaded programs (thrpool, thrperconn) can use multiple processors. I didn't select (via an option) multiple processors for smtps though.
Just as one example, the achieved throughput in MB/s is listed in the next table. As it can be seen, it is an order of magnitude lower than the sustainable throughput that can be achieved over a single connection (about 85-90MB/s measured with ttcp; this is a 100Mbit/s ethernet).
parameters | smtp-sink | smtps | thrperconn | thrpool |
1KB/msg (40MB) | 0.9 | 0.6 | 0.4 | 0.9 |
4KB/msg (160MB) | 3.3 | 2.9 | 0.6 | 2.1 |
32KB/msg (1280MB) | 6.5 | 6.3 | - | 11.9 |
parameters | smtp-sink | smtps | thrperconn | thrpool |
1KB msg size | 97 | 87 | 380 | 140 |
4KB msg size | 108 | 130 | 1150 | 156 |
32KB msg size | 208 | 197 | fails | 330 |
-w 1 | 165 | 138 | 484 | 223 |
parameters | smtp-sink | smtps | thrperconn | thrpool |
1KB msg size | 38 | 28 | - | 31 |
4KB msg size | 34 | 33 | - | 31 |
32KB msg size | 125 | 125 | - | 125 |
-w 1 | 125 | 125 | - | 155 |
125 for 250/3 |
parameters | smtp-sink | smtps | thrperconn | thrpool |
1KB msg size | 45 | 44 | 165 | 74 |
4KB msg size | 54 | 45 | 418 | 75 |
32KB msg size | 217 | 167 | fails | 256 |
-w 1 | 370 | 360 | - | 337 |
2004-03-02
statethreads/examples/smtps3
See Section 5.2.1.1, machine 1
wiz$ time ./smtpc2 -fa@b.c -Rx@y.z -t 100 -s 1000 -r localhost
sink program | FS | times (s) |
smtps3 | - | 5 |
smtpss | UFS | 17, 18 |
smtps3 -C | UFS | 16, 17, 19 |
source: s-6.perf-lab
sink: v-bsd.perf-lab
with -C
s-6.perf-lab$ time ./smtpc2 -t 100 -s 1000 -r v-bsd.perf-lab 19.17s real 1.08s user 0.64s system
without -C
s-6.perf-lab$ time ./smtpc2 -t 100 -s 1000 -r v-bsd.perf-lab 3.04s real 0.81s user 0.59s system
source: s-6.perf-lab
sink: mon.perf-lab (FreeBSD 4.9)
with -C
12.05s real 1.04s user 0.67s system
without -C
3.03s real 0.92s user 0.54s system
2004-03-04 source: s-6.perf-lab; sink: v-sun.perf-lab
with -C: 20s - 24s (UFS) Note: It takes 20s(!) to remove all CDB files:
time rm ?/S* 0m20.11swith -C: 1s (TMPFS); 16s (UFS, /), rm: 14s; logging turned on: 16s, rm: 0.8s.
without -C: 1s
2004-03-08 source: s-6.perf-lab; sink: v-bsd;
./smtpc -t 100 -s 1000
sink program | time (s) |
smtpss | 30 |
smtps3 -C | 30 |
smtps3 | 3 |
2004-03-08 source: s-6.perf-lab; sink: v-sun;
./smtpc -t 100 -s 1000
sink program | FS | times (s) |
smtps3 | - | 1 |
smtpss | UFS | 25, 30 |
smtps3 -C | UFS | 23 |
smtpss | swap | 2, 3 |
smtps3 -C | swap | 1, 2 |
Note: the variance for smtpss on UFS is fairly large. The lower numbers are achieved by running smtps3 -C first and then smtpss, the larger numbers are measured when the CDB files have just been removed. However, this effect was not reproduceable. Note: removing those files takes about as long as a test run.
Test setup with a sendmail X prototype of 2002-09-04: v-aix.perf-lab running QMGR, SMTPS, and SMTPC. Relaying from localhost to v-bsd.perf-lab. Source program running on v-aix:
time ./smtp-source -s 50 -m 100 -c localhost:8000
Using the full version: 2.45s; turning fsync() off: 1.44s.
This clearly shows the need for a better CDB implementation, at least on AIX.
Same test with reversed roles (sm9 on v-bsd, sink on v-aix): using the full version: 7.44s; turning fsync() off: 6.20s. For comparison: using sendmail 8.12: 14.71s.
The SCSI disks on v-bsd seem to be fairly slow. Moreover, there seems to be something wrong with the OS version (it's very old: FreeBSD 3.4).
On FreeBSD 4.6 (machine 14, see Section 5.2.1.1) (source, sink, sm-9 of 2002-10-01 on the same machine):
time ./smtp-source -s 100 -m 200 -c localhost:8000
softupdates: 4.35s; without softupdates: 5.66s
time ./smtp-source -s 50 -m 100 -c localhost:8000
softupdates: 2.01s/1.93s, -U: 1.79s; without softupdates: 2.60s/2.46s, -U: 2.17s
(-U turns off fsync()).
Using sendmail 8.12.6:
time ./smtp-source -s 50 -m 100 localhost:1234
softupdates: 5.01s. This looks quite good for sendmail 8, but the result for:
time ./smtp-source -c -s 100 -m 200 localhost:1234
is: 143.12s, which certainly is not anywhere near good. This is related to the high load generated by this: up to 200 concurrent sendmail processes just kill the machine. sendmail X has only up to 4 processes running.
Test date: 2003-05-25, version: sm9.0.0.6, machine: PC, AMD Duron 700MHz, 512MB RAM, SuSE 8.1
Test program:
time ./smtp-source -s 50 -m 500 -fa@b.c -tx@y.z localhost:1234
FS | Times | msg/s (best) |
JFS | 4.02s, 4.23s | 124 |
ReiserFS | 4.8s | 104 |
XFS | 6.7s, 7.2s, 7.48s, 7.64s | 74 |
EXT3 | 14.39s, 13.44s | 34 |
2004-03-17 checks/t-readwrite on destiny (Linux, IDE, ext2):
parameters | writes | time |
-s -f 1000 -p 1 | - | 9 |
-s -f 100 -p 10 | - | 6 |
The FS is mounted async (default!).
2004-03-17 checks/t-readwrite on ia64-2 (Linux, SCSI, reiserfs):
parameters | writes | time |
-s -f 1000 -p 1 | - | 5.2 |
-s -f 100 -p 10 | - | 2.6 |
2004-03-23 source: basil.ps-lab MTA: cilantro.ps-lab (Linux 2.4.18-64GB-SMP) sink: v-sun.perf-lab
FS: ReiserFS version 3.6.25
smtpc -t 100 -s 1000
program | source time | sink time |
smtps3 -C | - | |
sm9.0.0.12 | 6 | 5 |
sm8.12.11 | 74 | 74 |
sm8.12.11 See 1 | 50 | |
postfix 2.0.18 |
gatling -m 100 -c 5000 -z 1 -Z 1
program | writes | source time | source msgs/s | sink time |
smtps3 | 2 | 2295 | - | |
smtps3 -C | 5 | 962 | - | |
sm9.0.0.12 | 22 | 225 | 22 | |
sm8.12.11 | 358 | 14 | 358 | |
sm8.12.11 See 1 | 246 | 20 | - | |
postfix 2.0.18 |
Notes:
2004-03-25:
Filesystems:
smtpc -t 100 -s 1000
program | FS | source time | sink time |
sm9.0.0.12 | 1 | 63 | 61 |
1 | 63 | 63 | |
2 | 19 | 18 | |
3 | 5 | 4 | |
3 | 5 | 5 | |
5 | 81 | 80 | |
sm8.12.11 | 3 | 45 | several read errors |
5 | 91 | 92 | |
smtps3 -C |
2004-03-25: gatling -m 100 -c 5000 -z 1 -Z 1 (1KB message size)
program | FS | source time | sink time | msgs/s |
sm9.0.0.12 | 1 | |||
2 | 90 | 90 | 55 | |
3 | 24 | 24 | 208 | |
4 | 100 | 99 | 100 | |
sm8.12.11 | 3 | 216 | errors | 23 |
gatling -m 100 -c 5000 -z 4 -Z 4 (4KB message size)
program | FS | source time | sink time | msgs/s |
sm9.0.0.12 | 1 | |||
2 | 92 | 92 | 54 | |
3 | 141 | 140 | 35 | |
4 | 168 | 168 | 29 | |
sm8.12.11 | 3 | 226 | errors | 22 |
gatling -m 100 -c 5000 -z 16 -Z 16 (16KB message size)
program | FS | source time | sink time | msgs/s |
sm9.0.0.12 | 1 | |||
2 | ||||
3 | 169 | 29 | ||
4 | ||||
sm8.12.11 | 3 | 226 | errors | 22 |
Notes:
2003-11-19 sm-9.0.0.9 running on v-bsd.perf-lab (2 processors, FreeBSD 3.4)
Source on bsd.dev-lab
time ./smtp-source -d -s 100 -m 500
directly to sink: 2.16 - 2.74s (231msgs/s)
using MFS: 14.37 - 14.43s (34msgs/s) (sm8.12.10: 32s)
using FS with softupdates: 22.78 - 23.83s (21msgs/s) (sm8.12.10: 49s)
using FS without softupdates: 35.27 - 35.56s (14msgs/s)
2004-03-02 source: s-6.perf-lab; relay: mon; sink: v-bsd
time ./smtpc2 -O 10 -fa@s-6.perf-lab -Rnobody@v-bsd.perf-lab -t 100 -s 1000 -r mon.perf-lab:1234
38.26s real 1.01s user 0.88s system
2004-03-04 source: s-6.perf-lab; relay: v-bsd; sink: v-sun
options: -t 100 -s 1000
MTA | source time(s) | sink time |
postfix 2.0.18 | 53 | 94 |
sm9.0.0.12 | 69 | 68 |
without smtpc | 56 | - |
sm8.12.11 | 67 | 67 |
-odq | 79, 82 | |
-odq / 100 qd | 101 | |
-odq / 10 qd | 100 |
Note: this is FreeBSD 3.4 without softupdates and directory hashes.
getrusage(2) data:
sm8.12.11 -odq
ru_utime= 15.0158488 ru_stime= 71.0104605 ru_maxrss= 1524 ru_ixrss= 5030592 ru_idrss= 4098456 ru_isrss= 1412096 ru_minflt= 127503 ru_majflt= 0 ru_nswap= 0 ru_inblock= 0 ru_oublock= 11851 ru_msgsnd= 13000 ru_msgrcv= 10000 ru_nsignals= 0 ru_nvcsw= 617469 ru_nivcsw= 18793
sm8.12.11
ru_utime= 15.0236311 ru_stime= 62.0117941 ru_maxrss= 1520 ru_ixrss= 4573224 ru_idrss= 3676784 ru_isrss= 1283712 ru_minflt= 174619 ru_majflt= 0 ru_nswap= 0 ru_inblock= 0 ru_oublock= 4001 ru_msgsnd= 12000 ru_msgrcv= 10000 ru_nsignals= 1000 ru_nvcsw= 128074 ru_nivcsw= 14771
This looks like a problem in queue only mode: there's way too much data written: almost 3 times the amount of background delivery mode. Why does sm8 send 1000 more message in queue only mode?
2004-03-05 source, relay, sink: wiz (FreeBSD 4.8)
options: -t 100 -s 1000
source: 34s, sink: 32s
turn off smtpc: source: 31s, 34s
2004-03-26 source: v-6.perf-lab running smtpc -t 100 -s 5000; relay: v-bsd.perf-lab; sink: v-sun.perf-lab
sink runs smtps2 -R n with varying values for n
n | source time | requests served |
0 | 108 | 5000 |
8000000 | 115 | 5060 |
58000000 | 140 | 5450 |
88000000 | 151 | 5620 |
put defedb on a RAM disk:
n | source time | requests served |
0 | 108 | 5000 |
8000000 | ||
58000000 | 111 | 5453 |
88000000 | 114 | 5693 |
Obviously the additional disk I/O traffic created by having to use DEFEDB is slowing down the system.
2004-06-23 Upgraded v-bsd.perf-lab to FreeBSD 4.9 (2 processors), using softupdates.
source on v-sun, sink on s-6:
time ./smtpc2 -O 10 -t 100 -s 1000 -r v-bsd.perf-lab:1234
43s
turn off fsync(): (smtps -U, must be compiled with -DTESTING)
32s
A modified iostat(8) program is used to show the number of bytes written and read, and the number of read, write, and other disk I/O operations.
The following tests were performed: sink (smtps3) on v-bsd.perf-lab, source (smtpc) on s-6.perf-lab sending 1000 mails. All numbers for write operations are rounded; if there are numbers in parentheses then those denote the value of ru_oublock (getrusage(2)) for smtps/qmgr or sm8. If two times are given (separated by /) then the second time denotes the output (elapsed time) for the sink.
program | softupdates? | writes | reads | time |
smtps3 -C | yes | 2200 | - | 14 |
smtps3 -C | no | 2900 | - | 30 |
sm9.0.0.12, no sched (see 1) | yes | 5200 | - | 34 |
sm9.0.0.12, no sched | yes | - | ||
sm9.0.0.12, no sched | no | - | ||
sm9.0.0.12 (see 2) | yes | 3500 (2000/1300) | 4 | 33 |
yes | 3370 (2020/1270) | 4 | 30/29 | |
-O i=1000000 | yes | 2660 (1850/660) | 0 | 25/24 |
sm9.0.0.12 | no | 6300 (3000/3200) | 0 | 52 |
sm9.0.0.12 (see 4) | yes | 3500 (2200/1200) | 4 | 25 |
sm8.12.11 -odq SS=m | yes | 1800 | - | 41 |
sm8.12.11 -odq SS=m | no | 12200 | - | 72 |
sm8.12.11 SS=m (see 3) | yes | 236 (164) | 0 | 61 |
yes | 370 (218) | 0 | 60 | |
sm8.12.11 | no | 8100 (4100) | 1 | 63 |
sm8.12.11 SS=t | yes | 7400 | 0 | 70 |
postfix 2.0.18 | yes | 2900 | 16 | 21/26 |
Notes:
2004-03-23 source: basil.ps-lab MTA: wasabi.ps-lab (FreeBSD 4.9, machine 16 in Section 5.2.1.1) sink: v-sun.perf-lab
smtpc -t 100 -s 1000
program | writes | reads | source time | sink time |
smtps3 -C | 2400 | - | 11 | - |
sm9.0.0.12 | 2600 | 5 | 15 | 13 |
sm8.12.11 | 6000 | 1 | 35 | |
postfix 2.0.18 | 2800 | 15 | 14 | 20 |
Note: the sink time for postfix is shorter than the time for sm9 because sm9 emptied the queue during the run while postfix has more than 700 entries in the mail queue after the source finished sending all mails. This can be seen by looking at the sink time which is noticeable larger for postfix compared to sendmail X.
Using gatling:
Max random envelope rcpts: 1 Connections: 100 Max msgs/conn: Unlimited Messages: Fixed size 1 Kbytes Desired Message Rate: Unlimited Total messages: 5000 Total test elapsed time: 73.571 seconds (1:13.570) Overall message rate: 67.962 msg/sec Peak rate: 100.000 msg/sec
gatling -m 100 -c 5000 -z 1 -Z 1
program | writes | source time | source msgs/s | sink time |
smtps3 | 0 | 5 | 980 | - |
smtps3 -C | 11750 | 53 | 93 | |
sm9.0.0.12 | 73 | 67 | 71 | |
sm9.0.0.12 | 11157 (8000/2700) | 70 | 71 | 69 |
sm8.12.11 | 136 | 36 | ||
postfix 2.0.18 | 60 | 83 | 78 | |
postfix 2.0.18 | 12635 | 58 | 85 | 75 |
2004-03-16 results for wiz: source: time ./smtpc -s 1000 -t 100 -r localhost:1234; sink: smtps3, file system: UFS, softupdates
parameters | oublock | writes | source time | sink time |
-C -i | 1920 | ? | 17 | 16 |
-C -p 1 | 1860 | ? | 17 | 17 |
-C -p 1 | 1940 | 2700 | 16 | 15 |
-C -p 1 | 1970 | 2770 | 16 | 15 |
-C -p 2 | ? | 15 | ? | |
-C -p 2 | 877+966 | 2600 | 15 | ? |
-C -p 4 | 455+476+432+472 | 2640 | 15 | ? |
New option: -f for flat, i.e., instead of using 16 subdirectories for CDB files, a single directory is used. Even though this does not cause a noticeable difference in run time, the number of I/O operations is reduced.
parameters | oublock | writes | source time |
-C -p 2 | 915+920 | 2600 | 14 |
-C -p 2 -f | 600+610 | 2200 | 14 |
2004-03-16 source: s-6.perf-lab, time ./smtpc -s 1000 -t 100 -r localhost:1234; sink: -v-bsd.perf-lab, smtps3, file system: UFS, softupdates
parameters | oublock | writes | source time | sink time |
-C -i | 1430 | 2165 | 12 | 11 |
1550 | 2300 | 14 | 13 | |
-C -p 1 | 1500 | 2500 | 14 | 12 |
-C -p 2 | 1100+620 | 2500 | 13 | - |
800+770 | 2320 | 13 | - | |
-C -p 4 | 530+350+540+470 | 2600 | 13 | - |
Note: some of the write operations might be from softupdates due to the previous rm command (removing the CDB files).
2004-03-17 checks/t-readwrite on v-bsd (FreeBSD 4.9, SCSI):
parameters | softupdates | oublock | writes | time |
-s -f 1000 -p 1 | yes | 4000 | 4000 | 22 |
-s -f 100 -p 10 | yes | 2575 | 2579 | 14 |
-s -f 1000 -p 1 | no | 4050 | 4050 | 28 |
-s -f 100 -p 10 | no | 4050 | 4050 | 27 |
-p specifies the number of processes to start, -f specifies the number of files to write per process. The test cases above write 1000 files with either 1 or 10 processes. As it can be seen, it is significantly more efficient to use 10 processes if softupdates are turned on.
2004-03-17 checks/t-readwrite on wiz (FreeBSD 4.8, IDE):
parameters | softupdates | oublock | writes | time |
-s -f 1000 -p 1 | yes | 3000 | 3800 | 13 |
-s -f 100 -p 10 | yes | 2860 | 3600 | 13 |
In this case no difference can be seen, which is most likely a result of using an IDE drive with write-caching turned on (default).
2003-11-21 sm-9.0.0.9 running on v-sun.perf-lab
Source on bsd.dev-lab
time ./smtp-source -d -s 100 -m 5000 -c
using FS: 301.90 - 305.02s (16msgs/s)
using swap: 77.98 - 78.55s (64msgs/s)
Those tests ran only 32 SMTPS threads (the machine has 4 CPUs, hence the specified limit 128 was divided by 4). Using 128 SMTPS threads (by forcing only one process which was used anyway because SMTPS is run with the interactive option which does not start backgroup processes):
time ./smtp-source -d -s 100 -m 50000 -c
using swap: 727.73s (68msgs/s)
2004-03-09 sm-9.0.0.12 running on v-sun.perf-lab
time ./smtpc -O 20 -fa@s-6.perf-lab.sendmail.com -Rnobody@v-bsd.perf-lab.sendmail.com -t 100 -s 1000 -r v-sun.perf-lab.sendmail.com:1234
MTA options | FS | source time(s) | sink time(s) |
full MTS | SWAPFS | 16 | 14 |
without sched | SWAPFS | 10 | - |
smtpss | SWAPFS | 3 | - |
full MTS | UFS | 64, 65, 64 | 75, 70, 69 |
8.12.11 | SWAPFS | 16 | 19 |
8.12.11 | UFS | 141 | 138 |
Note: sm9 using UFS runs into connection limitations: QMGR believes there are 100 open connections even though the sink shows at most 18. This seems to be a communication latency between SMTPC and QMGR (and needs to be investigated further).
2004-03-17 checks/t-readwrite on v-sun (SunOS 5.8, SCSI):
parameters | writes | time |
-s -f 1000 -p 1 | - | 39 |
-s -f 100 -p 10 | - | 37 |
The filesystem on SunOS 5.8 does not cause any difference whether 1 or 10 processes are used.
2004-03-05 source, relay, and sink on zardoc (OpenBSD 3.2)
test with logging via smioout
zardoc$ time ./smtpc2 -O 10 -s 1000 -t 100 -r localhost:1234 24.17s real 0.94s user 2.57s system
smtps3 stats:
elapsed 26 Thread limits (min/max) 8/256 Waiting threads 8 Max busy threads 3 Requests served 1000
Note that there have been only 3 active threads. That means the client is not busy at all. Another test shows elapsed=23s, max busy threads=21, so the result isn't deterministic (the machine is running as normal SMTP server etc during tests).
test with logging via smioerr: smtpc2: 24.53s; no difference.
2004-03-17 checks/t-readwrite on aix-3 (AIX 4.3, SCSI, jfs):
parameters | writes | time |
-s -f 1000 -p 1 | - | 30 |
-s -f 100 -p 10 | - | 29 |
No (noticeable) difference.
Here are some results of a simple test program which creates and deletes a number of files and optionally renames them twice while doing so.
Notice: unless mentioned otherwise, all measurements are at most accurate to one second resolution. Repeated test will most likely show (slightly) different results. These tests are only listed to give an idea of the magnitude of available performance.
The involved systems are:
wdc0: unit 0 (wd0): <FUJITSU MPD3064AT> wd0: 6187MB (12672450 sectors), 13410 cyls, 15 heads, 63 S/T, 512 B/S
wd0 at pciide0 channel 0 drive 0: <IBM-DJNA-351010> wd0: can use 32-bit, PIO mode 4, DMA mode 2, Ultra-DMA mode 4 wd0: 16-sector PIO, LBA, 9671MB, 16383 cyl, 16 head, 63 sec, 19807200 sectors
wd1 at pciide0 channel 0 drive 1: <Maxtor 98196H8>, wd1: can use 32-bit, PIO mode 4, DMA mode 2, Ultra-DMA mode 4, wd1: 16-sector PIO, LBA, 78167MB, 16383 cyl, 16 head, 63 sec, 160086528 sectors
ad0: 6187MB <FUJITSU MPC3064AT> [13410/15/63] at ata0-master UDMA33
ahc0: <Adaptec 2940 Ultra2 SCSI adapter (OEM)> da0: <IBM DNES-309170W SA30> Fixed Direct Access SCSI-3 device da0: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing Enabled da0: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)
ad0: 8063MB <FUJITSU MPD3084AT> [16383/16/63] at ata0-master UDMA66softupdates
hda: IBM-DJNA-370910, 8693MB w/1966kB Cache, CHS=1108/255/63ext 2 FS
hda: 39102336 sectors (20020 MB) w/2048KiB Cache, CHS=2434/255/63, UDMA(66) reiserfs: using 3.5.x disk format ReiserFS version 3.6.25
WD1200BB hdg: 234441648 sectors (120034 MB) w/2048KiB Cache, CHS=232581/16/63, UDMA(100)
ad0: 8693MB <IBM-DJNA-370910> [17662/16/63] at ata0-master UDMA33 acd0: CDROM <CD-ROM 40X> at ata1-master PIO4
scsi0 : ioc0: LSI53C1030, FwRev=01000000h, Ports=1, MaxQ=255, IRQ=52 Vendor: MAXTOR Model: ATLASU320_18_SCA Rev: B120 Type: Direct-Access ANSI SCSI revision: 03 Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 SCSI device sda: 35916548 512-byte hdwr sectors (18389 MB) reiserfs: found format "3.6" with standard journal reiserfs: using ordered data mode Using r5 hash to sort names
da0 at ahc0 bus 0 target 0 lun 0 da0: <SEAGATE ST39175LW 0001> Fixed Direct Access SCSI-2 device da0: 80.000MB/s transfers (40.000MHz, offset 15, 16bit), Tagged Queueing Enabled da0: 8683MB (17783240 512 byte sectors: 255H 63S/T 1106C)
ad0: 6187MB <FUJITSU MPD3064AT> [13410/15/63] at ata0-master UDMA33
da3 at ahc0 bus 0 target 4 lun 0 da3: <IBM DNES-309170Y SA30> Fixed Direct Access SCSI-3 device da3: 40.000MB/s transfers (20.000MHz, offset 31, 16bit), Tagged Queueing Enabled da3: 8748MB (17916240 512 byte sectors: 255H 63S/T 1115C)
wd0 at pciide0 channel 0 drive 0: <IBM-DJNA-371350> wd0: 16-sector PIO, LBA, 12949MB, 16383 cyl, 16 head, 63 sec, 26520480 sectors
wd1 at pciide0 channel 0 drive 1: <WDC WD1200BB-53CAA0> wd1: 16-sector PIO, LBA, 114473MB, 16383 cyl, 16 head, 63 sec, 234441648 sectors
wd2 at pciide1 channel 0 drive 0: <Maxtor 6Y160P0> wd2: 16-sector PIO, LBA48, 156334MB, 16383 cyl, 16 head, 63 sec, 320173056 sectors wd2(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 6
In this section, some simple test programs are used that create some files, perform (sequential) read/write operations on them and remove them afterwards.
Entries in the following table are elapsed time in seconds (except for the first column which obviously refers to the machine description above). The program that has been used to produce these results is fsperf1.c.
machine | 5000 100 | -c 5000 100 | -c -r 5000 100 |
1 | 50 | 49 | 48 |
1 | 42 | 48 | 51 |
2a | 3 | 7 | 10 |
about 2200 tps | about 1500 tps | ||
11 | 21 | ||
3 | 10 | 34 | 34 |
about 500 tps | |||
4(a)i | 126 | 125 | |
4(a)ii | 208 | 454 | |
4b | 43 | 48 | |
7 | 7 | 13 | 16 |
5 | 9 | 8 | 9 |
8 | 133 | 201 | 603 |
9a | 52 | 665 | |
10a | 9 | 9 | 12 |
89 | 139 | 233 |
Comments:
(2004-07-14) With and without fsync(2) (-S)
common parameters | machine | -c | -c -r | -S -c | -S -c -r |
(5000 100) | 17 | 42 | 42 | 2 | 3 |
10b | 165 | 496 | 165 | 495 | |
18 | 83 | 83 | 5 | 8 | |
19a | 8 | 7 | 1 | 3 | |
19b | 8 | 9 | 1 | 3 | |
19c | 7 | 9 | 1 | 2 | |
(-s 32 5000 100) | 17 | 109 | 109 | 8 | 9 |
10b | 250 | 537 | 207 | 498 | |
18 | 114 | 113 | 14 | 16 | |
19b | 87 | 81 | 3 | 5 | |
19c | 26 | 26 | 4 | 5 |
Comments:
Next version: allow for hashing (00 - 99, up to two levels). Use enough files to defeat the (2MB) cache of IDE disks.
machine | -h 1 -c 1000 1000 | -h 1 -c -r 1000 1000 |
1 | 18 | 18 |
2a | 24 | 24 |
2b | 7 | 9 |
3 | 14 | 14 |
4(a)i | 23 | 23 |
4(a)ii | 33 | 77 |
4b | 25 | 49 |
5 | 3 | 2 |
7 | 3 | 4 |
8 | 58 | 163 |
9a | 51 | 139 |
11 | 28 | 48 |
Comments:
Next version fsperf1.c: allow for hashing (00 - 99, up to two levels). Use enough files to defeat the (2MB) cache of IDE disks. The parameters for the following table are 1000 operations and 1000 files, hence each file is used once. Additional parameters are listed in the heading. c: create, h 1: one level hashing, r: rename file twice, p: populate directories before test, then just reuse the files.
machine | -h 1 -c | -h 1 -c -r | -p -h 1 -c | -p -h 1 -c -r |
1 | 32 | 31 | 18 | 17 |
2a | 18 | 18 | 9 | 10 |
2b | 10 | 10 | 8 | 10 |
5 | 2 | 1 | 2 | 1 |
6 | 2 | 2 | 4 | 4 |
7 | 2 | 4 | 2 | 3 |
8 | 58 | 165 | 78 | 178 |
9a | 27 | 127 | 33 | 131 |
9c | 13 | 51 | 37 | 55 |
11 | 28 | 48 | 28 | 48 |
Comments:
Another test program (fsseq1.c) writes lines to a file and uses fsync(2) after a specified number (-C parameter).
20000 entries (10000 entries each for received/delivered, total 490000 bytes).
machine | - | -C 100 | -C 50 | -C 10 | -C 5 | -C 2 | -f |
1 | 1 | 4 | 6 | 17 | 32 | 78 | 150 |
2a | 0 | 2 | 2 | 5 | 5 | 9 | 18 |
2b | 1 | 0 | 1 | 3 | 4 | 10 | 20 |
3 | 1 | 2 | 3 | 9 | 16 | 37 | 68 |
5 | 1 | 1 | 2 | 6 | 12 | 27 | 56 |
7 | 0 | 4 | 8 | 39 | 79 | 198 | 410 |
8 | 1 | 7 | 13 | 60 | 120 | 299 | 598 |
9a | 1 | 8 | 13 | 15 | 62 | 90 | 140 |
11 | 0 | 6 | 12 | 53 | 106 | 262 | 518 |
This clearly demonstrates the need for group commits. However, the program requires a lot of CPU since each line is generated by snprintf(). Hence the full I/O speed may not be reached. To confirm this, another program (fsseq2.c) is used that just writes a buffer with a fixed content to a file.
The following table lists the results for group commits (C) together with various buffer sizes (256, 1024, 4096, 8192, and 16384). As usual the entries are execution time in seconds. The program writes 2000 records in total, e.g., for size 16384 that is 31MB data.
machine | C | 256 | 1024 | 4096 | 8192 | 16384 | |
5 | 1 | 4 | 5 | 10 | 20 | 34 | |
2 | 2 | 4 | 6 | 12 | 22 | ||
5 | 1 | 2 | 5 | 7 | 15 | ||
10 | 1 | 1 | 3 | 6 | 12 | ||
50 | 1 | 0 | 3 | 5 | 10 | ||
100 | 0 | 1 | 3 | 5 | 10 | ||
7 | 1 | 1 | 5 | 20 | 40 | 44 | |
2 | 1 | 5 | 11 | 23 | 29 | ||
5 | 1 | 5 | 9 | 12 | 13 | ||
10 | 1 | 2 | 3 | 6 | 7 | ||
50 | 0 | 1 | 1 | 2 | 3 | ||
100 | 0 | 1 | 1 | 1 | 3 | ||
8 | 1 | 3 | 10 | 45 | 95 | 109 | |
2 | 2 | 11 | 23 | 52 | 59 | ||
5 | 3 | 11 | 19 | 24 | 32 | ||
10 | 2 | 5 | 6 | 15 | 21 | ||
50 | 1 | 2 | 3 | 8 | 13 | ||
100 | 0 | 1 | 3 | 6 | 13 | ||
9a | 1 | 3 | 12 | 34 | 35 | 58 | |
2 | 3 | 12 | 18 | 53 | 53 | ||
5 | 3 | 6 | 21 | 23 | 24 | ||
10 | 3 | 5 | 6 | 13 | 14 | ||
50 | 1 | 2 | 2 | 5 | 7 | ||
100 | 1 | 1 | 2 | 3 | 6 | ||
11 | 1 | 21 | 35 | 77 | 83 | 92 | |
2 | 13 | 26 | 38 | 45 | 50 | ||
5 | 8 | 13 | 17 | 20 | 24 | ||
10 | 5 | 6 | 10 | 11 | 15 | ||
50 | 1 | 2 | 2 | 4 | 7 | ||
100 | 1 | 1 | 2 | 3 | 6 |
Comments:
Yet another program (fsseq3.c) uses write() instead of fwrite(). This time the tests write 40000KB each, which makes it simpler to determine the throughput.
Note: as usual, these times are not very accurate (1s resolution), and hence the rate is inaccurate too. Machines:
C | s | records | time | KB/s |
1 | 512 | 80000 | 1365 | 29 |
1 | 1024 | 40000 | 734 | 54 |
1 | 2048 | 20000 | 451 | 88 |
1 | 4096 | 10000 | 352 | 113 |
1 | 8192 | 5000 | 250 | 160 |
2 | 512 | 80000 | 736 | 54 |
2 | 1024 | 40000 | 453 | 88 |
2 | 2048 | 20000 | 354 | 112 |
2 | 4096 | 10000 | 382 | 104 |
2 | 8192 | 5000 | 225 | 177 |
5 | 512 | 80000 | 638 | 62 |
5 | 1024 | 40000 | 585 | 68 |
5 | 2048 | 20000 | 312 | 128 |
5 | 4096 | 10000 | 187 | 213 |
5 | 8192 | 5000 | 101 | 396 |
10 | 512 | 80000 | 561 | 71 |
10 | 1024 | 40000 | 296 | 135 |
10 | 2048 | 20000 | 161 | 248 |
10 | 4096 | 10000 | 88 | 454 |
10 | 8192 | 5000 | 60 | 666 |
50 | 512 | 80000 | 128 | 312 |
50 | 1024 | 40000 | 70 | 571 |
50 | 2048 | 20000 | 41 | 975 |
50 | 4096 | 10000 | 34 | 1176 |
50 | 8192 | 5000 | 29 | 1379 |
100 | 512 | 80000 | 73 | 547 |
100 | 1024 | 40000 | 43 | 930 |
100 | 2048 | 20000 | 33 | 1212 |
100 | 4096 | 10000 | 28 | 1428 |
100 | 8192 | 5000 | 27 | 1481 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 165 | 242 |
1 | 1024 | 40000 | 90 | 444 |
1 | 2048 | 20000 | 54 | 740 |
1 | 4096 | 10000 | 28 | 1428 |
1 | 8192 | 5000 | 16 | 2500 |
2 | 512 | 80000 | 94 | 425 |
2 | 1024 | 40000 | 52 | 769 |
2 | 2048 | 20000 | 30 | 1333 |
2 | 4096 | 10000 | 17 | 2352 |
2 | 8192 | 5000 | 11 | 3636 |
5 | 512 | 80000 | 54 | 740 |
5 | 1024 | 40000 | 33 | 1212 |
5 | 2048 | 20000 | 19 | 2105 |
5 | 4096 | 10000 | 11 | 3636 |
5 | 8192 | 5000 | 8 | 5000 |
10 | 512 | 80000 | 31 | 1290 |
10 | 1024 | 40000 | 18 | 2222 |
10 | 2048 | 20000 | 11 | 3636 |
10 | 4096 | 10000 | 8 | 5000 |
10 | 8192 | 5000 | 6 | 6666 |
50 | 512 | 80000 | 11 | 3636 |
50 | 1024 | 40000 | 8 | 5000 |
50 | 2048 | 20000 | 6 | 6666 |
50 | 4096 | 10000 | 5 | 8000 |
50 | 8192 | 5000 | 4 | 10000 |
100 | 512 | 80000 | 10 | 4000 |
100 | 1024 | 40000 | 8 | 5000 |
100 | 2048 | 20000 | 5 | 8000 |
100 | 4096 | 10000 | 4 | 10000 |
100 | 8192 | 5000 | 5 | 8000 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 13440 | 2 |
1 | 1024 | 40000 | 6790 | 5 |
1 | 2048 | 20000 | 3451 | 11 |
1 | 4096 | 10000 | 1779 | 22 |
1 | 8192 | 5000 | 1007 | 39 |
2 | 512 | 80000 | 6790 | 5 |
2 | 1024 | 40000 | 3439 | 11 |
2 | 2048 | 20000 | 1763 | 22 |
2 | 4096 | 10000 | 909 | 44 |
2 | 8192 | 5000 | 471 | 84 |
5 | 512 | 80000 | 2763 | 14 |
5 | 1024 | 40000 | 1414 | 28 |
5 | 2048 | 20000 | 739 | 54 |
5 | 4096 | 10000 | 383 | 104 |
5 | 8192 | 5000 | 208 | 192 |
10 | 512 | 80000 | 1414 | 28 |
10 | 1024 | 40000 | 731 | 54 |
10 | 2048 | 20000 | 384 | 104 |
10 | 4096 | 10000 | 208 | 192 |
10 | 8192 | 5000 | 120 | 333 |
50 | 512 | 80000 | 312 | 128 |
50 | 1024 | 40000 | 174 | 229 |
50 | 2048 | 20000 | 101 | 396 |
50 | 4096 | 10000 | 64 | 625 |
50 | 8192 | 5000 | 46 | 869 |
100 | 512 | 80000 | 171 | 233 |
100 | 1024 | 40000 | 100 | 400 |
100 | 2048 | 20000 | 64 | 625 |
100 | 4096 | 10000 | 46 | 869 |
100 | 8192 | 5000 | 37 | 1081 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 130 | 307 |
1 | 1024 | 40000 | 93 | 430 |
1 | 2048 | 20000 | 78 | 512 |
1 | 4096 | 10000 | 23 | 1739 |
1 | 8192 | 5000 | 12 | 3333 |
2 | 512 | 80000 | 62 | 645 |
2 | 1024 | 40000 | 46 | 869 |
2 | 2048 | 20000 | 24 | 1666 |
2 | 4096 | 10000 | 13 | 3076 |
2 | 8192 | 5000 | 15 | 2666 |
5 | 512 | 80000 | 66 | 606 |
5 | 1024 | 40000 | 31 | 1290 |
5 | 2048 | 20000 | 18 | 2222 |
5 | 4096 | 10000 | 15 | 2666 |
5 | 8192 | 5000 | 10 | 4000 |
10 | 512 | 80000 | 28 | 1428 |
10 | 1024 | 40000 | 19 | 2105 |
10 | 2048 | 20000 | 13 | 3076 |
10 | 4096 | 10000 | 10 | 4000 |
10 | 8192 | 5000 | 10 | 4000 |
50 | 512 | 80000 | 14 | 2857 |
50 | 1024 | 40000 | 10 | 4000 |
50 | 2048 | 20000 | 10 | 4000 |
50 | 4096 | 10000 | 9 | 4444 |
50 | 8192 | 5000 | 7 | 5714 |
100 | 512 | 80000 | 11 | 3636 |
100 | 1024 | 40000 | 10 | 4000 |
100 | 2048 | 20000 | 8 | 5000 |
100 | 4096 | 10000 | 8 | 5000 |
100 | 8192 | 5000 | 8 | 5000 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 3347 | 11 |
1 | 1024 | 40000 | 1689 | 23 |
1 | 2048 | 20000 | 845 | 47 |
1 | 4096 | 10000 | 418 | 95 |
1 | 8192 | 5000 | 192 | 208 |
2 | 512 | 80000 | 1243 | 32 |
2 | 1024 | 40000 | 796 | 50 |
2 | 2048 | 20000 | 431 | 92 |
2 | 4096 | 10000 | 222 | 180 |
2 | 8192 | 5000 | 122 | 327 |
5 | 512 | 80000 | 655 | 61 |
5 | 1024 | 40000 | 268 | 149 |
5 | 2048 | 20000 | 161 | 248 |
5 | 4096 | 10000 | 108 | 370 |
5 | 8192 | 5000 | 58 | 689 |
10 | 512 | 80000 | 355 | 112 |
10 | 1024 | 40000 | 185 | 216 |
10 | 2048 | 20000 | 85 | 470 |
10 | 4096 | 10000 | 42 | 952 |
10 | 8192 | 5000 | 38 | 1052 |
50 | 512 | 80000 | 88 | 454 |
50 | 1024 | 40000 | 49 | 816 |
50 | 2048 | 20000 | 31 | 1290 |
50 | 4096 | 10000 | 18 | 2222 |
50 | 8192 | 5000 | 10 | 4000 |
100 | 512 | 80000 | 45 | 888 |
100 | 1024 | 40000 | 33 | 1212 |
100 | 2048 | 20000 | 19 | 2105 |
100 | 4096 | 10000 | 14 | 2857 |
100 | 8192 | 5000 | 14 | 2857 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 6302 | 6 |
1 | 1024 | 40000 | 3220 | 12 |
1 | 2048 | 20000 | 1695 | 23 |
1 | 4096 | 10000 | 949 | 42 |
1 | 8192 | 5000 | 552 | 72 |
2 | 512 | 80000 | 3183 | 12 |
2 | 1024 | 40000 | 1708 | 23 |
2 | 2048 | 20000 | 950 | 42 |
2 | 4096 | 10000 | 484 | 82 |
2 | 8192 | 5000 | 299 | 133 |
5 | 512 | 80000 | 1402 | 28 |
5 | 1024 | 40000 | 805 | 49 |
5 | 2048 | 20000 | 440 | 90 |
5 | 4096 | 10000 | 252 | 158 |
5 | 8192 | 5000 | 137 | 291 |
10 | 512 | 80000 | 783 | 51 |
10 | 1024 | 40000 | 395 | 101 |
10 | 2048 | 20000 | 211 | 189 |
10 | 4096 | 10000 | 122 | 327 |
10 | 8192 | 5000 | 87 | 459 |
50 | 512 | 80000 | 181 | 220 |
50 | 1024 | 40000 | 107 | 373 |
50 | 2048 | 20000 | 68 | 588 |
50 | 4096 | 10000 | 49 | 816 |
50 | 8192 | 5000 | 42 | 952 |
100 | 512 | 80000 | 111 | 360 |
100 | 1024 | 40000 | 70 | 571 |
100 | 2048 | 20000 | 50 | 800 |
100 | 4096 | 10000 | 40 | 1000 |
100 | 8192 | 5000 | 36 | 1111 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 2638 | 15 |
1 | 1024 | 40000 | 1419 | 28 |
1 | 2048 | 20000 | 753 | 53 |
1 | 4096 | 10000 | 442 | 90 |
1 | 8192 | 5000 | 221 | 180 |
2 | 512 | 80000 | 1379 | 29 |
2 | 1024 | 40000 | 774 | 51 |
2 | 2048 | 20000 | 409 | 97 |
2 | 4096 | 10000 | 220 | 181 |
2 | 8192 | 5000 | 124 | 322 |
5 | 512 | 80000 | 644 | 62 |
5 | 1024 | 40000 | 382 | 104 |
5 | 2048 | 20000 | 198 | 202 |
5 | 4096 | 10000 | 105 | 380 |
5 | 8192 | 5000 | 58 | 689 |
10 | 512 | 80000 | 355 | 112 |
10 | 1024 | 40000 | 196 | 204 |
10 | 2048 | 20000 | 104 | 384 |
10 | 4096 | 10000 | 59 | 677 |
10 | 8192 | 5000 | 32 | 1250 |
50 | 512 | 80000 | 90 | 444 |
50 | 1024 | 40000 | 51 | 784 |
50 | 2048 | 20000 | 28 | 1428 |
50 | 4096 | 10000 | 19 | 2105 |
50 | 8192 | 5000 | 15 | 2666 |
100 | 512 | 80000 | 54 | 740 |
100 | 1024 | 40000 | 28 | 1428 |
100 | 2048 | 20000 | 20 | 2000 |
100 | 4096 | 10000 | 15 | 2666 |
100 | 8192 | 5000 | 14 | 2857 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 2642 | 15 |
1 | 1024 | 40000 | 1312 | 30 |
1 | 2048 | 20000 | 723 | 55 |
1 | 4096 | 10000 | 376 | 106 |
1 | 8192 | 5000 | 185 | 216 |
2 | 512 | 80000 | 1363 | 29 |
2 | 1024 | 40000 | 699 | 57 |
2 | 2048 | 20000 | 359 | 111 |
2 | 4096 | 10000 | 185 | 216 |
2 | 8192 | 5000 | 104 | 384 |
5 | 512 | 80000 | 563 | 71 |
5 | 1024 | 40000 | 302 | 132 |
5 | 2048 | 20000 | 162 | 246 |
5 | 4096 | 10000 | 88 | 454 |
5 | 8192 | 5000 | 46 | 869 |
10 | 512 | 80000 | 299 | 133 |
10 | 1024 | 40000 | 161 | 248 |
10 | 2048 | 20000 | 87 | 459 |
10 | 4096 | 10000 | 46 | 869 |
10 | 8192 | 5000 | 24 | 1666 |
50 | 512 | 80000 | 81 | 493 |
50 | 1024 | 40000 | 44 | 909 |
50 | 2048 | 20000 | 35 | 1142 |
50 | 4096 | 10000 | 19 | 2105 |
50 | 8192 | 5000 | 13 | 3076 |
100 | 512 | 80000 | 51 | 784 |
100 | 1024 | 40000 | 35 | 1142 |
100 | 2048 | 20000 | 26 | 1538 |
100 | 4096 | 10000 | 15 | 2666 |
100 | 8192 | 5000 | 13 | 3076 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 2576 | 15 |
1 | 1024 | 40000 | 1326 | 30 |
1 | 2048 | 20000 | 707 | 56 |
1 | 4096 | 10000 | 377 | 106 |
1 | 8192 | 5000 | 192 | 208 |
2 | 512 | 80000 | 1324 | 30 |
2 | 1024 | 40000 | 685 | 58 |
2 | 2048 | 20000 | 349 | 114 |
2 | 4096 | 10000 | 187 | 213 |
2 | 8192 | 5000 | 107 | 373 |
5 | 512 | 80000 | 578 | 69 |
5 | 1024 | 40000 | 313 | 127 |
5 | 2048 | 20000 | 163 | 245 |
5 | 4096 | 10000 | 89 | 449 |
5 | 8192 | 5000 | 46 | 869 |
10 | 512 | 80000 | 306 | 130 |
10 | 1024 | 40000 | 162 | 246 |
10 | 2048 | 20000 | 86 | 465 |
10 | 4096 | 10000 | 46 | 869 |
10 | 8192 | 5000 | 25 | 1600 |
50 | 512 | 80000 | 82 | 487 |
50 | 1024 | 40000 | 44 | 909 |
50 | 2048 | 20000 | 33 | 1212 |
50 | 4096 | 10000 | 19 | 2105 |
50 | 8192 | 5000 | 13 | 3076 |
100 | 512 | 80000 | 52 | 769 |
100 | 1024 | 40000 | 36 | 1111 |
100 | 2048 | 20000 | 25 | 1600 |
100 | 4096 | 10000 | 16 | 2500 |
100 | 8192 | 5000 | 13 | 3076 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 65 | 615 |
1 | 1024 | 40000 | 61 | 655 |
1 | 2048 | 20000 | 59 | 677 |
1 | 4096 | 10000 | 5 | 8000 |
1 | 8192 | 5000 | 4 | 10000 |
2 | 512 | 80000 | 13 | 3076 |
2 | 1024 | 40000 | 8 | 5000 |
2 | 2048 | 20000 | 4 | 10000 |
2 | 4096 | 10000 | 4 | 10000 |
2 | 8192 | 5000 | 3 | 13333 |
5 | 512 | 80000 | 44 | 909 |
5 | 1024 | 40000 | 21 | 1904 |
5 | 2048 | 20000 | 13 | 3076 |
5 | 4096 | 10000 | 3 | 13333 |
5 | 8192 | 5000 | 3 | 13333 |
10 | 512 | 80000 | 12 | 3333 |
10 | 1024 | 40000 | 3 | 13333 |
10 | 2048 | 20000 | 3 | 13333 |
10 | 4096 | 10000 | 3 | 13333 |
10 | 8192 | 5000 | 5 | 8000 |
50 | 512 | 80000 | 11 | 3636 |
50 | 1024 | 40000 | 3 | 13333 |
50 | 2048 | 20000 | 5 | 8000 |
50 | 4096 | 10000 | 5 | 8000 |
50 | 8192 | 5000 | 4 | 10000 |
100 | 512 | 80000 | 5 | 8000 |
100 | 1024 | 40000 | 5 | 8000 |
100 | 2048 | 20000 | 5 | 8000 |
100 | 4096 | 10000 | 4 | 10000 |
100 | 8192 | 5000 | 3 | 13333 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 124 | 322 |
1 | 1024 | 40000 | 87 | 459 |
1 | 2048 | 20000 | 72 | 555 |
1 | 4096 | 10000 | 20 | 2000 |
1 | 8192 | 5000 | 10 | 4000 |
2 | 512 | 80000 | 47 | 851 |
2 | 1024 | 40000 | 32 | 1250 |
2 | 2048 | 20000 | 16 | 2500 |
2 | 4096 | 10000 | 8 | 5000 |
2 | 8192 | 5000 | 5 | 8000 |
5 | 512 | 80000 | 56 | 714 |
5 | 1024 | 40000 | 27 | 1481 |
5 | 2048 | 20000 | 20 | 2000 |
5 | 4096 | 10000 | 5 | 8000 |
5 | 8192 | 5000 | 5 | 8000 |
10 | 512 | 80000 | 23 | 1739 |
10 | 1024 | 40000 | 17 | 2352 |
10 | 2048 | 20000 | 6 | 6666 |
10 | 4096 | 10000 | 3 | 13333 |
10 | 8192 | 5000 | 6 | 6666 |
50 | 512 | 80000 | 7 | 5714 |
50 | 1024 | 40000 | 4 | 10000 |
50 | 2048 | 20000 | 6 | 6666 |
50 | 4096 | 10000 | 6 | 6666 |
50 | 8192 | 5000 | 4 | 10000 |
100 | 512 | 80000 | 7 | 5714 |
100 | 1024 | 40000 | 6 | 6666 |
100 | 2048 | 20000 | 5 | 8000 |
100 | 4096 | 10000 | 4 | 10000 |
100 | 8192 | 5000 | 3 | 13333 |
C | s | records | time | KB/s |
1 | 512 | 80000 | 205 | 195 |
1 | 1024 | 40000 | 144 | 277 |
1 | 2048 | 20000 | 122 | 327 |
1 | 4096 | 10000 | 14 | 2857 |
1 | 8192 | 5000 | 7 | 5714 |
2 | 512 | 80000 | 34 | 1176 |
2 | 1024 | 40000 | 22 | 1818 |
2 | 2048 | 20000 | 13 | 3076 |
2 | 4096 | 10000 | 7 | 5714 |
2 | 8192 | 5000 | 5 | 8000 |
5 | 512 | 80000 | 96 | 416 |
5 | 1024 | 40000 | 48 | 833 |
5 | 2048 | 20000 | 20 | 2000 |
5 | 4096 | 10000 | 4 | 10000 |
5 | 8192 | 5000 | 4 | 10000 |
10 | 512 | 80000 | 36 | 1111 |
10 | 1024 | 40000 | 7 | 5714 |
10 | 2048 | 20000 | 5 | 8000 |
10 | 4096 | 10000 | 4 | 10000 |
10 | 8192 | 5000 | 3 | 13333 |
50 | 512 | 80000 | 12 | 3333 |
50 | 1024 | 40000 | 4 | 10000 |
50 | 2048 | 20000 | 4 | 10000 |
50 | 4096 | 10000 | 3 | 13333 |
50 | 8192 | 5000 | 3 | 13333 |
100 | 512 | 80000 | 7 | 5714 |
100 | 1024 | 40000 | 6 | 6666 |
100 | 2048 | 20000 | 3 | 13333 |
100 | 4096 | 10000 | 3 | 13333 |
100 | 8192 | 5000 | 3 | 13333 |
Very simple measurement of transfer rate:
time dd ibs=8192 if=/dev/zero obs=8192 count=5120 of=incq
machine | s | MB/s |
1 | 11.6 | 3.6 |
2a | 4.8 | 8.4 |
2b | 1.9 | 20.9 |
5 | 10.83 | 3.9 |
6 | 0.65 | 61 |
7 | 1.0 | 40.0 |
8 | 14.8 | 2.8 |
9 | 6.3 | 6.6 |
11 | 6.98 | 6.0 |
12a | 0.247 | 161 |
12b | 0.401 | 99 |
12c | 0.357 | 112 |
Comments:
dd ibs=8192 if=/dev/zero obs=8192 count=124000 of=incq
machine | s | MB/s |
12a | 24.762 | 39 |
12b | 22.608 | 42 |
The data in this table is more likely, even though 40MB/s is still very fast.
For comparison with the Berkeley DB performance data, more tests have been run with fsseq4 with different parameters. Number of records is 100000 unless otherwise noted, t/s is transactions (records written) per second. Notice: fsseq3 writes twice as much records as fsseq4 (one add and one delete entry each), and it calls fsync() twice as often (after the add and after the delete entry).
C | s | time | KB/s | t/s |
100000 | 20 | 1 | 1953 | 100000 |
10000 | 20 | 2 | 976 | 50000 |
1000 | 20 | 7 | 279 | 14285 |
100 | 20 | 20 | 97 | 5000 |
100000 | 100 | 3 | 3255 | 33333 |
10000 | 100 | 4 | 2441 | 25000 |
1000 | 100 | 8 | 1220 | 12500 |
100 | 100 | 57 | 171 | 1754 |
100000 | 512 | 15 | 3333 | 6666 |
10000 | 512 | 16 | 3125 | 6250 |
1000 | 512 | 17 | 2941 | 5882 |
100 | 512 | 67 | 746 | 1492 |
100000 | 1024 | 29 | 3448 | 3448 |
10000 | 1024 | 30 | 3333 | 3333 |
1000 | 1024 | 33 | 3030 | 3030 |
100 | 1024 | 77 | 1298 | 1298 |
100000 | 2048 | 60 | 3333 | 1666 |
10000 | 2048 | 60 | 3333 | 1666 |
1000 | 2048 | 64 | 3125 | 1562 |
100 | 2048 | 101 | 1980 | 990 |
C | s | time | KB/s | t/s |
100000 | 20 | 1 | 1953 | 100000 |
10000 | 20 | 1 | 1953 | 100000 |
1000 | 20 | 2 | 976 | 50000 |
100 | 20 | 2 | 976 | 50000 |
100000 | 100 | 2 | 4882 | 50000 |
10000 | 100 | 1 | 9765 | 100000 |
1000 | 100 | 2 | 4882 | 50000 |
100 | 100 | 7 | 1395 | 14285 |
100000 | 512 | 3 | 16666 | 33333 |
10000 | 512 | 3 | 16666 | 33333 |
1000 | 512 | 4 | 12500 | 25000 |
100 | 512 | 6 | 8333 | 16666 |
100000 | 1024 | 6 | 16666 | 16666 |
10000 | 1024 | 5 | 20000 | 20000 |
1000 | 1024 | 6 | 16666 | 16666 |
100 | 1024 | 8 | 12500 | 12500 |
100000 | 2048 | 12 | 16666 | 8333 |
10000 | 2048 | 12 | 16666 | 8333 |
1000 | 2048 | 15 | 13333 | 6666 |
100 | 2048 | 15 | 13333 | 6666 |
C | s | time | KB/s | t/s |
100000 | 20 | 1 | 1953 | 100000 |
10000 | 20 | 1 | 1953 | 100000 |
1000 | 20 | 2 | 976 | 50000 |
100 | 20 | 9 | 217 | 11111 |
100000 | 100 | 3 | 3255 | 33333 |
10000 | 100 | 4 | 2441 | 25000 |
1000 | 100 | 5 | 1953 | 20000 |
100 | 100 | 15 | 651 | 6666 |
100000 | 512 | 16 | 3125 | 6250 |
10000 | 512 | 18 | 2777 | 5555 |
1000 | 512 | 22 | 2272 | 4545 |
100 | 512 | 75 | 666 | 1333 |
100000 | 1024 | 34 | 2941 | 2941 |
10000 | 1024 | 35 | 2857 | 2857 |
1000 | 1024 | 46 | 2173 | 2173 |
100 | 1024 | 139 | 719 | 719 |
100000 | 2048 | 67 | 2985 | 1492 |
10000 | 2048 | 79 | 2531 | 1265 |
1000 | 2048 | 95 | 2105 | 1052 |
100 | 2048 | 246 | 813 | 406 |
C | s | time | KB/s | t/s |
100000 | 20 | 1 | 1953 | 100000 |
10000 | 20 | 1 | 1953 | 100000 |
1000 | 20 | 4 | 488 | 25000 |
100 | 20 | 31 | 63 | 3225 |
100000 | 100 | 2 | 4882 | 50000 |
10000 | 100 | 2 | 4882 | 50000 |
1000 | 100 | 6 | 1627 | 16666 |
100 | 100 | 33 | 295 | 3030 |
100000 | 512 | 8 | 6250 | 12500 |
10000 | 512 | 11 | 4545 | 9090 |
1000 | 512 | 15 | 3333 | 6666 |
100 | 512 | 50 | 1000 | 2000 |
100000 | 1024 | 11 | 9090 | 9090 |
10000 | 1024 | 10 | 10000 | 10000 |
1000 | 1024 | 14 | 7142 | 7142 |
100 | 1024 | 42 | 2380 | 2380 |
100000 | 2048 | 25 | 8000 | 4000 |
10000 | 2048 | 26 | 7692 | 3846 |
1000 | 2048 | 21 | 9523 | 4761 |
100 | 2048 | 42 | 4761 | 2380 |
C | s | time | KB/s | t/s |
100000 | 20 | 3 | 651 | 33333 |
10000 | 20 | 3 | 651 | 33333 |
1000 | 20 | 3 | 651 | 33333 |
100 | 20 | 5 | 390 | 20000 |
100000 | 100 | 3 | 3255 | 33333 |
10000 | 100 | 4 | 2441 | 25000 |
1000 | 100 | 4 | 2441 | 25000 |
100 | 100 | 9 | 1085 | 11111 |
100000 | 512 | 5 | 10000 | 20000 |
10000 | 512 | 5 | 10000 | 20000 |
1000 | 512 | 7 | 7142 | 14285 |
100 | 512 | 20 | 2500 | 5000 |
100000 | 1024 | 8 | 12500 | 12500 |
10000 | 1024 | 8 | 12500 | 12500 |
1000 | 1024 | 9 | 11111 | 11111 |
100 | 1024 | 26 | 3846 | 3846 |
100000 | 2048 | 15 | 13333 | 6666 |
10000 | 2048 | 16 | 12500 | 6250 |
1000 | 2048 | 21 | 9523 | 4761 |
100 | 2048 | 36 | 5555 | 2777 |
C | s | time | KB/s | t/s |
100000 | 20 | 1 | 1953 | 100000 |
10000 | 20 | 1 | 1953 | 100000 |
1000 | 20 | 4 | 488 | 25000 |
100 | 20 | 29 | 67 | 3448 |
100000 | 100 | 1 | 9765 | 100000 |
10000 | 100 | 2 | 4882 | 50000 |
1000 | 100 | 5 | 1953 | 20000 |
100 | 100 | 36 | 271 | 2777 |
100000 | 512 | 4 | 12500 | 25000 |
10000 | 512 | 5 | 10000 | 20000 |
1000 | 512 | 9 | 5555 | 11111 |
100 | 512 | 44 | 1136 | 2272 |
100000 | 1024 | 8 | 12500 | 12500 |
10000 | 1024 | 9 | 11111 | 11111 |
1000 | 1024 | 13 | 7692 | 7692 |
100 | 1024 | 54 | 1851 | 1851 |
100000 | 2048 | 15 | 13333 | 6666 |
10000 | 2048 | 17 | 11764 | 5882 |
1000 | 2048 | 22 | 9090 | 4545 |
100 | 2048 | 67 | 2985 | 1492 |
C | s | time | KB/s | t/s |
100000 | 20 | 2 | 976 | 50000 |
10000 | 20 | 1 | 1953 | 100000 |
1000 | 20 | 2 | 976 | 50000 |
100 | 20 | 3 | 651 | 33333 |
100000 | 100 | 2 | 4882 | 50000 |
10000 | 100 | 2 | 4882 | 50000 |
1000 | 100 | 2 | 4882 | 50000 |
100 | 100 | 6 | 1627 | 16666 |
100000 | 512 | 3 | 16666 | 33333 |
10000 | 512 | 3 | 16666 | 33333 |
1000 | 512 | 4 | 12500 | 25000 |
100 | 512 | 21 | 2380 | 4761 |
100000 | 1024 | 3 | 33333 | 33333 |
10000 | 1024 | 4 | 25000 | 25000 |
1000 | 1024 | 7 | 14285 | 14285 |
100 | 1024 | 41 | 2439 | 2439 |
100000 | 2048 | 4 | 50000 | 25000 |
10000 | 2048 | 5 | 40000 | 20000 |
1000 | 2048 | 12 | 16666 | 8333 |
100 | 2048 | 80 | 2500 | 1250 |
C | s | time | KB/s | t/s |
100000 | 20 | 1 | 1953 | 100000 |
10000 | 20 | 1 | 1953 | 100000 |
1000 | 20 | 4 | 488 | 25000 |
100 | 20 | 23 | 84 | 4347 |
100000 | 100 | 2 | 4882 | 50000 |
10000 | 100 | 2 | 4882 | 50000 |
1000 | 100 | 5 | 1953 | 20000 |
100 | 100 | 32 | 305 | 3125 |
100000 | 512 | 5 | 10000 | 20000 |
10000 | 512 | 5 | 10000 | 20000 |
1000 | 512 | 9 | 5555 | 11111 |
100 | 512 | 42 | 1190 | 2380 |
100000 | 1024 | 10 | 10000 | 10000 |
10000 | 1024 | 11 | 9090 | 9090 |
1000 | 1024 | 14 | 7142 | 7142 |
100 | 1024 | 59 | 1694 | 1694 |
100000 | 2048 | 21 | 9523 | 4761 |
10000 | 2048 | 21 | 9523 | 4761 |
1000 | 2048 | 25 | 8000 | 4000 |
100 | 2048 | 78 | 2564 | 1282 |
Comments:
Some performance data gathered from the WWW.
SR Office DriveMark 2002 in IO/Sec taken from [Ra01]:
Manufacturer | Model | I/O operations/second |
Seagate | Cheetah X15-36LP (36.7 GB Ultra160/m SCSI) | 485 |
Maxtor | Atlas 10k III (73 GB Ultra160/m SCSI) | 455 |
Fujitsu | MAM3367 (36 GB Ultra160/m SCSI) | 446 |
IBM | Ultrastar 36Z15 (36.7 GB Ultra160/m SCSI) | 402 |
Western Digital | Caviar WD1000BB-SE (100 GB ATA-100) | 397 |
Seagate | Cheetah 36ES (36 GB Ultra160/m SCSI) | 373 |
Fujitsu | MAN3735 (73 GB Ultra160/m SCSI) | 369 |
Seagate | Cheetah 73LP (73.4 GB Ultra160/m SCSI) | 364 |
Western Digital | Caviar WD1200BB (120 GB ATA-100) | 337 |
Seagate | Cheetah 36XL (36.7 GB Ultra 160/m SCSI) | 328 |
IBM | Deskstar 60GXP (60.0 GB ATA-100) | 303 |
Maxtor | DiamondMax Plus D740X (80 GB ATA-133) | 301 |
Seagate | Barracuda ATA IV (80 GB ATA-100) | 296 |
Quantum | Fireball Plus AS (60.0 GB ATA-100) | 295 |
Quantum | Atlas V (36.7 GB Ultra160/m SCSI) | 269 |
Seagate | Barracuda 180 (180 GB Ultra160/m SCSI) | 249 |
Maxtor | DiamondMax 536DX (100 GB ATA-100) | 248 |
Seagate | Barracuda 36ES (36 GB Ultra160/m SCSI) | 222 |
Seagate | U6 (80 GB ATA-100) | 210 |
Samsung | SpinPoint P20 (40.0 GB ATA-100) | 192 |
ZD Business Disk WinMark 99 in MB/Sec
Manufacturer | Model | MB/second |
Seagate | Cheetah X15-36LP (36.7 GB Ultra160/m SCSI) | 13.1 |
Maxtor | Atlas 10k III (73 GB Ultra160/m SCSI) | 12.0 |
IBM | Ultrastar 36Z15 (36.7 GB Ultra160/m SCSI) | 11.3 |
Fujitsu | MAM3367 (36 GB Ultra160/m SCSI) | 11.1 |
Seagate | Cheetah 36ES (36 GB Ultra160/m SCSI) | 10.5 |
Seagate | Cheetah 73LP (73.4 GB Ultra160/m SCSI) | 10.2 |
Seagate | Cheetah 36XL (36.7 GB Ultra 160/m SCSI) | 9.9 |
Western Digital | Caviar WD1000BB-SE (100 GB ATA-100) | 9.8 |
Fujitsu | MAN3735 (73 GB Ultra160/m SCSI) | 9.1 |
Western Digital | Caviar WD1200BB (120 GB ATA-100) | 8.9 |
IBM | Deskstar 60GXP (60.0 GB ATA-100) | 8.8 |
Seagate | Barracuda ATA IV (80 GB ATA-100) | 8.5 |
Maxtor | DiamondMax Plus D740X (80 GB ATA-133) | 8.0 |
Quantum | Atlas V (36.7 GB Ultra160/m SCSI) | 7.9 |
Quantum | Fireball Plus AS (60.0 GB ATA-100) | 7.7 |
Seagate | Barracuda 36ES (36 GB Ultra160/m SCSI) | 7.4 |
Seagate | Barracuda 180 (180 GB Ultra160/m SCSI) | 7.1 |
Maxtor | DiamondMax 536DX (100 GB ATA-100) | 6.9 |
Samsung | SpinPoint P20 (40.0 GB ATA-100) | 6.5 |
Seagate | U6 (80 GB ATA-100) | 6.3 |
The file and web server benchmarks (also available at [Ra01]) are not useful since they include 80 and 100 per cent read accesses, which is not really typical of MTA servers.
Some preliminary, very simple performance tests with Berkeley DB 4.0.14 have been made. Two benchmark programs have been used: bench_001 and bench_002 which use Btree and Queue as access methods. They are based on examples_c/bench_001.c that comes with Berkeley DB. Notice: the access method Queue requires fixed size records and the access methods is record numbers (simply increasing). This method may be used for the backup of the incoming EDB. Notice: the tests have not (yet) been run multiple times, at least not systematically. Testing showed that the runtimes may vary noticable. However, the data can be used to show some trends.
Possible parameters are:
-n N | number of records to write |
-T N | use transactions, synchronize after N transactions |
-l N | length of data part |
-C N | do a checkpoint every N actions and possibly remove logfile |
Unless otherwise noted, the following tests have been performed on system 1, see Section 5.2.1. Number of records is 100000 unless otherwise noted, t/s is transactions (records written) per second.
Vary synchronization (-T):
Prg | -T | -l | real | user | sys | KB/s | t/s |
1 | 100000 | 20 | 14.73 | 5.99 | 1.00 | 132 | 6788 |
1 | 10000 | 20 | 14.64 | 5.85 | 1.29 | 133 | 6830 |
1 | 1000 | 20 | 18.14 | 6.02 | 1.10 | 107 | 5512 |
1 | 100 | 20 | 70.57 | 6.03 | 1.76 | 27 | 1417 |
2 | 100000 | 20 | 11.58 | 2.91 | 0.74 | 168 | 8635 |
2 | 10000 | 20 | 10.14 | 2.86 | 0.85 | 192 | 9861 |
2 | 1000 | 20 | 11.20 | 2.85 | 0.95 | 174 | 8928 |
2 | 100 | 20 | 68.71 | 2.73 | 1.61 | 28 | 1455 |
Vary data length, first program only:
Prg | -T | -l | real | user | sys | KB/s | t/s |
1 | 100000 | 20 | 14.39 | 5.93 | 1.16 | 135 | 6949 |
1 | 10000 | 20 | 16.77 | 5.91 | 1.16 | 116 | 5963 |
1 | 1000 | 20 | 16.58 | 5.91 | 1.13 | 117 | 6031 |
1 | 100 | 20 | 68.10 | 5.95 | 1.85 | 28 | 1468 |
1 | 100000 | 100 | 23.30 | 5.57 | 1.90 | 419 | 4291 |
1 | 10000 | 100 | 30.56 | 5.56 | 1.90 | 319 | 3272 |
1 | 1000 | 100 | 33.39 | 5.51 | 1.99 | 292 | 2994 |
1 | 100 | 100 | 82.58 | 5.47 | 2.62 | 118 | 1210 |
1 | 100000 | 512 | 96.03 | 7.69 | 4.78 | 520 | 1041 |
1 | 10000 | 512 | 94.12 | 7.39 | 5.03 | 531 | 1062 |
1 | 1000 | 512 | 97.67 | 7.20 | 5.15 | 511 | 1023 |
1 | 100 | 512 | 164.13 | 7.51 | 5.67 | 304 | 609 |
1 | 100000 | 1024 | 304.88 | 10.88 | 10.62 | 327 | 327 |
1 | 10000 | 1024 | 270.00 | 10.69 | 10.66 | 370 | 370 |
1 | 1000 | 1024 | 275.27 | 10.91 | 11.06 | 363 | 363 |
1 | 100 | 1024 | 346.10 | 11.01 | 12.09 | 288 | 288 |
1 | 100000 | 2048 | 788.88 | 22.18 | 27.59 | 253 | 126 |
The test has been aborted at this point. Maybe run it again later on.
Vary data length, second program only:
Prg | -T | -l | real | user | sys | KB/s | t/s |
2 | 100000 | 20 | 9.46 | 2.81 | 0.80 | 206 | 10570 |
2 | 10000 | 20 | 11.53 | 2.88 | 0.81 | 169 | 8673 |
2 | 1000 | 20 | 12.47 | 2.83 | 0.96 | 156 | 8019 |
2 | 100 | 20 | 67.91 | 2.80 | 1.59 | 28 | 1472 |
2 | 100000 | 100 | 13.57 | 2.92 | 1.20 | 719 | 7369 |
2 | 10000 | 100 | 18.62 | 3.07 | 1.17 | 524 | 5370 |
2 | 1000 | 100 | 19.04 | 2.92 | 1.20 | 512 | 5252 |
2 | 100 | 100 | 72.73 | 2.80 | 2.16 | 134 | 1374 |
2 | 100000 | 512 | 46.10 | 3.90 | 2.61 | 1084 | 2169 |
2 | 10000 | 512 | 53.55 | 3.84 | 2.79 | 933 | 1867 |
2 | 1000 | 512 | 66.71 | 3.65 | 3.05 | 749 | 1499 |
2 | 100 | 512 | 105.25 | 3.36 | 3.76 | 475 | 950 |
2 | 100000 | 1024 | 103.72 | 4.92 | 4.68 | 964 | 964 |
2 | 10000 | 1024 | 105.53 | 4.87 | 4.82 | 947 | 947 |
2 | 1000 | 1024 | 105.60 | 4.73 | 4.85 | 946 | 946 |
2 | 100 | 1024 | 145.14 | 4.73 | 5.84 | 688 | 688 |
2 | 100000 | 2048 | 194.70 | 7.44 | 8.09 | 1027 | 513 |
2 | 10000 | 2048 | 197.09 | 7.22 | 8.15 | 1014 | 507 |
2 | 1000 | 2048 | 200.09 | 7.10 | 8.70 | 999 | 499 |
2 | 100 | 2048 | 234.85 | 6.86 | 9.53 | 851 | 425 |
Put the directory for logfiles on a different disk (/extra/home/ca/tmp/db), using Btree.
Prg | -T | -l | real | user | sys | KB/s | t/s |
1 | 100000 | 20 | 14.90 | 6.05 | 0.96 | 131 | 6711 |
1 | 10000 | 20 | 14.46 | 5.95 | 1.12 | 135 | 6915 |
1 | 1000 | 20 | 17.70 | 5.83 | 1.08 | 110 | 5649 |
1 | 100 | 20 | 63.91 | 5.92 | 1.74 | 30 | 1564 |
1 | 100000 | 100 | 27.00 | 5.53 | 1.90 | 361 | 3703 |
1 | 10000 | 100 | 33.39 | 5.63 | 1.92 | 292 | 2994 |
1 | 1000 | 100 | 29.16 | 5.63 | 1.75 | 334 | 3429 |
1 | 100 | 100 | 72.18 | 5.44 | 2.42 | 135 | 1385 |
1 | 100000 | 512 | 96.94 | 7.49 | 5.09 | 515 | 1031 |
1 | 10000 | 512 | 107.99 | 7.34 | 5.17 | 463 | 926 |
1 | 1000 | 512 | 97.05 | 7.21 | 5.54 | 515 | 1030 |
1 | 100 | 512 | 145.15 | 7.85 | 5.36 | 344 | 688 |
1 | 100000 | 1024 | 268.88 | 10.67 | 11.54 | 371 | 371 |
1 | 10000 | 1024 | 279.65 | 11.02 | 11.05 | 357 | 357 |
1 | 1000 | 1024 | 304.07 | 10.58 | 11.69 | 328 | 328 |
1 | 100 | 1024 | 319.74 | 10.88 | 12.10 | 312 | 312 |
1 | 100000 | 2048 | 738.38 | 23.07 | 27.13 | 270 | 135 |
1 | 10000 | 2048 | 651.86 | 22.70 | 26.92 | 306 | 153 |
1 | 1000 | 2048 | 693.13 | 21.79 | 28.63 | 288 | 144 |
1 | 100 | 2048 | 724.68 | 22.51 | 29.04 | 275 | 137 |
Put the directory for logfiles on a different disk (/extra/home/ca/tmp/db), using Queue.
Prg | -T | -l | real | user | sys | KB/s | t/s | |
2 | 100000 | 20 | 10.92 | 2.90 | 0.65 | 178 | 9157 | |
2 | 10000 | 20 | 9.94 | 2.87 | 0.77 | 196 | 10060 | |
2 | 1000 | 20 | 31.66 | 2.85 | 0.88 | 61 | 3158 | |
2 | 100 | 20 | 60.74 | 2.93 | 1.36 | 32 | 1646 | |
2 | 100000 | 100 | 13.62 | 3.09 | 0.95 | 717 | 7342 | |
2 | 10000 | 100 | 19.30 | 3.02 | 1.17 | 505 | 5181 | |
2 | 1000 | 100 | 15.55 | 3.16 | 1.08 | 628 | 6430 | |
2 | 100 | 100 | 71.88 | 2.97 | 1.72 | 135 | 1391 | |
2 | 100000 | 512 | 52.08 | 3.93 | 2.50 | 960 | 1920 | |
2 | 10000 | 512 | 52.42 | 3.68 | 3.03 | 953 | 1907 | |
2 | 1000 | 512 | 56.58 | 3.91 | 2.90 | 883 | 1767 | |
2 | 100 | 512 | 95.38 | 3.74 | 3.64 | 524 | 1048 | |
2 | 100000 | 1024 | 107.20 | 4.69 | 4.87 | 932 | 932 | |
2 | 10000 | 1024 | 100.15 | 4.88 | 4.57 | 998 | 998 | |
2 | 1000 | 1024 | 100.95 | 4.78 | 5.06 | 990 | 990 | |
2 | 100 | 1024 | 139.38 | 4.71 | 5.61 | 717 | 717 | |
2 | 100000 | 2048 | 187.78 | 7.68 | 8.41 | 1065 | 532 | |
2 | 10000 | 2048 | 189.76 | 7.09 | 8.62 | 1053 | 526 | |
2 | 1000 | 2048 | 201.95 | 7.37 | 8.65 | 990 | 495 | |
2 | 100 | 2048 | 217.66 | 7.21 | 9.53 | 918 | 459 |
Machine 2b: Vary data length, first program:
Prg | -T | -l | real | user | sys | KB/s | t/s |
1 | 100000 | 20 | 21.56 | 9.04 | 1.88 | 90 | 4638 |
1 | 10000 | 20 | 13.02 | 9.58 | 1.92 | 150 | 7680 |
1 | 1000 | 20 | 12.64 | 9.40 | 1.81 | 154 | 7911 |
1 | 100 | 20 | 16.35 | 9.68 | 1.73 | 119 | 6116 |
1 | 100000 | 100 | 32.79 | 9.16 | 4.60 | 297 | 3049 |
1 | 10000 | 100 | 25.05 | 9.54 | 4.11 | 389 | 3992 |
1 | 1000 | 100 | 23.69 | 9.80 | 4.39 | 412 | 4221 |
1 | 100 | 100 | 28.51 | 10.25 | 3.89 | 342 | 3507 |
1 | 100000 | 512 | 47.67 | 13.82 | 13.65 | 1048 | 2097 |
1 | 10000 | 512 | 48.04 | 13.22 | 13.64 | 1040 | 2081 |
1 | 1000 | 512 | 46.35 | 13.16 | 14.54 | 1078 | 2157 |
1 | 100 | 512 | 52.10 | 13.78 | 11.93 | 959 | 1919 |
1 | 100000 | 1024 | 109.32 | 21.59 | 25.00 | 914 | 914 |
1 | 10000 | 1024 | 107.94 | 19.97 | 26.49 | 926 | 926 |
1 | 1000 | 1024 | 108.74 | 20.13 | 26.06 | 919 | 919 |
1 | 100 | 1024 | 113.14 | 20.01 | 26.45 | 883 | 883 |
1 | 100000 | 2048 | 240.16 | 44.55 | 55.72 | 832 | 416 |
1 | 10000 | 2048 | 262.05 | 43.58 | 54.94 | 763 | 381 |
1 | 1000 | 2048 | 245.93 | 41.17 | 57.54 | 813 | 406 |
1 | 100 | 2048 | 254.97 | 41.39 | 59.63 | 784 | 392 |
Vary data length, second program:
Prg | -T | -l | real | user | sys | KB/s | t/s |
2 | 100000 | 20 | 9.85 | 5.92 | 1.30 | 198 | 10152 |
2 | 10000 | 20 | 7.82 | 5.90 | 1.28 | 249 | 12787 |
2 | 1000 | 20 | 7.21 | 5.13 | 1.34 | 270 | 13869 |
2 | 100 | 20 | 10.36 | 5.79 | 1.23 | 188 | 9652 |
2 | 100000 | 100 | 10.22 | 5.84 | 2.73 | 955 | 9784 |
2 | 10000 | 100 | 10.54 | 6.11 | 2.72 | 926 | 9487 |
2 | 1000 | 100 | 10.68 | 6.12 | 2.40 | 914 | 93 |