Defining Programming Languages

Analysis Requirements & Principles architecture:
Paradigms
substance:
Abstractions
structure:
Domains
building blocks: Features surface:
Syntactics
Defining PLs Language
List
foundational safety flexibilized

  1. General
  2. Defining Morphology and Syntax
  3. Notations for Checking Rules
  4. Styles of Defining Computational Semantics
  5. Some Techniques of Defining Computational Semantics
  6. OOPL State Models
  7. Calculi/Models as Semantic Basis for Semantics-defining Formalisms

 
As online examples of complete programming language definitions in natural language check out Ada, or C++. If you know further references, please don't hesitate to tell me (Ulf Schünemann), likewise if you have any suggestions.

General

All modern programming languages are extendible by new, user-defined names. A large part of a PL can be described as if it were user-defined, just that this ``extension'' cannot be removed or skipped (it is always there) and that it is the same for all users. In analogy to other extensions, this ``extension'' is called the "standard library" or "language library". Based on this observation, a language description can be split into two main parts:
  1. One part describes the standard or language library. In linguistics, this would be called the language's lexicon
  2. The other part describes the core language, i.e., that what is left after taking away the library.

General Guidelines

  • The terminology for PL descriptions is standardized by ISO/IEC's "Information technology -- Vocabulary" series (also called "Dictionary for Information Processing Systems", used e.g. used in the C standard) (cf. INCITS):
    1. Fundamental Terms (ISO/IEC 2382:1993)
    2. Arithmetic and logic operations
    3. Equipment technology
    4. Organization of Data
    5. Representation of data
    6. Computer programming
    7. Data Communication
    8. Operating techniques & facilities
    9. Processing Units
    10. Peripheral equipment
    11. Computer Graphics
    12. Databases
    13. Analog computing
    14. Systems developmentsystems and technical processes
    15. Interfaces between process computer
    16. Calculator
    17. Text Processing
    18. Computer-integrated manufacturing (CIM)
    19. Local Area Networks (LAN)
    20. Open Systems Interconnection Architecture
    21. Office Automation
    22. Artificial Intelligence - Basic concepts and expert systems
  • Lessons from the standardization of time: On Julius Caesar, Queen Eanfleda, and the lessons from time past [Brian Meek; 49-53: StandardView 1/2; ACM Dec 1993].
  • Proper definitions are just complicated and take their time to write: Of chips, numbers and all that [Brian Meek; 130-131: StandardView 2/3; ACM Sep 1994]
  • [TAOOP] Carl A Gunter, John C Mitchell (eds): Theoretical Aspects of Object-oriented Programming: types, semantics, and language design; MIT Press, 1994.
  • see also The Programming Language Research Community

    Language Design Tools / Frameworks

  • A discussion: Generating Compilers from Formal Semantics. Description of some compiler generators.
  • APPLAB supports the interactive design, development and use of domain-specific extensions of a base-language (by extending attribute grammars) [E Bjarnson et.al.: APPLAB - An Application Language Laboratory; Tech Rpt, Dpt CS, Lund Univ, 1997]
  • TinkerType is a framework for modular (feature-by-feature) construction of formal languages & their type systems by Michael Y Levin and Benjamin C Pierce

    Defining Morphology and Syntax

    1. Morphology deals with the structure of words (alphanumeric identifiers, numbers, symbolic tokens, etc.), and the grammatical and semantical categories they convey.
    2. Syntax deals with the structure of ``sentences,'' i.e., programs, above the word level.
    It is common-place since Fortran to describe morphology and syntax of PLs formally. The following formalisms (with different expressive power) are at hand:
    1. regular expressions (least powerful) -- for describing word structure (morphology)
    2. phrase structure grammars [Chomsky 1951] -- for describing syntax (with or without context-senstive rules). Used for programming languages (since Algol60) as BNF (Backus-Naur-Form) [1959/60]
    3. EBNF (Extended Backus Naur Form) [Wirth 1977] -- for describing syntax. Combines BNF with regular expressions. It is a most successful modern formal notation.
    4. transformation grammars -- for describing syntax
    5. categorial grammars [Ajdukiewicz 1935] -- for describing syntax, developed by logicians and linguists
    For those who want not only to describe a language but to implement language processors: There are many tools which take EBNF like definitions of morphology and syntax to produce so-called ``scanner'' and ``parser'' program modules which can take an input text and analyze its words and syntax, respectively.

    Notations for Checking Rules

    The best know application is type checking. Have a look at linguistics to see that the term semantic checking is justified. This level of a language definition is also called "static semantics", "context-sensitive syntax", or "well-formedness".
    A discussion of the term "static semantics" vs. "syntax" vs. "context-sensitivity" vs. "well-formedness": Brian Meek: The static semantics file; 33-42 in: Sigplan Notices 25/4; ACM, April 1990,

    Checking rules are described in different styles:

    A. Natural language.

    B. Extensions of context-free grammars.

    1. context-sensitive grammars
    2. attribute grammars
    3. two-level grammars = ?

    C. Inference rules, deductive systems (Post systems). «Grammars are examples of formal systems. They are useful in describing certain aspects of programming languages. ... Deductive systems are useful in various capacities in reasoning about programming languages. Deductive systems are rarely presented as grammars; they are more often given as Post systems, named after Emil Post, who first used them.» [Stansifer, 62]
    --> General definition of Post systems as a formal, symbol-manipulating system.
    (bcprules.sty -- a LaTeX style for type setting inference rules by Benjamin Pierce)
    Deductive systems infer the static type of a program term e in the context of the types of the free identifiers in e. The formal structure relating free identifiers with types has variously described:


    Styles of Defining Computational Semantics

    [LTCO] José Messeguer: A Logical Theory of Concurrent Objects; 101-115, OOPSLA/ECOOP'90.
    [OSPE] Andrew M Pitts: Operational Semantics and Program Equivalence; Proceedings of APPSEM'2000 (International Summerschool on Applied Semantics in Caminha, Minho, Portugal Sep. 2000); INRIA, 2000.
    G D Plotkin: A structural approach to operational semantics; TR DAIMI FN-19; Aarhus Univ 1981.
    A K Wright, M Felleisen: A syntactic approach to type soundness; 38-94 in Information and Computation 115; 1994.
    [AM/NS] Sabine Glesner: ASMs versus Natural Semantics: A Comparison with New Insights; 293-309: ASM'03 (LNCS 2589); 2003.

    «Grammars are convenient enough for syntax that they are used by all communities: programmers, implementers, designers.

    Semantic descriptions, however, are seldom both readable enough to be suitable for a beginner and precise enought to specify a language fully. Several distinct styles of language descriptions have arised to meet the conflicting needs of readability and precision:
  • Tutorials ...
  • Reference Manuals ...
  • Formal Definitions ...» [Sethi 26]

    Recursivity of the Definition

    All styles in use are syntax directed. That is, the definition is a system of definitions which define for each syntactic class X in the abstract syntax (in particular classes of statements and expressions), what computation an X-term e means [relative to some model of the computation]. The styles can be distinguished by the parts of the program whose meaning is accessible for the definition of e's meaning. For certain kinds of terms, there may be no difference because the same parts are used in all styles. The difference shows up in computationally more complex terms, like the while loop of imperative programming:
    proper subterms of e
    [for while: + fixed point operator]
    subterms of e
    [for while: + recursion]
    any constructed term
    [for while: "unfolding" using e's subterms]
    [[while(e) do s]]
    = fix F. st ->
    { st if not [[e]](st)
    F([[s]](st)) if [[e]](st)
    [[while(e) do s]]
    = st ->
    { st if not [[e]](st)
    [[while(e) do s]]([[s]](st)) if [[e]](st)
    [[while(e) do s]]
    = [[if(e) {s; while(e) do s}]]
    eg. in denotational semantics eg. in big-step semantics eg. in small-step semantics, rewrite systems

    Use of Abstract Syntax Tree

    «Both big-step semantics and ASMs use the abstract syntax trees of programs in their specifications but do not modify it during program execution. In contrast, small-step semantics explicitly rewrites the abstract syntax trees during execution. In general, a small-step semantics defines a term-rewriting system. Starting with the initial program as initial continuation program, during each state transition, the current continuation program is rewritten until the empty program is reached. In each state, the continuation program represents the computation which still needs to be done. In contrast, ASMs represent the remaining computation in a given state as a pointer to the node in the abstract syntax tree which is executed next. In each state transition, this pointer is updated» [AM/NS 294].

    There are different styles of formal definition, using deduction rules for each syntactic construct which collectively define the semantics of program terms M, N etc, especially function definitions (lambda abstractions) fun(a)L (N'), in a data state s. As an example of comparison we use the rules defining function application (call) with call-by-value parameter passing.

    For those semantics based on a deduction system (c.f. Post systems above), there are basically two dimensions of classification:

    • What is the structure Rel of premises and conclusions?
    • How does the conclusion depend on the premises?
    rewrite semantics vs. evaluation semantics
    big-step vs. small-step semantics
      Rel structure term semantics [[M]] [[M]] =
    rewrite steps Term --> Term Ground { N | M --> ... --> N and N irreducible }
    complete rewrite Term --> Ground Ground { N | M --> N }
    rewrite/valuation(Term mix BaseVal) --> BaseVal Generic BaseVal { X | M --> X }
    valuation (Term mix Value) --> Generic Value Value { v | M --> v }
    with state
    rewrite steps Term × State --> Term × State State × (Ground × State) { (s, (N, s')) | (M,s) --> ... --> (N, s') and (N, s') irreducible }
    complete rewrite Term × State -->
    Ground × State
    State ×
    (Ground × State)
    { (s, (N,s')) | (M,s) --> (N,s') }
    rewrite/valuationTerm × State -->
    (Generic BaseVal) × State
    State ×
    ((Generic BaseVal) × State)
    { (s, (X,s')) | (M,s) --> (X,s') }
    valuation Term × State --> Value × State State × (Value × State) { (s, (v,s')) | (M,s) --> (v,s') }
    where Term is a program term (a syntactic domain), Ground Term is an irreducible program term (a syntactic domain), Generic Ground is an irreducible generic program term (a syntactic domain), State is a mathematical environment and state model (a semantic domain), Value is a mathematical model of all values, including functions/generics (a semantic domain), and BaseVal Value is a mathematical model of non-generic values (a semantic domain).
    1. Abstract Machine operational semantics: «tend to pull the syntax of [language] constructs appart and build non-intuitive auxiliary data structures» [OSPE].
    2. Structural Operational Semantics (SOS), popularized by Plotkin (1981), is a syntax-directed approach for specifying the semantics of programs (which simplifies proofs by induction) [OSPE].
      1. Transition semantics = reduction semantics = small-step semantics (Plotkin's form of SOS) [OSPE].
        There are two kinds of derivation rules for the relation:
        • Rules that give the basic steps of reduction (axioms of the semantics' derivation system).
          E.g. (fun(x)L (v), s) --> (L, s[v/x])
        • Rules for simplification steps that say how reductions are performed within a context. The latter can be more succinctly specified using the notion of evaluation context by Wright and Felleisen (1994).
          (M, s) --> (M', s')
          (M(N), s) --> (M'(N), s')
             
          (N, s) --> (N', s')
          (fun(x)L (N), s) --> (fun(x)L (N'), s')

        Based on them computation proceeds with the rewriting logic's deduction rules until an irreducible configuration is reached, i.e. one that cannot be rewritten any more.
      2. Evaluation semantics = big-step semantics = natural semantics (Kahn 1987) = relational semantics (Milner) [OSPE] has deduction rules of the following form:
        (M, s) --> (fun(a)L, s');   (N, s') --> (x, s'');   (L, s''[x/a]) --> (y, s''')
        (M(N), s) --> (y, s''')
    3. In a denotational semantics deduction rules have the following form:
      (M, s) --> (f, s');   (N, s') --> (x, s'');   f(x) = y
      (M(N), s) --> (y, s'')
      The irreducible values are only single-token literal constants, or primitive values, respectively; function terms evaluate to mappings f. Treatment of recursion and loops requires least fixed points and domain theory. Category theory is used for modelling type systems. Rel (Term × State) × (NonTerm × State).
    4. Axiomatic semantics TODO c.f. Hoare logic, Hoare triples

    ``In comparing the denotational semantics with the operational semantics, the denotational semantics does not seem to be much simpler. It may even be argued that it is a great deal more complex, because it requires an understanding of fixed points. The primary advantage of the denotational semantics is the intuitive explanation it provides [for OO inheritance]. It suggests that inheritance may be useful for other kinds of recursive structures, like types and functions in addition to classes. It reveals that inheritance, while a natural extension of existing mechanisms,, does provide expressive power not found in conventional languages by allowing more flexible use of the fixed point function.'' [William Cook, Jens Palsberg: A Denotational Semantics of Inheritance and its Correctness; 329-350 in: Information and Computing 114; 1994]


    Some Techniques of Defining Computational Semantics

    1. Evaluation contexts are runtime terms E "with a whole" [], and E[M] is the substitution of M into the whole. Evaluation context are an elegant way of formalizing the notion of point-of-control in a (runtime) program P (aka. the redex in a semantics by reduction rules): Evaluation context are defined so that any runtime program P decomposes uniquely into P = E[M], where M is a, in some sense basic, term. Then M is the point-of-control in P. More precisely---since M might occur several times as subterm in P---the point of control is at that occurence of M in P which is substituted to [] in E.
      Evaluation context were used first in M Felleisen, D Friedman, E Kohlbecker, B Duba: Reasoning with continuations (131-141: Proc First Symposium on Logic in Computer Science; IEEE 1986) and A syntactic theory of sequential control (205-237: Theoretical Computer Science 52(2); 1987)

      Normally, evaluation can be defined by rewriting the redex (the basic term at the point-of-control). Special, eg., exit in C: E[exit v] --> v. Or call/cc in Scheme: E[call/cc F] --> F(val-from-F.abort(E[val-from-F])) where E[abort E] --> E.

    2. ... please add your own techniques ...

    OOPL State Models


    [CfS] John Boyland, James Noble, William Retert: Capabilities for Sharing: A Generalisation of Uniqueness and Read-Only; ECOOP'01.
    [PolyTOIL] Kim B Bruce, Adrian Fiech, Angela Schuett, Robert van Gent: PolyTOIL: A type-safe polymorphic object-oriented language; ACM Transactions on Programming Languages and Systems 2002.
    [Joe] David Clarke, Sophia Drossopoulou: Ownership, Encapsulation and the Disjointness of Type and Effect; OOPSLA'02.
    [OOC] David Clarke: Object Ownership & Containment; PhD thesis, Univ. of New South Wales 2001.
    [OT] David G Clarke, John M Potter, James Nobel: Ownership Types for Flexible Alias Protection; OOPSLA'98.
    [JTS] Sophia Drossopoulou, Susan Eisenbach: Java is Type Safe --- Probably; ECOOP'97.
    [Univ] Peter Müuller, Arnd Poetzsch-Heffter: Universes: A Type System for Alias and Dependency Control; Informatik Berichte 279, Fernuniversitäat Hagen 2001.
    [J/Isa] David von Oheimb: Analyzing Java in Isabelle/HOL: Formalization, Type Safety and Hoare Logic; Dissertation, Technische Universit&aauml'at Müunchen (Germany) 2001.
    [JavaS] D Syme:
    Proving JavaS Type Soundness; Tech Report, Comp Lab, Univ of Camb 1997.
    [Moby] Fisher, Reppy: Foundations for MOBY classes; Dec 1998.

    A state model for an OOPL has to relate

    1. object-IDs plus field labels to values --- for modeling instance variables, aka object fields
    2. local names (parameters, local variables, this) to values --- for modeling locals
    A some OOPL state models and their terminology:
    instance variables, aka object fields
    OID -> Id -> TermVal
    called (Object) Store [OT, Joe, CfS, Univ, Moby],
    or Store of Globals [J/Isa]
    OID -> Id -> ValLoc
    + ValLoc -> TermVal
    l
    o
    c
    a
    l
    s
    Id -> TermVal

    called
    Stack Frame [OT],
    Bindings [Joe], or
    Store of Locals [J/Isa]

  • [J/Isa] State = Store-of-globals × Store-of-locals
  • [JTS] State = OID->Id->Val Id->Val
  • [JavaS] State = OID->HeapObj (Id->Val)*
  • [Univ] State = {$} -> Store Id -> Val
  • [CfS] Store = (Id (OID x Id) -> Val
  • [OT] Store and Stack-frame*
  • [Joe] Store and Bindings
    • [PolyTOIL] Store = Loc -> Irr
      and Environment = Id -> Irr
      where
      Loc =^= OID ValLoc,
      Irr =^= PrimVal Loc,
      PrimVal OID =^= TermVal,
      Env =^= Id -> TermVal (for parameters)
      Id -> ValLoc (for local variables)
      reachable Store =^= OID -> Id -> ValLoc
      ValLoc -> TermVal
    Id -> ValLoc
    + ValLoc -> TermVal
     
    substitution
  • [Moby] Object store (and evaluation context stack)
  • [OOC] Store
  •  

    Calculi/Models as Semantic Basis for Semantics-defining Formalisms

    Computing with Rewriting Systems

    ``To compute ... one performs equational simplicitation by using the equations from left to right until no more simpliciations are possible. Note that this can be done concurrently, i.e., applying several equations at once'' [
    LTCO]

    Rewriting logic [LTCO] is a logic for deducing how to rewrite terms based on a set of rewrite rules from a rewrite theory (see above) with the first four following deduction rules. Equational logic is obtained from rewriting logic by adding the symmetry rule.
    1. Reflexivity: M --> M
    2. Congurence [c.f. eager evaluation]: f(M1, ... Mn) --> f(N1, ... Nn) if M1 --> N1 through Mn --> Nn
    3. Replacement for each rewrite rule f(x1, ... xn) --> g(x1, ... xn): f[M1/x1, ... Mn/xn] --> g[N1/x1, ... Nn/xn] if M1 --> N1 through Mn --> Nn
    4. Transitivity: M --> N if M --> L and L --> N

    5. Symmetry: M --> N if N --> M
    In rewrite logic, a sequent "M --> N" should be read "M becomes N". In equational logic, a sequent "M --> N" can be read "M equals N".
    ``[R]ewriting logic is a logic of becoming or change, not a logic of equality in the static Platonic sense. Adding the symmetry rule is a very strong restriction, namely assuming that all change is reversible, thus bringing us into a timeless Platonic realm in which "before" and "after" have been identified. [LTCO]

    Classification

    1. Syntactic rewriting:
      • Labelled transition systems (model concurrency as interleaving)
      • Functional programming ``corresponds to the case of confluent rules'':
        • Lambda calculus (in combinator form)
        • Herbrand-Gödel-Kleene theory of recursive functions
        • Algebraic datatypes (?)
          • Free datatypes
    2. Rewriting modulo associativity and neutral elements (e.g. nil for lists):
      • Post systems
        • Phrase structure grammars, including:
          • Turing machines
    3. Rewriting modulo associativity, commutativity, and neutral elements:
      • Petri nets
      • Chemical abstract machine [Berry and Boudol: The Chemical Abstract Machine; 91-94; POLP'90], specializes to:
        • CCS [Robin Milner: Communication and Concurrency; Prentice-Hall 1989]
      • Concurrent programming:
        • Unity's model of computation [K M Chandy, J Misra: Parallel Program Design: A Foundation; Addison-Wesley 1998]
        • Higher-level Petri nets for actors: POT and POP [Engelfriet: Net-based description of parallel object-based systems, or POTs and POPs; Technical Report, FOOL; 1990]
    ``[I]nitial algebra semantics is entirely inadequate to handle system modules with a nonfunctional behavior'' [LTCO] See there to find rewrite systems as models of rewrite theories (which are usually called rewrite systems themselves).

    For a comprehensive overview over all the formal models developed for reasoning about object-oriented language semantics you must read:
    Tom Mens: A survey on formal models for OO, 1994.


    Analysis Requirements & Principles Paradigms Abstractions Domains Features foundational safety flexible typing Syntactics Defining PLs Language List
    Ulf Schünemann 111099, 200901