Started on a draft RFC about type normalization

2025-05-04 10:33:46 +01:00 · 2022-03-09 20:37:49 -06:00 · 2022-03-09 20:37:49 -06:00 · a48c2e17cd
commit a48c2e17cd
parent dbdf91f3ca
1 changed files with 175 additions and 0 deletions
--- a/rfcs/type-normalization.md
+++ b/rfcs/type-normalization.md
@ -0,0 +1,175 @@
+# Type normalizaton
+
+## Summary
+
+Normalize types, for example removing redundant unions and
+intersections, to minimize memory usage and make user-visible inferred
+types easier to read.
+
+## Motivation
+
+Once local type inference lands, we will infer union types more
+often. This is good, in that we will be inferring more precise types,
+but does mean we have to consider the impact this has on memory, and
+the complexity of the user-visible types.
+
+Consider the program:
+```
+  function f()
+    local x:T = ...
+    local y:U = ...
+    if g() then return x else return y end
+  end
+```
+
+Currently, this program produces a type error, but with local type inference
+it does not, and instead introduces two subtyping constraints on the free return type:
+```
+  T <: R    U <: R
+```
+
+which is solved as
+```
+  R = T | U
+```
+
+This can produce unexpected types, for example `number|number`
+or `Animal|Cat`. Conversely, a program like
+
+```
+ local x:T = ...
+ local y:U = ...
+ local z = ...
+ x = z
+ y = z
+```
+
+will introduce a new free type `Z` for `z`, with constraints 
+```
+  Z <: T    Z <: U
+```
+
+which is solved as
+```
+  Z = T & U
+```
+
+This can also produce unexpected types, for example `number&number`
+or `Animal&Cat`.
+
+## Design
+
+In this section, we outline a number of possible designs. These fall into two broad camps: syntactic subtyping uses rewrite rules on the syntax of types, whereas semantic subtyping uses semantic models of types.
+
+### Syntactic subtyping
+
+#### Alternative: check for pointer equality before adding to a union/intersection
+
+To normalize a union or intersection, iterate through the options, adding them to a
+vector. Don't add them if the type is already present (a cheap check
+for pointer equality).
+
+Advantages: easy to implement; fast.
+
+Disadvantage: pointer equality is a very brittle test; it is easy to still get
+types `T|T`, caused by having two clones of `T`.
+
+#### Alternative: check for suptyping before adding to a union/intersection
+
+Ditto, but don't add a type if there's already a supertype in the
+vector (subtype in the case of intersections).
+
+Advantages: easy to implement (reuses the existing subtyping
+machinery). Does most cases we're interesred in (e.g. `number|number`
+or `Cat|Animal`).
+
+Disadvantages: doesn't deal with mixed intersections and unions
+(e.g. `Animal&Cat` doesn't simplify); depends on the order of types
+(e.g. `Cat&Animal` normalizes to `Cat`).
+
+#### Alternative: ditto but convert to disjunctive normal form first
+
+Use the fact that intersection distributes through union
+```
+  (T | U) & V  ==  (T & V) | (U & V)
+```
+
+to move union out of intersection before normalizing.
+
+Advantage: does a better job of some examples with tables:
+
+```
+   ({ p : T } | { q : U }) & { r : V }
+     == ({ p : T } & { r : V }) | ({ q : U } & { r : V })
+     == ({ p : T, r : V }) | ({ q : U, r : V })
+```
+
+Disadvantage: exponential blowup.
+
+#### Alternative: normalize union and intersection of tables to tables
+
+We can work up to the equivalence used in the last example, normalizing `{ p : T } & { q : U }` as `{ p : T, q : U }`.
+
+For read-write properties `{ p : T } & { p : U }` is inhabited only when `T == U`, otherwise it is an uninhabited type (so equivalent to `{ p : never }`).
+
+For read-only properties `{ get p : T } & { get p : U }` is inhabited by the same values as `{ get p : T&U }`. (Ditto union).
+
+Advantage: a better presentation of intersections of tables.
+
+Disadvantage: developers might be surprised that `{ p : number } & { p : number? }` is `{ p : never }`; recursive unions and intersections increase time complexity; there is less normalization of unions-of-tables.
+
+#### Alternative: normalize union and intersection of functions to functions
+
+Similarly, we can treat `(T -> U) | (T -> V)` as `T -> (U | V)` and `(T -> V) | (U -> V)` as `(T & U) -> V`.
+
+There is no good normalization of `(T -> U) | (V -> W)` in general. The obvious candidate is `(T & V) -> (U | W)` but there are functions of that type that are not of type `T -> U` or `V -> W`, such as `function(x) if random() then U.new() else W.new() end`.
+
+Advantage: normalizes some functions.
+
+Disadvantage: doesn't do what you expect. More recursive normalization. Has nasty interactions with overloaded functions. Requires union and intersection of type packs.
+
+## Semantic subtyping
+
+The idea here is to think of types as sets of values, where intersection and union have their usual interpretation on sets.
+
+Values can be thought of as trees, for example a table has children given by the properties, so types are sets of trees such as
+```
+     o         o
+    / \       / \
+{  p   q  ,  p   q  , ... }
+   ↓   ↓     ↓   ↓
+ true  1  false  5
+```
+
+is the set of trees corresponding to `{ p : boolean , q : number }`.
+
+In semantic subtyping, the subtyping replation is interpreted as subset order.
+
+In the same way that sets of strings are the languages of automata, sets
+of trees are the languages of *tree automata*. Many of the same
+techniques, such as minimization, construction of union and
+intersection, etc, apply to tree automata. The main difference is that
+nondeterministic top-down tree automata are strictly more powerful than derministic ones.
+
+Tree automata are given by rules of the form `q0 -> f(q1, ..., qN)` where
+
+ * each `qI` is a state of the automaton, and
+ * `f` is a symbol used to label a tree node with `N` children.
+
+for example, the automaton for `{ p : boolean, q : number }` has initial state `q0` and transitions:
+```
+  q0 -> { p = q1, q = q2 }
+  q1 -> true
+  q1 -> false
+  q2 -> n for any number n
+```
+
+Tree automata are closed under union and intersection, in the same way that string automata are.
+
+## Drawbacks
+
+Why should we *not* do this?
+
+## Alternatives
+
+What other designs have been considered? What is the impact of not doing this?