additional notes about escaping to ensure correct event IDs

2024-12-22 16:35:52 -05:00 · 2024-08-17 11:42:07 +01:00 · 2024-08-17 11:42:07 +01:00 · e34653ad04
commit e34653ad04
parent 2c7e2af15f
1 changed files with 38 additions and 0 deletions
--- a/01.md
+++ b/01.md
@ -42,6 +42,8 @@ To obtain the `event.id`, we `sha256` the serialized event. The serialization is
 ]
 ```

+### String Escapes
+
 To prevent implementation differences from creating a different event ID for the same event, the following rules MUST be followed while serializing:
 - UTF-8 should be used for encoding.
 - Whitespace, line breaks or other unnecessary formatting should not be included in the output JSON.
@ -54,6 +56,42 @@ To prevent implementation differences from creating a different event ID for the
  - A backspace, (`0x08`), use `\b`
  - A form feed, (`0x0C`), use `\f`

+In addition, implementations should retain all other escape sequences
+without modification due a normalization to one scheme affecting event IDs
+in the absence of a normative marker to specify the one being used,
+because there is three forms of escaping other than the single letter C
+style as above:
+
+- `\uXX` - 8 bit hex
+- `\uXXXX` - 16 bit hex
+- `\XXX` - 24 bit octal
+
+Implementations *could* make this a part of their internal data structure
+but the primary directive is that the submitted event string encoding MUST
+be the same after marshalling it back to JSON, thus it is simpler to just
+leave them alone.
+
+There can also be HTML entities, but these do not need special handling due
+to their not being based on the reverse solidus " \ ". Longer `\u` codes are
+possible, according to UTF-8 rules but few implementations use them and a
+parser that accepts the `\u` prefix without modification will accept 2, 4, 6
+or 8 hex digits or even incorrect values that don't include reverse solidus.
+A parser can thus make a special case for `\u` and `\[0-9]` and cover all cases.
+
+Because of the absence of sentinels to signify which scheme should be used, and
+to conserve space on the most frequently occurring control characters, `\n`,
+`\t` and `\\`, this specification uses the C-style escapes, and so any escapes
+like these three common types above, should not be modified to ensure that the
+canonical form of the event that determines the event ID hash is consistent
+across implementations.
+
+As a rule, data that is intended to represent binary should be either
+encoded in hexadecimal or standard JSON Base64. Wherever possible, as with
+the `e` and `p` tags, specifications that put binary data in a specific
+format in fields of tags should make it simple for implementations to store
+the data in binary format in the runtime to conserve memory and improve
+matching performance, at a very low processing cost.
+
 ### Tags

 Each tag is an array of one or more strings, with some conventions around them. Take a look at the example below: