1 | ---
|
2 | in_progress: yes
|
3 | default_highlighter: oils-sh
|
4 | ---
|
5 |
|
6 | Doc Processing in YSH - Notation, Query, Templating
|
7 | ====================================================
|
8 |
|
9 | This is a slogan for "maximalist YSH" design:
|
10 |
|
11 | *Documents, Objects, and Tables - HTML, JSON, and CSV* †
|
12 |
|
13 | This design doc is about the first part - **documents** and document processing.
|
14 |
|
15 | † from a paper about the C# language
|
16 |
|
17 | <div id="toc">
|
18 | </div>
|
19 |
|
20 | ## Intro
|
21 |
|
22 | Let's sketch a design for 3 aspects of doc processing:
|
23 |
|
24 | 1. HTM8 Notation - A **subset** of HTML5 meant for easy implementation, with
|
25 | regular languages.
|
26 | - It's part of J8 Notation (although it does not use J8 strings, like JSON8
|
27 | and TSV8 do.)
|
28 | - It's very important to understand that this is HTM8, not HTML8!
|
29 | 1. A subset of CSS for querying
|
30 | 1. Templating in the Markaby style (a bit like Lisp, but unlike JSX templates)
|
31 |
|
32 | The basic goal is to write ad hod HTML processors.
|
33 |
|
34 | YSH programs should loosely follow the style of the DOM API in web browsers,
|
35 | e.g. `document.querySelectorAll('table#mytable')` and the doc fragments it
|
36 | returns.
|
37 |
|
38 | Note that the DOM API is not available in node.js or Deno by default, much less
|
39 | any alternative lightweight JavaScript runtimes.
|
40 |
|
41 | I believe we can write include something that's simpler, and just as powerful,
|
42 | in YSH.
|
43 |
|
44 | ## Use Cases for HTML Processing
|
45 |
|
46 | These will help people get an idea.
|
47 |
|
48 | 1. making Oils cross-ref.html
|
49 | - query and replacement
|
50 | 1. table language - md-ul-table
|
51 | - query and replacement
|
52 | - many tables to make here
|
53 | 1. safe HTML subset, e.g. for publishing user results on continuous build
|
54 | - well I think I want to encode the policy, like
|
55 | - query
|
56 |
|
57 | Design goals:
|
58 |
|
59 | - Simple format that can be re-implemented anywhere
|
60 | - a few re2c expressions
|
61 | - Fast
|
62 | - re2c uses C
|
63 | - Few allocations
|
64 | - much simpler than an entire browser engine
|
65 |
|
66 | ## Operations
|
67 |
|
68 | - `doc('<p>')` - validates it and creates a value.Obj
|
69 | - `docQuery(mydoc, '#element')` - does a simple search
|
70 |
|
71 | Constructors:
|
72 |
|
73 | doc { # prints valid HT8
|
74 | p {
|
75 | echo 'hi'
|
76 | }
|
77 | p {
|
78 | 'hi' # I think I want to turn on this auto-quote feature
|
79 | }
|
80 | raw '<b>bold</b>'
|
81 | }
|
82 |
|
83 | And then
|
84 |
|
85 | doc (&mydoc) { # captures the output, and creates a value.Obj
|
86 | p {
|
87 | 'hi' # I think I want to turn on this auto-quote feature
|
88 | "hi $x"
|
89 | }
|
90 | }
|
91 |
|
92 | This is the same as the table constructor
|
93 |
|
94 | Module:
|
95 |
|
96 | source $LIB_YSH/doc.ysh
|
97 |
|
98 | doc (&d) {
|
99 | }
|
100 | doc {
|
101 | }
|
102 | doc('<p>')
|
103 |
|
104 | This can have both __invoke__ and __call__
|
105 |
|
106 | var results = d.query('#a')
|
107 |
|
108 | # The doc could be __invoke__ ?
|
109 | d query '#a' {
|
110 | }
|
111 |
|
112 | doc query (d, '#a') {
|
113 | for result in (results) {
|
114 | echo hi
|
115 | }
|
116 | }
|
117 |
|
118 | # we create (old, new) pairs?
|
119 | # this is performs an operation like:
|
120 | # d.outerHTML = outerHTML
|
121 | var d = d.replace(pairs)
|
122 |
|
123 |
|
124 | Safe HTML subset
|
125 |
|
126 | d query (tags= :|a p div h1 h2 h3|) {
|
127 | case (_frag.tag) {
|
128 | a {
|
129 | # get a list of all attributes
|
130 | var attrs = _frag.getAttributes()
|
131 | }
|
132 | }
|
133 | }
|
134 |
|
135 | If you want to take user HTML, then you first use an HTML5 -> HT8 converter.
|
136 |
|
137 | ## More Notes
|
138 |
|
139 | YSH API
|
140 |
|
141 | - Generating HTML/HTM8 is much more common than parsing it
|
142 | - although maybe we can do RemoveComments as a demo?
|
143 | - that is the lowest level "sed" model
|
144 |
|
145 | - For parsing, a minimum idea is:
|
146 | - lexer-based algorithms for query by tag, class name, and id
|
147 | - and then toTree() - this is a DOM
|
148 | - .tag and .attrs?
|
149 | - .innerHTML() and .outerHTML() perhaps
|
150 | - rewrite ul-table in that?
|
151 | - does that mean you mutate it, or construct text?
|
152 | - I think you can set the innerHTML probably
|
153 |
|
154 | - Testing of html.ysh aka htm8.ysh in the stdlib
|
155 |
|
156 | Cases:
|
157 |
|
158 | html 'hello <b>world</b>'
|
159 | html "hello <b>$name</b>"html
|
160 | html ["hello <b>$name</b>"] # hm this isn't bad, it's an unevaluated expression?
|
161 | commonmark 'hello **world**'
|
162 | md 'hello **world**'
|
163 | md ['hello **$escape**'] ? We don't have a good escaping algorithm
|
164 |
|
165 | ## Related
|
166 |
|
167 | - [table-object-doc.html](table-object-doc.html)
|
168 | - [htm8.html](htm8.html)
|