plugin/md (thanks Oblivion), ide: UppHub now interprets .md correctly

git-svn-id: svn://ultimatepp.org/upp/trunk@15737 f0d560ea-af0d-0410-9eb7-867de7ffcac7
This commit is contained in:
cxl 2021-02-04 16:15:30 +00:00
parent 6c19de028b
commit 8c4cb75eb5
13 changed files with 7881 additions and 1 deletions

View file

@ -98,6 +98,8 @@ void UppHubDlg::Readme()
if(s.GetCount()) {
if(n->readme.EndsWith(".qtf"))
PromptOK(s);
if(n->readme.EndsWith(".md"))
PromptOK(MarkdownConverter().Tables().ToQtf(s));
else
PromptOK("\1" + s);
}

View file

@ -17,6 +17,7 @@
#include <TextDiffCtrl/TextDiffCtrl.h>
#include <ide/Designers/Designers.h>
#include <ide/Android/Android.h>
#include <plugin/md/Markdown.h>
#include "About.h"
#include "MethodsCtrls.h"

View file

@ -20,7 +20,8 @@ uses
ide/Java,
ide/MacroManager,
Report,
Core/SSL;
Core/SSL,
plugin/md;
file
IDE readonly separator,

22
uppsrc/plugin/md/Copying Normal file
View file

@ -0,0 +1,22 @@
Copyright (c) 1998, 2021, The U++ Project
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted
provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of
conditions and the following disclaimer in the documentation and/or other materials provided
with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

View file

@ -0,0 +1,362 @@
# MD4C Change Log
## Version 0.4.7
Changes:
* Add `MD_TABLE_DETAIL` structure into the API. The structure describes column
count and row count of the table, and pointer to it is passed into the
application-provided block callback with the `MD_BLOCK_TABLE` block type.
Fixes:
* [#131](https://github.com/mity/md4c/issues/131):
Fix handling of a reference image nested in a reference link.
* [#135](https://github.com/mity/md4c/issues/135):
Handle unmatched parenthesis pairs inside a permissive URL and WWW auto-links
in a way more compatible with the GFM.
* [#138](https://github.com/mity/md4c/issues/138):
The tag `<tbody></tbody>` is now suppressed whenever the table has zero body
rows.
* [#139](https://github.com/mity/md4c/issues/139):
Recognize a list item mark even when EOF follows it.
* [#142](https://github.com/mity/md4c/issues/142):
Fix reference link definition label matching in a case when the label ends
with a Unicode character with non-trivial case folding mapping.
## Version 0.4.6
Fixes:
* [#130](https://github.com/mity/md4c/issues/130):
Fix `ISANYOF` macro, which could provide unexpected results when encountering
zero byte in the input text; in some cases leading to broken internal state
of the parser.
The bug could result in denial of service and possibly also to other security
implications. Applications are advised to update to 0.4.6.
## Version 0.4.5
Fixes:
* [#118](https://github.com/mity/md4c/issues/118):
Fix HTML renderer's `MD_HTML_FLAG_VERBATIM_ENTITIES` flag, exposed in the
`md2html` utility via `--fverbatim-entities`.
* [#124](https://github.com/mity/md4c/issues/124):
Fix handling of indentation of 16 or more spaces in the fenced code blocks.
## Version 0.4.4
Changes:
* Make Unicode-specific code compliant to Unicode 13.0.
New features:
* The HTML renderer, developed originally as the heart of the `md2html`
utility, is now built as a standalone library, in order to simplify its
reuse in applications.
* With `MD_HTML_FLAG_SKIP_UTF8_BOM`, the HTML renderer now skips UTF-8 byte
order mark (BOM) if the input begins with it, before passing to the Markdown
parser.
`md2html` utility automatically enables the flag (unless it is custom-built
with `-DMD4C_USE_ASCII`).
* With `MD_HTML_FLAG_XHTML`, The HTML renderer generates XHTML instead of
HTML.
This effectively means `<br />` instead of `<br>`, `<hr />` instead of
`<hr>`, and `<img ... />` instead of `<img ...>`.
`md2html` utility now understands the command line option `-x` or `--xhtml`
enabling the XHTML mode.
Fixes:
* [#113](https://github.com/mity/md4c/issues/113):
Add missing folding info data for the following Unicode characters:
`U+0184`, `U+018a`, `U+01b2`, `U+01b5`, `U+01f4`, `U+0372`, `U+038f`,
`U+1c84`, `U+1fb9`, `U+1fbb`, `U+1fd9`, `U+1fdb`, `U+1fe9`, `U+1feb`,
`U+1ff9`, `U+1ffb`, `U+2c7f`, `U+2ced`, `U+a77b`, `U+a792`, `U+a7c9`.
Due the bug, the link definition label matching did not work in the case
insensitive way for these characters.
## Version 0.4.3
New features:
* With `MD_FLAG_UNDERLINE`, spans enclosed in underscore (`_foo_`) are seen
as underline (`MD_SPAN_UNDERLINE`) rather than an ordinary emphasis or
strong emphasis.
Changes:
* The implementation of wiki-links extension (with `MD_FLAG_WIKILINKS`) has
been simplified.
- A noticeable increase of MD4C's memory footprint introduced by the
extension implementation in 0.4.0 has been removed.
- The priority handling towards other inline elements have been unified.
(This affects an obscure case where syntax of an image was in place of
wiki-link destination made the wiki-link invalid. Now *all* inline spans
in the wiki-link destination, including the images, is suppressed.)
- The length limitation of 100 characters now always applies to wiki-link
destination.
* Recognition of strike-through spans (with the flag `MD_FLAG_STRIKETHROUGH`)
has become much stricter and, arguably, reasonable.
- Only single tildes (`~`) and double tildes (`~~`) are recognized as
strike-through marks. Longer ones are not anymore.
- The length of the opener and closer marks have to be the same.
- The tildes cannot open a strike-through span if a whitespace follows.
- The tildes cannot close a strike-through span if a whitespace precedes.
This change follows the changes of behavior in cmark-gfm some time ago, so
it is also beneficial from compatibility point of view.
* When building MD4C by hand instead of using its CMake-based build, the UTF-8
support was by default disabled, unless explicitly asked for by defining
a preprocessor macro `MD4C_USE_UTF8`.
This has been changed and the UTF-8 mode now becomes the default, no matter
how `md4c.c` is compiled. If you need to disable it and use the ASCII-only
mode, you have explicitly define macro `MD4C_USE_ASCII` when compiling it.
(The CMake-based build as provided in our repository explicitly asked for
the UTF-8 support with `-DMD4C_USE_UTF8`. I.e. if you are using MD4C library
built with our vanilla `CMakeLists.txt` files, this change should not affect
you.)
Fixes:
* Fixed some string length handling in the special `MD4C_USE_UTF16` build.
(This does not affect you unless you are on Windows and explicitly define
the macro when building MD4C.)
* [#100](https://github.com/mity/md4c/issues/100):
Fixed an off-by-one error in the maximal length limit of some segments
of e-mail addresses used in autolinks.
* [#107](https://github.com/mity/md4c/issues/107):
Fix mis-detection of asterisk-encoded emphasis in some corner cases when
length of the opener and closer differs, as in `***foo *bar baz***`.
## Version 0.4.2
Fixes:
* [#98](https://github.com/mity/md4c/issues/98):
Fix mis-detection of asterisk-encoded emphasis in some corner cases when
length of the opener and closer differs, as in `**a *b c** d*`.
## Version 0.4.1
Unfortunately, 0.4.0 has been released with badly updated ChangeLog. Fixing
this is the only change on 0.4.1.
## Version 0.4.0
New features:
* With `MD_FLAG_LATEXMATHSPANS`, LaTeX math spans (`$...$`) and LaTeX display
math spans (`$$...$$`) are now recognized. (Note though that the HTML
renderer outputs them verbatim in a custom `<x-equation>` tag.)
Contributed by [Tilman Roeder](https://github.com/dyedgreen).
* With `MD_FLAG_WIKILINKS`, Wiki-style links (`[[...]]`) are now recognized.
(Note though that the HTML renderer renders them as a custom `<x-wikilink>`
tag.)
Contributed by [Nils Blomqvist](https://github.com/niblo).
Changes:
* Parsing of tables (with `MD_FLAG_TABLES`) is now closer to the way how
cmark-gfm parses tables as we do not require every row of the table to
contain a pipe `|` anymore.
As a consequence, paragraphs now cannot interrupt tables. A paragraph which
follows the table has to be delimited with a blank line.
Fixes:
* [#94](https://github.com/mity/md4c/issues/94):
`md_build_ref_def_hashtable()`: Do not allocate more memory than strictly
needed.
* [#95](https://github.com/mity/md4c/issues/95):
`md_is_container_mark()`: Ordered list mark requires at least one digit.
* [#96](https://github.com/mity/md4c/issues/96):
Some fixes for link label comparison.
## Version 0.3.4
Changes:
* Make Unicode-specific code compliant to Unicode 12.1.
* Structure `MD_BLOCK_CODE_DETAIL` got new member `fenced_char`. Application
can use it to detect character used to form the block fences (`` ` `` or
`~`). In the case of indented code block, it is set to zero.
Fixes:
* [#77](https://github.com/mity/md4c/issues/77):
Fix maximal count of digits for numerical character references, as requested
by CommonMark specification 0.29.
* [#78](https://github.com/mity/md4c/issues/78):
Fix link reference definition label matching for Unicode characters where
the folding mapping leads to multiple codepoints, as e.g. in `ẞ` -> `SS`.
* [#83](https://github.com/mity/md4c/issues/83):
Fix recognition of an empty blockquote which interrupts a paragraph.
## Version 0.3.3
Changes:
* Make permissive URL autolink and permissive WWW autolink extensions stricter.
This brings the behavior closer to GFM and mitigates risk of false positives.
In particular, the domain has to contain at least one dot and parenthesis
can be part of the link destination only if `(` and `)` are balanced.
Fixes:
* [#73](https://github.com/mity/md4c/issues/73):
Some raw HTML inputs could lead to quadratic parsing times.
* [#74](https://github.com/mity/md4c/issues/74):
Fix input leading to a crash. Found by fuzzing.
* [#76](https://github.com/mity/md4c/issues/76):
Fix handling of parenthesis in some corner cases of permissive URL autolink
and permissive WWW autolink extensions.
## Version 0.3.2
Changes:
* Changes mandated by CommonMark specification 0.29.
Most importantly, the white-space trimming rules for code spans have changed.
At most one space/newline is trimmed from beginning/end of the code span
(if the code span contains some non-space contents, and if it begins and
ends with space at the same time). In all other cases the spaces in the code
span are now left intact.
Other changes in behavior are in corner cases only. Refer to [CommonMark
0.29 notes](https://github.com/commonmark/commonmark-spec/releases/tag/0.29)
for more info.
Fixes:
* [#68](https://github.com/mity/md4c/issues/68):
Some specific HTML blocks were not recognized when EOF follows without any
end-of-line character.
* [#69](https://github.com/mity/md4c/issues/69):
Strike-through span not working correctly when its opener mark is directly
followed by other opener mark; or when other closer mark directly precedes
its closer mark.
## Version 0.3.1
Fixes:
* [#58](https://github.com/mity/md4c/issues/58),
[#59](https://github.com/mity/md4c/issues/59),
[#60](https://github.com/mity/md4c/issues/60),
[#63](https://github.com/mity/md4c/issues/63),
[#66](https://github.com/mity/md4c/issues/66):
Some inputs could lead to quadratic parsing times. Thanks to Anders Kaseorg
for finding all those issues.
* [#61](https://github.com/mity/md4c/issues/59):
Flag `MD_FLAG_NOHTMLSPANS` erroneously affected also recognition of
CommonMark autolinks.
## Version 0.3.0
New features:
* Add extension for GitHub-style task lists:
```
* [x] foo
* [x] bar
* [ ] baz
```
(It has to be explicitly enabled with `MD_FLAG_TASKLISTS`.)
* Added support for building as a shared library. On non-Windows platforms,
this is now default behavior; on Windows static library is still the default.
The CMake option `BUILD_SHARED_LIBS` can be used to request one or the other
explicitly.
Contributed by Lisandro Damián Nicanor Pérez Meyer.
* Renamed structure `MD_RENDERER` to `MD_PARSER` and refactorize its contents
a little bit. Note this is source-level incompatible and initialization code
in apps may need to be updated.
The aim of the change is to be more friendly for long-term ABI compatibility
we shall maintain, starting with this release.
* Added `CHANGELOG.md` (this file).
* Make sure `md_process_table_row()` reports the same count of table cells for
all table rows, no matter how broken the input is. The cell count is derived
from table underline line. Bogus cells in other rows are silently ignored.
Missing cells in other rows are reported as empty ones.
Fixes:
* CID 1475544:
Calling `md_free_attribute()` on uninitialized data.
* [#47](https://github.com/mity/md4c/issues/47):
Using bad offsets in `md_is_entity_str()`, in some cases leading to buffer
overflow.
* [#51](https://github.com/mity/md4c/issues/51):
Segfault in `md_process_table_cell()`.
* [#53](https://github.com/mity/md4c/issues/53):
With `MD_FLAG_PERMISSIVEURLAUTOLINKS` or `MD_FLAG_PERMISSIVEWWWAUTOLINKS`
we could generate bad output for ordinary Markdown links, if a non-space
character immediately follows like e.g. in `[link](http://github.com)X`.
## Version 0.2.7
This was the last version before the changelog has been added.

View file

@ -0,0 +1,22 @@
# The MIT License (MIT)
Copyright © 2016-2020 Martin Mitáš
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the “Software”),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.

View file

@ -0,0 +1,297 @@
[![Linux Build Status (travis-ci.com)](https://img.shields.io/travis/mity/md4c/master.svg?logo=linux&label=linux%20build)](https://travis-ci.com/mity/md4c)
[![Windows Build Status (appveyor.com)](https://img.shields.io/appveyor/ci/mity/md4c/master.svg?logo=windows&label=windows%20build)](https://ci.appveyor.com/project/mity/md4c/branch/master)
[![Code Coverage Status (codecov.io)](https://img.shields.io/codecov/c/github/mity/md4c/master.svg?logo=codecov&label=code%20coverage)](https://codecov.io/github/mity/md4c)
[![Coverity Scan Status](https://img.shields.io/coverity/scan/mity-md4c.svg?label=coverity%20scan)](https://scan.coverity.com/projects/mity-md4c)
# MD4C Readme
* Home: http://github.com/mity/md4c
* Wiki: http://github.com/mity/md4c/wiki
* Issue tracker: http://github.com/mity/md4c/issues
MD4C stands for "Markdown for C" and that's exactly what this project is about.
## What is Markdown
In short, Markdown is the markup language this `README.md` file is written in.
The following resources can explain more if you are unfamiliar with it:
* [Wikipedia article](http://en.wikipedia.org/wiki/Markdown)
* [CommonMark site](http://commonmark.org)
## What is MD4C
MD4C is Markdown parser implementation in C, with the following features:
* **Compliance:** Generally, MD4C aims to be compliant to the latest version of
[CommonMark specification](http://spec.commonmark.org/). Currently, we are
fully compliant to CommonMark 0.29.
* **Extensions:** MD4C supports some commonly requested and accepted extensions.
See below.
* **Performance:** MD4C is [very fast](https://talk.commonmark.org/t/2520).
* **Compactness:** MD4C parser is implemented in one source file and one header
file. There are no dependencies other than standard C library.
* **Embedding:** MD4C parser is easy to reuse in other projects, its API is
very straightforward: There is actually just one function, `md_parse()`.
* **Push model:** MD4C parses the complete document and calls few callback
functions provided by the application to inform it about a start/end of
every block, a start/end of every span, and with any textual contents.
* **Portability:** MD4C builds and works on Windows and POSIX-compliant OSes.
(It should be simple to make it run also on most other platforms, at least as
long as the platform provides C standard library, including a heap memory
management.)
* **Encoding:** MD4C by default expects UTF-8 encoding of the input document.
But it can be compiled to recognize ASCII-only control characters (i.e. to
disable all Unicode-specific code), or (on Windows) to expect UTF-16 (i.e.
what is on Windows commonly called just "Unicode"). See more details below.
* **Permissive license:** MD4C is available under the [MIT license](LICENSE.md).
## Using MD4C
### Parsing Markdown
If you need just to parse a Markdown document, you need to include `md4c.h`
and link against MD4C library (`-lmd4c`); or alternatively add `md4c.[hc]`
directly to your code base as the parser is only implemented in the single C
source file.
The main provided function is `md_parse()`. It takes a text in the Markdown
syntax and a pointer to a structure which provides pointers to several callback
functions.
As `md_parse()` processes the input, it calls the callbacks (when entering or
leaving any Markdown block or span; and when outputting any textual content of
the document), allowing application to convert it into another format or render
it onto the screen.
### Converting to HTML
If you need to convert Markdown to HTML, include `md4c-html.h` and link against
MD4C-HTML library (`-lmd4c-html`); or alternatively add the sources `md4c.[hc]`,
`md4c-html.[hc]` and `entity.[hc]` into your code base.
To convert a Markdown input, call `md_html()` function. It takes the Markdown
input and calls the provided callback function. The callback is fed with
chunks of the HTML output. Typical callback implementation just appends the
chunks into a buffer or writes them to a file.
## Markdown Extensions
The default behavior is to recognize only Markdown syntax defined by the
[CommonMark specification](http://spec.commonmark.org/).
However, with appropriate flags, the behavior can be tuned to enable some
extensions:
* With the flag `MD_FLAG_COLLAPSEWHITESPACE`, a non-trivial whitespace is
collapsed into a single space.
* With the flag `MD_FLAG_TABLES`, GitHub-style tables are supported.
* With the flag `MD_FLAG_TASKLISTS`, GitHub-style task lists are supported.
* With the flag `MD_FLAG_STRIKETHROUGH`, strike-through spans are enabled
(text enclosed in tilde marks, e.g. `~foo bar~`).
* With the flag `MD_FLAG_PERMISSIVEURLAUTOLINKS` permissive URL autolinks
(not enclosed in `<` and `>`) are supported.
* With the flag `MD_FLAG_PERMISSIVEEMAILAUTOLINKS`, permissive e-mail
autolinks (not enclosed in `<` and `>`) are supported.
* With the flag `MD_FLAG_PERMISSIVEWWWAUTOLINKS` permissive WWW autolinks
without any scheme specified (e.g. `www.example.com`) are supported. MD4C
then assumes `http:` scheme.
* With the flag `MD_FLAG_LATEXMATHSPANS` LaTeX math spans (`$...$`) and
LaTeX display math spans (`$$...$$`) are supported. (Note though that the
HTML renderer outputs them verbatim in a custom tag `<x-equation>`.)
* With the flag `MD_FLAG_WIKILINKS`, wiki-style links (`[[link label]]` and
`[[target article|link label]]`) are supported. (Note that the HTML renderer
outputs them in a custom tag `<x-wikilink>`.)
* With the flag `MD_FLAG_UNDERLINE`, underscore (`_`) denotes an underline
instead of an ordinary emphasis or strong emphasis.
Few features of CommonMark (those some people see as mis-features) may be
disabled with the following flags:
* With the flag `MD_FLAG_NOHTMLSPANS` or `MD_FLAG_NOHTMLBLOCKS`, raw inline
HTML or raw HTML blocks respectively are disabled.
* With the flag `MD_FLAG_NOINDENTEDCODEBLOCKS`, indented code blocks are
disabled.
## Input/Output Encoding
The CommonMark specification declares that any sequence of Unicode code points
is a valid CommonMark document.
But, under a closer inspection, Unicode plays any role in few very specific
situations when parsing Markdown documents:
1. For detection of word boundaries when processing emphasis and strong
emphasis, some classification of Unicode characters (whether it is
a whitespace or a punctuation) is needed.
2. For (case-insensitive) matching of a link reference label with the
corresponding link reference definition, Unicode case folding is used.
3. For translating HTML entities (e.g. `&amp;`) and numeric character
references (e.g. `&#35;` or `&#xcab;`) into their Unicode equivalents.
However note MD4C leaves this translation on the renderer/application; as
the renderer is supposed to really know output encoding and whether it
really needs to perform this kind of translation. (For example, when the
renderer outputs HTML, it may leave the entities untranslated and defer the
work to a web browser.)
MD4C relies on this property of the CommonMark and the implementation is, to
a large degree, encoding-agnostic. Most of MD4C code only assumes that the
encoding of your choice is compatible with ASCII. I.e. that the codepoints
below 128 have the same numeric values as ASCII.
Any input MD4C does not understand is simply seen as part of the document text
and sent to the renderer's callback functions unchanged.
The two situations (word boundary detection and link reference matching) where
MD4C has to understand Unicode are handled as specified by the following
preprocessor macros (as specified at the time MD4C is being built):
* If preprocessor macro `MD4C_USE_UTF8` is defined, MD4C assumes UTF-8 for the
word boundary detection and for the case-insensitive matching of link labels.
When none of these macros is explicitly used, this is the default behavior.
* On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C uses
`WCHAR` instead of `char` and assumes UTF-16 encoding in those situations.
(UTF-16 is what Windows developers usually call just "Unicode" and what
Win32API generally works with.)
Note that because this macro affects also the types in `md4c.h`, you have
to define the macro both when building MD4C as well as when including
`md4c.h`.
Also note this is only supported in the parser (`md4c.[hc]`). The HTML
renderer does not support this and you will have to write your own custom
renderer to use this feature.
* If preprocessor macro `MD4C_USE_ASCII` is defined, MD4C assumes nothing but
an ASCII input.
That effectively means that non-ASCII whitespace or punctuation characters
won't be recognized as such and that link reference matching will work in
a case-insensitive way only for ASCII letters (`[a-zA-Z]`).
## Documentation
The API of the parser is quite well documented in the comments in the `md4c.h`.
Similarly, the markdown-to-html API is described in its header `md4c-html.h`.
There is also [project wiki](http://github.com/mity/md4c/wiki) which provides
some more comprehensive documentation. However note it is incomplete and some
details may be somewhat outdated.
## FAQ
**Q: How does MD4C compare to a parser XY?**
**A:** Some other implementations combine Markdown parser and HTML generator
into a single entangled code hidden behind an interface which just allows the
conversion from Markdown to HTML. They are often unusable if you want to
process the input in any other way.
Even when the parsing is available as a standalone feature, most parsers (if
not all of them; at least within the scope of C/C++ language) are full DOM-like
parsers: They construct abstract syntax tree (AST) representation of the whole
Markdown document. That takes time and it leads to bigger memory footprint.
It's completely fine as long as you really need it. If you don't need the full
AST, there is a very high chance that using MD4C will be substantially faster
and less hungry in terms of memory consumption.
Last but not least, some Markdown parsers are implemented in a naive way. When
fed with a [smartly crafted input pattern](test/pathological_tests.py), they
may exhibit quadratic (or even worse) parsing times. What MD4C can still parse
in a fraction of second may turn into long minutes or possibly hours with them.
Hence, when such a naive parser is used to process an input from an untrusted
source, the possibility of denial-of-service attacks becomes a real danger.
A lot of our effort went into providing linear parsing times no matter what
kind of crazy input MD4C parser is fed with. (If you encounter an input pattern
which leads to a sub-linear parsing times, please do not hesitate and report it
as a bug.)
**Q: Does MD4C perform any input validation?**
**A:** No. And we are proud of it. :-)
CommonMark specification states that any sequence of Unicode characters is
a valid Markdown document. (In practice, this more or less always means UTF-8
encoding.)
In other words, according to the specification, it does not matter whether some
Markdown syntax construction is in some way broken or not. If it is broken, it
will simply not be recognized and the parser should see it just as a verbatim
text.
MD4C takes this a step further: It sees any sequence of bytes as a valid input,
following completely the GIGO philosophy (garbage in, garbage out). I.e. any
ill-formed UTF-8 byte sequence will propagate to the respective callback as
a part of the text.
If you need to validate that the input is, say, a well-formed UTF-8 document,
you have to do it on your own. The easiest way how to do this is to simply
validate the whole document before passing it to the MD4C parser.
## License
MD4C is covered with MIT license, see the file `LICENSE.md`.
## Links to Related Projects
Ports and bindings to other languages:
* [commonmark-d](https://github.com/AuburnSounds/commonmark-d):
Port of MD4C to D language.
* [markdown-wasm](https://github.com/rsms/markdown-wasm):
Port of MD4C to WebAssembly.
* [PyMD4C](https://github.com/dominickpastore/pymd4c):
Python bindings for MD4C
Software using MD4C:
* [QOwnNotes](https://www.qownnotes.org/):
A plain-text file notepad and todo-list manager with markdown support and
ownCloud / Nextcloud integration.
* [Qt](https://www.qt.io/):
Cross-platform C++ GUI framework.
* [Textosaurus](https://github.com/martinrotter/textosaurus):
Cross-platform text editor based on Qt and Scintilla.
* [8th](https://8th-dev.com/):
Cross-platform concatenative programming language.

6343
uppsrc/plugin/md/MD4C/md4c.c Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,405 @@
/*
* MD4C: Markdown parser for C
* (http://github.com/mity/md4c)
*
* Copyright (c) 2016-2020 Martin Mitas
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
* IN THE SOFTWARE.
*/
#ifndef MD4C_H
#define MD4C_H
#ifdef __cplusplus
extern "C" {
#endif
#if defined MD4C_USE_UTF16
/* Magic to support UTF-16. Note that in order to use it, you have to define
* the macro MD4C_USE_UTF16 both when building MD4C as well as when
* including this header in your code. */
#ifdef _WIN32
#include <windows.h>
typedef WCHAR MD_CHAR;
#else
#error MD4C_USE_UTF16 is only supported on Windows.
#endif
#else
typedef char MD_CHAR;
#endif
typedef unsigned MD_SIZE;
typedef unsigned MD_OFFSET;
/* Block represents a part of document hierarchy structure like a paragraph
* or list item.
*/
typedef enum MD_BLOCKTYPE {
/* <body>...</body> */
MD_BLOCK_DOC = 0,
/* <blockquote>...</blockquote> */
MD_BLOCK_QUOTE,
/* <ul>...</ul>
* Detail: Structure MD_BLOCK_UL_DETAIL. */
MD_BLOCK_UL,
/* <ol>...</ol>
* Detail: Structure MD_BLOCK_OL_DETAIL. */
MD_BLOCK_OL,
/* <li>...</li>
* Detail: Structure MD_BLOCK_LI_DETAIL. */
MD_BLOCK_LI,
/* <hr> */
MD_BLOCK_HR,
/* <h1>...</h1> (for levels up to 6)
* Detail: Structure MD_BLOCK_H_DETAIL. */
MD_BLOCK_H,
/* <pre><code>...</code></pre>
* Note the text lines within code blocks are terminated with '\n'
* instead of explicit MD_TEXT_BR. */
MD_BLOCK_CODE,
/* Raw HTML block. This itself does not correspond to any particular HTML
* tag. The contents of it _is_ raw HTML source intended to be put
* in verbatim form to the HTML output. */
MD_BLOCK_HTML,
/* <p>...</p> */
MD_BLOCK_P,
/* <table>...</table> and its contents.
* Detail: Structure MD_BLOCK_TABLE_DETAIL (for MD_BLOCK_TABLE),
* structure MD_BLOCK_TD_DETAIL (for MD_BLOCK_TH and MD_BLOCK_TD)
* Note all of these are used only if extension MD_FLAG_TABLES is enabled. */
MD_BLOCK_TABLE,
MD_BLOCK_THEAD,
MD_BLOCK_TBODY,
MD_BLOCK_TR,
MD_BLOCK_TH,
MD_BLOCK_TD
} MD_BLOCKTYPE;
/* Span represents an in-line piece of a document which should be rendered with
* the same font, color and other attributes. A sequence of spans forms a block
* like paragraph or list item. */
typedef enum MD_SPANTYPE {
/* <em>...</em> */
MD_SPAN_EM,
/* <strong>...</strong> */
MD_SPAN_STRONG,
/* <a href="xxx">...</a>
* Detail: Structure MD_SPAN_A_DETAIL. */
MD_SPAN_A,
/* <img src="xxx">...</a>
* Detail: Structure MD_SPAN_IMG_DETAIL.
* Note: Image text can contain nested spans and even nested images.
* If rendered into ALT attribute of HTML <IMG> tag, it's responsibility
* of the parser to deal with it.
*/
MD_SPAN_IMG,
/* <code>...</code> */
MD_SPAN_CODE,
/* <del>...</del>
* Note: Recognized only when MD_FLAG_STRIKETHROUGH is enabled.
*/
MD_SPAN_DEL,
/* For recognizing inline ($) and display ($$) equations
* Note: Recognized only when MD_FLAG_LATEXMATHSPANS is enabled.
*/
MD_SPAN_LATEXMATH,
MD_SPAN_LATEXMATH_DISPLAY,
/* Wiki links
* Note: Recognized only when MD_FLAG_WIKILINKS is enabled.
*/
MD_SPAN_WIKILINK,
/* <u>...</u>
* Note: Recognized only when MD_FLAG_UNDERLINE is enabled. */
MD_SPAN_U
} MD_SPANTYPE;
/* Text is the actual textual contents of span. */
typedef enum MD_TEXTTYPE {
/* Normal text. */
MD_TEXT_NORMAL = 0,
/* NULL character. CommonMark requires replacing NULL character with
* the replacement char U+FFFD, so this allows caller to do that easily. */
MD_TEXT_NULLCHAR,
/* Line breaks.
* Note these are not sent from blocks with verbatim output (MD_BLOCK_CODE
* or MD_BLOCK_HTML). In such cases, '\n' is part of the text itself. */
MD_TEXT_BR, /* <br> (hard break) */
MD_TEXT_SOFTBR, /* '\n' in source text where it is not semantically meaningful (soft break) */
/* Entity.
* (a) Named entity, e.g. &nbsp;
* (Note MD4C does not have a list of known entities.
* Anything matching the regexp /&[A-Za-z][A-Za-z0-9]{1,47};/ is
* treated as a named entity.)
* (b) Numerical entity, e.g. &#1234;
* (c) Hexadecimal entity, e.g. &#x12AB;
*
* As MD4C is mostly encoding agnostic, application gets the verbatim
* entity text into the MD_PARSER::text_callback(). */
MD_TEXT_ENTITY,
/* Text in a code block (inside MD_BLOCK_CODE) or inlined code (`code`).
* If it is inside MD_BLOCK_CODE, it includes spaces for indentation and
* '\n' for new lines. MD_TEXT_BR and MD_TEXT_SOFTBR are not sent for this
* kind of text. */
MD_TEXT_CODE,
/* Text is a raw HTML. If it is contents of a raw HTML block (i.e. not
* an inline raw HTML), then MD_TEXT_BR and MD_TEXT_SOFTBR are not used.
* The text contains verbatim '\n' for the new lines. */
MD_TEXT_HTML,
/* Text is inside an equation. This is processed the same way as inlined code
* spans (`code`). */
MD_TEXT_LATEXMATH
} MD_TEXTTYPE;
/* Alignment enumeration. */
typedef enum MD_ALIGN {
MD_ALIGN_DEFAULT = 0, /* When unspecified. */
MD_ALIGN_LEFT,
MD_ALIGN_CENTER,
MD_ALIGN_RIGHT
} MD_ALIGN;
/* String attribute.
*
* This wraps strings which are outside of a normal text flow and which are
* propagated within various detailed structures, but which still may contain
* string portions of different types like e.g. entities.
*
* So, for example, lets consider this image:
*
* ![image alt text](http://example.org/image.png 'foo &quot; bar')
*
* The image alt text is propagated as a normal text via the MD_PARSER::text()
* callback. However, the image title ('foo &quot; bar') is propagated as
* MD_ATTRIBUTE in MD_SPAN_IMG_DETAIL::title.
*
* Then the attribute MD_SPAN_IMG_DETAIL::title shall provide the following:
* -- [0]: "foo " (substr_types[0] == MD_TEXT_NORMAL; substr_offsets[0] == 0)
* -- [1]: "&quot;" (substr_types[1] == MD_TEXT_ENTITY; substr_offsets[1] == 4)
* -- [2]: " bar" (substr_types[2] == MD_TEXT_NORMAL; substr_offsets[2] == 10)
* -- [3]: (n/a) (n/a ; substr_offsets[3] == 14)
*
* Note that these invariants are always guaranteed:
* -- substr_offsets[0] == 0
* -- substr_offsets[LAST+1] == size
* -- Currently, only MD_TEXT_NORMAL, MD_TEXT_ENTITY, MD_TEXT_NULLCHAR
* substrings can appear. This could change only of the specification
* changes.
*/
typedef struct MD_ATTRIBUTE {
const MD_CHAR* text;
MD_SIZE size;
const MD_TEXTTYPE* substr_types;
const MD_OFFSET* substr_offsets;
} MD_ATTRIBUTE;
/* Detailed info for MD_BLOCK_UL. */
typedef struct MD_BLOCK_UL_DETAIL {
int is_tight; /* Non-zero if tight list, zero if loose. */
MD_CHAR mark; /* Item bullet character in MarkDown source of the list, e.g. '-', '+', '*'. */
} MD_BLOCK_UL_DETAIL;
/* Detailed info for MD_BLOCK_OL. */
typedef struct MD_BLOCK_OL_DETAIL {
unsigned start; /* Start index of the ordered list. */
int is_tight; /* Non-zero if tight list, zero if loose. */
MD_CHAR mark_delimiter; /* Character delimiting the item marks in MarkDown source, e.g. '.' or ')' */
} MD_BLOCK_OL_DETAIL;
/* Detailed info for MD_BLOCK_LI. */
typedef struct MD_BLOCK_LI_DETAIL {
int is_task; /* Can be non-zero only with MD_FLAG_TASKLISTS */
MD_CHAR task_mark; /* If is_task, then one of 'x', 'X' or ' '. Undefined otherwise. */
MD_OFFSET task_mark_offset; /* If is_task, then offset in the input of the char between '[' and ']'. */
} MD_BLOCK_LI_DETAIL;
/* Detailed info for MD_BLOCK_H. */
typedef struct MD_BLOCK_H_DETAIL {
unsigned level; /* Header level (1 - 6) */
} MD_BLOCK_H_DETAIL;
/* Detailed info for MD_BLOCK_CODE. */
typedef struct MD_BLOCK_CODE_DETAIL {
MD_ATTRIBUTE info;
MD_ATTRIBUTE lang;
MD_CHAR fence_char; /* The character used for fenced code block; or zero for indented code block. */
} MD_BLOCK_CODE_DETAIL;
/* Detailed info for MD_BLOCK_TABLE. */
typedef struct MD_BLOCK_TABLE_DETAIL {
unsigned col_count; /* Count of columns in the table. */
unsigned head_row_count; /* Count of rows in the table header (currently always 1) */
unsigned body_row_count; /* Count of rows in the table body */
} MD_BLOCK_TABLE_DETAIL;
/* Detailed info for MD_BLOCK_TH and MD_BLOCK_TD. */
typedef struct MD_BLOCK_TD_DETAIL {
MD_ALIGN align;
} MD_BLOCK_TD_DETAIL;
/* Detailed info for MD_SPAN_A. */
typedef struct MD_SPAN_A_DETAIL {
MD_ATTRIBUTE href;
MD_ATTRIBUTE title;
} MD_SPAN_A_DETAIL;
/* Detailed info for MD_SPAN_IMG. */
typedef struct MD_SPAN_IMG_DETAIL {
MD_ATTRIBUTE src;
MD_ATTRIBUTE title;
} MD_SPAN_IMG_DETAIL;
/* Detailed info for MD_SPAN_WIKILINK. */
typedef struct MD_SPAN_WIKILINK {
MD_ATTRIBUTE target;
} MD_SPAN_WIKILINK_DETAIL;
/* Flags specifying extensions/deviations from CommonMark specification.
*
* By default (when MD_PARSER::flags == 0), we follow CommonMark specification.
* The following flags may allow some extensions or deviations from it.
*/
#define MD_FLAG_COLLAPSEWHITESPACE 0x0001 /* In MD_TEXT_NORMAL, collapse non-trivial whitespace into single ' ' */
#define MD_FLAG_PERMISSIVEATXHEADERS 0x0002 /* Do not require space in ATX headers ( ###header ) */
#define MD_FLAG_PERMISSIVEURLAUTOLINKS 0x0004 /* Recognize URLs as autolinks even without '<', '>' */
#define MD_FLAG_PERMISSIVEEMAILAUTOLINKS 0x0008 /* Recognize e-mails as autolinks even without '<', '>' and 'mailto:' */
#define MD_FLAG_NOINDENTEDCODEBLOCKS 0x0010 /* Disable indented code blocks. (Only fenced code works.) */
#define MD_FLAG_NOHTMLBLOCKS 0x0020 /* Disable raw HTML blocks. */
#define MD_FLAG_NOHTMLSPANS 0x0040 /* Disable raw HTML (inline). */
#define MD_FLAG_TABLES 0x0100 /* Enable tables extension. */
#define MD_FLAG_STRIKETHROUGH 0x0200 /* Enable strikethrough extension. */
#define MD_FLAG_PERMISSIVEWWWAUTOLINKS 0x0400 /* Enable WWW autolinks (even without any scheme prefix, if they begin with 'www.') */
#define MD_FLAG_TASKLISTS 0x0800 /* Enable task list extension. */
#define MD_FLAG_LATEXMATHSPANS 0x1000 /* Enable $ and $$ containing LaTeX equations. */
#define MD_FLAG_WIKILINKS 0x2000 /* Enable wiki links extension. */
#define MD_FLAG_UNDERLINE 0x4000 /* Enable underline extension (and disables '_' for normal emphasis). */
#define MD_FLAG_PERMISSIVEAUTOLINKS (MD_FLAG_PERMISSIVEEMAILAUTOLINKS | MD_FLAG_PERMISSIVEURLAUTOLINKS | MD_FLAG_PERMISSIVEWWWAUTOLINKS)
#define MD_FLAG_NOHTML (MD_FLAG_NOHTMLBLOCKS | MD_FLAG_NOHTMLSPANS)
/* Convenient sets of flags corresponding to well-known Markdown dialects.
*
* Note we may only support subset of features of the referred dialect.
* The constant just enables those extensions which bring us as close as
* possible given what features we implement.
*
* ABI compatibility note: Meaning of these can change in time as new
* extensions, bringing the dialect closer to the original, are implemented.
*/
#define MD_DIALECT_COMMONMARK 0
#define MD_DIALECT_GITHUB (MD_FLAG_PERMISSIVEAUTOLINKS | MD_FLAG_TABLES | MD_FLAG_STRIKETHROUGH | MD_FLAG_TASKLISTS)
/* Parser structure.
*/
typedef struct MD_PARSER {
/* Reserved. Set to zero.
*/
unsigned abi_version;
/* Dialect options. Bitmask of MD_FLAG_xxxx values.
*/
unsigned flags;
/* Caller-provided rendering callbacks.
*
* For some block/span types, more detailed information is provided in a
* type-specific structure pointed by the argument 'detail'.
*
* The last argument of all callbacks, 'userdata', is just propagated from
* md_parse() and is available for any use by the application.
*
* Note any strings provided to the callbacks as their arguments or as
* members of any detail structure are generally not zero-terminated.
* Application has to take the respective size information into account.
*
* Any rendering callback may abort further parsing of the document by
* returning non-zero.
*/
int (*enter_block)(MD_BLOCKTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*leave_block)(MD_BLOCKTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*enter_span)(MD_SPANTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*leave_span)(MD_SPANTYPE /*type*/, void* /*detail*/, void* /*userdata*/);
int (*text)(MD_TEXTTYPE /*type*/, const MD_CHAR* /*text*/, MD_SIZE /*size*/, void* /*userdata*/);
/* Debug callback. Optional (may be NULL).
*
* If provided and something goes wrong, this function gets called.
* This is intended for debugging and problem diagnosis for developers;
* it is not intended to provide any errors suitable for displaying to an
* end user.
*/
void (*debug_log)(const char* /*msg*/, void* /*userdata*/);
/* Reserved. Set to NULL.
*/
void (*syntax)(void);
} MD_PARSER;
/* For backward compatibility. Do not use in new code.
*/
typedef MD_PARSER MD_RENDERER;
/* Parse the Markdown document stored in the string 'text' of size 'size'.
* The parser provides callbacks to be called during the parsing so the
* caller can render the document on the screen or convert the Markdown
* to another format.
*
* Zero is returned on success. If a runtime error occurs (e.g. a memory
* fails), -1 is returned. If the processing is aborted due any callback
* returning non-zero, the return value of the callback is returned.
*/
int md_parse(const MD_CHAR* text, MD_SIZE size, const MD_PARSER* parser, void* userdata);
#ifdef __cplusplus
} /* extern "C" { */
#endif
#endif /* MD4C_H */

View file

@ -0,0 +1,362 @@
#include "Markdown.h"
#define LLOG(x) // DLOG("MarkdownConverter: " << x)
namespace Upp {
class sMarkdownContext : NoCopy {
struct Block
{
MD_BLOCKTYPE type;
String text;
Value data;
int level;
Block *parent;
Array<Block> children;
Block(MD_BLOCKTYPE t, Block *p);
};
enum NestedBlockPosition {
BLOCK_IN_OLIST = 1,
BLOCK_IN_ULIST = 2,
BLOCK_IN_QUOTE = 4,
BLOCK_IN_THEAD = 8,
BLOCK_IN_TBODY = 16
};
Array<Block> document;
Block* current_block;
int current_level;
String Compose(const Array<Block>& doc, int data = 0, bool notext = false, dword flags = 0) const;
public:
void BeginBlock(MD_BLOCKTYPE type, void *detail);
void EndBlock(MD_BLOCKTYPE type, void *detail);
String ToQtf() { return Compose(document); }
sMarkdownContext& operator<<(const String& s) { ASSERT(current_block); current_block->text << s; return *this; }
sMarkdownContext& operator<<(const char *s) { ASSERT(current_block); current_block->text << s; return *this; }
sMarkdownContext()
: current_block(nullptr)
, current_level(0)
{}
};
sMarkdownContext::Block::Block(MD_BLOCKTYPE t, Block *p)
: type(t)
, level(0)
, parent(p)
{
}
void sMarkdownContext::BeginBlock(MD_BLOCKTYPE type, void *detail)
{
if(!current_block) {
current_block = &document.Create<Block>(type, current_block);
}
else {
current_block = &current_block->children.Create<Block>(type, current_block);
}
switch(type) {
case MD_BLOCK_UL:
current_block->data = reinterpret_cast<MD_BLOCK_UL_DETAIL*>(detail)->mark;
current_block->level = ++current_level;
break;
case MD_BLOCK_OL:
current_block->data = '1';
current_block->level = ++current_level;
break;
case MD_BLOCK_H:
current_block->data = (int) reinterpret_cast<MD_BLOCK_H_DETAIL*>(detail)->level;
current_block->level = current_level;
break;
default:
current_block->level = current_level;
break;
}
}
void sMarkdownContext::EndBlock(MD_BLOCKTYPE type, void *detail)
{
if(findarg(type, MD_BLOCK_UL, MD_BLOCK_OL) >= 0)
--current_level;
if(current_block)
current_block = current_block->parent;
}
String sMarkdownContext::Compose(const Array<Block>& doc, int data, bool notext, dword flags) const
{
// TODO:
// 1) Refactor this method.
// 2) Make certain block styles and page properties (e.g. margins, indentation, etc.) configurable.
String txt;
for(int i = 0; i < doc.GetCount(); i++) {
const Block& b = doc[i];
switch(b.type) {
case MD_BLOCK_DOC:
{
txt << "[G;2;# "
<< Compose(b.children, data, false, flags)
<< "&]";
break;
}
case MD_BLOCK_HR:
{
txt << "[H1;L0;h(220.220.220) &]";
break;
}
case MD_BLOCK_H:
{
txt << "&[*;"
<< clamp(6 - b.data.To<int>(), 1, 6)
<< " "
<< b.text
<< "&]";
break;
}
case MD_BLOCK_UL:
{
txt << Compose(b.children, b.data, false, flags|BLOCK_IN_ULIST);
break;
}
case MD_BLOCK_OL:
{
txt << "[N! "
<< Compose(b.children, b.data, false, flags|BLOCK_IN_OLIST)
<< "&]";
break;
}
case MD_BLOCK_LI:
{
bool q = b.text.IsEmpty();
if(!q) {
txt << "[a20;b20;l"
<< b.level * ((flags & BLOCK_IN_QUOTE) ? 100 : 200)
<< ";i200;"
<< decode(data, '*', "OO ", '-', "O1 ", '+', "O2 ", "N1 ")
<< b.text
<< "&]";
}
txt << Compose(b.children, data, q, flags);
break;
}
case MD_BLOCK_P:
{
if(!b.level) {
txt << "[a50;b50 " << b.text << " &]";
}
else
if(notext && !i) {
txt << "[a20;b20;l"
<< b.level * ((flags & BLOCK_IN_QUOTE) ? 20 : 200)
<< ";i200;"
<< decode(data, '*', "OO ", '-', "O1 ", '+', "O2 ", "N1 ")
<< b.text
<< "&]";
}
else {
txt << "[a20;b20;l"
<< b.level * ((flags & BLOCK_IN_QUOTE) ? 40 : 400)
<< ";O_ "
<< b.text
<< "&]";
}
break;
}
case MD_BLOCK_CODE:
case MD_BLOCK_HTML: // Treat HTML as code block....
{
txt << "{{10000;<"
<< b.level * ((flags & BLOCK_IN_QUOTE) ? 10 : 400)
<< ";@(250.250.250);F(230.230.230) [i10;C;1;@5;< "
<< b.text
<< " ]}}&";
break;
}
case MD_BLOCK_QUOTE:
{
txt << "{{1:500;G4;g20;F0;f0;<"
<< b.level * ((flags & BLOCK_IN_QUOTE) ? 10 : 400)
<< " :: [i10;<; "
<< Nvl(b.text, "")
<< Compose(b.children, data, false, flags|BLOCK_IN_QUOTE)
<< " ]}}&";
break;
}
case MD_BLOCK_TABLE:
{
txt << "{{"
<< Compose(b.children, data, false, flags)
<< "}}&";
break;
}
case MD_BLOCK_THEAD:
{
txt << Compose(b.children, data, false, flags|BLOCK_IN_THEAD);
break;
}
case MD_BLOCK_TBODY:
{
txt << Compose(b.children, data, false, flags|BLOCK_IN_TBODY);
break;
}
case MD_BLOCK_TR:
{
int n = b.children.GetCount();
if(flags & BLOCK_IN_THEAD) {
for(int j = 0; j < n; j++) {
txt << '1';
if(j < n - 1)
txt << ':';
else {
txt << "@(220.225.230);G(220.220.220);<"
<< b.level * ((flags & BLOCK_IN_QUOTE) ? 10 : 400)
<< " ";
}
}
}
for(int j = 0; j < n; j++) {
const Block& bb = b.children[j];
if(j == 0 && bb.type == MD_BLOCK_TD)
txt << "::@2 ";
txt << bb.text;
if(j < n - 1)
txt << "||";
}
break;
}
default:
//txt << b.text;
break;
}
}
return txt;
}
static String sDeQtfMd(const char *s)
{
// Here we duplicate DeQtf() function, because we don't want to bring in the RichText package.
StringBuffer r;
for(; *s; s++) {
if(*s == '\n')
r.Cat('&');
else {
if((byte) *s > ' ' && !IsDigit(*s) && !IsAlpha(*s) && (byte) *s < 128)
r.Cat('`');
r.Cat(*s);
}
}
return String(r); // Make compilers happy...
}
String MarkdownConverter::ToQtf(const String& mdtext)
{
MD_PARSER parser;
parser.abi_version = 0;
parser.flags = flags;
parser.debug_log = nullptr;
parser.syntax = nullptr;
parser.enter_block = [](MD_BLOCKTYPE type, void *detail, void *udata) -> int
{
reinterpret_cast<sMarkdownContext*>(udata)->BeginBlock(type, detail);
return 0;
};
parser.leave_block = [](MD_BLOCKTYPE type, void *detail, void *udata) -> int
{
reinterpret_cast<sMarkdownContext*>(udata)->EndBlock(type, detail);
return 0;
};
parser.enter_span = [](MD_SPANTYPE type, void *detail, void *udata) -> int
{
auto& ctx = *reinterpret_cast<sMarkdownContext*>(udata);
switch(type) {
case MD_SPAN_A:
{
auto *q = reinterpret_cast<MD_SPAN_A_DETAIL*>(detail);
ctx << "[^" << String(q->href.text, q->href.size) << "^ ";
break;
}
case MD_SPAN_IMG:
{
auto *q = reinterpret_cast<MD_SPAN_IMG_DETAIL*>(detail);
ctx << "[^" << String(q->src.text, q->src.size) << "^ ";
break;
}
case MD_SPAN_WIKILINK:
{
auto *q = reinterpret_cast<MD_SPAN_WIKILINK_DETAIL*>(detail);
ctx << "[^" << String(q->target.text, q->target.size) << "^ ";
break;
}
default:
{
ctx << decode(type,
MD_SPAN_U, "[_ ",
MD_SPAN_EM, "[/ ",
MD_SPAN_DEL, "[- ",
MD_SPAN_CODE, "[C;@5;$(245.245.245) ",
MD_SPAN_STRONG, "[* ", "");
break;
}}
return 0;
};
parser.leave_span = [](MD_SPANTYPE type, void *detail, void *udata) -> int
{
auto& ctx = *reinterpret_cast<sMarkdownContext*>(udata);
if(findarg(type,
MD_SPAN_A,
MD_SPAN_U,
MD_SPAN_EM,
MD_SPAN_IMG,
MD_SPAN_DEL,
MD_SPAN_CODE,
MD_SPAN_STRONG,
MD_SPAN_WIKILINK) >= 0)
ctx << "]";
return 0;
};
parser.text = [](MD_TEXTTYPE type, const MD_CHAR *text, MD_SIZE size, void* udata) -> int
{
auto& ctx = *reinterpret_cast<sMarkdownContext*>(udata);
ctx << decode(type,
MD_TEXT_NULLCHAR, "?",
MD_TEXT_BR, "&",
MD_TEXT_SOFTBR, " ", // TODO: See if there is a way to properly imitate this in qtf...
(const char*) ~sDeQtfMd(String((const char*) text, size)));
return 0;
};
#ifdef _DEBUG
parser.debug_log = [](const char *msg, void *udata) -> void
{
LLOG(msg);
};
#endif
sMarkdownContext ctx;
int rc = md_parse((const MD_CHAR*)~mdtext, (MD_SIZE) mdtext.GetLength(), &parser, &ctx);
return rc ? String::GetVoid() : ctx.ToQtf();
}
}

View file

@ -0,0 +1,37 @@
#ifndef Upp_Markdown_h
#define Upp_Markdown_h
#include <Core/Core.h>
#include "MD4C/md4c.h"
namespace Upp {
class MarkdownConverter {
public:
MarkdownConverter() : flags(MD_DIALECT_COMMONMARK) {}
MarkdownConverter& CollapseWhitespaces(bool b = true) { if(b) flags |= MD_FLAG_COLLAPSEWHITESPACE; else flags &= ~MD_FLAG_COLLAPSEWHITESPACE; return *this; }
MarkdownConverter& NoIndentedCodeblocks(bool b = true) { if(b) flags |= MD_FLAG_NOINDENTEDCODEBLOCKS; else flags &= ~MD_FLAG_NOINDENTEDCODEBLOCKS; return *this; }
MarkdownConverter& Tables(bool b = true) { if(b) flags |= MD_FLAG_TABLES; else flags &= ~MD_FLAG_TABLES; return *this; }
MarkdownConverter& WikiLinks(bool b = true) { if(b) flags |= MD_FLAG_WIKILINKS; else flags &= ~MD_FLAG_WIKILINKS; return *this; }
MarkdownConverter& Strikeout(bool b = true) { if(b) flags |= MD_FLAG_STRIKETHROUGH; else flags &= ~MD_FLAG_STRIKETHROUGH; return *this; }
MarkdownConverter& Underline(bool b = true) { if(b) flags |= MD_FLAG_UNDERLINE; else flags &= ~MD_FLAG_UNDERLINE; return *this; }
MarkdownConverter& NoHtmlBlocks(bool b = true) { if(b) flags |= MD_FLAG_NOHTMLBLOCKS; else flags &= ~MD_FLAG_NOHTMLBLOCKS; return *this; }
MarkdownConverter& NoHtmlSpans(bool b = true) { if(b) flags |= MD_FLAG_NOHTMLSPANS; else flags &= ~MD_FLAG_NOHTMLSPANS; return *this; }
MarkdownConverter& NoHtml(bool b = true) { if(b) flags |= MD_FLAG_NOHTML; else flags &= ~MD_FLAG_NOHTML; return *this; }
MarkdownConverter& PermissiveAtxHeaders(bool b = true) { if(b) flags |= MD_FLAG_PERMISSIVEATXHEADERS; else flags &= ~MD_FLAG_PERMISSIVEATXHEADERS; return *this; }
MarkdownConverter& PermissiveUrlAutolinks(bool b = true) { if(b) flags |= MD_FLAG_PERMISSIVEURLAUTOLINKS; else flags &= ~MD_FLAG_PERMISSIVEURLAUTOLINKS; return *this; }
MarkdownConverter& PermissiveWWWAutolinks(bool b = true) { if(b) flags |= MD_FLAG_PERMISSIVEWWWAUTOLINKS; else flags &= ~MD_FLAG_PERMISSIVEWWWAUTOLINKS; return *this; }
MarkdownConverter& PermissiveAutolinks(bool b = true) { if(b) flags |= MD_FLAG_PERMISSIVEAUTOLINKS; else flags &= ~MD_FLAG_PERMISSIVEAUTOLINKS; return *this; }
MarkdownConverter& CommonMarkDialect() { flags = MD_DIALECT_COMMONMARK; return *this; }
MarkdownConverter& GitHubDialect() { flags = MD_DIALECT_GITHUB; return *this; }
String ToQtf(const String& markdown_text);
private:
dword flags;
};
}
#endif

12
uppsrc/plugin/md/issues Normal file
View file

@ -0,0 +1,12 @@
# Markdown package (Scheduled for: Ultimate++ release 2021.1)
## TODO:
- Image handling is yet to be implemented.
- Task lists are yet to be implemented.
## Remaining issues:
- Block elements don't have configurable style.
- Page properties (margins, indentation, etc.) are not configurable.
- Soft break is not handled very well.

14
uppsrc/plugin/md/md.upp Normal file
View file

@ -0,0 +1,14 @@
description "Markdown to QTF converter, based on MD4C library\377";
uses
Core;
file
Markdown.h,
Markdown.cpp,
Info readonly separator,
Copying,
issues,
Library readonly separator,
MD4C/md4c.h,
MD4C/md4c.c;