Hermann  Frami

Hermann Frami

1679956080

Stitch Multiple Markdown Files together into A Single Document

Stitchmd


Introduction

stitchmd is a tool that stitches together several Markdown files into one large Markdown file, making it easier to maintain larger Markdown files.

It lets you define the layout of your final document in a summary file, which it then uses to stitch and interlink other Markdown files with.

Flow diagram

See Getting Started for a tutorial, or Usage to start using it.

Features

Cross-linking: Recognizes cross-links between files and their headers and re-targets them for their new locations. This keeps your input and output files independently browsable on websites like GitHub.

Example

Input

[Install](install.md) the program.
See also, [Overview](#overview).

Output

[Install](#install) the program.
See also, [Overview](#overview).

Relative linking: Rewrites relative images and links to match their new location.

Example

Input

![Graph](images/graph.png)

Output

![Graph](docs/images/graph.png)

Header offsetting: Adjusts levels of all headings in included Markdown files based on the hierarchy in the summary file.

Example

Input

- [Introduction](intro.md)
  - [Installation](install.md)

Output

# Introduction

<!-- contents of intro.md -->

## Installation

<!-- contents of install.md -->

Use cases

The following is a non-exhaustive list of use cases where stitchmd may come in handy.

  • Maintaining a document with several collaborators with reduced risk of merge conflicts.
  • Divvying up a document between collaborators by ownership areas. Owners will work inside the documents or directories assigned to them.
  • Keeping a single-page and multi-page version of the same content.
  • Re-using documentation across multiple Markdown documents.
  • Preparing initial drafts of long-form content from an outline of smaller texts.

...and more. (Feel free to contribute a PR with your use case.)

Getting Started

This is a step-by-step tutorial to introduce stitchmd.

For details on how to use it, see Usage.

First, install stitchmd. If you have Go installed, this is as simple as:

go install go.abhg.dev/stitchmd@latest

For other installation methods, see the Installation section.

Create a couple Markdown files. Feel free to open these up and add content to them.

echo 'Welcome to my program.' > intro.md
echo 'It has many features.' > features.md
echo 'Download it from GitHub.' > install.md

Alternatively, clone this repository and copy the doc folder.

Create a summary file defining the layout between these files.

cat > summary.md << EOF
- [Introduction](intro.md)
  - [Features](features.md)
- [Installation](install.md)
EOF

Run stitchmd on the summary.

stitchmd summary.md

The output should look similar to the following:

- [Introduction](#introduction)
  - [Features](#features)
- [Installation](#installation)

# Introduction

Welcome to my program.

## Features

It has many features.

# Installation

Download it from GitHub.

Each included document got its own heading matching its level in the summary file.

Next, open up intro.md and add the following to the bottom:

See [installation](install.md) for instructions.

If you run stitchmd now, the output should change slightly.

- [Introduction](#introduction)
  - [Features](#features)
- [Installation](#installation)

# Introduction

Welcome to my program.
See [installation](#installation) for instructions.

## Features

It has many features.

# Installation

Download it from GitHub.

stitchmd recognized the link from intro.md to install.md, and updated it to point to the # Installation header instead.

Next steps: Play around with the document further:

Alter the hierarchy further.

Add an item to the list without a file:

- Overview
  - [Introduction](intro.md)
  - [Features](features.md)

Add sections or subsections to a document and link to those.

[Build from source](install.md#build-from-source).

Add a heading to the summary.md:

# my awesome program

- [Introduction](#introduction)
  - [Features](#features)
- [Installation](#installation)

Installation

You can install stitchmd from pre-built binaries or from source.

Binary installation

Pre-built binaries of stitchmd are available for different platforms over a few different mediums.

Homebrew

If you use Homebrew on macOS or Linux, run the following command to install stitchmd:

brew install abhinav/tap/stitchmd

ArchLinux

If you use ArchLinux, install stitchmd from AUR using the stitchmd-bin package.

git clone https://aur.archlinux.org/stitchmd-bin.git
cd stitchmd-bin
makepkg -si

If you use an AUR helper like yay, run the following command instead:

yay -S stitchmd-bin

GitHub Releases

For other platforms, download a pre-built binary from the Releases page and place it on your $PATH.

Install from source

To install stitchmd from source, install Go >= 1.20 and run:

go install go.abhg.dev/stitchmd@latest

Usage

stitchmd [OPTIONS] FILE

stitchmd accepts a single Markdown file as input. This file defines the layout you want in your combined document, and is referred to as the summary file.

For example:

# User Guide

- [Getting Started](getting-started.md)
    - [Installation](installation.md)
- [Usage](usage.md)
- [API](api.md)

# Appendix

- [How things work](implementation.md)
- [FAQ](faq.md)

The format of the summary file is specified in more detail in Syntax.

Given such a file as input, stitchmd will print a single Markdown file including the contents of all listed files inline.

Example output

The output of the input file above will be roughly in the following shape:

# User Guide

- [Getting Started](#getting-started)
    - [Installation](#installation)
- [Usage](#usage)
- [API](#api)

## Getting Started

<!-- contents of getting-started.md -->

### Installation

<!-- contents of installation.md -->

## Usage

<!-- contents of usage.md -->

## API

<!-- contents of api.md -->

# Appendix

- [How things work](#how-things-work)
- [FAQ](#faq)

## How things work

<!-- contents of implementation.md -->

## FAQ

<!-- contents of faq.md -->

Options

stitchmd supports the following options:

Read from stdin

Instead of reading from a specific file on-disk, you can pass in '-' as the file name to read the summary from stdin.

cat summary.md | stitchmd -

Add a preface

-preface FILE

If this flag is specified, stitchmd will include the given file at the top of the output verbatim.

You can use this to add comments holding license headers or instructions for contributors.

For example:

cat > generated.txt <<EOF
<!-- This file was generated by stitchmd. DO NOT EDIT. -->

EOF
stitchmd -preface generated.txt summary.md

Offset heading levels

-offset N

stitchmd changes heading levels based on a few factors:

  • level of the section heading
  • position of the file in the hierarchy of that section
  • the file's own title heading

The -offset flag allows you to offset all these headings by a fixed value.

Example

Input

# User Guide

- [Introduction](intro.md)
  - [Installation](install.md)
stitchmd -offset 1 summary.md

Output

## User Guide

- [Introduction](#introduction)
  - [Installation](#installation)

### Introduction

<!-- ... -->

### Installation

<!-- ... -->

Use a negative value to reduce heading levels.

Example

Input

# User Guide

- [Introduction](intro.md)
  - [Installation](install.md)
stitchmd -offset -1 summary.md

Output

# User Guide

- [Introduction](#introduction)
  - [Installation](#installation)

# Introduction

<!-- ... -->

## Installation

<!-- ... -->

Disable the TOC

-no-toc

stitchmd reproduces the original table of contents in the output. You can change this with the -no-toc flag.

stitchmd -no-toc summary.md

This will omit the item listing under each section.

Example

Input

- [Introduction](intro.md)
- [Installation](install.md)
stitchmd -no-toc summary.md

Output

# Introduction

<!-- .. -->

# Installation

<!-- .. -->

Write to file

-o FILE

stitchmd writes its output to stdout by default. Use the -o option to write to a file instead.

stitchmd -o README.md summary.md

Change the directory

-C DIR

Paths in the summary file are considered relative to the summary file.

Use the -C flag to change the directory that stitchmd considers itself to be in.

stitchmd -C docs summary.md

This is especially useful if your summary file is passed via stdin

... | stitchmd -C docs -

Report a diff

-d

stitchmd normally writes output directly to the file if you pass in a filename with -o. Use the -d flag to instead have it report what would change in the output file without actually changing it.

stitchmd -d -o README.md # ...

This can be useful for lint checks and similar, or to do a dry run and find out what would change without changing it.

Syntax

Although the summary file is Markdown, stitchmd expects it in a very specific format.

The summary file is comprised of one or more sections. Sections have a section title specified by a Markdown heading.

Example

# Section 1

<!-- contents of section 1 -->

# Section 2

<!-- contents of section 2 -->

If there's only one section, the section title may be omitted.

File = Section | (SectionTitle Section)+

Each section contains a Markdown list defining one or more list items. List items are one of the following, and may optionally have another list nested inside them to indicate a hierarchy.

Links to local Markdown files: These files will be included into the output, with their contents adjusted to match their place.

  • Example
- [Overview](overview.md)
- [Getting Started](start/install.md)

Plain text: These will become standalone headers in the output. These must have a nested list.

  • Example
- Introduction
    - [Overview](overview.md)
    - [Getting Started](start/install.md)

Items listed in a section are rendered together under that section. A section is rendered in its entirety before the listing for the next section begins.

Example

Input

# Section 1

- [Item 1](item-1.md)
- [Item 2](item-2.md)

# Section 2

- [Item 3](item-3.md)
- [Item 4](item-4.md)

Output

# Section 1

- [Item 1](#item-1)
- [Item 2](#item-2)

## Item 1

<!-- ... -->

## Item 2

<!-- ... -->

# Section 2

- [Item 3](#item-3)
- [Item 4](#item-4)

## Item 3

<!-- ... -->

## Item 4

<!-- ... -->

The heading level of a section determines the minimum heading level for included documents: one plus the section level.

Example

Input

## User Guide

- [Introduction](intro.md)

Output

## User Guide

- [Introduction](#introduction)

### Introduction

<!-- ... -->

Page Titles

All pages included with stitchmd are assigned a title.

By default, the title is the name of the item in the summary. For example, given the following:

<!-- summary.md -->
- [Introduction](intro.md)

<!-- intro.md -->
Welcome to Foo.

The title for intro.md is "Introduction".

Output

- [Introduction](#introduction)

# Introduction

Welcome to Foo.

A file may specify its own title by adding a heading that meets the following rules:

  • it's a level 1 heading
  • it's the first item in the file
  • there are no other level 1 headings in the file

If a file specifies its own title, this does not affect its name in the summary list. This allows the use of short link titles for long headings.

For example, given the following:

<!-- summary.md -->
- [Introduction](intro.md)

<!-- intro.md -->
# Introduction to Foo

Welcome to Foo.

The title for intro.md will be "Introduction to Foo".

Output

- [Introduction](#introduction-to-foo)

# Introduction to Foo

Welcome to Foo.

Download Details:

Author: Abhinav
Source Code: https://github.com/abhinav/stitchmd 
License: MIT license

#markdown #go #golang #document 

Stitch Multiple Markdown Files together into A Single Document
Desmond  Gerber

Desmond Gerber

1676672640

How to Fixed: Class "DOMDocument" Not Found in Laravel

How to Fixed: Class "DOMDocument" Not Found in Laravel

In this article, we will see to fixed class "DOMDocument" not found in laravel. Also, class 'domdocument' not found php 7, and class 'domdocument' not found php 8. class 'domdocument' not found php artisan. The DOMDocument class represents an entire HTML or XML document.

Also, we will see how to install php-dom and php-xml extensions in ubuntu using the terminal.

So, let's see fixed: class "DOMDocument" not found in laravel, class domdocument not found in laravel, class 'domdocument' not found laravel 9, class 'domdocument' not found ubuntu and how to resolve domdocument class not found in php.

Run the following command to install php-dom and php-xml.

sudo apt-get install php-dom
sudo apt-get install php-xml

Install php-dom in PHP 8.2

Run the following command and install php8.2-dom and php8.2-xml.

sudo apt-get install php8.2-dom
sudo apt-get install php8.2-xml

Install php-dom in PHP 8.1

Run the following command and install php8.1-dom and php8.1-xml.

sudo apt-get install php8.1-dom
sudo apt-get install php8.1-xml

Install php-dom in PHP 8.0

Run the following command and install php8.0-dom and php8.0-xml.

sudo apt-get install php8.0-dom
sudo apt-get install php8.0-xml

Install php-dom in PHP 7.4

Run the following command and install php7.4-dom and php7.4-xml.

sudo apt-get install php7.4-dom
sudo apt-get install php7.4-xml

Install php-dom in PHP 7.3

Run the following command and install php7.3-dom and php7.2-xml.

sudo apt-get install php7.2-dom
sudo apt-get install php7.2-xml

Install php-dom in PHP 7.2

Run the following command and install php7.2-dom and php7.2-xml.

sudo apt-get install php7.2-dom
sudo apt-get install php7.2-xml

Original article source at: https://websolutionstuff.com/

#laravel #class #dom #document 

How to Fixed: Class "DOMDocument" Not Found in Laravel

Add Margin Notes to A LibreOffice Document

With margin notes, you can provide notes for your reader, such as extra context, errata, or pointers to other material.

I use LibreOffice Writer on Linux to write my documentation, including client proposals, training materials, and books. Sometimes when I work on a very technical document, I might need to add a margin note to a document, to provide extra context or to make some other note about the text.

LibreOffice Writer doesn't have a "margin note" feature but instead implements margin notes as frames. Here is how I add margin notes as frames in a LibreOffice document.

Set page margins accordingly

If you want to use margin notes in a document, you'll need a wider margin than the standard 1-inch on the left and right sides of the page. This will accommodate placing the frame completely in the margin, so it becomes a true margin note. In my documents, I don't want the margin to become too wide, so I usually increase the left and right page margins to something like 1.25" when I need to use margin notes.

You can set the page margin by using the Styles pop-out selection box. If you don't see the Styles selection box, you can activate it by using ViewStyle in the menus. The default keyboard shortcut for this is F11.

Image of the styles tab in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

Select the Page style, and right-click on the Default Page Style entry to modify it.

Image of the page style tab in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

You can use this dialog box to change the page size and the margins. Where you might normally set a 1-inch margin on the left, right, top, and bottom for your documents, instead set the margins to 1.25" on the left and right, and 1-inch on the top and bottom. The extra space on the left and right will provide a little extra room for the page margins without adding too much blank space.

Image of a page with margins.

(Jim Hall, CC BY-SA 4.0)

When I'm writing a print edition of a book, I need to prepare the document for double-sided printing, with Left and Right pages. That means my materials need to use the Left Page and Right Page styles for the body pages, and the First Page style for the cover page. So after I modify the document's default page style, I need to do the same for the First Page, Left Page, and Right Page styles. The first time you edit these page styles, they will inherit the page size from the default, so you only need to modify the page margins to your preferred style. For documents with Left and Right pages, you might instead provide the 1.25" margin only on the "outside" margins: the right side for Right pages, and the left side for Left pages.

Add a note in a frame

Wherever you need to add a margin note, insert a frame using the InsertFrameFrame… menu action. This brings up a dialog box where you can edit the settings of the frame.

Image showing where you can edit in frame in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

Set the width of the frame to the width of the outside page margin, which you earlier set to 1.25". Click the box for Mirror on even pages then set the horizontal position to Outside for the Entire page. Change the anchor to the character, which will lock the margin note to where it appears in the document, then set the vertical position to Top for the Character, which will start the margin note on the same line as where you insert the frame. I also select Keep inside text boundaries so my frame stays within the printed page.

Image showing where you edit a frame in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

Under the Wrap tab, set the spacing to zero on the left, right, top, and bottom.

Image showing frame wrap in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

Finally, select the Borders tab and remove the border by clicking the No Borders preset. Unselect the Synchronize check box, and set the left and right padding to .25", and the top and bottom padding to zero. The left and right padding is important. This provides some extra white space around your margin note so it isn't jammed right up against the text body, and it isn't too close to the page edge.

Frame borders

(Jim Hall, CC BY-SA 4.0)

Click Ok to add the empty frame to your document, anchored to the character position where you inserted the frame.

Image showing an added frame in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

The frame padding changes the effective width of the margin note. In this example, you set the outside page margin and frame width to 1.25" and .25" for left and right padding within the frame. That leaves .75" for the margin note itself. This narrow space could be too tight for very long margin notes, depending on what text you need to write here. If you find you need more room for your margin notes, you might instead set the outside page margin and frame width to 1.5" which gives 1-inch margin notes.

Once you've added the frame, you can click inside the frame to type your margin note.

Image showing where to type your note in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

Change the look with Styles

When you use frames to add margin notes, LibreOffice uses the Frame Contents paragraph style for your text. You can modify this style to change how margin notes are displayed.

By default, the Frame Contents style uses the same font, font style, and font size as your Default Paragraph Style. To change just the characteristics of the margin note, right-click on the Frame Contents style in the Style Selection selection box, and click Modify. This brings up a new dialog box where you can change the style of the margin notes.

Image showing how to change the style of margin notes in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

In my printed documents, I use Crimson Pro at 12-point size as my default paragraph font. For my margin notes, I prefer to reduce the font size to 10-point and change the style to italics. This makes the margin note stand apart from the rest of the text, yet remains easy to read in print.

Image showing the changed style of margin notes in LibreOffice.

(Jim Hall, CC BY-SA 4.0)

Copy and paste to create new frames

Inserting a frame as a margin note involves several steps, and it can be a pain to repeat each of them whenever I need to create a new margin note. I save a few steps by copying an existing margin note frame, and pasting it into my document at a new position. Updating the margin note requires selecting the text within the frame and typing a new note.

Because you set up the margin note to appear on the outside of the page area, the copied and pasted margin note will correctly appear in the left margin on Left pages, and in the right margin on Right pages.

Adding margin notes in a document is easy with LibreOffice. With margin notes, you can provide notes for your reader, such as extra context, errata, or pointers to other material. Used sparingly, margin notes can be a welcome addition to your documents.

Original article source at: https://opensource.com/

#document #notes 

Add Margin Notes to A LibreOffice Document

How to Hide Create Document Button

Requirement

Hide Create Document (Document CorePack) button from opportunity entity.

Details

Create Document button is added by Document CorePack addon to the entities, but sometimes we want to hide this button for specific entities for example let’s say we want to hide this in the opportunity entity. If you will try to find this button using Ribbon Workbench you won’t get this button visible there. To hide this button you need to setting in the Document CorePack. To hide it we need to follow below steps:

  • Navigate to Settings -> DocumentsCorePack
  • Click on General Settings and Expend Dialog Settings

Hide Create Document button

  • Search for ‘Create Document’ button Entity Configuration and click on Change button
  • Search for Opportunity and click on Delete button

Hide Create Document button

In case you want to add Create Document button to entity you can use Add button and select your entity.

  • Click on Save Configuration button to save your changes.

Now after we should not able able to see Create Document button in the opportunity entity.

Summary

This is how we can hide/add 'Create Document' button in entity.

Hope it will help someone !!

Keep learning and Keep Sharing !!

Original article source at: https://www.c-sharpcorner.com/

#button #document 

How to Hide Create Document Button

Learn Document with BookStack, an Open Source Confluence Alternative

BookStack is an open source, web-based documentation system, that allows you to create a structured knowledge store for personal, team, or company use.

BookStack is an open source, web-based documentation system, that allows you to create a structured knowledge store for personal, team, or company use. BookStack focuses on ease-of-use and design to provide an experience suitable for an audience with, potentially, mixed skills in technology. It's built upon the PHP framework Laravel, with MySQL or MariaDB used as a datastore.

I built BookStack after attempting to find a documentation or wiki system for my workplace. Confluence was the closest option to suit my requirements but the user-based pricing introduced a barrier. The closed nature of Confluence also raised questions to the longevity of the documentation I'd be building. In the end, I decided to build my own platform to suit my needs. I released it under the MIT license to give back to the open source community that I'd come to love and benefit from over the years.

Content hierarchy and organization options

To keep things familiar and intuitive, BookStack makes use of real-world book terms to describe its organization structure. Documentation content is created as a "Page":

  • Pages belong to a specific "Book".
  • Within a Book, Pages can optionally be grouped up into "Chapters".
  • As your documentation grows, you can then use "Shelves" to categorize Books, with Books being able to be part of multiple shelves if needed.

This structure sits at the heart of BookStack, and can often be the love-it-or-hate-it deciding aspect of whether BookStack is suitable for your use case.

IImage of a view of a "Book" in BookStack, with child chapters and pages shown within.

(Dan Brown, CC BY-SA 4.0)

Upon this core hierarchy, BookStack also provides tagging, user favorites, and advanced search capabilities to ensure content remains discoverable.

Writing documentation

The primary method of writing documentation in BookStack is through the use of its what-you-see-is-what-you-get (WYSIWYG) editor, which makes use of the open source Tiny project. This editor provides a range of content formats including:

  • Various header levels
  • Code blocks
  • Collapsible blocks
  • Tables
  • Images
  • Links
  • iFrame embeds
  • Alert callouts
  • Bullet, numbered and tasks lists
  • Drawings (through intregration with the open source diagrams.net)

An image of editing a "Page" using the WYSIWYG editor.

(Dan Brown, CC BY-SA 4.0)

If you prefer Markdown, you can use the built-in Markdown editor, which provides a live preview and supports the same feature set as the WYSIWYG editor. If permission allows, you can even jump between these editor options depending on the page you're editing.

How your data is stored

Documentation is stored within a MySQL or MariaDB database in a relatively simple HTML format, in addition to the original Markdown content if Markdown was used. A lot of design and development decisions have been made to keep this HTML format simplistic. It uses plain standard HTML elements where possible, to ensure raw documentation content remains open and portable.

Uploaded images, attachments, and created drawings are saved on the local filesystem but can optionally be stored in an s3-compatible datastore like the open source MinIO.

To keep your content accessible, there are built-in options to export content as PDF, HTML, plain text, or Markdown. For external consumption, there's a HTTP REST API and a webhook system. In terms of extension, a "logical theme system" allows running of custom PHP code upon a wide range of system events.

Ready for business

BookStack comes with a range of features to support business environments. Support for a range of authentication options are built-in, including SAML2, OpenID Connect, and LDAP allowing easy single-sign-on usage with platforms such as KeyCloak. MFA options are available and can be mandated based upon role. An audit log provides full visibility of modification activities across an instance.

​image of the BookStack audit log activity list.

(Dan Brown, CC BY-SA 4.0)

A full role-based permission system provides administrators full control over create, view, update, and delete actions of system content. This allows per-role system defaults, with options to set custom permissions on a per-hierarchy item basis.

A community of support

After being active for over 7 years, the community for BookStack has grown with various avenues for discussion and support. We now have:

If you want to play with BookStack, you can try it out on our demo site. To learn how to set up your own instance, visit the installation page of our documentation.

Original article source at: https://opensource.com/

#document #opensource #business 

Learn Document with BookStack, an Open Source Confluence Alternative

Bring The Data and Document in MarkLogic

MarkLogic brings all the features you need into one unified system as it is the only Enterprise NoSQL database. MarkLogic can bring multiple heterogeneous data sources into a single platform architecture, allowing for homogenous data access. For bringing the data we need to insert the documents. On the query console, we are able to perform the query according to requirements.

Bringing in the documents

There are many ways to insert documents into a MarkLogic database. Available interfaces include:

  • MarkLogic Data Hub
  • MarkLogic Content Pump
  • Apache Nifi
  • REST API
  • XQuery functions
  • MuleSoft
  • Data Movement SDK (Java API)
  • Node.js API
  • JavaScript functions
  • Apache Kafka
  • Content Processing Framework
  • XCC
  • WebDAV

Explanation of available interfaces

  • MarkLogic Data Hub: The MarkLogic Data Hub is open-source software that is used to inject data from different sources or from multiple sources. It is used to import the data as well as harmonize the data.
  • MarkLogic Content Pump: It is a command line tool for bulk loading billions of documents into a MarkLogic database, extracting or copying the content. It helps us to make workflow integration very easy.
  • Apache Nifi: It is useful when someone needs to ingest data from a relational database into a MarkLogic Database.
  • REST API: It provides a programming language agnostic way to write a document in MarkLogic.
  • XQuery functions: When we want to write the document to a MarkLogic database then this function is used. Able to write the records from the query console or from the XQuery application.
  • MuleSoft: The Marklogic connector for MuleSoft is Used to bring data from various other systems into the MarkLogic database.

Available Interfaces

  • Data Movement SDK (Java API): Included in the java API, the data movement SDK provides the classes for java developers to use to import and transform documents.
  • Node.js API: It provides Node.js classes for the developers to use to write the document to a MarkLogic database from their Node.js code.
  • JavaScript functions: Able to write the document through the query console or by using the javascript application.
  • Apache Kafka: When we need to stream the data into the database, we can do it by using the Kafka MarkLogic connector.
  • Content Processing Framework: A Pipeline framework for making changes to documents as they are being loaded into the database, such as enriching the data or transforming the PDF or MS office document in XML.
  • XML Contentbase Connector (XCC): If you need to create a multi-tier application that communicates with the MarkLogic then it is useful.
  • WebDAV:  Web Distributed Authoring and Versioning used to drag and drop the documents in the Marklogic Database.

Inserting the document using the Query Console

To insert the document using the query console javaScript or XQuery used. The xdmp.documentLoad() function. Used to load the document from the file system into a database.

declareupdate();

xdmp.documentLoad("path of the source file")

Running a JavaScript expression that makes changes to a database. Need to use the declareUpdate function.

The xdmp.documentinsert() function is used to write a document into a database.

declareUpdate();

xdmp.documentInsert('/employee1.json',

{

'title : 'Knoldus' ,

'description': 'Amazing place to work'

});

Uniform Resource Identifier (URI)

To address any document in a MarkLogic database, it is necessary that each document has a unique URI.

/products/1.json

The URI does not refer to the physical location of a document in a database. Provides a unique name for referencing the document.

Deleting the documents

  • The clear button in the admin interface can be used to delete all the documents in a database.
  • To delete an individual document, the xdmp.documentDelete() function can be used.

declareUpdate();

xdmp.documentDelete('/employee1.json');

Accessing a Document

To read a document in a database, use the cts.doc().

cts.doc('/employee1.json);

Modifying Documents

Documents can be modified via various APIS and tools, including data hub, JavaScript, XQuery, etc.

JavaScript functions for updating documents include:

xdmp.nodeReplace()

xdmp.nodeInsert()

xdmp.nodeInsertBefore()

xdmp.nodeInsertAfter()

xdmp.nodeDelete()

Conclusion

MarkLogic is a NoSql database with many facilities and if someone wants to insert the data then this blog is helpful. After insertion needs to access and modify the document by using some predefined functions.

Reference:

Original article source at: https://blog.knoldus.com/

#data #document 

Bring The Data and Document in MarkLogic
Lawrence  Lesch

Lawrence Lesch

1668094080

PDF-lib: Create and Modify PDF Documents in any JavaScript Environment

PDF-lib

Create and modify PDF documents in any JavaScript environment.

Designed to work in any modern JavaScript runtime. Tested in Node, Browser, Deno, and React Native environments.

Learn more at pdf-lib.js.org

Features

  • Create new PDFs
  • Modify existing PDFs
  • Create forms
  • Fill forms
  • Flatten forms
  • Add Pages
  • Insert Pages
  • Remove Pages
  • Copy pages between PDFs
  • Draw Text
  • Draw Images
  • Draw PDF Pages
  • Draw Vector Graphics
  • Draw SVG Paths
  • Measure width and height of text
  • Embed Fonts (supports UTF-8 and UTF-16 character sets)
  • Set document metadata
  • Read document metadata
  • Set viewer preferences
  • Read viewer preferences
  • Add attachments

Motivation

pdf-lib was created to address the JavaScript ecosystem's lack of robust support for PDF manipulation (especially for PDF modification).

Two of pdf-lib's distinguishing features are:

  1. Supporting modification (editing) of existing documents.
  2. Working in all JavaScript environments - not just in Node or the Browser.

There are other good open source JavaScript PDF libraries available. However, most of them can only create documents, they cannot modify existing ones. And many of them only work in particular environments.

Usage Examples

Create Document

This example produces this PDF.

Try the JSFiddle demo

import { PDFDocument, StandardFonts, rgb } from 'pdf-lib'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Embed the Times Roman font
const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman)

// Add a blank page to the document
const page = pdfDoc.addPage()

// Get the width and height of the page
const { width, height } = page.getSize()

// Draw a string of text toward the top of the page
const fontSize = 30
page.drawText('Creating PDFs in JavaScript is awesome!', {
  x: 50,
  y: height - 4 * fontSize,
  size: fontSize,
  font: timesRomanFont,
  color: rgb(0, 0.53, 0.71),
})

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Modify Document

This example produces this PDF (when this PDF is used for the existingPdfBytes variable).

Try the JSFiddle demo

import { degrees, PDFDocument, rgb, StandardFonts } from 'pdf-lib';

// This should be a Uint8Array or ArrayBuffer
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const existingPdfBytes = ...

// Load a PDFDocument from the existing PDF bytes
const pdfDoc = await PDFDocument.load(existingPdfBytes)

// Embed the Helvetica font
const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica)

// Get the first page of the document
const pages = pdfDoc.getPages()
const firstPage = pages[0]

// Get the width and height of the first page
const { width, height } = firstPage.getSize()

// Draw a string of text diagonally across the first page
firstPage.drawText('This text was added with JavaScript!', {
  x: 5,
  y: height / 2 + 300,
  size: 50,
  font: helveticaFont,
  color: rgb(0.95, 0.1, 0.1),
  rotate: degrees(-45),
})


// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Create Form

This example produces this PDF.

Try the JSFiddle demo

See also Creating and Filling Forms

import { PDFDocument } from 'pdf-lib'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Add a blank page to the document
const page = pdfDoc.addPage([550, 750])

// Get the form so we can add fields to it
const form = pdfDoc.getForm()

// Add the superhero text field and description
page.drawText('Enter your favorite superhero:', { x: 50, y: 700, size: 20 })

const superheroField = form.createTextField('favorite.superhero')
superheroField.setText('One Punch Man')
superheroField.addToPage(page, { x: 55, y: 640 })

// Add the rocket radio group, labels, and description
page.drawText('Select your favorite rocket:', { x: 50, y: 600, size: 20 })

page.drawText('Falcon Heavy', { x: 120, y: 560, size: 18 })
page.drawText('Saturn IV', { x: 120, y: 500, size: 18 })
page.drawText('Delta IV Heavy', { x: 340, y: 560, size: 18 })
page.drawText('Space Launch System', { x: 340, y: 500, size: 18 })

const rocketField = form.createRadioGroup('favorite.rocket')
rocketField.addOptionToPage('Falcon Heavy', page, { x: 55, y: 540 })
rocketField.addOptionToPage('Saturn IV', page, { x: 55, y: 480 })
rocketField.addOptionToPage('Delta IV Heavy', page, { x: 275, y: 540 })
rocketField.addOptionToPage('Space Launch System', page, { x: 275, y: 480 })
rocketField.select('Saturn IV')

// Add the gundam check boxes, labels, and description
page.drawText('Select your favorite gundams:', { x: 50, y: 440, size: 20 })

page.drawText('Exia', { x: 120, y: 400, size: 18 })
page.drawText('Kyrios', { x: 120, y: 340, size: 18 })
page.drawText('Virtue', { x: 340, y: 400, size: 18 })
page.drawText('Dynames', { x: 340, y: 340, size: 18 })

const exiaField = form.createCheckBox('gundam.exia')
const kyriosField = form.createCheckBox('gundam.kyrios')
const virtueField = form.createCheckBox('gundam.virtue')
const dynamesField = form.createCheckBox('gundam.dynames')

exiaField.addToPage(page, { x: 55, y: 380 })
kyriosField.addToPage(page, { x: 55, y: 320 })
virtueField.addToPage(page, { x: 275, y: 380 })
dynamesField.addToPage(page, { x: 275, y: 320 })

exiaField.check()
dynamesField.check()

// Add the planet dropdown and description
page.drawText('Select your favorite planet*:', { x: 50, y: 280, size: 20 })

const planetsField = form.createDropdown('favorite.planet')
planetsField.addOptions(['Venus', 'Earth', 'Mars', 'Pluto'])
planetsField.select('Pluto')
planetsField.addToPage(page, { x: 55, y: 220 })

// Add the person option list and description
page.drawText('Select your favorite person:', { x: 50, y: 180, size: 18 })

const personField = form.createOptionList('favorite.person')
personField.addOptions([
  'Julius Caesar',
  'Ada Lovelace',
  'Cleopatra',
  'Aaron Burr',
  'Mark Antony',
])
personField.select('Ada Lovelace')
personField.addToPage(page, { x: 55, y: 70 })

// Just saying...
page.drawText(`* Pluto should be a planet too!`, { x: 15, y: 15, size: 15 })

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Fill Form

This example produces this PDF (when this PDF is used for the formPdfBytes variable, this image is used for the marioImageBytes variable, and this image is used for the emblemImageBytes variable).

Try the JSFiddle demo

See also Creating and Filling Forms

import { PDFDocument } from 'pdf-lib'

// These should be Uint8Arrays or ArrayBuffers
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const formPdfBytes = ...
const marioImageBytes = ...
const emblemImageBytes = ...

// Load a PDF with form fields
const pdfDoc = await PDFDocument.load(formPdfBytes)

// Embed the Mario and emblem images
const marioImage = await pdfDoc.embedPng(marioImageBytes)
const emblemImage = await pdfDoc.embedPng(emblemImageBytes)

// Get the form containing all the fields
const form = pdfDoc.getForm()

// Get all fields in the PDF by their names
const nameField = form.getTextField('CharacterName 2')
const ageField = form.getTextField('Age')
const heightField = form.getTextField('Height')
const weightField = form.getTextField('Weight')
const eyesField = form.getTextField('Eyes')
const skinField = form.getTextField('Skin')
const hairField = form.getTextField('Hair')

const alliesField = form.getTextField('Allies')
const factionField = form.getTextField('FactionName')
const backstoryField = form.getTextField('Backstory')
const traitsField = form.getTextField('Feat+Traits')
const treasureField = form.getTextField('Treasure')

const characterImageField = form.getButton('CHARACTER IMAGE')
const factionImageField = form.getTextField('Faction Symbol Image')

// Fill in the basic info fields
nameField.setText('Mario')
ageField.setText('24 years')
heightField.setText(`5' 1"`)
weightField.setText('196 lbs')
eyesField.setText('blue')
skinField.setText('white')
hairField.setText('brown')

// Fill the character image field with our Mario image
characterImageField.setImage(marioImage)

// Fill in the allies field
alliesField.setText(
  [
    `Allies:`,
    `  • Princess Daisy`,
    `  • Princess Peach`,
    `  • Rosalina`,
    `  • Geno`,
    `  • Luigi`,
    `  • Donkey Kong`,
    `  • Yoshi`,
    `  • Diddy Kong`,
    ``,
    `Organizations:`,
    `  • Italian Plumbers Association`,
  ].join('\n'),
)

// Fill in the faction name field
factionField.setText(`Mario's Emblem`)

// Fill the faction image field with our emblem image
factionImageField.setImage(emblemImage)

// Fill in the backstory field
backstoryField.setText(
  `Mario is a fictional character in the Mario video game franchise, owned by Nintendo and created by Japanese video game designer Shigeru Miyamoto. Serving as the company's mascot and the eponymous protagonist of the series, Mario has appeared in over 200 video games since his creation. Depicted as a short, pudgy, Italian plumber who resides in the Mushroom Kingdom, his adventures generally center upon rescuing Princess Peach from the Koopa villain Bowser. His younger brother and sidekick is Luigi.`,
)

// Fill in the traits field
traitsField.setText(
  [
    `Mario can use three basic three power-ups:`,
    `  • the Super Mushroom, which causes Mario to grow larger`,
    `  • the Fire Flower, which allows Mario to throw fireballs`,
    `  • the Starman, which gives Mario temporary invincibility`,
  ].join('\n'),
)

// Fill in the treasure field
treasureField.setText(['• Gold coins', '• Treasure chests'].join('\n'))

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Flatten Form

This example produces this PDF (when this PDF is used for the formPdfBytes variable).

Try the JSFiddle demo

import { PDFDocument } from 'pdf-lib'

// This should be a Uint8Array or ArrayBuffer
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const formPdfBytes = ...

// Load a PDF with form fields
const pdfDoc = await PDFDocument.load(formPdfBytes)

// Get the form containing all the fields
const form = pdfDoc.getForm()

// Fill the form's fields
form.getTextField('Text1').setText('Some Text');

form.getRadioGroup('Group2').select('Choice1');
form.getRadioGroup('Group3').select('Choice3');
form.getRadioGroup('Group4').select('Choice1');

form.getCheckBox('Check Box3').check();
form.getCheckBox('Check Box4').uncheck();

form.getDropdown('Dropdown7').select('Infinity');

form.getOptionList('List Box6').select('Honda');

// Flatten the form's fields
form.flatten();

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Copy Pages

This example produces this PDF (when this PDF is used for the firstDonorPdfBytes variable and this PDF is used for the secondDonorPdfBytes variable).

Try the JSFiddle demo

import { PDFDocument } from 'pdf-lib'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// These should be Uint8Arrays or ArrayBuffers
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const firstDonorPdfBytes = ...
const secondDonorPdfBytes = ...

// Load a PDFDocument from each of the existing PDFs
const firstDonorPdfDoc = await PDFDocument.load(firstDonorPdfBytes)
const secondDonorPdfDoc = await PDFDocument.load(secondDonorPdfBytes)

// Copy the 1st page from the first donor document, and
// the 743rd page from the second donor document
const [firstDonorPage] = await pdfDoc.copyPages(firstDonorPdfDoc, [0])
const [secondDonorPage] = await pdfDoc.copyPages(secondDonorPdfDoc, [742])

// Add the first copied page
pdfDoc.addPage(firstDonorPage)

// Insert the second copied page to index 0, so it will be the
// first page in `pdfDoc`
pdfDoc.insertPage(0, secondDonorPage)

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Embed PNG and JPEG Images

This example produces this PDF (when this image is used for the jpgImageBytes variable and this image is used for the pngImageBytes variable).

Try the JSFiddle demo

import { PDFDocument } from 'pdf-lib'

// These should be Uint8Arrays or ArrayBuffers
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const jpgImageBytes = ...
const pngImageBytes = ...

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Embed the JPG image bytes and PNG image bytes
const jpgImage = await pdfDoc.embedJpg(jpgImageBytes)
const pngImage = await pdfDoc.embedPng(pngImageBytes)

// Get the width/height of the JPG image scaled down to 25% of its original size
const jpgDims = jpgImage.scale(0.25)

// Get the width/height of the PNG image scaled down to 50% of its original size
const pngDims = pngImage.scale(0.5)

// Add a blank page to the document
const page = pdfDoc.addPage()

// Draw the JPG image in the center of the page
page.drawImage(jpgImage, {
  x: page.getWidth() / 2 - jpgDims.width / 2,
  y: page.getHeight() / 2 - jpgDims.height / 2,
  width: jpgDims.width,
  height: jpgDims.height,
})

// Draw the PNG image near the lower right corner of the JPG image
page.drawImage(pngImage, {
  x: page.getWidth() / 2 - pngDims.width / 2 + 75,
  y: page.getHeight() / 2 - pngDims.height,
  width: pngDims.width,
  height: pngDims.height,
})

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Embed PDF Pages

This example produces this PDF (when this PDF is used for the americanFlagPdfBytes variable and this PDF is used for the usConstitutionPdfBytes variable).

Try the JSFiddle demo

import { PDFDocument } from 'pdf-lib'

// These should be Uint8Arrays or ArrayBuffers
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const americanFlagPdfBytes = ...
const usConstitutionPdfBytes = ...

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Embed the American flag PDF bytes
const [americanFlag] = await pdfDoc.embedPdf(americanFlagPdfBytes)

// Load the U.S. constitution PDF bytes
const usConstitutionPdf = await PDFDocument.load(usConstitutionPdfBytes)

// Embed the second page of the constitution and clip the preamble
const preamble = await pdfDoc.embedPage(usConstitutionPdf.getPages()[1], {
  left: 55,
  bottom: 485,
  right: 300,
  top: 575,
})

// Get the width/height of the American flag PDF scaled down to 30% of
// its original size
const americanFlagDims = americanFlag.scale(0.3)

// Get the width/height of the preamble clipping scaled up to 225% of
// its original size
const preambleDims = preamble.scale(2.25)

// Add a blank page to the document
const page = pdfDoc.addPage()

// Draw the American flag image in the center top of the page
page.drawPage(americanFlag, {
  ...americanFlagDims,
  x: page.getWidth() / 2 - americanFlagDims.width / 2,
  y: page.getHeight() - americanFlagDims.height - 150,
})

// Draw the preamble clipping in the center bottom of the page
page.drawPage(preamble, {
  ...preambleDims,
  x: page.getWidth() / 2 - preambleDims.width / 2,
  y: page.getHeight() / 2 - preambleDims.height / 2 - 50,
})

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Embed Font and Measure Text

pdf-lib relies on a sister module to support embedding custom fonts: @pdf-lib/fontkit. You must add the @pdf-lib/fontkit module to your project and register it using pdfDoc.registerFontkit(...) before embedding custom fonts.

See below for detailed installation instructions on installing @pdf-lib/fontkit as a UMD or NPM module.

This example produces this PDF (when this font is used for the fontBytes variable).

Try the JSFiddle demo

import { PDFDocument, rgb } from 'pdf-lib'
import fontkit from '@pdf-lib/fontkit'

// This should be a Uint8Array or ArrayBuffer
// This data can be obtained in a number of different ways
// If you're running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const fontBytes = ...

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Register the `fontkit` instance
pdfDoc.registerFontkit(fontkit)

// Embed our custom font in the document
const customFont = await pdfDoc.embedFont(fontBytes)

// Add a blank page to the document
const page = pdfDoc.addPage()

// Create a string of text and measure its width and height in our custom font
const text = 'This is text in an embedded font!'
const textSize = 35
const textWidth = customFont.widthOfTextAtSize(text, textSize)
const textHeight = customFont.heightAtSize(textSize)

// Draw the string of text on the page
page.drawText(text, {
  x: 40,
  y: 450,
  size: textSize,
  font: customFont,
  color: rgb(0, 0.53, 0.71),
})

// Draw a box around the string of text
page.drawRectangle({
  x: 40,
  y: 450,
  width: textWidth,
  height: textHeight,
  borderColor: rgb(1, 0, 0),
  borderWidth: 1.5,
})

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Add Attachments

This example produces this PDF (when this image is used for the jpgAttachmentBytes variable and this PDF is used for the pdfAttachmentBytes variable).

Try the JSFiddle demo

import { PDFDocument } from 'pdf-lib'

// These should be Uint8Arrays or ArrayBuffers
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const jpgAttachmentBytes = ...
const pdfAttachmentBytes = ...

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Add the JPG attachment
await pdfDoc.attach(jpgAttachmentBytes, 'cat_riding_unicorn.jpg', {
  mimeType: 'image/jpeg',
  description: 'Cool cat riding a unicorn! 🦄🐈🕶️',
  creationDate: new Date('2019/12/01'),
  modificationDate: new Date('2020/04/19'),
})

// Add the PDF attachment
await pdfDoc.attach(pdfAttachmentBytes, 'us_constitution.pdf', {
  mimeType: 'application/pdf',
  description: 'Constitution of the United States 🇺🇸🦅',
  creationDate: new Date('1787/09/17'),
  modificationDate: new Date('1992/05/07'),
})

// Add a page with some text
const page = pdfDoc.addPage();
page.drawText('This PDF has two attachments', { x: 135, y: 415 })

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Set Document Metadata

This example produces this PDF.

Try the JSFiddle demo

import { PDFDocument, StandardFonts } from 'pdf-lib'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Embed the Times Roman font
const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman)

// Add a page and draw some text on it
const page = pdfDoc.addPage([500, 600])
page.setFont(timesRomanFont)
page.drawText('The Life of an Egg', { x: 60, y: 500, size: 50 })
page.drawText('An Epic Tale of Woe', { x: 125, y: 460, size: 25 })

// Set all available metadata fields on the PDFDocument. Note that these fields
// are visible in the "Document Properties" section of most PDF readers.
pdfDoc.setTitle('🥚 The Life of an Egg 🍳')
pdfDoc.setAuthor('Humpty Dumpty')
pdfDoc.setSubject('📘 An Epic Tale of Woe 📖')
pdfDoc.setKeywords(['eggs', 'wall', 'fall', 'king', 'horses', 'men'])
pdfDoc.setProducer('PDF App 9000 🤖')
pdfDoc.setCreator('pdf-lib (https://github.com/Hopding/pdf-lib)')
pdfDoc.setCreationDate(new Date('2018-06-24T01:58:37.228Z'))
pdfDoc.setModificationDate(new Date('2019-12-21T07:00:11.000Z'))

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Read Document Metadata

Try the JSFiddle demo

import { PDFDocument } from 'pdf-lib'

// This should be a Uint8Array or ArrayBuffer
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const existingPdfBytes = ...

// Load a PDFDocument without updating its existing metadata
const pdfDoc = await PDFDocument.load(existingPdfBytes, {
  updateMetadata: false
})

// Print all available metadata fields
console.log('Title:', pdfDoc.getTitle())
console.log('Author:', pdfDoc.getAuthor())
console.log('Subject:', pdfDoc.getSubject())
console.log('Creator:', pdfDoc.getCreator())
console.log('Keywords:', pdfDoc.getKeywords())
console.log('Producer:', pdfDoc.getProducer())
console.log('Creation Date:', pdfDoc.getCreationDate())
console.log('Modification Date:', pdfDoc.getModificationDate())

This script outputs the following (when this PDF is used for the existingPdfBytes variable):

Title: Microsoft Word - Basic Curriculum Vitae example.doc
Author: Administrator
Subject: undefined
Creator: PScript5.dll Version 5.2
Keywords: undefined
Producer: Acrobat Distiller 8.1.0 (Windows)
Creation Date: 2010-07-29T14:26:00.000Z
Modification Date: 2010-07-29T14:26:00.000Z

Set Viewer Preferences

import {
  PDFDocument,
  StandardFonts,
  NonFullScreenPageMode,
  ReadingDirection,
  PrintScaling,
  Duplex,
  PDFName,
} from 'pdf-lib'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Embed the Times Roman font
const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman)

// Add a page and draw some text on it
const page = pdfDoc.addPage([500, 600])
page.setFont(timesRomanFont)
page.drawText('The Life of an Egg', { x: 60, y: 500, size: 50 })
page.drawText('An Epic Tale of Woe', { x: 125, y: 460, size: 25 })

// Set all available viewer preferences on the PDFDocument:
const viewerPrefs = pdfDoc.catalog.getOrCreateViewerPreferences()
viewerPrefs.setHideToolbar(true)
viewerPrefs.setHideMenubar(true)
viewerPrefs.setHideWindowUI(true)
viewerPrefs.setFitWindow(true)
viewerPrefs.setCenterWindow(true)
viewerPrefs.setDisplayDocTitle(true)

// Set the PageMode (otherwise setting NonFullScreenPageMode has no meaning)
pdfDoc.catalog.set(PDFName.of('PageMode'), PDFName.of('FullScreen'))

// Set what happens when fullScreen is closed
viewerPrefs.setNonFullScreenPageMode(NonFullScreenPageMode.UseOutlines)

viewerPrefs.setReadingDirection(ReadingDirection.L2R)
viewerPrefs.setPrintScaling(PrintScaling.None)
viewerPrefs.setDuplex(Duplex.DuplexFlipLongEdge)
viewerPrefs.setPickTrayByPDFSize(true)

// We can set the default print range to only the first page
viewerPrefs.setPrintPageRange({ start: 0, end: 0 })

// Or we can supply noncontiguous ranges (e.g. pages 1, 3, and 5-7)
viewerPrefs.setPrintPageRange([
  { start: 0, end: 0 },
  { start: 2, end: 2 },
  { start: 4, end: 6 },
])

viewerPrefs.setNumCopies(2)

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Read Viewer Preferences

import { PDFDocument } from 'pdf-lib'

// This should be a Uint8Array or ArrayBuffer
// This data can be obtained in a number of different ways
// If your running in a Node environment, you could use fs.readFile()
// In the browser, you could make a fetch() call and use res.arrayBuffer()
const existingPdfBytes = ...

// Load a PDFDocument without updating its existing metadata
const pdfDoc = await PDFDocument.load(existingPdfBytes)
const viewerPrefs = pdfDoc.catalog.getOrCreateViewerPreferences()

// Print all available viewer preference fields
console.log('HideToolbar:', viewerPrefs.getHideToolbar())
console.log('HideMenubar:', viewerPrefs.getHideMenubar())
console.log('HideWindowUI:', viewerPrefs.getHideWindowUI())
console.log('FitWindow:', viewerPrefs.getFitWindow())
console.log('CenterWindow:', viewerPrefs.getCenterWindow())
console.log('DisplayDocTitle:', viewerPrefs.getDisplayDocTitle())
console.log('NonFullScreenPageMode:', viewerPrefs.getNonFullScreenPageMode())
console.log('ReadingDirection:', viewerPrefs.getReadingDirection())
console.log('PrintScaling:', viewerPrefs.getPrintScaling())
console.log('Duplex:', viewerPrefs.getDuplex())
console.log('PickTrayByPDFSize:', viewerPrefs.getPickTrayByPDFSize())
console.log('PrintPageRange:', viewerPrefs.getPrintPageRange())
console.log('NumCopies:', viewerPrefs.getNumCopies())

This script outputs the following (when this PDF is used for the existingPdfBytes variable):

HideToolbar: true
HideMenubar: true
HideWindowUI: false
FitWindow: true
CenterWindow: true
DisplayDocTitle: true
NonFullScreenPageMode: UseNone
ReadingDirection: R2L
PrintScaling: None
Duplex: DuplexFlipLongEdge
PickTrayByPDFSize: true
PrintPageRange: [ { start: 1, end: 1 }, { start: 3, end: 4 } ]
NumCopies: 2

Draw SVG Paths

This example produces this PDF.

Try the JSFiddle demo

import { PDFDocument, rgb } from 'pdf-lib'

// SVG path for a wavy line
const svgPath =
  'M 0,20 L 100,160 Q 130,200 150,120 C 190,-40 200,200 300,150 L 400,90'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Add a blank page to the document
const page = pdfDoc.addPage()
page.moveTo(100, page.getHeight() - 5)

// Draw the SVG path as a black line
page.moveDown(25)
page.drawSvgPath(svgPath)

// Draw the SVG path as a thick green line
page.moveDown(200)
page.drawSvgPath(svgPath, { borderColor: rgb(0, 1, 0), borderWidth: 5 })

// Draw the SVG path and fill it with red
page.moveDown(200)
page.drawSvgPath(svgPath, { color: rgb(1, 0, 0) })

// Draw the SVG path at 50% of its original size
page.moveDown(200)
page.drawSvgPath(svgPath, { scale: 0.5 })

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// For example, `pdfBytes` can be:
//   • Written to a file in Node
//   • Downloaded from the browser
//   • Rendered in an <iframe>

Deno Usage

pdf-lib fully supports the exciting new Deno runtime! All of the usage examples work in Deno. The only thing you need to do is change the imports for pdf-lib and @pdf-lib/fontkit to use the Skypack CDN, because Deno requires all modules to be referenced via URLs.

See also How to Create and Modify PDF Files in Deno With pdf-lib

Creating a Document with Deno

Below is the create document example modified for Deno:

import {
  PDFDocument,
  StandardFonts,
  rgb,
} from 'https://cdn.skypack.dev/pdf-lib@^1.11.1?dts';

const pdfDoc = await PDFDocument.create();
const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman);

const page = pdfDoc.addPage();
const { width, height } = page.getSize();
const fontSize = 30;
page.drawText('Creating PDFs in JavaScript is awesome!', {
  x: 50,
  y: height - 4 * fontSize,
  size: fontSize,
  font: timesRomanFont,
  color: rgb(0, 0.53, 0.71),
});

const pdfBytes = await pdfDoc.save();

await Deno.writeFile('out.pdf', pdfBytes);

If you save this script as create-document.ts, you can execute it using Deno with the following command:

deno run --allow-write create-document.ts

The resulting out.pdf file will look like this PDF.

Embedding a Font with Deno

Here's a slightly more complicated example demonstrating how to embed a font and measure text in Deno:

import {
  degrees,
  PDFDocument,
  rgb,
  StandardFonts,
} from 'https://cdn.skypack.dev/pdf-lib@^1.11.1?dts';
import fontkit from 'https://cdn.skypack.dev/@pdf-lib/fontkit@^1.0.0?dts';

const url = 'https://pdf-lib.js.org/assets/ubuntu/Ubuntu-R.ttf';
const fontBytes = await fetch(url).then((res) => res.arrayBuffer());

const pdfDoc = await PDFDocument.create();

pdfDoc.registerFontkit(fontkit);
const customFont = await pdfDoc.embedFont(fontBytes);

const page = pdfDoc.addPage();

const text = 'This is text in an embedded font!';
const textSize = 35;
const textWidth = customFont.widthOfTextAtSize(text, textSize);
const textHeight = customFont.heightAtSize(textSize);

page.drawText(text, {
  x: 40,
  y: 450,
  size: textSize,
  font: customFont,
  color: rgb(0, 0.53, 0.71),
});
page.drawRectangle({
  x: 40,
  y: 450,
  width: textWidth,
  height: textHeight,
  borderColor: rgb(1, 0, 0),
  borderWidth: 1.5,
});

const pdfBytes = await pdfDoc.save();

await Deno.writeFile('out.pdf', pdfBytes);

If you save this script as custom-font.ts, you can execute it with the following command:

deno run --allow-write --allow-net custom-font.ts

The resulting out.pdf file will look like this PDF.

Complete Examples

The usage examples provide code that is brief and to the point, demonstrating the different features of pdf-lib. You can find complete working examples in the apps/ directory. These apps are used to do manual testing of pdf-lib before every release (in addition to the automated tests).

There are currently four apps:

  • node - contains tests for pdf-lib in Node environments. These tests are a handy reference when trying to save/load PDFs, fonts, or images with pdf-lib from the filesystem. They also allow you to quickly open your PDFs in different viewers (Acrobat, Preview, Foxit, Chrome, Firefox, etc...) to ensure compatibility.
  • web - contains tests for pdf-lib in browser environments. These tests are a handy reference when trying to save/load PDFs, fonts, or images with pdf-lib in a browser environment.
  • rn - contains tests for pdf-lib in React Native environments. These tests are a handy reference when trying to save/load PDFs, fonts, or images with pdf-lib in a React Native environment.
  • deno - contains tests for pdf-lib in Deno environments. These tests are a handy reference when trying to save/load PDFs, fonts, or images with pdf-lib from the filesystem.

Installation

NPM Module

To install the latest stable version:

# With npm
npm install --save pdf-lib

# With yarn
yarn add pdf-lib

This assumes you're using npm or yarn as your package manager.

UMD Module

You can also download pdf-lib as a UMD module from unpkg or jsDelivr. The UMD builds have been compiled to ES5, so they should work in any modern browser. UMD builds are useful if you aren't using a package manager or module bundler. For example, you can use them directly in the <script> tag of an HTML page.

The following builds are available:

NOTE: if you are using the CDN scripts in production, you should include a specific version number in the URL, for example:

When using a UMD build, you will have access to a global window.PDFLib variable. This variable contains all of the classes and functions exported by pdf-lib. For example:

// NPM module
import { PDFDocument, rgb } from 'pdf-lib';

// UMD module
var PDFDocument = PDFLib.PDFDocument;
var rgb = PDFLib.rgb;

Fontkit Installation

pdf-lib relies upon a sister module to support embedding custom fonts: @pdf-lib/fontkit. You must add the @pdf-lib/fontkit module to your project and register it using pdfDoc.registerFontkit(...) before embedding custom fonts (see the font embedding example). This module is not included by default because not all users need it, and it increases bundle size.

Installing this module is easy. Just like pdf-lib itself, @pdf-lib/fontkit can be installed with npm/yarn or as a UMD module.

Fontkit NPM Module

# With npm
npm install --save @pdf-lib/fontkit

# With yarn
yarn add @pdf-lib/fontkit

To register the fontkit instance:

import { PDFDocument } from 'pdf-lib'
import fontkit from '@pdf-lib/fontkit'

const pdfDoc = await PDFDocument.create()
pdfDoc.registerFontkit(fontkit)

Fontkit UMD Module

The following builds are available:

NOTE: if you are using the CDN scripts in production, you should include a specific version number in the URL, for example:

When using a UMD build, you will have access to a global window.fontkit variable. To register the fontkit instance:

var pdfDoc = await PDFLib.PDFDocument.create()
pdfDoc.registerFontkit(fontkit)

Documentation

API documentation is available on the project site at https://pdf-lib.js.org/docs/api/.

The repo for the project site (and generated documentation files) is located here: https://github.com/Hopding/pdf-lib-docs.

Fonts and Unicode

When working with PDFs, you will frequently come across the terms "character encoding" and "font". If you have experience in web development, you may wonder why these are so prevalent. Aren't they just annoying details that you shouldn't need to worry about? Shouldn't PDF libraries and readers be able to handle all of this for you like web browsers can? Unfortunately, this is not the case. The nature of the PDF file format makes it very difficult to avoid thinking about character encodings and fonts when working with PDFs.

pdf-lib does its best to simplify things for you. But it can't perform magic. This means you should be aware of the following:

  • There are 14 standard fonts defined in the PDF specification. They are as follows: Times Roman (normal, bold, and italic), Helvetica (normal, bold, and italic), Courier (normal, bold, and italic), ZapfDingbats (normal), and Symbol (normal). These 14 fonts are guaranteed to be available in PDF readers. As such, you do not need to embed any font data if you wish to use one of these fonts. You can use a standard font like so:
import { PDFDocument, StandardFonts } from 'pdf-lib'
const pdfDoc = await PDFDocument.create()
const courierFont = await pdfDoc.embedFont(StandardFonts.Courier)
const page = pdfDoc.addPage()
page.drawText('Some boring latin text in the Courier font', {
  font: courierFont,
})
  • The standard fonts do not support all characters available in Unicode. The Times Roman, Helvetica, and Courier fonts use WinAnsi encoding (aka Windows-1252). The WinAnsi character set only supports 218 characters in the Latin alphabet. For this reason, many users will find the standard fonts insufficient for their use case. This is unfortunate, but there's nothing that PDF libraries can do to change this. This is a result of the PDF specification and its age. Note that the ZapfDingbats and Symbol fonts use their own specialized encodings that support 203 and 194 characters, respectively. However, the characters they support are not useful for most use cases. See here for an example of all 14 standard fonts.
  • You can use characters outside the Latin alphabet by embedding your own fonts. Embedding your own font requires to you load the font data (from a file or via a network request, for example) and pass it to the embedFont method. When you embed your own font, you can use any Unicode characters that it supports. This capability frees you from the limitations imposed by the standard fonts. Most PDF files use embedded fonts. You can embed and use a custom font like so (see also):
import { PDFDocument } from 'pdf-lib'
import fontkit from '@pdf-lib/fontkit'

const url = 'https://pdf-lib.js.org/assets/ubuntu/Ubuntu-R.ttf'
const fontBytes = await fetch(url).then((res) => res.arrayBuffer())

const pdfDoc = await PDFDocument.create()

pdfDoc.registerFontkit(fontkit)
const ubuntuFont = await pdfDoc.embedFont(fontBytes)

const page = pdfDoc.addPage()
page.drawText('Some fancy Unicode text in the ŪЬȕǹƚü font', {
  font: ubuntuFont,
})

Note that encoding errors will be thrown if you try to use a character with a font that does not support it. For example, Ω is not in the WinAnsi character set. So trying to draw it on a page with the standard Helvetica font will throw the following error:

Error: WinAnsi cannot encode "Ω" (0x03a9)
    at Encoding.encodeUnicodeCodePoint

Font Subsetting

Embedding a font in a PDF document will typically increase the file's size. You can reduce the amount a file's size is increased by subsetting the font so that only the necessary characters are embedded. You can subset a font by setting the subset option to true. For example:

const font = await pdfDoc.embedFont(fontBytes, { subset: true });

Note that subsetting does not work for all fonts. See https://github.com/Hopding/pdf-lib/issues/207#issuecomment-537210471 for additional details.

Creating and Filling Forms

pdf-lib can create, fill, and read PDF form fields. The following field types are supported:

See the form creation and form filling usage examples for code samples. Tests 1, 14, 15, 16, and 17 in the complete examples contain working example code for form creation and filling in a variety of different JS environments.

IMPORTANT: The default font used to display text in buttons, dropdowns, option lists, and text fields is the standard Helvetica font. This font only supports characters in the latin alphabet (see Fonts and Unicode for details). This means that if any of these field types are created or modified to contain text outside the latin alphabet (as is often the case), you will need to embed and use a custom font to update the field appearances. Otherwise an error will be thrown (likely when you save the PDFDocument).

You can use an embedded font when filling form fields as follows:

import { PDFDocument } from 'pdf-lib';
import fontkit from '@pdf-lib/fontkit';

// Fetch the PDF with form fields
const formUrl = 'https://pdf-lib.js.org/assets/dod_character.pdf';
const formBytes = await fetch(formUrl).then((res) => res.arrayBuffer());

// Fetch the Ubuntu font
const fontUrl = 'https://pdf-lib.js.org/assets/ubuntu/Ubuntu-R.ttf';
const fontBytes = await fetch(fontUrl).then((res) => res.arrayBuffer());

// Load the PDF with form fields
const pdfDoc = await PDFDocument.load(formBytes);

// Embed the Ubuntu font
pdfDoc.registerFontkit(fontkit);
const ubuntuFont = await pdfDoc.embedFont(fontBytes);

// Get two text fields from the form
const form = pdfDoc.getForm();
const nameField = form.getTextField('CharacterName 2');
const ageField = form.getTextField('Age');

// Fill the text fields with some fancy Unicode characters (outside
// the WinAnsi latin character set)
nameField.setText('Ӎӑȑїõ');
ageField.setText('24 ŷȇȁŗš');

// **Key Step:** Update the field appearances with the Ubuntu font
form.updateFieldAppearances(ubuntuFont);

// Save the PDF with filled form fields
const pdfBytes = await pdfDoc.save();

Handy Methods for Filling, Creating, and Reading Form Fields

Existing form fields can be accessed with the following methods of PDFForm:

New form fields can be created with the following methods of PDFForm:

Below are some of the most commonly used methods for reading and filling the aforementioned subclasses of PDFField:





Limitations

  • pdf-lib can extract the content of text fields (see PDFTextField.getText), but it cannot extract plain text on a page outside of a form field. This is a difficult feature to implement, but it is within the scope of this library and may be added to pdf-lib in the future. See #93, #137, #177, #329, and #380.
  • pdf-lib can remove and edit the content of text fields (see PDFTextField.setText), but it does not provide APIs for removing or editing text on a page outside of a form field. This is also a difficult feature to implement, but is within the scope of pdf-lib and may be added in the future. See #93, #137, #177, #329, and #380.
  • pdf-lib does not support the use of HTML or CSS when adding content to a PDF. Similarly, pdf-lib cannot embed HTML/CSS content into PDFs. As convenient as such a feature might be, it would be extremely difficult to implement and is far beyond the scope of this library. If this capability is something you need, consider using Puppeteer.

Help and Discussion

Discussions is the best place to chat with us, ask questions, and learn more about pdf-lib!

See also MAINTAINERSHIP.md#communication and MAINTAINERSHIP.md#discord.

Encryption Handling

pdf-lib does not currently support encrypted documents. You should not use pdf-lib with encrypted documents. However, this is a feature that could be added to pdf-lib. Please create an issue if you would find this feature helpful!

When an encrypted document is passed to PDFDocument.load(...), an error will be thrown:

import { PDFDocument, EncryptedPDFError } from 'pdf-lib'

const encryptedPdfBytes = ...

// Assignment fails. Throws an `EncryptedPDFError`.
const pdfDoc = PDFDocument.load(encryptedPdfBytes)

This default behavior is usually what you want. It allows you to easily detect if a given document is encrypted, and it prevents you from trying to modify it. However, if you really want to load the document, you can use the { ignoreEncryption: true } option:

import { PDFDocument } from 'pdf-lib'

const encryptedPdfBytes = ...

// Assignment succeeds. Does not throw an error.
const pdfDoc = PDFDocument.load(encryptedPdfBytes, { ignoreEncryption: true })

Note that using this option does not decrypt the document. This means that any modifications you attempt to make on the returned PDFDocument may fail, or have unexpected results.

You should not use this option. It only exists for backwards compatibility reasons.

Contributing

We welcome contributions from the open source community! If you are interested in contributing to pdf-lib, please take a look at the CONTRIBUTING.md file. It contains information to help you get pdf-lib setup and running on your machine. (We try to make this as simple and fast as possible! :rocket:)

Maintainership

Check out MAINTAINERSHIP.md for details on how this repo is maintained and how we use issues, PRs, and discussions.

Tutorials and Cool Stuff

Prior Art

  • pdfkit is a PDF generation library for Node and the Browser. This library was immensely helpful as a reference and existence proof when creating pdf-lib. pdfkit's code for font embedding, PNG embedding, and JPG embedding was especially useful.
  • pdf.js is a PDF rendering library for the Browser. This library was helpful as a reference when writing pdf-lib's parser. Some of the code for stream decoding was ported directly to TypeScript for use in pdf-lib.
  • pdfbox is a PDF generation and modification library written in Java. This library was an invaluable reference when implementing form creation and filling APIs for pdf-lib.
  • jspdf is a PDF generation library for the browser.
  • pdfmake is a PDF generation library for the browser.
  • hummus is a PDF generation and modification library for Node environments. hummus is a Node wrapper around a C++ library, so it doesn't work in many JavaScript environments - like the Browser or React Native.
  • react-native-pdf-lib is a PDF generation and modification library for React Native environments. react-native-pdf-lib is a wrapper around C++ and Java libraries.
  • pdfassembler is a PDF generation and modification library for Node and the browser. It requires some knowledge about the logical structure of PDF documents to use.

Git History Rewrite

This repo used to contain a file called pdf_specification.pdf in the root directory. This was a copy of the PDF 1.7 specification, which is made freely available by Adobe. On 8/30/2021, we received a DMCA complaint requiring us to remove the file from this repo. Simply removing the file via a new commit to master was insufficient to satisfy the complaint. The file needed to be completely removed from the repo's git history. Unfortunately, the file was added over two years ago, this meant we had to rewrite the repo's git history and force push to master 😔.

Steps We Took

We removed the file and rewrote the repo's history using BFG Repo-Cleaner as outlined here. For full transparency, here are the exact commands we ran:

$ git clone git@github.com:Hopding/pdf-lib.git
$ cd pdf-lib
$ rm pdf_specification.pdf
$ git commit -am 'Remove pdf_specification.pdf'
$ bfg --delete-files pdf_specification.pdf
$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
$ git push --force

Why Should I Care?

If you're a user of pdf-lib, you shouldn't care! Just keep on using pdf-lib like normal 😃 ✨!

If you are a pdf-lib developer (meaning you've forked pdf-lib and/or have an open PR) then this does impact you. If you forked or cloned the repo prior to 8/30/2021 then your fork's git history is out of sync with this repo's master branch. Unfortunately, this will likely be a headache for you to deal with. Sorry! We didn't want to rewrite the history, but there really was no alternative.

It's important to note that pdf-lib's source code has not changed at all. It's exactly the same as it was before the git history rewrite. The repo still has the exact same number of commits (and even the same commit contents, except for the commit that added pdf_specification.pdf). What has changed are the SHAs of those commits.

The simplest way to deal with this fact is to:

  1. Reclone pdf-lib
  2. Manually copy any changes you've made from your old clone to the new one
  3. Use your new clone going forward
  4. Reopen your unmerged PRs using your new clone

See this StackOverflow answer for a great, in depth explanation of what a git history rewrite entails.

Download Details:

Author: Hopding
Source Code: https://github.com/Hopding/pdf-lib 
License: MIT license

#typescript #javascript #pdf #editing #document 

PDF-lib: Create and Modify PDF Documents in any JavaScript Environment
Zak Dyer

Zak Dyer

1666150578

Parsing and Modifying XML in Python

In this Python XML tutorial, you'll learn how to parse and modify XML in Python. Python enables you to parse and modify XML documents. In order to parse XML document, you need to have the entire XML document in memory.

What is XML?

XML stands for eXtensible Markup Language. It was designed to store and transport small to medium amounts of data and is widely used for sharing structured information.

Python enables you to parse and modify XML documents. In order to parse XML document, you need to have the entire XML document in memory.

XML stands for Extensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable. That’s why, the design goals of XML emphasize simplicity, generality, and usability across the Internet.

Here we consider that the XML file is present in the memory. Please read the comments in the code for a clear understanding.

XML File:

python-parse-xml

Let us save the above XML file as “test.xml”. Before going further you should know that in XML we do not have predefined tags as we have in HTML. While writing XML the author has to define his/her own tags as well as the document structure. Now we need to parse this file and modify it using Python. We will be using “minidom” library of Python 3 to do the above task. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install minidom

Reading XML

First we will be reading the contents of the XML file and then we will learn how to modify the XML file.
Example

import xml.dom.minidom as md

def main():

	# parsing the xml file and
	# storing the contents in
	# "file" object Put in the
	# path of your XML file in
	# the parameter for parse() method.
	file = md.parse( "test.xml" )

	# nodeName returns the type of
	# the file(in our case it returns
	# document)
	print( file.nodeName )

	# firstChild.tagName returns the
	# name of the first tag.Here it
	# is "note"
	print( file.firstChild.tagName )

	firstname = file.getElementsByTagName( "fname" )

	# printing the first name
	print( "Name: " + firstname[ 0 ].firstChild.nodeValue )

	lastname = file.getElementsByTagName( "lname" )

	# printing the last name
	print( "Surname: " + lastname[ 0 ].firstChild.nodeValue )

	favgame = file.getElementsByTagName( "favgame" )

	# printing the favourite game
	print( "Favourite Game: " + favgame[ 0 ].firstChild.nodeValue )

	# Printing tag values having
	# attributes(Here tag "player"
	# has "name" attribute)
	players = file.getElementsByTagName( "player" )

	for player in players:
		print( player.getAttribute( "name" ) )

if __name__=="__main__":
	main();

Output:

#document
note
Name: Jack
Surname: Shelby
Favourite Game: Football
Messi
Ronaldo
Mbappe

In the above Python code while printing First Name or Last Name we have used firstname[0] / lastname[0]. This is because there is only 1 “fname” and only 1 “lname” tag. For multiple same tags we can proceed like below.
XML:

python-parse-xml-2

Python

import xml.dom.minidom as md

def main():

	file = md.parse( "test.xml" )
	names = file.getElementsByTagName( "fname" )

	for name in names:

		print( name.firstChild.nodeValue )

if __name__=="__main__":
	main();

Output

Jack
John
Harry

Modifying XML

Now we have got a basic idea on how we can parse and read the contents of a XML file using Python. Now let us learn to modify an XML file.
XML File:

python-parse-xml-3

Let us add the following :

  • Height
  • Languages known by Jack

Let us delete the “hobby” tag. Also let us modify the age to 29.
Python Code:(Modifying XML)

import xml.dom.minidom as md

def main():

	file = md.parse("test.xml")
	
	height = file.createElement( "height" )

	# setting height value to 180cm
	height.setAttribute("val", "180 cm")

	# adding height tag to the "file"
	# object
	file.firstChild.appendChild(height)

	lan = [ "English", "Spanish", "French" ]

	# creating separate "lang" tags for
	# each language and adding it to
	# "file" object
	for l in lan:
		
		lang = file.createElement( "lang" )
		lang.setAttribute( "lng", l )
		file.firstChild.appendChild( lang )

	delete = file.getElementsByTagName( "hobby" )

	# deleting all occurences of a particular
	# tag(here "hobby")
	for i in delete:

		x = i.parentNode
		x.removeChild( i )

	# modifying the value of a tag(here "age")
	file.getElementsByTagName( "age" )[ 0 ].childNodes[ 0 ].nodeValue = "29"

	# writing the changes in "file" object to
	# the "test.xml" file
	with open( "test.xml", "w" ) as fs:

		fs.write( file.toxml() )
		fs.close()

if __name__=="__main__":
	main();

Output:

python-parse-xml-4

The last 3 lines of the Python code just converts the “file” object into XML using the toxml() method and writes it to the “test.xml” file. If you do not want to edit the original file and just want to print the modified XML then replace those 3 lines by:

print(file.toxml())

#python #xml #programming 

Parsing and Modifying XML in Python

DocumenterCitations.jl Uses Bibliography.jl to Add Support for BibTeX

DocumenterCitations.jl

DocumenterCitations.jl uses Bibliography.jl to add support for BibTeX citations and references in documentation pages generated by Documenter.jl.

DocumenterCitations.jl is still in early development so please open issues if you encounter any bugs or pain points, would like to see a new feature, or if you have any questions. 

Download Details:

Author: Ali-ramadhan
Source Code: https://github.com/ali-ramadhan/DocumenterCitations.jl 
License: MIT license

#julia #document 

DocumenterCitations.jl Uses Bibliography.jl to Add Support for BibTeX

Documentertools.jl: Extra Tools for Setting Up Documenter

DocumenterTools

This package contains utilities for setting up documentation generation with Documenter.jl. For documentation see Documenter.jls documentation.

Installation

The package can be added using the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run

pkg> add DocumenterTools

Download Details:

Author: JuliaDocs
Source Code: https://github.com/JuliaDocs/DocumenterTools.jl 
License: View license

#julia #docs #document 

Documentertools.jl: Extra Tools for Setting Up Documenter

Documenter.jl: A Documentation Generator for Julia

Documenter

A documentation generator for Julia.

Installation

The package can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add Documenter

Or, equivalently, via the Pkg API:

julia> import Pkg; Pkg.add("Documenter")

Documentation

  • STABLEdocumentation of the most recently tagged version.
  • DEVELdocumentation of the in-development version.

Project Status

The package is tested against, and being developed for, Julia 1.6 and above on Linux, macOS, and Windows.

Questions and Contributions

Usage questions can be posted on the Julia Discourse forum under the documenter tag, in the #documentation channel of the Julia Slack and/or in the JuliaDocs Gitter chat room.

Contributions are very welcome, as are feature requests and suggestions. Please open an issue if you encounter any problems. The contributing page has a few guidelines that should be followed when opening pull requests and contributing code.

Related packages

There are several packages that extend Documenter in different ways. The JuliaDocs organization maintains:

Other third-party packages that can be combined with Documenter include:

Finally, there are also a few other packages in the Julia ecosystem that are similar to Documenter, but fill a slightly different niche:

Download Details:

Author: JuliaDocs 
Source Code: https://github.com/JuliaDocs/Documenter.jl 
License: MIT license

#julia #docs #document 

Documenter.jl: A Documentation Generator for Julia
Hunter  Krajcik

Hunter Krajcik

1657840080

Flutter Document Reader Core Ocrandmrz

Document Reader Core (Flutter)

Regula Document Reader SDK allows you to read various kinds of identification documents, passports, driving licenses, ID cards, etc. All processing is performed completely offline on your device. No any data leaving your device.

Installing

Use this package as a library

Depend on it

Run this command:

With Flutter:

 $ flutter pub add flutter_document_reader_core_ocrandmrz

This will add a line like this to your package's pubspec.yaml (and run an implicit flutter pub get):

dependencies:
  flutter_document_reader_core_ocrandmrz: ^6.4.0

Alternatively, your editor might support flutter pub get. Check the docs for your editor to learn more.

Import it

Now in your Dart code, you can use:

import 'package:flutter_document_reader_core_ocrandmrz/flutter_document_reader_core_ocrandmrz.dart';

example/lib/main.dart

import 'package:flutter/material.dart';
import 'dart:async';

import 'package:flutter/services.dart';
import 'package:flutter_document_reader_core_ocrandmrz/flutter_document_reader_core_ocrandmrz.dart';

void main() {
  runApp(MyApp());
}

class MyApp extends StatefulWidget {
  @override
  _MyAppState createState() => _MyAppState();
}

class _MyAppState extends State<MyApp> {
  String _platformVersion = 'Unknown';

  @override
  void initState() {
    super.initState();
    initPlatformState();
  }

  // Platform messages are asynchronous, so we initialize in an async method.
  Future<void> initPlatformState() async {
    String platformVersion;
    // Platform messages may fail, so we use a try/catch PlatformException.
    try {
      platformVersion = await FlutterDocumentReaderCore.platformVersion;
    } on PlatformException {
      platformVersion = 'Failed to get platform version.';
    }

    // If the widget was removed from the tree while the asynchronous platform
    // message was in flight, we want to discard the reply rather than calling
    // setState to update our non-existent appearance.
    if (!mounted) return;

    setState(() {
      _platformVersion = platformVersion;
    });
  }

  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      home: Scaffold(
        appBar: AppBar(
          title: const Text('Plugin example app'),
        ),
        body: Center(
          child: Text('Running on: $_platformVersion\n'),
        ),
      ),
    );
  }
}

Documentation

The documentation can be found here.

Demo application

The demo application can be found here: https://github.com/regulaforensics/DocumentReader-Flutter.

Original article source at: https://pub.dev/packages/flutter_document_reader_core_ocrandmrz 

#flutter #dart #document 

Flutter Document Reader Core Ocrandmrz
Royce  Reinger

Royce Reinger

1657482720

TF-IDF: Term Frequency - Inverse Document Frequency in Ruby

Tf-Idf

Install

gem sources -a http://gemcutter.org
sudo gem install tf_idf

How To Use

require 'rubygems'
require 'tf_idf'

data = [%w{a a a a a a a a b b}, %w{a a}]

a = TfIdf.new(data)

# To find the term frequencies
a.tf
  #=> [{'b' => 0.2, 'a' => etc...}, {'a' => 1}]

# To find the inverse document frequency
a.idf
  #=> {'b' => 0.301... etc...}

# And to find the tf-idf
a.tf_idf
  #=> [{'b' => 0.0602, 'a' => etc...}, {etc...}]

Copyright

Copyright © 2009 Red Davis. See LICENSE for details.

en.wikipedia.org/wiki/Tf–idf

Author: Reddavis
Source Code: https://github.com/reddavis/TF-IDF 
License: MIT license

#ruby #document 

TF-IDF: Term Frequency - Inverse Document Frequency in Ruby

Smooks: Extensible Data integration Java Framework for Building XML

Smooks

This is the Git source code repository for the Smooks project.

Building

Prerequisites

JDK 8

Apache Maven 3.2.x

Maven

git clone git://github.com/smooks/smooks.git

cd smooks

mvn clean install

NoteYou will need both Maven (version 3.2.x) and Git installed on your local machine.

Docker

You can also build from the Docker image:

Install Docker.

Run sudo docker build -t smooks github.com/smooks/smooks. This will create a Docker image named smooks that contains the correct build environment and a clone of this Git repo.

Run sudo docker run -i smooks mvn clean install to build the source code.

Getting Started

The easiest way to get started with Smooks is to download and try out the examples. The examples are the recommended base upon which to integrate Smooks into your application.

Introduction

Smooks is an extensible Java framework for building XML and non-XML data (CSV, EDI, POJOs, etc…​) fragment-based applications. It can be used as a lightweight framework on which to hook your own processing logic for a wide range of data formats but, out-of-the-box, Smooks ships with features that can be used individually or seamlessly together:

Java Binding: Populate POJOs from a source (CSV, EDI, XML, POJOs, etc…​). Populated POJOs can either be the final result of a transformation, or serve as a bridge for further transformations like what is seen in template resources which generate textual results such as XML. Additionally, Smooks supports collections (maps and lists of typed data) that can be referenced from expression languages and templates.

Transformation: perform a wide range of data transformations and mappings. XML to XML, CSV to XML, EDI to XML, XML to EDI, XML to CSV, POJO to XML, POJO to EDI, POJO to CSV, etc…​

Templating: extensible template-driven transformations, with support for XSLT, FreeMarker, and StringTemplate.

Huge Message Processing: process huge messages (gigabytes!). Split, transform and route fragments to JMS, filesystem, database, and other destinations.

Fragment Enrichment: enrich fragments with data from a database or other data sources.

Complex Fragment Validation: rule-based fragment validation.

Fragment Persistence: read fragments from, and save fragments to, a database with either JDBC, persistence frameworks (like MyBatis, Hibernate, or any JPA compatible framework), or DAOs.

Combine: leverage Smooks’s transformation, routing and persistence functionality for Extract Transform Load (ETL) operations.

Validation: perform basic or complex validation on fragment content. This is more than simple type/value-range validation.

Why Smooks?

Smooks was conceived to perform fragment-based transformations on messages. Supporting fragment-based transformation opened up the possibility of mixing and matching different technologies within the context of a single transformation. This meant that one could leverage distinct technologies for transforming fragments, depending on the type of transformation required by the fragment in question.

In the process of evolving this fragment-based transformation solution, it dawned on us that we were establishing a fragment-based processing paradigm. Concretely, a framework was being built for targeting custom visitor logic at message fragments. A visitor does not need to be restricted to transformation. A visitor could be implemented to apply all sorts of operations on fragments, and therefore, the message as a whole.

Smooks supports a wide range of data structures - XML, EDI, JSON, CSV, POJOs (POJO to POJO!). A pluggable reader interface allows you to plug in a reader implementation for any data format.

Fragment-Based Processing

The primary design goal of Smooks is to provide a framework that isolates and processes fragments in structured data (XML and non-XML) using existing data processing technologies (such as XSLT, plain vanilla Java, Groovy script).

A visitor targets a fragment with the visitor’s resource selector value. The targeted fragment can take in as much or as little of the source stream as you like. A fragment is identified by the name of the node enclosing the fragment. You can target the whole stream using the node name of the root node as the selector or through the reserved #document selector.

NoteThe terms fragment and node denote different meanings. It is usually acceptable to use the terms interchangeably because the difference is subtle and, more often than not, irrelevant. A node may be the outer node of a fragment, excluding the child nodes. A fragment is the outer node and all its child nodes along with their character nodes (text, etc…​). When a visitor targets a node, it typically means that the visitor can only process the fragment’s outer node as opposed to the fragment as a whole, that is, the outer node and its child nodes

What’s new in Smooks 2?

Smooks 2 introduces the DFDL cartridge and revamps its EDI cartridge, while dropping support for Java 7 along with a few other notable breaking changes:

DFDL cartridge

DFDL is a specification for describing file formats in XML. The DFDL cartridge leverages Apache Daffodil to parse files and unparse XML. This opens up Smooks to a wide array of data formats like SWIFT, ISO8583, HL7, and many more.

Pipeline support

Compose any series of transformations on an event outside the main execution context before directing the pipeline output to the execution result stream or to other destinations

Complete overhaul of the EDI cartridge

Rewritten to extend the DFDL cartridge and provide much better support for reading EDI documents

Added functionality to serialize EDI documents

As in previous Smooks versions, incorporated special support for EDIFACT

SAX NG filter

Replaces SAX filter and supersedes DOM filter

Brings with it a new visitor API which unifies the SAX and DOM visitor APIs

Cartridges migrated to SAX NG

Supports XSLT and StringTemplate resources unlike the legacy SAX filter

Mementos: a convenient way to stash and un-stash a visitor’s state during its execution lifecycle

Independent release cycles for all cartridges and one Maven BOM (bill of materials) to track them all

License change

After reaching consensus among our code contributors, we’ve dual-licensed Smooks under LGPL v3.0 and Apache License 2.0. This license change keeps Smooks open source while adopting a permissive stance to modifications.

New Smooks XSD schema (xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd")

Uniform XML namespace declarations: dropped default-selector-namespace and selector-namespace XML attributes in favour of declaring namespaces within the standard xmlns attribute from the smooks-resource-config element.

Removed default-selector attribute from smooks-resource-config element: selectors need to be set explicitly

Dropped Smooks-specific annotations in favour of JSR annotations

Farewell @ConfigParam, @Config, @AppContext, and @StreamResultWriter. Welcome @Inject.

Farewell @Initialize and @Uninitialize. Welcome @PostConstruct and @PreDestroy.

Separate top-level Java namespaces for API and implementation to provide a cleaner and more intuitive package structure: API interfaces and internal classes were relocated to org.smooks.api and org.smooks.engine respectively

Improved XPath support for resource selectors

Functions like not() are now supported

Numerous dependency updates

Maven coordinates change: we are now publishing Smooks artifacts under Maven group IDs prefixed with org.smooks

Replaced default SAX parser implementation from Apache Xerces to FasterXML’s Woodstox: benchmarks consistently showed Woodstox outperforming Xerces

Migrating from Smooks 1.7 to 2.0

  1. Smooks 2 no longer supports Java 7. Your application needs to be compiled to at least Java 8 to run Smooks 2.
  2. Replace references to Java packages org.milyn with org.smooks.api, org.smooks.engine, org.smooks.io or org.smooks.support.
  3. Inherit from org.smooks.api.resource.visitor.sax.ng.SaxNgVisitor instead of org.milyn.delivery.sax.SAXVisitor.
  4. Change legacy document root fragment selectors from $document to #document.
  5. Replace Smooks Maven coordinates to match the coordinates as described in the Maven guide.
  6. Replace ExecutionContext#isDefaultSerializationOn() method calls with ExecutionContext#getContentDeliveryRuntime().getDeliveryConfig().isDefaultSerializationOn().
  7. Replace ExecutionContext#getContext() method calls with`ExecutionContext#getApplicationContext()`.
  8. Replace org.smooks.delivery.dom.serialize.SerializationVisitor references with org.smooks.api.resource.visitor.SerializerVisitor.
  9. Replace org.smooks.cdr.annotation.AppContext annotations with javax.inject.Inject annotations.
  10. Replace org.smooks.cdr.annotation.ConfigParam annotations with javax.inject.Inject annotations:
  11. Substitute the @ConfigParam name attribute with the @javax.inject.Named annotation.
  12. Wrap java.util.Optional around the field to mimic the behaviour of the @ConfigParam optional attribute.
  13. Replace org.smooks.delivery.annotation.Initialize annotations with javax.annotation.PostConstruct annotations.
  14. Replace org.smooks.delivery.annotation.Uninitialize annotations with javax.annotation.PreDestroy annotations.
  15. Replace references to org.smooks.javabean.DataDecode with org.smooks.api.converter.TypeConverterFactory.
  16. Replace references to org.smooks.cdr.annotation.Configurator with org.smooks.api.lifecycle.LifecycleManager.
  17. Replace references to org.smooks.javabean.DataDecoderException with org.smooks.api.converter.TypeConverterException.
  18. Replace references to org.smooks.cdr.SmooksResourceConfigurationStore with org.smooks.api.Registry.
  19. Replace references to org.milyn.cdr.SmooksResourceConfiguration with org.smooks.api.resource.config.ResourceConfig.
  20. Replace references to org.milyn.delivery.sax.SAXToXMLWriter with org.smooks.io.DomSerializer.

FAQs

See the FAQ.

Maven

See the Maven guide for details on how to integrate Smooks into your project via Maven.

Fundamentals

A commonly accepted definition of Smooks is of it being a "Transformation Engine". Nonetheless, at its core, Smooks makes no reference to data transformation. The core codebase is designed to hook visitor logic into an event stream produced from a source of some kind. As such, in its most distilled form, Smooks is a Structured Data Event Stream Processor.

An application of a structured data event processor is transformation. In implementation terms, a Smooks transformation solution is a visitor reading the event stream from a source to produce a different representation of the input. However, Smooks’s core capabilities enable much more than transformation. A range of other solutions can be implemented based on the fragment-based processing model:

Java Binding: population of a POJO from the source.

Splitting & Routing: perform complex splitting and routing operations on the source stream, including routing data in different formats (XML, EDI, CSV, POJO, etc…​) to multiple destinations concurrently.

Huge Message Processing: declaratively consume (transform, or split and route) huge messages without writing boilerplate code.

Basic Processing Model

Smooks’s fundamental behaviour is to take an input source, such as XML, and from it generate an event stream to which visitors are applied to produce a result such as EDI.

Several sources and result types are supported which equate to different transformation types, including but not limited to:

  • XML to XML
  • XML to POJO
  • POJO to XML
  • POJO to POJO
  • EDI to XML
  • EDI to POJO
  • POJO to EDI
  • CSV to XML
  • CSV to …​
  • …​ to …​

Smooks maps the source to the result with the help of a highly-tunable SAX event model. The hierarchical events generated from an XML source (startElement, endElement, etc…​) drive the SAX event model though the event model can be just as easily applied to other structured data sources (EDI, CSV, POJO, etc…​). The most important events are typically the before and after visit events. The following illustration conveys the hierarchical nature of these events.

Image:event-model.gif

Hello World App

One or more of SaxNgVisitor interfaces need to be implemented in order to consume the SAX event stream produced from the source, depending on which events are of interest.

The following is a hello world app demonstrating how to implement a visitor that is fired on the visitBefore and visitAfter events of a targeted node in the event stream. In this case, Smooks configures the visitor to target element foo:

Image:simple-example.png

The visitor implementation is straightforward: one method implementation per event. As shown above, a Smooks config (more about resource-config later on) is written to target the visitor at a node’s visitBefore and visitAfter events.

The Java code executing the hello world app is a two-liner:

Smooks smooks = new Smooks("/smooks/echo-example.xml");
smooks.filterSource(new StreamSource(inputStream));

Observe that in this case the program does not produce a result. The program does not even interact with the filtering process in any way because it does not provide an ExecutionContext to smooks.filterSource(...).

This example illustrated the lower level mechanics of the Smooks’s programming model. In reality, most users are not going to want to solve their problems at this level of detail. Smooks ships with substantial pre-built functionality, that is, pre-built visitors. Visitors are bundled based on functionality: these bundles are called Cartridges.

Smooks Resources

A Smooks execution consumes an source of one form or another (XML, EDI, POJO, JSON, CSV, etc…​), and from it, generates an event stream that fires different visitors (Java, Groovy, DFDL, XSLT, etc…​). The goal of this process can be to produce a new result stream in a different format (data transformation), bind data from the source to POJOs and produce a populated Java object graph (Java binding), produce many fragments (splitting), and so on.

At its core, Smooks views visitors and other abstractions as resources. A resource is applied when a selector matches a node in the event stream. The generality of such a processing model can be daunting from a usability perspective because resources are not tied to a particular domain. To counteract this, Smooks 1.1 introduced an Extensible Configuration Model feature that allows specific resource types to be specified in the configuration using dedicated XSD namespaces of their own. Instead of having a generic resource config such as:

<resource-config selector="order-item">
    <resource type="ftl"><!-- <item>
    <id>${.vars["order-item"].@id}</id>
    <productId>${.vars["order-item"].product}</productId>
    <quantity>${.vars["order-item"].quantity}</quantity>
    <price>${.vars["order-item"].price}</price>
</item>
    -->
    </resource>
</resource-config>

an Extensible Configuration Model allows us to have a domain-specific resource config:

<ftl:freemarker applyOnElement="order-item">
    <ftl:template><!-- <item>
    <id>${.vars["order-item"].@id}</id>
    <productId>${.vars["order-item"].product}</productId>
    <quantity>${.vars["order-item"].quantity}</quantity>
    <price>${.vars["order-item"].price}</price>
</item>
    -->
    </ftl:template>
</ftl:freemarker>

When comparing the above snippets, the latter resource has:

A more strongly typed domain specific configuration and so is easier to read,

Auto-completion support from the user’s IDE because the Smooks 1.1+ configurations are XSD-based, and

No need set the resource type in its configuration.

Visitors

Central to how Smooks works is the concept of a visitor. A visitor is a Java class performing a specific task on the targeted fragment such as applying an XSLT script, binding fragment data to a POJO, validate fragments, etc…​

Selectors

Resource selectors are another central concept in Smooks. A selector chooses the node/s a visitor should visit, as well working as a simple opaque lookup value for non-visitor logic.

When the resource is a visitor, Smooks will interpret the selector as an XPath-like expression. There are a number of things to be aware of:

The order in which the XPath expression is applied is the reverse of a normal order, like what hapens in an XSLT script. Smooks inspects backwards from the targeted fragment node, as opposed to forwards from the root node.

Not all of the XPath specification is supported. A selector supports the following XPath syntax:

text() and attribute value selectors: a/b[text() = 'abc'], a/b[text() = 123], a/b[@id = 'abc'], a/b[@id = 123].

text() is only supported on the last selector step in an expression: a/b[text() = 'abc'] is legal while a/b[text() = 'abc']/c is illegal.

text() is only supported on visitor implementations that implement the AfterVisitor interface only. If the visitor implements the BeforeVisitor or ChildrenVisitor interfaces, an error will result.

or & and logical operations: a/b[text() = 'abc' and @id = 123], a/b[text() = 'abc' or @id = 123]

Namespaces on both the elements and attributes: a:order/b:address[@b:city = 'NY'].

NoteThis requires the namespace prefix-to-URI mappings to be defined. A configuration error will result if not defined. Read the namespace declaration section for more details.

Supports = (equals), != (not equals), < (less than), > (greater than).

Index selectors: a/b[3].

Namespace Declaration

The xmlns attribute is used to bind a selector prefix to a namespace:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:c="http://c" xmlns:d="http://d">

    <resource-config selector="c:item[@c:code = '8655']/d:units[text() = 1]">
        <resource>com.acme.visitors.MyCustomVisitorImpl</resource>
    </resource-config>

</smooks-resource-list>

Alternatively, namespace prefix-to-URI mappings can be declared using the legacy core config namespace element:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">

    <core:namespaces>
        <core:namespace prefix="c" uri="http://c"/>
        <core:namespace prefix="d" uri="http://d"/>
    </core:namespaces>

    <resource-config selector="c:item[@c:code = '8655']/d:units[text() = 1]">
        <resource>com.acme.visitors.MyCustomVisitorImpl</resource>
    </resource-config>

</smooks-resource-list>

Input

Smooks relies on a Reader for ingesting a source and generating a SAX event stream. A reader is any class extending XMLReader. By default, Smooks uses the XMLReader returned from XMLReaderFactory.createXMLReader(). You can easily implement your own XMLReader to create a non-XML reader that generates the source event stream for Smooks to process:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <reader class="com.acme.ZZZZReader" />

    <!--
        Other Smooks resources, e.g. <jb:bean> configs for
        binding data from the ZZZZ data stream into POJOs....
    -->

</smooks-resource-list>

The reader config element is referencing a user-defined XMLReader. It can be configured with a set of handlers, features and parameters:

<reader class="com.acme.ZZZZReader">
    <handlers>
        <handler class="com.X" />
        <handler class="com.Y" />
    </handlers>
    <features>
        <setOn feature="http://a" />
        <setOn feature="http://b" />
        <setOff feature="http://c" />
        <setOff feature="http://d" />
    </features>
    <params>
        <param name="param1">val1</param>
        <param name="param2">val2</param>
    </params>
</reader>

Packaged Smooks modules, known as cartridges, provide support for non-XML readers but, by default, Smooks expects an XML source. Omit the class name from the reader element to set features on the default XML reader:

<reader>
    <features>
        <setOn feature="http://a" />
        <setOn feature="http://b" />
        <setOff feature="http://c" />
        <setOff feature="http://d" />
    </features>
</reader>

Output

Smooks can present output to the outside world in two ways:

As instances of Result: client code extracts output from the Result instance after passing an empty one to Smooks#filterSource(...).

As side effects: during filtering, resource output is sent to web services, local storage, queues, data stores, and other locations. Events trigger the routing of fragments to external endpoints such as what happens when splitting and routing.

Unless configured otherwise, a Smooks execution does not accumulate the input data to produce all the outputs. The reason is simple: performance! Consider a document consisting of hundreds of thousands (or millions) of orders that need to be split up and routed to different systems in different formats, based on different conditions. The only way of handing documents of these magnitudes is by streaming them.

ImportantSmooks can generate output in either, or both, of the above ways, all in a single filtering pass of the source. It does not need to filter the source multiple times in order to generate multiple outputs, critical for performance.

Result

A look at the Smooks API reveals that Smooks can be supplied with multiple Result instances:

public void filterSource(Source source, Result... results) throws SmooksException

Smooks can work with the standard JDK StreamResult and DOMResult result types, as well as the Smooks specific ones:

JavaResult: result type for capturing the contents of the Smooks JavaBean context.

StringResult: StreamResult extension wrapping a StringWriter, useful for testing.

ImportantAs yet, Smooks does not support capturing output to multiple Result instances of the same type. For example, you can specify multiple StreamResult instances in Smooks.filterSource(...) but Smooks will only output to the first StreamResult instance.

Stream Results

The StreamResult and DOMResult types receive special attention from Smooks. When the default.serialization.on global parameter is turned on, which by default it is, Smooks serializes the stream of events to XML while filtering the source. The XML is fed to the Result instance if a StreamResult or DOMResult is passed to Smooks#filterSource.

NoteThis is the mechanism used to perform a standard 1-input/1-xml-output character-based transformation.

Side Effects

Smooks is also able to generate different types of output during filtering, that is, while filtering the source event stream but before it reaches the end of the stream. A classic example of this output type is when it is used to split and route fragments to different endpoints for processing by other processes.

Pipeline

A pipeline is a flexible, yet simple, Smooks construct that isolates the processing of a targeted event from its main processing as well as from the processing of other pipelines. In practice, this means being able to compose any series of transformations on an event outside the main execution context before directing the pipeline output to the execution result stream or to other destinations. With pipelines, you can enrich data, rename/remove nodes, and much more.

Under the hood, a pipeline is just another instance of Smooks, made self-evident from the Smooks config element declaring a pipeline:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">

   <core:smooks filterSourceOn="...">
       <core:action>
           ...
       </core:action>
       <core:config>
           <smooks-resource-list>
               ...
           </smooks-resource-list>
       </core:config>
   </core:smooks>

</smooks-resource-list>

core:smooks fires a nested Smooks execution whenever an event in the stream matches the filterSourceOn selector. The pipeline within the inner smooks-resource-list element visits the selected event and its child events. It is worth highlighting that the inner smooks-resource-list element behaves identically to the outer one, and therefore, it accepts resources like visitors, readers, and even pipelines (a pipeline within a pipeline!). Moreover, a pipeline is transparent to its nested resources: a resource’s behaviour remains the same whether it’s declared inside a pipeline or outside it.

The optional core:action element tells the nested Smooks instance what to do with the pipeline’s output. The next sections list the supported actions.

Inline

Merges the pipeline’s output with the result stream:

...
<core:action>
    <core:inline>
        ...
    </core:inline>
</core:action>
...

As described in the subsequent sections, an inline action replaces, prepends, or appends content.

Replace

Substitutes the selected fragment with the pipeline output:

...
<core:inline>
    <core:replace/>
</core:inline>
...

Prepend Before

Adds the output before the selector start tag:

<core:inline>
    <core:prepend-before/>
</core:inline>

Prepend After

Adds the output after the selector start tag:

<core:inline>
    <core:prepend-after/>
</core:inline>

Append Before

Adds the output before the selector end tag:

<core:inline>
    <core:append-before/>
</core:inline>

Append After

Adds the output after the selector end tag:

<core:inline>
    <core:append-after/>
</core:inline>

Bind To

Binds the output to the execution context’s bean store:

...
<core:action>
    <core:bind-to id="..."/>
</core:action>
...

Output To

Directs the output to a different stream other than the result stream:

...
<core:action>
    <core:output-to outputStreamResource="..."/>
</core:action>
...

Cartridge

The basic functionality of Smooks can be extended through the development of a Smooks cartridge. A cartridge is a Java archive (JAR) containing reusable resources (also known as Content Handlers). A cartridge augments Smooks with support for a specific type input source or event handling.

Visit the GitHub organisation page for the complete list of Smooks cartridges.

Filter

A Smooks filter delivers generated events from a reader to the application’s resources. Smooks 1 had the DOM and SAX filters. The DOM filter was simple to use but kept all the events in memory while the SAX filter, though more complex, delivered the events in streaming fashion. Having two filter types meant two different visitor APIs and execution paths, with all the baggage it entailed.

Smooks 2 unifies the legacy DOM and SAX filters without sacrificing convenience or performance. The new SAX NG filter drops the API distinction between DOM and SAX. Instead, the filter streams SAX events as partial DOM elements to SAX NG visitors targeting the element. A SAX NG visitor can read the targeted node as well as any of the node’s ancestors but not the targeted node’s children or siblings in order to keep the memory footprint to a minimum.

The SAX NG filter can mimic DOM by setting its max.node.depth parameter to 0 (default value is 1), allowing each visitor to process the complete DOM tree in its visitAfter(...) method:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <params>
        <param name="max.node.depth">0</param>
    </params>
    ...
</smooks>

A max.node.depth value of greater than 1 will tell the filter to read and keep an node’s descendants up to the desired depth. Take the following input as an example:

<order id="332">
    <header>
        <customer number="123">Joe</customer>
    </header>
    <order-items>
        <order-item id="1">
            <product>1</product>
            <quantity>2</quantity>
            <price>8.80</price>
        </order-item>
        <order-item id="2">
            <product>2</product>
            <quantity>2</quantity>
            <price>8.80</price>
        </order-item>
        <order-item id="3">
            <product>3</product>
            <quantity>2</quantity>
            <price>8.80</price>
        </order-item>
    </order-items>
</order>

Along with the config:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <params>
        <param name="max.node.depth">2</param>
    </params>

    <resource-config selector="order-item">
        <resource>org.acme.MyVisitor</resource>
    </resource-config>

</smooks>

At any given time, there will always be a single order-item in memory containing product because max.node.depth is 2. Each new order-item overwrites the previous order-item to minimise the memory footprint. MyVisitor#visitAfter(...) is invoked 3 times, each invocation corresponding to an order-item fragment. The first invocation will process:

<order-item id='1'>
    <product>2</product>
</order-item>

While the second invocation will process:

<order-item id='2'>
    <product>2</product>
</order-item>

Whereas the last invocation will process:

<order-item id='3'>
    <product>3</product>
</order-item>

Programmatically, implementing org.smooks.api.resource.visitor.sax.ng.ParameterizedVisitor will give you fine-grained control over the visitor’s targeted element depth:

...
public class DomVisitor implements ParameterizedVisitor {

    @Override
    public void visitBefore(Element element, ExecutionContext executionContext) {
    }

    @Override
    public void visitAfter(Element element, ExecutionContext executionContext) {
        System.out.println("Element: " + XmlUtil.serialize(element, true));
    }

    @Override
    public int getMaxNodeDepth() {
        return Integer.MAX_VALUE;
    }
}

ParameterizedVisitor#getMaxNodeDepth() returns an integer denoting the targeted element’s maximum tree depth the visitor can accept in its visitAfter(...) method.

Settings

Filter-specific knobs are set through the smooks-core configuration namespace (https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd) introduced in Smooks 1.3:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">

    <core:filterSettings type="SAX NG" (1)
                         defaultSerialization="true" (2)
                         terminateOnException="true" (3)
                         closeSource="true" (4)
                         closeResult="true" (5)
                         rewriteEntities="true" (6)
                         readerPoolSize="3"/> (7)

    <!-- Other visitor configs etc... -->

</smooks-resource-list>

type (default: SAX NG): the type of processing model that will be used. SAX NG is the recommended type. The DOM type is deprecated.

defaultSerialization (default: true): if default serialization should be switched on. Default serialization being turned on simply tells Smooks to locate a StreamResult (or DOMResult) in the Result objects provided to the Smooks.filterSource method and to serialize all events to that Result instance. This behavior can be turned off using this global configuration parameter and can be overridden on a per-fragment basis by targeting a visitor at that fragment that takes ownership of the org.smooks.io.FragmentWriter object.

terminateOnException (default: true): whether an exception should terminate execution.

closeSource (default: true): close Inp instance streams passed to the Smooks.filterSource method. The exception here is System.in, which will never be closed.

closeResult: close Result streams passed to the [Smooks.filterSource method (default "true"). The exception here is System.out and System.err, which will never be closed.

rewriteEntities: rewrite XML entities when reading and writing (default serialization) XML.

readerPoolSize: reader Pool Size (default 0). Some Reader implementations are very expensive to create (e.g. Xerces). Pooling Reader instances (i.e. reusing) can result in a huge performance improvement, especially when processing lots of "small" messages. The default value for this setting is 0 (i.e. unpooled - a new Reader instance is created for each message). Configure in line with your applications threading model.

Troubleshooting

Smooks streams events that can be captured, and inspected, while in-flight or after execution. HtmlReportGenerator is one such class that inspects in-flight events to go on and generate an HTML report from the execution:

Smooks smooks = new Smooks("/smooks/smooks-transform-x.xml");
ExecutionContext executionContext = smooks.createExecutionContext();

executionContext.getContentDeliveryRuntime().addExecutionEventListener(new HtmlReportGenerator("/tmp/smooks-report.html"));
smooks.filterSource(executionContext, new StreamSource(inputStream), new StreamResult(outputStream));

HtmlReportGenerator is a useful tool in the developer’s arsenal for diagnosing issues, or for comprehending a transformation.

An example HtmlReportGenerator report can be seen online here.

Of course you can also write and use your own ExecutionEventListener implementations.

CautionOnly use the HTMLReportGenerator in development. When enabled, the HTMLReportGenerator incurs a significant performance overhead and with large message, can even result in OutOfMemory exceptions.

Terminate

You can terminate Smooks’s filtering before it reaches the end of a stream. The following config terminates filtering at the end of the customer fragment:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">

    <!-- Visitors... -->
    <core:terminate onElement="customer"/>

</smooks-resource-list>

The default behavior is to terminate at the end of the targeted fragment, on the visitAfter event. To terminate at the start of the targeted fragment, on the visitBefore event, set the terminateBefore attribute to true:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">

    <!-- Visitors... -->
    <core:terminate onElement="customer" terminateBefore="true"/>

</smooks-resource-list>

Bean Context

The Bean Context is a container for objects which can be accessed within during a Smooks execution. One bean context is created per execution context, that is, per Smooks#filterSource(...) operation. Provide an org.smooks.io.payload.JavaResult object to Smooks#filterSource(...) if you want the contents of the bean context to be returned at the end of the filtering process:

//Get the data to filter
StreamSource source = new StreamSource(getClass().getResourceAsStream("data.xml"));

//Create a Smooks instance (cachable)
Smooks smooks = new Smooks("smooks-config.xml");

//Create the JavaResult, which will contain the filter result after filtering
JavaResult result = new JavaResult();

//Filter the data from the source, putting the result into the JavaResult
smooks.filterSource(source, result);

//Getting the Order bean which was created by the JavaBean cartridge
Order order = (Order)result.getBean("order");

Resources like visitors access the bean context’s beans at runtime from the BeanContext. The BeanContext is retrieved from ExecutionContext#getBeanContext(). You should first retrieve a BeanId from the BeanIdStore when adding or retrieving objects from the BeanContext. A BeanId is a special key that ensures higher performance then String keys, however String keys are also supported. The BeanIdStore must be retrieved from ApplicationContext#getBeanIdStore(). A BeanId object can be created by calling BeanIdStore#register(String). If you know that the BeanId is already registered, then you can retrieve it by calling BeanIdStore#getBeanId(String). BeanId is scoped at the application context. You normally register it in the @PostConstruct annotated method of your visitor implementation and then reference it as member variable from the visitBefore and visitAfter methods.

NoteBeanId and BeanIdStore are thread-safe.

Pre-installed Beans

A number of pre-installed beans are available in the bean context at runtime:

PUUID: This UniqueId instance provides unique identifiers for the filtering ExecutionContext.

PTIME: This Time instance provides time-based data for the filtering ExecutionContext.

The following are examples of how each of these would be used in a FreeMarker template.

Unique ID of the ExecutionContext:

${PUUID.execContext}

Random Unique ID:

${PUUID.random}

Filtering start time in milliseconds:

${PTIME.startMillis}

Filtering start time in nanoseconds:

${PTIME.startNanos}

Filtering start date:

${PTIME.startDate}

Current time in milliseconds:

${PTIME.nowMillis}

Current time in nanoSeconds:

${PTIME.nowNanos}

Current date:

${PTIME.nowDate}

Global Configurations

Global configuration settings are, as the name implies, configuration options that can be set once and be applied to all resources in a configuration.

Smooks supports two types of globals, default properties and global parameters:

Global Configuration Parameters: Every in a Smooks configuration can specify elements for configuration parameters. These parameter values are available at runtime through the ResourceConfig, or are reflectively injected through the @Inject annotation. Global Configuration Parameters are parameters that are defined centrally (see below) and are accessible to all runtime components via the ExecutionContext (vs ResourceConfig). More on this in the following sections.

Default Properties: Specify default values for attributes. These defaults are automatically applied to `ResourceConfig`s when their corresponding does not specify the attribute. More on this in the following section.

Global Configuration Parameters

Global properties differ from the default properties in that they are not specified on the root element and are not automatically applied to resources.

Global parameters are specified in a <params> element:

<params>
    <param name="xyz.param1">param1-val</param>
</params>

Global Configuration Parameters are accessible via the ExecutionContext e.g.:

public void visitAfter(Element element, ExecutionContext executionContext) {
    String param1 = executionContext.getConfigParameter("xyz.param1", "defaultValueABC");
    ....
}

Default Properties

Default properties are properties that can be set on the root element of a Smooks configuration and have them applied to all resource configurations in smooks-conf.xml file. For example, if you have a resource configuration file in which all the resource configurations have the same selector value, you could specify a default-target-profile=order to save specifying the profile on every resource configuration:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      default-target-profile="order">

    <resource-config>
        <resource>com.acme.VisitorA</resource>
        ...
    </resource-config>

    <resource-config>
        <resource>com.acme.VisitorB</resource>
        ...
    </resource-config>

<smooks-resource-list>

The following default configuration options are available:

default-target-profile*: Default target profile that will be applied to all resources in the smooks configuration file, where a target-profile is not defined.

default-condition-ref: Refers to a global condition by the conditions id. This condition is applied to resources that define an empty "condition" element (i.e. ) that does not reference a globally defined condition.

Configuration Modularization

Smooks configurations are easily modularized through use of the <import> element. This allows you to split Smooks configurations into multiple reusable configuration files and then compose the top level configurations using the <import> element e.g.

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <import file="bindings/order-binding.xml" />
    <import file="templates/order-template.xml" />

</smooks-resource-list>

You can also inject replacement tokens into the imported configuration by using <param> sub-elements on the <import>. This allows you to make tweaks to the imported configuration.

<!-- Top level configuration... -->
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <import file="bindings/order-binding.xml">
        <param name="orderRootElement">order</param>
    </import>

</smooks-resource-list>
<!-- Imported parameterized bindings/order-binding.xml configuration... -->
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:jb="https://www.smooks.org/xsd/smooks/javabean-1.6.xsd">

    <jb:bean beanId="order" class="org.acme.Order" createOnElement="@orderRootElement@">
        .....
    </jb:bean>

</smooks-resource-list>

Note how the replacement token injection points are specified using @tokenname@.

Exporting Results

When using Smooks standalone you are in full control of the type of output that Smooks produces since you specify it by passing a certain Result to the filter method. But when integrating Smooks with other frameworks (JBossESB, Mule, Camel, and others) this needs to be specified inside the framework’s configuration. Starting with version 1.4 of Smooks you can now declare the data types that Smooks produces and you can use the Smooks api to retrieve the Result(s) that Smooks exports.

To declare the type of result that Smooks produces you use the 'exports' element as shown below:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd" xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">
   <core:exports>
      <core:result type="org.smooks.io.payload.JavaResult"/>
   </core:exports>
</smooks-resource-list>

The newly added exports element declares the results that are produced by this Smooks configuration. A exports element can contain one or more result elements. A framework that uses Smooks could then perform filtering like this:

// Get the Exported types that were configured.
Exports exports = Exports.getExports(smooks.getApplicationContext());
if (exports.hasExports())
{
    // Create the instances of the Result types.
    // (Only the types, i.e the Class type are declared in the 'type' attribute.
    Result[] results = exports.createResults();
    smooks.filterSource(executionContext, getSource(exchange), results);
    // The Results(s) will now be populate by Smooks filtering process and
    // available to the framework in question.
}

There might also be cases where you only want a portion of the result extracted and returned. You can use the ‘extract’ attribute to specify this:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd">
   <core:exports>
      <core:result type="org.smooks.io.payload.JavaResult" extract="orderBean"/>
   </core:exports>
</smooks-resource-list>

The extract attribute is intended to be used when you are only interested in a sub-section of a produced result. In the example above we are saying that we only want the object named orderBean to be exported. The other contents of the JavaResult will be ignored. Another example where you might want to use this kind of extracting could be when you only want a ValidationResult of a certain type, for example to only return validation errors.

Below is an example of using the extracts option from an embedded framework:

// Get the Exported types that were configured.
Exports exports = Exports.getExports(smooks.getApplicationContext());
if (exports.hasExports())
{
    // Create the instances of the Result types.
    // (Only the types, i.e the Class type are declared in the 'type' attribute.
    Result[] results = exports.createResults();
    smooks.filterSource(executionContext, getSource(exchange), results);
    List<object> objects = Exports.extractResults(results, exports);
    // Now make the object available to the framework that this code is running:
    // Camel, JBossESB, Mule, etc...
}

Performance Tuning

Like with any Software, when configured or used incorrectly, performance can be one of the first things to suffer. Smooks is no different in this regard.

General

Cache and reuse the Smooks Object. Initialization of Smooks takes some time and therefore it is important that it is reused.

Pool reader instances where possible. This can result in a huge performance boost, as some readers are very expensive to create.

If possible, use SAX NG filtering. However, you need to check that all Smooks cartridges in use are SAX NG compatible. SAX NG processing is faster than DOM processing and has a consistently small memory footprint. It is especially recommended for processing large messages. See the Filtering Process Selection (DOM or SAX?) section. SAX NG is the default filter since Smooks 2.

Turn off debug logging. Smooks performs some intensive debug logging in parts of the code. This can result in significant additional processing overhead and lower throughput. Also remember that NOT having your logging configured (at all) may result in debug log statements being executed!!

Contextual selectors can obviously have a negative effect on performance e.g. evaluating a match for a selector like "a/b/c/d/e" will obviously require more processing than that of a selector like "d/e". Obviously there will be situations where your data model will require deep selectors, but where it does not, you should try to optimize them for the sake of performance.

Smooks Cartridges

Every cartridge can have its own performance optimization tips.

Javabean Cartridge

If possible don’t use the Virtual Bean Model. Create Beans instead of maps. Creating and adding data to Maps is a lot slower then creating simple POJO’s and calling the setter methods.

Testing

Unit Testing

Unit testing with Smooks is simple:

public class MyMessageTransformTest {
    @Test
    public void test_transform() throws Exception {
        Smooks smooks = new Smooks(getClass().getResourceAsStream("smooks-config.xml"));

        try {
            Source source = new StreamSource(getClass().getResourceAsStream("input-message.xml" ) );
            StringResult result = new StringResult();

            smooks.filterSource(source, result);

            // compare the expected xml with the transformation result.
            XMLUnit.setIgnoreWhitespace(true);
            XMLAssert.assertXMLEqual(new InputStreamReader(getClass().getResourceAsStream("expected.xml")), new StringReader(result.getResult()));
        } finally {
            smooks.close();
        }
    }
}

The test case above uses XMLUnit.

The following maven dependency was used for xmlunit in the above test:

<dependency>
    <groupId>xmlunit</groupId>
    <artifactId>xmlunit</artifactId>
    <version>1.1</version>
</dependency>

Common use cases

Processing Huge Messages (GBs)

One of the main features introduced in Smooks v1.0 is the ability to process huge messages (Gbs in size). Smooks supports the following types of processing for huge messages:

One-to-One Transformation: This is the process of transforming a huge message from its source format (e.g. XML), to a huge message in a target format e.g. EDI, CSV, XML etc.

Splitting & Routing: Splitting of a huge message into smaller (more consumable) messages in any format (EDI, XML, Java, etc…​) and Routing of those smaller messages to a number of different destination types (filesystem, JMS, database).

Persistence: Persisting the components of the huge message to a database, from where they can be more easily queried and processed. Within Smooks, we consider this to be a form of Splitting and Routing (routing to a database).

All of the above is possible without writing any code (i.e. in a declarative manner). Typically, any of the above types of processing would have required writing quite a bit of ugly/unmaintainable code. It might also have been implemented as a multi-stage process where the huge message is split into smaller messages (stage #1) and then each smaller message is processed in turn to persist, route, etc…​ (stage #2). This would all be done in an effort to make that ugly/unmaintainable code a little more maintainable and reusable. With Smooks, most of these use-cases can be handled without writing any code. As well as that, they can also be handled in a single pass over the source message, splitting and routing in parallel (plus routing to multiple destinations of different types and in different formats).

NoteBe sure to read the section on Java Binding.

One-to-One Transformation

If the requirement is to process a huge message by transforming it into a single message of another format, the easiest mechanism with Smooks is to apply multiple FreeMarker templates to the Source message Event Stream, outputting to a Smooks.filterSource Result stream.

This can be done in one of 2 ways with FreeMarker templating, depending on the type of model that’s appropriate:

Using FreeMarker + NodeModels for the model.

Using FreeMarker + a Java Object model for the model. The model can be constructed from data in the message, using the Javabean Cartridge.

Option #1 above is obviously the option of choice, if the tradeoffs are OK for your use case. Please see the FreeMarker Templating docs for more details.

The following images shows an message, as well as the message to which we need to transform the message:

Image:huge-message.png

Imagine a situation where the message contains millions of elements. Processing a huge message in this way with Smooks and FreeMarker (using NodeModels) is quite straightforward. Because the message is huge, we need to identify multiple NodeModels in the message, such that the runtime memory footprint is as low as possible. We cannot process the message using a single model, as the full message is just too big to hold in memory. In the case of the message, there are 2 models, one for the main data (blue highlight) and one for the data (beige highlight):

Image:huge-message-models.png

So in this case, the most data that will be in memory at any one time is the main order data, plus one of the order-items. Because the NodeModels are nested, Smooks makes sure that the order data NodeModel never contains any of the data from the order-item NodeModels. Also, as Smooks filters the message, the order-item NodeModel will be overwritten for every order-item (i.e. they are not collected). See SAX NG.

Configuring Smooks to capture multiple NodeModels for use by the FreeMarker templates is just a matter of configuring the DomModelCreator visitor, targeting it at the root node of each of the models. Note again that Smooks also makes this available to SAX filtering (the key to processing huge message). The Smooks configuration for creating the NodeModels for this message are:

<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:core="https://www.smooks.org/xsd/smooks/smooks-core-1.6.xsd"
                      xmlns:ftl="https://www.smooks.org/xsd/smooks/freemarker-2.0.xsd">

     <!--
        Create 2 NodeModels. One high level model for the "order"
        (header, etc...) and then one for the "order-item" elements...
     -->
    <resource-config selector="order,order-item">
        <resource>org.smooks.engine.resource.visitor.dom.DomModelCreator</resource>
    </resource-config>

    <!-- FreeMarker templating configs to be added below... -->

Now the FreeMarker templates need to be added. We need to apply 3 templates in total:

A template to output the order "header" details, up to but not including the order items.

A template for each of the order items, to generate the elements in the .

A template to close out the message.

With Smooks, we implement this by defining 2 FreeMarker templates. One to cover #1 and #3 (combined) above, and a seconds to cover the elements.

The first FreeMarker template is targeted at the element and looks as follows:

<ftl:freemarker applyOnElement="order-items">
        <ftl:template><!--<salesorder>
    <details>
        <orderid>${order.@id}</orderid>
        <customer>
            <id>${order.header.customer.@number}</id>
            <name>${order.header.customer}</name>
        </customer>
    </details>
    <itemList>
    <?TEMPLATE-SPLIT-PI?>
    </itemList>
</salesorder>-->
        </ftl:template>
</ftl:freemarker>

You will notice the +<?TEMPLATE-SPLIT-PI?>+ processing instruction. This tells Smooks where to split the template, outputting the first part of the template at the start of the element, and the other part at the end of the element. The element template (the second template) will be output in between.

The second FreeMarker template is very straightforward. It simply outputs the elements at the end of every element in the source message:

    <ftl:freemarker applyOnElement="order-item">
        <ftl:template><!-- <item>
    <id>${.vars["order-item"].@id}</id>
    <productId>${.vars["order-item"].product}</productId>
    <quantity>${.vars["order-item"].quantity}</quantity>
    <price>${.vars["order-item"].price}</price>
</item>-->
        </ftl:template>
    </ftl:freemarker>
</smooks-resource-list>

Because the second template fires on the end of the elements, it effectively generates output into the location of the <?TEMPLATE-SPLIT-PI?> Processing Instruction in the first template. Note that the second template could have also referenced data in the "order" NodeModel.

And that’s it! This is available as a runnable example in the Tutorials section.

This approach to performing a One-to-One Transformation of a huge message works simply because the only objects in memory at any one time are the order header details and the current details (in the Virtual Object Model).? Obviously it can’t work if the transformation is so obscure as to always require full access to all the data in the source message e.g. if the messages needs to have all the order items reversed in order (or sorted).? In such a case however, you do have the option of routing the order details and items to a database and then using the database’s storage, query and paging features to perform the transformation.

Splitting & Routing

Smooks supports a number of options when it comes to splitting and routing fragments. The ability to split the stream into fragments and route these fragments to different endpoints (File, JMS, etc…​) is a fundamental capability. Smooks improves this capability with the following features:

Basic Fragment Splitting: basic splitting means that no fragment transformation happens prior to routing. Basic splitting and routing involves defining the XPath of the fragment to be split out and defining a routing component (e.g., Apache Camel) to route that unmodified split fragment.

Complex Fragment Splitting: basic fragment splitting works for many use cases and is what most splitting and routing solutions offer. Smooks extends the basic splitting capabilities by allowing you to perform transformations on the split fragment data before routing is applied. For example, merging in the customer-details order information with each order-item information before performing the routing order-item split fragment routing.

In-Flight Stream Splitting & Routing (Huge Message Support): Smooks is able to process gigabyte streams because it can perform in-flight event routing; events are not accumulated when the max.node.depth parameter is left unset.

Multiple Splitting and Routing: conditionally split and route multiple fragments (different formats XML, EDI, POJOs, etc…​) to different endpoints in a single filtering pass of the source. One could route an OrderItem Java instance to the HighValueOrdersValidation JMS queue for order items with a value greater than $1,000 and route all order items as XML/JSON to an HTTP endpoint for logging.

Extending Smooks

All existing Smooks functionality (Java Binding, EDI processing, etc…​) is built through extension of a number of well defined APIs. We will look at these APIs in the coming sections.

The main extension points/APIs in Smooks are:

Reader APIs: Those for processing Source/Input data (Readers) so as to make it consumable by other Smooks components as a series of well defined hierarchical events (based on the SAX event model) for all of the message fragments and sub-fragments.

Visitor APIs: Those for consuming the message fragment SAX events produced by a source/input reader.

Another very important aspect of writing Smooks extensions is how these components are configured. Because this is common to all Smooks components, we will look at this first.

Configuring Smooks Components

All Smooks components are configured in exactly the same way. As far as the Smooks Core code is concerned, all Smooks components are "resources" and are configured via a ResourceConfig instance, which we talked about in earlier sections.

Smooks provides mechanisms for constructing namespace (XSD) specific XML configurations for components, but the most basic configuration (and the one that maps directly to the ResourceConfig class) is the basic XML configuration from the base configuration namespace (https://www.smooks.org/xsd/smooks-2.0.xsd).

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <resource-config selector="">
        <resource></resource>
        <param name=""></param>
    </resource-config>

</smooks-resource-list>

Where:

The selector attribute is the mechanism by which the resource is "selected" e.g. can be an XPath for a visitor. We’ll see more of this in the coming sections.

The resource element is the actual resource. This can be a Java Class name or some other form of resource (such as a template). For the purposes of this section however, lets just assume the resource to by a Java Class name.

The param elements are configuration parameters for the resource defined in the resource element.

Smooks takes care of all the details of creating the runtime representation of the resource (e.g. constructing the class named in the resource element) and injecting all the configuration parameters. It also works out what the resource type is, and from that, how to interpret things like the selector e.g., if the resource is a visitor instance, it knows the selector is an XPath, selecting a Source message fragment.

Configuration Annotations

After your component has been created, you need to configure it with the element details. This is done using the @Inject annotation.

@Inject

The Inject annotation reflectively injects the named parameter (from the elements) having the same name as the annotated property itself (the name can actually be different, but by default, it matches against the name of the component property).

Suppose we have a component as follows:

public class DataSeeder {

    @Inject
    private File seedDataFile;

    public File getSeedDataFile() {
        return seedDataFile;
    }

    // etc...
}

We configure this component in Smooks as follows:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <resource-config selector="dataSeeder">
        <resource>com.acme.DataSeeder</resource>
        <param name="seedDataFile">./seedData.xml</param>
    </resource-config>

</smooks-resource-list>

This annotation eliminates a lot of noisy code from your component because it:

Handles decoding of the value before setting it on the annotated component property. Smooks provides type converters for all the main types (Integer, Double, File, Enums, etc…​), but you can implement and use a custom TypeConverter where the out-of-the-box converters don’t cover specific decoding requirements. Smooks will automatically use your custom converter if it is registered. See the TypeConverter Javadocs for details on registering a TypeConverter implementation such that Smooks will automatically locate it for converting a specific data type.

Supports enum constraints for the injected property, generating a configuration exception where the configured value is not one of the defined choice values. For example, you may have a property which has a constrained value set of "ON" and "OFF". You can use an enum for the property type to constrain the value, raise exceptions, etc…​:

@Inject
private OnOffEnum foo;

Can specify default property values:

@Inject
private Boolean foo = true;

Can specify whether the property is optional:

@Inject
private java.util.Optional<Boolean> foo;

By default, all properties are required but setting a default implicitly marks the property as being optional.

@PostConstruct and @PreDestroy

The Inject annotation is great for configuring your component with simple values, but sometimes your component needs more involved configuration for which we need to write some "initialization" code. For this, Smooks provides @PostConstruct.

On the other side of this, there are times when we need to undo work performed during initialization when the associated Smooks instance is being discarded (garbage collected) e.g. to release some resources acquired during initialization, etc…​ For this, Smooks provides the @PreDestroy.

The basic initialization/un-initialization sequence can be described as follows:

smooks = new Smooks(..);

    // Initialize all annotated components
    @PostConstruct

        // Use the smooks instance through a series of filterSource invocations...
        smooks.filterSource(...);
        smooks.filterSource(...);
        smooks.filterSource(...);
        ... etc ...

smooks.close();

    // Uninitialize all annotated components
    @PreDestroy

In the following example, lets assume we have a component that opens multiple connections to a database on initialization and then needs to release all those database resources when we close the Smooks instance.

public class MultiDataSourceAccessor {

    @Inject
    private File dataSourceConfig;

    Map<String, Datasource> datasources = new HashMap<String, Datasource>();

    @PostConstruct
    public void createDataSources() {
        // Add DS creation code here....
        // Read the dataSourceConfig property to read the DS configs...
    }

    @PreDestroy
    public void releaseDataSources() {
        // Add DS release code here....
    }

    // etc...
}

Notes:

@PostConstruct and @PreDestroy methods must be public, zero-arg methods.

@Inject properties are all initialized before the first @PostConstruct method is called. Therefore, you can use @Inject component properties as input to the initialization process.

@PreDestroy methods are all called in response to a call to the Smooks.close method.

Defining Custom Configuration Namespaces

Smooks supports a mechanism for defining custom configuration namespaces for components. This allows you to support custom, XSD based (validatable), configurations for your components Vs treating them all as vanilla Smooks resources via the base configuration.

The basic process involves:

Writing an configuration XSD for your component that extends the base https://www.smooks.org/xsd/smooks-2.0.xsd configuration namespace. This XSD must be supplied on the classpath with your component. It must be located in the /META-INF folder and have the same path as the namespace URI. For example, if your extended namespace URI is http://www.acme.com/schemas/smooks/acme-core-1.0.xsd, then the physical XSD file must be supplied on the classpath in "/META-INF/schemas/smooks/acme-core-1.0.xsd".

Writing a Smooks configuration namespace mapping configuration file that maps the custom namespace configuration into a ResourceConfig instance. This file must be named (by convention) based on the name of the namespace it is mapping and must be physically located on the classpath in the same folder as the XSD. Extending the above example, the Smooks mapping file would be "/META-INF/schemas/smooks/acme-core-1.0.xsd-smooks.xml". Note the "-smooks.xml" postfix.

The easiest way to get familiar with this mechanism is by looking at existing extended namespace configurations within the Smooks code itself. All Smooks components (including e.g. the Java Binding functionality) use this mechanism for defining their configurations. Smooks Core itself defines a number of extended configuration namesaces, as can be seen in the source.

Implementing a Source Reader

Implementing and configuring a new Source Reader for Smooks is straightforward. The Smooks specific parts of the process are easy and are not really the issue. The level of effort involved is a function of the complexity of the Source data format for which you are implementing the reader.

Implementing a Reader for your custom data format immediately opens all Smooks capabilities to that data format e.g. Java Binding, Templating, Persistence, Validation, Splitting & Routing, etc…​ So a relatively small investment can yield a quite significant return. The only requirement, from a Smooks perspective, is that the Reader implements the standard org.xml.sax.XMLReader interface from the Java JDK. However, if you want to be able to configure the Reader implementation, it needs to implement the org.smooks.api.resource.reader.SmooksXMLReader interface (which is just an extension of org.xml.sax.XMLReader). So, you can easily use (or extend) an existing org.xml.sax.XMLReader implementation, or implement a new Reader from scratch.

Let’s now look at a simple example of implementing a Reader for use with Smooks. In this example, we will implement a Reader that can read a stream of Comma Separated Value (CSV) records, converting the CSV stream into a stream of SAX events that can be processed by Smooks, allowing you to do all the things Smooks allows (Java Binding, etc…​).

We start by implementing the basic Reader class:

public class MyCSVReader implements SmooksXMLReader {

    // Implement all of the XMLReader methods...
}

Two methods from the XMLReader interface are of particular interest:

setContentHandler(ContentHandler): This method is called by Smooks Core. It sets the ContentHandler instance for the reader. The ContentHandler instance methods are called from inside the parse(InputSource) method.

parse(InputSource): This is the method that receives the Source data input stream, parses it (i.e. in the case of this example, the CSV stream) and generates the SAX event stream through calls to the ContentHandler instance supplied in the setContentHandler(ContentHandler) method.

We need to configure our CSV reader with the names of the fields associated with the CSV records. Configuring a custom reader implementation is the same as for any Smooks component, as described in the Configuring Smooks Components section above.

So focusing a little more closely on the above methods and our fields configuration:

public class MyCSVReader implements SmooksXMLReader {

    private ContentHandler contentHandler;

    @Inject
    private String[] fields; // Auto decoded and injected from the "fields" <param> on the reader config.

    public void setContentHandler(ContentHandler contentHandler) {
        this.contentHandler = contentHandler;
    }

    public void parse(InputSource csvInputSource) throws IOException, SAXException {
        // TODO: Implement parsing of CSV Stream...
    }

    // Other XMLReader methods...
}

So now we have our basic Reader implementation stub. We can start writing unit tests to test the new reader implementation.

First thing we need is some sample CSV input. Lets use a simple list of names:

names.csv

Tom,Fennelly Mike,Fennelly Mark,Jones

Second thing we need is a test Smooks configuration to configure Smooks with our MyCSVReader. As stated before, everything in Smooks is a resource and can be configured with the basic configuration. While this works fine, it’s a little noisy, so Smooks provides a basic configuration element specifically for the purpose of configuring a reader. The configuration for our test looks like the following:

mycsvread-config.xml
<?xml version="1.0"?>
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <reader class="com.acme.MyCSVReader">
        <params>
            <param name="fields">firstname,lastname</param>
        </params>
    </reader>

</smooks-resource-list>

And of course we need the JUnit test class:

public class MyCSVReaderTest extends TestCase {

    public void test() {
        Smooks smooks = new Smooks(getClass().getResourceAsStream("mycsvread-config.xml"));
        StringResult serializedCSVEvents = new StringResult();

        smooks.filterSource(new StreamSource(getClass().getResourceAsStream("names.csv")), serializedCSVEvents);

        System.out.println(serializedCSVEvents);

        // TODO: add assertions, etc...
    }
}

So now we have a basic setup with our custom Reader implementation, as well as a unit test that we can use to drive our development. Of course, our reader parse method is not doing anything yet and our test class is not making any assertions, etc…​ So lets start implementing the parse method:

public class MyCSVReader implements SmooksXMLReader {

    private ContentHandler contentHandler;

    @Inject
    private String[] fields; // Auto decoded and injected from the "fields" <param> on the reader config.

    public void setContentHandler(ContentHandler contentHandler) {
        this.contentHandler = contentHandler;
    }

    public void parse(InputSource csvInputSource) throws IOException, SAXException {
        BufferedReader csvRecordReader = new BufferedReader(csvInputSource.getCharacterStream());
        String csvRecord;

        // Send the start of message events to the handler...
        contentHandler.startDocument();
        contentHandler.startElement(XMLConstants.NULL_NS_URI, "message-root", "", new AttributesImpl());

        csvRecord = csvRecordReader.readLine();
        while(csvRecord != null) {
            String[] fieldValues = csvRecord.split(",");

            // perform checks...

            // Send the events for this record...
            contentHandler.startElement(XMLConstants.NULL_NS_URI, "record", "", new AttributesImpl());
            for(int i = 0; i < fields.length; i++) {
                contentHandler.startElement(XMLConstants.NULL_NS_URI, fields[i], "", new AttributesImpl());
                contentHandler.characters(fieldValues[i].toCharArray(), 0, fieldValues[i].length());
                contentHandler.endElement(XMLConstants.NULL_NS_URI, fields[i], "");
            }
            contentHandler.endElement(XMLConstants.NULL_NS_URI, "record", "");

            csvRecord = csvRecordReader.readLine();
        }

        // Send the end of message events to the handler...
        contentHandler.endElement(XMLConstants.NULL_NS_URI, "message-root", "");
        contentHandler.endDocument();
    }

    // Other XMLReader methods...
}

If you run the unit test class now, you should see the following output on the console (formatted):

<message-root>
    <record>
        <firstname>Tom</firstname>
        <lastname>Fennelly</lastname>
    </record>
    <record>
        <firstname>Mike</firstname>
        <lastname>Fennelly</lastname>
    </record>
    <record>
        <firstname>Mark</firstname>
        <lastname>Jones</lastname>
    </record>
</message-root>

After this, it is just a case of expanding the tests, hardening the reader implementation code, etc…​

Now you can use your reader to perform all sorts of operations supported by Smooks. As an example, the following configuration could be used to bind the names into a List of PersonName objects:

java-binding-config.xml
<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd" xmlns:jb="https://www.smooks.org/xsd/smooks/javabean-1.6.xsd">

    <reader class="com.acme.MyCSVReader">
        <params>
            <param name="fields">firstname,lastname</param>
        </params>
    </reader>

    <jb:bean beanId="peopleNames" class="java.util.ArrayList" createOnElement="message-root">
        <jb:wiring beanIdRef="personName" />
    </jb:bean>

    <jb:bean beanId="personName" class="com.acme.PersonName" createOnElement="message-root/record">
        <jb:value property="first" data="record/firstname" />
        <jb:value property="last" data="record/lastname" />
    </jb:bean>

</smooks-resource-list>

And then a test for this configuration could look as follows:

public class MyCSVReaderTest extends TestCase {

    public void test_java_binding() {
        Smooks smooks = new Smooks(getClass().getResourceAsStream("java-binding-config.xml"));
        JavaResult javaResult = new JavaResult();

        smooks.filterSource(new StreamSource(getClass().getResourceAsStream("names.csv")), javaResult);

        List<PersonName> peopleNames = (List<PersonName>) javaResult.getBean("peopleNames");

        // TODO: add assertions etc
    }
}

For more on Java Binding, see the Java Binding section.

Tips:

Reader instances are never used concurrently. Smooks Core will create a new instance for every message, or, will pool and reuse instances as per the readerPoolSize FilterSettings property.

If your Reader requires access to the Smooks ExecutionContext for the current filtering context, your Reader needs to implement the SmooksXMLReader interface.

If your Source data is a binary data stream your Reader must implement the StreamReader interface. See next section.

You can programmatically configure your reader (e.g. in your unit tests) using a GenericReaderConfigurator instance, which you then set on the Smooks instance.

While the basic configuration is fine, it’s possible to define a custom configuration namespace (XSD) for your custom CSV Reader implementation. This topic is not covered here. Review the source code to see the extended configuration namespace for the Reader implementations supplied with Smooks (out-of-the-box) e.g. the EDIReader, CSVReader, JSONReader, etc…​ From this, you should be able to work out how to do this for your own custom Reader.

Implementing a Binary Source Reader

Prior to Smooks v1.5, binary readers needed to implement the StreamReader interface. This is no longer a requirement. All XMLReader instances receive an InputSource (to their parse method) that contains an InputStream if the InputStream was provided in the StreamSource passed in the Smooks.filterSource method call. This means that all XMLReader instance are guaranteed to receive an InputStream if one is available, so no need to mark the XMLReader instance.

Implementing a Flat File Source Reader

In Smooks v1.5 we tried to make it a little easier to implement a custom reader for reading flat file data formats. By flat file we mean "record" based data formats, where the data in the message is structured in flat records as opposed to a more hierarchical structure. Examples of this would be Comma Separated Value (CSV) and Fixed Length Field (FLF). The new API introduced in Smooks v1.5 should remove the complexity of the XMLReader API (as outlined above).

The API is composed of 2 interfaces plus a number of support classes.These interfaces work as a pair. They need to be implemented if you wish to use this API for processing a custom Flat File format not already supported by Smooks.

/**
 * {@link RecordParser} factory class.
 * <p/>
 * Configurable by the Smooks {@link org.smooks.cdr.annotation.Configurator}
 */
public interface RecordParserFactory {

    /**
     * Create a new Flat File {@link RecordParser} instance.
     * @return A new {@link RecordParser} instance.
     */
    RecordParser newRecordParser();
}


/**
 * Flat file Record Parser.
 */
public interface RecordParser<T extends RecordParserFactory>  {

    /**
     * Set the parser factory that created the parser instance.
     * @param factory The parser factory that created the parser instance.
     */
    void setRecordParserFactory(T factory);

    /**
     * Set the Flat File data source on the parser.
     * @param source The flat file data source.
     */
    void setDataSource(InputSource source);

    /**
     * Parse the next record from the message stream and produce a {@link Record} instance.
     * @return The records instance.
     * @throws IOException Error reading message stream.
     */
    Record nextRecord() throws IOException;

}

Obviously the RecordParserFactory implementation is responsible for creating the RecordParser instances for the Smooks runtime. The RecordParserFactory is the class that Smooks configures, so it is in here you place all your @Inject details. The created RecordParser instances are supplied with a reference to the RecordParserFactory instance that created them, so it is easy enough the provide them with access to the configuration via getters on the RecordParserFactory implementation.

The RecordParser implementation is responsible for parsing out each record (a Record contains a set of Fields) in the nextRecord() method. Each instance is supplied with the Reader to the message stream via the setReader(Reader) method. The RecordParser should store a reference to this Reader and use it in the nextRecord() method. A new instance of a given RecordParser implementation is created for each message being filtered by Smooks.

Configuring your implementation in the Smooks configuration is as simple as the following:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:ff="https://www.smooks.org/xsd/smooks/flatfile-1.6.xsd">

    <ff:reader fields="first,second,third" parserFactory="com.acme.ARecordParserFactory">
        <params>
            <param name="aConfigParameter">aValue</param>
            <param name="bConfigParameter">bValue</param>
        </params>
    </ff:reader>

    <!--
 Other Smooks configurations e.g. <jb:bean> configurations
 -->

</smooks-resource-list>

The Flat File configuration also supports basic Java binding configurations, inlined in the reader configuration.

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:ff="https://www.smooks.org/xsd/smooks/flatfile-1.6.xsd">

    <ff:reader fields="firstname,lastname,gender,age,country" parserFactory="com.acme.PersonRecordParserFactory">
        <!-- The field names must match the property names on the Person class. -->
        <ff:listBinding beanId="people" class="com.acme.Person" />
    </ff:reader>

</smooks-resource-list>

To execute this configuration:

Smooks smooks = new Smooks(configStream);
JavaResult result = new JavaResult();

smooks.filterSource(new StreamSource(messageReader), result);

List<Person> people = (List<Person>) result.getBean("people");

Smooks also supports creation of Maps from the record set:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:ff="https://www.smooks.org/xsd/smooks/flatfile-1.6.xsd">

    <ff:reader fields="firstname,lastname,gender,age,country" parserFactory="com.acme.PersonRecordParserFactory">
        <ff:mapBinding beanId="people" class="com.acme.Person" keyField="firstname" />
    </ff:reader>

</smooks-resource-list>

The above configuration would produce a Map of Person instances, keyed by the "firstname" value of each Person. It would be executed as follows:

Smooks smooks = new Smooks(configStream);
JavaResult result = new JavaResult();

smooks.filterSource(new StreamSource(messageReader), result);

Map<String, Person> people = (Map<String, Person>) result.getBean("people");

Person tom = people.get("Tom");
Person mike = people.get("Mike");

Virtual Models are also supported, so you can define the class attribute as a java.util.Map and have the record field values bound into Map instances, which are in turn added to a List or a Map.

VariableFieldRecordParser and VariableFieldRecordParserFactory

VariableFieldRecordParser and VariableFieldRecordParserFactory are abstract implementations of the RecordParser and RecordParserFactory interface. They provide very useful base implementations for a Flat File Reader, providing base support for:

The utility java binding configurations as outlined in the previous section.

Support for "variable field" records i.e. a flat file message that contains multiple record definitions. The different records are identified by the value of the first field in the record and are defined as follows: fields="book[name,author] | magazine[*]". Note the record definitions are pipe separated. "book" records will have a first field value of "book" while "magazine" records will have a first field value of "magazine". Astrix ("*") as the field definition for a record basically tells the reader to generate the field names in the generated events (e.g. "field_0", "field_1", etc…​).

The ability to read the next record chunk, with support for a simple record delimiter, or a regular expression (regex) pattern that marks the beginning of each record.

The CSV and Regex readers are implemented using these abstract classes. See the csv-variable-record and flatfile-to-xml-regex examples. The Regex Reader implementation is also a good example that can be used as a basis for your own custom flat file reader.

Implementing a Fragment Visitor

Visitors are the workhorse of Smooks. Most of the out-of-the-box functionality in Smooks (Java binding, templating, persistence, etc…​) was created by creating one or more visitors. Visitors often collaborate through the ExecutionContext and ApplicationContext objects, accomplishing a common goal by working together.

ImportantSmooks treats all visitors as stateless objects. A visitor instance must be usable concurrently across multiple messages, that is, across multiple concurrent calls to the Smooks.filterSource method.All state associated with the current Smooks.filterSource execution must be stored in the ExecutionContext. For more details see the ExecutionContext and ApplicationContex section.

SAX NG Visitor API

The SAX NG visitor API is made up of a number of interfaces. These interfaces are based on the SAX events that a SaxNgVisitor implementation can capture and processes. Depending on the use case being solved with the SaxNgVisitor implementation, you may need to implement one or all of these interfaces.

BeforeVisitor: Captures the startElement SAX event for the targeted fragment element:

public interface BeforeVisitor extends Visitor {

    void visitBefore(Element element, ExecutionContext executionContext);
}

ChildrenVisitor: Captures the character based SAX events for the targeted fragment element, as well as Smooks generated (pseudo) events corresponding to the startElement events of child fragment elements:

public interface ChildrenVisitor extends Visitor {

    void visitChildText(CharacterData characterData, ExecutionContext executionContext) throws SmooksException, IOException;

    void visitChildElement(Element childElement, ExecutionContext executionContext) throws SmooksException, IOException;
}

AfterVisitor: Captures the endElement SAX event for the targeted fragment element:

public interface AfterVisitor extends Visitor {

    void visitAfter(Element element, ExecutionContext executionContext);
}

As a convenience for those implementations that need to capture all the SAX events, the above three interfaces are pulled together into a single interface in the ElementVisitor interface.

Illustrating these events using a piece of XML:

<message>
    <target-fragment>      <--- BeforeVisitor.visitBefore
        Text!!                       <--- ChildrenVisitor.visitChildText
        <child>                      <--- ChildrenVisitor.visitChildElement
        </child>
    </target-fragment>     <--- AfterVisitor.visitAfter
</message>
NoteOf course, the above is just an illustration of a Source message event stream and it looks like XML, but could be EDI, CSV, JSON, etc…​ Think of this as just an XML serialization of a Source message event stream, serialized as XML for easy reading.

Element: As can be seen from the above SAX NG interfaces, Element type is passed in all method calls. This object contains details about the targeted fragment element, including attributes and their values. We’ll discuss text accumulation and StreamResult writing in the coming sections.

Text Accumulation

SAX is a stream based processing model. It doesn’t create a Document Object Model (DOM) of any form. It doesn’t "accumulate" event data in any way. This is why it is a suitable processing model for processing huge message streams.

The Element will always contain attributes associated with the targeted element, but will not contain the fragment child text data, whose SAX events (ChildrenVisitor.visitChildText) occur between the BeforeVisitor.visitBefore and AfterVisitor.visitAfter events (see above illustration). The filter does not accumulate text events on the Element because, as already stated, that could result in a significant performance drain. Of course the downside to this is the fact that if your SaxNgVisitor implementation needs access to the text content of a fragment, you need to explicitly tell Smooks to accumulate text for the targeted fragment. This is done by stashing the text into a memento from within the ChildrenVisitor.visitChildText method and then restoring the memento from within the AfterVisitor.visitAfter method implementation of your SaxNgVisitor as shown below:

public class MyVisitor implements ChildrenVisitor, AfterVisitor {

    @Override
    public void visitChildText(CharacterData characterData, ExecutionContext executionContext) {
        executionContext.getMementoCaretaker().stash(new TextAccumulatorMemento(new NodeVisitable(characterData.getParentNode()), this), textAccumulatorMemento -> textAccumulatorMemento.accumulateText(characterData.getTextContent()));
    }

    @Override
    public void visitChildElement(Element childElement, ExecutionContext executionContext) {

    }

    @Override
    public void visitAfter(Element element, ExecutionContext executionContext) {
        TextAccumulatorMemento textAccumulatorMemento = new TextAccumulatorMemento(new NodeVisitable(element), this);
        executionContext.getMementoCaretaker().restore(textAccumulatorMemento);
        String fragmentText = textAccumulatorMemento.getTextContent();

        // ... etc ...
    }
}

It is a bit ugly having to implement ChildrenVisitor.visitChildText just to tell Smooks to accumulate the text events for the targeted fragment. For that reason, we have the @TextConsumer annotation that can be used to annotate your SaxNgVisitor implementation, removing the need to implement the ChildrenVisitor.visitChildText method:

@TextConsumer
public class MyVisitor implements AfterVisitor {

    public void visitAfter(Element element, ExecutionContext executionContext) {
        String fragmentText = element.getTextContent();

        // ... etc ...
    }
}

Note that the complete fragment text will not be available until the AfterVisitor.visitAfter event.

StreamResult Writing/Serialization

The Smooks.filterSource(Source, Result) method can take one or more of a number of different Result type implementations, one of which is the StreamResult class (see Multiple Outputs/Results). By default, Smooks will always serialize the full Source event stream as XML to any StreamResult instance provided to the Smooks.filterSource(Source, Result) method.

So, if the Source provided to the Smooks.filterSource(Source, Result) method is an XML stream and a StreamResult instance is provided as one of the Result instances, the Source XML will be written out to the StreamResult unmodified, unless the Smooks instance is configured with one or more SaxNgVisitor implementations that modify one or more fragments. In other words, Smooks streams the Source in and back out again through the StreamResult instance. Default serialization can be turned on/off by configuring the filter settings.

If you want to modify the serialized form of one of the message fragments (i.e. "transform"), you need to implement a SaxNgVisitor to do so and target it at the message fragment using an XPath-like expression.

NoteOf course, you can also modify the serialized form of a message fragment using one of the out-of-the-box Templating components. These components are also SaxNgVisitor implementations.

The key to implementing a SaxNgVisitor geared towards transforming the serialized form of a fragment is telling Smooks that the SaxNgVisitor implementation in question will be writing to the StreamResult. You need to tell Smooks this because Smooks supports targeting of multiple SaxNgVisitor implementations at a single fragment, but only one SaxNgVisitor is allowed to write to the StreamResult, per fragment. If a second SaxNgVisitor attempts to write to the StreamResult, a SAXWriterAccessException will result and you will need to modify your Smooks configuration.

In order to be "the one" that writes to the StreamResult, the SaxNgVisitor needs to acquire ownership of the Writer to the StreamResult. It does this by simply making a call to the ExecutionContext.getWriter().write(…​) method from inside the BeforeVisitor.visitBefore methods implementation:

public class MyVisitor implements ElementVisitor {

    @Override
    public void visitBefore(Element element, ExecutionContext executionContext) {
        Writer writer = executionContext.getWriter();

        // ... write the start of the fragment...
    }

    @Override
    public void visitChildText(CharacterData characterData, ExecutionContext executionContext) {
        Writer writer = executionContext.getWriter();

        // ... write the child text...
    }

    @Override
    public void visitChildElement(Element childElement, ExecutionContext executionContext) {
    }

    @Override
    public void visitAfter(Element element, ExecutionContext executionContext) {
        Writer writer = executionContext.getWriter();

        // ... close the fragment...
    }
}
NoteIf you need to control serialization of sub-fragments you need to reset the Writer instance so as to divert serialization of the sub-fragments. You do this by calling ExecutionContext.setWriter.

Sometimes you know that the target fragment you are serializing/transforming will never have sub-fragments. In this situation, it’s a bit ugly to have to implement the BeforeVisitor.visitBefore method just to make a call to the ExecutionContext.getWriter().write(...) method to acquire ownership of the Writer. For this reason, we have the @StreamResultWriter annotation. Used in combination with the @TextConsumer annotation, we can remove the need to implement all but the AfterVisitor.visitAfter method:

@TextConsumer
@StreamResultWriter
public class MyVisitor implements AfterVisitor {

    public void visitAfter(Element element, ExecutionContext executionContext) {
        Writer writer = executionContext.getWriter();

        // ... serialize to the writer ...
    }
}

DomSerializer

Smooks provides the DomSerializer class to make serializing of element data, as XML, a little easier. This class allows you to write a SaxNgVisitor implementation like:

@StreamResultWriter
public class MyVisitor implements ElementVisitor {

    private DomSerializer domSerializer = new DomSerializer(true, true);

    @Override
    public void visitBefore(Element element, ExecutionContext executionContext) {
        try {
            domSerializer.writeStartElement(element, executionContext.getWriter());
        } catch (IOException e) {
            throw new SmooksException(e);
        }
    }

    @Override
    public void visitChildText(CharacterData characterData, ExecutionContext executionContext) {
        try {
            domSerializer.writeText(characterData, executionContext.getWriter());
        } catch (IOException e) {
            throw new SmooksException(e);
        }
    }

    @Override
    public void visitChildElement(Element element, ExecutionContext executionContext) throws SmooksException, IOException {
    }

    @Override
    public void visitAfter(Element element, ExecutionContext executionContext) throws SmooksException, IOException {
        try {
            domSerializer.writeEndElement(element, executionContext.getWriter());
        } catch (IOException e) {
            throw new SmooksException(e);
        }
    }
}

You may have noticed that the arguments in the DomSerializer constructor are boolean. This is the closeEmptyElements and rewriteEntities args which should be based on the closeEmptyElements and rewriteEntities filter setting, respectively. Smooks provides a small code optimization/assist here. If you annotate the DomSerializer field with @Inject, Smooks will create the DomSerializer instance and initialize it with the closeEmptyElements and rewriteEntities filter settings for the associated Smooks instance:

@TextConsumer
public class MyVisitor implements AfterVisitor {

    @Inject
    private DomSerializer domSerializer;

    public void visitAfter(Element element, ExecutionContext executionContext) throws SmooksException, IOException {
        try {
            domSerializer.writeStartElement(element, executionContext.getWriter());
            domSerializer.writeText(element, executionContext.getWriter());
            domSerializer.writeEndElement(element, executionContext.getWriter());
        } catch (IOException e) {
            throw new SmooksException(e);
        }
    }
}

Visitor Configuration

SaxNgVisitor configuration works in exactly the same way as any other Smooks component. See Configuring Smooks Components.

The most important thing to note with respect to configuring visitor instances is the fact that the selector attribute is interpreted as an XPath (like) expression. For more on this see the docs on Selectors.

Also note that visitors can be programmatically configured on a Smooks instance. Among other things, this is very useful for unit testing.

Example Visitor Configuration

Let’s assume we have a very simple SaxNgVisitor implementation as follows:

@TextConsumer
public class ChangeItemState implements AfterVisitor {

    @Inject
    private DomSerializer domSerializer;

    @Inject
    private String newState;

    public void visitAfter(Element element, ExecutionContext executionContext) {
        element.setAttribute("state", newState);

        try {
            domSerializer.writeStartElement(element, executionContext.getWriter());
            domSerializer.writeText(element, executionContext.getWriter());
            domSerializer.writeEndElement(element, executionContext.getWriter());
        } catch (IOException e) {
            throw new SmooksException(e);
        }
    }
}

Declaratively configuring ChangeItemState to fire on fragments having a status of "OK" is as simple as:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd">

    <resource-config selector="order-items/order-item[@status = 'OK']">
        <resource>com.acme.ChangeItemState </resource>
        <param name="newState">COMPLETED</param>
    </resource-config>

</smooks-resource-list>

Of course it would be really nice to be able to define a cleaner and more strongly typed configuration for the ChangeItemState component, such that it could be configured something like:

<smooks-resource-list xmlns="https://www.smooks.org/xsd/smooks-2.0.xsd"
                      xmlns:order="http://www.acme.com/schemas/smooks/order.xsd">

    <order:changeItemState itemElement="order-items/order-item[@status = 'OK']" newState="COMPLETED" />

</smooks-resource-list>

For details on this, see the section on Defining Custom Configuration Namespaces.

This visitor could also be programmatically configured on a Smooks as follows:

Smooks smooks = new Smooks();

smooks.addVisitor(new ChangeItemState().setNewState("COMPLETED"), "order-items/order-item[@status = 'OK']");

smooks.filterSource(new StreamSource(inReader), new StreamResult(outWriter));

Visitor Instance Lifecycle

One aspect of the visitor lifecycle has already been discussed in the general context of Smooks component initialization and uninitialization.

Smooks supports two additional component lifecycle events, specific to visitor components, via the ExecutionLifecycleCleanable and VisitLifecycleCleanable interfaces.

ExecutionLifecycleCleanable

Visitor components implementing this lifecycle interface will be able to perform post Smooks.filterSource lifecycle operations.

public interface ExecutionLifecycleCleanable extends Visitor {

    void executeExecutionLifecycleCleanup(ExecutionContext executionContext);
}

The basic call sequence can be described as follows (note the executeExecutionLifecycleCleanup calls):

smooks = new Smooks(..);

        smooks.filterSource(...);
            ** VisitorXX.executeExecutionLifecycleCleanup **
        smooks.filterSource(...);
            ** VisitorXX.executeExecutionLifecycleCleanup **
        smooks.filterSource(...);
            ** VisitorXX.executeExecutionLifecycleCleanup **
        ... etc ...

This lifecycle method allows you to ensure that resources scoped around the Smooks.filterSource execution lifecycle can be cleaned up for the associated ExecutionContext.

VisitLifecycleCleanable

Visitor components implementing this lifecycle interface will be able to perform post AfterVisitor.visitAfter lifecycle operations.

public interface VisitLifecycleCleanable extends Visitor {

    void executeVisitLifecycleCleanup(ExecutionContext executionContext);
}

The basic call sequence can be described as follows (note the executeVisitLifecycleCleanup calls):

smooks.filterSource(...);

    <message>
        <target-fragment>      <--- VisitorXX.visitBefore
            Text!!                       <--- VisitorXX.visitChildText
            <child>                      <--- VisitorXX.visitChildElement
            </child>
        </target-fragment>     <--- VisitorXX.visitAfter
        ** VisitorXX.executeVisitLifecycleCleanup **
        <target-fragment>      <--- VisitorXX.visitBefore
            Text!!                       <--- VisitorXX.visitChildText
            <child>                      <--- VisitorXX.visitChildElement
            </child>
        </target-fragment>     <--- VisitorXX.visitAfter
        ** VisitorXX.executeVisitLifecycleCleanup **
    </message>
    VisitorXX.executeExecutionLifecycleCleanup

smooks.filterSource(...);

    <message>
        <target-fragment>      <--- VisitorXX.visitBefore
            Text!!                       <--- VisitorXX.visitChildText
            <child>                      <--- VisitorXX.visitChildElement
            </child>
        </target-fragment>     <--- VisitorXX.visitAfter
        ** VisitorXX.executeVisitLifecycleCleanup **
        <target-fragment>      <--- VisitorXX.visitBefore
            Text!!                       <--- VisitorXX.visitChildText
            <child>                      <--- VisitorXX.visitChildElement
            </child>
        </target-fragment>     <--- VisitorXX.visitAfter
        ** VisitorXX.executeVisitLifecycleCleanup **
    </message>
    VisitorXX.executeExecutionLifecycleCleanup

This lifecycle method allows you to ensure that resources scoped around a single fragment execution of a SaxNgVisitor implementation can be cleaned up for the associated ExecutionContext.

ExecutionContext

ExecutionContext is scoped specifically around a single execution of a Smooks.filterSource method. All Smooks visitors must be stateless within the context of a single execution. A visitor is created once in Smooks and referenced across multiple concurrent executions of the Smooks.filterSource method. All data stored in an ExecutionContext instance will be lost on completion of the Smooks.filterSource execution. ExecutionContext is a parameter in all visit invocations.

ApplicationContext

ApplicationContext is scoped around the associated Smooks instance: only one ApplicationContext instance exists per Smooks instance. This context object can be used to store data that needs to be maintained (and accessible) across multiple Smooks.filterSource executions. Components (any component, including SaxNgVisitor components) can gain access to their associated ApplicationContext instance by declaring an ApplicationContext class property and annotating it with @Inject:

public class MySmooksResource {

    @Inject
    private ApplicationContext appContext;

    // etc...
}

Community

You can join these groups and chats to discuss and ask Smooks related questions:

Mailing list: googlegroups: smooks-user

Mailing list: googlegroups: smooks-user

Chat room about using Smooks: gitter:smooks/smooks

Issue tracker: github:smooks/smooks

Contributing

Please see the following guidelines if you’d like to contribute code to Smooks.

Download Details:
Author: smooks
Source Code: https://github.com/smooks/smooks
License: View license

#java 

Smooks: Extensible Data integration Java Framework for Building XML

Flutter Plugin for Reading and Validation of Identification Documents

Document Reader Core (Flutter)

Regula Document Reader SDK allows you to read various kinds of identification documents, passports, driving licenses, ID cards, etc. All processing is performed completely offline on your device. No any data leaving your device.

Documentation

The documentation can be found here.

Demo application

The demo application can be found here: https://github.com/regulaforensics/DocumentReader-Flutter.

Use this package as a library

Depend on it

Run this command:

With Flutter:

 $ flutter pub add flutter_document_reader_core_ocrandmrzrfid

This will add a line like this to your package's pubspec.yaml (and run an implicit flutter pub get):

dependencies:
  flutter_document_reader_core_ocrandmrzrfid: ^6.3.0

Alternatively, your editor might support flutter pub get. Check the docs for your editor to learn more.

Import it

Now in your Dart code, you can use:

import 'package:flutter_document_reader_core_ocrandmrzrfid/flutter_document_reader_core_ocrandmrzrfid.dart'; 

example/lib/main.dart

import 'package:flutter/material.dart';
import 'dart:async';

import 'package:flutter/services.dart';
import 'package:flutter_document_reader_core_ocrandmrzrfid/flutter_document_reader_core_ocrandmrzrfid.dart';

void main() {
  runApp(MyApp());
}

class MyApp extends StatefulWidget {
  @override
  _MyAppState createState() => _MyAppState();
}

class _MyAppState extends State<MyApp> {
  String _platformVersion = 'Unknown';

  @override
  void initState() {
    super.initState();
    initPlatformState();
  }

  // Platform messages are asynchronous, so we initialize in an async method.
  Future<void> initPlatformState() async {
    String platformVersion;
    // Platform messages may fail, so we use a try/catch PlatformException.
    try {
      platformVersion = await FlutterDocumentReaderCore.platformVersion;
    } on PlatformException {
      platformVersion = 'Failed to get platform version.';
    }

    // If the widget was removed from the tree while the asynchronous platform
    // message was in flight, we want to discard the reply rather than calling
    // setState to update our non-existent appearance.
    if (!mounted) return;

    setState(() {
      _platformVersion = platformVersion;
    });
  }

  @override
  Widget build(BuildContext context) {
    return MaterialApp(
      home: Scaffold(
        appBar: AppBar(
          title: const Text('Plugin example app'),
        ),
        body: Center(
          child: Text('Running on: $_platformVersion\n'),
        ),
      ),
    );
  }
} 

Download Details:

Author: regulaforensics

Source Code: https://github.com/regulaforensics/DocumentReader-Flutter

#flutter #document 

Flutter Plugin for Reading and Validation of Identification Documents