Gordon  Murray

Gordon Murray

1673466300

How to Loading and Indexing Data In MarkLogic

With MarkLogic being a document-oriented database, data is commonly stored in a JSON or XML document format.

If the data to bring into the MarkLogic is not already structured in JSON or XML means if it is currently in a relational database, there are various ways to export or transform it from the source.

For example, many relational databases provide an option to export relational data in XML or in JSON format, or a SQL script could be written to fetch the data from the database, outputting it in an XML or JSON structure. Or, using Marklogic rows from a .csv file can be imported as XML and JSON documents.

In any case, it is normal to first denormalize the data being exported from the relational database to first put the content back together in its original state. Denormalization, which naturally occurs when working with documents in their original form, greatly reduces the need for joins and acceleration performance.

Schema Agnostic

As we know that schema is something having a set of rules for a particular structure of the database. While we talk about data quality then schemas are helpful as quality matters a lot with quality reliability and a proper actional database is going to present.

Now if we talk about the schema-agnostic then it is something the database is not bounded by any schema but it is aware of it. Schemas are optional in MarkLogic. Data is going to be loaded in its original data form. To address a group of documents within a database, directories, collections and internal structure of documents can be used. With MarkLogic easily supports data from disparate systems all in the same database.

Required Document Size and Structure

When loading a document, it is the best choice to have one document per entity. Marklogic is the most performant with many small documents, rather than one large document. The target document size is 1KB to 100KB but can be larger.

For Example, rather than loading a bunch of students all as one document, have each student be a document.

Whenever defining a document remember that use XML document and attribute names or JSON property names. Make document names human-readable so do not create generic names. Using this convention help indexes be efficient.

<items>

<item>

<product> Mouse </product>

<price> 1000 </price>

<quantity> 3 </quantity>

</item>

<item>

<product> Keyboard </product>

<price> 2000 </price>

<quantity> 2 </quantity>

</item>

</items>

Indexing Documents

As documents are loaded, all the words in each document and the structure of each document, are indexed. So documents are easily searchable.

The document can be loaded into the MarkLogic in many ways:

  • MarkLogic Content Pump.
  • Data movement SDK.
  • Rest APIs
  • Java API or Node js API.
  • XQuery
  • Javascript Functions.

Reading a Document

To read a document, the URI of the document is used.

XQuery Example : fn:doc("college/course-101.json")

JavaScript Example : fn:doc("account/order-202.json")

Rest API Example : curl --abc --user admin:admin: -X GET "http://localhost:8055/v1/document?uri=/accounting/order-10072.json"

Splitting feature of MLCP

MLCP has the feature of splitting the long XML documents, where each occurrence of a designated element becomes an individual XML document in the database. This is useful when multiple records are all contained within one large XML file. Such as a list of students, courses, details, etc.

The -input_file_type aggregates option is used to split a large document into individual documents. The aggregate_record-element option is used element used to designate a new document. The -uri_id is used to create a URI for each document.

While it is fine to have a mix of XML and JSON documents in the same database, it is also possible to transform content from one format to other. You can easily transform the files by following the below steps.

xquery version "1.0-ml";

import module namespace json = "http://abc.com/xdmp/json" at "abc/json/json.xqy";

json:transform-to-json(fn:doc("doc-01.xml"), json:config("custom"))

A Marklogic content pump can be used to import the rows from the .csv file to a MarkLogic database. We are able to the data during the process or afterward in the database. Ways to modify content once it is already in the database include using the data movement SDK, XQuery, Js, etc.

Conclusion

As we know that MarkLogic is a database that facilitates many things like we can load the data, indexing the data, transforming the data, and splitting the data.

References:

Original article source at: https://blog.knoldus.com/

#loading #data #index 

What is GEEK

Buddha Community

How to Loading and Indexing Data In MarkLogic

Dotnet Script: Run C# Scripts From The .NET CLI

dotnet script

Run C# scripts from the .NET CLI, define NuGet packages inline and edit/debug them in VS Code - all of that with full language services support from OmniSharp.

NuGet Packages

NameVersionFramework(s)
dotnet-script (global tool)Nugetnet6.0, net5.0, netcoreapp3.1
Dotnet.Script (CLI as Nuget)Nugetnet6.0, net5.0, netcoreapp3.1
Dotnet.Script.CoreNugetnetcoreapp3.1 , netstandard2.0
Dotnet.Script.DependencyModelNugetnetstandard2.0
Dotnet.Script.DependencyModel.NugetNugetnetstandard2.0

Installing

Prerequisites

The only thing we need to install is .NET Core 3.1 or .NET 5.0 SDK.

.NET Core Global Tool

.NET Core 2.1 introduced the concept of global tools meaning that you can install dotnet-script using nothing but the .NET CLI.

dotnet tool install -g dotnet-script

You can invoke the tool using the following command: dotnet-script
Tool 'dotnet-script' (version '0.22.0') was successfully installed.

The advantage of this approach is that you can use the same command for installation across all platforms. .NET Core SDK also supports viewing a list of installed tools and their uninstallation.

dotnet tool list -g

Package Id         Version      Commands
---------------------------------------------
dotnet-script      0.22.0       dotnet-script
dotnet tool uninstall dotnet-script -g

Tool 'dotnet-script' (version '0.22.0') was successfully uninstalled.

Windows

choco install dotnet.script

We also provide a PowerShell script for installation.

(new-object Net.WebClient).DownloadString("https://raw.githubusercontent.com/filipw/dotnet-script/master/install/install.ps1") | iex

Linux and Mac

curl -s https://raw.githubusercontent.com/filipw/dotnet-script/master/install/install.sh | bash

If permission is denied we can try with sudo

curl -s https://raw.githubusercontent.com/filipw/dotnet-script/master/install/install.sh | sudo bash

Docker

A Dockerfile for running dotnet-script in a Linux container is available. Build:

cd build
docker build -t dotnet-script -f Dockerfile ..

And run:

docker run -it dotnet-script --version

Github

You can manually download all the releases in zip format from the GitHub releases page.

Usage

Our typical helloworld.csx might look like this:

Console.WriteLine("Hello world!");

That is all it takes and we can execute the script. Args are accessible via the global Args array.

dotnet script helloworld.csx

Scaffolding

Simply create a folder somewhere on your system and issue the following command.

dotnet script init

This will create main.csx along with the launch configuration needed to debug the script in VS Code.

.
├── .vscode
│   └── launch.json
├── main.csx
└── omnisharp.json

We can also initialize a folder using a custom filename.

dotnet script init custom.csx

Instead of main.csx which is the default, we now have a file named custom.csx.

.
├── .vscode
│   └── launch.json
├── custom.csx
└── omnisharp.json

Note: Executing dotnet script init inside a folder that already contains one or more script files will not create the main.csx file.

Running scripts

Scripts can be executed directly from the shell as if they were executables.

foo.csx arg1 arg2 arg3

OSX/Linux

Just like all scripts, on OSX/Linux you need to have a #! and mark the file as executable via chmod +x foo.csx. If you use dotnet script init to create your csx it will automatically have the #! directive and be marked as executable.

The OSX/Linux shebang directive should be #!/usr/bin/env dotnet-script

#!/usr/bin/env dotnet-script
Console.WriteLine("Hello world");

You can execute your script using dotnet script or dotnet-script, which allows you to pass arguments to control your script execution more.

foo.csx arg1 arg2 arg3
dotnet script foo.csx -- arg1 arg2 arg3
dotnet-script foo.csx -- arg1 arg2 arg3

Passing arguments to scripts

All arguments after -- are passed to the script in the following way:

dotnet script foo.csx -- arg1 arg2 arg3

Then you can access the arguments in the script context using the global Args collection:

foreach (var arg in Args)
{
    Console.WriteLine(arg);
}

All arguments before -- are processed by dotnet script. For example, the following command-line

dotnet script -d foo.csx -- -d

will pass the -d before -- to dotnet script and enable the debug mode whereas the -d after -- is passed to script for its own interpretation of the argument.

NuGet Packages

dotnet script has built-in support for referencing NuGet packages directly from within the script.

#r "nuget: AutoMapper, 6.1.0"

package

Note: Omnisharp needs to be restarted after adding a new package reference

Package Sources

We can define package sources using a NuGet.Config file in the script root folder. In addition to being used during execution of the script, it will also be used by OmniSharp that provides language services for packages resolved from these package sources.

As an alternative to maintaining a local NuGet.Config file we can define these package sources globally either at the user level or at the computer level as described in Configuring NuGet Behaviour

It is also possible to specify packages sources when executing the script.

dotnet script foo.csx -s https://SomePackageSource

Multiple packages sources can be specified like this:

dotnet script foo.csx -s https://SomePackageSource -s https://AnotherPackageSource

Creating DLLs or Exes from a CSX file

Dotnet-Script can create a standalone executable or DLL for your script.

SwitchLong switchdescription
-o--outputDirectory where the published executable should be placed. Defaults to a 'publish' folder in the current directory.
-n--nameThe name for the generated DLL (executable not supported at this time). Defaults to the name of the script.
 --dllPublish to a .dll instead of an executable.
-c--configurationConfiguration to use for publishing the script [Release/Debug]. Default is "Debug"
-d--debugEnables debug output.
-r--runtimeThe runtime used when publishing the self contained executable. Defaults to your current runtime.

The executable you can run directly independent of dotnet install, while the DLL can be run using the dotnet CLI like this:

dotnet script exec {path_to_dll} -- arg1 arg2

Caching

We provide two types of caching, the dependency cache and the execution cache which is explained in detail below. In order for any of these caches to be enabled, it is required that all NuGet package references are specified using an exact version number. The reason for this constraint is that we need to make sure that we don't execute a script with a stale dependency graph.

Dependency Cache

In order to resolve the dependencies for a script, a dotnet restore is executed under the hood to produce a project.assets.json file from which we can figure out all the dependencies we need to add to the compilation. This is an out-of-process operation and represents a significant overhead to the script execution. So this cache works by looking at all the dependencies specified in the script(s) either in the form of NuGet package references or assembly file references. If these dependencies matches the dependencies from the last script execution, we skip the restore and read the dependencies from the already generated project.assets.json file. If any of the dependencies has changed, we must restore again to obtain the new dependency graph.

Execution cache

In order to execute a script it needs to be compiled first and since that is a CPU and time consuming operation, we make sure that we only compile when the source code has changed. This works by creating a SHA256 hash from all the script files involved in the execution. This hash is written to a temporary location along with the DLL that represents the result of the script compilation. When a script is executed the hash is computed and compared with the hash from the previous compilation. If they match there is no need to recompile and we run from the already compiled DLL. If the hashes don't match, the cache is invalidated and we recompile.

You can override this automatic caching by passing --no-cache flag, which will bypass both caches and cause dependency resolution and script compilation to happen every time we execute the script.

Cache Location

The temporary location used for caches is a sub-directory named dotnet-script under (in order of priority):

  1. The path specified for the value of the environment variable named DOTNET_SCRIPT_CACHE_LOCATION, if defined and value is not empty.
  2. Linux distributions only: $XDG_CACHE_HOME if defined otherwise $HOME/.cache
  3. macOS only: ~/Library/Caches
  4. The value returned by Path.GetTempPath for the platform.

 

Debugging

The days of debugging scripts using Console.WriteLine are over. One major feature of dotnet script is the ability to debug scripts directly in VS Code. Just set a breakpoint anywhere in your script file(s) and hit F5(start debugging)

debug

Script Packages

Script packages are a way of organizing reusable scripts into NuGet packages that can be consumed by other scripts. This means that we now can leverage scripting infrastructure without the need for any kind of bootstrapping.

Creating a script package

A script package is just a regular NuGet package that contains script files inside the content or contentFiles folder.

The following example shows how the scripts are laid out inside the NuGet package according to the standard convention .

└── contentFiles
    └── csx
        └── netstandard2.0
            └── main.csx

This example contains just the main.csx file in the root folder, but packages may have multiple script files either in the root folder or in subfolders below the root folder.

When loading a script package we will look for an entry point script to be loaded. This entry point script is identified by one of the following.

  • A script called main.csx in the root folder
  • A single script file in the root folder

If the entry point script cannot be determined, we will simply load all the scripts files in the package.

The advantage with using an entry point script is that we can control loading other scripts from the package.

Consuming a script package

To consume a script package all we need to do specify the NuGet package in the #loaddirective.

The following example loads the simple-targets package that contains script files to be included in our script.

#load "nuget:simple-targets-csx, 6.0.0"

using static SimpleTargets;
var targets = new TargetDictionary();

targets.Add("default", () => Console.WriteLine("Hello, world!"));

Run(Args, targets);

Note: Debugging also works for script packages so that we can easily step into the scripts that are brought in using the #load directive.

Remote Scripts

Scripts don't actually have to exist locally on the machine. We can also execute scripts that are made available on an http(s) endpoint.

This means that we can create a Gist on Github and execute it just by providing the URL to the Gist.

This Gist contains a script that prints out "Hello World"

We can execute the script like this

dotnet script https://gist.githubusercontent.com/seesharper/5d6859509ea8364a1fdf66bbf5b7923d/raw/0a32bac2c3ea807f9379a38e251d93e39c8131cb/HelloWorld.csx

That is a pretty long URL, so why don't make it a TinyURL like this:

dotnet script https://tinyurl.com/y8cda9zt

Script Location

A pretty common scenario is that we have logic that is relative to the script path. We don't want to require the user to be in a certain directory for these paths to resolve correctly so here is how to provide the script path and the script folder regardless of the current working directory.

public static string GetScriptPath([CallerFilePath] string path = null) => path;
public static string GetScriptFolder([CallerFilePath] string path = null) => Path.GetDirectoryName(path);

Tip: Put these methods as top level methods in a separate script file and #load that file wherever access to the script path and/or folder is needed.

REPL

This release contains a C# REPL (Read-Evaluate-Print-Loop). The REPL mode ("interactive mode") is started by executing dotnet-script without any arguments.

The interactive mode allows you to supply individual C# code blocks and have them executed as soon as you press Enter. The REPL is configured with the same default set of assembly references and using statements as regular CSX script execution.

Basic usage

Once dotnet-script starts you will see a prompt for input. You can start typing C# code there.

~$ dotnet script
> var x = 1;
> x+x
2

If you submit an unterminated expression into the REPL (no ; at the end), it will be evaluated and the result will be serialized using a formatter and printed in the output. This is a bit more interesting than just calling ToString() on the object, because it attempts to capture the actual structure of the object. For example:

~$ dotnet script
> var x = new List<string>();
> x.Add("foo");
> x
List<string>(1) { "foo" }
> x.Add("bar");
> x
List<string>(2) { "foo", "bar" }
>

Inline Nuget packages

REPL also supports inline Nuget packages - meaning the Nuget packages can be installed into the REPL from within the REPL. This is done via our #r and #load from Nuget support and uses identical syntax.

~$ dotnet script
> #r "nuget: Automapper, 6.1.1"
> using AutoMapper;
> typeof(MapperConfiguration)
[AutoMapper.MapperConfiguration]
> #load "nuget: simple-targets-csx, 6.0.0";
> using static SimpleTargets;
> typeof(TargetDictionary)
[Submission#0+SimpleTargets+TargetDictionary]

Multiline mode

Using Roslyn syntax parsing, we also support multiline REPL mode. This means that if you have an uncompleted code block and press Enter, we will automatically enter the multiline mode. The mode is indicated by the * character. This is particularly useful for declaring classes and other more complex constructs.

~$ dotnet script
> class Foo {
* public string Bar {get; set;}
* }
> var foo = new Foo();

REPL commands

Aside from the regular C# script code, you can invoke the following commands (directives) from within the REPL:

CommandDescription
#loadLoad a script into the REPL (same as #load usage in CSX)
#rLoad an assembly into the REPL (same as #r usage in CSX)
#resetReset the REPL back to initial state (without restarting it)
#clsClear the console screen without resetting the REPL state
#exitExits the REPL

Seeding REPL with a script

You can execute a CSX script and, at the end of it, drop yourself into the context of the REPL. This way, the REPL becomes "seeded" with your code - all the classes, methods or variables are available in the REPL context. This is achieved by running a script with an -i flag.

For example, given the following CSX script:

var msg = "Hello World";
Console.WriteLine(msg);

When you run this with the -i flag, Hello World is printed, REPL starts and msg variable is available in the REPL context.

~$ dotnet script foo.csx -i
Hello World
>

You can also seed the REPL from inside the REPL - at any point - by invoking a #load directive pointed at a specific file. For example:

~$ dotnet script
> #load "foo.csx"
Hello World
>

Piping

The following example shows how we can pipe data in and out of a script.

The UpperCase.csx script simply converts the standard input to upper case and writes it back out to standard output.

using (var streamReader = new StreamReader(Console.OpenStandardInput()))
{
    Write(streamReader.ReadToEnd().ToUpper());
}

We can now simply pipe the output from one command into our script like this.

echo "This is some text" | dotnet script UpperCase.csx
THIS IS SOME TEXT

Debugging

The first thing we need to do add the following to the launch.config file that allows VS Code to debug a running process.

{
    "name": ".NET Core Attach",
    "type": "coreclr",
    "request": "attach",
    "processId": "${command:pickProcess}"
}

To debug this script we need a way to attach the debugger in VS Code and the simplest thing we can do here is to wait for the debugger to attach by adding this method somewhere.

public static void WaitForDebugger()
{
    Console.WriteLine("Attach Debugger (VS Code)");
    while(!Debugger.IsAttached)
    {
    }
}

To debug the script when executing it from the command line we can do something like

WaitForDebugger();
using (var streamReader = new StreamReader(Console.OpenStandardInput()))
{
    Write(streamReader.ReadToEnd().ToUpper()); // <- SET BREAKPOINT HERE
}

Now when we run the script from the command line we will get

$ echo "This is some text" | dotnet script UpperCase.csx
Attach Debugger (VS Code)

This now gives us a chance to attach the debugger before stepping into the script and from VS Code, select the .NET Core Attach debugger and pick the process that represents the executing script.

Once that is done we should see our breakpoint being hit.

Configuration(Debug/Release)

By default, scripts will be compiled using the debug configuration. This is to ensure that we can debug a script in VS Code as well as attaching a debugger for long running scripts.

There are however situations where we might need to execute a script that is compiled with the release configuration. For instance, running benchmarks using BenchmarkDotNet is not possible unless the script is compiled with the release configuration.

We can specify this when executing the script.

dotnet script foo.csx -c release

 

Nullable reference types

Starting from version 0.50.0, dotnet-script supports .Net Core 3.0 and all the C# 8 features. The way we deal with nullable references types in dotnet-script is that we turn every warning related to nullable reference types into compiler errors. This means every warning between CS8600 and CS8655 are treated as an error when compiling the script.

Nullable references types are turned off by default and the way we enable it is using the #nullable enable compiler directive. This means that existing scripts will continue to work, but we can now opt-in on this new feature.

#!/usr/bin/env dotnet-script

#nullable enable

string name = null;

Trying to execute the script will result in the following error

main.csx(5,15): error CS8625: Cannot convert null literal to non-nullable reference type.

We will also see this when working with scripts in VS Code under the problems panel.

image

Download Details:
Author: filipw
Source Code: https://github.com/filipw/dotnet-script
License: MIT License

#dotnet  #aspdotnet  #csharp 

 iOS App Dev

iOS App Dev

1620466520

Your Data Architecture: Simple Best Practices for Your Data Strategy

If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.

If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.

In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.

#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition

Gerhard  Brink

Gerhard Brink

1620629020

Getting Started With Data Lakes

Frameworks for Efficient Enterprise Analytics

The opportunities big data offers also come with very real challenges that many organizations are facing today. Often, it’s finding the most cost-effective, scalable way to store and process boundless volumes of data in multiple formats that come from a growing number of sources. Then organizations need the analytical capabilities and flexibility to turn this data into insights that can meet their specific business objectives.

This Refcard dives into how a data lake helps tackle these challenges at both ends — from its enhanced architecture that’s designed for efficient data ingestion, storage, and management to its advanced analytics functionality and performance flexibility. You’ll also explore key benefits and common use cases.

Introduction

As technology continues to evolve with new data sources, such as IoT sensors and social media churning out large volumes of data, there has never been a better time to discuss the possibilities and challenges of managing such data for varying analytical insights. In this Refcard, we dig deep into how data lakes solve the problem of storing and processing enormous amounts of data. While doing so, we also explore the benefits of data lakes, their use cases, and how they differ from data warehouses (DWHs).


This is a preview of the Getting Started With Data Lakes Refcard. To read the entire Refcard, please download the PDF from the link above.

#big data #data analytics #data analysis #business analytics #data warehouse #data storage #data lake #data lake architecture #data lake governance #data lake management

Cyrus  Kreiger

Cyrus Kreiger

1618039260

How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

#big data #data #data analysis #data security #data integration #etl #data warehouse #data breach #elt

Macey  Kling

Macey Kling

1597579680

Applications Of Data Science On 3D Imagery Data

CVDC 2020, the Computer Vision conference of the year, is scheduled for 13th and 14th of August to bring together the leading experts on Computer Vision from around the world. Organised by the Association of Data Scientists (ADaSCi), the premier global professional body of data science and machine learning professionals, it is a first-of-its-kind virtual conference on Computer Vision.

The second day of the conference started with quite an informative talk on the current pandemic situation. Speaking of talks, the second session “Application of Data Science Algorithms on 3D Imagery Data” was presented by Ramana M, who is the Principal Data Scientist in Analytics at Cyient Ltd.

Ramana talked about one of the most important assets of organisations, data and how the digital world is moving from using 2D data to 3D data for highly accurate information along with realistic user experiences.

The agenda of the talk included an introduction to 3D data, its applications and case studies, 3D data alignment, 3D data for object detection and two general case studies, which are-

  • Industrial metrology for quality assurance.
  • 3d object detection and its volumetric analysis.

This talk discussed the recent advances in 3D data processing, feature extraction methods, object type detection, object segmentation, and object measurements in different body cross-sections. It also covered the 3D imagery concepts, the various algorithms for faster data processing on the GPU environment, and the application of deep learning techniques for object detection and segmentation.

#developers corner #3d data #3d data alignment #applications of data science on 3d imagery data #computer vision #cvdc 2020 #deep learning techniques for 3d data #mesh data #point cloud data #uav data