Olen  Predovic

Olen Predovic


Interpreting Black-Box ML Models using LIME

It is almost trite at this point for anyone to espouse the potential of machine learning in the medical field. There are a plethora of examples to support this claim — one would be Microsoft’s use of medical imaging data in helping clinicians and radiologists make an accurate cancer diagnosis. Simultaneously, the development of sophisticated AI algorithms has drastically improved the accuracy of such diagnoses. Undoubtedly, such amazing applications of medical data and one has all the good reasons to be excited about its benefits.

However, such cutting-edge algorithms are black boxes that might be difficult, if not impossible, to interpret. One example of a black-box model is the deep neural network, where a single decision is made after the input data passes through millions of neurons in the network. Such black-box models do not allow clinicians to verify the models’ diagnosis with their prior knowledge and experiences, making the model-based diagnosis less trustworthy.

In fact, a recent survey of radiologists in Europe paints a realistic picture of the use of black-box models in radiology. The survey shows that only 55.4% of the clinicians think that patients will not accept purely AI-based application without the supervision of a physician. [1]

Image for post

#explainable-ai #cancer #machine-learning #interpretability #data-science

What is GEEK

Buddha Community

Interpreting Black-Box ML Models using LIME
Chloe  Butler

Chloe Butler


Pdf2gerb: Perl Script Converts PDF Files to Gerber format


Perl script converts PDF files to Gerber format

Pdf2Gerb generates Gerber 274X photoplotting and Excellon drill files from PDFs of a PCB. Up to three PDFs are used: the top copper layer, the bottom copper layer (for 2-sided PCBs), and an optional silk screen layer. The PDFs can be created directly from any PDF drawing software, or a PDF print driver can be used to capture the Print output if the drawing software does not directly support output to PDF.

The general workflow is as follows:

  1. Design the PCB using your favorite CAD or drawing software.
  2. Print the top and bottom copper and top silk screen layers to a PDF file.
  3. Run Pdf2Gerb on the PDFs to create Gerber and Excellon files.
  4. Use a Gerber viewer to double-check the output against the original PCB design.
  5. Make adjustments as needed.
  6. Submit the files to a PCB manufacturer.

Please note that Pdf2Gerb does NOT perform DRC (Design Rule Checks), as these will vary according to individual PCB manufacturer conventions and capabilities. Also note that Pdf2Gerb is not perfect, so the output files must always be checked before submitting them. As of version 1.6, Pdf2Gerb supports most PCB elements, such as round and square pads, round holes, traces, SMD pads, ground planes, no-fill areas, and panelization. However, because it interprets the graphical output of a Print function, there are limitations in what it can recognize (or there may be bugs).

See docs/Pdf2Gerb.pdf for install/setup, config, usage, and other info.


#Pdf2Gerb config settings:
#Put this file in same folder/directory as pdf2gerb.pl itself (global settings),
#or copy to another folder/directory with PDFs if you want PCB-specific settings.
#There is only one user of this file, so we don't need a custom package or namespace.
#NOTE: all constants defined in here will be added to main namespace.
#package pdf2gerb_cfg;

use strict; #trap undef vars (easier debug)
use warnings; #other useful info (easier debug)

#configurable settings:
#change values here instead of in main pfg2gerb.pl file

use constant WANT_COLORS => ($^O !~ m/Win/); #ANSI colors no worky on Windows? this must be set < first DebugPrint() call

#just a little warning; set realistic expectations:
#DebugPrint("${\(CYAN)}Pdf2Gerb.pl ${\(VERSION)}, $^O O/S\n${\(YELLOW)}${\(BOLD)}${\(ITALIC)}This is EXPERIMENTAL software.  \nGerber files MAY CONTAIN ERRORS.  Please CHECK them before fabrication!${\(RESET)}", 0); #if WANT_DEBUG

use constant METRIC => FALSE; #set to TRUE for metric units (only affect final numbers in output files, not internal arithmetic)
use constant APERTURE_LIMIT => 0; #34; #max #apertures to use; generate warnings if too many apertures are used (0 to not check)
use constant DRILL_FMT => '2.4'; #'2.3'; #'2.4' is the default for PCB fab; change to '2.3' for CNC

use constant WANT_DEBUG => 0; #10; #level of debug wanted; higher == more, lower == less, 0 == none
use constant GERBER_DEBUG => 0; #level of debug to include in Gerber file; DON'T USE FOR FABRICATION
use constant WANT_STREAMS => FALSE; #TRUE; #save decompressed streams to files (for debug)
use constant WANT_ALLINPUT => FALSE; #TRUE; #save entire input stream (for debug ONLY)

#DebugPrint(sprintf("${\(CYAN)}DEBUG: stdout %d, gerber %d, want streams? %d, all input? %d, O/S: $^O, Perl: $]${\(RESET)}\n", WANT_DEBUG, GERBER_DEBUG, WANT_STREAMS, WANT_ALLINPUT), 1);
#DebugPrint(sprintf("max int = %d, min int = %d\n", MAXINT, MININT), 1); 

#define standard trace and pad sizes to reduce scaling or PDF rendering errors:
#This avoids weird aperture settings and replaces them with more standardized values.
#(I'm not sure how photoplotters handle strange sizes).
#Fewer choices here gives more accurate mapping in the final Gerber files.
#units are in inches
use constant TOOL_SIZES => #add more as desired
#round or square pads (> 0) and drills (< 0):
    .010, -.001,  #tiny pads for SMD; dummy drill size (too small for practical use, but needed so StandardTool will use this entry)
    .031, -.014,  #used for vias
    .041, -.020,  #smallest non-filled plated hole
    .051, -.025,
    .056, -.029,  #useful for IC pins
    .070, -.033,
    .075, -.040,  #heavier leads
#    .090, -.043,  #NOTE: 600 dpi is not high enough resolution to reliably distinguish between .043" and .046", so choose 1 of the 2 here
    .100, -.046,
    .115, -.052,
    .130, -.061,
    .140, -.067,
    .150, -.079,
    .175, -.088,
    .190, -.093,
    .200, -.100,
    .220, -.110,
    .160, -.125,  #useful for mounting holes
#some additional pad sizes without holes (repeat a previous hole size if you just want the pad size):
    .090, -.040,  #want a .090 pad option, but use dummy hole size
    .065, -.040, #.065 x .065 rect pad
    .035, -.040, #.035 x .065 rect pad
    .001,  #too thin for real traces; use only for board outlines
    .006,  #minimum real trace width; mainly used for text
    .008,  #mainly used for mid-sized text, not traces
    .010,  #minimum recommended trace width for low-current signals
    .015,  #moderate low-voltage current
    .020,  #heavier trace for power, ground (even if a lighter one is adequate)
    .030,  #heavy-current traces; be careful with these ones!
#Areas larger than the values below will be filled with parallel lines:
#This cuts down on the number of aperture sizes used.
#Set to 0 to always use an aperture or drill, regardless of size.
use constant { MAX_APERTURE => max((TOOL_SIZES)) + .004, MAX_DRILL => -min((TOOL_SIZES)) + .004 }; #max aperture and drill sizes (plus a little tolerance)
#DebugPrint(sprintf("using %d standard tool sizes: %s, max aper %.3f, max drill %.3f\n", scalar((TOOL_SIZES)), join(", ", (TOOL_SIZES)), MAX_APERTURE, MAX_DRILL), 1);

#NOTE: Compare the PDF to the original CAD file to check the accuracy of the PDF rendering and parsing!
#for example, the CAD software I used generated the following circles for holes:
#CAD hole size:   parsed PDF diameter:      error:
#  .014                .016                +.002
#  .020                .02267              +.00267
#  .025                .026                +.001
#  .029                .03167              +.00267
#  .033                .036                +.003
#  .040                .04267              +.00267
#This was usually ~ .002" - .003" too big compared to the hole as displayed in the CAD software.
#To compensate for PDF rendering errors (either during CAD Print function or PDF parsing logic), adjust the values below as needed.
#units are pixels; for example, a value of 2.4 at 600 dpi = .0004 inch, 2 at 600 dpi = .0033"
use constant
    HOLE_ADJUST => -0.004 * 600, #-2.6, #holes seemed to be slightly oversized (by .002" - .004"), so shrink them a little
    RNDPAD_ADJUST => -0.003 * 600, #-2, #-2.4, #round pads seemed to be slightly oversized, so shrink them a little
    SQRPAD_ADJUST => +0.001 * 600, #+.5, #square pads are sometimes too small by .00067, so bump them up a little
    RECTPAD_ADJUST => 0, #(pixels) rectangular pads seem to be okay? (not tested much)
    TRACE_ADJUST => 0, #(pixels) traces seemed to be okay?
    REDUCE_TOLERANCE => .001, #(inches) allow this much variation when reducing circles and rects

#Also, my CAD's Print function or the PDF print driver I used was a little off for circles, so define some additional adjustment values here:
#Values are added to X/Y coordinates; units are pixels; for example, a value of 1 at 600 dpi would be ~= .002 inch
use constant
    CIRCLE_ADJUST_MINY => -0.001 * 600, #-1, #circles were a little too high, so nudge them a little lower
    CIRCLE_ADJUST_MAXX => +0.001 * 600, #+1, #circles were a little too far to the left, so nudge them a little to the right
    SUBST_CIRCLE_CLIPRECT => FALSE, #generate circle and substitute for clip rects (to compensate for the way some CAD software draws circles)
    WANT_CLIPRECT => TRUE, #FALSE, #AI doesn't need clip rect at all? should be on normally?
    RECT_COMPLETION => FALSE, #TRUE, #fill in 4th side of rect when 3 sides found

#allow .012 clearance around pads for solder mask:
#This value effectively adjusts pad sizes in the TOOL_SIZES list above (only for solder mask layers).
use constant SOLDER_MARGIN => +.012; #units are inches

#line join/cap styles:
use constant
    CAP_NONE => 0, #butt (none); line is exact length
    CAP_ROUND => 1, #round cap/join; line overhangs by a semi-circle at either end
    CAP_SQUARE => 2, #square cap/join; line overhangs by a half square on either end
    CAP_OVERRIDE => FALSE, #cap style overrides drawing logic
#number of elements in each shape type:
use constant
    RECT_SHAPELEN => 6, #x0, y0, x1, y1, count, "rect" (start, end corners)
    LINE_SHAPELEN => 6, #x0, y0, x1, y1, count, "line" (line seg)
    CURVE_SHAPELEN => 10, #xstart, ystart, x0, y0, x1, y1, xend, yend, count, "curve" (bezier 2 points)
    CIRCLE_SHAPELEN => 5, #x, y, 5, count, "circle" (center + radius)
#const my %SHAPELEN =
#Readonly my %SHAPELEN =>
    rect => RECT_SHAPELEN,
    line => LINE_SHAPELEN,
    curve => CURVE_SHAPELEN,
    circle => CIRCLE_SHAPELEN,

#This will repeat the entire body the number of times indicated along the X or Y axes (files grow accordingly).
#Display elements that overhang PCB boundary can be squashed or left as-is (typically text or other silk screen markings).
#Set "overhangs" TRUE to allow overhangs, FALSE to truncate them.
#xpad and ypad allow margins to be added around outer edge of panelized PCB.
use constant PANELIZE => {'x' => 1, 'y' => 1, 'xpad' => 0, 'ypad' => 0, 'overhangs' => TRUE}; #number of times to repeat in X and Y directions

# Set this to 1 if you need TurboCAD support.
#$turboCAD = FALSE; #is this still needed as an option?

#CIRCAD pad generation uses an appropriate aperture, then moves it (stroke) "a little" - we use this to find pads and distinguish them from PCB holes. 
use constant PAD_STROKE => 0.3; #0.0005 * 600; #units are pixels
#convert very short traces to pads or holes:
use constant TRACE_MINLEN => .001; #units are inches
#use constant ALWAYS_XY => TRUE; #FALSE; #force XY even if X or Y doesn't change; NOTE: needs to be TRUE for all pads to show in FlatCAM and ViewPlot
use constant REMOVE_POLARITY => FALSE; #TRUE; #set to remove subtractive (negative) polarity; NOTE: must be FALSE for ground planes

#PDF uses "points", each point = 1/72 inch
#combined with a PDF scale factor of .12, this gives 600 dpi resolution (1/72 * .12 = 600 dpi)
use constant INCHES_PER_POINT => 1/72; #0.0138888889; #multiply point-size by this to get inches

# The precision used when computing a bezier curve. Higher numbers are more precise but slower (and generate larger files).
#$bezierPrecision = 100;
use constant BEZIER_PRECISION => 36; #100; #use const; reduced for faster rendering (mainly used for silk screen and thermal pads)

# Ground planes and silk screen or larger copper rectangles or circles are filled line-by-line using this resolution.
use constant FILL_WIDTH => .01; #fill at most 0.01 inch at a time

# The max number of characters to read into memory
use constant MAX_BYTES => 10 * M; #bumped up to 10 MB, use const

use constant DUP_DRILL1 => TRUE; #FALSE; #kludge: ViewPlot doesn't load drill files that are too small so duplicate first tool

my $runtime = time(); #Time::HiRes::gettimeofday(); #measure my execution time

print STDERR "Loaded config settings from '${\(__FILE__)}'.\n";
1; #last value must be truthful to indicate successful load


#use Package::Constants;
#use Exporter qw(import); #https://perldoc.perl.org/Exporter.html

#my $caller = "pdf2gerb::";

#sub cfg
#    my $proto = shift;
#    my $class = ref($proto) || $proto;
#    my $settings =
#    {
#        $WANT_DEBUG => 990, #10; #level of debug wanted; higher == more, lower == less, 0 == none
#    };
#    bless($settings, $class);
#    return $settings;

#use constant HELLO => "hi there2"; #"main::HELLO" => "hi there";
#use constant GOODBYE => 14; #"main::GOODBYE" => 12;

#print STDERR "read cfg file\n";

#our @EXPORT_OK = Package::Constants->list(__PACKAGE__); #https://www.perlmonks.org/?node_id=1072691; NOTE: "_OK" skips short/common names

#print STDERR scalar(@EXPORT_OK) . " consts exported:\n";
#foreach(@EXPORT_OK) { print STDERR "$_\n"; }
#my $val = main::thing("xyz");
#print STDERR "caller gave me $val\n";
#foreach my $arg (@ARGV) { print STDERR "arg $arg\n"; }

Download Details:

Author: swannman
Source Code: https://github.com/swannman/pdf2gerb

License: GPL-3.0 license


Uncovering the Magic: interpreting Machine Learning black-box models

**The trade-off between predictive power and interpretability **is a common issue to face when working with black-box models, especially in business environments where results have to be explained to non-technical audiences. Interpretability is crucial to being able to question, understand, and trust AI and ML systems. It also provides data scientists and engineers better means for debugging models and ensuring that they are working as intended.

Image for post


This tutorial aims to present different techniques for approaching model interpretation in black-box models.

_Disclaimer: _this article seeks to introduce some useful techniques from the field of interpretable machine learning to the average data scientists and to motivate its adoption . Most of them have been summarized from this highly recommendable book from Christoph Molnar: Interpretable Machine Learning.

The entire code used in this article can be found in my GitHub


  1. Taxonomy of Interpretability Methods
  2. Dataset and Model Training
  3. Global Importance
  4. Local Importance

1. Taxonomy of Interpretability Methods

  • **Intrinsic or Post-Hoc? **This criteria distinguishes whether interpretability is achieved by restricting the complexity of the machine learning model (intrinsic) or by applying methods that analyze the model after training (post-hoc).
  • **Model-Specific or Model-Agnostic? **Linear models have a model-specific interpretation, since the interpretation of the regression weights are specific to that sort of models. Similarly, decision trees splits have their own specific interpretation. Model-agnostic tools, on the other hand, can be used on any machine learning model and are applied after the model has been trained (post-hoc).
  • **Local or Global? **Local interpretability refers to explaining an individual prediction, whereas global interpretability is related to explaining the model general behavior in the prediction task. Both types of interpretations are important and there are different tools for addressing each of them.

2. Dataset and Model Training

The dataset used for this article is the Adult Census Income from UCI Machine Learning Repository. The prediction task is to determine whether a person makes over $50K a year.

Since the focus of this article is not centered in the modelling phase of the ML pipeline, minimum feature engineering was performed in order to model the data with an XGBoost.

The performance metrics obtained for the model are the following:

Image for post

Fig. 1: Receiving Operating Characteristic (ROC) curves for Train and Test sets.

Image for post

Fig. 2: XGBoost performance metrics

The model’s performance seems to be pretty acceptable.

3. Global Importance

The techniques used to evaluate the global behavior of the model will be:

3.1 - Feature Importance (evaluated by the XGBoost model and by SHAP)

3.2 - Summary Plot (SHAP)

3.3 - Permutation Importance (ELI5)

3.4 - Partial Dependence Plot (PDPBox and SHAP)

3.5 - Global Surrogate Model (Decision Tree and Logistic Regression)

3.1 - Feature Importance

  • XGBoost (model-specific)
feat_importances = pd.Series(clf_xgb_df.feature_importances_, index=X_train.columns).sort_values(ascending=True)

Image for post

Fig. 3: XGBoost Feature Importance

When working with XGBoost, one must be careful when interpreting features importances, since the results might be misleading. This is because the model calculates several importance metrics, with different interpretations. It creates an importance matrix, which is a table with the first column including the names of all the features actually used in the boosted trees, and the other with the resulting ‘importance’ values calculated with different metrics (Gain, Cover, Frequence). A more thourough explanation of these can be found here.

The **Gain **is the most relevant attribute to interpret the relative importance (i.e. improvement in accuracy) of each feature.

  • SHAP

In general, SHAP library is considered to be a model-agnostic tool for addressing interpretability (we will cover SHAP’s intuition in the Local Importance section). However, the library has a model-specific method for tree-based machine learning models such as decision trees, random forests and gradient boosted trees.

explainer = shap.TreeExplainer(clf_xgb_df)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values, X_test, plot_type = 'bar')

Image for post

Fig. 4: SHAP Feature Importance

The XGBoost feature importance was used to evaluate the relevance of the predictors in the model’s outputs for the Train dataset and the SHAP one to evaluate it for Test dataset, in order to assess if the most important features were similar in both approaches and sets.

It is observed that the most important variables of the model are maintained, although in different order of importance (age seems to take much more relevance in the test set by SHAP approach).

3.2 Summary Plot (SHAP)

The SHAP Summary Plot is a very interesting plot to evaluate the features of the model, since it provides more information than the traditional Feature Importance:

  • Feature Importance: variables are sorted in descending order of importance.
  • Impact on Prediction: the position on the horizontal axis indicates whether the values of the dataset instances for each feature have more or less impact on the output of the model.
  • Original Value: the color indicates, for each feature, whether it is a high or low value (in the range of each of the feature).
  • Correlation: the correlation of a feature with the model output can be analyzed by evaluating its color (its range of values) and the impact on the horizontal axis. For example, it is observed that the age has a positive correlation with the target, since the impact on the output increases as the value of the feature increases.
shap.summary_plot(shap_values, X_test)

Image for post

Fig. 5: SHAP Summary Plot

3.3 - Permutation Importance (ELI5)

Another way to assess the global importance of the predictors is to randomly permute the order of the instances for each feature in the dataset and predict with the trained model. If by doing this disturbance in the order, the evaluation metric does not change substantially, then the feature is not so relevant. If instead the evaluation metric is affected, then the feature is considered important in the model. This process is done individually for each feature.

To evaluate the trained XGBoost model, the Area Under the Curve (AUC) of the ROC Curve will be used as the performance metric. Permutation Importance will be analyzed in both Train and Test:

# Train
perm = PermutationImportance(clf_xgb_df, scoring = 'roc_auc', random_state=1984).fit(X_train, y_train)
eli5.show_weights(perm, feature_names = X_train.columns.tolist())

# Test
perm = PermutationImportance(clf_xgb_df, scoring = 'roc_auc', random_state=1984).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

Image for post

Fig. 6: Permutation Importance for Train and Test sets.

Even though the order of the most important features changes, it looks like that the most relevant ones remain the same. It is interesting to note that, unlike the XGBoost Feature Importance, the age variable in the Train set has a fairly strong effect (as showed by SHAP Feature Importance in the Test set). Furthermore, the 6 most important variables according to the Permutation Importance are kept in Train and Test (the difference in order may be due to the distribution of each sample).

The coherence between the different approaches to approximate the global importance generates more confidence in the interpretation of the model’s output.

#model-interpretability #model-fairness #interpretability #machine-learning #shapley-values #deep learning

Display Boxes

Display Boxes


Custom Printed Soap Boxes | Wholesale Soap Packaging | Custom Packaging Pro

You have a great quality product in your hand, and now you are thinking of ways to represent it to the customers in a most appealing way. Many brands are afraid of spending money on custom boxes without knowing that it is the most cost-effecting way with various benefits. When designed with proper planning and creativity, custom packaging can result in money. You shouldn’t mind spending a little more to save money in the long run. You may disagree with us, but investment in customization and personalization can open new doors of success for your business.

How Custom Boxes for Soap Are a Cost-Effective Solution?

From our kitchen cabinets to the big supermarkets, custom boxes are everywhere. We are surrounded by customization, even if it is about selling a small cosmetic product. When you deal in the soap market, you have to face a lot of competition. You can’t compete in the market with a strategy of using plain cardboard boxes. Gone are the days when you can use a simple packaging solution. Today is the age of customization, and you have to use custom Boxes for Soap. These are not only affordable but also look good on the shelves and provide an ultimate customer experience. Let’s take a deeper look at how custom packaging is a cost-effective solution with several benefits.

Helps To Reduce Shipping Prices

During this pandemic, the e-commerce industry has grown too fast, and every brand is selling its products online. One thing which is keeping the manufacturer is the high shipping price. But to make the right decision, you need to understand how shipping prices work. It is not only about the weight of the box; consider dimensional weight as well. Even if you are shipping a small item like soap, the box size can increase the shipping cost. So, the firsts step towards price reduction is using the box which is according to the product size and also light in weight.

Maximum Durability Ensures Protection

It is one of the most faced issues which brands face. Damaged and broken products always result in returns which ultimately means additional cost. No brand will ever want to face negative reviews and customer backlash. You can avoid it by using durable and sturdy boxes. When it comes to protection, corrugate and cardboard soap boxers can outclass every other solution. These two materials are quite affordable and readily available in the market. Once again, it is highly recommended to use the right box size to avoid damage and returns. You can also use other packaging materials for added protection.

Increase Your Visibility with Pillow Boxes

When it comes to increasing your visibility and exposure in retail stores, there is no better option other than custom containers. These come in a variety of shapes and styles, which increase the customer’s interest in your product. Custom Pillow Boxes are the right choice to present your soap products on the shelves. You can customize the pillow packaging further with other customization options like window patching, lamination, and gold stamping. Unique solutions always make customers take a closer look at the product, and most probably they will end up buying it.

Right Box Size Help to Save Big

We have mentioned it before, but it needs more repetition. A wrong size box will always cost you more. If you are thinking that you can end up saving by choosing the standard size boxes for all the products, you are wrong. It will cost you more in form of damaged products and returns. Moreover, the bigger will be the size of the box, the more you have to pay for shipping. So, always choose the right size, which suits the product dimensions, and don’t leave too much space in the containers. A bigger size box will also make you use the protective material.

Cardboard And Paperboard Are the Cheapest Options

If you still think that custom boxes are way out of your range, we have still so many options for you. Take a simple corrugate or cardboard box in white or any plain color, print your logo on it, and you have a custom box in your hand. Getting a custom solution for your soap products has never been so easy. Today is the age of minimalism, and you can take advantage of this trend. Use simple customization for a natural and minimal look. Cardboard and corrugated are the most affordable option when it comes to custom material.

Happy Customers Result in Repeat Business and Positive Reviews

When it comes to benefits with custom packaging, there are several benefits that you can get. From product protection to the customer experience, you will get everything with a custom solution. The biggest benefits of providing a personalized experience are repeat business and positive reviews from the customers. When you put your heart and effort into the customization, customers will reward you with positive feedback. A good review from the customers will attract more customers to your business. Satisfied customers will bring business with repeat business and higher brand recall.

Kraft Boxes for Display Improve Sustainability

When it comes to being sustainable, there is no better option than Kraft. Use Kraft packaging boxes to display your products on the shelves. It will attract more and more eco-conscious customers, which will ultimately result in boosted sales. It is not only a cheap option but offers 100% recyclability. The presentation and display of your product have a greater role to play in drawing the attraction. Using the same old-style packaging will not going to help you out. Think of something innovative and try using Kraft boxes for better results. Find a solution that is not unique but meets the customer’s needs.

When it comes to soap packaging, Kraft Boxes for Display are a perfect choice. These are not only cost-efficient but result in customer satisfaction, repeat business, and reduced shipping cost. Find a solution that meets your needs without breaking the budget.

#boxes for soap #boxes for pillow #boxes for display #soap boxes #pillow boxes #display boxes

Olen  Predovic

Olen Predovic


Interpreting Black-Box ML Models using LIME

It is almost trite at this point for anyone to espouse the potential of machine learning in the medical field. There are a plethora of examples to support this claim — one would be Microsoft’s use of medical imaging data in helping clinicians and radiologists make an accurate cancer diagnosis. Simultaneously, the development of sophisticated AI algorithms has drastically improved the accuracy of such diagnoses. Undoubtedly, such amazing applications of medical data and one has all the good reasons to be excited about its benefits.

However, such cutting-edge algorithms are black boxes that might be difficult, if not impossible, to interpret. One example of a black-box model is the deep neural network, where a single decision is made after the input data passes through millions of neurons in the network. Such black-box models do not allow clinicians to verify the models’ diagnosis with their prior knowledge and experiences, making the model-based diagnosis less trustworthy.

In fact, a recent survey of radiologists in Europe paints a realistic picture of the use of black-box models in radiology. The survey shows that only 55.4% of the clinicians think that patients will not accept purely AI-based application without the supervision of a physician. [1]

Image for post

#explainable-ai #cancer #machine-learning #interpretability #data-science

Tia  Gottlieb

Tia Gottlieb


Opening black-box models with LIME — Beauty and the beast

Despite a booming adoption of machine learning in making critical decisions, many machine learning models remain black boxes. Here trust and ethical challenges of using machine learning in making decisions come into the picture. Because the real-world consequences of wrong prediction can be expensive. In 2017, a lawsuit was filed against three US-based companies responsible for designing and developing an automated computer system which was used by Michigan government unemployment agency and made 20,000 false fraud accusations. A good boss will also question his data team why the company should be making these critical business decisions. In order not to blindly trust machine learning models, the need for understanding model’s behaviors, the decisions it makes, and any associated potential pitfalls is high on the agenda.

Model interpretability — Post hoc explanation technique

A commonly accepted phenomenon in data science community and literature (Johansson, Ulf, et al., 2011; Choi, Edward, et al., 2016, etc.) is that there is a trade-off between model accuracy and interpretability. At times

when you have a very large potentially complicated dataset, the use of complex models such as deep neural nets and random forests often achieve higher performance, but lower interpretability. In contrast, simpler models, such as linear or logistic regression just to name a few, often offer lower performance but higher interpretability. This is where post hoc explanation techniques emerge and become a game changer in machine learning literature.

Traditional statistical methods use a hypothesis-driven approach, which means that we construct the assumptions and verify hypotheses by training the models. Post hoc techniques, on the contrary, are set-up techniques which are applied “after the event”- after model training. So the idea people come up with is to first build complex models for making decisions and then explain them using simpler models. In other words, constructing complex model first and assumptions will come later. Some examples of post hoc explanation techniques include LIME, SHAP, Anchor, MUSE amongst others. My point of discussion in this article will be explaining what LIME is and how it works using a step-by-step guide with Python codes as well as the good and the ugly associated with LIME.

So what the heck is LIME and how does it work?

LIME is the abbreviation for Local Interpretable Model-agnostic Explanations. In essence, LIME is Model-agnostic, meaning that it can be applied to ANY models based on the assumption that every complex model is a black-box. By interpretable manner, it implies that LIME will be able to explain how the model behaves, which features it picks up and what kinds of interactions between them take place to drive the predictions. Last but certainly not least, LIME is observation specific, meaning that it tries to understand features that influence the black-box model around a single instance of interest. The rationale behind LIME is that as global model interpretability is very difficult to achieve in reality, it is far simpler to approximate a black-box model by a simple model locally.

In practice, the decision boundary of a black-box model can look very complex. In the figure below, the decision boundary of a black-box model is represented by the blue-pink background, which seems too complex to be accurately approximated by a linear model.

Image for post

Decision boundary of blackbox model.

However, studies (Baehrens, David, et al. (2010), Laugel et al., 2018 etc.) have shown that when you zoom in and look at a small enough neighborhood (bear with me, I will discuss what it means a meaningful neighborhood in the next session), no matter how complex the model is at a global level, the decision boundary at a local neighborhood can be a much more simpler, can in fact be even linear. For instance, the probability of someone having a cancer may be nonlinearly dependent on his age. But if you are looking at only a small group of people who are more than 70 so to say, there is a likelihood that for that subset of the population, the risk of having cancer is linearly correlated with increase in age.

That being said, what LIME does is to go on a very local level and get to a point where it becomes so local that a linear model is powerful enough to explain the behavior of the original black-box model at the local level and it will be highly accurate on that locality (in the neighborhood of the observation we want to explain). In the example (Figure 1), the bold red cross is the instance being explained. LIME will generate new instances by sampling a neighborhood around this selected instance, applying the original black-box model on that neighborhood to generate the corresponding predictions, and weighting those generated instances by their distances to the instance being explained.

In this obtained dataset (which includes the generated instances, the corresponding predictions and the weights), LIME trains an interpretable model (e.g weighted linear model, decision trees) which captures the behaviors of the complex model in that neighborhood. The coefficients of that local linear model will serve as the explanators and tell us which features drive the prediction to one way or the other. In Figure 1, the dashed line is the behaviour of the black-box model in a specific locality which can be explained by a linear model and the explanation is considered faithful in that locality.

#lime #black-box #machine-learning #data-science