Counterfactual Policy Gradients Explained

Among many of its challenges, multi-agent reinforcement learning has one obstacle that is overlooked: “credit assignment.” To explain this concept, let’s first take a look at an example…

Say we have two robots, robot A and robot B. They are trying to collaboratively push a box into a hole. In addition, they both receive a reward of 1 if they push it in and 0 otherwise. In the ideal case, the two robots would both push the box towards the hole at the same time, maximizing the speed and efficiency of the task.

However, suppose that robot A does all the heavy lifting, meaning robot A pushes the box into the hole while robot B stands idly on the sidelines. Even though robot B simply loitered around, _both robot A __and __robot B would receive a reward of 1. _In other words, the same behavior is encouraged later on even though robot B executed a suboptimal policy. This is when the issue of “credit assignment” comes in. In multi-agent systems, we need to find a way to give “credit” or reward to agents who contribute to the overall goal, not to those who let others do the work.

Okay so what’s the solution? Maybe we only give rewards to agents who contribute to the task itself.

Image for post

It’s Harder than It Seems

It seems like this easy solution may just work, but we have to keep several things in mind.

First, state representation in reinforcement learning might not be expressive enough to properly tailor rewards like this. In other words, we can’t always easily quantify whether an agent contributed to a given task and dole out rewards accordingly.

Secondly, we don’t want to handcraft these rewards, because it defeats the purpose of designing multi-agent algorithms. There’s a fine line between telling agents how to collaborate and encouraging them to learn how to do so.

#data-science #artificial-intelligence #machine learning

What is GEEK

Buddha Community

Counterfactual Policy Gradients Explained

Improving Kubernetes Security with Open Policy Agent (OPA)

Many multinational organizations now run their applications on microservice architecture inside their cloud environments, and (many) administrators are responsible for defining multiple policies on those environments. These giant IT organizations have extensive infrastructure systems and their systems have their own policy modules or their own built-in authorization systems. This is an excellent solution to a policy issue at enterprise scale (especially if you have the investment and resources to ensure best practice implementation), but such an overall ecosystem can be fragmented, which means if you want to improve control and visibility over who can do what across the stack, you would face a lot of complexity.

Why We Need OPA

Doing a lot of policy enforcement manually is the problem of the past. This does not work in today’s modern environments where everything is very dynamic and ephemeral, where the technology stack is very heterogeneous, where every development team could use a different language. So, the question is, how do you gain granular control over manual policies to automate and streamline their implementation? And the answer is with Open Policy Agent (OPA).

OPA provides technology that helps unify policy enforcement across a wide range of software and enable or empower administrators with more control over their systems. These policies are incredibly helpful in maintaining security, compliance, standardization across environments where we need to define and enforce such policies in a declarative way.

#blog #kubernetes #security #kubernetes open policy agent #opa #open policy agent #policy enforcement #policy implementation

Explaining the Explainable AI: A 2-Stage Approach

As artificial intelligence (AI) models, especially those using deep learning, have gained prominence over the last eight or so years [8], they are now significantly impacting society, ranging from loan decisions to self-driving cars. Inherently though, a majority of these models are opaque, and hence following their recommendations blindly in human critical applications can raise issues such as fairness, safety, reliability, along with many others. This has led to the emergence of a subfield in AI called explainable AI (XAI) [7]. XAI is primarily concerned with understanding or interpreting the decisions made by these opaque or black-box models so that one can appropriate trust, and in some cases, have even better performance through human-machine collaboration [5].

While there are multiple views on what XAI is [12] and how explainability can be formalized [4, 6], it is still unclear as to what XAI truly is and why it is hard to formalize mathematically. The reason for this lack of clarity is that not only must the model and/or data be considered but also the final consumer of the explanation. Most XAI methods [11, 9, 3], given this intermingled view, try to meet all these requirements at the same time. For example, many methods try to identify a sparse set of features that replicate the decision of the model. The sparsity is a proxy for the consumer’s mental model. An important question asks whether we can disentangle the steps that XAI methods are trying to accomplish? This may help us better understand the truly challenging parts as well as the simpler parts of XAI, not to mention it may motivate different types of methods.

Two-Stages of XAI

We conjecture that the XAI process can be broadly disentangled into two parts, as depicted in Figure 1. The first part is uncovering what is truly happening in the model that we want to understand, while the second part is about conveying that information to the user in a consumable way. The first part is relatively easy to formalize as it mainly deals with analyzing how well a simple proxy model might generalize either locally or globally with respect to (w.r.t.) data that is generated using the black-box model. Rather than having generalization guarantees w.r.t. the underlying distribution, we now want them w.r.t. the (conditional) output distribution of the model. Once we have some way of figuring out what is truly important, a second step is to communicate this information. This second part is much less clear as we do not have an objective way of characterizing an individual’s mind. This part, we believe, is what makes explainability as a whole so challenging to formalize. A mainstay for a lot of XAI research over the last year or so has been to conduct user studies to evaluate new XAI methods.

#overviews #ai #explainability #explainable ai #xai

Mia  Marquardt

Mia Marquardt

1619269260

Policy Gradient REINFORCE Algorithm with Baseline

Policy gradient methods are very popular reinforcement learning(RL) algorithms. They are very useful in that they can directly model the policy, and they work in both discrete and continuous space. In this article, we will:

  1. have a short overview of the underlying math of policy gradient;
  2. implement the policy gradient REINFORCE algorithm in Tensorflow to play CartPole;
  3. compare Policy Gradient and Deep Q Network(DQN)

I assume readers have an understanding of reinforcement learning basics. As a refresher, you can take a quick look at the first section of my previous post A Structural Overview of Reinforcement Learning Algorithms.

I have also implemented Deep Q-net (DQN) in Tensorflow to play CartPole previously. Check it out here if you are interested. :)

#reinforcement-learning #artificial-intelligence #policy-gradient #tensorflow

Mikel  Okuneva

Mikel Okuneva

1597626000

Google Updates Ad Policies to Counter Influence Campaigns, Extortion

Google is making two changes in its advertising policy as the U.S. moves into the fall election season ahead of the presidential contest in November, in an attempt to thwart disinformation campaigns.

For one, Google is updating its Google Ads Misrepresentation Policy to prevent coordinated activity around politics, social issues or “matters of public concern,” by requiring advertisers to provide transparency about who they are. As of Sept. 1, this will mean big penalties for “concealing or misrepresenting your identity or other material details about yourself,” the internet giant said in a recent post, adding that violations will be considered “egregious.”

“If we find violations of this policy, we will suspend your Google Ads accounts upon detection and without prior warning, and you will not be allowed to advertise with us again,” according to the announcement.

Coordinated activity (i.e. the use of ads in cooperation with other sites or accounts to create viral content and an artificial echo chamber) has been seen as a hallmark of disinformation and fake-news influence campaigns. Social media platforms have cracked down on fake accounts ever since such operations were discovered to be widespread during the 2016 presidential election.

For instance, in June, Twitter took down three separate nation-sponsored influence operations, attributed to the People’s Republic of China (PRC), Russia and Turkey. Collectively the operations consisted of 32,242 bogus or bot accounts generating fake content, and the various amplifier accounts that retweeted it.

Advertising however hasn’t been in the content-policing crosshairs in the same way as content accounts on social media platforms – something that Google is now correcting.

“The changes Google is implementing around misrepresentation are timely as we come up to an election period,” Brandon Hoffman, CISO at Netenrich, told Threatpost. “Certainly nobody can deny the power of the advertising machine for getting an agenda out there. The manipulation that can be achieved with such advertising systems can be considered tantamount to a cybersecurity issue. Putting policy measures in place and making them known well in advance is a positive gesture in the attempt to stem the tide of misinformation that is almost certain to come our way over the coming months.”

He added a caveat, however: “Unfortunately policy and the enforcement of policy is subject to the effectiveness to the controls put in place to identify the abuse. This draws a parallel to other cyber security issues we see where controls are constantly are being updated and enhanced yet the volume of security issues remains unabated.”

The second change, also taking effect September 1, involves the launch of the Google Ads Hacked Political Materials Policy. The aim with the launch is to prevent hacked materials from circulating, by preventing the marketing of them – specifically within the context of politics. This can make politically motivated extortion or influence attempts less effective.

“Ads that directly facilitate or advertise access to hacked material related to political entities within scope of Google’s elections ads policies [are not allowed],” according to Google. “This applies to all protected material that was obtained through the unauthorized intrusion or access of a computer, computer network, or personal electronic device, even if distributed by a third party.”

#government #hacks #web security #2020 presidential election #ad policy #disinformation #dnc hack #fake news #google #google ads #hacked political materials policy #influence campaigns #misinformation #misrepresentation policy #transparency

Hollie  Ratke

Hollie Ratke

1603753200

ML Optimization pt.1 - Gradient Descent with Python

So far in our journey through the Machine Learning universe, we covered several big topics. We investigated some regression algorithms, classification algorithms and algorithms that can be used for both types of problems (SVM**, **Decision Trees and Random Forest). Apart from that, we dipped our toes in unsupervised learning, saw how we can use this type of learning for clustering and learned about several clustering techniques.

We also talked about how to quantify machine learning model performance and how to improve it with regularization. In all these articles, we used Python for “from the scratch” implementations and libraries like TensorFlowPytorch and SciKit Learn. The word optimization popped out more than once in these articles, so in this and next article, we focus on optimization techniques which are an important part of the machine learning process.

In general, every machine learning algorithm is composed of three integral parts:

  1. loss function.
  2. Optimization criteria based on the loss function, like a cost function.
  3. Optimization technique – this process leverages training data to find a solution for optimization criteria (cost function).

As you were able to see in previous articles, some algorithms were created intuitively and didn’t have optimization criteria in mind. In fact, mathematical explanations of why and how these algorithms work were done later. Some of these algorithms are Decision Trees and kNN. Other algorithms, which were developed later had this thing in mind beforehand. SVMis one example.

During the training, we change the parameters of our machine learning model to try and minimize the loss function. However, the question of how do you change those parameters arises. Also, by how much should we change them during training and when. To answer all these questions we use optimizers. They put all different parts of the machine learning algorithm together. So far we mentioned Gradient Decent as an optimization technique, but we haven’t explored it in more detail. In this article, we focus on that and we cover the grandfather of all optimization techniques and its variation. Note that these techniques are not machine learning algorithms. They are solvers of minimization problems in which the function to minimize has a gradient in most points of its domain.

Dataset & Prerequisites

Data that we use in this article is the famous Boston Housing Dataset . This dataset is composed 14 features and contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It is a small dataset  with only 506 samples.

For the purpose of this article, make sure that you have installed the following _Python _libraries:

  • **NumPy **– Follow this guide if you need help with installation.
  • **SciKit Learn **– Follow this guide if you need help with installation.
  • Pandas – Follow this guide if you need help with installation.

Once installed make sure that you have imported all the necessary modules that are used in this tutorial.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor

Apart from that, it would be good to be at least familiar with the basics of linear algebracalculus and probability.

Why do we use Optimizers?

Note that we also use simple Linear Regression in all examples. Due to the fact that we explore optimizationtechniques, we picked the easiest machine learning algorithm. You can see more details about Linear regression here. As a quick reminder the formula for linear regression goes like this:

where w and b are parameters of the machine learning algorithm. The entire point of the training process is to set the correct values to the w and b, so we get the desired output from the machine learning model. This means that we are trying to make the value of our error vector as small as possible, i.e. to find a global minimum of the cost function.

One way of solving this problem is to use calculus. We could compute derivatives and then use them to find places where is an extrema of the cost function. However, the cost function is not a function of one or a few variables; it is a function of all parameters of a machine learning algorithm, so these calculations will quickly grow into a monster. That is why we use these optimizers.

#ai #machine learning #python #artificaial inteligance #artificial intelligence #batch gradient descent #data science #datascience #deep learning #from scratch #gradient descent #machine learning #machine learning optimizers #ml optimization #optimizers #scikit learn #software #software craft #software craftsmanship #software development #stochastic gradient descent