1678450560
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
pip install traceml
If you would like to use the tracking features, you need to install polyaxon
as well:
pip install polyaxon traceml
Coming soon
You can enable the offline mode to track runs without an API:
export POLYAXON_OFFLINE="true"
Or passing the offline flag
from traceml import tracking
tracking.init(..., is_offline=True, ...)
import random
import traceml as tracking
tracking.init(
is_offline=True,
project='quick-start',
name="my-new-run",
description="trying TraceML",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
# Tracking some data refs
tracking.log_data_ref(content=X_train, name='x_train')
tracking.log_data_ref(content=y_train, name='y_train')
# Tracking inputs
tracking.log_inputs(
batch_size=64,
dropout=0.2,
learning_rate=0.001,
optimizer="Adam"
)
def get_loss(step):
result = 10 / (step + 1)
noise = (random.random() - 0.5) * 0.5 * result
return result + noise
# Track metrics
for step in range(100):
loss = get_loss(step)
tracking.log_metrics(
loss=loss,
accuracy=(100 - loss) / 100.0,
)
# Track some one time results
tracking.log_outputs(validation_score=0.66)
# Optionally manually stop the tracking process
tracking.stop()
You can use TraceML's callback to automatically save all metrics and collect outputs and models, you can also track additional information using the logging methods:
from traceml import tracking
from traceml.integrations.keras import Callback
tracking.init(
is_offline=True,
project='tracking-project',
name="keras-run",
description="trying TraceML & Keras",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
tracking.log_inputs(
batch_size=64,
dropout=0.2,
learning_rate=0.001,
optimizer="Adam"
)
tracking.log_data_ref(content=x_train, name='x_train')
tracking.log_data_ref(content=y_train, name='y_train')
tracking.log_data_ref(content=x_test, name='x_test')
tracking.log_data_ref(content=y_test, name='y_test')
# ...
model.fit(
x_train,
y_train,
validation_data=(X_test, y_test),
epochs=epochs,
batch_size=100,
callbacks=[Callback()],
)
You can log metrics, inputs, and outputs of Pytorch experiments using the tracking module:
from traceml import tracking
tracking.init(
is_offline=True,
project='tracking-project',
name="pytorch-run",
description="trying TraceML & PyTorch",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
tracking.log_inputs(
batch_size=64,
dropout=0.2,
learning_rate=0.001,
optimizer="Adam"
)
# Metrics
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
tracking.log_mtrics(loss=loss)
asset_path = tracking.get_outputs_path('model.ckpt')
torch.save(model.state_dict(), asset_path)
# log model
tracking.log_artifact_ref(asset_path, framework="pytorch", ...)
You can log metrics, outputs, and models of Tensorflow experiments and distributed Tensorflow experiments using the tracking module:
from traceml import tracking
from traceml.integrations.tensorflow import Callback
tracking.init(
is_offline=True,
project='tracking-project',
name="tf-run",
description="trying TraceML & Tensorflow",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
tracking.log_inputs(
batch_size=64,
dropout=0.2,
learning_rate=0.001,
optimizer="Adam"
)
# log model
estimator.train(hooks=[Callback(log_image=True, log_histo=True, log_tensor=True)])
You can log metrics, outputs, and models of Fastai experiments using the tracking module:
from traceml import tracking
from traceml.integrations.fastai import Callback
tracking.init(
is_offline=True,
project='tracking-project',
name="fastai-run",
description="trying TraceML & Fastai",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
# Log model metrics
learn.fit(..., cbs=[Callback()])
You can log metrics, outputs, and models of Pytorch Lightning experiments using the tracking module:
from traceml import tracking
from traceml.integrations.pytorch_lightning import Callback
tracking.init(
is_offline=True,
project='tracking-project',
name="pytorch-lightning-run",
description="trying TraceML & Lightning",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
...
trainer = pl.Trainer(
gpus=0,
progress_bar_refresh_rate=20,
max_epochs=2,
logger=Callback(),
)
You can log metrics, outputs, and models of HuggingFace experiments using the tracking module:
from traceml import tracking
from traceml.integrations.hugging_face import Callback
tracking.init(
is_offline=True,
project='tracking-project',
name="hg-run",
description="trying TraceML & HuggingFace",
tags=["examples"],
artifacts_path="path/to/artifacts/repo"
)
...
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset if training_args.do_train else None,
eval_dataset=eval_dataset if training_args.do_eval else None,
callbacks=[Callback],
# ...
)
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
from bokeh.plotting import figure
from vega_datasets import data
from traceml import tracking
def plot_mpl_figure(step):
np.random.seed(19680801)
data = np.random.randn(2, 100)
figure, axs = plt.subplots(2, 2, figsize=(5, 5))
axs[0, 0].hist(data[0])
axs[1, 0].scatter(data[0], data[1])
axs[0, 1].plot(data[0], data[1])
axs[1, 1].hist2d(data[0], data[1])
tracking.log_mpl_image(figure, 'mpl_image', step=step)
def log_bokeh(step):
factors = ["a", "b", "c", "d", "e", "f", "g", "h"]
x = [50, 40, 65, 10, 25, 37, 80, 60]
dot = figure(title="Categorical Dot Plot", tools="", toolbar_location=None,
y_range=factors, x_range=[0, 100])
dot.segment(0, factors, x, factors, line_width=2, line_color="green", )
dot.circle(x, factors, size=15, fill_color="orange", line_color="green", line_width=3, )
factors = ["foo 123", "bar:0.2", "baz-10"]
x = ["foo 123", "foo 123", "foo 123", "bar:0.2", "bar:0.2", "bar:0.2", "baz-10", "baz-10",
"baz-10"]
y = ["foo 123", "bar:0.2", "baz-10", "foo 123", "bar:0.2", "baz-10", "foo 123", "bar:0.2",
"baz-10"]
colors = [
"#0B486B", "#79BD9A", "#CFF09E",
"#79BD9A", "#0B486B", "#79BD9A",
"#CFF09E", "#79BD9A", "#0B486B"
]
hm = figure(title="Categorical Heatmap", tools="hover", toolbar_location=None,
x_range=factors, y_range=factors)
hm.rect(x, y, color=colors, width=1, height=1)
tracking.log_bokeh_chart(name='confusion-bokeh', figure=hm, step=step)
def log_altair(step):
source = data.cars()
brush = alt.selection(type='interval')
points = alt.Chart(source).mark_point().encode(
x='Horsepower:Q',
y='Miles_per_Gallon:Q',
color=alt.condition(brush, 'Origin:N', alt.value('lightgray'))
).add_selection(
brush
)
bars = alt.Chart(source).mark_bar().encode(
y='Origin:N',
color='Origin:N',
x='count(Origin):Q'
).transform_filter(
brush
)
chart = points & bars
tracking.log_altair_chart(name='altair_chart', figure=chart, step=step)
def log_plotly(step):
df = px.data.tips()
fig = px.density_heatmap(df, x="total_bill", y="tip", facet_row="sex", facet_col="smoker")
tracking.log_plotly_chart(name="2d-hist", figure=fig, step=step)
plot_mpl_figure(100)
log_bokeh(100)
log_altair(100)
log_plotly(100)
An extension to pandas dataframes describe function.
The module contains DataFrameSummary
object that extend describe()
with:
describe()
function with the values with columns_stats
The DataFrameSummary
expect a pandas DataFrame
to summarise.
from traceml.summary.df import DataFrameSummary
dfs = DataFrameSummary(df)
getting the columns types
dfs.columns_types
numeric 9
bool 3
categorical 2
unique 1
date 1
constant 1
dtype: int64
getting the columns stats
dfs.columns_stats
A B C D E
counts 5802 5794 5781 5781 4617
uniques 5802 3 5771 128 121
missing 0 8 21 21 1185
missing_perc 0% 0.14% 0.36% 0.36% 20.42%
types unique categorical numeric numeric numeric
getting a single column summary, e.g. numerical column
# we can also access the column using numbers A[1]
dfs['A']
std 0.2827146
max 1.072792
min 0
variance 0.07992753
mean 0.5548516
5% 0.1603367
25% 0.3199776
50% 0.4968588
75% 0.8274732
95% 1.011255
iqr 0.5074956
kurtosis -1.208469
skewness 0.2679559
sum 3207.597
mad 0.2459508
cv 0.5095319
zeros_num 11
zeros_perc 0,1%
deviating_of_mean 21
deviating_of_mean_perc 0.36%
deviating_of_median 21
deviating_of_median_perc 0.36%
top_correlations {u'D': 0.702240243124, u'E': -0.663}
counts 5781
uniques 5771
missing 21
missing_perc 0.36%
types numeric
Name: A, dtype: object
dfs[[1, 2]]
Author: Polyaxon
Source Code: https://github.com/polyaxon/traceml
License: Apache-2.0 license
#machinelearning #python #datascience #tensorflow
1620466520
If you accumulate data on which you base your decision-making as an organization, you should probably think about your data architecture and possible best practices.
If you accumulate data on which you base your decision-making as an organization, you most probably need to think about your data architecture and consider possible best practices. Gaining a competitive edge, remaining customer-centric to the greatest extent possible, and streamlining processes to get on-the-button outcomes can all be traced back to an organization’s capacity to build a future-ready data architecture.
In what follows, we offer a short overview of the overarching capabilities of data architecture. These include user-centricity, elasticity, robustness, and the capacity to ensure the seamless flow of data at all times. Added to these are automation enablement, plus security and data governance considerations. These points from our checklist for what we perceive to be an anticipatory analytics ecosystem.
#big data #data science #big data analytics #data analysis #data architecture #data transformation #data platform #data strategy #cloud data platform #data acquisition
1617988080
Using data to inform decisions is essential to product management, or anything really. And thankfully, we aren’t short of it. Any online application generates an abundance of data and it’s up to us to collect it and then make sense of it.
Google Data Studio helps us understand the meaning behind data, enabling us to build beautiful visualizations and dashboards that transform data into stories. If it wasn’t already, data literacy is as much a fundamental skill as learning to read or write. Or it certainly will be.
Nothing is more powerful than data democracy, where anyone in your organization can regularly make decisions informed with data. As part of enabling this, we need to be able to visualize data in a way that brings it to life and makes it more accessible. I’ve recently been learning how to do this and wanted to share some of the cool ways you can do this in Google Data Studio.
#google-data-studio #blending-data #dashboard #data-visualization #creating-visualizations #how-to-visualize-data #data-analysis #data-visualisation
1621518941
The data-related career landscape can be confusing, not only to newcomers, but also to those who have spent time working within the field.
Get in where you fit in. Focusing on newcomers, however, I find from requests that I receive from those interested in join the data field in some capacity that there is often (and rightly) a general lack of understanding of what it is one needs to know in order to decide where it is that they fit in. In this article, we will have a look at five distinct data career archetypes, and hopefully provide some advice on how to get one’s feet wet in this vast, convoluted field.
We will focus solely on industry roles, as opposed to those in research, as not to add an additional layer of complication. We will also omit executive level positions such as Chief Data Officer and the like, mostly because if you are at the point in your career that this role is an option for you, you probably don’t need the information in this article.
So here are 5 data career archetypes, replete with descriptions and information on what makes them distinct from one another.
The data architect focuses on engineering and managing data stores and the data that reside within them.
The data architect is concerned with managing data and engineering the infrastructure which stores and supports this data. There is generally little to no data analysis needing to take place in such a role (beyond data store analysis for performance tuning), and the use of languages such as Python and R is likely not necessary. An expert level knowledge of relational and non-relational databases, however, will undoubtedly be necessary for such a role. Selecting data stores for the appropriate types of data being stored, as well as transforming and loading the data, will be necessary. Databases, data warehouses, and data lakes; these are among the storage landscapes that will be in the data architect’s wheelhouse. This role is likely the one which will have the greatest understanding of and closest relationship with hardware, primarily that related to storage, and will probably have the best understanding of cloud computing architectures of anyone else in this article as well.
SQL and other data query languages — such as Jaql, Hive, Pig, etc. — will be invaluable, and will likely be some of the main tools of an ongoing data architect’s daily work after a data infrastructure has been designed and implemented. Verifying the consistency of this data as well as optimizing access to it are also important tasks for this role. A data architect will have the know-how to maintain appropriate data access rights, ensure the infrastructure’s stability, and guarantee the availability of the housed data.
This is differentiated from the data engineer role by focus: while a data engineer is concerned with building and maintaining data pipelines (see below), the data architect is focused on the data itself. There may be overlap between the 2 roles, however: ETL; any task which could transform or move data, especially from one store to another; starting data on a journey down a pipeline.
Like other roles in this article, you might not necessarily see a “data architect” role advertised as such, and might instead see related job titles, such as:
The data engineer focuses on engineering and managing the infrastructure which supports the data and data pipelines.
What is the data infrastructure? It’s the collection of software and storage solutions that allow for the retrieval of data from a data store, the processing of data in some specified manner (or series of manners), the movement of data between tasks (as well as the tasks themselves), as data is on its way to analysis or modeling, as well as the tasks which come after this analysis or modeling. It’s the pathway that the data takes as it moves along its journey from its home to its ultimate location of usefulness, and beyond. The data engineer is certainly familiar with DataOps and its integration into the data lifecycle.
From where does the data infrastructure come? Well, it needs to be designed and implemented, and the data engineer does this. If the data architect is the automobile mechanic, keeping the car running optimally, then data engineering can be thought of as designing the roadway and service centers that the automobile requires to both get around and to make the changes needed to continue on the next section of its journey. The pair of these roles are crucial to both the functioning and movement of your automobile, and are of equal importance when you are driving from point A to point B.
Truth be told, some the technologies and skills required for data engineering and data management are similar; however, the practitioners of these disciplines use and understand these concepts at different levels. The data engineer may have a foundational knowledge of securing data access in a relational database, while the data architect has expert level knowledge; the data architect may have some understanding of the transformation process that an organization requires its stored data to undergo prior to a data scientist performing modeling with that data, while a data engineer knows this transformation process intimately. These roles speak their own languages, but these languages are more or less mutually intelligible.
#data analyst #data engineer #data engineering #data management #data science
1621413060
Data engineering is among the core branches of big data. If you’re studying to become a data engineer and want some projects to showcase your skills (or gain knowledge), you’ve come to the right place. In this article, we’ll discuss data engineering project ideas you can work on and several data engineering projects, and you should be aware of it.
You should note that you should be familiar with some topics and technologies before you work on these projects. Companies are always on the lookout for skilled data engineers who can develop innovative data engineering projects. So, if you are a beginner, the best thing you can do is work on some real-time data engineering projects.
We, here at upGrad, believe in a practical approach as theoretical knowledge alone won’t be of help in a real-time work environment. In this article, we will be exploring some interesting data engineering projects which beginners can work on to put their data engineering knowledge to test. In this article, you will find top data engineering projects for beginners to get hands-on experience.
Amid the cut-throat competition, aspiring Developers must have hands-on experience with real-world data engineering projects. In fact, this is one of the primary recruitment criteria for most employers today. As you start working on data engineering projects, you will not only be able to test your strengths and weaknesses, but you will also gain exposure that can be immensely helpful to boost your career.
That’s because you’ll need to complete the projects correctly. Here are the most important ones:
#big data #big data projects #data engineer #data engineer project #data engineering projects #data projects
1624072920
Big data skills are crucial to land up data engineering job roles. From designing, creating, building, and maintaining data pipelines to collating raw data from various sources and ensuring performance optimization, data engineering professionals carry a plethora of tasks. They are expected to know about big data frameworks, databases, building data infrastructure, containers, and more. It is also important that they have hands-on exposure to tools such as Scala, Hadoop, HPCC, Storm, Cloudera, Rapidminer, SPSS, SAS, Excel, R, Python, Docker, Kubernetes, MapReduce, Pig, and to name a few.
Here, we list some of the important skills that one should possess to build a successful career in big data.
#big data #latest news #data engineering jobs #skills for data engineering jobs #10 must-have skills for data engineering jobs #data engineering