Skip to content

Chatbot for The Institute for Ethical ML, specifically for The ML Engineer Newsletter

Notifications You must be signed in to change notification settings

EthicalML/chatbot

Repository files navigation

The ML Engineer Chatbot

This repo contains code for The ML Engineer chatbot

This also includes a clean format of the pages parsed for easier ingestion.

This can be found in the newsletter_clean.json file.

Also as Markdown table below:

issue items.0.title items.0.link items.0.content items.1.title items.1.link items.1.content items.2.title items.2.link items.2.content items.3.title items.3.link items.3.content items.4.title items.4.link items.4.content items.5.title items.5.link items.5.content items.6.title items.6.link items.6.content
24 Stanford’s Deep Learning NLP Course http://cs224d.stanford.edu/ Standford’s Deep NLP Course A great time to be alive thanks to the incredible e-learning resources. Standford has made online their computer science course on Deep Learning for Natural Language Processing. All the video lectures can be found online for free - a great end-to-end introduction to the theory and practice of several cutting edge concepts. Many few alternative resources are available as well, such as Deep Mind’s deep learning NLP course which can be found on Github. The data orchestration layer http://www.alluxio.io/resources/whitepapers/alluxio-overview/ The Data Orchestration Layer Alluxio is an open source framework that provides and advocates for a data orchestration layer. This basically includes an architectural layer that is in charge of simplifying and standardising data access, making it easier for data scientists and engineers to load and interact with the right datasets. With datasets and data sources growing exponentially, this opportunity will only grow - Alluxio provides a really interesting whitepaper where they explain these challenges, and cover some of the benefits that a platform like alluxio can bring to the table. The illustrated transformer http://jalammar.github.io/illustrated-transformer/ The Illustrated Transformer Last week we shared Jay’s work on Attention in Seq2seq NLP models. This week Jay comes back with another great visual deep dive into the transformer - a model that uses attention to speed the speed in which these models can be trained. Google’s people+AI guide http://pair.withgoogle.com People plus AI Guidebook Google released a “People+AI” guidebook where they have made available a great and extensible resource that introduces fundamental knowledge for designing human-centered AI products. The guide covers an overview of machine learning and automation, as well as more high level (and critical) topics such as data collection, explainability, trust, feedback, control, erros, feedback and more. Reproducible ML pipeliens http://medium.com/comet-ml/building-a-fully-reproducible-machine-learning-pipeline-with-comet-ml-and-quilt-aa9c7bf85e72 Build a reproducible ML Pipeline The space on machine learning reproducibility keeps surprising us with a lot of innovative approaches - this week Cecelia Shao from CometML has put together a tutorial on how to build a reproducible machine learning pipeline using Comet.ML and Quilt. In this tutorial she shows us how we can build a Keras image classifier on a fruits dataset. GANs in action http://www.manning.com/books/gans-in-action GANs in Action Book This week we have seen yet another great piece of research by the Samsung AI team which has also brought a video how they are able to use this tech to bring world famous paintings (like the Mona Lisa) to life. For anyone interested to dive deeper into the world of GANs, there is a Manning book “GANs in Action” by Jakub Langr which has made available content for free. Industrial NLP libraries http://github.com/EthicalML/awesome-machine-learning-operations Industrial NLP libraries
25 Google’s Research Director on MLOps http://www.microsoft.com/en-us/research/video/as-we-may-program/ Google Research on MLOps Google’s current research director and former NASA chief scientist Peter Norvig dives into how machine learning will change the way we program. Peter focuses a lot on machine learning model evaluation, as well as the tools we use. You can find the hour-long video here, as well as the full slides here. A book on AutoML http://www.automl.org/book/ The Book on AutoML AutoML provides methods and processes to make Machine Learning available for ML experts and non-experts. AutoML has achieved considerable successes in recent years and an ever-growing number of disciplines rely on it. AutoML.org provides a free e-book that covers all things around this topic, and provides extensive resources to dive into hands on use of this powerful set of tools and techniques. Deep Learning for face detection http://machinelearningmastery.com/how-to-perform-face-detection-with-classical-and-deep-learning-methods-in-python-with-keras/ Deep Learning for face detection Face detection is a computer vision problem that involves finding faces in photos. In this hands-on tutorial, they provide the knowledge to understand the challenges and opportunities for face detection, as well as a hands-on example performing state-of-the-art face detection can be achieved using a Multi-task Cascade CNN via the MTCNN library. Maintainable ETL Pipelines http://multithreaded.stitchfix.com/blog/2019/05/21/maintainable-etls/ Maintainable ETL Pipelines This great article provides a set of tips and best practices to structure your ETL data pipelines to ensure they are scalable and maintainable in the medium and longer term. The tips provided consist of 4 key themes: 1) Building a chain of simple tasks, 2) using a workflow management tool, 3) leveraging SQL where possible and 4) implementing data quality checks. Counterfactuals for XAI http://docs.seldon.io/projects/alibi/en/v0.2.0/methods/CF.html Counterfactuals for Explainable AI A counterfactual explanation describes a causal situation in the form: “If X had not occurred, Y would not have occurred”. In interpretable machine learning, counterfactual explanations can be used to explain predictions of individual instances. Seldon’s Interpretable Machine Learning Library Alibi has launched its v0.2.0 version which contains Counterfactual explanations, and provides an example of how to find counterfactual instances using the MNIST dataset. Semi-supervised machine learning http://towardsdatascience.com/the-quiet-semi-supervised-revolution-edec1e9ad8c The Semi-Supervised Revolution One of the most familiar settings for a machine learning engineer is having access to a lot of data, but modest resources to annotate it. Everyone in that predicament eventually goes through the logical steps of asking themselves what to do when they have limited supervised data, but lots of unlabeled data, and the literature appears to have a ready answer: semi-supervised learning. This post provides a brief introduction to the concept of semi-supervised machine learning, as well as references to papers that provide an insight on this topic.
26 PyTorch Hub for Reproducible ML http://pytorch.org/blog/towards-reproducible-research-with-pytorch-hub/ PyTorch Hub + Reproducible ML Reproducibility is an essential requirement for many fields of research including those based on machine learning techniques. PyTorch has released PyTorch Hub, where the community can now share models built with PyTorch. This new great resource also has built-in support for Colab, integration with Papers With Code and currently contains a broad set of models that include Classification and Segmentation, Generative, Transformers, and beyond 🚀. A free course on privacy preserving ML http://eu.udacity.com/course/secure-and-private-ai--ud185 Privacy-preserving AI free course What a time to be alive for life-long learners - a brand new Free online course has been made available by Facebook AI on hands down some of the most exciting topics in this space: Federated Learning, Differential Privacy and Encrypted computation. This course teaches you how to leverage open source tools to explore these topics on an introductory level. Really awesome to see this type of content be made available freely. MLFlow for pipeline management http://thenewstack.io/databricks-mlflow-aims-to-simplify-management-of-machine-learning-pipelines/ MLFlow for pipeline management MLflow from Databricks is an open source framework that addresses some of the biggest challenges in machine learning, including configuring environments, tracking experiments, and deploying trained models for inference. This post provides a high level overview on this framework as well as useful links to get started trying it out. E2E NLP Pipelines with Kubeflow & Seldon http://github.com/SeldonIO/seldon-core/tree/master/examples/kubeflow E2E NLP Pipelines with Kubeflow End to end pipelines are always a challenge in the data science space. Kubeflow is an open source framework that hells you run reproducible ML workloads in Kubernetes. This example showcases and end-to-end NLP pipeline leveraging re-usable components that utilize key frameworks such as the SpaCy NLP library to perform automation of text analysis, as well as serving the models using Seldon. The brains behind SpaCy http://www.analyticsvidhya.com/blog/2019/06/datahack-radio-ines-montani-matthew-honnibal-brains-behind-spacy/ The Brains behind SpaCy The DataHack team has put together a great podcast where they bring the co-founders of Explosion.ai, and authors of SpaCy to talk about the story behind this popular framework. During this 40 minute episode, they dive into the idea behind developing spaCy, spaCy’s evolution from the first alpha release, use cases of spaCy including a couple of surprising applicationsInes, and Matt’s advice to NLP enthusiasts.
27 Distributed AI made easy with Ray http://bair.berkeley.edu/blog/2018/01/09/ray/ Distributed AI made easy w Ray Last week we attended China’s first Ray meetup, and had a huge pleasure to see Ion Stoica, Apache Spark founder and Databricks Chairman, presenting a technical deep dive on Ray, followed by a set of great talks from Alibaba, Didi and Ant Financial engineeering leaders on Ray usecases. Ray is a fast and simple framework for building and running distributed applications, and comes in with a broad set of tools including Tune (Rapid Hyperparam search), RLib (Scalable Reinforcement Learning), and Distributed Training, between several other features. Model interpretation with Alibi http://changelog.com/practicalai/48 Model Interpretation with Alibi Janis Klaise, Data Scientist at Seldon, joins Daniel Whitenack and Chris Benson on their Practical AI podcast to talk about the challenges of production machine learning, and how Seldon is tackling the challenge with open source particularily in the theme of explainable machine learning with Alibi Black Box Model Explanations. Janis provides an introduction to the challenges of production machine learning, as well as the different approaches that can be used in machine learning explainability. Principled Machine Learning http://dev.to/robogeek/principled-machine-learning-4eho Principled Machine Learning A great blog post that summarises a set of principles presented at a talk by Patrick Ball with the Data & Society Research Institute titled Principled Data Processing. Transparency, accountability, reproducibility and scalability, which truly resonate with our 8 principles for responsible machine learning. Deep learning vs classical for time series http://github.com/SeldonIO/seldon-core/tree/master/examples/kubeflow Comparing Time Series Models Machine learning mastery comes back this week with a deep dive analysing results of classical and machine learning methods for time series analysis. In this post, James three key things: 1) classical methods like ETS and ARIMA out-perform ML/DL methods for one-step forecasting on univariate datasets, 2) How classical methods like Theta and ARIMA outperform DL,ML models for multi-step forecasting on univariate datasets, and how ML/DL methods do not yet deliver on their promse for univariate time series forecasting. The quest for high quality data http://www.oreilly.com/ideas/the-quest-for-high-quality-data The quest for high-quality data Ben Lorica and Ihab Ilyas bring us an excellent piece this week covering machine learning solutions for data integration, cleaning, and data generation, which are quickly gaining traction and popularity. This post covers fundamental topics like data integration / cleaning, data programming and https://www.oreilly.com/ideas/the-quest-for-high-quality-datamarket validation. Open source libraries for adversarial robustness https://github.com/EthicalML/awesome-production-machine-learning#adversarial-robustness-libraries OSS: Adversarial Robustness The theme for this week’s featured ML libraries is Adversarial Robustness, which includes tools for adversarial attacks and adversarial security. These libraries are an incredibly exciting addition that fall in our Responsible ML Principle #8, and the whole section was contributed by one of the Fellows at the Institute Ilja Moisejevs from Calipso AI. The four featured libraries this week are:
28 The state of AI in 2019 http://www.stateof.ai/?fbclid=IwAR0AAhweykbVQMw28mVFLc7Tl1SMQGM_IiBpk6a_jeiODm1vnTEFbg8UfUg The state of AI in 2019 Nathan Benaich and Ian Hogarth bring us the report outlining the state of AI in 2019, which aims to provide an overview on the current state and advances in 5 key areas: research, talent, industry, china and politics. This annual report covers the most interesting things that they have seen in the last 12 years. You can also find last year’s report here. Production machine learning in 2019 http://github.com/EthicalML/state-of-mlops-2019/ Production ML in 2019 During Kubecon Shanghai 2019 we presented a high level overview of the themes that are growingly becoming critical in the world of production machine learning. This includes over 10 themes which expand over explainability, privacy, model versioning, adversarial robustness and beyond. This repository contains a set of slides that dive into three key themes: black box explainability, model versioning and ML orchestration. For each of these themes, there is a high level explanation, together with a hands on example with a Jupyter notebook, including an end-to-end NLP pipeline, tabular explainers and a pytorchhub integration. Model governance and operations http://www.oreilly.com/ideas/what-are-model-governance-and-model-operations Model governance and ops Machine learning at scale introduces new challenges, as managing a large number of models that perform increasingly critical tasks becomes more complex. O’Reilly Chief Scientist Ben Lorica has put together a great overview of the ecosystem and tools available in the machine learning governance and operations world. The best of modern NLP http://medium.com/huggingface/the-best-and-most-current-of-modern-natural-language-processing-5055f409a1d1 The best of modern NLP Hugging face scientist Victor Sanh has put together an extensive list of resources related to key themes in NLP, including transfer learning, representation learning, neural dialogue, as well as other miscellaneous pieces of research that have contributed to the growth of the field in the last two years. Adversarial examples with FGSM http://www.tensorflow.org/beta/tutorials/generative/adversarial_fgsm Adversarial examples with FGSM Tensorflow tutorials has launched a new deep dive on adversarial examples, which covers the conceptual, theoretical and practical aspect of this topic. It provides the code for you to find an adversarial example which could trick a classifier.
29 The state of AI in 2019 http://www.stateof.ai/?fbclid=IwAR0AAhweykbVQMw28mVFLc7Tl1SMQGM_IiBpk6a_jeiODm1vnTEFbg8UfUg The state of AI in 2019 Production machine learning in 2019 http://github.com/EthicalML/state-of-mlops-2019/ Production machine learning in 2019 Model governance and operations http://www.oreilly.com/ideas/what-are-model-governance-and-model-operations Model governance and operations The best of modern NLP http://medium.com/huggingface/the-best-and-most-current-of-modern-natural-language-processing-5055f409a1d1 The best of modern NLP Adversarial examples with FGSM http://www.tensorflow.org/beta/tutorials/generative/adversarial_fgsm Question-answering AI in K8s Intel Software Innovator Daniel Whitenack has put together an awesome production-level framework with modular functionality to perform question-answering ML inference on top of Kubernetes. They’ve put toether a brief screencast that showcase how you can interact with it, as well as a Arxiv research paper with full details on the framework.
30 Production-level ML Explainers http://github.com/EthicalML/explainability-and-bias/ Production-level ML Explainers This week we presented at PyData London and EuroPython Basel on produciton-level machine learning model explainers, which is an approach to leverage explanations end-to-end with the purpose to align with higher level frameworks like regulation or industry standards. The slides are available online and include code examples for data analysis with XAI, black box model analysis with Alibi and production explainers with Seldon. AI Explanations with Counterfactuals http://arxiv.org/abs/1907.02584 AI Explanations w Counterfactuals An awesome research paper published in Arxiv this week by Seldon Data Scientists Arnaud Van Looveren and Janis Klaise titled “Interpretable Counterfactual Explanations Guided by Prototypes”. This paper dives into the concept of counterfactuals, which is an ML local model explanation technique that allows you to ask the question “for this ML prediction, what could be the smallest changes I could do to the input to change the outcome?”. Being such a computationally expensive task, this paper proposas a new approach to reduce the computational resources required to use this technique. Flat light: Privacy & Cybersecurity Merging http://www.lawfareblog.com/flat-light-data-protection-disoriented-policy-practice Privacy & Cybersecurity Merging One of the most interesting white papers so far, written by Immuta’s Chief Privacy Officer Andrew Burt. This paper covers critical topics on privacy and cybersecurity, as well as how these topics have been changing as we move into massive scale production systems. This paper also provides great historical case studies that provide an insight of how important conceptual shifts and standardisation of thses concepts will be. Hightlights of AI O’Reilly Beijing http://www.oreilly.com/ideas/highlights-from-ai-beijing-2019 Hightlights of AI O’Reilly Beijing O’Reilly’s Ben Webb brings us a great high level overview of some of the key keynotes at the AI O’Reilly Beijing. Some of these include the future of hiring in AI, RISELab innovations, breakthroughs, data orchestration, AI in retail, data structures and more. 18 Impressive GAN Applications http://machinelearningmastery.com/impressive-applications-of-generative-adversarial-networks/ 18 Impressive GANs Applications Great post from Machine Learning Mastery that dives into the more practical side of GANs. This article covers use-cases of GANs across various datasets of image and text types.
31 End-to-end XAI in production http://www.youtube.com/watch?v=vq8mDiDODhc End-to-end XAI in production Our PyData London talk last week is now on youtube, where we spoke about end-to-end machine learning explainability techniques with an emphasis on production. During this talk we covered the tools and approaches you can take to tackle machine learning explainability in data and models. We also introduced the concept of production ML explainer design patterns which abstract the XAI techniques so they can use at scale across live models in production. Causal inference for UX http://eng.uber.com/causal-inference-at-uber/ Causal inference to improve UX Excellent and comprehensible post by Uber Engineering on how they use Causal Inference techniques to improve user experience. In this post they introduce the importance of the topic, as well as a deep dive on key causal inference techniques including: compiler average cuasal effect (CACE), CUPED / Diff-in-diff propensity score matching (IPTW), Heterogeneus treatment effect Uplift modeling, quantile regression and mediation modelling. An excellent post that covers the theoretical and practical perspectives of causal inference techniques. ML in enterprise http://www.oreilly.com/ideas/managing-machine-learning-in-the-enterprise-lessons-from-banking-and-health-care Managing ML in enterprise Excellent post by the O’Reilly team covering key lessons learned from the field from managing production machine learning systems in the financial and healthcare sectors. Historically these two sectors (and the financial sector in particular) tends to lead the way on technology adoption, so it’s often great to take some of the learnings obtained introducing innovations to the sector, and abstract them to help introduce innovations into other sectors (such as transport, energy, construction, etc). Adversarial Robustness http://towardsdatascience.com/evasion-attacks-on-machine-learning-or-adversarial-examples-12f2283e06a1 Intro to Adversarial Examples Great high level introduction on the topic of Adversarial Robustness, which provides an introduction to this topic, as well as case studies and examples that showcase the importance of this branch of techniques. The post breaks down evasion attacks into five separate classes: gradients, confidence scores, hard labels, surrogate models and brute force. The GANs story so far http://blog.floydhub.com/gans-story-so-far/ The GAN Story so far Great article which provides a very comprehensible overview of the recent history of Generative Adversarial Networks. GANs, DCGANs, CGANs, CycleGANs, CoGANs, ProGANs, WGANs, SAGANs, BIgGANs and StyleGANS - GANS EVERYWHERE!
32 End-to-end ML Pipelines in Enterprise http://www.oreilly.com/ideas/enabling-end-to-end-machine-learning-pipelines-in-real-world-applications E2e ML Pipelines in Enterprise IBM Principal Engineer Nick Pentreath and Ben Lorica dive into end-to-end machine learning pipelines, and discuss the challenges and opportunities unlocking the potential of machine learning at scale. During this conversation, they cover fundamental topics not only in the training phase of machine learning but also focus on the deployment, monitoring and governance of machine learning systems at scale. An excellent overview + deep dive on an incredibly important topic. Code-free deep learning with Ludwig http://eng.uber.com/introducing-ludwig/ Code-free deep learning Ludwig Uber engineering is making deep learning more accessible through their open source code-free deep learning framework called Ludwig. As they mention, Ludwig is unique in its ability to help make deep learning easier to understand for non-experts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike. Of course, with great powers comes great responsibility, so we recommend any new-commers to the deep learning world to check out and follow our 8 principles for responsible Machine Learning. ML Reidentification and Privacy Issues http://www.nature.com/articles/s41467-019-10933-3 ML Reidentification and Privacy An incredibly insightful research paper which could have a significant impact in privacy, where they propose a method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. Some of their results are impressive: “Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes”. With the rise of privacy protection laws such as GDPR, it will be important to consider these kind of loopholes and semi-indirect (but still fully relevant) challenges. How OSS and AI will take us to the moon http://venturebeat.com/2019/07/20/how-open-source-and-ai-can-take-us-to-the-moon-mars-and-beyond/ OSS + AI will take us to the Moon Great positive take by Venturebeat on two of the biggest changers in technology in 2019, open source and artificial intelligence. In this article they cover openess and collaboration, the spaceborne computer example, open source software+hardware and augmenting human capability with AI. Managing large-scale distributed systems http://blog.pragmaticengineer.com/operating-a-high-scale-distributed-system/ Large Scale Distributed Systems Yet another great article by one of Uber Engineering Manager Gergely Orosz on “Operating a Large, Distributed System in a Reliable Way”. In this article Gergely takes us in a high level overview of the key themes he has identified managing the payments system at Uber. In this post he covers fundamental (and super interesting concepts) including Monitoring, Oncall, Anomaly Detection, Alerting, Outages, Incident Management Processes, Postmortems, Incident Reviews, a Culture of Ongoin, Improvements, Failover Drills, Capacity Planning & Blackbox Testing and more (much, much more). Stream Processing OSS Libraries https://github.com/ethicalml/awesome-production-machine-learning Stream Processing OSS Libraries
33 Data streaming at scale with Brooklin http://engineering.linkedin.com/blog/2019/brooklin-open-source Brooklin for data streaming Linkedin open sources Brooklin, a distributed service for streaming data in near real-time at scale, currently powering over 2 trillion messages per day at Linkedin. Data streaming is truly driving the way for real-time machine learning usecases. This is also a very interesting project, primarily as it doesn’t aim to replace OSS projects like Kafka, instead it sits on a higher level providing a primary solution for streaming across various stores and messaging systems (Kafka, Azure Events Hub, Kinesis, etc). In this post, they showcase how Brooklyn can be used as a streaming bridge across these heterogeneous messaging services, as well as mirroring kafka functionality, and beyond. Tensorflow AI interpretability http://blog.sicara.com/tf-explain-interpretability-tensorflow-2-9438b5846e35?gi=6a3bde675a49 Tensorflow AI Interpretability The tensorflow team enters the ML model interpretability arena with TFExplain - a library that offers interpretability methods to understand model predictions. The library is adapted to the Tensorflow 2.0 workflow, using tf.keras API as possible, prividing: 1) heatmap visualisations & gradient analysis, 2) off-training & keras.callback usages, and 3) tensorboard integration. Smart City revolution with OSS LiDAR http://v-sense.scss.tcd.ie/dublincity/ LIDAR and its smart applications A project that would have considered a dream for smart-city enthusiast has been fully open sourced. The Urban Modelling Group at University College Dublin has captured major area of Dublin city centre (around 5.6km^2) and made available as the densest LiDAR point cloud and imagery dataset (260m points out of 1.4b are labelled). All hail the (AI) algorithm http://interactive.aljazeera.com/aje/2019/hail-algorithms/index.html All hail the (AI) algorithm A five-part video series released by Aljazeera covering high level concepts that break down some of the biggest challenges in AI through a mainstream media lens. The five parts basically break down into: 1) Trust & bias, 2) Big Tech monopolies, 3) Missinformation, 4) Surveilance, and 5) Regulation around data & privacy. Machines gone wrong http://machinesgonewrong.com/#start Machines (and AI) Gone Wrong An excellent project that tries to simplify one of the most popular concepts around machine learning. “Machines gone wrong” covers foundational topics in the challenges of AI such as an explanation of AI ethics, why AI is different when talking about these issues, as well as some key themes like algorithmic bias. Stream Processing OSS Libraries https://github.com/ethicalml/awesome-production-machine-learning Stream Processing OSS Libraries
34 A survey on the state of AutoML http://arxiv.org/abs/1908.00709 A survey on the state of AutoML An extensive and interesting deep dive into the AutoML ecosystem, together with the techniques, tools and challenges that this area in machine learning will face, including end-to-end pipelines, interpretability, reproducibility and beyond. Got Speech? Voice Applications 101 http://www.oreilly.com/ideas/got-speech-these-guidelines-will-help-you-get-started-building-voice-applications Got speech? Voice Applications O’Reilly’s Ben Lorica and Yishay Carmiel have put together a 101 for building voice applications. They break down the voice applications into dialogue vs monologues and human2human vs human2machine, and into some of the challenges, applications, and potential. Cloud native kubernetes semantic search http://gnes.ai/ Cloud native semantic text search A really exciting new open source framework brings industrial NLP functionality for text search. GNES [jee-nes] is a cloud-native semantic search system based on deep neural network. It enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any content form. Certainly an exciting space for content search at massive scale. Learning from adversaries in AI http://www.oreilly.com/ideas/learning-from-adversaries Learning from adversaries In machine learning adversarial attacks are becoming an increasing worry in production-level applications. This interesting article proposes that adversarial images aren’t 100% a problem—they’re an opportunity to explore new ways of interacting with AI. This is an interesting space, as some of the work we have been doing has provided some correlations around explainability/interpretability techniques in AI, and adversarial attacks, where the ultimate objective is to reverse-engineer models (with different ultimate outcomes of course). Python-compatible spreadsheets for data science http://hackernoon.com/introducing-grid-studio-a-spreadsheet-app-with-python-to-make-data-science-easier-tdup38f7 Python-compatible spreadsheets An engineer approach into a data science challenge - Rick Lamers has built a python-first open source spreadsheet application. In this blog post he shows how you are able to leverage the full power of the spreadsheets, together with functionality that python makes available, such as scraping, performing pre-/post-processing, etc. Really interesting project which does seem to offer quite a lot of potential for expansion. Neural Network Search AutoML http://github.com/ethicalml/awesome-production-machine-learning#neural-architecture-search OSS: NN Architecture AutoML The theme for this week’s featured ML libraries is Neural Network Architecture AutoML, which you can find in our Production Machine Learning ecosystem list. These libraries are an incredibly exciting addition that fall in our Responsible ML Principle #4. The four featured libraries this week are:
35 The future of data engineering http://riccomini.name/future-data-engineering The future of Data Engineering Great insight into the near future of data engineering covering the current transformations that this role has been undergoing, as well as some of the trends that will become more prominent in the immediate and medium term for data engineers. The article covers the transition from batch into realtime, the exponential increase of connectivity, automation, descentralisation and beyond. A self study journey towards machine learning http://towardsdatascience.com/6-techniques-which-help-me-study-machine-learning-five-days-per-week-fb3e889fad80 From self study to ML Engineering Former Apple engineer shares his experience transitioning into a machine learning full time role. In this article he shares 6 techniques that helped him study machine learning 5 days a week. This includes reducing search space, fixing your environment, setting up your system, work smart, embrace being stuck and the 3-year old principle. 12 NLP researchers to follow http://www.kdnuggets.com/2019/08/nlp-researchers-practitioners-innovators-should-follow.html 12 NLP Researchers to Follow If you are interested in NLP, KDNuggets has put together an excellent list of 12 NLP researchers to follow to stay up to date with some of the latest cutting edge tools and research in this field. The list includes some really great practitioners and researchers, including Explosion (and SpaCy) cofounders Matt Honibal & Ines Montani, DeepMind Researcher Sebastian Ruder, FastAI founder Jeremy Howard and more. N-shot learning with small-data http://blog.floydhub.com/n-shot-learning/ N-Shot Learning with Small Data N-Shot Learning is very exciting area of research which focuses on tackling challenges with “Small Data”. That is, using n-data examples. These can be from zero-shot learning - zero examples - to 1-short, to few-shot, etc. This is an interesting challenge as it requires a lot of the domain expertise and in an abstract sense the concept of intuition to be embedded in the algorithms to be able to abstract insights that would otherwise be missed by vanilla deep learning (or even traditional machine learning) algorithms. Causal inference with counterfactuals http://www.inference.vc/causal-inference-3-counterfactuals/ Causal Inference: Counterfactuals Excellent piece by Twitter’s Ferenc Huszár on Counterfactuals. Counterfactuals is an incredibly interesting technique that falls within the causal inference family. This technique has been growing in popularity especially due to its intuitive explanatory power. From a high level definition, a counterfactual asks the question around “what would be the minimum change that I could make to change the outcome”. In machine learning this has great predictive power. Neural Network Search AutoML http://github.com/ethicalml/awesome-production-machine-learning#neural-architecture-search OSS: NN Architecture AutoML The theme for this week’s featured ML libraries is Neural Network Architecture AutoML, which you can find in our Production Machine Learning ecosystem list. These libraries are an incredibly exciting addition that fall in our Responsible ML Principle #4. The four featured libraries this week are:
36 The state of federated machine learning http://arxiv.org/abs/1908.07873 The state of Federated Learning Federated learning involves training machine learning models distributed over remote devices or siloed data centers (this could be mobile phones or even hospitals, while keeping data localized). Training in heterogeneous and potentially massive networks introduces a lot of new challenges. This great article dives into the unique characteristics and challenges of federated learning, together with current approaches and future work. Data science best practices http://syslog.ravelin.com/data-science-best-practices-843c9693db8 Data Science Best Practices It’s easy and fun to ship a prototype, whether that’s in software or data science. What’s much, much harder is making it resilient, reliable, scalable, fast, and secure. This article brings some of the best practices identified by the team at Ravelin. Their data science guidelines include:: 1) all starters will build, train and deploy production models within a week, 2) leverage humans whilst automating manual work, 3) deploy models incrementally and often, 4) end users will never notice a model change other than improved results. Career progression of a data scientist http://medium.com/sequoia-capital/progression-of-a-data-scientist-e1bebf8c8420 Progression of a Data Scientist Sequoia has put together a great overview of the career progression of a data scientist - specifically they examine what characteristics senior product data scientists have relative to junior ones, and why a healthy data-informed company should invest in the development of their data scientists. The article covers the key “five core skills” of a data scientist, how data scientists advance, common questions in data science, and key takeaways. A 3 year retrospective on observability http://thenewstack.io/observability-a-3-year-retrospective/ Observability 3 year retrospective Incredibly insightful deep dive into the topic of observability, which discusses terminology, challenges and key insights. It emphasises that metrics do not equal observability, and provides key terms such as cardinality for system insights, and covers some of the present and future of this very important topic. Introduction to the transformer architeture http://rubikscode.net/2019/07/29/introduction-to-transformers-architecture/ Intro to transformer architecture Transformers are popular (and effective) sequence-to-sequence models used for language modeling, machine translation, image captioning and text generation. This article covers key concepts, including RNNs, LSTMs, attention, self-attention and then cover how these all fit togethers in the transfoerm architecture. Industry-strength NLP Frameworks http://github.com/ethicalml/awesome-production-machine-learning#industrial-strength-nlp OSS: Industry-strength NLP   The theme for this week’s featured ML libraries is Industry-strength NLP, and we’re happy to announce that we have added over 10 new libraries to the section. The four featured libraries this week are:
37 Real Time NLP & ML with Spacy, Spark and Kafka http://infoshare.pl/news/one,66,254,126,infoshare-alejandro-saucedo-real-time-nlp-machine-learning-with-spark-streaming-kafka-and-spacy.html Real Time NLP: Spacy and Kafka The need for real time machine learning use-cases in production is increasing. This talk provides practical insight on how to build real time data streaming machine learning pipelines that are production ready, covering a case study performing automated content moderation on Reddit comments in real time. The talk dives into fundamental concepts of stream processing such as windows, watermarking and checkponting, and show how to use frameworks like Kafka, Spacy and Spark Streaming. Becoming an ML practitioner http://www.oreilly.com/ideas/becoming-a-machine-learning-practitioner Becoming an ML practitioner O’Reilly comes with an awesome podcast, talking with Kesha Williams, technical instructor at A Cloud Guru, a training company focused on cloud computing. As a full stack web developer, Williams became intrigued by machine learning and started teaching herself the ML tools on Amazon Web Services. Fast forward to today, Williams has built some well-regarded Alexa skills, mastered ML services on AWS, and has now firmly added machine learning to her developer toolkit. Cracking the black box with interpretability techniques http://opendatascience.com/cracking-the-box-interpreting-black-box-machine-learning-models/ Cracking the black box (XAI) Great overview on the concept, techniques and key areas in machine learning interpretability. This article covers several classes of interpretability methods such as model-specific vs model-agnostic, techniques such as partial dependency plots, permutation importance, anchors, and more, Notebook innovation (and infrastructure) at Netflix http://medium.com/netflix-techblog/notebook-innovation-591ee3221233 Notebook innovation at Netflix Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis.This post provides a very interesting insight on the infrastructure and processes that Neflix has introduced internally around Notebooks and data. This article covers the processes that involve the roles of analysts, data scientists and data engineers, as well as the challenges with data access, templates and infrastructure. How AI solves scale complexities https://thenewstack.io/how-ai-solves-the-kubernetes-complexity-conundrum/ How AI solves scale complexities How AI Solves the Kubernetes Complexity Conundrum. A really interesting article from the new stack that dives into the challenges and complexities that the introduction of kubernetes has brought to the tech world, and makes a high level case on how AI will help manage some of the key complexities that the scale of kubernates-based systems will entail. Industry-strength NLP Frameworks http://github.com/ethicalml/awesome-production-machine-learning#industrial-strength-nlp OSS: Industry-strength NLP   The theme for this week’s featured ML libraries is Industry-strength NLP, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
38 Continuous delivery for machine learning http://martinfowler.com/articles/cd4ml.html Continuous Delivery for ML Great article from martinfowler.com tackling the challenge of Automating the end-to-end lifecycle of Machine Learning applications. As we have seen, the process for developing, deploying, and continuously improving them is more complex compared to more traditional software. This article proposes Continuous Delivery for Machine Learning (CD4ML), which is explained as the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications. What is AIOps and why you should care http://thenewstack.io/what-is-aiops-and-why-you-should-care/ AIOps and why you should care The first of a two-part series on the emerging concept of AIOps. Whilst a lot of articles that we reference focus on the productionisation techniques and hence “DevOps for ML”, the keyword of AIOps is growingly used to refer to applying ML to DevOps. Key examples of this would be to leverage concept drif, outlier detection, explanations and beyond. Although there is still ambiguity with these terms and concepts, here the new stack provides a deep dive on these increasingly discussed topics. Language models as knowledge bases https://arxiv.org/abs/1909.01066 Lang models as knowledge bases Interesting perspective on how the potential of languge models can be generalised. In this paper, the authors argue that language models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as “fill-in-the-blank” cloze statements. They suggest that language models have advantages over structured knowledge bases, and they propose the “LAMA (LAnguage Model Anal-ysis) probe” to test the factual and commonsenseknowledge in language models. They have also made their code available open source. AWS Data Security best practices http://datafloq.com/read/best-practices-for-data-security-in-aws/6804 AWS data security best practices In AI data is key, and with growing usecases it’s key to ensure your security is aligned with best practices. Great article that outlines (as more of a reminder) a set of simple principles to follow as best practices when handling data in AWS (although it could also apply to other clouds). A smooth approach to production ML https://maxhalford.github.io/blog/a-smooth-approach-to-putting-machine-learning-into-production/ A smooth approach to prod ML Another great artcile that dives into the challenges of dealing with production machine learning system. In this article there are multiple areas discussed, including a dissection of the “lambda architecture”, online learning, and a few other topics. The article also shares some lessons learned, and covers some hands on examples using a really interesting library called “creme”. ML Explainability libraries https://github.com/ethicalml/awesome-production-machine-learning OSS: Explainability Libraries   The theme for this week’s featured ML libraries is ML Explainability, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
39 Management insights for data science http://towardsdatascience.com/everything-a-data-scientist-should-know-about-data-management-6877788c6a42 Management for Data Science Management in data science teams has become a big challenge in organisations looking to introduce and grow their data science teams. This article provides an excellent ground up overview of management, from a practical perspective. It first introduces high level concepts you have probably come across, but then dives into a hands on case study building end-to-end data science infrastrtructure for a recommendation app startup, detailing the stakeholders involved, together with interactions and processes. Outlier selection vs detection http://koaning.io/posts/outliers-selection-vs-detection/ Selection vs Detection of outliers Anomaly detection algorithms have been gaining popularity due to their practical use beyond traditional areas like fraud detection. In this context, this article does a great job defining nuanced terminology involved in this area - namely the difference between selection versus detection of outliers. The blog provides code that allows you to leverage sklearn’s algorithms to approach these challenges with a “one size fits all” approach, that encompasses a pattern that can generalise into other techniques that could be useful in similar contexts. 10 rules for sharing notebooks http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007007 Rules for sharing notebooks With a fast increase in adoption of Jupyter (and other) notebook technologies, there has also been an increase in complexity around collaborating and maintaining notebooks. Last week we showed how Netflix tackles their challenges with data and software infrastructure. This week, we see another great piece that instead proposes 10 rules you can follow as an individual to make your notebooks easier to digest, maintain and extend. AutoML and AI at Google http://changelog.com/practicalai/55 AutoML and AI at Google Large scale use of machine learning has introduced new complexities - with that, there has been a large amount of manual work that comes when finding the best parameters for an ML algorithm (such a neural network, random forest classifier and beyond). Practical AI brings us an excellent podcast from Google’s Sheron Chen diving into AutoML and AI at Google, covering some of the most popular topics in machine learning at this time. 5 sampling algorithms for everyone http://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know-43c7bc11d17c 5 sampling algos for everyone Data imbalance and representability of training vs production data often becomes a huge challenge, and is certianly often qutie a key point when the topic of “algorithmic bias” is raised. Here is a great blog post that introduces 5 common sampling techniques that every ML practitioner should be familiar with. ML Explainability libraries https://github.com/ethicalml/awesome-production-machine-learning OSS: Explainability Libraries   The theme for this week’s featured ML libraries is ML Explainability, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
40 Tricking machine learning Malware classifiers https://towardsdatascience.com/evading-machine-learning-malware-classifiers-ce52dabdb713 Tricking ML Classifiers The topic of cybersecurity in machine learning has seen an increase in activity in the community due to its critical nature around production systems. This blog post covers a fascinating competition that took place at DEFCON, where participants were tasked with tricking a ML classifier trained to detect malware. In this article the author provides an insight on how the challenge was applied together with the techniques used to succeed. A new paradigm for ML deployment https://www.oreilly.com/radar/machine-learning-requires-a-fundamentally-different-deployment-approach/ ML deployment paradigm Production machine learning systems have proven that the nuanced challenges that are faced when deploying machine learning require a new paradigm. O’Reilly’s Mike Loukides does a fantastic job in his latest article to provide an overview to the topic of machine learning deployment, together with insights on how this challenge is currently being tackled. Five must-know graph algorithms https://towardsdatascience.com/data-scientists-the-five-graph-algorithms-that-you-should-know-30f454fa5513 Five must-know graph algorithms Although most of the carefully curated datasets that you may come across online may be on relational or key-value store, there has been an ever-increasing interest on graph datasets, as most of the data we interact with on a regular basis will tend to have more complex, and often grap-like structures. This article provides a comprehensible and non-exhaustive list of graph algorithms to get acquaintanced with - these include an intuitive explanation, insights of where they may be relevant and an example code implementation. Survey in fairnes and bias in ML https://arxiv.org/abs/1908.09635 Survey fairness and bias in ML As larger and more critical datasets (and decisions) become part of the machine learning end-to-end production workflow, the challenges with statistical and societal bias/fairness become more complex. This survey provides a very comprehensible deep dive on the concepts and taxonomies around the concepts of “types of bias”, “types of discrimination”, and “types of fairness”, together with how these interact with the different types of machine learning techniques. Google’s OSS differential privacy library https://developers.googleblog.com/2019/09/enabling-developers-and-organizations.html Google’s OSS differential privacy The current implications of data privacy and trust has led into reviving interest into extremely fascinating research areas that have existed for decades. This one in particular is differential privacy, a technique that allows for data to be anonymised in a way that still leaves statistical properties which allow for processing on top of the anonymised data, which can lead to improvements in privacy. Google has released a C++ library of ε-differentially private algorithms, which can be used to produce aggregate statistics over numeric data sets containing private or sensitive information. Open source differential privacy libraries https://github.com/ethicalml/awesome-production-machine-learning OSS: Privacy Preserving ML   The theme for this week’s featured ML libraries is Privacy Preserving Machine Learning libraries, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
41 One data engine to rule them all at Google https://blog.acolyer.org/2019/09/11/procella/ One data engine to rule them all The world of data processing is full of different engines with nuanced differences. Google already has a numerous set of data processing engines including Dremel, Mesa, Photon , F1, PowerDrill and Spanner - so, why did they need yet another data processing engine? Apparently because they felt they had too many data processing systems, and wanted to unify them all. Because of this, Google released a fascinating new paper that introduces their new SQL Engine called Porcella, which aims to unify serving and analytical data at youtube. This blog post provides a great insight on the architecture, objectives and use-cases of this new engine. Tackling data processing at scale with Presto foundation https://www.linuxfoundation.org/uncategorized/2019/09/facebook-uber-twitter-and-alibaba-form-presto-foundation-to-tackle-distributed-data-processing-at-scale/ Tackling data processing at scale The Linux Foundation is leading yet another fantastic initiative: Facebook, Uber, Twitter and Alibaba join forces to form the “Presto Foundation” to tackle distributed data processing at scale. This is great news as the neutral governance will enable the members to contribute to the project to tackle some of the bigger challenges dealing with massively distributed data processing. GitHub releases the ImageNet for Code https://www.wandb.com/articles/codesearchnet The ImageNet for Code GitHub and Weights & Biases have collaborated to put together a fantastic contribution to the machine learning ecosystem that could bring fantastic innovations to the software engineering industry itself and trigger more similar competitions. This consists of a massively large dataset containing 6 million functions, 2 million of them documented, from open source projects on GitHub in 6 languages (Go, Java, Javascript, PHP, Python and Ruby) with the objective of improving semantic code search. They are also launching the CodeSearchNet challenge, which is a benchmark that will track and compare models trained on the CodeSearchNet dataset. Six lessons learned debugging a scaling problem at Gitlab https://about.gitlab.com/2019/08/27/tyranny-of-the-clock/ Wisdom from debugging at scale As systems grow in complexity, the approaches towards debugging issues also become more complex as the issues can be on code workflows, but also in data inconsistencies, network issues, infrastructure problems and beyond. GitLab has put together a fantastic deep dive on how they were able to resolve one of their issues at massive scale, together with a set of lessons they learned from it. Reimagining Experimentation Analysis at Netflix https://medium.com/netflix-techblog/reimagining-experimentation-analysis-at-netflix-71356393af21 Netflix reimagining experiments As your systems and teams become larger and more complex, the need not only to experiment efficiently but to be able to track, share and reproduce experiments become more critical. Netflix has put together a great post where they outline ther approach to re-thinking the way they track and manage experiments internally. Traditionally they have been using ABlaze, which is their centralised A/B testing platform, but now with their new platform they are able to perfectly recreate analyses on notebooks. Open source differential privacy libraries https://github.com/ethicalml/awesome-production-machine-learning OSS: Privacy Preserving ML   The theme for this week’s featured ML libraries is Privacy Preserving Machine Learning libraries, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
42 Serverless for ML in Kubernetes with Kubeflow https://www.youtube.com/watch?v=hGIvlFADMhU Serverless for ML in Kubernetes Serving machine learning models at scale is one of the biggest challenges. The KFServing project aims to tackle this. KFserving is a cross-industry open source collaboration (currently led by multiple technology companies including Seldon, Google, Microsoft, IBM and Bloomberg) with the objective to develop a fully fledged machine learning serving and orchestration framework in Kubernetes. This initiative is incredibly exciting, because it has several tech leaders collaborating on defining what production ML could look like, and working towards abstracting some of very complex and heterogeneous production ML terminology, into standardised protocols and interfaces. When a model is to big for production http://nlathia.github.io/2019/09/29/Large-NLP-in-prod/ When a model is too big for prod Machine learning models that are trained with very large datasets introduce new complexities, including large memory usage, heavy compute, black box constraints and more. The team at Monzo has put together a great overview that provides an outline of the key concepts that are often taken into consideration when moving a model into production, and dive into their use-case leveraging the HuggingFace library. Optimising production machine learning at Apple https://arxiv.org/abs/1909.05372 Optimising production machine learning at Apple Turn your projects into visual apps with Streamlit https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace Turn your ML into interactive apps Historically in data science, the time it takes to convert an idea into an interactive application takes a non-trivial amount of time. A new tool called Streamlit provides a way to easily build interactive applications from complex data science tools without the need to deal with the underlying infrastructural complexities (wrapping the backend in a microservice, exposing endpoints, building a UI to consume them, etc). Really awesome tool, definitely recommend checking it out. Modern applications at Amazon https://www.allthingsdistributed.com/2019/08/modern-applications-at-aws.html Modern Applications at AWS As an organisation scales and teams become more distant, there is a risk for innovation to stagnate, and a lot of the challenges in the organisational structure starts to reflect in the product/service interfaces - often for the worse. Amazon provides an interesting retrospective view of how they have tackled this to be able to build modern applications at Amazon Web Services. Machine learning deployment & orchestration libraries https://github.com/ethicalml/awesome-production-machine-learning OSS: ML Deployment Libraries   The theme for this week’s featured ML libraries is Machine learning Deployment and Orchestration Libraries, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
43 The Institute for Ethical AI & ML joins the Linux Foundation https://lfai.foundation/blog/2019/10/09/the-institute-for-ethical-ai-and-machine-learning-joins-lf-ai/ IEML joins the Linux Foundation The Institute for Ethical AI & Machine Learning (the non-profit behind this newsletter) is thrilled to be joining the Linux Foundation’s LFAI as an organisational member! Some of our core work has already made its way to LF initiatives, including large sections of our Awesome Production ML List contributing to the fast-growing LF AI Landscape. We will be involved across various workstreams within the Linux Foundation, and will be contributing across the board to their ML initiatives. Exciting times ahead, and a lot of even more exciting news in future newsletter editions! Black holes, new cures and AI Ethics with NumFocus http://numfocus.org/case-studies Case studies with NumFocus We’ve heard incredible achievements in the research community, ranging from creating the world’s first picture of a black hole, to finding cure to diseases. NumFocus is the organisation behind the open source tools that have been enabling some of these great achievements, including NumPy, SKlearn, Jupyer, Pandas, and many more. Recently NumFocus launched an initiative where they created a set of case studies where they showcase achievements accomplished using NumFocus tools, including the black hole photograph, curing diseases and introducing transparency into AI algorithms. Machine learning for business & operational intelligence http://www.oreilly.com/ideas/machine-learning-for-operational-analytics-and-business-intelligence ML for business & ops intelligence O’Reilly Chief Scientist Ben Lorica comes back with yet another great podcast where he speaks with Peter Bailis, Co-founder of Stanford’s DAWN Lab and CEO of Sisu, a startup that is using machine learning to improve operational analytics. In this podcast they dive into the role of ML in operational analytics, ML Benchmark initiatives (such as MLPerf and DAWNBench), and trends in tools for the lifecycle of ML in the enterprise. The FairML book free resources https://fairmlbook.org/about.html The open FairML Book With the rise of AI, learning machine learning concepts has become critical. However the lack of resources around the social challenges which ML practitioners may face is not significant. Three researchers from Cornell, Berkeley and Princeton came together to write a non-exhaustive but comprehensible book that contains key insights to consider “Fairness” as core throughout the development of ML-related systems, as opposed to as an afterthought. The book is still work in progress, but there are a couple of key chapters available for free. AI ethics - whose ethics? http://www.meetup.com/Ethics-Lunch-Group-Rise-London/events/265585657/ AI Ethics - whose ethics? As AI becomes more prevalent in society, we face thougher challenges around privacy, security and trust of systems. These challenges often create scenarios that may raise ethical questions which practitioners and leaders will have to tackle. Because of this, learning and studying the underlying philosophical concepts that have been built throughout the millenia could provide incredibly positive results. We started a lunch group in London to dive into these topics once a month. Last session Dr. Ryan Dawson provided in introducion on Aristotle’s Nichomachean Ethics, which followed by a discussion around their relevance in today’s connected world. Next session’s topic is “Whose Ethics?” where we’ll be diving into the  similarities and differences of Western and Eastern philosophy and its modern relevance into AI Ethics. Machine learning deployment & orchestration libraries https://github.com/ethicalml/awesome-production-machine-learning IEML joins the Linux Foundation The Institute for Ethical AI & Machine Learning (the non-profit behind this newsletter) is thrilled to be joining the Linux Foundation’s LFAI as an organisational member! Some of our core work has already made its way to LF initiatives, including large sections of our Awesome Production ML List contributing to the fast-growing LF AI Landscape. We will be involved across various workstreams within the Linux Foundation, and will be contributing across the board to their ML initiatives. Exciting times ahead, and a lot of even more exciting news in future newsletter editions!
44 Awesome Artificial Intelligence Guidelines List http://github.com/ethicalml/awesome-artificial-intelligence-guidelines Awesome AI Guidelines List As AI systems become more prevalent in society, we face bigger and tougher societal and ethical challenges. Recently there has been an increase in content that attempts to address these challenges in the form of “Principles”, “Ethics Frameworks”, “Checklists” and beyond. Navigating through so many resources is not easy, which is why we created and now maintain “The Awesome AI Guidelines List”, a repository which maps the ecosystem of guidelines, principles, codes of ethics, standards, regulation, etc related to AI 🚀 if there is any guideline or framework which is not outlined please let us know or feel free to submit an issue / pull request! Simplifying Model Management with MLflow http://www.youtube.com/watch?v=MSUTaCBhD7A MLFlow simplifying model mgmt When dealing with production machine learning systems, we may face challenges that we won’t see in the experimentation stage. One of the key challenges is to manage a large number of machine learning models, potentially from various users, and being able to compare them and upgrade them into staging and production environments. Dataricks announced a new ML Management feature which tackles this issue by providing a workflow system to version and manage models across multiple stages. Choosing charts that everyone understands http://sloanreview.mit.edu/article/choose-charts-everyone-understands/ Choosing intuitive visualisations When dealing with challenges that involve a lot of data, it’s often hard to choose the best visualisations to use at different stages of the project. This post provides a set of best practices on how to approach this challenge, suggesting how to leverage complex charts for data analysis, and classic charts for communicating data. Furthermore they provide a case study / example of how this looks like in practice in their team. Machine Learning Explainability at AI O’Reilly http://github.com/EthicalML/explainability-and-bias/ ML Explainability at AI O’Reilly When training complex models like neural networks, we obtain advantages in accuracy, but we face tradeoffs on explainability. Fortunately we have seen a recent increase in tools and methods that you can use to extract explanations from various types of machine learning models. Last week we gave a talk about the approaches and tools you can use to introduce interpretability techniques into your experimentation workflow. Furthermore we show how you can  leverage explainability techniques in the production stage of your machine learning lifecycle. 6-step process for building ML projects http://towardsdatascience.com/a-6-step-field-guide-for-building-machine-learning-projects-6e4554f6e3a1 Machine learning in 6 steps There are many ways in which you can tackle building a machine learning model. This post proposes a 6-step approach towards building any machine learning model. It provides quite a reasonable breakdown that covers a reasonable amount of important pieces, from defining the problem, to identifying risks and high level, to questions around interpretability, tuning & inference time. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning OSS: ML Deployment Libraries   The theme for this week’s featured ML libraries is Machine learning Deployment and Orchestration Libraries, and we’re happy to share brand new libraries into that section. The four featured libraries this week are:
45 Deep fake detection challenge http://deepfakedetectionchallenge.ai/ Deep fake detection challenge Lately the topic of deep fakes has been raising more concerns due to an inceasing number of high profile stories. The use of deep fakes for malicious use-cases seems to have really huge potential to cause negative damage in society. A very interesting new initiative has been launched by the Partnership in AI, Microsoft, AWS and Facebook. This initiative invites researchers and practitioners to participate in a competition to detect deep fakes. The dataset is now released, and the competition starts in December. Human knowledge to Improve AI http://bair.berkeley.edu/blog/2019/10/21/coordination/ Human knowledge to improve AI In reinforcement learning, often agents can be trained efficiently by running them with other agents throughout a significant number of iterations. Although this may be quite efficielt, and may lead to hyper-optimised results, these may not be optimal when the agents have to interact with humans. This Berkeley project has set out to explore this topic in more detail by looking at how agents can be improved when trained with human interaction. Neural text search data flow http://hanxiao.github.io/2019/10/18/GNES-Flow-a-Pythonic-Way-to-Build-Cloud-Native-Neural-Search-Pipelines/ Neural text search data flow A very interesting project called GNES is tackling semantic search across image, text and beyond. As its popularity and use has been increasing there has been new challenges that the open source team have to dealth with. One of these key challenges has been data flow across complex jobs. This is why the GNES team released GNES flow, a framework that brings DAG-based data flow to GNES. The causal inference book http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/ The Causal Inference Book An incredible resource on all things Causal inference, which aims to support researchers, practitioners from various backgrounds, including epidemiologists, statisticians, psychologists, economists, sociologists, political scientists, computer scientists and beyond. The book is divided in 3 parts of increasing difficulty: causal inference without models, causal inference with models, and causal inference from complex longitudinal data. Netflix open sources polynote http://medium.com/netflix-techblog/open-sourcing-polynote-an-ide-inspired-polyglot-notebook-7f929d3f447 Netflix Open Sources Polynote In previous editions of the MLE Newsletter we have covered how Netflix has built advanced infrastructure and introduced processes which has allowed for production experimentation at scale. It is great to see that Netflix is now open sourcing parts of their internal infastructure, starting with Polynote - an experimental polyglot notebook environment which supports Scala and Python (with or without Spark), SQL, and Vega. If you are interested in other open source data science notebooks we have an entire section in our Production ML list which would be worth for you to check out. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning OSS: Data Science Notebooks   The theme for this week’s featured ML libraries is Data Science Notebooks, and we’re happy to share brand new libraries into that section to showcase tools beyond the good old Jupyter Notebooks. The four featured libraries this week are:
46 6 lessons learned at Booking.com http://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com 6 lessons learned at Booking.com Booking.com has put together 6 lessons learned from building 150 models that were successful in production inference. This has come from their currently massive user-base which consists of millions of accomodation providers and millions of guests. Most of their use-cases consits of advanced and specialised recommender systems, with a constraint of massive throughput in processing. Their main conclusion is that an iterative, hypothesis driven process, integrated with other disciplines was fundamental to build 150 successful products enabled by Machine Learning. EurNLP 2019 videos released http://www.facebook.com/pg/eurnlp/videos/ EurNLP 2019 videos released The first annual EurNLP Summit took place in London on October 11th. This was a great opportunity to foster discussion and collaboration between NLP researchers in academia and industry. The talks from this event were recorded and are all available at their Facebook page. Consistency of AI Summarization http://arxiv.org/abs/1910.12840 Consistency of AI Summarization A very interesting paper, which proposes that current metrics for assessing summarization algorithms do not account for whether summaries are factually consistent with source documents. The paper proposes a weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary. This proposed approach verifies three key things: 1) identify whether sentences remain factually consistent after transformation, 2) extract a span in the source documents to support the consistency prediction, 3) extract a span in the summary sentence that is inconsistent if one exists. Linux Foundation Trusted AI http://lfai.foundation/blog/2019/10/29/trusted-ai-committee-established/ Linux Foundation Trusted AI LF AI is an umbrella foundation of the Linux Foundation that supports open source innovation in artificial intelligence, machine learning, and deep learning. To build trust in the adoption of AI, the Trusted AI Committee has been established as part of Linux Foundation AI. We are proud to be contributing to this foundation, which has the objectives of: 1) define policies, guidelines and tooling, 2) survey and contract current OSS projects to join LFAI, 3) create a badging or certification process for OSS projects, and 4) standardise taxonomy around trusted AI AI Ethics - Whose Ethics? http://www.meetup.com/Ethics-Lunch-Group-Rise-London/ AI Ethics - Whose Ethics? As AI becomes more prevalent in society, we face thougher challenges around privacy, security and trust of systems. These challenges often create scenarios that may raise ethical questions which practitioners and leaders will have to tackle. Because of this, learning and studying the underlying philosophical concepts that have been built throughout the millenia could provide incredibly positive results. We started a lunch group in London to dive into these topics once a month. Last session Dr. Ryan Dawson provided in introducion on Aristotle’s Nichomachean Ethics, which followed by a discussion around their relevance in today’s connected world. Next session’s topic is “Whose Ethics?” where we’ll be diving into the similarities and differences of Western and Eastern philosophy and its modern relevance into AI Ethics. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning OSS: Data Science Notebooks   The theme for this week’s featured ML libraries is Data Science Notebooks, and we’re happy to share brand new libraries into that section to showcase tools beyond the good old Jupyter Notebooks. The four featured libraries this week are:
47 End-to-end ML with MLFlow and Seldon http://www.youtube.com/watch?v=D6eSfd9w9eA E2E ML with MLFlow and Seldon Machine Learning Engineer Adrian Gonzales has put together a fantastic hands on tutorial which covers the motivations, challenges and best practices when setting up end-to-end machine learning systems. In this talk he dives into how MLFlow can be leveraged to train models through their experimentation functionality, and leverage the Seldon-MLFlow integation to seamlessly deploy models into a production Kubernetes cluster, where he then runs real-time feedback analysis whilst A/B testing the models. You can watch the video in youtube and try it out yourself with the jupyter notebook. Reconstructing human thoughts with ML http://techxplore.com/news/2019-10-neural-network-reconstructs-human-thoughts.html Reconstructing thoughts with ML Neurobotics is leveraging machine learning to reverse engineer human thoughts. They have come up with quite a clever way to leverage labelled data through the expected visual appearance of an object linked to the brain waves that would be emitted when perceiving it. This startup has very interesting showcases of their technology that provide very interesting initial results. Technology applications like this can be incredibly interestibg, but also require an evaluation of its impact in society. Scalble AutoML with Ray on Spark https://medium.com/riselab/scalable-automl-for-time-series-prediction-using-ray-and-analytics-zoo-b79a6fd08139 Scalable AutoML with Ray Machine learning experimentation can be highly time consuming, and with growing complexity of machine learning requirements make it harder to build and run experimentation at scale. The team at Intel have put together a great insight on how they are tackling it with their “analytics zoo”, and more specifically in this post they outline how it can be tackled leveraging AutoML using Ray on top of Spark. This is an interesting approach as it allows the technical user to leverage existing Spark infrastructure, but with the simplicity of Ray, which in this case allows for large scale automated hyperparameter search. Tensorflow world videos are now out http://www.youtube.com/playlist?list=PLQY2H8rRoyvxcmHHRftsuiO1GyinVAwUg Tensorflow World Videos O’Reilly and TensorFlow teamed up to put together the first TensorFlow World conference, which took place recently. It brought together the growing TensorFlow community to learn from each other and explore new ideas, techniques, and approaches in deep and machine learning. The videos for the conference are now live in YouTube. 14 different types of learning in ML http://machinelearningmastery.com/types-of-learning-in-machine-learning/ 14 types of learning in ML Machine learning is a very broad subject. Machine learning mastery does a fantastic job to map out some of the key different types of learning in machine learning. These include an intuitive overview of what these consist of, and include learning problems, hybrid learning problems, statistical inference and learning techniques. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning OSS: Data Science Notebooks   The theme for this week’s featured ML libraries is Data Science Notebooks, and we’re happy to share brand new libraries into that section to showcase tools beyond the good old Jupyter Notebooks. The four featured libraries this week are:
48 ONNX Joins the Linux Foundation http://lfai.foundation/press-release/2019/11/14/lf-ai-welcomes-onnx/ ONNX Joins the Linux Foundation ONNX has joined the Linux Foundation! This is an incredibly exciting announcement given the potential this presents towards standardisation of protocols in the machine learning ecosystem. ONNX stands for Open Neural Network eXchange, and it is an open format used to represent machine learning and deep learning models, which provides for advanced and standardised functionalities for model creation and export, visualization, optimization, and acceleration capabilities. The Nuances in DevOps for ML http://hackernoon.com/why-is-devops-for-machine-learning-so-different-384z32f1 The Nuances in DevOps for ML MLOps is a concept that is used to define the challenges and methodologies to continuously integrate, deploy and monitor machine learning in production. Open Source Engineer Ryan Dawston has put togher a great article that provides a high level overview of what MLOps is, and how machine learning is different to traditional software. Continuous Delivery for ML http://martinfowler.com/articles/cd4ml.html Continuous Delivery for ML When managing hundreds or even thousands of models in production it is necessary to introduce automation across the deployment and integration process. This great article provides a thorough overview of the nuanced challenges that machine learning introduces when deploying at scale, and provides some of the core concepts that need to be taken into consideration when introducing automation for continuous deployment of ML. Learnings reaching 2% in Kaggle http://towardsdatascience.com/my-secret-sauce-to-be-in-top-2-of-a-kaggle-competition-57cff0677d3c Learnings reaching 2% in Kaggle Instacart machine learning engineer Abhay Pawar has put together a very comprehensible article on lessons learned reaching top 2% in a Kaggle competition. In the article he covers best practices to learn more from the data to be able to build feature understanding, and iteratively improve the solution to ensure reliable results. The New Data Exchange http://thedataexchange.media/taking-stock-of-foundational-tools-for-analytics-and-machine-learning The New Data Exchange O’Reilly Chief Scientist Ben Lorica has announced a brand new podcast that focuses in Machine Learning called “The Data Exchange”. The first episode dives right in with a conversation with Paco Nathan exploring core trends in ML including data governance, autoML, notebooks and deep learning libraries. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning OSS: Data Science Notebooks   The theme for this week’s featured ML libraries is Industry Strength NLP. The four featured libraries this week are:
49 New Outlier & Aversarial Detector OSS Library http://docs.seldon.io/projects/alibi-detect/en/stable/overview/getting_started.html Outlier & Adversarial Detector Often machine learning models, and just systems in general, are monitored with very basic rules - e.g. send an alert if a metric drops under a certain threshold. However this can lead to a lot of false positives being flagged, creating too much noise, which could lead into ignoring these alerts (And real issues not being caught). The Seldon team has released a new open source machine learning library that focuses on outlier, adversarial and concept drift detection, which allows for smarter monitoring techniques. The package aims to cover both online and offline detectors for tabular data, text, images and time series. This package is built with production machine learning use-cases in mind, and has continuously updated integrations with the ML deployment framework Seldon Core. The AI Governance Dilemma https://hackernoon.com/move-fast-and-break-things-the-ai-governance-dilemma-dsq32ix The AI Governance Dilemma As machine learning is adopted in more critical use-cases, the common phrase that tech startups have used “move fast and break things” becomes less desired. In this great article by Seldon Open Source Engineer Ryan Dawston, breaks down this AI Governance Dilemma. Ryan provides an introduction to the challenges of ML being deployed in critical use-cases, together with the different areas that should be taken into account, including outliers, concept drift, bias, privacy and other risks. Mening-aware word vectors with sense2vec http://explosion.ai/blog/sense2vec-reloaded Contextually Keyed Word Vectors Neural word representations have proven useful in Natural Language Processing (NLP) tasks due to their ability to efficiently model complex semantic and syntactic word relationships. However, most techniques model only one representation per word, despite the fact that a single word can have multiple meanings or “senses”. The ExplosionAI team, which is also behind SpaCy created a technique which they called sense2vec, which builds word embeddings in a similar way to word2vec, but also takes into account part-of-speech attributes for word tokens, which allow for meaning-aware vectors. Time series anomaly detection at Microsoft http://arxiv.org/abs/1906.03821 Time Series Anomaly Detection Large companies need to monitor various metrics of their applications and services in realtime. Microsoft has released a fascinating paper where they share some of their knowledge developing and maintaining a time-series anomaly detection service which helps customers to monitor the time-series continuously and alert for potential incidents on time. In this paper, they introduce the pipeline and algorithm of their anomaly detection service,which is designed to be accurate, efficient and general. The pipeline consists of three major modules, including data ingestion, exper-imentation platform and online compute. In this paper, the team also proposes a novel algorithmbased on Spectral Residual (SR) and Convolutional Neural Network(CNN). Pyro 1.0 Released with LFAI http://lfai.foundation/blog/2019/11/18/pyro-1-0-has-arrived/ Pyro 1.0 Released Another great announcement from the LFAI this week - Pyro has released it’s 1.0 version! Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling. It is developed and maintained by Uber AI and community contributors. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
50 Data Science Best Practices http://syslog.ravelin.com/data-science-best-practices-843c9693db8 Data Science Best Practices It’s easy and fun to ship a prototype, whether that’s in software or data science. What’s much, much harder is making it resilient, reliable, scalable, fast, and secure. Ravelin co-founder and CTO Leonard Austin has written an excellent blog post where he outlines some best practices that are brought from the software engineering best practices. A Contract for the Web http://contractfortheweb.org/ A Contract for the Web Sir Tim Berners-Lee has launched what he has called a ‘Contract for the Web’, intended to govern the behaviour of both internet giants, such as Google and Facebook, and governments. The Contract describes itself as “a global plan of action to make our online world safe and empowering for everyone”. Deep Learning Indaba 2019 http://www.youtube.com/playlist?list=PLICxY_yQeGYng7mbMmuZjt3S1sDb6YpBJ Deep Learning Indaba 2019 The videos for Deep Learning Indaba 2019 are out! The mission of the Deep Learning Indaba is to Strengthen African Machine Learning. The Deep Learning Indaba is the annual meeting of the African machine learning community. In 2019, the Indaba aims to see 700 members of Africa’s artificial intelligence community for a week-long event of teaching, research, exchange, and debate around the state of the art in machine learning and artificial intelligence. Uncertainty Quantification in Deep Learning http://www.inovex.de/blog/uncertainty-quantification-deep-learning/ Uncertainty Quantification in DL While we usually cannot guarantee our models to be absolutely perfect, we could use information about how certain they are with their predictions. That way, in case of high uncertainty, we can perform more extensive tests or pass the case to a human in order to avoid potentially wrong results. This, however, requires our models to be aware of their prediction accuracy for a given input. This article aims to break down just that. Google XAI Whitepaper http://storage.googleapis.com/cloud-ai-whitepapers/AI%20Explainability%20Whitepaper.pdf Google XAI Whitepaper Google entered the XAI ecosystem with a new Google Cloud AI Explanations product, which is targetted at model developers and data scientists. Together with their new system, they have released a whitepaper that outlines their approach towards XAI, togeter with a high level overview of the motivations and features. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
51 Play Endless Game Built by AI http://aiweirdness.com/post/189511103367/play-ai-dungeon-2-become-a-dragon-eat-the-moon Play Endless Game Built by AI “AI Dungeon 2”, a new dungeon-crawling game was built using the GPT-2 model, and has been very well received by the community. This “open-world” game allows you to interact with the storyline by providing actions that are followed by results that expand on a story that is generated on the go. This is a very creative use of pre-trained language models, and certainly quite an exciting one that could be interesting to explore in many different industry applications. Gentle Intro to Model Selection http://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/ Gentle Intro to Model Selection Given easy-to-use machine learning libraries like scikit-learn and Keras, it is straightforward to fit many different machine learning models on a given predictive modeling dataset. The challenge of applied machine learning, therefore, becomes how to choose among a range of different models that you can use for your problem. Machine learning mastery has put together a great article containing insights on what is model selection, considerations for model selection and techniques available. Code Reviews for Jupyter Notebooks http://towardsdatascience.com/introducing-reviewnb-visual-diff-for-jupyter-notebooks-6797e6dfa20c Code Reviews for Jupyter NBs Code-review methodologies have brought robust development practice into software development. A new exciting project is now extending existing frameworks to provide further code-review functionality into Jupyter notebooks specifically. This project has been named ReviewNB, and it is a visual diff for Jupyter notebooks presented as a GitHub app that communicates to GitHub APIs directly, and processes changes which are then displayed as side-by-side diff formats. Very exciting project, and certainly a space to keep an eye on. Netflix Releases Metaflow http://www.zdnet.com/article/netflix-our-metaflow-python-library-for-faster-data-science-is-now-open-source/ Netflix Releases Metaflow Netflix’s data-science team has open-sourced its Metaflow Python library, a key part of the ‘human-centered’ machine-learning infrastructure it uses for building and deploying data-science workflows. It’s great to see tech giants contributing to open source, especially in areas that are currently progressing at breakneck speed, namely the intersection between data science, devops and software engineering. Adversarial Detection Hands On Example http://docs.seldon.io/projects/alibi-detect/en/stable/examples/ad_advvae_mnist.html Adversarial Detection Hands On Adversarial detection algorithms are growing in popularity due to growing concern in exploitation of production machine learning models. A great tutorial was put together by the data science team at Seldon outlining how to use Adversarial Variational Autoencoder Detection algorithms specifically on the MNIST dataset (and more generally on image datasets). Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
52 Top Python ML Libraries in 2019 http://tryolabs.com/blog/2019/12/10/top-10-python-libraries-of-2019/ Top Python ML Libraries in 2019 This last year we have seen a large number of open source libraries coming out. This article highlights 10 python machine learning libraries that came out in 2019 which are must watch, many of the libraries in the list which are machine learning related. This list includes HTTX, Starlette, FastAPI, Immutables, Pyodide, Modin, Streamlit, Transformers, Detectron2 and Metaflow. NeurIPS 2019 Videos are Out http://slideslive.com/neurips/ NeurIPS 2019 Videos are Out Neural Information Processing Systems (NeurIPS) is a multi-track machine learning and computational neuroscience conference that includes invited talks, demonstrations, symposia and oral and poster presentations of refereed papers. The videos for this year’s conference are now online and available at https://slideslive.com/neurips/ Modern NLP with SpaCy Podcast http://changelog.com/practicalai/68 Modern NLP with SpaCy Podcast SpaCy is an awesome NLP open source library! It’s easy to use, has widespread adoption, is open source, and integrates the latest language models. Ines Montani and Matthew Honnibal (core developers of spaCy and co-founders of Explosion) join the PracticalAI podcast to discuss the history of the project, its capabilities, and the latest trends in NLP. They also dive into the practicalities of taking NLP workflows to production. Testing Guide for Software http://martinfowler.com/testing/ Testing Guide for Software As software approaches production scale, it requires the relevant amount of testing on a component and system level. The approaches involve when testing systems, especially in machine learning become more ambiguous, and benefit from the best practices that have been gathered. The testing guide in martin fowler’s blog is an excellent and comprehensible source of information about testing, which can be adopted not only for traditional software projects but also for machine learning / data science projects. Spotify on Better ML Infrastructure http://labs.spotify.com/2019/12/13/the-winding-road-to-better-machine-learning-infrastructure-through-tensorflow-extended-and-kubeflow/ Spotify on Better ML Infrastructure When Spotify launched people were amazed that they could access almost the world’s entire music catalog instantaneously. More users and more features led to more systems that relied on Machine Learning to scale inferences across a growing user base. As these ML systems were buildt, they started to hit a point where engineers spent more of their time maintaining data and backend systems in support of the ML-specific code than iterating on the model itself. They realized we needed to standardize best practices and build tooling to bridge the gaps between data, backend, and ML. This blog post outlines their experience building just that, and how they leverage Tensorflow Extended (TFX) and Kubeflow in their Paved Road for ML systems. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
53 Real time computer vision at scale http://thedataexchange.media/building-large-scale-real-time-computer-vision-applications/ Real time computer vision at scale Chief Data Scientist Ben Lorica comes back with another fantastic podcast with the Data Exchange - this time with a conversation with Reza Zadeh on large scale & real time computer vision use cases, adversarial attacks, deepfakes, fairness, privacy, and security. In this edition they dive into 1) Challenges in building large-scale, real-time computer vision applications. 2) Robustness of computer vision applications (adversarial attacks, deepfakes). 3) Impact of computer vision technologies on society: security, privacy and surveillance. Ray for the curious http://medium.com/distributed-computing-with-ray/ray-for-the-curious-fa0e019e17d3 Ray for the curious Ray is an open-source system for scaling Python applications from single machines to large clusters. Dean Wampler from the newly announced company (founded by some of the core Ray team) has put together a great article that provides an intuitive understanding on Ray for distributed data processing. Evolution of Zulily’s Airflow http://zulily-tech.com/2019/11/19/evolution-of-zulilys-airflow-infrastructure/ Evolution of Zulily’s Airflow In production data science use-cases, the challenge of enabling and managing scheduling of data processing tasks at scale becomes growingly complex. Apache Airflow has skyrocketed since its debut as a key tool to perform workflow management, and with the grow of cloud native / Kubernetes technologies, Airflow has been able to ride the wave by providing more integrated Kubernetes support. Zulily has put together a great overview of how they have been able to extend their Airflow production infrastructure in Kubernetes, together with lessons learned on the way. Key trends in ML for 2020 http://www.kdnuggets.com/2019/12/predictions-ai-machine-learning-data-science-research.html Key trends in ML for 2020 It’s year end again, and that means it’s time for KDnuggets annual year end expert analysis and predictions. This year they posed the question: What were the main developments in AI, Data Science, Deep Learning, and Machine Learning in 2019, and what key trends do you expect in 2020? They brought together insights from renowned various experts in the field which has been made available in this article. A gentle intro to imbalanced ML http://www.meetup.com/Ethics-Lunch-Group-Rise-London/events/267303953/ A gentle intro to imbalanced ML Machine learning mastery has put together a very comprehensible introduction to “imbalanced classification”. This tutorial covers three key areas: 1) Imbalanced classification is the problem of classification when there is an unequal distribution of classes in the training dataset. 2) The imbalance in the class distribution may vary, but a severe imbalance is more challenging to model and may require specialized techniques. 3) Many real-world classification problems have an imbalanced class distribution, such as fraud detection, spam detection, and churn prediction. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
54 From system 1 to system 2 by Yoshua Bengio http://slideslive.com/38921750/from-system-1-deep-learning-to-system-2-deep-learning Yoshua Bengio; Towards system 2 Past progress in deep learning has concentrated mostly on learning from a static dataset, mostly for perception tasks and other System 1 tasks which are done intuitively and unconsciously by humans. However, in recent years, a shift in research direction and new tools such as soft-attention and progress in deep reinforcement learning are opening the door to the development of novel deep architectures and training frameworks for addressing System 2 tasks (which are done consciously), such as reasoning, planning, capturing causality and obtaining systematic generalization in natural language processing and other applications. Yoshua Bengio shared very interesting insights on this NeurIPS talk which covered the key concepts that will enable expansion from System 1 tasks to System 2 tasks. The Artificial Intelligence Index 2019 Report http://hai.stanford.edu/sites/g/files/sbiybj10986/f/ai_index_2019_report.pdf AI Index 2019 Report The AI Index, a Stanford-backed initiative to assess the progress and impact of AI, has launched its 2019 report. The new report contains a vast amount of data relating to AI, covering areas ranging from bibliometrics, to technical progress, to analysis of diversity within the field of AI. Jack Clark from OpenAI, who is part of the steering committee outlined some key statistics that include: 300% growth in volume of peer-reviewed AI papers, 800% growth in NeurIPS attendance since 2012, $70b invested worldwide in AI, and more. Microsoft’s NLP Best Practices http://github.com/microsoft/nlp-recipes/ Microsoft’s NLP Best Practices In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. Microsoft has put together a great repository with examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language. The day that changed Netflix tech http://www.linkedin.com/pulse/date-changed-netflixs-attitude-towards-availability-jaspreet-bakshi/ The day that changed Netflix tech On Christmas Eve 2012, Netflix streaming service experienced an outage. This particular incident got a lot of media coverage for obvious reasons, and was caused due to an AWS region becoming fully unavailable. To mitigate region-based outages, Netflix invested heavily in Resiliency Engineering and Cloud Platform teams to create a discipline to break things on purpose. This post provides really interesting insight on some of the approaches taken to address these issues. Attention and Augmented RNNs http://distill.pub/2016/augmented-rnns/ Attention and Augmented RNNs Recurrent neural networks are one of the staples of deep learning, allowing neural networks to work with sequences of data like text, audio and video. Such models have been found to be very powerful, achieving remarkable results in many tasks including translation, voice recognition, and image captioning. As a result, recurrent neural networks have become very widespread in the last few years. This post provides a great overview on RNNs, together with intuition on some of the core concepts around these. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
55 Machine Learning System Design http://github.com/chiphuyen/machine-learning-systems-design Machine Learning System Design With the rise of large scale machine learning applications, it is becoming increasingly critical for practitioners to learn the best practices in machine learning system design. This great booklet covers four may steps of desingin machine learning systems, including 1) project setup, 2) data pipeline, 3) modeling, and 4) serving. The booklet itself also contains 27 open-minded machine learning system design questions that might come up in machine learning interviews. Machine Learning Interviews http://docs.google.com/presentation/d/1MX2V6fTp71j1aztvY5HLYM44iLG4HYMrYd4Dxn6Cxnw/edit#slide=id.g6152350dbb_0_63 Machine Learning Interviews As the role of machine learning engineer becomes more prominent in industry, more useful content is contributed by the community to define the role, together with the best practices, and even advise on job interviews. This presentation by Machine Learning Engineer Chip Huyen provides great insight on the role of the MLE, together with advice on how to best approach machine learning interviews. A Deep Dive into Online Learning http://parameterfree.com/2019/09/02/introduction-to-online-learning/ A Deep Dive into Online Learning A fantastic resource that provides a very comprehensible introduction to online learning, which comes together with a set of lecture notes from Boston University’s “Introduction to Online Learning” course. This first lecture provides an initial insight on the topic, with a strong technical foundation as well as an exercise to put the learnings into practice. Unsupervised NLU via GPT-2 http://rakeshchada.github.io/Zero-Shot-GPT-2.html Unsupervised NLU via GPT-2 Amazon Applied Scientist Rakesh Chada has put together a great post that showcases the power of GPT-2. The language model GPT-2 from OpenAI is one of the most coherent generative models for text out there. While its generation capabilities are impressive, it’s ability to zero-shot perform some of the Natural Language Understanding (NLU) tasks seems even more fascinating to Rakesh. In this blog post, some of those capabilities are highlighted as well as a deep dive on one such fun use-case of converting singular nouns in english to their plural counterparts (and vice-versa). Open Source Business Models http://a16z.com/2019/10/04/commercializing-open-source/ Open Source Business Models The open source software (OSS) movement has created some of our most important and widely used technologies, including operating systems, web browsers, databases and (of course) machine learning. Our world would not function, or at least not function as well, without open source software. In this podcast, Peter Levene shares some of his experience working with open source as a developer, entrepreneur and investor around business models for open source projects. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
56 The Evolution of Tensorflow and ML Infrastructure http://thedataexchange.media/the-evolution-of-machine-learning-infrastructure Evolution of ML Infrastructure Chief Data Scientist Ben Lorica comes back with another great podcast on The Data Exchange Podcast in conversation with Rajat Monga, one of the founding members of the TensorFlow Engineering team. Up until recently Rajat was the engineering manager for TensorFlow at Google. In this podcast they dive into TFX, a production scale ML platform based on Tensorflow, they talk about Multi-Level Intermediate Representation (MLIR), Deep Learning and the state of machine learning infrastructure. 30 Woman Advancing AI http://blog.re-work.co/top-women-in-ai-2019/ 30 Woman Advancing AI Re-work has put together a Women in AI list of the year, which focuses on individuals that have spearheaded or taken part in great research in 2019, and therefore deserve recognition. Calculating the Value of Data http://bair.berkeley.edu/blog/2019/12/16/data-worth/ Calculating the Value of Data People give massive amounts of their personal data to companies every day and these data are used to generate tremendous business values. Some economists and politicians argue that based on value of data people there are situations where paid transactions should take place. Furthermore in the context of organisations holding data, this data has both a value and a risk that is currently ambiguous to quantify. This artcle discusses methods proposed in Bekeley papers that attempt to answer this question in the ML context. Intro to Ethics in Artificial Intelligence http://www.meetup.com/Ethics-Lunch-Group-Rise-London/events/267303953/ Intro to Ethics in AI The discussion of ethics in AI has become more critical as more applications make their way into production environments that affect the real world. We’re organising a London meetup on January 24th covering an introduction to AI, where HATLAB Deputy Director James Kingston will help us get our bearings by taking us on a survey of an AI Ethics Landscape, followed by an open discussion. Come join us! A Guide to File Formats in ML http://towardsdatascience.com/guide-to-file-formats-for-machine-learning-columnar-training-inferencing-and-the-feature-store-2e0c3d18d4f9 A Guide to File Formats in ML Most machine learning models are trained using data from files. Logical Clocks Co-Founder James Dowling has put toghether this guide to the popular file formats used in open source frameworks for machine learning in Python, including TensorFlow/Keras, PyTorch, Scikit-Learn, and PySpark. This post also describes how a Feature Store can make the Data Scientist’s life easier by generating training/test data in a file format of choice on a file system of choice. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
57 Google Research in 2019 and beyond http://ai.googleblog.com/2020/01/google-research-looking-back-at-2019.html Google Research 2019 + Beyond Google reaserch has put together a very comprehensible overview of their key highlights and results from 2019 as well as their focus for 2020 and beyond. In this post they touch upon ethical use of AI, AI for social good, Applications of AI in Other Fields, Assistive Technology, Use of AI in Mobile Devices, Quantum Computing, AutoML and more. Facebook Open Source Year in Review http://engineering.fb.com/open-source/open-source-2019/ Facebook OSS Year in Review Facebook dives into their open source year in review list, where they cover high level overview of their open source projects and the achievements/updates in 2019, including their ~2500 contributors and 32000 contributions. In this article they cover their open source frameworks PyTorch, Hydra, Calibra as well as other Open Source partnerships. AI Lessons Learned with Rakuten http://thedataexchange.media/business-at-the-speed-of-ai-lessons-from-rakuten AI Lessons Learned with Rakuten Chief Scientist Ben Lorica comes back with another Data Exchange podcast. This time he dives into lessons learned with Rakuten Data Science VP Bahman Bahmani. In this podcast they cover the impact that machine learning in Rakuten, best practices in attracting/retaining ML talent, the trio of strategic options and culture within the organisation. Rethinking moving fast and breaking things with AI http://practical-ai-ethics.org/move-fast-and-break-things-the-ai-governance-dilemma/ Move fast and break things w AI The AI Governance Dilemma: As machine learning is adopted in more critical use-cases, the common phrase that tech startups have used “move fast and break things” becomes less desired. In this great article by Seldon Open Source Engineer Ryan Dawston, this AI Governance Dilemma is broken down. Ryan provides an introduction to the challenges of ML being deployed in critical use-cases, together with the different areas that should be taken into account, including outliers, concept drift, bias, privacy and other risks. Intro to Ethics in Artificial Intelligence Meetup http://www.meetup.com/Ethics-Lunch-Group-Rise-London/ Intro to Ethics in AI Intro to Ethics in AI: The discussion of ethics in AI has become more critical as more applications make their way into production environments that affect the real world. We’re organising a London meetup on January 24th covering an introduction to AI, where HATLAB Deputy Director James Kingston will help us get our bearings by taking us on a survey of an AI Ethics Landscape, followed by an open discussion. Come join us! Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
58 Feature Stores for Machine Learning http://featurestore.org/ Feature Stores for ML Duplicated work in data science scales as the projects and teams scale. Feature stores are now seen as core part of the solution for re-usability, however there is still a lot of ambiguity on its definition, architecture and best practices. This site contains an excellent list of resources that map large part of the ecosystem to drive the conversation forward, including videos, articles and beyond. Key AI & Data Trends for 2020 http://thedataexchange.media/key-ai-and-data-trends-for-2020 Key AI & Data Trends for 2020 The Data Exchange podcast comes back this week with an excellent deep dive into the key AI, Machine Learning and data trends for 2020. In this episode they dive into types of machine learning, real life applications, infrastructure/tools, and other topics such as managing risks and trends to watch. LF AI 2019 Year in Review https://lfai.foundation/blog/2020/01/22/lf-ai-2019-year-in-review/ LF AI 2019 Year in Review It has been an incredible journey at the LF AI since we became an organisational member, and we could not be more excited for the great leaps it has achieved, and more importantly what it has yet to achieve. This great post provides an insight on some of the achievements and updates from 2019. Massive shoutout especially to the core team for their great work driving this forward, here is to yet another great 2020. From local interpretability to global understanding http://www.nature.com/articles/s42256-019-0138-9 From local to global XAI Tree-based models have seen a steady increase in adoption in produciton use-cases, and with that adoption has also come demand for compliance and reduction of operational risks. This paper proposes a solution that improves the interpretability of tree-based models through three main contributions. Contributions like this are what furthers the area of interpretaibility in machine learning. Sampling methods for imbalanced classes http://machinelearningmastery.com/data-sampling-methods-for-imbalanced-classification/ Sampling methods for imbalances Machine learning techniques often fail or give misleadingly optimistic performance on classification datasets with an imbalanced class distribution. The reason is that many machine learning algorithms are designed to operate on classification data with an equal number of observations for each class. When this is not the case, algorithms can learn that very few examples are not important and can be ignored in order to achieve good performance. In this article, machine learning mastery dives into a set of practical sampling methods that can be used when facing imbalanced datasets. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
59 Table Detection, Information Extraction with Deep Learning http://nanonets.com/blog/table-extraction-deep-learning/ Table Detection & NLP with DL The amount of data being collected is drastically increasing day-by-day with lots of applications, tools, and online platforms booming in the present technological era. To handle and access this humongous data productively, it’s necessary to develop valuable information extraction tools. One of the sub-areas that’s demanding attention in the Information Extraction field is the fetching and accessing of data from tabular forms.Table Extraction (TE) is the task of detecting and decomposing table information in a document. In this article they cover the motivations, techniques and solutions on how this can be achieved. Google’s general AI conversational agent http://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html?m=1 Towards general conv. agent The current open-domain chatbots have a critical flaw — they often don’t make sense; somtimes they say inconsistencies, lack common sense and basic knowledge of the world. In this research, they present Mena, a 2.5 billion parameter end-to-end trained neural conversation model which can conduct conversations that are more sensible and specific than existing state of the art chatbots. New improvements are reflected through a human evaluation metrics proposed for open domain chatbots called sensibleness and specificity averave (SSA). TF-Encrypt and the state of privacy-preserving ML http://thedataexchange.media/the-state-of-privacy-preserving-machine-learning State of privacy preserving ML Ben Lorica comes back this week with yet another great episode of the data exchange podcast, where he dives into conversation with Morten Dahl, research scientist at Dropout Labs, a startup building a platform and tools for privacy-preserving machine learning (and the person behind TF-Encrypt). In this conversation they dive into the current state of TF Encrypted, Federated learning (FL) and secure aggregation for FL, Privacy-preserving ML solutions,  differential privacy, homomorphic encryption, and RISELab’s stack for coopetitive learning (MC2). Airflow’s new Distributed Job Queueing System http://medium.com/airbnb-engineering/dynein-building-a-distributed-delayed-job-queueing-system-93ab10f05f99 Distributed Delayed Job Queueing Asynchronous background jobs can often dramatically improve the scalability of web applications by moving time-consuming, resource-intensive tasks to the background. These tasks are often prone to failures, and retrying mechanisms often make it even more expensive to operate applications with such jobs. Having a background queue helps the web servers handle incoming web requests promptly, and reduces the likelihood of performance issues that occur when requests become backlogged. At Airbnb, they built a job scheduling system called Dynein for very critical use cases. In this article, they walk through the history of job queuing systems at Airbnb, explain why they built Dynein, and describe how they were able to achieve its high scalability. Confidence models in financial research & practice http://www.oreilly.com/ideas/the-trinity-of-errors-in-applying-confidence-intervals-an-exploration-using-statsmodels Applying confidence models Financial models are at the mercy in model specifications, errors in model parameter estimates and errors resulting from the failure of a model to adapt to structural changes of an environment. Because of this trifecta of errors, it’s important for dynamic models to quanitfy the uncertainty inherent in the financial estimates and predictions. This post they explore three types of errors in applying confidence intervals that are common in financial research and practice. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
60 An MLOps Framework for Machine Learning at Scale https://www.youtube.com/watch?v=68_Phxwaj-k&feature=youtu.be Hands on MLOps for AI at Scale Production machine learning systems bring fundamentally different challenges to those in traditional software engineering. Last week in our talk at FOSDEM 2020 we provided a practical CI/CD framework to scale production machine learning at massive scale. In this talk we define the concept of MLOps, cover some of the challenges that production machine learning brings to the table, as well as a hands on example using Seldon Core and Jenkins X to build machine learning pipelines that can scale to hundreds of models. Why ML Degrades in Production http://towardsdatascience.com/why-machine-learning-models-degrade-in-production-d0f2108e9214 Why ML Degrades in Production The lifecycle of a machine learning model only begins when it’s deployed. Degrading performance is a big challenge that requires the right processes and infrastructure to ensure it’s monitored so that any business impact that would arise from skewed predictions due to drift in performance is avoided. Kaggle Kernel on Interpretability http://www.kaggle.com/parulpandey/intrepreting-machine-learning-models Kaggle Kernel on Interpretability Machine learning interpretability is key in high risk use-cases - there are large number of techniques available, each with their own tradeoffs, and it’s important to make sure the tradeoffs of these are understood. This Kaggle Kernel,  covers a high level overview of the importance of machine learning interpretability, together with hands on examples around permutation importance, partial dependence plots and SHAP. Building Domain Specific NLP http://thedataexchange.media/building-domain-specific-natural-language-applications Building Domain Specific NLP In this episode of the Data Exchange, Chief Scientist Ben Lorica speaks with David Talby, co-creator of Spark NLP, an open source, highly scalable, production grade natural language processing (NLP) library. Spark NLP has become one of the more popular NLP libraries and is available on PyPI, Conda, Maven, and Spark Packages. With recent advances in research in large-scale natural language models, there is strong interest in domain specific natural language applications - in this podcast they dive into some of these. Bayesian Product Raking at Wayfair http://tech.wayfair.com/data-science/2020/01/bayesian-product-ranking-at-wayfair/ Bayesian Product Raking Wayfair Wayfair has a huge catalog with over 14 million items with very broad categories. However, the large size of our product catalog also makes it hard for customers to find the perfect item among all of the possible options. In this post wayfair introduces their new Bayesian system which was developed to (1) identify these products and (2) present them to their customers. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
61 Microsoft’s NLP Recipes http://github.com/microsoft/nlp-recipes/ Microsoft’s NLP Recipes In recent years, the field of natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence solutions. Microsoft has put together a great resource with best practices for NLP through Jupyter notebooks and utility functions. Messaging & Data Ingestion with Pulsar http://thedataexchange.media/taking-messaging-and-data-ingestion-systems-to-the-next-level Messaging & Data Ingestion++ The Data Exchange Podcast dives into conversation with Sijie Guo on how Apache Pulsar is able to handle both queuing and streaming, and both online and offline applications. In this episode they cover the role of messaging in modern data applications/platforms, queuing implementations, streaming applications, and a status update on apache pulsar. Why Imbalanced ML is so hard http://machinelearningmastery.com/imbalanced-classification-is-hard/ Why Imbalanced ML is so hard Machine learning mastery sheds light into the topic of imbalanced classification in machine learning, specifically around why this challenge is so difficutl to tackle. In this tutorial they cover the challenges of severly skewed class distributions, costs of missclassification, proprieties that can be imbalanced, and a framework to develop an intuition to compoind the effects on the modelling difficulty posed by different dataset properties. AI for Data Cleaning at Scale http://towardsdatascience.com/ai-should-not-leave-structured-data-behind-33474f9cd07a AI for Data Cleaning at Scale An interesting article that proposes using ML to clean data at scale (for training more ML). This article breaks down the challenge of data cleaning, and covers a fascinating academic opens ource project called HoloClean, which aims to tackle this, together with a breakdon of the techniques and next steps. Training Models with 1b+ Params http://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/ Training Models with 1b+ Params Larger models are difficult to train because of cost, time, and ease of code integration. Microsoft is releasing an open-source library called DeepSpeed, which suggests to provide scale, speed, cost, and usability, unlocking the ability to train models at massive scale. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
62 Jurgen’s Retrospective AI 2010s http://people.idsia.ch/~juergen/2010s-our-decade-of-deep-learning.html Jurgen’s Retrospective AI 2010s Jürgen Schmidhuber has put together a fantasic post focusing on the recent decade’s most important developments and applications based on their work, as well as developments from related work, addressing privacy and data markets. The post includes LSTMs, Feed Forward NNs, Network Comparisons, Trends and the Future. Missinformation in a Hyperconnected World http://www.meetup.com/Ethics-Lunch-Group-Rise-London/events/268774414/ Hyperconnected Missinformation This Friday we are organising an open event in London to dive into the topic of missinformation and bias in a hyperconnected world. Head of Machine Learning at Factmata Dr. Magdalena Lis will be presenting a brief overview of the topic of fake news and bias, which will follow by a discourse on this topic to explore key themes, such as “Is social media fuelling the spread of misinformation?”, “what can be done to address it?”. Come join us! MLOps: The End of End-to-End http://www.mosaicventures.com/mosaicblog/2020/2/20/mlops-the-end-of-end-to-end MLOps: The End of End-to-End Mosaic has put together an overview of the concept of MLOps in the context of the full lifecycle of machine learning. This post provides a conceptual understanding of the different stages in end-to-end ML including data exploration, modelling and production inference. Empirical Quality Metrics for Deep Learning http://calculatedcontent.com/2020/02/16/weightwatcher-empirical-quality-metrics-for-deep-neural-networks/ Empirical Quality Metrics for DL A blog post that dives into the release of a tool called “Weightwatcher” which provides a set of tools for computing quality metrics of trained and pre-trained deep neural networks. The post provides insihgt on some of the metrics that are computed, as well as ways in which models can be benchmarked against each other. How to Interpret an ML Model http://francescopochetti.com/whitening-a-black-box-how-to-interpret-a-ml-model/ How to Interpret an ML Model A practical jupyter notebook that dives into some high level techniques for explaining machine learning models. Some of the methods explored include partial dependency plots, individual conditional expectation, tree ensambles feature contribution, permutation feature importance and the good old LIME & SHAP. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
63 The Building Blocks of ML Interpretability http://distill.pub/2018/building-blocks/ Building Blocks of Interpretability Often machine learning interpretability techniques are studied and analysed in isolation. This post explores the powerful interfaces that arise when you combine interpretability techniques, as well as the the rich structure of the combinatorial space that results when combining them. The post provides an intuitive explanation, together with visual representations of these building blocks for machine learning interpretability techniques. Scalable ML for Everyone with Ray http://thedataexchange.media/scalable-machine-learning-scalable-python-for-everyone Scalable ML for Everyone with Ray Linux Foundation’s Ethics in AI and Big Data Course http://www.edx.org/course/ethics-in-ai-and-big-data Ethics in AI and Big Data Course The Linux Foundation has put together an excellent resource that covers key topics that form the foundations of Ethics in AI and Big Data. In this course they cover a brief overview on AI, as well as principles for building responsible AI, together with several initiatives and open source drivers that suround this topic. The course starts this week, so perfect timing to join in. Re-assessing Emotional Expressions http://francescopochetti.com/whitening-a-black-box-how-to-interpret-a-ml-model/ Re-assessing Emotional Expressions Microsoft’s Agile Data Science Process http://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview Microsoft’s Data Science Process Microsoft has put together a post that dives into the Team Data Science Process, which is an agile and iterative data science methodology to delivery predictive analytics solutions and intelligent applications efficiently. This article provides an overview of TDSP and its main components. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
64 Kubeflow 1.0: Kubernetes ML for Everyone http://medium.com/kubeflow/kubeflow-1-0-cloud-native-ml-for-everyone-a3950202751 Kubernetes ML for Everyone The Kubeflow Project has made their official release of the v1.0, a great milestone to bring Kubernetes Machine Learning for everyone. The Kubeflow project brings end to end machine learning capabilities, including Development with Jupyter Notebook Management, Building with Kubeflow Fairing, Training leveraging multiple ML frameworks and Deployment with KFServing and Seldon Core. Production-Ready ML Systems http://medium.com/cracking-the-data-science-interview/the-5-components-towards-building-production-ready-machine-learning-system-a4d5237ec04e Production-Ready ML Systems One of the biggest issues that we face in machine learning is how to deploy and scale models in production. This article breaks down the core concepts that make machine learning deployment and productionionisation different to that of traditional software, as well as the core components that are part of the machine learning lifecycle. These include the Training, Validation, Testing, Serving and Monitoring. Explaining Long Term ML Impact http://ai.googleblog.com/2020/02/ml-fairness-gym-tool-for-exploring-long.html Explaining Long Term ML Impact Machine learning systems have been increasingly deployed to aid in high-impact decision-making, such as determining criminal sentencing, child welfare assessments, who receives medical attention and many other settings. Understanding whether such systems are fair is crucial, and requires an understanding of models’ short- and long-term effects.Google released a research paper which outlines a set of components for building simple simulations that explore potential long-run impacts of deploying machine learning-based decision systems in social environments. Quantifying Reproducibility of ML http://thegradient.pub/independently-reproducible-machine-learning/ Quantifying Reproducibility of ML Peer review has been an integral part of scientific research for more than 300 years. But even before peer review was introduced, reproducibility was a primary component of the scientific method. Now, we hear warnings that Artificial Intelligence (AI) and Machine Learning (ML) face their own reproducibility crises. This article dives into insights obtained whilst attempting to reproduce ML algorithms from papers continuously, leading into a framework to assess and quantify how reproducible a specific resource is. Adversarial Examples Resource http://nicholas.carlini.com/writing/2019/all-adversarial-example-papers.html Adversarial Examples Resource It can be hard to stay up-to-date on the published papers in the field of adversarial examples, where we have seen massive growth in the number of papers written each year. This resource attempts to address just that by putting together a huge list of papers from Arxiv related to adversarial examples. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
65 Explainability, Security & MLOps in Podcast http://thedataexchange.media/the-responsible-development-deployment-and-operation-of-machine-learning-systems Explainability, Security & MLOps The Data Exchange Podcast comes back this week with a conversation on machine learning explainability, MLOps, adversarial robustness and privacy preserving ML, with Institute for Ethical AI Chief Scientist & Seldon Engineering Director Alejandro Saucedo. In this podcast Ben Lorica and Alejandro dive into some of the key trends in machine learning, as well as some of the core best practices for developing, deploying and monitoring production machine learning at massive scale. Python Machine Learning Books http://pythonbooks.org/topical-books/machine-learning-and-artifical-intelligence/ Python Machine Learning Books A fantastic resource that has been compiled together from some of the best books found through conversations across developers and researchers. Specifically this sub-page has carefully curated Python Books that focus specifically on machine learning. The books referenced in this section are not only great for beginners, but also for intermediate-level ML learners. DevOps in Machine Learning http://www.theregister.co.uk/2020/03/07/devops_machine_learning_mlops/ DevOps in Machine Learning What would machine learning look like if you mixed in DevOps? Wonder no more, Seldon Open Source Developer Dr. Ryan Dawson has put together a great piece that outlines the concept of MLOps, together with some of the existing biggest challenges, solutions and best practices, together with some of the initiatives that are advancing these discussions forward. A Tour on E2E ML Platforms http://databaseline.tech/a-tour-of-end-to-end-ml-platforms/ A Tour on E2E ML Platforms A fantastic article by Spotify Senior Data Engineer Ian Hellström which aims to provide a high level overview of some of the end-to-end platforms available in the MLOps space. This post dives into Google TFX, Uber Michelangelo, Airbnb Bighead, Netflix Metaflow and more. Adversarial ML Reading List http://nicholas.carlini.com/writing/2018/adversarial-machine-learning-reading-list.html Adversarial ML Reading List Another fantastic resource that dives into the broad world of Adversarial Robustness, which has curated and carefully selected pieces that are recommended as reading list for anyone interested to learn more on the topic. This resource also provides a high level overview that covers the basics, a quick introduction, a complete background and papers broken down by various categories. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
66 PyTorch ML from Scratch http://machinelearningmastery.com/pytorch-tutorial-develop-deep-learning-models/ PyTorch ML from Scratch Machine Learning Mastery has put together a great resource providing a step by step guide on how to train a PyTorch machine learning model. The article covers every step including installation, ML lifecycle (data prep, training, evaluation), and even going further into developing a MLP for multiclass & regression and a CNN for image classification. The MLOps References List http://github.com/visenger/mlops-references The MLOps References List A fantastic resource that has put together a very comprehensive list of resources related to MLOps, or the topic surounding the components required to productionise machine learning. This resource contains multiple different themes including references, papers, talks, existing ML systems, and more. Deep Learning in Information Retrieval http://thedataexchange.media/how-deep-learning-is-being-used-in-search-and-information-retrieval Deep Learning & Info Retrieval The Data Exchange Podcast comes back this week in conversation with Hypercube Founder Edo Liberty, focusing primarily on how deep learning can be used in search & information retrieval. This podcast includes Edo’s experience, deep learning & IR, challenges when building information retrieval tools at scale, and deep learning based search including enterprise e2e deep search paltforms Integrating SHAP Explainability http://docs.seldon.io/projects/alibi/en/latest/methods/KernelSHAP.html Integrating SHAP Explainability SHAP (SHapley Additive exPlanations) is an algorithm which provides model-agnostic (black box), human interpretable explanations suitable for regression and classification models applied to tabular data. This method is a member of the additive feature attribution methods class; feature attribution refers to the fact that the change of an outcome to be explained (e.g., a class probability in a classification problem) with respect to a baseline (e.g., average prediction probability for that class in the training set) can be attributed in different proportions to the model input features. The Alibi Explain OSS project has implemented this technique and has put together several jupyter notebook examples to implement this algorithm across various models. AI meets operations with OReilly http://www.oreilly.com/radar/ai-meets-operations/ AI meets operations with OReilly One of the biggest challenges operations groups will face over the coming year will be learning how to support AI- and ML-based applications. The OReilly team has put together a comprehensive high level overview of the probelm of managing machine learning at scale. In this article Mike Loukides provides a high level overview of this challenge, together with some of the components that may comprise the solutions, as well as examples. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
67 Deploying COVID-19 AI solutions at scale http://github.com/axsaucedo/seldon-core/blob/corona_research_exploration/examples/models/research_paper_classification/README.md COVID-19 AI solutions at scale There has been great momentum from the machine learning community to extract insights from the increasingly growing COVID-19 Datasets, such as the Allen Institute for AI Open Research Dataset as well as the data repository by Johns Hopkins CSSE - the best insights have come out of cross-functional collaborations across ML practitioners and relevant domain experts such as infectious disease experts. Chief Scientist at the Institute for Ethical AI Alejandro Saucedo has put together a brief hands on tutorial to showcase how to deploy COVID-19 AI Solutions at scale, encouraging cross functional collaboration across domain experts such as data scientists, software engineers and even epidemiologists & healthcare professionals. Democratising Deep Fakes 😬 http://colab.research.google.com/github/AliaksandrSiarohin/first-order-model/blob/master/demo.ipynb#scrollTo=d8kQ3U7MHqh- Democratising Deep Fakes 😬 An incredibly fascinating tutorial that showcases yet further improvement and simplification into the creation of deep fakes, making it even easier for researchers and practitioners to create deep fakes. The rate of improvement of the quality and simplicity around creation of deep fakes is improving with break-neck speed, and with that a lot of very interesting questions arise around the ethical, privacy- and security-related concerns, across others. Try out the hands on collab notebook to try it yourself. You can also check out the video that covers the paper and implementation into further detail. Shopify on Scaling AI http://thedataexchange.media/business-at-the-speed-of-ai-lessons-from-shopify Shopify on Scaling AI The Data Exchange comes back this week with a fantastic podcast in converastion with Shopify VP and Head of Data Science and Data Platform Engineering Solmaz Shahalizadeh. In this podcast they dive into building and scaling machine learning data products, building and scaling data teams, and data informed product building. Industry Reinforcement Learning http://anyscale.com/blog/enterprise-applications-of-reinforcement-learning-recommenders-and-simulation-modeling/ Industry Reinforcement Learning Chief Scientist Ben Lorica has put together a great article that covers a high level overview of enterprise applications of reinforcement learning. The post covers applications of reinforcement learning in recommenders systems, simulation modelling & opimisation, and dives into some of the tools that power some of those solutions, together with an insight on some of the biggest challenges currently in this space. Transfer Learning in NLP http://docs.google.com/presentation/d/1LsUAhR_qIVbq6xH6Aw4ag8MGB_-UWfd0KoVhtTgye6o/edit#slide=id.g6e76c30798_0_0 Transfer Learning in NLP A fantastic presentation that covers a very comprehensive overview and deep dive on all-things-transfer-learning in NLP. The presentation covers the motivations, open problems, definitions/terminology, as well as some of the current work in the research and practitioner communities. The slides are available, as well as a version of that presentation in video format. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
68 GitLab Data Lessons Learned http://about.gitlab.com/blog/2020/02/10/lessons-learned-as-data-team-manager/ GitLab Data Lessons Learned A fantastic article by GitLab Staff Data Engineer Taylor Murphy on his key lessons learned leading the GitLab Data Team. In this article Taylor covers the importance of management skills in data engineering, together with key areas to focus including growth, hiring, process, tools, performance, meetings and beyond. The article also provides a significant amount of links and resources to expand in these very useful areas. Data Discovery at Spotify http://labs.spotify.com/2020/02/27/how-we-improved-data-discovery-for-data-scientists-at-spotify/ Data Discovery at Spotify Spotify has released a high level overview of their journey to improve data discovery across the firm. In this article they provide a set of high level steps (or themes) that they follow to achieve these; including diagnosing the problem, understanding intent, enabling knowledge, mapping expertise and more. Exploratory Data Analysis Deep Dive http://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e Exploratory Data Analysis Dive A very comprehensive article that outlines the best practices on Exploratory Data Analysis, a step which is foundational to the data science process. This article covers a high level definition to EDA, it’s components, and a deep dive into how to dive into understanding features, cleaning datasets and analysing feature relationships. Tokenisers & How Machines Read http://blog.floydhub.com/tokenization-nlp/ Tokenisers & How Machines Read NLP applications are only growing in industry, and hence best practices and understanding of its fundamentals is increasingly crucial. This article provides a very comprehensive deep dive in one of the core components of NLP; tokenization. This article provides a deep dive on the topic of tokenization, together with the challenges that present in this space, and common types of text tokenization. Intel Demystifying the AI Stack http://www.intel.com/content/www/us/en/intel-capital/news/story.html?id=730#/type=QWxs/page=0/term=/tags= Intel Demystifying the AI Stack Intel Capital has created an overview of the end-to-end AI infrastructure stack, together with a mapping of how existing projects fall into their respective categories. In this article they cover an overview of the different layers of their Stack, including Hardware, Software Accelerators, Libraries, Data Science Frameworks, Orchestration, Automation, and Autonomous. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
69 AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/69.html This week in Issue #69: Insights for Remote ML Teams http://www.comet.ml/site/how-to-make-remote-work-effective-for-data-science-teams/ Insights for Remote ML Teams CometML has put together a great article that outlines best practices for managing remote data science teams. The article includes key considerations, including ortanisational structures, biggest challenges that remote workers face, and best practices; these include productive workspaces, communication, habits, trust and more. Human-in-the-loop in Prod ML http://thedataexchange.media/human-in-the-loop-machine-learning/ Human-in-the-loop in Prod ML A great Data Exchange Podcast with Machine Learning Consulting CEO Rob Munro, where they dive into “Human in the loop Machine Learning”, and cover Rob’s experience at various tech giants, writing his book on the topic, several NLP areas where it’s relevant, and how this fits in real life. Netflix & Druid for Real Time Data http://netflixtechblog.com/how-netflix-uses-druid-for-real-time-insights-to-ensure-a-high-quality-experience-19e1e8568d06 Netflix & Druid for Real Time Data Netflix brings us a high level overview of how they use Druid for real time insights. Apache Druid is a high performance real-time analytics database, which is designed primarily for workflows where fast queries and ingest really matter. In this post they highlight how Druid’s capabilities shine around instant data visibility, ad-hoc queries, operational analytics and handling high concurrency. They cover a high level architecture of their data processing lifecycle, as well as insights they have gathered to ensure scale. The Importance of Data Prep http://www.oreilly.com/radar/the-unreasonable-importance-of-data-preparation/ The Importance of Data Prep The O’Reilly team published a great post that highlights the importance of data preparation. In this article they present a “Data Science Hierarchy of Needs”, where they outline how key data processing is to ensure accurate and reliable insights when tackling any data-related challenge. In the post they cover the importance of automatic data prep, some key tools aiding in this area, and they dive into the future of tooling. Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
70 [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/70.html This week in Issue #70: Privacy Preserving AI Lecture http://www.youtube.com/watch?v=4zrU54VIK6k Privacy Preserving AI Lecture Andrew Trask joins Lex Fridman to deliver a full-length lecture on privacy preserving AI. This excellent resource covers a broad set of areas within the space of privacy preserving AI, including the premise of the problem it aims to solve, the different tools available at our disposal, terminoogy and other key concepts in this area. Backpropagation 101 from Thinc http://thinc.ai/docs/backprop101 Backpropagation 101 from Thinc SpaCy Cofounder Matt Honibal has put together a fantastic resource that dives into a step by step intuitive overview of the backpropagation algorithm and it’s implementation in deep learning, and breaks in down in its constituent terms with hands on practical examples. Harvard Offering Free Courses http://online-learning.harvard.edu/catalog?keywords=&subject%5B%5D=3&paid%5B1%5D=1&max_price=&start_date_range%5Bmin%5D%5Bdate%5D=&start_date_range%5Bmax%5D%5Bdate%5D= Harvard Offering Free Courses Harvard is offering free online courses for anyone that wants to expand their knowledge boundaries, which is fantastic as they are accessible for free. These courses cover a broad range of topics in computer science, including their AI with Python course. For anyone interested, they also have made available over 50 courses across various other academic fields that are also available for free. GPT2 AI Dungeon Game Update http://medium.com/@aidungeon/ai-dungeon-multiplayer-is-out-84177419bf7a GPT2 AI Dungeon Game Update AI Dungeons is a narrative based game built on top of the natural language generation model GPT-2, which allows for a fully unique and pseudo-personalised gaming experience. They have released new functionality which allows players to try it out without any code, and even being able to play in multi-player model Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
71 A Practical Intro to Responsible AI http://www.youtube.com/watch?v=TPoEs-HJE6U A Practical Intro to Responsible AI The next generation of young innovators took the virtual channels this weekend to build solutions to tackle the challenge of our generation at the YouthVsCOVID TeensInAI Hackathon. During this event we presented a talk that covered a practical introduction to responsible AI, where we covered some of the key motivations for following best practices to ensure responsible development, deployment and operation of AI systems. Please also check out the other fantastic talks presented at the event at the TeensInAI Youtube Channel. Simulating the Real World in Python http://realpython.com/simpy-simulating-with-python/ Simulating Real World in Python A simulation is a representation of a real-world system. One can use mathematical or computational models of this system to study how it works - this article showcases how you can leverage Python’s SimPy library to get started. Advanced NLP with SpaCy http://course.spacy.io/en/ Advanced NLP with SpaCy SpaCy Co-founder Ines Montani has put together an official advanced NLP course which introduces core Natural Language Processing concepts using SpaCy. This course is broken down into four chapters which cover foundational pieces such as finding workds / phrases, scaling analysis, building processing pipelines and training your own neural network model. 500 Free CompSci Courses http://www.freecodecamp.org/news/free-courses-top-cs-universities/ 500 Free CompSci Courses Every year, Class Central publishes rankings of the world’s highest rated and most popular online courses. This year they decided to showing all the free online courses from some of the top courses at universities (with a section focusing on computer science). This article provides the methodology (and jupyter notebook) used to rank the universities using the central class database. Modelling & Simulating Epidemics https://thedataexchange.media/computational-models-and-simulations-of-epidemic-infectious-diseases/ Modelling & Simulating Epidemics The Data Exchange comes back with an excellent podcast where it dives into the trending topic of computational modelling and simulations of epidemic infectionus diseases. During this podcast Chief Scientist Ben Lorica speaks with Data Scientist Bruno Goncalves, and covers some key techniques used for epidemic modelling, as well as their impact in decision making. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/70.html AI Conferences Gone Virtual 2020 [Updated List 26/04/2020] Due to the current global situation, a large number of conferences have had to face hard choices, several which decided going fully virtual. This hard choice has now open the doors to people from around the world to gain access to the great online content generated by expert speakers and contributors. We wanted to highlight some of these key conferences so they are not missed - these include: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
72 Monitoring Machine Learning Models in Prod http://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models/ Monitoring ML Models in Prod The lifecycel of a machine learning model only begins when it’s deployed to production. It’s key to be able to understand the performance, and especially the degrades that the models experience as time passes. Whether it is due to data distribution changing, or other external factors, it’s key to ensure the right infrastructure is in place. This article provides an excellent picture of this challenge and various tools and solutions available. 65 Free Springer ML Books http://towardsdatascience.com/springer-has-released-65-machine-learning-and-data-books-for-free-961f8181f189 65 Free Springer ML Books Springer has released hundreds of free books on a wide range of topics to the general public. The list, which includes 408 books in total, covers a wide range of scientific and technological topics. In order to save you some time, this article has created a single list of all the books (65 in number) that are relevant to the data and Machine Learning field. Neural Network Music Generator http://openai.com/blog/jukebox/ Neural Network Music Generator OpenAI has released a very interesting announcement, the launch of Jukebox, a neural network based model that can be used to generate music of different genres with lyrics. In this post they cover in depth the approach as well as various examples that were generated with this model. Open Source Deep Learning http://thedataexchange.media/an-open-source-platform-for-training-deep-learning-models/ Open Source Deep Learning The Data Exchange podcast comes back this week with a deep dive with DeterminedAI CEO Evan Sparks, where they dive into their brand new open sourced Deep Learning Training platform, together with some key enterprise use-cases of deep learning, the challenges and opportunities of distributed training & hyperparameter tuning, as well as some examples of how teams have been using their open source platform. AI, COVID19, Ethics & Contact Tracing http://www.meetup.com/Ethics-Lunch-Group-Rise-London/events/270379397/ AI, COVID19 & Contact Tracing As we face one of the biggest challenges of our generation, several ethical implications arise which often appear to clash with urgency of proposed solutions - one of the ongoing key technological discussions in this space is the use and approach towards contact tracing techology. We are organising a meetup on the 15th of March where HATLAB Deputy Director wil dive into the ethical implications of responses to COVID in the context of contact tracing in particular. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/70.html [Updated] AI Conferences Gone Virtual in 2020 Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
73 Real Time ML Stream Processing http://pycon.hk/sessions-2020-spring/real-time-stream-processing-with-python-at-scale/ Real Time ML Stream Processing This weekend we have a talk at PyConHK 2020 about real time machine learning using Python, Faust, Kafka and Seldon. During this talk we dive into the core concepts of stream processing, as well as the ecosystem of tools available and a hands on example building a streaming pipeline with multiple workers processing real time comments from Reddit’s /r/science subreddit using 200k comments the “comments removed by mods” dataset. ACM Position Statement on Contact Tracing http://www.acm.org/binaries/content/assets/public-policy/europe-tpc-contact-tracing-statement.pdf Statement on Contact Tracing We have been contributing to a position statement on Contact Tracing applications through our role at the Association for Computer Machinery (ACM)’s European Policy Committee. This statement provides a set of principles and recommendations to countries that are looking to use contact tracing to tackle the challenges COVID has been posing in our societies. This Friday we’re also organising a virtual meetup where the Hatlab Deputy Director will be sharing insights on contact tracing apps. ICLR 2020 Videos Released http://iclr.cc/virtual_2020/calendar.html ICLR 2020 Videos Released The Eighth Conference on Learning Representations took place virtually in 2020, and has recently released the videos for the all the talks, which are now available at their website. This is a fantastic resource that provides access to state of the art research in a digestible format, and provides for an open resource from which other researchers and practitioners will be able to build upon. Why TinyML will be Huge http://thedataexchange.media/why-tinyml-will-be-huge/ Why TinyML will be Huge The Data Exchange Podcast comes back this week with a fantastic session with Staff Research Engineer Pete Warden. This episode covers the early days of deep learning for computer vision, core early days of the tensorflow project, insights on TinyML and why it’s such an important topic, privacy in the context of tinyML, and Pete’s new book. PapersWithCode: A home for ML http://medium.com/paperswithcode/a-home-for-results-in-ml-e25681c598dc PapersWithCode: A home for ML The PapersWithCode project has released a new update, where they share some insights on the challenge of reproducibility that they have been tackling with this project. They are introducing new exciting features, including new results interface, an ML Extraction Algorithm that automatically extracts results from papers, and a Big Database update with 800+ new leaderboards, 550+ new results and more. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/73.html This week in Issue #73: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
74 Coding Habits for Data Scientists http://www.thoughtworks.com/insights/blog/coding-habits-data-scientists Coding Habits for Data Scientists Often ML code is written in Jupyter notebooks with the main purpose of experimentation instead of scalability, which may come with undesired side-effects and may have deterimental impacts on the stability and robustness of the model beyond it’s deployment. This article has put toghether a great overview of some of the motivations as well as best practices around coding habits for data science. Enterprise AI Adoption 2020 http://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2020/ Enterprise AI Adoption 2020 O’Reilly has put together and compiled the results of a survey they carried out which provides insights on AI Adoption in the Enterprise in 2020. In this article they dive into how the efforts are maturing from prototype to production on AI, and how companies are able to fill the skills gap across a broad range of industries. Natural Language Processing 101 http://realpython.com/natural-language-processing-spacy-python/ Natural Language Processing 101 A fantastic tutorial that goes into the depths of Natural Language Processing. This article dives into the foundational terms and concepts in NLP, how to use the SpaCy framework for NLP, building end to end pipelines and diving into more advanced NLP concepts. AI Scalability & Performance http://thedataexchange.media/improving-performance-and-scalability-of-data-science-libraries/ AI Scalability & Performance The data exchange podcast goes into conversation with Wes McKinney, Director of Ursa Labs and Apache Arrow PMC member. Wes is the creator of Pandas, and author of the best selling book “Python for Data Analysis”. In this post they cover these open source projects, they dive into the need for shared infrastructure for data science, and some of the critical work at Ursa Labs. MLOps is Not Enough http://techcommunity.microsoft.com/t5/azure-ai/mlops-is-not-enough/ba-p/1386789# MLOps is Not Enough MLOps is defined as the operational complexities involved in operating production machine learning at scale. This article from microsoft provides deeper thoughts on the concept of MLOps and argues that a broader approach is required to tackle the challenge at scale - namely the Data Science Lifecycle process. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/74.html This week in Issue #74: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
75 Microsoft Code Autocomplete AI http://www.youtube.com/watch?v=fZSFNUT6iY8 Microsoft Programming AI Microsoft and OpenAI shared yet another interesting use-case of GPT-2 text generation - this time its function is to generate Python code. They showcased it during the Microsoft Build 2020 last week, which although was only a prepared demonstration, does seem to have some really interesting insights showing how to generate suggested code based on an initial input. ML Infra for Model Building http://towardsdatascience.com/ml-infrastructure-tools-for-model-building-464770ac4fec ML Infra for Model Building Arize AI Co-Founder Aparna Dhinakaran has put togher an overview on the tools available to address the challenges present in the end to end machine learning lifecycle. This article proposes over a dozen different t hemes to classify the various different technologies available, which shows the complexity that the end to end ML challenge encompasses, and provides a brief intuitive overview of each Virtual Discourse on Rethinking Public Data http://www.meetup.com/Tech-Ethics-London/events/270315999/ Discourse Rethinking Public Data Thanks to advances in the development of data-driven technologies, we now have unprecedented opportunities to unlock the social value of data. Data could now truly function as a common resource and a public good with transformative power for communities and society. This Friday we are organising an online session where Head of Public Engagement at the UK Ada Lovelace Institute, Reema Patel, will be share insights on the ethical challenges about data use and we will dive into how we can best ensure data is used in the interests of society. Advanced NLP Video Course http://www.youtube.com/watch?v=THduWAnG97k Advanced NLP Video Course Last week we shared the Advanced NLP Course that dives into how to use SpaCy to tackle intermediate and advanced NLP real-life challenges. This week the SpaCy team has shared a video series they have created which covers the NLP Course end to end. This is a fantastic resource which has now (and still is) been translated into a large range of different languages (with Humans help not NLP in this case, we’re fully not there just yet). What to Do When AI Fails http://www.oreilly.com/radar/what-to-do-when-ai-fails/ What to Do When AI Fails O’Reilly has put together a great piece that emphasises the implications AI use-cases have when they go wrong, namely it introduces the motivation of such reflection quoting several relatively recent high profile incidents that showcase their impact. In this post they outline how these AI cases are different, as well as terminology around “AI incidents”, and some of the best practices and approaches available to mitigate these scenarios. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/75.html This week in Issue #75: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
76 Highlights on EuroPython & ACM http://www.acm.org/articles/membernet/2020/membernet-05282020#ACM-officers Highlights on EuroPython & ACM ML in Production Deployment Guide http://mlinproduction.com/what-does-it-mean-to-deploy-a-machine-learning-model-deployment-series-01/ ML in Prod Deployment Guide “ML in Production” is a website that curates content focused around best practices for building real world machine learning systems. They have put together a fantastic five-part series that dives into the concepts and challenges of production machine learning, including the definitions, the software interfaces, batch processing, online inference and ml deployment. Frameworks used by ML Startups http://neptune.ai/blog/tools-libraries-frameworks-methodologies-ml-startups-roundup?utm_source=reddit&utm_medium=post&utm_campaign=blog-tools-libraries-frameworks-methodologies-ml-startups-roundup Frameworks used by ML Startups Navigating the wide and deep range of machine learning tools can be hard, especially for fast-moving requirements that startups face. In this article 41 machine learning startups were surveyed across the world to gain understanding on the tools, libraries and frameworks used on a day to day basis. The insights obtained are grouped into Methodology, Software Development setup, ML Frameworks, MLOps and “the unexpected”. GPT-3 Deep Dive Explanation http://www.youtube.com/watch?v=SY5PvZrJhLE GPT-3 Deep Dive Explanation A paper was released last week covering initial achievements in the experimental results the GPT Language Model, trained on almost 500 Billion tokens and 175 Billion parameters. This 60 minute video dives into the paper and breaks it down in an intuitive and comprehensible perspective, covering the terminology & foundations, details on the model size & dataset, methodology, fine tuning, experimental results and much more. Scaling Data with Outliers for ML http://machinelearningmastery.com/robust-scaler-transforms-for-machine-learning/ Scaling Data with Outliers for ML Machine Learning Mastery has put together a comprehensive article which dives into how to use robust scaler transforms to standardise numerical input variables for classification and regression. In this tutorial they cover the algorithms that benefit from these techniques, some of the approches that enable it, and how to use the RobustScaler to scale numerical input variables using the median and interquartile range. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/75.html AI Conferences Gone Virtual 2020 [Updated List 31/05/2020] Due to the current global situation, a large number of conferences have had to face hard choices, several which decided going fully virtual. This hard choice has now open the doors to people from around the world to gain access to the great online content generated by expert speakers and contributors. We wanted to highlight some of these key conferences so they are not missed - these include: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
77 Made with Machine Learning Platform http://madewithml.com/ Made with ML Platform “Made with ML (MWML)” is a fantastic free platform that focuses on enabling the ML community to learn, explore and built, through a set of curated resources, ML related lessons, a continuously updated set of ML projects, and more. Check it out and do make sure to add any projects / resources that are not listed already. Identifying & Mitigating AI Risks http://thedataexchange.media/identifying-and-mitigating-liabilities-and-risks-associated-with-ai/ Identifying & Mitigating AI Risks The Data Exchange podcast comes back with a fantastic conversation with Immuta Chief Legal Officer and BNH AI Managing Partner Andrew Burt. This podcast dives into core components of machine learning model governance, specifically from a legal professional perspective, diving into the intersection between these two fields, covering best practices and challenges of identifying and mitigating risks, as well as incident response and recovery in ML. ACM ByteCast with Donald Knuth http://learning.acm.org/bytecast ACM ByteCast with Donald Knuth The Association for Computing Machinery has released their first ByteCast podcast, kicking off with a fantastic conversation with Computer Science Legend Donald Knuth, largely known for his book, “The Art of Computer Programming”. In this podcast they discuss what led him to discover his love for computer science, as well as his outlook on how people learn technical skills, and how his mentorship has helped him write “human oriented” programs. Microsoft NLP Bias Research http://venturebeat.com/2020/06/01/microsoft-researchers-say-nlp-bias-studies-must-consider-role-of-social-hierarchies-like-racism/ Microsoft NLP Bias Research Following our post last week covering GPT-3, this week Microsoft comes with a very important topic, publishing a paper that covers the analysis of 146 NLP bias research papers. In this paper they dive into the issues and impact in some of this bias, as well as best practices required in the research field to ensure some of these undesired biases are identified and mitigated. Feature Selection with Continuous Data http://machinelearningmastery.com/feature-selection-with-numerical-input-data/ Feature Selection with Cont. Data Machine Learning mastery has put together a great overview of an important sub-topic in feature selection. Namely this is feature selection with numerical or continuous input data. In this post they cover a hands on example using a diabetes prediction dataset, showcasing the challenges found in conitnuous inputs in the context of binary classification, and they teach how to evaluate the importance of numerical features using the ANOVA f-test and mutual information statistics. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/75.html [Updated] AI Conferences Gone Virtual in 2020 Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
78 Outlier & Anomaly Detection ML http://github.com/ethicalml/awesome-production-machine-learning/#outlier-and-anomaly-detection Outlier & Anomaly Detection ML The State of ML in Python 2020 http://arxiv.org/abs/2002.04803 The State of ML in Python 2020 Python continues to be the fastest growing language for scientific computing, data science and machine learning. Sebastian Raschka has put together together with Joshua Patterson and Corey Nolet an overview of the current state of machine learning in python, diving into some of the main developments and technology trends in data science, machine learning and broader artificial intelligence. Applied Homomorphic Encryption http://simons.berkeley.edu/talks/practical-applications-homomorphic-encryption Applied Homomorphic Encryption Homomorphic Encryption is a fascinating privacy preserving machine learning technique that allows for processing to take place on encrypted data, which provides the same results as if the computation was processed on the plaintext. This of course comes at a computational cost, however the developments in these techniques are making them more accessible. In this talk Hao Chen from Microsoft Research dives into some of the practical applications of this techqnique, together with an overview of the technique itself. OpenAI NLP API Beta Launch http://beta.openai.com/ OpenAI NLP API Beta Launch We covered last week the launch of the new OpenAPI GPT3 release, a model that requires an unprecedented amount of computational power to even process an inference, let alone train. This week OpenAPI has released a new commercial API for NLP tasks including semantic search, summarization, sentiment analysis, content generation, translation, and more. Continuous Delivery Podcast http://cd.foundation/podcast/ Continuous Delivery Podcast With the demands for large scale production machine learning systems, the skills required to build, maintain and operate these systems require a set of cross-functional skills which range from data science to devops. In the context of the devops requirements, the trend of continuous delivery has become growingly important with the emergence and adoption of frameworks like Kubernetes. The Continuous Delivery foundation has released an exciting initiative to dive into some of the topics that are critical in MLOps in their new CI/CD & DevOps Podcasts. Check it out. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/75.html [Updated] AI Conferences Gone Virtual in 2020 Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
79 ML Model Serving Ecosystem http://github.com/EthicalML/awesome-production-machine-learning#model-serving-and-monitoring Model Serving Ecosystem Demands for large scale production machine learning capabilities are growing at breakneck speeds, and the ecosystem of tools are growing at equally fast pace. DKB ML Lead Engineer Lina Weichbrodt has made a fantastic contribution to our OSS Production ML Tools list, adding a new section with tools that specialise on large-scale frameworks for ML serving and monitoring. This is a great new addition which we’re quite excited about as it will allow the community to stay up to date with innovation in this field. GitHub Actions for MLOps http://github.blog/2020-06-17-using-github-actions-for-mlops-data-science/ GitHub Actions for MLOps Machine Learning Operations (or MLOps) enables Data Scientists to work in a more collaborative fashion, by providing testing, lineage, versioning, and historical information in an automated way. GitHub has put together an article that outlines how it’s possible to leverage the GitHub Actions feature that integrates parts of the data science and machine learning workflow with a software development workflow. Building OSS Tools for NLP Devs http://thedataexchange.media/building-open-source-developer-tools-for-language-applications/ Building OSS Tools for NLP Devs The Data Exchange podcast dives into conversation with SpaCy and ExplosionAI Cofounder Matt Honnibal. In this great episode Matt shares insights related to the most popular NLP library SpaCy, together with some of the other fantastic projects the ExplosionAI team is working on including the ML framework Thinc, their commercial data labelling tool Prodi.gy and beyond. Reinforcement Learning Applications http://anyscale.com/blog/enterprise-applications-of-reinforcement-learning-recommenders-and-simulation-modeling/ Reinforcement Learning Apps In recent years machine learning research – particularly research in deep learning – has had a profound impact on enterprise applications. We’re now also seeing more researchers studying RL and some of these investments will begin to show up in applications. In this post Chief Data Scientist Ben Lorica dives into enterprise applications of reinforcement learning, together with insightful metrics and facts of adoption in industry. NLP Search Transfer Learning at Scale http://aidemos.cs.toronto.edu/nds/paper.html NLP Transfer Learning at Scale Transfer learning has proven to be a successful technique to train deep learning models in the domains where little training data is available. The dominant approach is to pretrain a model on a large generic dataset such as ImageNet and finetune its weights on the target domain. This fascinating paper proposes an architecture for a large scale neural transfer search framework, together with a SaaS implementation of the service which can be tested. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/75.html [Updated] AI Conferences Gone Virtual in 2020 Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
80 [XAI] Explainable AI in Retail http://ai.science/e/xai-explainable-ai-in-retail--P5GX9My5lrZufnFvL9h6 [XAI] Explainable AI in Retail Great deep dive into Explainability in AI, which is a key component in production machine learning systems. In this article they dive into how Machine learning algorithms are increasingly being used in high stakes decisions, steps to mitigate the risk of these black-box algorithms, a number of explainability techniques, a great collection of resources representing the application of explainability methods in a practical setting, and key insights on challenges applying these methods in the field. MLflow Joins Linux Foundation http://databricks.com/blog/2020/06/25/mlflow-joins-the-linux-foundation-to-become-the-open-standard-for-machine-learning-platforms.html MLflow Joins Linux Foundation Last week at the Spark + AI Summit 2020 Databricks announced that their flagship open source AI framework MLFlow is becoming a Linux Foundation project! This is absolutely fantastic news for the open souce and enterprise machine learning ecosystem as it will further the current topic of experiment management and deployment lifecycle. During this conference they also announced some core roadmap features that will be added into the MLFlow library, together with some of the plans and stats behind this great decision. DVC 1.0 features for MLOps http://dvc.org/blog/dvc-1-0-release DVC 1.0 features for MLOps The Data Version Control (DVC) framework has released their 1.0 version! This is a great announcement for the MLOps ecosystem as this is one of the core tools providing full provenance and version control to machine learning assets, introducing sophisticated versioning capabilities for the machine learning constitutens of each pipeline component, consisting of data, config and code. Designing Industrial Scale ML http://thedataexchange.media/designing-machine-learning-models-for-both-consumer-and-industrial-applications/ Designing Industrial Scale ML The Data Exchange Podcast comes back this week with a conversation with Christopher Nguyen, CEO of Arimo (a Panasonic company). Christopher is a former Engineering Director at Google, and was an early proponent of deep learning for enterprise applications. In this podcast they dive into the difference between working at an AI vendor company vs working at a AI buying company. They also dive into ML usecases for IoT and Industrial internet apps, and also cover key concepts in MLOps. Machine Learning Operations http://ml-ops.org/ Machine Learning Operations ML-ops.org is new resource in the MLOps space covering some of the core principles on the topic of productionisation of machine learning across its full lifecycle. This resource includes a concise definition of MLOps, together with several deep dives into sub-topics of MLOps, including underlying motivations, design processes, workflows, principles and more. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/75.html [Updated] AI Conferences Gone Virtual in 2020 Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
81 Getting ML into Production http://veekaybee.github.io/2020/06/09/ml-in-prod/ Getting ML into Production A fantastic and comprehensible post by Automaticc Machine Learning Engineer Vicki Boykis covering an end to end journey towards productionising an AI powered application. This great post provides a sneak-peek into some of the challenges and pain-points involved when developing some of the underlying components required to produce production ready machine learning services, which are able to power an AI application. Top Books on ML Feature Engineering http://machinelearningmastery.com/books-on-data-cleaning-data-preparation-and-feature-engineering/ Top Books on ML Feature Eng Machine Learning Mastery’s Jason Brownlee has put together a great post featuring 8 of the top books on data cleaning and feature engineering as recommended reads. Feature engineering is key in the machine learning lifecycle, as it enables for better performance, more robust moedls, more explainable models (through domain knowledge abstraction), between other improvements. Adversarial ML in Industry http://arxiv.org/abs/2002.05646 MSFT Adversarial ML in Industry Micosoft Researches have published a research survey that provides insights on the state of adversarial ML in industry, through 28 interviews which outlines key insighs on the gaps in securing machine learning systems when viewed from the context of traditional software security development. This paper provides a deep dive from the perspective of both ML engineers and security incident responders, making it quite an interesting piece for practitioners involved in the development and design of production machine learning systems. Google on Neural Nets for Tables http://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html Google on Neural Nets for Tables Table information extraction in natural language processing is a well known and still not fully resolved challenge across both research and industry. Google has released an article that aims to showcase their achievements tacking this challenge by leveraging state-of-the-art NLP deep learning frameworks. This article provides both theoretical and practical insights on “TAPAS”, a weakly supervised table parsing approach that extends the BERT architecture to tackle this challenge via question answering techniques on (seemlingly structured) text-based tables. Getting into a Causal Flow http://www.causalflows.com/introduction/ Getting into a Causal Flow Why should you care about Causal Inference? Most, if not all, business analytics questions, are inquiries of cause and effect. This fantastic article provides an introductory insight into causal inference with practical and intuitive examples. It also aims to provide an intuition on when this branch of techniques are “good enough” as well as more importantly “when they are not”. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/81.html This week in Issue #81: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
82 Full Stack Deep Learning Course http://course.fullstackdeeplearning.com/ Full Stack Deep Learning Course A fantastic resource which covers end to end concepts in productionisation of machine learning systems, taught by experts in the field. They have compiled insights focused around formulating the problem, estimating costs, finding & cleaning datasets, picking the right frameworks, assessing compute infrastructure, ensuring reproducibility, troubleshooting training and deploying models at scale. Software Engineers in ML http://towardsdatascience.com/what-software-engineers-can-bring-to-machine-learning-25f458c80e5 Software Engineers in ML Many production machine learning challenges are analogous to that of software engineering; this article puts together a high level overview of key insights that software engineers can bring to machine learning. This article dives into reproducibility as version control, model serving as devops and model drift as performance monitoring. Web Services vs Streaming for Inference http://towardsdatascience.com/web-services-vs-streaming-for-real-time-machine-learning-endpoints-c08054e2b18e Web Services vs Streaming in ML A very interesting evaluation of machine learning performance comparing rest vs kafka APIs for the usecase of streaming data. In this article we can see Playtica’s journey assessing a benchmarking of the ETL systems (such as Airflow) vs streaming systems (such as Kafka), and how they compare in service exhaustion, client starvation, handling failures, retries and performance. Continuous ML (CML) CI/CD http://dvc.org/blog/cml-release Continuous ML (CML) CI/CD Iterative.ai has announced a new OSS project in their data version control family, called continuous machine learning (CML). This CML framework dives into CI/CD for machine learning, introducing best practices for continuous delivery for model training, model evaluation, comparing ML experiments, and monitoring dataset changes. Papers With Code Methods http://paperswithcode.com/methods Papers With Code Methods The great ML resource PapersWithCode has released a new feature called “Methods”. Here they are now tracking 730+ building blocks of machine learning: optimizers, activations, attention layers, convolutions and much more. This allows the community to track usage over time and explore papers from a new perspective. [Updated] AI Conferences Gone Virtual in 2020 http://ethical.institute/mle/82.html This week in Issue #82: Featured OSS Production ML Libraries http://github.com/ethicalml/awesome-production-machine-learning Featured OSS Production ML Libraries
83 Building an Enterprise Deep Learning Stack http://medium.com/@Determined_AI/building-a-deep-learning-platform-21a4a9dd90fe Building an Enterprise DL Stack Determined AI has put together a fantastic article outlining how they leveraged open source and enterprise tools to build an end to end deep learning platform. They cover some of the motivations that lead to require end to end capabilities, dive into some of the key challenges, and provide a solution for each phase of the model lifecycle. 5 Key Features for ML Platforms http://anyscale.com/blog/five-key-features-for-a-machine-learning-platform/ 5 Key Features for ML Platforms ML Platform Designers need to meet current challenges and plan for future workloads. In this post by Anyscale Ben Lorica and Ion Stoica cover some of the key components in the machine learning lifecycle, as well as how the different components of Ray tackle each of these pieces, including model training, model tuning, model serving and model monitoring. AI Dungeon Open World w GPT3 http://medium.com/@aidungeon/ai-dungeon-dragon-model-upgrade-7e8ea579abfe AI Dungeon Open World w GPT3 This week there has been a large surge of GPT3 case-studies showcasing the astonishing capabilities of this massive-scale new model. AI Dungeon has been an early adopted for the GPT-x algorithms, and has included a release to their open world, proceduraly generated, smart AI text-based adventure game. The State of Apache Airflow http://softwareengineeringdaily.com/2020/06/10/apache-airflow-with-maxime-beauchemin-vikram-koka-and-ash-berlin-taylor/ The State of Apache Airflow Apache Airflow creator Maxime Beuchemin joins the Software Engineering Daily podcast to dive into the state of Airflow in 2020. Since Airflow’s creation, it has powered the data infrastructure at companies like AirBnb, Netflix, Lyft and beyond. It has had a huge, and growing impact in the data pipeline space, and there’s a lot yet to come.

About

Chatbot for The Institute for Ethical ML, specifically for The ML Engineer Newsletter

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published