Programming, languages, and software
Programming
Machine learning requires programming. Nobody else can do it for you. You will have to learn how to program: this means not just hacking together some code, but writing software that will work under all conditions and can be easily maintained. Some advice based on hard experience gained over the years:
Learn at least two programming languages. There is no one best language for all applications. Some are better at data handling; others are better for computationally-intensive applications.
Take a software engineering approach to any code you write. It may take slightly longer to develop the original code, but the additional time will more than repay itself in reducing the effort needed for testing and debugging. And when you come back to code that you wrote six months or a year ago, you will still be able to understand what you wrote.
Learn object-oriented programming. Object-oriented programming is the modern approach to software engineering. Among other things it provides mechanisms for data abstraction, data hiding, and developing reusable code.
Tie in to the user communities to get answers to your problems and see examples of how other programmers work. Stack Overvflow [64] is a particularly good source of information for software and programming issues. Code Project [30] provides a number of articles and applications, many of which deal with machine learning, written in a variety of programming languages.
Avoid language wars! A lot of ink – and electrons – have been wasted on debating which is the ‘best’ language. (Stack Overflow will now delete any conversations with the themes like ‘my language is better than yours because …’.) Choose a language or languages that work best for you and learn them. (Of course, if you are looking for jobs in machine learning you will need to pay attention to what is being sought in the job market.) You will find that once you learn one language well it is not too difficult to pick up others.
Two very good books on programming are those by McConnell [50] and Hunt and Thomas [40]. McConnell’s book is good to have at one’s side when programming.
C – C is a systems programming language that was developed at Bell Labs in the late 1960s and early 1970s. C is a procedural language, unlike object-oriented languages like C++ and Java. It is one of the most widely used languages in the world, and is used on a number of different platforms. The GNU Compiler Collection [35] and Microsoft Visual Studio [54] both contain C compilers. C can be used to write efficient numeric and data processing routines that can be called from languages like Python or R. Because it provides access to the internal workings of a computer, it is useful to think of C as essentially a high-level assembly language.
C++ -- C++ is an object-oriented language based on the C language. C++ compilers are available for Windows and Linux platforms. Microsoft Visual Studio [54] includes C++. The GNU Compiler Collection [35] includes C++ for both Linux and Windows platforms. C++ can produce fast executing code, but it has some significant weaknesses that have limited its popularity for machine learning. In particular: 1) memory management is nonexistent, requiring a lot of effort on the part of the programmer to avoid memory leaks, and 2) generics in C++ are not true generics, leading to significant ‘code bloat’.
C# -- C# was developed as Microsoft’s flagship language for developing Windows applications, although versions are now available for Linux. It is an object-oriented language that also contains a number of functional programming features. Like Java, C# source code is compiled to intermediate code that runs on a virtual machine (.NET). Although not particularly designed for machine learning, C# can be quite useful for processing data in preparation for machine learning. Among the nice features of C# are: 1) implementation of true generics, which means that generics can be part of a dynamic link library, 2) a large set of generic container classes (lists, stacks, dictionaries, hash tables, etc.), and 3) LINQ, a data query facility that can be used instead of SQL for many applications and that can be used to query a number of different types of data sources (databases, spreadsheets, text files, in-memory arrays, etc.). C# is available as part of Microsoft Visual Studio [54].
Programming languages
There are a number of programming languages that are being used in machine learning. R and Python are two of the top languages in data engineering. Here is a list of some of them, in no particular order:
The following languages are not designed for machine learning, but can provide useful capabilities working in conjunction with languages like those listed above.
R – R is an open source version of the S language for data analysis that was developed at Bell Labs. It has excellent capabilities for statistics, data analysis, and graphics. R is especially popular in the academic environment; most statistics texts published these days are based on R. The ggplot2 package provides one of the best data visualization capabilities in any language. There are over 10,000 add-on packages available for R, many of which deal with machine learning. The R community provides excellent support for the language, and new versions are coming out every quarter. R is available through the R web site [59].
Python – Python is an object-oriented language that is rapidly becoming popular in the commercial world because of machine learning oriented features like SciPy, NumPy, and Pandas. Python can be downloaded from the Python web site [58] or as part of the Anaconda package [27], which also automatically includes SciPy, NumPy, and Pandas.
Java – Java is billed as a ‘write once, run anywhere’ object-oriented language that is one of the most widely used languages in enterprise applications. Java source code is compiled to an intermediate language that runs on the Java Virtual Machine. The language is fairly similar to C++, although it has its own idioms. Java is maintained by Oracle, but resides on its own web site [42].
Scala – Scala was designed to address some of the criticisms of Java. It has become popular for machine learning applications. It combines both functional and object-oriented programming. It is designed to work with both Java and Javascript. Scala is another language that has been popular with data scientists and machine learning specialists. Scala can be found on its web site [62].
Julia – Julia is a high-level general-purpose programming language that was developed to carry out high-performance numerical analysis. It is a fairly new language – version 1.0 was released in August 2018 – but has attracted a number of high-profile users whose need for computational speed is paramount. Julia can be downloaded from its web site [43].
Octave – Octave is an open-source platform based on Matlab, and has most of the functionality of Matlab. Like Matlab, it is particularly useful for matrix computations. Some machine learning developers use Octave to prototype their algorithms, which they later implement in another language (e.g., Python, R) for more efficient running. Both Linux and Windows versions of Octave are available [36].
Lisp – Lisp is a family of programming languages, developed between the 1970s and 1980s, of which the most popular dialects are Clojure and Common Lisp. Compared to other languages on this list, Lisp has the longest history. Therefore, it had a lot of influence on the development of R, Python, and Javascript. It is dynamically typed. It’s used in large AI projects, such as Macsyma, DART, and CYC and by companies such as Grammarly. Because of its best prototyping capabilities and its support for symbolic expression Lisp is used in AI field.
Prolog — Prolog offers tree based data structuring mechanism, automatic backtracking and pattern matching combining these mechanisms provides a flexible framework to work with artificial intelligence. Unlike traditional programming language, Prolog is a high-level programming language based on formal logic. It is a language performing sequence of commands and solving logical formulas. As its program consists of list facts and rules, it is rule-based as well as declarative language.
The purpose of this section is to provide an overview of software libraries that will help the reader to implement the AI/ML techniques described in the previous chapters. This list is by no means comprehensive, but includes a selection of the most popular libraries for the topics covered by this primer.
CUDA is a parallel computing platform developed by NVIDIA that allows for general-purpose computing with Graphics Processing Units (GPUs). While CUDA itself is not a machine learning library, it is frequently used for computationally intensive steps in AI and ML work, such as the training of neural networks.
DSSTNE is a library developed by Amazon for building Deep Learning models on sparse datasets. It is written in C++ and uses GPUs.
Keras is a Python-based tool to support convolutional neural networks and deep learning. It can now work in conjunction with TensorFlow.
Microsoft Cognitive Toolkit contains tools for a number of different machine learning applications including neural networks and time series.
mlpack is a large library containing a variety of algorithms for classification, regression, clustering, geometry, preprocessing, and transformations. It is written in C++ and also has Python bindings.
OpenAI Gym is a very general toolkit for developing and comparing reinforcement learning algorithms. It is written in Python.
OpenCV is a large library for computer vision and machine learning. It includes a wide range of algorithms for object recognition, object tracking, image processing, and other purposes. The library is written in C++ and also has interfaces in C, Python, Java, and MATLAB.
StellarGraph is a library for machine learning on graph-structured data. It is written in Python and uses the Keras library.
TensorFlow was developed by Google. It includes capabilities for deep learning and reinforcement learning. APIs are available in a number of languages including Python (most well developed) and C++.
ThunderSVM is a support-vector machine library that supports both GPU and CPU computing. It is written in C++ and also has interfaces in Python, R, and MATLAB.
XGBoost is a general purpose gradient boosting library. It is written in C++, supports distributed computing, and also has interfaces in Python, R, Java, and Scala.
Last updated