Data Mining in MATLAB: 2017

Friday, April 21, 2017

A New Dawn for Local Learning Methods?

The relentless improvement in speed of computers continues. While some technical barriers to this progress have begun to emerge, exploitation of parallelism has actually increased the rate of acceleration for many purposes, especially in applied mathematical fields such as data mining.

Interestingly, new, powerful hardware has been put to the task of running ever more baroque algorithms. Feedforward neural networks, once trained over several days, now train in minutes on affordable desktop hardware. Over time, ever fancier algorithms have been fed to these machines: boosting, support vector machines, random forests and, most recently, deep learning illustrate this trend.

Another class of learning algorithms may also benefit from developments in hardware: local learning methods. Typical of local methods are radial basis function (RBF) neural networks and k-nearest neighbors (k-NN). RBF neural networks were briefly popular in the heyday of neural networks (the 1990s) since they train much faster than the more popular feedforward neural networks. k-NN is often discussed in chapter 1 of machine learning books: it is conceptually simple, easy to implement and demonstrates the advantages and disadvantages of local techniques well.

Local learning techniques usually have a large number of components, each of which handles only a small fraction of the set of possible input cases. The nice thing about this approach is that these local components largely do not need to coordinate with each other: The complexity of the model comes from having a large number of such components to handle many different situations. Local learning techniques thus make training easy: In the case of k-NN, one simply stores the training data for future reference. Little, if any, "fitting" is done during learning. This gift comes with a price, though: Local learning systems train very quickly, but model execution is often rather slow. This is because local models will either fire all of those local components, or spend time figuring out which among them applies to any given situation.

Local learning methods have largely fallen out of favor since: 1. they are slow to predict outcomes for new cases and, secondarily, 2. their implementation requires retention of some or all of the training data, and 2. This author wonders whether contemporary computer hardware may not present an opportunity for a resurgence among local methods. Local methods often perform well statistically, and would help diversify model ensembles for users of more popular learning algoprithms. Analysts looking for that last measure of improvement might be well served by investigating this class of solutions. Local algorithms are among the easiest to code from scratch. Interested readers are directed to "Lazy Learning", edited by D. Aha (ISBN-13: 978-0792345848) and "Nearest Neighbor Norms: NN Pattern Classification Techniques", edited by B. Dasarathy (ISBN-13: 978-0818689307).

Wednesday, March 29, 2017

Data Analytics Summit III at Harrisburg University of Science and Technology

Harrisburg University of Science and Technology (Harrisburg, Pennsylvania) has just finished hosting Data Analytics Summit III. This is a multi-day event featuring a mix of presenters from the private sector, the government/government-related businesses and academia which spans research, practice and more visionary ("big picture") topics. The theme was “Analytics Applied: Case Studies, Measuring Impact, and Communicating Results".

Regrettably, I was unable to attend this time because I was traveling for business, but I was at Data Analytics Summit II, which was held in December of 2015. If you haven't been: Harrisburg University of Science and Technology does a nice job hosting this event. Additionally, (so far) the Data Analytics Summit has been free of charge, so there is the prospect of free food if you are a starving grad student.

The university has generously provided links to video of the presentations from the most recent Summit:

http://analyticssummit.harrisburgu.edu/

Video links for the previous Summit, whose theme was unstructured data can be found at the bottom of my article, "Unstructured Data Mining - A Primer" (Apr-11-2016) over on icrunchdata:

https://icrunchdata.com/unstructured-data-mining-primer/

I encourage readers to explore this free resource.

Friday, March 17, 2017

Geographic Distances: A Quick Trip Around the Great Circle

Recently, I wanted to calculate the distance between locations on the Earth. Finding a handy solution, I thought readers might be interested. In my situation, location data included ZIP codes (American postal codes). Also available to me is a look-up table of the latitude and longitude of the geometric centroid of each ZIP code. Since the areas identified by ZIP codes are usually geographical small, and making the "close enough" assumption that this planet is perfectly spherical, trigonometry will allow distance calculations which are, for most purposes, precise enough.

Given the latitude and longitude of cities 'A' and 'B', the following line of MATLAB code will calculate the distance between the two coordinates "as the crow flies" (technically, the "great circle distance"), in kilometers:

DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)));

Note that latitude and longitude are expected as decimal degrees. If your data is in degrees/minutes/seconds, a quick conversion will be needed.

I've checked this formula against a second source and quickly verified it using a few pairs of cities:

% 'A' = New York
% 'B' = Atlanta
% Random on-line reference: 1202km
LatA = 40.664274;
LongA = -73.9385;
LatB = 33.762909;
LongB = -84.422675;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

1202

% 'A' = New York
% 'B' = Los Angeles
% Random on-line reference: 3940km (less than 0.5% difference)<0 .5="" br="" difference="">
LatA = 40.664274;
LongA = -73.9385;
LatB = 34.019394;
LongB = -118.410825;
DistanceKilometers = round(111.12 * acosd(cosd(LongA - LongB) * cosd(LatA) * cosd(LatB) + sind(LatA) * sind(LatB)))

DistanceKilometers =

3955

References:

"How Far is Berlin?", by Alan Zeichick, published in the Sep-1991 issue of "Computer Language" magazine. Note that Zeichick credits as his source an HP-27 scientific calculator, from which he reverse-engineered the formula above.

"Trigonometry DeMYSTiFieD, 2nd edition", by Stan Gibilisco (ISBN: 978-0-07-178024-7)