Crunch Big Data on Your Laptop With Polars Streaming
Polars streaming avoids out-of-memory errors in large cross joins via processing data in chunks. Learn how to run 27M row workloads on a single machine.
Polars streaming avoids out-of-memory errors in large cross joins via processing data in chunks. Learn how to run 27M row workloads on a single machine.
Refactoring an RCE machine learning algorithm from Pandas lambda functions to the Polars expression API reduced execution time from six minutes to fourteen seconds. Polars cross joins, columnar operations, and Apache Arrow drive a 25x speedup.
Build a lightweight capacity planning model in Python Pandas using flow diagrams, throughput estimates, and GROUP BY operations to estimate CPU requirements and infrastructure cost. Apply Operations Research concepts to size a simple web...
Identify investment grade copies of sealed Super Mario Bros. 3 variants through Python Pandas, Seaborn, and auction sales data. Normalize prices across market cycles and compare box grade, seal grade, release variant, and sale date to rank...
Witness practical Pandas, Seaborn, and Matplotlib techniques for exploring machine learning datasets using the UCI Abalone database. Includes histograms, KDE plots, boxplots, correlation heatmaps, PCA, regression plots, and multidimensional...
Refactor a Reduced Coulomb Energy neural network implementation from Matlab into R Tidyverse with pipes, tibbles, functional operations, and vectorized distance calculations. Compares loop-based Matlab patterns with tidy data workflows for...
Compare Howard Roark and Rodion Raskolnikov through sentiment scoring, emotional intensity, and thematic analysis.
Train a TensorFlow and Keras NLP model to identify and extract Raskolnikov’s dialogue and internal monologues from Crime and Punishment. Perform speaker-level lit analysis via transfer learning, BERT classification, Pandas processing, and...
In part one of this two-part series, I developed a Reduced Columb Energy (RCE) classifier in Python. RCE calculates hit footprints around training data and uses the footprints to classify test data. RCE draws a circle around each labeled training...
In Pattern Classification Using Neural Networks (IEEE Communications Magazine, Nov. 1989) Richard P. Lippman provides the following definition of Exemplar neural net classifiers: [Exemplar classifiers] perform classification based on the identity...
Data Scientists need skill and experience to create useful Machine Learning (ML) models. ML activities include tool selection, training logistic decisions (move data to training vs. train in-situ), data acquisition, data cleaning, data quality...
Good Vs. Evil - Two Opposing paths Taken by a Similar Genius This blog post provides a comparison between Henry David Thoreau's Walden and Ted Kaczynski's Unabomber Manifesto. To compare these two works, I use both a modern Natural Language...
I started my AI/ML journey in 2011 with a laptop model, a term which indicates a measure of size. Laptop models, by definition, do not exceed the compute, memory and storage resources of a single piece of hardware. The laptop model approach works...
Model optimization on traditional Artificial Intelligence and Machine Learning (AI/ML) platforms requires considerable Data Architect expertise and judgement. These ML platforms require the Architect to choose from dozens of available training...
In this demonstration we continue to use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem. Today we use Principal Component Analysis (PCA) to...
In this demonstration we will use Keras and TensorFlow 2.3 to explore data, normalize data, and build both a linear model and Deep Neural Network (DNN) to solve a regression problem. TensorFlow Core 2.3 includes tf.keras, which provides the high...
FastAI provides Jupyter notebooks to wrangle data, train models, optimize models and then serve models. I recommended FastAI to my Data Scientist friends and they found the FastAI Jupyter layout and workflow both cumbersome and confusing. GCP...
Fastai provides helper functions on top of Pytorch to help us wrangle, clean, and process data. In this HOWTO we will accomplish the following: Deploy an AWS g3.8xlarge instance Compile and install NVIDIA drivers on our g3.8xlarge instance Use a...
Introduction Machine Learning engineers use Probabilistic Neural Networks (PNN) for classification and pattern recognition tasks. PNN use a Parzen Window along with a non-negative kernel function to estimate the probability distribution function...
Introduction I investigate the effectiveness of a Reduced Coulomb Energy (RCE) Neural Network on the classification of the University of California, Irvine (UCI) Bupa liver disorder data set. I investigate seven (7) different versions of the data...
Caution! Math Ahead! For the Math-phobic, I explain how I crunch the test results in a math-free, simple and focused blog post here. I use math here, so this may be your last chance to escape! Still with me? Excellent! The bullets below outline...
Do you have big data chops? Quick, what do these three things have in common? Yankees, Giants, Rangers, Knicks What about these? Beatles, Monkees, Beach Boys Do you have an answer for each? "New York," for example, for the first list and "Rock...
In this blog post I will revisit the first piece of code I wrote with the R Programming language, back in the early part of this decade. Coming from an Octave/MATLAB background, I really enjoyed the functional nature of R. I imagined flinging...
Why do we need yet another personality test? Because, without "big data" technologies online "personality tests" suffer these problems: With most tests, we quickly see a pattern to the answers, and can easily steer the test to the outcome we want...
For my (post) masters project on machine learning and big data infrastructure I thought it would be fun to acquire my own data set. Last semester I traded available services and architected a scalable (big data) Internet facing survey...