Our CEO, Chander Dhall, became fascinated with machine learning over a decade ago. Having a masters in computer science, he has always kept up with academia even though the company primarily works on projects for mid and large size Fortune 500 corporations. Having been awarded by both Microsoft (Microsoft Most Valuable Professional for close to a decade) and Google (Google Developer Expert), he has been fortunate to interact and share knowledge with the ones who create these technologies.
Since Cazton is a big supporter of open source, we have experts who have contributed to open source machine learning libraries as well as data engineering libraries. Our team consists of experts who have PhDs as well as Masters’ degrees in data science and machine learning, are open source contributors with their own ML libraries, and have years of experience in the industry. That’s one reason our team has been working on serious machine learning projects long before our competitors.
Machine learning projects require serious understanding of data. Cazton data scientists with years of successful experience in the industry have worked in multiple business domains. Our machine learning experts not only evaluate the data, but are adept at all facets of data engineering. If you are serious about doing machine learning properly, please check out our expert machine learning team of PhDs, as well as Microsoft-awarded Most Valuable Professionals and Google Developer Experts.
Imagine Google’s Alexa tool, do you like it? Even at the time of writing the article, it doesn’t understand context. It’s not conversational. It takes most commands from the user in a mutually exclusive fashion. For the most part, it’s not very good at correlating the commands subsequently given to it. So what’s the point? Machine learning is not that straightforward. In order to understand it, we need to understand TensorFlow, which is a deep learning library created and open-sourced by Google. We also need to understand machine learning and most importantly, deep learning which is a subset of ML. If you are new to Machine Learning please watch this 3-min video demonstration.
With the current popularity of ML/AI, having a serious data science practice has been one of the top prerogatives for most corporations. However, there is a shortage of data scientists currently. To be more accurate, there is a shortage of good data scientists. Many people have tried to take advantage of the latest hype and game the system. Why is data science so complex? In order to be successful at data science and machine learning, there are usually major steps we need to follow, at least, at a very high level:
- Data Collection: A good data collection process includes ingesting data through multiple sources. Most projects we work on have data coming in from many sources. At first glance, this seems like a piece of cake. But we have yet to find an ideal project in which data was collected. It’s actually a more complex process than most people tend to think. We are talking instrumentation, logging, external as well as user generated data. Even in companies which have moved to a uniform API strategy, data could be very fragmented and may not be in the best format needed for exploration.
- Storage and data flow: Once we have the data, we need to be able to store it in the best format possible. As we know, data could be structured (as in a relational database), unstructured (typically text and multimedia content) and semi-structured (like XML and JSON). In this stage, we need to take care of the infrastructure to store data optimally. We need to make sure we have the right pipelines and the data flow is reliable.
- ETL (Extract, Transform and Load): Extracting is the process of reading data from a data source. This could be an RDBMS like PostgreSQL, SQL Server, Oracle etc., a document database like MongoDB, CouchBase etc., a search engine like Elasticsearch, Solr, Lucene etc., a logging tool like Splunk, Logstash, System Center etc. or even a caching engine like Redis, Memcached etc. Transform is the process of converting the data to a form it needs to be so it can be placed into the database of choice. Imagine, getting data from all the sources and moving it to Hadoop for big data processing. Load is the process of writing the data into the target database.
Clean up and anomaly detection: This stage involves a technique used to identify unusual patterns that do not conform to expected behavior called outliers. It’s called anomaly detection. In deep learning, it’s important to train the machine using training data. It’s important to detect and remove anomalies on top of cleaning up redundant data. If anomalies are not removed, the training data would be faulty and hence the machine may have a bias and not provide the best results. There are three kinds of anomalies that our business may need to detect. They are point anomalies, contextual anomalies and collective anomalies.
- Point anomalies: Imagine a credit card fraud where the user buys a $50,000 business class ticket when he has a history of never buying a flight ticket of more than $5,000. This is a good example of point anomaly.
- Contextual anomalies: A good example of this would be spending during Thanksgiving or Christmas may be way higher than the average spend otherwise.
- Collective anomalies: This could be the result of detecting multiple anomalies and figure out a pattern that is not usual. A good example would be finalizing network traffic looking for collective anomalies that could mean it’s a hack.
- Representation: In traditional programming, the focus is on code. In machine learning projects, the focus shifts to representation. A machine learning model can't directly see, hear, or sense input examples. Instead, we need to create a representation of the data. This representation is provided to the model and ideally highlights the key qualities of the data in question. In order to create a well-trained model, we must choose the set of features that best represents the data. The phase of representation means creating feature vectors that could be understood by an ML library like TensorFlow.
- Aggregation and Training: Once that data is cleaned up, we can use it to train our machine learning model. This state includes aggregation of training data. During this stage, we use analytics and metrics to figure out what’s that set of training data that can be used to train a model such that it’s able to solve actual real-life problems later on with production data. Usually we use what’s called test data, which is a small subset of training data to test and verify the results.
- Evaluation: Evaluation is a very critical stage in ML. This is the stage in which we prefer one model vs another. Mean squared error of a model’s output vs the data output is one example. Another example is the likelihood or the estimated probability of a model given the observed data. These are examples of different evaluation functions that will imply somewhat different heights at each point on a single landscape.
- Optimization: Optimization is how we search the space of represented models to obtain better evaluations. Stochastic gradient descent and genetic algorithms are two (very) different ways of optimizing a model class. This involves traversing the current landscape to find the ideal model. After training the model, it’s quite possible that we may no longer be able to recover exactly how it was optimized. However, we can log the relevant data while training that can explain the trajectory. (This video shows an illustration that will clarify this concept).
Once all of this is done and the model is trained, we can use it to solve real world problems in the domain of our choice. Our team works in different industries like airlines, finance, insurance, engineering, healthcare, tech and manufacturing just to name a few. Understanding the domain helps identify the right approach in using machine learning to solve complex problems. We are fortunate to have the priceless experience of knowing what works and what doesn’t work given the breadth of domains we work in.
This helps us bring best practices and unique perspective to future projects. We use various machine learning and data engineering technologies to make the projects successful. Some of these technologies are TensorFlow, Keras, Scikit-learn, Microsoft Cognitive Toolkit, Theano, Caffe, Torch, Kafka, Hadoop, Spark, Ignite and many others. The great news is that our team works on all major cloud platforms including Microsoft, Google and AWS and has experience with VMware and Pivotal so we can work with the existing tech stack of the client. Machine learning can be used to solve a wide variety of problems including:
- Image Recognition: Imagine identifying a cab driver based on his real time picture using machine learning to verify the resemblance with the pictures in the database.
- Natural language processing: Imagine creating a chat bot that understands the question and uses machine learning to accept a pizza order from a customer online. With speech to text, we can even allow the user to just provide voice instructions to the bot.
- Sentimental analysis (video or audio): Imagine conducting a company meeting with recorded video and using machine learning to provide a sentiment analysis.
- Recommendation engine: Rather than creating rules on what to recommend, imagine using machine learning to create recommendations based on users purchase habits and activity on an eCommerce website.
- Search Engine: Imagine creating something as complex as a search engine like Google or Bing. This might not be easy and may involve tens of thousands of models, but it has been done and is surely doable. Remember, Google and Bing have a need to parse almost unlimited data. However, it’s a lot easier to use the same problems on limited sets of data and probably get similar or even better results.
We can go on and on with examples of where to use machine learning as the practical applications are truly endless. The above process (Steps 1-8) can be very complex. The individual problems may not be that hard to solve, but if the team isn’t experienced or there is a lack of people with good aptitude, the process could be a never ending and the likelihood of success very low. We do need to remember one thing in our competitive world, success doesn’t simply mean delivery. Imagine doing a project in five years after spending $100 million on it. What if the same project could be done in one year for just $5 million? If delivery is the only criterion of success, both projects would be deemed successful. However, we clearly know that the former is more of a failure than success.
With data science and AI, it’s essential to have the right team and understand how to manage them. Some of the work our team has to do is change the mindset of using archaic processes for data science and ML. Expertise, experience and our company’s history of success is crucial in making a project successful. Delay in projects not only reduces the competitive edge of companies, but can also result in massive layoffs. We, at Cazton, work with you ensure you are successful both as an individual by rising higher in your career and as a company by staying innovative and ahead of the competition.
Cazton is composed of technical professionals with expertise gained all over the world and in all fields of the tech industry and we put this expertise to work for you. We serve all industries, including banking, finance, legal services, life sciences & healthcare, technology, media, and the public sector. Check out some of our services:
Cazton has expanded into a global company, servicing clients not only across the United States, but in Europe and Canada as well. In the United States, we provide our consulting and training services across various cities like Austin, Dallas, Houston, New York, New Jersey, Irvine, Los Angeles, Denver, Boulder, Charlotte, Atlanta, Orlando, Miami, San Antonio, San Diego and others. Contact us today to learn more about what our experts can do for you.