In this Data Age, it's clear that we have a surplus of data. But why should that necessitate an entirely new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time frame. Data is collected in various forms and from different sources, and often comes in a very unorganized format.
Data can be missing, incomplete, or just flat out wrong. Oftentimes, we will have data on very different scales, and that makes it tough to compare it. Say that we are looking at data in relation to pricing used cars. One characteristic of a car is the year it was made, and another might be the number of miles on that car. Once we clean our data (which we will spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.
In the nineteenth century, the world was in the grip of the Industrial Age. Mankind was exploring its place in the industrial world, working with giant mechanical inventions. Captains of industry, such as Henry Ford, recognized that using these machines could open major market opportunities, enabling industries to achieve previously unimaginable profits. Of course, the Industrial Age had its pros and cons. While mass production placed goods in the hands of more consumers, our battle with pollution also began around this time. By the twentieth century, we were quite skilled at making huge machines; the goal then was to make them smaller and faster. The Industrial Age was over and was replaced by what we now refer to as the Information Age.
The Data Age can lead to something much more sinister—the dehumanization of individuals through mass data. Many people ask me what the biggest difference between data science and data analytics is. While some can argue that there is no difference between the two, many will argue that there are hundreds! I believe that, regardless of how many differences there are between the two terms, the biggest is that data science follows a structured, step-by-step process that, when followed, preserves the integrity of the results. Between logarithms/ exponents, matrix algebra, and proportionality, mathematics clearly has a big role not just in the analysis of data, but in many aspects of our lives.
Electric vehicles are powered by lithium-ion batteries (LIB), a rechargeable battery that's still not fully understood or perfected. And inasmuch as electric cars are expected to replace gas-powered cars, any research that improves the performance of a lithium-ion battery will be a boon for electric vehicles and the environment. What are known as Battery Management Systems are trained to capture a battery's state of health and to predict its remaining life time. These two concepts help owners of electric vehicles know when to stop the car to recharge its battery as well as when to schedule battery replacements. Furthermore, a high-estimation accuracy model translates into a lifetime extension of battery packs, since it allows for a Battery Management System that can identify and protect weak cells.
The data, which are generated every minute, measure battery functions such as temperature, voltage and volatility in the currents, resulting in hundreds of thousands of data points.
The five essential steps to perform data science are as follows:
1. Asking an interesting question
2. Obtaining the data
3. Exploring the data
4. Modeling the data
5. Communicating and visualizing the results
Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:
• Make decisions
• Predict the future
• Understand the past/present
• Create new industries/products
No matter which industry you work in—IT, fashion, food, or finance—there is no doubt that data affects your life and work. At some point this week, you will either have or hear a conversation about data. We started using machines to gather and store information (data) about ourselves and our environment for the purpose of understanding our universe. News outlets are covering more and more stories about data leaks, battery performance, and how data can give us a glimpse into our lives. But why now? What makes this era such a hotbed of data-related industries? As with the Industrial Age, the Information Age brought us both the good and the bad. The good was the extraordinary works of technology, including mobile phones and televisions. The bad was not as bad as worldwide pollution, but still left us with a problem in the twenty-first century—so much data. That's right—the Information Age, in its quest to procure data, has exploded the production of electronic data. Estimates show that we created about 1.8 trillion gigabytes of data in 2011 (take a moment to just think about how much that is). Just one year later, in 2012, we created over 2.8 trillion gigabytes of data! This number is only going to explode further to hit an estimated 40 trillion gigabytes of created data in just one year by 2020. People contribute to this every time they tweet, post on Facebook, save a new resume on Microsoft Word, or just send their mom a picture by text message.
Machine learning: This refers to giving computers the ability to learn from data without explicit "rules" being given by a programmer. We have seen the concept of machine learning earlier in this chapter as the union of someone who has both coding and math skills. Here, we are attempting to formalize this definition. Machine learning combines the power of computers with intelligent learning algorithms in order to automate the discovery of relationships in data and create powerful data models. Speaking of data models, in this book, we will concern ourselves with the following two basic types of data model: ° Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness ° Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula. Exploratory data analysis (EDA): This refers to preparing data in order to standardize results and gain quick insights. EDA is concerned with data visualization and preparation. This is where we turn unorganized data into organized data and clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots to identify key features and relationships to exploit in our data models. Data mining: This is the process of finding relationships between elements of data. Data mining is the part of data science where we try to find relationships between variables (think the spawn-recruit model). Lithium-ion batteries power almost every electronic device in our lives, including phones and laptops. They’re at the heart of renewable energy and e-mobility. For years companies have tried to predict how many charging cycles a battery will last before it dies. Better predictions would enable more accurate quality assessment and improve long-term planning.
Artificial Intelligence: The word Artificial Intelligence comprises two words “Artificial” and “Intelligence.” Artificial refers to something which is made by human or non-natural things, and Intelligence means the ability to understand or think. There is a misconception that Artificial Intelligence is a system, but it is not a system. AI is implemented in the order. There can be so many definitions of AI. One explanation can be “It is the study of how to train the computers so that computers can do things which at present humans can do better.” Therefore It is an intelligence where we want to add all the capabilities to a machine that human contains.
Machine Learning: Machine Learning is the learning in which machines can learn on their own without being explicitly programmed. It is an application of AI that provides the system with the ability to learn and improve from experience automatically. Here, we can generate a program by integrating the input and output of that program. One of the simple definition of the Machine Learning is “Machine Learning is said to learn from experience E w.r.t some class of task T and a performance measure P if learners performance at the task in the class as measured by P improves with experiences.”
Is the data organized or not? We are checking for whether the data is presented in a row/column structure. For the most part, data will be presented in an organized fashion. In this book, over 90% of our examples will begin with organized data. Nevertheless, this is the most basic question that we can answer before diving any deeper into our analysis. A general rule of thumb is that if we have unorganized data, we want to transform it into a row/column structure. For example, earlier in this book, we looked at ways to transform text into a row/column structure by counting the number of words/phrases. The data for each cell is presented in a nested structure, with some features only measured once per cycle and others multiple times. Over a full cycle, we have more than a thousand measurements for capacity, temperature, voltage, and current, but only one scalar measurement for other metrics such as internal resistance of the cell or the total cycle time.
What does each row represent? Once we have an answer to how the data is organized and are looking at a nice row/column-based data set, we should identify what each row represents. This step is usually very quick and can help put things into perspective much more quickly. Do we need to perform any transformations on the columns? Depending on the level/type of data in each column, we might need to perform certain types of transformation. For example, for the sake of statistical modeling and machine learning, we would like each column to be numerical. We could remove cycles that had time gaps, small outliers, or other inconsistencies. One particular useful thing we found for smoothing out noise is the savitzky golay filter.
What does each column represent? We should identify each column by the level of data and whether it is quantitative/qualitative, and so on. This categorization might change as our analysis progresses, but it is important to begin this step as early as possible. Are there any missing data points? Data isn't perfect. Sometimes, we might be missing data because of human or mechanical error. When this happens, we, as data scientists, must make decisions about how to deal with these discrepancies. Raw measurement data can be extremely noisy. Distances between measurements are not always equal, data that’s supposed to decrease monotonically increases unexpectedly and sometimes the hardware just shuts off and continues measuring at a random point in time.
You may ask, how do I get the data and how do I parse it to do data science experiments with the data? Here is data-set for capacity degradation tests from Stanford University: https://data.matr.io/1/projects/5c48dd2bc625d700019f3204
This data set, used in our publication “Data-driven prediction of battery cycle life before capacity degradation”, consists of 124 commercial lithium-ion batteries cycled to failure under fast-charging conditions. These lithium-ion phosphate (LFP)/graphite cells, manufactured by A123 Systems (APR18650M1A), were cycled in horizontal cylindrical fixtures on a 48-channel Arbin LBT potentiostat in a forced convection temperature chamber set to 30°C. The cells have a nominal capacity of 1.1 Ah and a nominal voltage of 3.3 V.
The objective of this work is to optimize fast charging for lithium-ion batteries. As such, all cells in this dataset are charged with a one-step or two-step fast-charging policy.
The following repository contains some starter code to load the datasets in either MATLAB or python:
We believe that Energsoft has proprietary algorithms and machine learning models with the features generated for the variance model and ran on the internal machine learning toolkit. Energsoft used automatic training with a selection from 12,000 different machines learning hyperparameter and model tweakings/selections. The results obtained from the ML framework are better than those obtained by using Lasso and Ridge regression on Python or anything at Stanford, NASA or MIT. The mean percentage error has to be calculated using the predicted lifetime and observed lifetime to evaluate the model further model has to tested with new samples (secondary testing) to assess the consistency of the model and to make sure that it is not over-fitting.
The model which gives the best results is the full model which uses cycle 10 and 80 for Stanford data and battery cycles10 and 60 for the new data. The models’ performance has considerably decreased when compared to models built only for Stanford data or the original data. The model also uses two different sets of cycles to calculate features. The R Squared is also on the lower side. One of the reasons for the performance decreases it that the two data-sets are different, and the model cannot establish a robust relationship between the features and the battery lifetime. The next step to increase the accuracy could be to cluster the data concerning the variance feature (as it is the most important one) and build a separate model for each cluster of batteries. Increasing the number of samples can also be helpful.
The classification model classifies the cycle life of batteries into two groups (below and above 550 cycles)
The accuracy obtained is 91.67%
A lot of battery management systems are saying that they can make predictions in state of the art 95%, but it is simply not true, and it is not working correctly on different chemistry or even new products. Most of them based on mathematical models and not ML.
Artificial Intelligence should be impacting new material research, understanding problems with batteries, and predicting performance not just with predictive, but with prescriptive analytics. The impact should be analyzed with metrics and adjusted.
We have an office in Ukraine and the USA, currently, we are still in the seed stage, but we have 90 years of experience in the enterprise software and data science industries. We are a massive engineering team, and we can provide consulting if needed.
The current competitors think that they can charge tons of money from researchers and innovators in the battery industry, but it is time to help stop climate change and not just to get rich. Nobody will care about successful companies if the planet is in danger. Most of the competitors are heavy on researchers with Ph.D. themself and not from data science, ML or software backgrounds.
Energsoft is building an integrated AI and ML solution for R&D labs, cell manufacturing, pack integration, new product introduction, second life, and recycling. The software platform is on desktop, and the web allows for getting visualizations, automated reports, and insightful dashboards.
Please contact email@example.com for further testimonials and connections with our customers.