The Leftshift One Data Audit: Foundation for the AIaaS Journey
 The Leftshift One Data Audit offers an ideal starting point for clients with little to no experience in AI projects. The audit assists in identifying areas where AI can provide value and assesses the suitability of existing data for the specific use case.
The Data Audit serves as the foundation for further AI projects with Leftshift One and establishes a common basis for the next steps.
- 17. October. 2024
Patrick Ratheiser
CEO & Founder
The Leftshift One Data Audit
 For clients with a large amount of existing data, its suitability is evaluated based on technical feasibility and economic benefits. A review of the quality and quantity of the data is essential to achieve the desired benefits of an AI model. Preprocessing is a crucial step to systematically read raw data and convert it into a format suitable for AI models. In feature engineering, manual metrics for the data are defined that do not directly emerge from the data itself.
What added value does the Data Audit provide as a starting point for AI projects?
The Data Audit provides an ideal starting point for embarking on the AIaaS journey with Leftshift One. Even for clients with no prior experience in AI or data science, areas where AI can deliver value can be identified based on the extensive experience from previous projects. It is assessed whether the chosen use case aligns with the existing data or if a use case can be derived from the available data.
For clients who have progressed further and collected relevant data, its suitability is also evaluated in terms of technical feasibility and economic benefits.
At the conclusion of the Data Audit, a meaningful decision is made regarding whether the various challenges can be automated and resolved with AI in a short timeframe. Furthermore, the Data Audit establishes a common foundation for future potential AI projects with Leftshift One.
Quality and Quantity of Data are Essential
For an AI model to achieve the desired benefits, it is necessary to review the data depending on the complexity of the chosen use case. This also considers whether the use case aligns with the business model and, ideally, provides value from the outset.
An example can illustrate the relationship between the quality and quantity of data concerning a specific use case: In an industrial company that manufactures gearboxes with about 1,500 features each, the result for all gearboxes tested for correct functionality is around 1% negative. If supervised learning is used, the AI learns from example datasets. It aims to identify underlying patterns rather than simply memorizing, which can lead to “overfitting.” In terms of quantity, it quickly becomes clear that a certain amount of data must be available for the AI to recognize a pattern. For a gearbox with 1,500 features, the required datasets amount to hundreds of thousands, while a simpler example with 10 parameters may only need around a thousand datasets.
Another example can be found in text analysis with an email classifier. Assuming that about 1% of emails are classified as spam, a sufficient amount of data must be collected to achieve representative significance. Generally, it can be noted that more parameters in an AI model correspond to a larger required dataset. A lack of data quality can be mitigated in the Data Audit through preprocessing, although natural limitations exist.
While machine learning is a significant area of AI, there are also fields where no data is needed. For example, fixed algorithms are used in optimizing route planners or schedules without needing to check the quantity or quality of the data in advance.
Â
Â
Through preprocessing, the data is transformed into the desired format.
The data is often available in CSV, JSON, or ideally in SQL format. The goal of preprocessing is to read the data in a structured manner and convert it into a tabular format. Since the majority of AI algorithms can only work with numbers, texts are transformed into what are called tokenizers, which the algorithms can utilize. The raw data, which can also include images, is transformed into a format that is consumable by AI models. The numerical values generated as a starting point then correspond to the numerical values required for the use case.
Another step in preprocessing involves dealing with erroneous or incomplete data. Depending on the context, this data is either removed or replaced with representative values, such as the mean. Additionally, feature engineering relies on background knowledge to define manual metrics for the data that do not directly emerge from the data itself. A simple example of this is the difference between gross and net amounts, which are composed of taxes and social contributions. In extensive deep learning models, preprocessing and feature engineering can be automated through end-to-end mapping with independent learning.
The selection of AI algorithms is based on experience and depends on the use case
The choice of the right algorithm always depends on the specific problem at hand. Often, it is necessary to experiment with different algorithms and adjust the parameters of the training algorithms. It is advisable to start with a handful of simpler algorithms and only move on to more complex ones, such as neural networks, when needed. For certain problem cases like Natural Language Processing, transformers are a suitable method, while for tabular data, gradient boosting techniques or neural networks are often the best choice. Neural networks usually offer the best performance, but one must also consider the resource and performance trade-offs.
After training, the model is evaluated on test data to assess its functionality. Subsequently, the method with the best accuracy is selected and tested on additional data. It is advisable to rely on established models and best practices from the literature and to look for successful solutions in similar projects when necessary. In many cases, it is essential to utilize open-source algorithms. The trade-off between resources and performance is always taken into account, and the algorithm is chosen that provides the appropriate performance with acceptable resource usage. For instance, while ChatGPT is capable of addressing all text-related problem statements, the resource consumption is enormous and often not economical. The Data Audit is recommended to ensure that the data and use case are suitable for the chosen algorithm.
Feasibility is evaluated from a technical, infrastructural, and AI perspective.
To achieve a demonstrable result, it is essential that the initial model functions well and delivers good results. Clients are aware that they need to provide data for the AI implementation. Even with limited data, ideal preprocessing allows for the realization of the desired use case. Additionally, the project presents an opportunity to collect further data during the AI implementation.
Before beginning the implementation, the feasibility of the project is assessed. From a technical perspective, the client’s systems can ideally be integrated in the cloud with Leftshift One. However, many clients prefer on-premise integration due to data protection concerns. Therefore, part of the Data Audit focuses on the technical infrastructure to ensure potential integration.
Once the technical components are clarified, a feasibility analysis is conducted from an AI perspective. The entire Data Audit essentially serves as a feasibility analysis, typically taking 3 to 5 days depending on the project size. If the client wishes to clarify feasibility in advance through a datathon, time can be saved. This involves conducting rough analyses on the provided data.
The theoretical feasibility analysis is based on previously completed projects. Various steps within the Data Audit include workshops, preprocessing, feature engineering, and exploratory data analysis. It becomes evident early on if the data exhibits significant variations that complicate meaningful analysis. A simple example of this is the energy consumption data before and after the COVID-19 pandemic—both data collections are prerequisites for a meaningful analysis.
Particularly for text data, the datasets and corresponding target variables are essential. To evaluate the model’s performance, the client’s entire data is carefully divided into a training set and a test set. The model is trained and developed on the training dataset before being evaluated on the test dataset to see if it performs well on previously unseen data.
The results of the method and implementation, along with metrics, are presented to the client. The next steps are then discussed, such as developing a prototype or providing further support in data collection.
The outcome of the Data Audit is an informed decision for or against the use of AI
Whether AI should be used in a specific project depends on factors such as the use case and the available data. Additionally, the selection of the appropriate AI is crucial for the project’s success. It is fundamentally important to evaluate the use case from an economic perspective. If AI delivers better results than the previously employed method, transitioning to AI is worthwhile. The final decision on whether it should be used in the future lies with the respective company.
However, some technical issues may arise, such as a lack of quantity and quality of data or a suboptimal use case. In such cases, it makes sense to reframe the problem to continue the project under the newly defined parameters. If all stakeholders are satisfied and the AI is technically feasible, a go decision can be made, and the AIaaS journey can progress to the next phase with a prototype.
Utilize our trustworthy and explainable AI models in an energy-efficient and cost-effective manner to sustainably solve your problems with the AIaaS approach. Book your non-binding AI expert consultation here now and get started with a Data Audit!
Schedule your free initial consultation now with our expert in artificial intelligence and data analysis
To process your request, we will handle the data you provide in the form. Thank you for filling it out!