Laying the Foundation for Productive AI Implementation with Optimal Data Preparation
From Practice
An international logistics company has set the clear goal of making its internal knowledge management more efficient and innovative through the use of Leftshift One’s MyGPT.
To establish a solid foundation for this initiative, the initial focus is on precise data extraction and preparation.
- 17. October. 2024
Patrick Ratheiser
CEO & Founder
Karin Schnedlitz
Content Managerin
What Does the Process Look Like in Detail?
The primary challenge is to prepare the diverse data in a way that the LLM behind MyGPT can effectively utilize. Simply uploading Word or Excel documents is not sufficient. Clean data extraction and preparation are critical success factors in preventing hallucinations (link to article on hallucinations) and ensuring a reliable foundation for all queries to the system.
Data Preparation is the Foundation for the Reliable Functionality of MyGPT
With MyGPT’s Strict Mode, all generated answers are based on relevant documents. The process of data extraction and preparation is crucial for two reasons. First, the desired information must be found through semantic search based on the user’s input. Second, this information must be correctly structured and in a format that allows the LLM to generate an accurate response.
Data Extraction and Preparation for Various Formats
Leftshift One has extensive experience in document preparation for AI. The goal is to map the structure of each document to text. Various mechanisms are in place to optimally analyze and process information across different formats. The common formats include:
Use Cases for ChatGPT
In Which Areas Can ChatGPT Be Used?
In principle, ChatGPT can be used in any area where text is generated, processed, or where information needs to be retrieved. However, depending on the application area, there are also risks associated with its use.
Below is a risk assessment of areas where ChatGPT can generally be applied but may not always be advisable:
Challange | Approach | Text | Depending on the context, text can be unstructured and variable in formatting. | Text data can be analyzed and processed directly. Application of text processing techniques to clean the text and bring it into a standardized format. |
---|---|---|
PDFs often consist of a mixture of text, images, and other media. | Application of OCR techniques to extract text from images. Use of specialized libraries to read and interpret the content of PDFs. Conversion of the extracted content into a structured format suitable for machine learning. | |
Word | Word documents can contain complex formatting, tables, images, and embedded objects. | Use of libraries specifically designed for reading Word documents. Extraction of plain text while ignoring complex formatting and irrelevant content. Conversion of the content into a machine-readable format. |
Excel | Excel spreadsheets can contain complex data structures, formulas, and links between cells. The order and structure of the data varies. | Use of tools that can efficiently read table data. Conversion of tables into structured data formats such as CSV or JSON. Consideration of cell formatting and types during data extraction. |
Power Point | Often includes many design elements. Incorporation of effects. | Extraction of text content using dedicated Python libraries. Preparation in a machine-readable format. |
Flexibility for Data Storage and Data Security
Regarding data sources and storage, a distinction is made between the document format (e.g., Word, PDF) and the document storage location. The location of the documents can vary. Leftshift One’s connectors allow access to both local data and cloud-based documents. Strict data protection requirements are met by clearly separating the process of uploading data from the interaction with MyGPT.
Data Update: “On the Fly”!
To ensure that knowledge management is always based on up-to-date information, the logistics company can use a connector to update documents based on time triggers. Since Leftshift One’s approach does not require extensive fine-tuning, no retraining is necessary after the data update—making the data update happen “on the fly.”
The Steps from Data Foundation to Productive Deployment
The journey to the productive deployment of MyGPT began for the logistics company with an initial meeting with Leftshift One and a review of the data foundation. A deeper analysis of the data was then conducted through a Data Audit (link to article on Data Audit). During this phase, the specifics of the documents were examined, and the further course of action was defined. The logistics company only needed to upload the necessary data into MyGPT afterward. The processing takes place automatically in the background, allowing for productive access to the knowledge management system with MyGPT in a short time.
Leftshift One: A Unique Selling Proposition Through Experience and In-House Development
For the logistics company, the key reason for relying on Leftshift One when implementing an AI-based knowledge management system was the extensive, personalized consultation they provided. Leftshift One’s deep expertise, gained from numerous successful AI projects, enables them to address unique challenges and develop tailored solutions. With MyGPT, Leftshift One’s in-house development of plugins stands out, covering a wide range of business use cases and technical scenarios.
Innovations in Multimodality Open Up New Opportunities
With the emergence of new and more powerful AI image models, Leftshift One has expanded the roadmap for the further development of MyGPT towards multimodality. To implement this productively, it is essential to first evaluate the actual maturity of these new technologies. In terms of data extraction and preparation, this would open up a multitude of new possibilities.
Take advantage of generative AI in your business now.
To process your request, we will handle the data you provide in the form. Thank you for filling it out!