AI Engineering : Building Multi-Modal Intelligent Systems with Vision, Language, and Audio From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Rea
Overview
AI Engineering: Building Multi-Modal Intelligent Systems with Vision, Language, and Audio
From LLM Fine-Tuning to Voice Agents, AR Interfaces, and Real-World Deployment
Unlock the future of artificial intelligence with practical, production-ready multi-modal engineering.
This hands-on guide is built for developers, researchers, and AI professionals who want to go beyond chatbots and dive into building intelligent systems that understand text, images, audio, and human intent - all in one pipeline.
Whether you're fine-tuning large language models (LLMs) or creating voice-driven AR interfaces, this book walks you through the real engineering decisions, tools, and architectures needed to bring multi-modal AI to life.
What You'll Learn:
-
Fine-tuning Large Language Models (LLMs): Train and adapt models like GPT-2, LLaMA, and Mistral for custom tasks using Hugging Face, LoRA, QLoRA, and PEFT.
-
Voice Interfaces: Combine Whisper, LLMs, and Bark/Tortoise TTS to build interactive speech-driven assistants.
-
Computer Vision + Language: Use models like BLIP, CLIP, and DETR to connect what systems see to what they say and understand.
-
Instruction Tuning & Hyperparameter Optimization: Build smarter, domain-specific models with efficient training workflows.
-
Multi-Modal Pipelines: Chain audio, image, and text inputs for question answering, summarization, tutoring, and AR/robotic control.
-
Real-Time Interfaces: Deploy intelligent agents using FastAPI, Streamlit, Gradio, Docker, and Hugging Face Spaces.
-
Edge & Offline Deployment: Optimize models with ONNX, quantization (4-bit, 8-bit), and TensorRT for low-latency inference on CPU/GPU.
Use Cases Covered:
-
Smart document summarizers with OCR + TTS
-
Voice-enabled image assistants
-
Emotion-aware agents
-
Virtual tutors
-
AR-enhanced AI interfaces
-
Robotic perception + control from voice/image input
-
Secure, multilingual, and privacy-conscious AI systems
Tools & Frameworks Inside:
-
Python, PyTorch, Hugging Face Transformers
-
LangChain, OpenCV, Whisper, TTS, BLIP
-
ROS, Unity (AR/VR), Gradio, Streamlit
-
Docker, FastAPI, gRPC, TorchServe
Built for engineers. Written with depth. Designed for real-world impact.
If you're ready to build intelligent multi-modal agents that understand the world like humans do - across speech, vision, and language - this book gives you the complete roadmap.
Perfect for:
Machine learning engineers, data scientists, AI product developers, researchers, robotics engineers, and anyone building cutting-edge AI systems.
This item is Non-Returnable
Customers Also Bought
Details
- ISBN-13: 9798296089038
- ISBN-10: 9798296089038
- Publisher: Independently Published
- Publish Date: August 2025
- Dimensions: 9 x 6 x 0.62 inches
- Shipping Weight: 0.88 pounds
- Page Count: 296
Related Categories
