Azure - Building Multimodal Generative Experiences. Part 1

manjunani

Manjunatha Sai Uppu

Posted on May 29, 2024

Azure - Building Multimodal Generative Experiences. Part 1

This Blog is a brief run-through of this learn collection to get a good understanding of the AI services offered by Azure.

Getting Started with Azure OpenAI Services

  • Many Generative models are subsets of deep learning algorithms that support various workloads like vision, speech, language, decision and search.
  • On Azure, these models are available through Rest APIs, SDK's, and Studio Interfaces.
  • Azure Open AI Provides access to model management, deployment, experimentation, customization and learning resources.
  • To begin building with Azure OpenAI, we need to choose and deploy a base model. Microsoft provides base models and an option to create customized base models
  • There are several types of models available likely -->GPT 4, GPT 3.5, Embedding models, and Dall E Models(image generation models) which differ by cost, speed, and how well they complete the tasks
  • Deployment of those Azure AI models can be done in multiple ways like using Azure OpenAI Studio, Azure CLI, or Azure Rest API.
  • Once the deployment is done, you can check the models by triggering them from Azure OpenAI Studio.
  • Likely using the below prompt types
    • Classifying Content
    • Generating New Content
    • Holding a Conversation
    • Transformation ( translation and symbol conversion)
    • Summarizing Content
    • Pick up where you left off
    • Giving Factual Responses
  • Additionally, you can test your models in the completion playground.
    • These below are the parameters that you see on the completion playground.
    • Temperature: Controls randomness. Lowering the temperature means that the model produces more repetitive and deterministic responses. Increasing the temperature results in more unexpected or creative responses. Try adjusting temperature or Top P but not both.
    • Max length (tokens): Set a limit on the number of tokens per model response. The API supports a maximum of 4000 tokens shared between the prompt (including system message, examples, message history, and user query) and the model's response. One token is roughly four characters for typical English text.
    • Stop sequences: Make responses stop at a desired point, such as the end of a sentence or list. Specify up to four sequences where the model will stop generating further tokens in a response. The returned text won't contain the stop sequence.
    • Top probabilities (Top P): Similar to temperature, this controls randomness but uses a different method. Lowering Top P narrows the model’s token selection to likelier tokens. Increasing Top P lets the model choose from tokens with both high and low likelihood. Try adjusting temperature or Top P but not both.
    • Frequency penalty: Reduce the chance of repeating a token proportionally based on how often it has appeared in the text so far. This decreases the likelihood of repeating the exact same text in a response.
    • Presence penalty: Reduce the chance of repeating any token that has appeared in the text at all so far. This increases the likelihood of introducing new topics in a response.
    • Pre-response text: Insert text after the user’s input and before the model’s response. This can help prepare the model for a response.
    • Post-response text: Insert text after the model’s generated response to encourage further user input, as when modeling a conversation.
  • Also you have a chat playground that is based on conversation in the message out interface which contains these parameters (Max Response, Top P, Past Messages Included)

Analyze Images

  • Azure Vision Service is designed to help you extract information from images. It provides the below functionalities
    • Description and Tag Generation
    • Object Detection
    • People Detection
    • Image metadata, color, and type analysis
    • Category Identification
    • Background Removal
    • Moderate Rating (determine if the image includes any adult or violent content)
    • Optical Character Recognition
    • Smart thumbnail generation

Plan an Azure AI Document Intelligence solution

  • Azure AI Document Intelligence uses Azure AI Services to analyze the content of scanned forms and convert them into data. It can recognize text values in both common forms and forms that are unique to your business.
  • Azure AI Document Intelligence is an Azure service that you can use to analyze forms completed by your customers, partners, employers, or others and extract the data that they contain.
  • prebuilt models like (read, general document, layout) are available and common type forms like ( invoice, receipt, W-2 US tax declaration, ID Document, Business Card, Health Insurance card), custom models (custom template model, custom neural model), composed models (model which consists of multiple custom models).
  • Refer this for more info on model types
  • Azure AI Document Intelligence includes Application Programming Interfaces for each model types you've seen.

Use Prebuilt Document Intelligence models

  • prebuilt models in Azure AI Document Intelligence enable you to extract data from common forms without training your own models.
  • Several of the prebuilt models are trained on specific form types:
    • Invoice model. Extracts common fields and their values from invoices.
    • Receipt model. Extracts common fields and their values from receipts.
    • W2 model. Extracts common fields and their values from the US Government's W2 tax declaration form.
    • ID document model. Extracts common fields and their values from US drivers' licenses and international passports.
    • Business card model. Extracts common fields and their values from business cards.
    • Health insurance card model. Extracts common fields and their values from health insurance cards.
  • The other models are designed to extract values from documents with less specific structures:
    • Read model. Extracts text and languages from documents.
    • General document model. Extract text, keys, values, entities, and selection marks from documents.
    • Layout model. Extracts text and structure information from documents.
  • Features of Prebuilt models (text extraction, key-value pairs, entities, selection marks, tables, fields)
  • Also there are some input requirements we need to follow to use the prebuilt models and also a brief info of which models provide which features refer to this link.
  • Probably W2 models and General Document models are good enough to get all the features that we might need.

Extract data from forms with Azure Document Intelligence.

  • Azure Document Intelligence uses Optical Character Recognition capabilities and a deep learning model to extract text, key-value pairs, selection marks, and tables from documents.
  • OCR captures document structure by creating bounding boxes around detected objects in an image.
  • Azure Document Intelligence is composed of the following services
    • Document Analysis Models
    • Prebuilt models( W2, invoices, receipts, ID Documents, Business Cards)
    • Custom Models
  • Azure Document Intelligence works on input documents that meet these requirements:
    • Format must be JPG, PNG, BMP, PDF (text or scanned), or TIFF.
    • The file size must be less than 500 MB for the paid (S0) tier and 4 MB for the free (F0) tier.
    • Image dimensions must be between 50 x 50 pixels and 10000 x 10000 pixels.
    • The total size of the training data set must be 500 pages or less.
  • To Use OCR Capabilities, use a layout or read or general document model.
  • To create an application that extracts data from other formats we can use prebuilt models
  • To create an application that extracts data for your industry-specific norms we can use custom models
  • Custom Models
    • Custom template models accurately extract labeled key-value pairs, selection marks, tables, regions, and signatures from documents. Training only takes a few minutes, and more than 100 languages are supported.
    • Custom neural models are deep-learned models that combine layout and language features to accurately extract labeled fields from documents. This model is best for semi-structured or unstructured documents.

Refer this Link for Continuation

💖 💪 🙅 🚩
manjunani
Manjunatha Sai Uppu

Posted on May 29, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related