Generating a Concordance

Merriam-Webster defines a concordance as an alphabetical index of the principal words in a book or the works of an author with their immediate contexts. It is useful for building an index of words in a given text. In this article we will build a concordance generator using Java. The concordance should list the word, the number of times it appears in the text, and the sentences in which the word appears. The text that we will work with is the Gettysburg Address

Creating the project

Let's start by creating the project. We will use the maven quickstart archetype to generate and empty project:

mvn archetype:generate -DarchetypeGroupId=org.apache.maven.archetypes -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4

You will be asked to fill in the following questions, and can use groupIds and packages that make sense for you:

Define value for property 'groupId': org.exaxis.concordance
Define value for property 'artifactId': concordance
Define value for property 'version' 1.0-SNAPSHOT: : 
Define value for property 'package' org.exaxis.concordance: : 
Confirm properties configuration:
groupId: org.exaxis.concordance
artifactId: concordance
version: 1.0-SNAPSHOT
package: org.exaxis.concordance
 Y: : Y

When you are done, you should have a project that looks like this:

Setting Up Main

The first thing we are going to do is setup main. By default, quickstart sets up main to be a "Hello World" program.

/**
 * Hello world!
 *
 */
public class App 
{
    public static void main( String[] args )
    {
        System.out.println( "Hello World!" );
    }
}

We want to keep all the logic out of main so we will need a Concordance class to encapsulate the logic. Main will instantiate a Concordance from a file name and print out the concordance. We will also need to get the filename from the args passed into main. The following code should suffice:

/**
 * Generate a Concordance from the file passed in
 */
public class App 
{
    public static void main( String[] args )
    {
        // make sure we have enough arguments
        if (args.length==0)
            usage();

        // grab the filename from the last argument
        String filename = args[args.length-1];

        try {
            //load the concordance from the file
            Concordance concordance = Concordance.fromFile(filename);

            // print the concordance
            concordance.print(System.out);
        } catch (Exception e) {
            errorMessageAndExit("Error processing file: "+filename+e.getLocalizedMessage());
        }
    }

    private static void usage(){
        errorMessageAndExit("Usage: java -jar target/concordance-1.0-SNAPSHOT.jar inputFile");
    }

    private static void errorMessageAndExit(String message){
        System.err.println(message);
        System.exit(1);
    }
}

In short, this code:

makes sure there is at least one argument.
assumes the last argument is the filename.
Creates a concordance using the Concordance.fromFile() factory method.
Prints the concordance to System.out using the Concordance.print() method.
If the concordance throws an exception, the error will be printed and the return code will be 1.

The Concordance class

In order for App.main to compile, we will need a Concordance Class with a static fromFile method, and a print method. Here is the basic class

public class Concordance {
  public void print(PrintStream printStream) {
    printStream.println("Not Implemented");
  }

  public static Concordance fromFile(String filename) throws IOException {
    return fromReader(new FileReader(filename));
  }

  public static Concordance fromString(String s) throws IOException {
    return fromReader(new StringReader(s));
  }

  public static Concordance fromReader(Reader reader) throws IOException {
    try (BufferedReader bufferedReader = new BufferedReader(reader)) {
      return fromBufferedReader(bufferedReader);
    } catch (Exception e) {
      throw e;
    }
  }

  public static Concordance fromBufferedReader(BufferedReader bufferedReader) throws IOException {
    // Process the text
    process(bufferedReader);

    // return the Concordance
    return new Concordance();
  }

  private static void process(BufferedReader bufferedReader) throws IOException {
  }

}

There are several factory methods (fromFile, fromString, fromReader, and fromBufferedReader) to allow us to turn a file or a string into a BufferedReader. A file will be used from the command line and strings can be used for testing. It is fromBufferedReader where the text is processed and a Concordance is generated and returned.

But what is a concordance? As stated above, the concordance is an alphabetical list of words and there context, being the number of occurrences and the sentences that the word appears in.

What structure(s) would be best for storing this info. Let's start with the word usage info. We would need the word, the occurrences, and the list of sentences that it appears in. The sentences can be represented by the sentence number in the text. The following class would work:

public class WordUsage {
  private final String word;
  private final ArrayList<Integer> sentences = new ArrayList<>();

  public WordUsage(String word, int sentence){
    this.word = word;
    addSentence(sentence);
  }

  public WordUsage(String word, int[] locations){
    this.word = word;
    Arrays.sort(locations);
    for (int sentence : locations)
      this.sentences.add(sentence);
  }

  public String getWord() {
    return word;
  }

  public int getOccurrences() {
    return sentences.size();
  }

  public List<Integer> getSentences() {
    return sentences;
  }

  public void addSentence(int sentence){
    sentences.add(sentence);
  }
}

WordUsage stores the word, and the sentences that it appears in. The occurrences is the number of sentences it appears in. There is a constructor for creating the WordUsage for the first sentence, or multiple sentences at once. addSentence allows you to add additional sentences to the WordUsage as they are found.

Now we need a structure to store the words alphabetically with their usage data. A TreeMap seems to good fit as it is a sorted map that can hold the usage data for the word. We can add a TreeMap<String, WordUsage> to the Concordance class to hold the concordance data. Using the TreeMap means that we can retrieve the sorted WordUsages without any additional sorting. We can now define a constructor which takes the TreeMap to instantiate the Concordance class. This also gives us the return value for the process method, and we can complete the fromBufferedReader factory method. The updated code is:

public class Concordance {
  private final TreeMap<String, WordUsage>  wordMap;

  private Concordance(TreeMap<String, WordUsage>  wordMap){
    this.wordMap = wordMap;
  }

  public static Concordance fromBufferedReader(BufferedReader bufferedReader) throws IOException {
    // Process the text
    // return the Concordance
    return new Concordance(process(bufferedReader));
  }

  private static TreeMap<String, WordUsage> process(BufferedReader bufferedReader) throws IOException {
    assert(bufferedReader != null);
    TreeMap<String, WordUsage> wordUsageMap = new TreeMap<>();
    // process the bufferedReader
    return wordUsageMap;
  }
}

The only thing left to do is generate the TreeMap<String, WordUsage>. In order to make sure it is as efficient as possible we will implement a simple single-pass lexer which will break the text into words to be added to the concordance. Whitespace, punctuation, and end of sentence characters will be used to identify word boundries. End of sentence characters will also be used to increment the current sentence number. Here is the code for the processor:

private static TreeMap<String, WordUsage> process(BufferedReader bufferedReader) throws IOException {
    assert(bufferedReader != null);
    TreeMap<String, WordUsage> wordUsageMap = new TreeMap<>();
    // process the bufferedReader
    int currentSentence = 1;
    StringBuilder stringBuilder = new StringBuilder();

    // Use a simple single-pass lexer to return the words of the corpus
    int ch = bufferedReader.read();
    while (ch != -1){
      char curr = (char)ch;

      // Are we at the end of the word
      if (Character.isWhitespace(curr)||isEndOfSentence(curr)||isPunctuation(curr)) {
        if (!stringBuilder.isEmpty()){
          // add the word to the TreeMap
          addWord(wordUsageMap, stringBuilder.toString().toLowerCase(), currentSentence);
          stringBuilder = new StringBuilder();
        }
        if (isEndOfSentence(curr)) {
          ++currentSentence;
        }
      } else {
        stringBuilder.append(curr);
      }

      ch = bufferedReader.read();
    }
    return wordUsageMap;
  }

  private static void addWord(TreeMap<String, WordUsage> wordStatsMap, String word, int sentence){
    WordUsage wordUsage = wordStatsMap.get(word);
    if (wordUsage == null) {
      wordStatsMap.put(word, new WordUsage(word, sentence));
    } else {
      wordUsage.addSentence(sentence);
    }
  }

  private static boolean isEndOfSentence(char ch){
    return ch=='.'||ch=='?'||ch=='!';
  }

  private static boolean isPunctuation(char ch){
    return ch == ',' || ch == ':' || ch == ';' || ch == '"' || ch == '\'' || ch == '(' || ch == ')' || ch == '-';
  }

The only thing left is to fill in the Concordance.print method so we can print out the results of the concordance. Since the printing will require printing the WordUsages, we will add a toString method to the WordUsage class. Here are the print and toString methods:

Concordance.java

  public void print(PrintStream printStream) {
    for (WordUsage wordUsage : wordMap.values()){
      printStream.println(wordUsage.toString());
    }
  }

WordUsage.java

  private String printLocations(){
    StringBuilder sb = new StringBuilder();
    for(int sentence: sentences){
      sb.append(sentence);
      sb.append(',');
    }
    sb.deleteCharAt(sb.length()-1);
    return sb.toString();
  }

  @Override
  public String toString() {
    return String.format("%15s: {%d:%s}", getWord(), getOccurrences(), printLocations());
  }

Finally, Running the concordance on the Gettysburg Address is done as follows:

java -jar target/concordance-1.0-SNAPSHOT.jar src/test/resources/GettysburgAddress.txt

and produces the following output:

              a: {7:1,2,3,4,4,6,10}
          above: {1:7}
            add: {1:7}
       advanced: {1:9}
            ago: {1:1}
            all: {1:1}
     altogether: {1:5}
            and: {6:1,1,2,5,7,10}
            any: {1:2}
            are: {3:1,2,3}
             as: {1:4}
    battlefield: {1:3}
             be: {2:9,10}
         before: {1:10}
          birth: {1:10}
          brave: {1:7}
        brought: {1:1}
            but: {2:6,8}
             by: {1:10}
            can: {5:2,6,6,6,8}
          cause: {1:10}
          civil: {1:2}
           come: {1:4}
      conceived: {2:1,2}
     consecrate: {1:6}
    consecrated: {1:7}
      continent: {1:1}
        created: {1:1}
           dead: {3:7,10,10}
       dedicate: {2:4,6}
      dedicated: {4:1,2,9,10}
        detract: {1:7}
       devotion: {2:10,10}
            did: {1:8}
           died: {1:10}
             do: {1:5}
          earth: {1:10}
         endure: {1:2}
        engaged: {1:2}
          equal: {1:1}
            far: {2:7,9}
        fathers: {1:1}
          field: {1:4}
          final: {1:4}
        fitting: {1:5}
            for: {5:4,9,10,10,10}
         forget: {1:8}
          forth: {1:1}
         fought: {1:9}
           four: {1:1}
        freedom: {1:10}
           from: {2:10,10}
           full: {1:10}
           gave: {2:4,10}
            god: {1:10}
     government: {1:10}
          great: {3:2,3,10}
         ground: {1:6}
         hallow: {1:6}
           have: {5:4,7,9,10,10}
           here: {8:4,7,8,8,9,9,10,10}
         highly: {1:10}
        honored: {1:10}
             in: {4:1,2,6,10}
      increased: {1:10}
             is: {3:5,9,10}
             it: {5:5,7,8,9,10}
         larger: {1:6}
           last: {1:10}
        liberty: {1:1}
         little: {1:8}
           live: {1:4}
          lives: {1:4}
         living: {2:7,9}
           long: {2:2,8}
        measure: {1:10}
            men: {2:1,7}
            met: {1:3}
          might: {1:4}
         nation: {5:1,2,2,4,10}
          never: {1:8}
            new: {2:1,10}
          nobly: {1:9}
            nor: {1:8}
            not: {5:6,6,6,10,10}
           note: {1:8}
            now: {1:2}
             of: {5:3,4,10,10,10}
             on: {2:1,3}
             or: {2:2,7}
            our: {2:1,7}
         people: {3:10,10,10}
         perish: {1:10}
          place: {1:4}
           poor: {1:7}
        portion: {1:4}
          power: {1:7}
         proper: {1:5}
    proposition: {1:1}
         rather: {2:9,10}
      remaining: {1:10}
       remember: {1:8}
        resolve: {1:10}
        resting: {1:4}
            say: {1:8}
          score: {1:1}
          sense: {1:6}
          seven: {1:1}
          shall: {3:10,10,10}
         should: {1:5}
             so: {3:2,2,9}
      struggled: {1:7}
           take: {1:10}
           task: {1:10}
        testing: {1:2}
           that: {13:1,2,3,4,4,4,5,10,10,10,10,10,10}
            the: {11:1,7,8,9,9,10,10,10,10,10,10}
          their: {1:4}
          these: {2:10,10}
           they: {3:8,9,10}
           this: {4:1,5,6,10}
          those: {1:4}
           thus: {1:9}
             to: {8:1,4,7,9,9,10,10,10}
          under: {1:10}
     unfinished: {1:9}
             us: {3:9,10,10}
           vain: {1:10}
            war: {2:2,3}
             we: {10:2,3,4,5,6,6,6,8,10,10}
           what: {2:8,8}
        whether: {1:2}
          which: {2:9,10}
            who: {3:4,7,9}
           will: {1:8}
           work: {1:9}
          world: {1:8}
          years: {1:1}

Conclusion

A concordance generator is a useful tool to build an index of words from a body of text. The body of text can be processed in a single pass, and the sorted list of word usage information is simply the values of the TreeMap

The code for this project can be found on GitHub.

Blog

Generating a Concordance

David Berry

Creating the project

Setting Up Main

The Concordance class

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related