'Apache POI: ${my_placeholder} is treated as three different runs

I have a .docx template with placeholders to be filled, such as ${programming_language}, ${education}, etc.

The placeholder keywords must be easily distinguished from the other plain words, hence they are enclosed with ${ }.

for (XWPFTable table : doc.getTables()) {
  for (XWPFTableRow row : table.getRows()) {
    for (XWPFTableCell cell : row.getTableCells()) {
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
        for (XWPFRun run : paragraph.getRuns()) {
          System.out.println("run text: " + run.text());
          /** replace text here, etc. */
        }
      }
    }
  }
}

I want to extract the placeholders together with the enclosing ${ } characters. The problem is, that is seems like the enclosing characters are treated as different runs...

run text: ${
run text: programming_language
run text: }
run text: Some plain text here 
run text: ${
run text: education
run text: }

Instead, I would like to achieve the following effect:

run text: ${programming_language}
run text: Some plain text here
run text: ${education}

I have tried using other enclosing characters, such as: { }, < >, # #, etc.

I do not want to do some weird concatenations of runs, etc. I want to have it in a single XWPFRun.

If I cannot find the proper solution, I will just make it like so: VAR_PROGRAMMING_LANGUGE, VAR_EDUCATION, I think.



Solution 1:[1]

Current apache poi 4.1.2 provides TextSegment to deal with those Word text-run issues. XWPFParagraph.searchText searches for a string in a paragraph and returns a TextSegment. This provides access to the begin run and the end run of that text in that paragraph (BeginRun and EndRun). It also provides access to the start character position in begin run and end character position in end run (BeginChar and EndChar). It additionally provides access to the index of the text element in the text run (BeginText and EndText). This always should be 0, because default text runs only have one text element.

Having this, we can do the following:

Replace the found partial string in begin run by the replacement. To do so, get the text part which was before the searched string and concatenate the replacement to it. After that the begin run fully contains the replacement.

Delete all text runs between begin run and end run as they contain parts of the searched string which is not more needed.

Let remain only the text part after the searched string in end run.

Doing so we are able replacing text which is in multiple text runs.

Following example shows this.

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;

public class WordReplaceTextSegment {

 static public void replaceTextSegment(XWPFParagraph paragraph, String textToFind, String replacement) {
  TextSegment foundTextSegment = null;
  PositionInParagraph startPos = new PositionInParagraph(0, 0, 0);
  while((foundTextSegment = paragraph.searchText(textToFind, startPos)) != null) { // search all text segments having text to find

System.out.println(foundTextSegment.getBeginRun()+":"+foundTextSegment.getBeginText()+":"+foundTextSegment.getBeginChar());
System.out.println(foundTextSegment.getEndRun()+":"+foundTextSegment.getEndText()+":"+foundTextSegment.getEndChar());

   // maybe there is text before textToFind in begin run
   XWPFRun beginRun = paragraph.getRuns().get(foundTextSegment.getBeginRun());
   String textInBeginRun = beginRun.getText(foundTextSegment.getBeginText());
   String textBefore = textInBeginRun.substring(0, foundTextSegment.getBeginChar()); // we only need the text before

   // maybe there is text after textToFind in end run
   XWPFRun endRun = paragraph.getRuns().get(foundTextSegment.getEndRun());
   String textInEndRun = endRun.getText(foundTextSegment.getEndText());
   String textAfter = textInEndRun.substring(foundTextSegment.getEndChar() + 1); // we only need the text after

   if (foundTextSegment.getEndRun() == foundTextSegment.getBeginRun()) { 
    textInBeginRun = textBefore + replacement + textAfter; // if we have only one run, we need the text before, then the replacement, then the text after in that run
   } else {
    textInBeginRun = textBefore + replacement; // else we need the text before followed by the replacement in begin run
    endRun.setText(textAfter, foundTextSegment.getEndText()); // and the text after in end run
   }

   beginRun.setText(textInBeginRun, foundTextSegment.getBeginText());

   // runs between begin run and end run needs to be removed
   for (int runBetween = foundTextSegment.getEndRun() - 1; runBetween > foundTextSegment.getBeginRun(); runBetween--) {
    paragraph.removeRun(runBetween); // remove not needed runs
   }

  }
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));

  String textToFind = "${This is the text to find}"; // might be in different runs
  String replacement = "Replacement text";

  for (XWPFParagraph paragraph : doc.getParagraphs()) { //go through all paragraphs
   if (paragraph.getText().contains(textToFind)) { // paragraph contains text to find
    replaceTextSegment(paragraph, textToFind, replacement);
   }
  }

  FileOutputStream out = new FileOutputStream("result.docx");
  doc.write(out);
  out.close();
  doc.close();

 }
}

Above code works not in all cases because XWPFParagraph.searchText has bugs. So I will provide a better searchText method:

/**
 * this methods parse the paragraph and search for the string searched.
 * If it finds the string, it will return true and the position of the String
 * will be saved in the parameter startPos.
 *
 * @param searched
 * @param startPos
 */
static TextSegment searchText(XWPFParagraph paragraph, String searched, PositionInParagraph startPos) {
    int startRun = startPos.getRun(),
        startText = startPos.getText(),
        startChar = startPos.getChar();
    int beginRunPos = 0, candCharPos = 0;
    boolean newList = false;

    //CTR[] rArray = paragraph.getRArray(); //This does not contain all runs. It lacks hyperlink runs for ex.
    java.util.List<XWPFRun> runs = paragraph.getRuns(); 
    
    int beginTextPos = 0, beginCharPos = 0; //must be outside the for loop
    
    //for (int runPos = startRun; runPos < rArray.length; runPos++) {
    for (int runPos = startRun; runPos < runs.size(); runPos++) {
        //int beginTextPos = 0, beginCharPos = 0, textPos = 0, charPos; //int beginTextPos = 0, beginCharPos = 0 must be outside the for loop
        int textPos = 0, charPos;
        //CTR ctRun = rArray[runPos];
        CTR ctRun = runs.get(runPos).getCTR();
        XmlCursor c = ctRun.newCursor();
        c.selectPath("./*");
        try {
            while (c.toNextSelection()) {
                XmlObject o = c.getObject();
                if (o instanceof CTText) {
                    if (textPos >= startText) {
                        String candidate = ((CTText) o).getStringValue();
                        if (runPos == startRun) {
                            charPos = startChar;
                        } else {
                            charPos = 0;
                        }

                        for (; charPos < candidate.length(); charPos++) {
                            if ((candidate.charAt(charPos) == searched.charAt(0)) && (candCharPos == 0)) {
                                beginTextPos = textPos;
                                beginCharPos = charPos;
                                beginRunPos = runPos;
                                newList = true;
                            }
                            if (candidate.charAt(charPos) == searched.charAt(candCharPos)) {
                                if (candCharPos + 1 < searched.length()) {
                                    candCharPos++;
                                } else if (newList) {
                                    TextSegment segment = new TextSegment();
                                    segment.setBeginRun(beginRunPos);
                                    segment.setBeginText(beginTextPos);
                                    segment.setBeginChar(beginCharPos);
                                    segment.setEndRun(runPos);
                                    segment.setEndText(textPos);
                                    segment.setEndChar(charPos);
                                    return segment;
                                }
                            } else {
                                candCharPos = 0;
                            }
                        }
                    }
                    textPos++;
                } else if (o instanceof CTProofErr) {
                    c.removeXml();
                } else if (o instanceof CTRPr) {
                    //do nothing
                } else {
                    candCharPos = 0;
                }
            }
        } finally {
            c.dispose();
        }
    }
    return null;
}

This will be called like:

...
while((foundTextSegment = searchText(paragraph, textToFind, startPos)) != null) {
...

Solution 2:[2]

Just like someone has commented your question, you can't have control where or when Word will split the paragraph in some runs. If the other answer still didn't help you, then I have the way I got around it:

First of all, this "solution" have a big problem, but still, I will put it here for the reason that someone can solve it.

    public void mainMethod(XWPFParagraph paragraph) {
        if (paragraph.getRuns().size() > 1) {
            String myRun = unifyRuns(paragraph.getRuns());
            // make the verification of placeholders ${...}
            paragraph.getRuns().get(0).setText(myRun);
            
            while(paragraph.getRuns().size() > 1) {
                paragraph.removeRun(1);
            }
        }
    }
    
    private String unifyRuns(List<XWPFRun> runElements) {
        StringBuilder unifiedRun = new StringBuilder();
        for (XWPFRun run : runElements) {
            unifiedRun.append(run);
        }
        return unifiedRun.toString();
    }

The code may contain some error since I'm doing it as I remember.

The problem here is that when Word separates paragraphs into runs, it doesn't do it for nothing, because when there are texts with different fonts (like font-family or font-size), it separates the texts in different runs.

In the text "Here's my bold text", Word will split the text to separate the bold and normal text. Then, the code above is a bad solution if you are using POI to create large documents with different types of fonts. In that case you would need to verify first if the run is actualy in bold, then you will treat the placeholders.

Again, this a "solution" that i found, and it's not complete yet. Sorry for english errors, i'm using Google Translate to write this answer.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Luiz Felipe Rodrigues