'MalformedByteSequenceException Invalid byte 1 of 1-byte UTF-8 sequence
I am coding an XML parser class and when I run it sometimes it works fine but another time it doesn't work and throws this exception:
MalformedByteSequenceException Invalid byte 1 of 1-byte UTF-8 sequence
Can anyone provide some information as to why?
Here is my code:
package TRT;
import java.math.BigInteger;
import java.net.URL;
import java.net.URLConnection;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class Gundem {
public static void main(String[] args) {
// TODO Auto-generated method stub
Gundem gundem=new Gundem();
try {
URL url=new URL("http://www.trt.net.tr/rss/gundem.rss");
URLConnection connection=url.openConnection();
DocumentBuilderFactory builderFactory=DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder=builderFactory.newDocumentBuilder();
Document document=docBuilder.parse(connection.getInputStream());
Element element=document.getDocumentElement();
Node node=(Node)element.getChildNodes();
System.out.println(node.getNodeName());
NodeList nodeList=node.getChildNodes();
Node channelNode=(Node)nodeList.item(0);
System.out.println(channelNode.getNodeName());
NodeList childNodeListOfChannelNode=channelNode.getChildNodes();
for(int i=0;i<childNodeListOfChannelNode.getLength();i++){
Node childNodesOfChannelNode=(Node)childNodeListOfChannelNode.item(i);
System.out.println(childNodesOfChannelNode.getNodeName());
if(childNodesOfChannelNode.getNodeName().equals(Constants.ITEM)){
Item item=new Item();
NodeList itemList=childNodesOfChannelNode.getChildNodes();
for(int j=0;j<itemList.getLength();j++){
Node childNodeOfItem=itemList.item(j);
if(childNodeOfItem.getNodeName().equals(Constants.TITLE)){
item.setTitle(childNodeOfItem.getTextContent());
System.out.println(item.getTitle());
System.out.println(gundem.dumpingInputAsHex(item.getTitle()));
}
else if(childNodeOfItem.getNodeName().equals(Constants.DESCRIPTION)){
item.setDescription(childNodeOfItem.getTextContent());
System.out.println(item.getDescription());
}
}
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.exit(0); // this line is for solving that problem; JDWP Unable to get JNI 1.2 environment, jvm->GetEnv() return code = -2
}
public String dumpingInputAsHex(String input){
return String.format("%40x",new BigInteger(1,input.getBytes()));
}
}
Solution 1:[1]
The most likely case is that you are attempting to parse, as UTF-8, a document that is encoded with some other character set such as ISO-8859-1. The parser encountered an ISO-8859-1 character with a value that is not allowed to occur by itself in UTF-8.
To solve the problem you will need to determine the actual encoding of the document and then create your own InputStreamReader from the return value of connection.getInputStream(), specifying the correct encoding. Then create an InputSource from the reader and pass that to docBuilder.parse().
Further Research:
I ran your code in Eclipse (JDK 7) and was able to reproduce the error. I then set Exception breakpoints in Eclipse on both MalformedByteSequenceException exceptions, and it DID NOT fail. Tracing into the code, I was able to see, one time only, the invalid characters in the input buffer. This indicates to me there's a race condition bug somewhere in the Xerces parser.
You may have to file a bug with Oracle.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
