`
george.gu
  • 浏览: 71168 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Java String Charset Encoding

阅读更多

Charset

Charset is a named mapping between sequences of sixteen-bit Unicode code units and sequences of bytes.

1 byte = 8 bits, byte represent the int value from 0x00 through 0xFF. In ASC-II, we use a mapping to represent characters. For example, Charsets are named by strings composed of the following characters:
		The uppercase letters 'A' through 'Z' ('0x41' through '0x5a'), 
		The lowercase letters 'a' through 'z' ('0x61' through '0x7a'), 
		The digits '0' through '9' ('0x30' through '0x39'), 
		The dash character '-' ('0x2d', HYPHEN-MINUS), 
		The period character '.' ('0x2e', FULL STOP), 
		The colon character ':' ('0x3a', COLON), and 
		The underscore character '_' ('0x5f', LOW LINE). 
 

Unicode 4.0

The weakness for previous mapping is some special Characters cannot be represented, like Chinese, Greek cannot be represented. "unicode 4.0 standard" define basic multiple language encoding charset mapping, it is from \u0000 to \uFFFF.

 

Java save characters in unicode. 

 

    /** The value is used for character storage. */
    private final char value[];

1 char = 2 bytes. 

 

We can see following information from String javadoc:

 

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String. 

The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values).
 

Encoding and Decoding String

String won't keep the Charset information, because all the characters are stored in unicode. Charset is only used when decoding the specified array of bytes to String characters. 

Following source snapshot will help us to understand well:
	String a = "èç";
	byte[] b_defaultEncoding = a.getBytes(); // 0xe8, 0xe7
	byte[] b_utf8 = a.getBytes("UTF-8"); // 0xc3, 0xa8, 0xc3, 0xa7
	byte[] b_ucs2 = a.getBytes("ISO-10646-UCS-2"); // 0x00, 0xe8, 0x00, 0xe7

	String a_defaultEncoding = new String(b_defaultEncoding);
	String a_mix = new String(b_ucs2);
	String a_ucs2 = new String(b_ucs2, "ISO-10646-UCS-2");
	String a_utf8 = new String(b_utf8, "UTF-8");

	System.out.println(a_defaultEncoding);//èç
	System.out.println(a_ucs2);//èç
	System.out.println(a_utf8);//èç
	System.out.println(a_mix); // è ç
We can see a_mix is not well converted, because it is encoded by Charset "ISO-10646-UCS-2" but when decode to String, default charset is used. 

Default Charset in java

You can specify default Charset with system property: "file.encoding". If not specified, normally "UTF-8" will be used. For more details please refer to Charset.defaultCharset().

Chinese Characters unicode charset

Normally, Simplified Chinese Characters unicode charset from \u4e00 to \u9fa5.


 

0
1
分享到:
评论

相关推荐

    JavaWeb开发技术-解决中文输出乱码问题.docx

    使用setContentType(String type)方法设置编码,或使用setCharacterEncoding(String charset)和setHeader(String name,String value)设置编码和解码方式来解决中文输出乱码。 参考代码 编写ChineseServlet.java,...

    聊天室程序下载

    protected String encoding = null; protected FilterConfig filterConfig = null; public void init(FilterConfig filterConfig) throws ServletException { this.filterConfig = filterConfig; this....

    java练习题

    初学java的可以边学边练,效果盛佳! 3 编程题(改代码,写代码) 3.1 下拉列表实现 使用ajax技术完成下拉列表对应显示内容。 当选择下拉列表某项时: 显示对应内容: 参考答案: 1)ActionServlet ...

    java自动发邮件

    <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" "http://java.sun.com/dtd/web-app_2_3.dtd"> <servlet-name>mailsenderservlet...

    JAVA WEB 开发详解(JSP+Serlet)

    import java.io.*; import javax.servlet.http.*; public class LoginCheckServlet extends HttpServlet { public void doPost(HttpServletRequest req, HttpServletResponse resp) throws ServletException,...

    复杂邮件程序完整Java源码,支持添加附件,图片,HTML格式文本,支持远程WebService调用

    void sendMail(String sender,String password,String addressee,String subject,String text,Map<String,File> enclosures,Map<String,RecipientType> copyToSends) throws Exception; /** * sendMail 发送邮件...

    java网站开发结合jsp写的上传以及批量上传文件代码

    request.setCharacterEncoding("gbk"); try { // 1:引入smartupload SmartUpload su = new SmartUpload(); // 2:设置允许上传的文件的后缀名,用逗号隔开 su.setAllowedFilesList("jpg,gif,bmp,...

    springmybatis

    ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8; Insert INTO `user` VALUES ('1', 'summer', '100', 'shanghai,pudong'); 到此为止,前期准备工作就完成了。下面开始真正配置mybatis项目了。 1. 在...

    ContractController.java

    String fileName=java.net.URLEncoder.encode("劳动合同.pdf", "UTF-8"); res.setHeader("Content-Disposition","attachment;filename="+fileName); templatePdf.setOutputEncoding("UTF-8"); ...

    彻底解决fckeditor(jsp版)上传中文图片乱码问题

    3.修改ConnectorServlet.java和SimpleUploaderServlet.java两个文件,我在这两个文件中都是加了一个静态变量encoding,private static String encoding;保存项目中的编码, 若在web.xml文件中没有给这个变量传值的话...

    JavaMail的所有内容

    charset=gbk"); request.setCharacterEncoding("gbk"); response.setCharacterEncoding("gbk"); PrintWriter out = response.getWriter(); // 获取值 MultipartRequest req=new MultipartRequest(request,...

    调用pb开发的webserver(SOAP 1.1)

    调用pb开发的webserver(SOAP 1.1) /* ...Host: localhost Content-Type: application/... charset=utf-8 Content-Length: length <?xml version="1.0" encoding="utf-8"?> <ll_a>string <ll_b>string </soap

    jsp登陆界面源代码

    charset=GB18030" pageEncoding="GB18030"%><%@ page import="java.util.*" %><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <title>登录页面</title></head><body> ...

    freemarker语法完整版

    sequence 序列,对应java 里的list 、数组等非键值对的集合 hash 键值对的集合 namespace 对一个ftl 文件的引用, 利用这个名字可以访问到该ftl 文件的资源 B 指令 if, else, elseif 语法 Java代码 ...

    javaBean实验报告.doc

    charset=utf-8" language="java"%> <html> <head> 用户登陆</title> </head> <body> <form action="/shiyan4/Servlet" method=Get name=form> <table> <tr> 用户名:</td> <td><input type="text" name="username">...

    1234阿塞企业网站系统

    charset=gb2312" import="java.sql.*"%> String driver = "sun.jdbc.odbc.JdbcOdbcDriver"; //驱动程序,见第十四章 String url = "jdbc:odbc:stud"; //数据库连接语句,见第十四章 String userID = "sa", pwd ...

    servlet2.4doc

    Overrides the standard java.lang.Object.clone method to return a copy of this cookie. containsHeader(String) - Method in class javax.servlet.http.HttpServletResponseWrapper The default behavior of ...

    JSP分页查询实例代码.doc

    JSP分页查询实例代码 <%request.setCharacterEncoding("GBK");%> ; charset=gb2312" %> <%@ page import="java.sql.*"%> <%@ page import="java.util.*"%> <%@ page import="java.lang.*"%> <!DOCTYPE ...

    jsp连接MySQL实现插入insert操作功能示例

    @ page language=”java” pageEncoding=”utf-8″%> <%@ page contentType=”text/html;charset=utf-8″%> <% request.setCharacterEncoding(“UTF-8”); response.setCharacterEncoding(“UTF-8...

Global site tag (gtag.js) - Google Analytics