PPStream里的广告越来越多。在看动漫的时候特别烦人。不想看广告但又想使用最新的PPS版本,所以就尝试了一下给PPS过滤广告。
- 修改hosts文件:失败(PPS还是会把hosts改回原样),而且只能阻止有hostname的网址。
- 阻止PPS缓存广告及其他宣传文件:成功
具体实现方法:
- 找到PPS储存它下载广告的文件夹
- 阻止PPS创建广告文件夹
这里感谢一下PPS的程序员。没有因为无法创建广告文件夹而限制PPS的播放功能。
PPStream里的广告越来越多。在看动漫的时候特别烦人。不想看广告但又想使用最新的PPS版本,所以就尝试了一下给PPS过滤广告。
具体实现方法:
这里感谢一下PPS的程序员。没有因为无法创建广告文件夹而限制PPS的播放功能。
在Java里使用java.awt.Robot来屏幕截图非常的慢。直接使用RobotPeer或者native JNI的函数能数倍的提高速度,实现即时截图。
下面是测试结果:
// 使用Robot Robot.getPixelColor(1024 * 768): 3850 ms Robot.createScreenCapture(1024 * 768): 19 ms // 使用RobotPeer RobotPeer.getRGBPixel(1024 * 768): 3686 ms RobotPeer.getRGBPixels(1024 * 768): 10 ms // 使用RobotPeer.getRGBPixels(int x, int y, int w, int h, int[] buffer) (native) RobotPeer.getRGBPixels(1024 * 768, buffer): 7 ms
测试代码:
// // 使用Robot // final Robot robot = new Robot(); long start = System.currentTimeMillis(); int x = 0; int y = 0; for (int i = 0; i < 1024 * 768; i++) { robot.getPixelColor(x++, y); if (x == 1024) { y++; } } System.out.println("Robot.getPixelColor(1024 * 768): " + (System.currentTimeMillis() - start) + " ms"); start = System.currentTimeMillis(); robot.createScreenCapture(new Rectangle(0, 0, 1024, 768)); System.out.println("Robot.createScreenCapture(1024 * 768): " + (System.currentTimeMillis() - start) + " ms");</code> // // 使用RobotPeer // final RobotPeer peer = ((ComponentFactory) Toolkit.getDefaultToolkit()).createRobot(null, null); start = System.currentTimeMillis(); for (int i = 0; i < 1024 * 768; i++) { peer.getRGBPixel(x++, y); if (x == 1024) { y++; } } System.out.println("RobotPeer.getRGBPixel(1024 * 768): " + (System.currentTimeMillis() - start) + " ms"); start = System.currentTimeMillis(); peer.getRGBPixels(new Rectangle(0, 0, 1024, 768)); System.out.println("RobotPeer.getRGBPixels(1024 * 768): " + (System.currentTimeMillis() - start) + " ms"); // // 使用RobotPeer.getRGBPixels(int x, int y, int w, int h, int[] buffer) (native) // final Class[] params = new Class[] { int.class, int.class, int.class, int.class, int[].class }; final Method getRGBPixelsMethod = peer.getClass().getDeclaredMethod("getRGBPixels", params); getRGBPixelsMethod.setAccessible(true); final int[] buffer = new int[1024 * 768]; start = System.currentTimeMillis(); getRGBPixelsMethod.invoke(peer, 0, 0, 1024, 768, buffer); System.out.println("RobotPeer.getRGBPixels(1024 * 768, buffer): " + (System.currentTimeMillis() - start) + " ms");
如果是纯粹想在自己的电脑上提升速度。也不妨试一下binary weaving。就是覆盖rt.jar里的WRobotPeer.java文件。
测试结果:
Robot.getPixelColor(1024 * 768): 3446 ms Robot.createScreenCapture(1024 * 768): 23 ms RobotPeer.getRGBPixel(1024 * 768): 3387 ms RobotPeer.getRGBPixels(1024 * 768): 10 ms RobotPeer.getRGBPixels(1024 * 768, buffer): 8 ms RobotPeer.getRGBPixels(1024 * 768, buffer) direct: 7 ms
WRobotPeer.java文件:
package sun.awt.windows; import java.awt.Rectangle; import java.awt.peer.RobotPeer; public class WRobotPeer extends WObjectPeer implements RobotPeer { public WRobotPeer() { create(); } private synchronized native void _dispose(); protected void disposeImpl() { _dispose(); } public native void create(); public native void mouseMoveImpl(int paramInt1, int paramInt2); public void mouseMove(int paramInt1, int paramInt2) { mouseMoveImpl(paramInt1, paramInt2); } public native void mousePress(int paramInt); public native void mouseRelease(int paramInt); public native void mouseWheel(int paramInt); public native void keyPress(int paramInt); public native void keyRelease(int paramInt); public int getRGBPixel(int paramInt1, int paramInt2) { return getRGBPixelImpl(paramInt1, paramInt2); } public native int getRGBPixelImpl(int paramInt1, int paramInt2); public int[] getRGBPixels(Rectangle paramRectangle) { int[] arrayOfInt = new int[paramRectangle.width * paramRectangle.height]; getRGBPixels(paramRectangle.x, paramRectangle.y, paramRectangle.width, paramRectangle.height, arrayOfInt); return arrayOfInt; } public native void getRGBPixels(int x, int y, int w, int h, int[] buffer); }
源程序:https://dict4cn.googlecode.com/svn/trunk/importer/src/SogouSgimCoreBinReader.java
import java.io.IOException; import java.io.RandomAccessFile; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.channels.FileChannel; /** * Sogou sgim_core.bin Reader * * * 地址: * 0x0C:单词数量 * ????:单词长度(byte),单词(编码:UTF-16LE) * * For files like sgim_eng.bin etc., the implementation has to be littlely modified. * * @author keke */ public class SogouSgimCoreBinReader { public static void main(String[] args) throws IOException { String binFile = "D:\\sgim_core.bin"; // String binFile = "D:\\sgim_eng.bin"; // read scel into byte array FileChannel fChannel = new RandomAccessFile(binFile, "r").getChannel(); ByteBuffer bb = ByteBuffer.allocate((int) fChannel.size()); fChannel.read(bb); bb.order(ByteOrder.LITTLE_ENDIAN); bb.rewind(); int words = bb.getInt(0xC); System.out.println("读入文件: " + binFile + ",单词:" + words); int i; int startPos = -1; while (bb.hasRemaining()) { i = bb.getInt(); if (i == 0x554a0002) { // core, 6.1.0.6700 // if (i == 0x00610002) { // eng, 6.1.0.6700 startPos = bb.position() - 4; break; } } if (startPos > -1) { short s; int counter = 0; ByteBuffer buffer = ByteBuffer.allocate(Short.MAX_VALUE); System.out.println("单词起始位置:0x" + Integer.toHexString(startPos)); bb.position(startPos); while (bb.hasRemaining() && words-- > 0) { s = bb.getShort(); bb.get(buffer.array(), 0, s); counter++; // System.out.println(new String(buffer.array(), 0, s, "UTF-16LE")); } int endPos = bb.position(); int diff = endPos - startPos; System.out.println("读出单词'" + binFile + "':" + counter); System.out.println("单词结尾位置:0x" + Integer.toHexString(endPos)); System.out.println("单词词典长度:0x" + Integer.toHexString(diff)); } fChannel.close(); } }
新 Lingoes灵格斯电子词典LD2(LDF)文件单词提取器
http://code.google.com/p/lingoes-extractor/
1. Windows版: http://lingoes-extractor.googlecode.com/files/lingoes-extractor-1.0.exe
2. Java版:http://lingoes-extractor.googlecode.com/files/lingoes-extractor-1.0.jar
选择LD2文件跟导出文件:
导出后的文件:
支持已知所有Lingoes词典版本(2.x)。自动导出索引组(*.idx),所有词组(*.words),翻译(*.output)文件等。
Lingoes Reader / Exporter源程序下载:https://dict4cn.googlecode.com/svn/trunk/importer/src/LingoesLd2Reader.java
源文件:
import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.FileWriter; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.io.RandomAccessFile; import java.io.UnsupportedEncodingException; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.channels.Channels; import java.nio.channels.FileChannel; import java.util.ArrayList; import java.util.List; import java.util.regex.Pattern; import java.util.zip.Inflater; import java.util.zip.InflaterInputStream; /** * Lingoes LD2/LDF File Reader * * <pre> * Lingoes Format overview: * * General Information: * - Dictionary data are stored in deflate streams. * - Index group information is stored in an index array in the LD2 file itself. * - Numbers are using little endian byte order. * - Definitions and xml data have UTF-8 or UTF-16LE encodings. * * LD2 file schema: * - File Header * - File Description * - Additional Information (optional) * - Index Group (corresponds to definitions in dictionary) * - Deflated Dictionary Streams * -- Index Data * --- Offsets of definitions * --- Offsets of translations * --- Flags * --- References to other translations * -- Definitions * -- Translations (xml) * * TODO: find encoding / language fields to replace auto-detect of encodings * * </pre> * * @author keke * */ public class LingoesLd2Reader { private static final String[] AVAIL_ENCODINGS = { "UTF-8", "UTF-16LE", "UTF-16BE" }; public static void main(String[] args) throws IOException { // download from // https://skydrive.live.com/?cid=a10100d37adc7ad3&sc=documents&id=A10100D37ADC7AD3%211172#cid=A10100D37ADC7AD3&sc=documents String ld2File = "X:\\kkdict\\dicts\\lingoes\\Prodic English-Vietnamese Business.ld2"; // read lingoes ld2 into byte array ByteArrayOutputStream dataOut = new ByteArrayOutputStream(); FileChannel fChannel = new RandomAccessFile(ld2File, "r").getChannel(); fChannel.transferTo(0, fChannel.size(), Channels.newChannel(dataOut)); fChannel.close(); // as bytes ByteBuffer dataRawBytes = ByteBuffer.wrap(dataOut.toByteArray()); dataRawBytes.order(ByteOrder.LITTLE_ENDIAN); System.out.println("文件:" + ld2File); System.out.println("类型:" + new String(dataRawBytes.array(), 0, 4, "ASCII")); System.out.println("版本:" + dataRawBytes.getShort(0x18) + "." + dataRawBytes.getShort(0x1A)); System.out.println("ID: 0x" + Long.toHexString(dataRawBytes.getLong(0x1C))); int offsetData = dataRawBytes.getInt(0x5C) + 0x60; if (dataRawBytes.limit() > offsetData) { System.out.println("简介地址:0x" + Integer.toHexString(offsetData)); int type = dataRawBytes.getInt(offsetData); System.out.println("简介类型:0x" + Integer.toHexString(type)); int offsetWithInfo = dataRawBytes.getInt(offsetData + 4) + offsetData + 12; if (type == 3) { // without additional information readDictionary(ld2File, dataRawBytes, offsetData); } else if (dataRawBytes.limit() > offsetWithInfo - 0x1C) { readDictionary(ld2File, dataRawBytes, offsetWithInfo); } else { System.err.println("文件不包含字典数据。网上字典?"); } } else { System.err.println("文件不包含字典数据。网上字典?"); } } private static final long decompress(final String inflatedFile, final ByteBuffer data, final int offset, final int length, final boolean append) throws IOException { Inflater inflator = new Inflater(); InflaterInputStream in = new InflaterInputStream(new ByteArrayInputStream(data.array(), offset, length), inflator, 1024 * 8); FileOutputStream out = new FileOutputStream(inflatedFile, append); writeInputStream(in, out); long bytesRead = inflator.getBytesRead(); in.close(); out.close(); inflator.end(); return bytesRead; } private static final String[] detectEncodings(final ByteBuffer inflatedBytes, final int offsetWords, final int offsetXml, final int defTotal, final int dataLen, final int[] idxData, final String[] defData) throws UnsupportedEncodingException { final int tests = Math.min(defTotal, 10); int defEnc = 0; int xmlEnc = 0; Pattern p = Pattern.compile("^.*[\\x00-\\x1f].*$"); for (int i = 0; i < tests; i++) { readDefinitionData(inflatedBytes, offsetWords, offsetXml, dataLen, AVAIL_ENCODINGS[defEnc], AVAIL_ENCODINGS[xmlEnc], idxData, defData, i); if (p.matcher(defData[0]).matches()) { if (defEnc < AVAIL_ENCODINGS.length - 1) { defEnc++; } i = 0; } if (p.matcher(defData[1]).matches()) { if (xmlEnc < AVAIL_ENCODINGS.length - 1) { xmlEnc++; } i = 0; } } System.out.println("词组编码:" + AVAIL_ENCODINGS[defEnc]); System.out.println("XML编码:" + AVAIL_ENCODINGS[xmlEnc]); return new String[] { AVAIL_ENCODINGS[defEnc], AVAIL_ENCODINGS[xmlEnc] }; } private static final void extract(final String inflatedFile, final String indexFile, final String extractedWordsFile, final String extractedXmlFile, final String extractedOutputFile, final int[] idxArray, final int offsetDefs, final int offsetXml) throws IOException, FileNotFoundException, UnsupportedEncodingException { System.out.println("写入'" + extractedOutputFile + "'。。。"); FileWriter indexWriter = new FileWriter(indexFile); FileWriter defsWriter = new FileWriter(extractedWordsFile); FileWriter xmlWriter = new FileWriter(extractedXmlFile); FileWriter outputWriter = new FileWriter(extractedOutputFile); // read inflated data ByteArrayOutputStream dataOut = new ByteArrayOutputStream(); FileChannel fChannel = new RandomAccessFile(inflatedFile, "r").getChannel(); fChannel.transferTo(0, fChannel.size(), Channels.newChannel(dataOut)); fChannel.close(); ByteBuffer dataRawBytes = ByteBuffer.wrap(dataOut.toByteArray()); dataRawBytes.order(ByteOrder.LITTLE_ENDIAN); final int dataLen = 10; final int defTotal = offsetDefs / dataLen - 1; String[] words = new String[defTotal]; int[] idxData = new int[6]; String[] defData = new String[2]; final String[] encodings = detectEncodings(dataRawBytes, offsetDefs, offsetXml, defTotal, dataLen, idxData, defData); dataRawBytes.position(8); int counter = 0; final String defEncoding = encodings[0]; final String xmlEncoding = encodings[1]; for (int i = 0; i < defTotal; i++) { readDefinitionData(dataRawBytes, offsetDefs, offsetXml, dataLen, defEncoding, xmlEncoding, idxData, defData, i); words[i] = defData[0]; defsWriter.write(defData[0]); defsWriter.write("\n"); xmlWriter.write(defData[1]); xmlWriter.write("\n"); outputWriter.write(defData[0]); outputWriter.write("="); outputWriter.write(defData[1]); outputWriter.write("\n"); System.out.println(defData[0] + " = " + defData[1]); counter++; } for (int i = 0; i < idxArray.length; i++) { int idx = idxArray[i]; indexWriter.write(words[idx]); indexWriter.write(", "); indexWriter.write(String.valueOf(idx)); indexWriter.write("\n"); } indexWriter.close(); defsWriter.close(); xmlWriter.close(); outputWriter.close(); System.out.println("成功读出" + counter + "组数据。"); } private static final void getIdxData(final ByteBuffer dataRawBytes, final int position, final int[] wordIdxData) { dataRawBytes.position(position); wordIdxData[0] = dataRawBytes.getInt(); wordIdxData[1] = dataRawBytes.getInt(); wordIdxData[2] = dataRawBytes.get() & 0xff; wordIdxData[3] = dataRawBytes.get() & 0xff; wordIdxData[4] = dataRawBytes.getInt(); wordIdxData[5] = dataRawBytes.getInt(); } private static final void inflate(final ByteBuffer dataRawBytes, final List<Integer> deflateStreams, final String inflatedFile) { System.out.println("解压缩'" + deflateStreams.size() + "'个数据流至'" + inflatedFile + "'。。。"); int startOffset = dataRawBytes.position(); int offset = -1; int lastOffset = startOffset; boolean append = false; try { for (Integer offsetRelative : deflateStreams) { offset = startOffset + offsetRelative.intValue(); decompress(inflatedFile, dataRawBytes, lastOffset, offset - lastOffset, append); append = true; lastOffset = offset; } } catch (Throwable e) { System.err.println("解压缩失败: 0x" + Integer.toHexString(offset) + ": " + e.toString()); } } private static final void readDefinitionData(final ByteBuffer inflatedBytes, final int offsetWords, final int offsetXml, final int dataLen, final String wordEncoding, final String xmlEncoding, final int[] idxData, final String[] defData, final int i) throws UnsupportedEncodingException { getIdxData(inflatedBytes, dataLen * i, idxData); int lastWordPos = idxData[0]; int lastXmlPos = idxData[1]; final int flags = idxData[2]; int refs = idxData[3]; int currentWordOffset = idxData[4]; int currenXmlOffset = idxData[5]; String xml = strip(new String(inflatedBytes.array(), offsetXml + lastXmlPos, currenXmlOffset - lastXmlPos, xmlEncoding)); while (refs-- > 0) { int ref = inflatedBytes.getInt(offsetWords + lastWordPos); getIdxData(inflatedBytes, dataLen * ref, idxData); lastXmlPos = idxData[1]; currenXmlOffset = idxData[5]; if (xml.isEmpty()) { xml = strip(new String(inflatedBytes.array(), offsetXml + lastXmlPos, currenXmlOffset - lastXmlPos, xmlEncoding)); } else { xml = strip(new String(inflatedBytes.array(), offsetXml + lastXmlPos, currenXmlOffset - lastXmlPos, xmlEncoding)) + ", " + xml; } lastWordPos += 4; } defData[1] = xml; String word = new String(inflatedBytes.array(), offsetWords + lastWordPos, currentWordOffset - lastWordPos, wordEncoding); defData[0] = word; } private static final void readDictionary(final String ld2File, final ByteBuffer dataRawBytes, final int offsetWithIndex) throws IOException, FileNotFoundException, UnsupportedEncodingException { System.out.println("词典类型:0x" + Integer.toHexString(dataRawBytes.getInt(offsetWithIndex))); int limit = dataRawBytes.getInt(offsetWithIndex + 4) + offsetWithIndex + 8; int offsetIndex = offsetWithIndex + 0x1C; int offsetCompressedDataHeader = dataRawBytes.getInt(offsetWithIndex + 8) + offsetIndex; int inflatedWordsIndexLength = dataRawBytes.getInt(offsetWithIndex + 12); int inflatedWordsLength = dataRawBytes.getInt(offsetWithIndex + 16); int inflatedXmlLength = dataRawBytes.getInt(offsetWithIndex + 20); int definitions = (offsetCompressedDataHeader - offsetIndex) / 4; List<Integer> deflateStreams = new ArrayList<Integer>(); dataRawBytes.position(offsetCompressedDataHeader + 8); int offset = dataRawBytes.getInt(); while (offset + dataRawBytes.position() < limit) { offset = dataRawBytes.getInt(); deflateStreams.add(Integer.valueOf(offset)); } int offsetCompressedData = dataRawBytes.position(); System.out.println("索引词组数目:" + definitions); System.out.println("索引地址/大小:0x" + Integer.toHexString(offsetIndex) + " / " + (offsetCompressedDataHeader - offsetIndex) + " B"); System.out.println("压缩数据地址/大小:0x" + Integer.toHexString(offsetCompressedData) + " / " + (limit - offsetCompressedData) + " B"); System.out.println("词组索引地址/大小(解压缩后):0x0 / " + inflatedWordsIndexLength + " B"); System.out.println("词组地址/大小(解压缩后):0x" + Integer.toHexString(inflatedWordsIndexLength) + " / " + inflatedWordsLength + " B"); System.out.println("XML地址/大小(解压缩后):0x" + Integer.toHexString(inflatedWordsIndexLength + inflatedWordsLength) + " / " + inflatedXmlLength + " B"); System.out.println("文件大小(解压缩后):" + (inflatedWordsIndexLength + inflatedWordsLength + inflatedXmlLength) / 1024 + " KB"); String inflatedFile = ld2File + ".inflated"; inflate(dataRawBytes, deflateStreams, inflatedFile); if (new File(inflatedFile).isFile()) { String indexFile = ld2File + ".idx"; String extractedFile = ld2File + ".words"; String extractedXmlFile = ld2File + ".xml"; String extractedOutputFile = ld2File + ".output"; dataRawBytes.position(offsetIndex); int[] idxArray = new int[definitions]; for (int i = 0; i < definitions; i++) { idxArray[i] = dataRawBytes.getInt(); } extract(inflatedFile, indexFile, extractedFile, extractedXmlFile, extractedOutputFile, idxArray, inflatedWordsIndexLength, inflatedWordsIndexLength + inflatedWordsLength); } } private static final String strip(final String xml) { int open = 0; int end = 0; if ((open = xml.indexOf("<![CDATA[")) != -1) { if ((end = xml.indexOf("]]>", open)) != -1) { return xml.substring(open + "<![CDATA[".length(), end).replace('\t', ' ').replace('\n', ' ') .replace('\u001e', ' ').replace('\u001f', ' '); } } else if ((open = xml.indexOf("<Ô")) != -1) { if ((end = xml.indexOf("</Ô", open)) != -1) { open = xml.indexOf(">", open + 1); return xml.substring(open + 1, end).replace('\t', ' ').replace('\n', ' ').replace('\u001e', ' ') .replace('\u001f', ' '); } } else { StringBuilder sb = new StringBuilder(); end = 0; open = xml.indexOf('<'); do { if (open - end > 1) { sb.append(xml.substring(end + 1, open)); } open = xml.indexOf('<', open + 1); end = xml.indexOf('>', end + 1); } while (open != -1 && end != -1); return sb.toString().replace('\t', ' ').replace('\n', ' ').replace('\u001e', ' ').replace('\u001f', ' '); } return ""; } private static final void writeInputStream(final InputStream in, final OutputStream out) throws IOException { byte[] buffer = new byte[1024 * 8]; int len; while ((len = in.read(buffer)) > 0) { out.write(buffer, 0, len); } } }
源程序下载:https://code.google.com/p/dict4cn/source/browse/trunk/importer/src/SogouScelReader.java
Source Code:
import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.RandomAccessFile; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.channels.Channels; import java.nio.channels.FileChannel; /** * Sougou Pinyin IME SCEL File Reader * * SCEL Format overview: * * General Information: * - Chinese characters and pinyin are all encoded with UTF-16LE. * - Numbers are using little endian byte order. * * SCEL hex analysis: * - 0x0 Pinyin List Offset * - 0x120 total number of words * - 0x total number of pinyin * - ... List of pinyin as [index, byte length of pinyin, pinyin as string] triples * - ... Dictionary * - ... * * Dictionary format: * - It can interpreted as a list of * [alternatives of words, * byte length of pinyin indexes, pinyin indexes, * [byte length of word, word as string, length of skip bytes, skip bytes] * ... (alternatives) * ]. * * * @author keke */ public class SogouScelReader { public static void main(String[] args) throws IOException { // download from http://pinyin.sogou.com/dict String scelFile = "D:\\test.scel"; // read scel into byte array ByteArrayOutputStream dataOut = new ByteArrayOutputStream(); FileChannel fChannel = new RandomAccessFile(scelFile, "r").getChannel(); fChannel.transferTo(0, fChannel.size(), Channels.newChannel(dataOut)); fChannel.close(); // scel as bytes ByteBuffer dataRawBytes = ByteBuffer.wrap(dataOut.toByteArray()); dataRawBytes.order(ByteOrder.LITTLE_ENDIAN); byte[] buf = new byte[1024]; String[] pyDict = new String[512]; int totalWords = dataRawBytes.getInt(0x120); // pinyin offset dataRawBytes.position(dataRawBytes.getInt()); int totalPinyin = dataRawBytes.getInt(); for (int i = 0; i < totalPinyin; i++) { int idx = dataRawBytes.getShort(); int len = dataRawBytes.getShort(); dataRawBytes.get(buf, 0, len); pyDict[idx] = new String(buf, 0, len, "UTF-16LE"); } // extract dictionary int counter = 0; for (int i = 0; i 0) { int key = dataRawBytes.getShort(); if (first) { first = false; } else { py.append('\''); } py.append(pyDict[key]); } first = true; while (alternatives-- > 0) { if (first) { first = false; } else { word.append(", "); } int wordlength = dataRawBytes.getShort(); dataRawBytes.get(buf, 0, wordlength); word.append(new String(buf, 0, wordlength, "UTF-16LE")); // skip bytes dataRawBytes.get(buf, 0, dataRawBytes.getShort()); } System.out.println(word.toString() + "\t" + py.toString()); counter++; } System.out.println("\nExtracted '" + scelFile + "': " + counter); } }
源程序下载:https://dict4cn.googlecode.com/svn/trunk/importer/src/BaiduBcdReader.java
Source Code:
import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.RandomAccessFile; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.channels.Channels; import java.nio.channels.FileChannel; /** * Baidu Pinyin IME BDICT File Reader * * BDICT Format overview: * * General Information: * - Chinese characters and pinyin are all encoded with UTF-16LE. * - Numbers are using little endian byte order. * * BDICT hex analysis: * - 0x250 total number of words * - 0x350 dictionary offset * - 0x Dictionary * * Dictionary format: * - It can interpreted as a list of * [amount of characters (short not integer!) * pinyin construction using fenmu and yunmu, * word as string * ]. * * * @author keke */ public class BaiduBdictReader { private static final String[] FEN_MU = { "c", "d", "b", "f", "g", "h", "ch", "j", "k", "l", "m", "n", "", "p", "q", "r", "s", "t", "sh", "zh", "w", "x", "y", "z" }; private static final String[] YUN_MU = { "uang", "iang", "ong", "ang", "eng", "ian", "iao", "ing", "ong", "uai", "uan", "ai", "an", "ao", "ei", "en", "er", "ua", "ie", "in", "iu", "ou", "ia", "ue", "ui", "un", "uo", "a", "e", "i", "a", "u", "v" }; public static void main(String[] args) throws IOException { // download from http://r6.mo.baidu.com/web/iw/index/ String bdictFile = "D:\\test.bcd"; // read scel into byte array ByteArrayOutputStream dataOut = new ByteArrayOutputStream(); FileChannel fChannel = new RandomAccessFile(bdictFile, "r").getChannel(); fChannel.transferTo(0, fChannel.size(), Channels.newChannel(dataOut)); fChannel.close(); // bdict as bytes ByteBuffer dataRawBytes = ByteBuffer.wrap(dataOut.toByteArray()); dataRawBytes.order(ByteOrder.LITTLE_ENDIAN); byte[] buf = new byte[1024]; int total = dataRawBytes.getInt(0x250); // dictionary offset dataRawBytes.position(0x350); for (int i = 0; i < total; i++) { int length = dataRawBytes.getShort(); dataRawBytes.getShort(); boolean first = true; StringBuilder pinyin = new StringBuilder(); for (int j = 0; j < length; j++) { if (first) { first = false; } else { pinyin.append('\''); } pinyin.append(FEN_MU[dataRawBytes.get()] + YUN_MU[dataRawBytes.get()]); } dataRawBytes.get(buf, 0, 2 * length); String word = new String(buf, 0, 2 * length, "UTF-16LE"); System.out.println(word+"\t"+pinyin); } System.out.println("\nExtracted '" + bdictFile + "': " + total); } }
源程序下载:https://code.google.com/p/dict4cn/source/browse/trunk/importer/src/QQPinyinQpydReader.java
输出 (汽车品牌.qpyd):
名称:汽车品牌 类型:汽车 子类型:爱好 词库说明:汽车品牌 词库样例:三菱牌 印度斯坦 爱丽舍 吉奥 阿斯顿 京城海狮 风骏 东南汽车 福特牌 御马 富康 华阳汽车 海锋 奇瑞君威 德托 大发牌 都市骏马 利亚纳 法比亚伊比萨 词条数:961 压缩词库数据地址:0x180 三菱牌 san'ling'pai 典雅 dian'ya 奇兵 qi'bing 东风牌 dong'feng'pai 水星 shui'xing 新锋锐 xin'feng'rui 勇士 yong'shi 百利 bai'li 嘉年华 jia'nian'hua 飞碟汽车 fei'die'qi'che ...
Source Code:
import java.io.ByteArrayOutputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.RandomAccessFile; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.channels.Channels; import java.nio.channels.FileChannel; import java.util.Arrays; import java.util.zip.InflaterOutputStream; /** * QQ Pinyin IME QPYD File Reader * * QPYD Format overview: * * General Information: * - Chinese characters are all encoded with UTF-16LE. * - Pinyin are encoded in ascii (or UTF-8). * - Numbers are using little endian byte order. * * QPYD hex analysis: * - 0x00 QPYD file identifier * - 0x38 offset of compressed data (word-pinyin-dictionary) * - 0x44 total words in qpyd * - 0x60 start of header information * * Compressed data analysis: * - zip/standard (beginning with 0x789C) is used in (all analyzed) qpyd files * - data is divided in two parts * -- 1. offset and length information (16 bytes for each pinyin-word pair) * 0x06 offset points to first pinyin * 0x00 length of pinyin * 0x01 length of word * -- 2. actual data * Dictionary data has the form ((pinyin)(word))* with no separators. * Data can only be read using offset and length information. * * * @author keke */ public class QQPinyinQpydReader { public static void main(String[] args) throws IOException { // download from http://dict.py.qq.com/list.php String qqydFile = "D:\\test.qpyd"; // read qpyd into byte array ByteArrayOutputStream dataOut = new ByteArrayOutputStream(); FileChannel fChannel = new RandomAccessFile(qqydFile, "r").getChannel(); fChannel.transferTo(0, fChannel.size(), Channels.newChannel(dataOut)); fChannel.close(); // qpyd as bytes ByteBuffer dataRawBytes = ByteBuffer.wrap(dataOut.toByteArray()); dataRawBytes.order(ByteOrder.LITTLE_ENDIAN); // read info of compressed data int startZippedDictAddr = dataRawBytes.getInt(0x38); int zippedDictLength = dataRawBytes.limit() - startZippedDictAddr; // qpys as UTF-16LE string String dataString = new String(Arrays.copyOfRange(dataRawBytes.array(), 0x60, startZippedDictAddr), "UTF-16LE"); // print header System.out.println("名称:" + substringBetween(dataString, "Name: ", "\r\n")); System.out.println("类型:" + substringBetween(dataString, "Type: ", "\r\n")); System.out.println("子类型:" + substringBetween(dataString, "FirstType: ", "\r\n")); System.out.println("词库说明:" + substringBetween(dataString, "Intro: ", "\r\n")); System.out.println("词库样例:" + substringBetween(dataString, "Example: ", "\r\n")); System.out.println("词条数:" + dataRawBytes.getInt(0x44)); // read zipped qqyd dictionary into byte array dataOut.reset(); Channels.newChannel(new InflaterOutputStream(dataOut)).write( ByteBuffer.wrap(dataRawBytes.array(), startZippedDictAddr, zippedDictLength)); // uncompressed qqyd dictionary as bytes ByteBuffer dataUnzippedBytes = ByteBuffer.wrap(dataOut.toByteArray()); dataUnzippedBytes.order(ByteOrder.LITTLE_ENDIAN); // for debugging: save unzipped data to *.unzipped file Channels.newChannel(new FileOutputStream(qqydFile + ".unzipped")).write(dataUnzippedBytes); System.out.println("压缩数据:0x" + Integer.toHexString(startZippedDictAddr) + " (解压前:" + zippedDictLength + " B, 解压后:" + dataUnzippedBytes.limit() + " B)"); // stores the start address of actual dictionary data int unzippedDictStartAddr = -1; int idx = 0; byte[] byteArray = dataUnzippedBytes.array(); while (unzippedDictStartAddr == -1 || idx < unzippedDictStartAddr) { // read word int pinyinStartAddr = dataUnzippedBytes.getInt(idx + 0x6); int pinyinLength = dataUnzippedBytes.get(idx + 0x0) & 0xff; int wordStartAddr = pinyinStartAddr + pinyinLength; int wordLength = dataUnzippedBytes.get(idx + 0x1) & 0xff; if (unzippedDictStartAddr == -1) { unzippedDictStartAddr = pinyinStartAddr; System.out.println("词库地址(解压后):0x" + Integer.toHexString(unzippedDictStartAddr) + "\n"); } String pinyin = new String(Arrays.copyOfRange(byteArray, pinyinStartAddr, pinyinStartAddr + pinyinLength), "UTF-8"); String word = new String(Arrays.copyOfRange(byteArray, wordStartAddr, wordStartAddr + wordLength), "UTF-16LE"); System.out.println(word + "\t" + pinyin); // step up idx += 0xa; } } public static final String substringBetween(String text, String start, String end) { int nStart = text.indexOf(start); int nEnd = text.indexOf(end, nStart + 1); if (nStart != -1 && nEnd != -1) { return text.substring(nStart + start.length(), nEnd); } else { return null; } } }