[qa:47785] Re: dvipdfm vs. pdflatex? (Re: pTeXの内部Unicode化実験)

Re: dvipdfm vs. pdflatex? (Re: pTeXの内部Unicode化実験)

名前: anonymouse
日時: 2007-04-27 23:29:11
IPアドレス: 66.7.192.*

>>47783
> VF (仮想フォント) は TeX の抱える弱点(分綴、カーニング)を補ってくれますが、
> テキスト情報の問題に関しては無力なんですね。フォントの中ではグリフと
> 符号位置(Type1 ではグリフ名?)が不可分になっていて、現在のところ、PDF 出力
> までの間にこの対応を変えてくれる処理がありません。
> 結局、"real font" を作るしかないのですかねえ……。

dvipdfmx の場合ならば、CMap を作れば良いのでは？

いずれにしろ、(TeX の生成した) PDF から、信頼できるテキスト情報を
得られるのは、『単純な』文字を使う場合に限られますけれども。

However, the whole business of extracting text/searching/etc in PDF  
files based on CMap resources is a mess, and my advice would be to  
regard PDF as a medium for viewing and printing, not for text data  
exchange. The stream of glyphs present in the PDF may have very  
complex relationships to the underlying Unicode text -- consider, for  
example, Indic scripts where there is extensive reordering of  
elements within the syllable. As I understand it, to search for  
"hindi" in a PDF with Acrobat, you'd effectively have to type "ihndi"  
as the search string (and that's just a small example; it gets much  
worse).

Sure, it's nice (especially for plain English text) when copy/paste  
and text search give you a good approximation of what you'd expect,  
but until there's a (widely-supported) way to "annotate" the glyph  
stream in the PDF with the associated Unicode text, rather than  
attempting to recover Unicode characters from the actual sequence of  
glyphs, it will never really be universal and reliable. The character- 
to-glyph process is not fully reversible; there's too much complexity  
and potential ambiguity in the mappings and transformations.

JK

# 『TeX の生成した PDF』と制限をつけたのは、dvi ファイルに空白文字が
# 含まれない以上、tagged PDF に変換するには、何らかの heuristic guess
# が必要だからです。
この書き込みへの返事：