#!/usr/bin/env ruby
# -*- coding: utf-8 -*-

=begin

= convbkmk Ver.0.30

  2018.11.25
  Takuji Tanaka
  ttk (at) t-lab.opal.ne.jp
((<URL:http://www.t-lab.opal.ne.jp/tex/uptex_en.html>))


== Abstract

((*convbkmk*)) is a tiny utility for making correct bookmarks in pdf files
typesetted by pLaTeX/upLaTeX with the hyperref package.
pLaTeX/upLaTeX + hyperref outputs data of bookmarks
in their internal encodings (EUC-JP, Shift_JIS or UTF-8).
On the other hand, the PostScript/PDF format requests that
the data is written in a certain syntax with UTF-16 or PDFDocEncoding.
Thus, data conversion is required to create correct bookmarks.

In addition, pLaTeX outputs dvi files with special commands
in its internal encoding (EUC-JP or Shift_JIS).
It is not consistent with recent dviware and file systems
which assume UTF-8.

((*convbkmk*)) provides a function of
the encoding conversion and formatting the data.

== Requirement

((*ruby*)) 1.9.x or later is required.
((*ruby*)) 1.8.x is no longer supported.
To support conversion of dvi special,
((*dvispc*)) in dviout-util is required.

== Examples

=== for pdf bookmark

pLaTeX (internal kanji code: euc) + hyperref + dvips :
 $ platex doc00.tex
 $ platex doc00.tex
 $ dvips doc00.dvi
 $ convbkmk.rb -e doc00.ps
 $ ps2pdf doc00-convbkmk.ps

pLaTeX (kanji code: sjis) + hyperref + dvipdfmx :
 $ platex doc01.tex
 $ platex doc01.tex
 $ convbkmk.rb -s -o doc01.out
 $ platex doc01.tex
 $ dvipdfmx doc01.dvi

upLaTeX + hyperref + dvips :
 $ uplatex doc02.tex
 $ uplatex doc02.tex
 $ dvips doc02.dvi
 $ convbkmk.rb doc02.ps
 $ ps2pdf doc02-convbkmk.ps

upLaTeX + hyperref + dvipdfmx :
 $ uplatex doc03.tex
 $ uplatex doc03.tex
 $ convbkmk.rb -o doc03.out
 $ uplatex doc03.tex
 $ dvipdfmx doc03.dvi

=== for dvi special (graphic file names)

pLaTeX (internal kanji code: euc) + dvips :
 $ platex doc04.tex
 $ platex doc04.tex
 $ convbkmk.rb -e -d doc04.dvi
 $ dvips doc04-convbkmk.dvi
 $ ps2pdf doc04-convbkmk.ps

pLaTeX (internal kanji code: sjis) + dvipdfmx :
 $ platex doc05.tex
 $ platex doc05.tex
 $ convbkmk.rb -s -d doc05.dvi
 $ dvipdfmx doc05.dvi

((*convbkmk*)) executes ((*dvispc*)) command
to extract dvi files.
((*dvispc*)) command is designated by
an environmental variable 'DVISPC'.
By default, 'dvispc' is set.

More examples are provided at the GitHub repository
and by the upTeX source archive.

== Repository

convbkmk is maintained on GitHub:
((<URL:https://github.com/t-tk/convbkmk>))

== License

convbkmk

Copyright (c) 2009-2018 Takuji Tanaka

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

== History

: 2009.08.02  0.00
 * Initial version.
: 2011.05.02  0.01
 * Bug fix: BOM was not correct.
: 2012.05.08  0.02
 * Bug fix: for a case of dvips with -z option and Ruby1.8.
 * Add conversion of /Creator and /Producer .
: 2012.05.12  0.03
 * Suppress halfwidth -> fullwidth katakana conversion and MIME decoding in Ruby1.8.
: 2012.06.01  0.04
 * Support escape sequences: \n, \r, \t, \b, \f, \\, \ddd (octal, PDFDocEncoding) and \0xUUUU (Unicode UTF-16BE).
 * Support sequences of end of line: '\' or other followed by "\n", "\r\n" or "\r" .
 * Set file IO to binary mode.
: 2012.07.26  0.05
 * Add -o option to support conversion of OUT files generated by dvipdfmx.
: 2012.08.07  0.06
 * Bug fix: Ver.0.05 does not work with Ruby1.9.
: 2012.09.17  0.07
 * Bug fix: An infinite loop occurs in Ver.0.05, 0.06 with -g option in some cases.
 * Add reference for PDFDocEncoding.
: 2013.05.11  0.08
 * Add -O option: overwrite output files onto input files instead of creating foo-convbkmk.ps .
 * Make comments rd/rdtool friendly.
: 2014.03.02  0.09
 * Bug fix: Conversion was not complete in some cases.
: 2014.03.08  0.10
 * Bug fix: Output of binary data might be broken in filter mode on Windows.
: 2014.12.29  0.10a
 * Update the author's mail address and web site.
: 2018.11.11  0.20
 * Do not support Ruby1.8 anymore.
: 2018.11.25  0.30
 * Add -d option to support conversion of graphic file names in dvi special by pLaTeX.

=end

Version = "0.30"

require "optparse"

if RUBY_VERSION < "1.9"
  abort("Ruby 1.8 or earlier is no longer supported.")
end

class String
  def to_utf8(enc)
    self.force_encoding(enc.current).encode('UTF-8')
  end
  def utf16be_to_utf8
    self.force_encoding('UTF-16BE').encode('UTF-8')
  end
  def utf8_to_utf16be
    self.force_encoding('UTF-8').encode('UTF-16BE')
  end
end

class TeXEncoding
  attr_accessor :current, :option, :status, :is_8bit
  attr_reader :list

  def initialize
    @current = false
    @option = false
    @status = false
    @is_8bit = false
    @list = ['Shift_JIS', 'EUC-JP', 'UTF-8']
  end

  def set_process_encoding(enc)
    if @status == 'fixed'
      raise 'dupulicate definition'
    end
    if enc == 'guess'
      @option = 'guess'
      @status = 'guess'
    else
      @current = enc
      @option = enc
      @status = 'fixed'
    end
    return enc
  end
end
enc = TeXEncoding.new

Opts = {}

OptionParser.new do |opt|
  opt.on('-e', '--euc-jp',
         'set pTeX internal encoding to EUC-JP') {|v|
    enc.set_process_encoding('EUC-JP')
  }
  opt.on('-s', '--shift_jis',
         'set pTeX internal encoding to Shift_JIS') {|v|
    enc.set_process_encoding('Shift_JIS')
  }
  opt.on('-u', '--utf-8',
         'set upTeX internal encoding to UTF-8') {|v|
    enc.set_process_encoding('UTF-8')
  }
  opt.on('-g', '--guess',
         'guess pTeX/upTeX internal encoding') {|v|
    enc.set_process_encoding('guess')
  }
  enc_alias = Hash.new
  enc.list.each { |e|
    enc_alias[e] = e
    enc_alias[e[0]] = e
    enc_alias[e.downcase] = e
  }
  opt.on('--enc=ENC', enc_alias,
         'set pTeX/upTeX internal encoding to ENC') {|v|
    enc.set_process_encoding(v)
  }
  opt.on('-o', '--out',
         'treat OUT files') {|v|
    Opts[:mode] = :out
    Opts[:overwrite] = true
    require "fileutils"
  }
  opt.on('-d', '--dvi-special',
         'treat specials in DVI files') {|v|
    Opts[:mode] = :spc
    Dvispc = ENV["DVISPC"] ||= 'dvispc'
    require "fileutils"
  }
  opt.on('-O', '--overwrite',
         'overwrite output files') {|v|
    Opts[:overwrite] = true
    require "fileutils"
  }
  opt.banner += " file0.ps [file1.ps ...]\n" \
    + opt.banner.sub('Usage:','      ') + " < in_file.ps > out_file.ps\n" \
    + opt.banner.sub('Usage:','      ') + " -o file0.out [file1.out ...]\n" \
    + opt.banner.sub('Usage:','      ') + " -d file0.dvi [file1.dvi ...]\n" \
    + opt.banner.sub('Usage:','      ') + " -d file0.dvispc [file1.dvispc ...]"

  opt.parse!
end

# default encoding
if enc.status == false
  enc.set_process_encoding('UTF-8')
end
if Opts[:mode] == :out
  OpenP, CloseP, OpenPEsc, ClosePEsc = '{', '}', '\{', '\}'
  FileSfx = 'out'
elsif Opts[:mode] == :spc then
  FileSfx = '(dvi|dvispc)'
else
  OpenP, CloseP, OpenPEsc, ClosePEsc = '(', ')', '\(', '\)'
  FileSfx = 'ps'
end

def try_guess_encoding(line, enc)
  return 'US-ASCII' if line.ascii_only?

  enc.is_8bit = true
  valid_enc = false
  count = 0
  enc.list.each { |e|
    if line.dup.force_encoding(e).valid_encoding?
      count += 1
      valid_enc = e
    end
  }
  if count == 1
    enc.set_process_encoding(valid_enc)
    return valid_enc
  elsif count > 1
    return false # ambiguous
  else
    raise 'Cannot guess encoding!'
  end
end

def os_legacy_encoding(enc)
  return if enc.status != 'guess'
  enc.is_8bit = true
  if (RUBY_PLATFORM =~ /mswin|msys|mingw|cygwin|bccwin|wince|emc/i)
    valid_enc = 'Shift_JIS'
  else
    valid_enc = 'EUC-JP'
  end
  enc.set_process_encoding(valid_enc)
end

def check_parentheses_balance(line, enc)
  depth = 0
  count = 0
  tmp_prev = ''
  tmp_rest = line

  if enc.status == 'guess'
    if tmp_enc = try_guess_encoding(line, enc)
      # succeeded in guess or ascii only
      tmp_rest = line.force_encoding(tmp_enc)
    else
      # ambiguous
      raise 'unexpected internal condition!'
    end
  else
    tmp_enc = enc.current
    tmp_rest = tmp_rest.force_encoding(tmp_enc)
    unless tmp_rest.valid_encoding?
      # illegal input
      $stdout = STDERR
      p 'parameters: '
      p '  status: ' + enc.status
      p '  option: ' + enc.option
      p '  current: ' + enc.current
      p enc.is_8bit
      p '             [' + line + ']'
      raise 'encoding is not consistent'
    end
  end

  while tmp_rest.length>0 do
    if    (tmp_rest =~ /\A(\\#{OpenPEsc}|\\#{ClosePEsc}|[^#{OpenP}#{CloseP}])*(#{OpenPEsc}|#{ClosePEsc})/o) # parenthis
      if $2 == OpenP
        depth += 1
        count += 1
      else
        depth -= 1
      end
      tmp_prev += $&
      tmp_rest = $'
    else
      tmp_prev += tmp_rest
      tmp_rest = ''
    end
    if depth<1
      break
    end
  end
  return depth, count, tmp_prev, tmp_rest
end

# PDFDocEncoding -> UTF-16BE
# Ref. "PDF Reference, Sixth Edition, version 1.7", 2006, Adobe Systems Incorporated
#   http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
#   http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
PDF2UNI = Array(0..255)
PDF2UNI[0o030..0o037] = 0x02d8, 0x02c7, 0x02c6, 0x02d9, 0x02dd, 0x02db, 0x02da, 0x02dc
PDF2UNI[0o200..0o207] = 0x2022, 0x2020, 0x2021, 0x2026, 0x2014, 0x2013, 0x0192, 0x2044
PDF2UNI[0o210..0o217] = 0x2039, 0x203a, 0x2212, 0x2030, 0x201e, 0x201c, 0x201d, 0x2018
PDF2UNI[0o220..0o227] = 0x2019, 0x201a, 0x2122, 0xfb01, 0xfb02, 0x0141, 0x0152, 0x0160
PDF2UNI[0o230..0o237] = 0x0178, 0x017d, 0x0131, 0x0142, 0x0153, 0x0161, 0x017e, 0xfffd
PDF2UNI[0o240       ] = 0x20ac
PDF2UNI[0o255       ] = 0xfffd

def conv_string_to_utf16be(line, enc)
  if line !~ /(#{OpenPEsc})(.*)(#{ClosePEsc})/mo
    raise 'illegal input!'
  end
  pre, tmp, post = $`, $2, $'

  if tmp.ascii_only? && tmp !~ /\\0x[0-9A-F]{4}/i
    return line
  end

  conv = ''
  conv.force_encoding('UTF-8')
  tmp.force_encoding(enc.current)

  while tmp.length>0 do
    case tmp
    when /\A[^\\\n\r]+/
      conv += $&.to_utf8(enc)
    when /\A\\([0-3][0-7][0-7])/  # PDFDocEncoding -> UTF-8
      conv += [PDF2UNI[$1.oct]].pack("U*")
    when /\A\\0x(D[8-B][0-9A-F]{2})\\0x(D[C-F][0-9A-F]{2})/i  # surrogate pair
      conv += [$1.hex, $2.hex].pack("n*").utf16be_to_utf8
    when /\A\\0x([0-9A-F]{4})/i
      conv += [$1.hex].pack("U*")
    when /\A\\[nrtbf\\]/
      conv += eval(%!"#{$&}"!)
    when /\A(\r\n|\r|\n)/
      conv += "\n"
    when /\A\\([\r\n]{1,2})|\\/
      # ignore
    else
      raise 'unexpected input!'
    end
    tmp = $'
  end

  buf = ''
  conv16be = "\xFE\xFF"            # BOM U+FEFF
  conv16be.force_encoding('UTF-16BE')
  conv16be += conv.utf8_to_utf16be # UTF-16BE with BOM
  conv16be.each_byte {|byte|
    buf += (Opts[:mode] == :out ? '\%03o' : '%02X') % byte
  }
  buf = Opts[:mode] == :out ? '{' + buf + '}' : '<' + buf + '>'
  return pre + buf + post
end

def special_string_to_utf8(line, enc)
  if line.ascii_only? || line !~ /\Axxx[1-4]/mo
    return line, 0
  end

  if line !~ /\Axxx(\d) (\d+) '(.*)'([^']*)\Z/mo
    raise 'illegal input!'
  end
  xxx, bytes, str, trail = $1.to_i, $2.to_i, $3, $4
  if str.bytesize != bytes
    raise 'byte size is not consistent!'
  end
  if str !~ /\A((PS|ps)file=|pdf:image |pdf:epdf )/mo
    return line, 0
  end

  conv = ''
  conv.force_encoding('UTF-8')

  os_legacy_encoding(enc)
  str.force_encoding(enc.current)
  str = str.to_utf8(enc)
  bytes_new = str.bytesize

  xxx_new = bytes_new <= 0xff ? 1 : 4
  conv = 'xxx' + xxx_new.to_s + ' ' + bytes_new.to_s + " '" + str + "'" + trail
  return conv, bytes_new - bytes + xxx_new - xxx
end

def dvi_post_post(line, offset)
  if line !~ /\Apost_post (\d+) ([23])(?: 223){4,7}\Z/mo
    raise 'illegal input!'
  end
  bytes, id = $1.to_i, $2
  padding = line.scan(' 223').count
  bytes += offset
  padding = (padding - offset) % 4 + 4
  line = 'post_post ' + bytes.to_s + ' ' + id + ' 223' * padding + "\n"
  return line
end

def file_treatment(ifile, ofile, enc)
  ifile.set_encoding('ASCII-8BIT')
  ofile.set_encoding('ASCII-8BIT')

  line, offset = '', 0
  while l = ifile.gets do
    line.force_encoding('ASCII-8BIT')
    line += l
    if    Opts[:mode] == :out then
      reg = %r!(\{)!
    elsif Opts[:mode] == :spc then
      reg = %r!(\A(xxx|post_post))!
    else
      reg = %r!(/Title|/Author|/Keywords|/Subject|/Creator|/Producer)(\s+\(|$)!
    end
    if (line !~ reg )
      ofile.print line
      line = ''
      next
    end

    if Opts[:mode] == :spc
      if (line =~ /\Axxx/)
        line, diff = special_string_to_utf8(line, enc)
        offset += diff
      else
        line = dvi_post_post(line, offset)
      end
      ofile.print line
      line = ''
      next
    end

    ofile.print $`
    line = $& + $'

    if Opts[:mode] != :out
      while line =~ %r!(/Title|/Author|/Keywords|/Subject|/Creator|/Producer)\Z! do
        line += ifile.gets
      end
    end

    if enc.status == 'guess'
      if tmp_enc = try_guess_encoding(line, enc)
        # succeeded in guess or ascii only
        line.force_encoding(tmp_enc)
      else
        # ambiguous
        next
      end
    end

    while line.length>0 do
      depth, count, tmp_prev, tmp_rest \
        = check_parentheses_balance(line, enc)
      if depth<0
        p depth, count, tmp_prev, tmp_rest
        raise 'illegal input! (depth<0)'
      elsif depth>0
        break
      elsif count==0
        ofile.print line
        line = ''
        break
      elsif count>0
        ofile.print conv_string_to_utf16be(tmp_prev, enc)
        line = tmp_rest
      else
        p depth, count, tmp_prev, tmp_rest
        raise 'illegal input! (count<0)'
      end
    end

  end

  if enc.status == 'guess' && enc.is_8bit
    raise 'did not succeed in guess encoding!'
  end
end


### main
if ARGV.size == 0
  ifile = STDIN.binmode
  ofile = STDOUT.binmode
  file_treatment(ifile, ofile, enc)
else
  ARGV.each {|fin|
    if (fin !~ /\.#{FileSfx}$/io)
      raise 'input file does not seem ' + FileSfx.upcase + ' file'
    end
    sfx = $&
    if (Opts[:mode] == :spc && fin =~ /\.dvi$/i)
      dvi_conversion = true
      fspc = fin.gsub(/\.dvi$/io, '.dvispc')
      if !(system Dvispc + ' -a ' + fin + ' ' + fspc)
        raise "fail to execute 'dvispc -a' command!"
      end
      fin = fspc
      sfx = '.dvispc'
    end
    fout = fin.gsub(/#{sfx}$/i, "-convbkmk#{sfx}")
    open(fin, 'rb') {|ifile|
      open(fout, 'wb') {|ofile|
        file_treatment(ifile, ofile, enc)
      }
    }
    if (Opts[:overwrite])
      FileUtils.mv(fout, fin)
      fout = fin
    end
    if dvi_conversion
      fdvi = fout.gsub(/\.dvispc$/o, '.dvi')
      if !(system Dvispc + ' -x ' + fout + ' ' + fdvi)
        raise "fail to execute 'dvispc -x' command!"
      end
    end
  }
end