Unix HOWTOs and Tips Short unix command line administration tips and scripts

30Oct/110

Simple PHP cyr2lat command line transliteration filter from bulgarian to latin

Sometimes you need to easily convert some Cyrillic Bulgarian text to its latin equivalent (a process known as "romanization", see Romanization of Bulgarian ).

A possible use case scenario is making slugs for urls, containing bulgarian.

Since it is a common task, in the best Unix tradition, it is very usefull to have a simple command line filter, into which you can pipe the cyrillic text, and producing the romanized version in its output.

Here is a simple version of the command line filter cyr2lat, written in php, that does just that:

#!/usr/bin/env php
<?php
$cyr  = array('а','б','в','г','д','е','ж','з','и','й','к','л','м','н','о','п','р',
              'с','т','у','ф','х','ц','ч','ш','щ','ъ','ь','ю','я',
              'А','Б','В','Г','Д','Е','Ж','З','И','Й','К','Л','М','Н','О','П','Р',
              'С','Т','У','Ф','Х','Ц','Ч','Ш','Щ','Ъ','Ь', 'Ю','Я' );
$lat = array( 'a','b','v','g','d','e','zh','z','i','y','k','l','m','n','o','p','r',
              's','t','u','f' ,'h' ,'ts' ,'ch','sh' ,'sht' ,'a' ,'y' ,'yu','ya',
              'A','B','V','G','D','E','Zh','Z','I','Y','K','L','M','N','O','P','R',
              'S','T','U','F' ,'H' ,'Ts' ,'Ch','Sh' ,'Sht' ,'A' ,'Y' ,'Yu' ,'Ya' );

$in = fopen ("php://stdin","r");
while($line = fgets($in)){
    echo str_replace($cyr, $lat, $line);
}

Source of cyr2lat.php

To use it, just save it to a file named cyr2lat.php, then make this script executable by:

chmod 755 cyr2lat.php

... and possibly move it to a location in your path:

mv cyr2lat.php /usr/local/bin

or

mv cyr2lat.php ~/bin

After this, you can run for example:

echo "Това е текст на кирилица" | cyr2lat.php

and you will get:

Tova e tekst na kirilitsa

 

NB: This filter assumes that the input text is in the utf8 encoding. If you have an input text in the cp1251 encoding, just pipe it first through iconv, like this:

echo "Това е пак текст на кирилица, но този път с кодировка cp1251" |iconv -fcp1251 -tutf8 |cyr2lat.php

 

6Mar/110

Using jed with different encodings – utf8 and cp1251.

Maintaining websites through remote ssh connections, requires using console editors which should start fast, do not require huge amounts of memory and are easy to use.

The powerfull editors Vim and Emacs are not good for this task - although VIM is virtually guaranteed to be installed and it is fast, can be customized and so on, it is hardly easy to learn. Emacs may be easier, it has menus and tutorial, it is very customizable, but is not too fast to start.

In contrast, Jed is a very light weight editor which is ideal for quick changes, because it starts instantly, supports syntax highlighting for many common programming languages and has good UTF8 support. It has emacs like macro recorder, buffers, and auto indenting too. It also supports emacs keyboard shortcuts by default, so if you use emacs for longer programming sessions, you probably will like jed when you need to edit quickly some remote files.

Unfortunately bulgarian texts (and sites) are often written using 2 different common encodings, and so there is a catch - you need some tweaks for your programs (and jed in particular), to support both encodings equally well:

In order to edit utf8 files with jed, first, you need to have the correct bulgarian locale for utf8 generated.
You can do this using this command:

sudo localedef -i bg_BG -f UTF-8 bg_BG.UTF8

You can install jed and some jed goodies with the following instructions (Debian/Ubuntu):

sudo apt-get install jed jed-extra jedstate

For RedHat/Centos, try to google for a jed rpm, or try:

wget  ftp://ftp.pbone.net/mirror/centos.karan.org/el5/extras/testing/i386/RPMS/jed-0.99.18-5.el5.kb.i386.rpm
rpm -i jed-0.99.18-5.el5.kb.i386.rpm

Next, you add these lines to your .bashrc file:

alias jed.cp1251='LANG=C jed'
alias jed.utf8='LANG=bg_BG.UTF-8 jed'

Restart your shell, or do:

source ~/.bashrc

in your current shell, so that the changes can be applied.

After all this preparation, when you want to edit a file encoded using utf8, do it like this:

jed.utf8    FILENAME

For cp1251/windows1251, the command should be:

jed.cp1251   FILENAME

 

Note, that the selected encoding of your terminal application should match the encoding of your editor, or else you will see some funny characters instead of the proper ones. For this and other reasons, I recommend you to use the KDE's konsole application. It is stable, has good internationalization support, and it is pretty fast, even when you have a large scroll buffer.

You can also use PuTTY in case you are trapped in a windows only shop :-) .