Unix HOWTOs and Tips Short unix command line administration tips and scripts

30Oct/110

Simple PHP cyr2lat command line transliteration filter from bulgarian to latin

Sometimes you need to easily convert some Cyrillic Bulgarian text to its latin equivalent (a process known as "romanization", see Romanization of Bulgarian ).

A possible use case scenario is making slugs for urls, containing bulgarian.

Since it is a common task, in the best Unix tradition, it is very usefull to have a simple command line filter, into which you can pipe the cyrillic text, and producing the romanized version in its output.

Here is a simple version of the command line filter cyr2lat, written in php, that does just that:

#!/usr/bin/env php
<?php
$cyr  = array('а','б','в','г','д','е','ж','з','и','й','к','л','м','н','о','п','р',
              'с','т','у','ф','х','ц','ч','ш','щ','ъ','ь','ю','я',
              'А','Б','В','Г','Д','Е','Ж','З','И','Й','К','Л','М','Н','О','П','Р',
              'С','Т','У','Ф','Х','Ц','Ч','Ш','Щ','Ъ','Ь', 'Ю','Я' );
$lat = array( 'a','b','v','g','d','e','zh','z','i','y','k','l','m','n','o','p','r',
              's','t','u','f' ,'h' ,'ts' ,'ch','sh' ,'sht' ,'a' ,'y' ,'yu','ya',
              'A','B','V','G','D','E','Zh','Z','I','Y','K','L','M','N','O','P','R',
              'S','T','U','F' ,'H' ,'Ts' ,'Ch','Sh' ,'Sht' ,'A' ,'Y' ,'Yu' ,'Ya' );

$in = fopen ("php://stdin","r");
while($line = fgets($in)){
    echo str_replace($cyr, $lat, $line);
}

Source of cyr2lat.php

To use it, just save it to a file named cyr2lat.php, then make this script executable by:

chmod 755 cyr2lat.php

... and possibly move it to a location in your path:

mv cyr2lat.php /usr/local/bin

or

mv cyr2lat.php ~/bin

After this, you can run for example:

echo "Това е текст на кирилица" | cyr2lat.php

and you will get:

Tova e tekst na kirilitsa

 

NB: This filter assumes that the input text is in the utf8 encoding. If you have an input text in the cp1251 encoding, just pipe it first through iconv, like this:

echo "Това е пак текст на кирилица, но този път с кодировка cp1251" |iconv -fcp1251 -tutf8 |cyr2lat.php