28th September, 2011
libexttextcat: text guessing feature
LibreOffice inherited a text language guesser, based on textcat from wise-guys.nl and extended by Jocelyn Merand to basically handle UTF-8 text. This is the thing that makes the suggestions as to what language your text might really be in when you right click on some misspelled text and chose set language.
We’ve now spun this off as a standalone libexttextcat and fixed up some conversion problems from the original selection of 8bit encodings and generated new language fingerprints in other cases, which should give better results for various languages, and allow us to enable checking for some languages which was disabled until now.
The current list of languages it attempts to detect can be seen here
Here’s a plausible process to add your favourite language to it, given git clone git://anongit.freedesktop.org/libreoffice/libexttextcat and bootstrapping from the insanely-translated UDHR using Abkhaz as an example.
cd libexttextcat/langclass/ShortTexts/
wget http://unicode.org/udhr/d/udhr_abk.txt
#skip english header, name result using BCP-47
tail -n+7 udhr_abk.txt > ab.txt
cd ../LM
../../src/createfp < ../ShortTexts/ab.txt > ab.lm
echo ab.lm ab--utf8 >> ../fpdb.conf
Then update the check target in src/Makefile.am to confirm the detection of ShortTexts/ab.txt as ab works using make check
I’ll remove the necessity of a configuration file in a later version, and convert the result to a BCP-47 tag. For the moment it remains a drop in replacement for the original solution which necessitates retaining the slightly odd language tag syntax.
Posted at 3:10 pm | Comments Off
26th September, 2011
Recording presentations
I have had a need to record lecture presentations. To this end I’ve hacked up some software which (a) takes a feed from a webcam, and (b) takes a PDF/ODF presentation and combines them into a WebM file.
For the code:
git clone git://gitorious.org/lecturec/lecturec.git
A simple overview:
Posted at 3:23 pm | Comments Off
20th September, 2011
git, really nifty after all
Maybe there’s something to the cult-of-git after all
. vcl/unx/source/fontmanager/fontcache.cxx had some code which painstakingly constructed a string, only to do nothing with it. Clearly at some time in the past it was used, so when did its use go away. This is a file which has been moved around over the years from place to place, hmm, potentially tricky to scratch the itch of knowing when it happened ?, not at all…
git log --follow --oneline -S'suspiciously missing variable' /path/to/file.cxx
and 2 seconds later I have a list of 5 commits, there it is at the top of the list. Back in 2005, a rework of the font cache where the stat on a file was optimized out, while the constructed path to the file remained. No undetected nightmare merge bug then, just a missed micro optimization opportunity.
Posted at 12:57 am | Comments Off