Archive for June, 2009

Back references in Python

June 15, 2009

Back references in regular expressions in Python work just like you’d expect them to. For example, to break up hyphenated words in a text (but keep all other hyphens), do this:


>>> import re
>>> pat=re.compile(r'([A-Za-z]+)-([A-Za-z]+)’)
>>> s = ‘This is a te- test of back-references in Python.’
>>> re.sub(pat, r’\1 \2′, s)
‘This is a te- test of back references in Python.’

Single quotes and HTK

June 12, 2009

(Sorry in advance for the stupid “smart quotes” in this post. I haven’t yet figured out how to fix this in WordPress.)

In transcribing naturally occurring speech (in my case, sociolinguistic interviews) for the purpose of forced alignment, I need to decide what to do with clitics and other similarly reduced words. A common orthographic convention is to start the transcription with a single quote, as in ’em for reduced them, ‘im for reduced him, ’cause for reduced because, ’til for reduced until, etc. Simply providing transcriptions like this as input to HTK (I’m using version 3.4) doesn’t work, though, because any token beginning with a single quote (or a double quote) is treated as quoted by HShell and is processed differently. So, if I include, for example, ‘EM in my transcription file, along with the corresponding entry:

EM AH0 M

in my pronouncing dictionary, I get the following HShell error while trying to run HVite to do forced alignment:

ERROR [+5013] ReadString: File end within quoted string
FATAL ERROR – Terminating program HVite

One solution would be to simply not transcribe such tokens with initial single quotes, and have dictionary entries like:

CUZ K AH0 Z
EM AH0 M

These conventions would work fine for my own use, but it would be cumbersome to train transcribers in our group project to behave consistently in this regard. If they accidentally used a transcription with an initial single quote, the forced alignment would fail.

A more flexible solution is to escape the initial single quote in the transcription (this can be done automatically in pre-processing before sending the file to HTK for forced alignment, assuming that there aren’t any tokens that are actually supposed to be quoted) and the dictionary. The way to do this in HTK is to include a backslash (‘\’) before initial single quotes that are supposed to be escaped. So, the transcription of cliticized them would be \’em, and the corresponding dictionary entry would be:

\’EM AH0 M

Following these conventions enables transcribers to continue to transcribe reduced tokens in a way that is natural to them without breaking the forced alignment in HTK.

Using HTK 3.4.1 on Mac OS 10.5

June 12, 2009

The two problems that I encountered trying to compile HTK 3.4 on my MacBook (OS 10.5.7, 64-bit Intel Core 2 Duo) are now fixed in HTK 3.4.1–this newer version now compiles successfully with no modifications. The addition of the following lines to the configure file is what makes it work now:

i386*darwin*)
CFLAGS=”-ansi -g -O2 -DNO_AUDIO -D’ARCH=\”darwin\”‘ -I/usr/include/malloc $CFLAGS”
LDFLAGS=”-L/usr/X11R6/lib $LDFLAGS”
ARCH=darwin
Objcopy=echo
PRILF=”-x”
CPU=darwin
SHRLF=”-shared”
LIBEXT=dylib
;;

(See this post for the solutions I was using earlier.)

However, something has changed between HTK 3.4 and HTK 3.4.1, and now I’m not able to do forced alignment with the newer version. The following HVite command:

$ HVite -T 1 -a -m -I tmp/tmp.mlf -H model/macros -H model/hmmdefs -S ./tmp/test.scp -i tmp/aligned.mlf tmp/dict model/monophones

works for me in HTK 3.4 to do forced alignment, but produces this error in HTK 3.4.1:

ERROR [+8522] LatFromPaths: Align have dur<=0
FATAL ERROR – Terminating program HVite

I haven’t had time to look into this error yet to see what the problem is. My solution for the time being has been to continue using HTK 3.4 for doing forced alignment.

Compiling HTK 3.4 on Mac OS 10.5

June 12, 2009

HTK 3.4 does not compile out of the box on my MacBook (OS 10.5.7, 64-bit Intel Core 2 Duo).  There are two problems, one in the configure file, and one in the file HTKLib/strarr.c.

1) After unpacking the source code and running:

$ ./configure
$ make all

I get the following error during compilation:

gcc  -Wall -Wno-switch -g -O2 -I.    -c -o esignal.o esignal.c
esignal.c: In function ‘ReadHeader’:
esignal.c:974: error: ‘ARCH’ undeclared (first use in this function)
esignal.c:974: error: (Each undeclared identifier is reported only once
esignal.c:974: error: for each function it appears in.)
esignal.c: In function ‘WriteHeader’:
esignal.c:1184: error: ‘ARCH’ undeclared (first use in this function)
make[1]: *** [esignal.o] Error 1
make: *** [HTKLib/HTKLib.a] Error 1

After looking into the configure file, I see that the variable ARCH should be defined for my system on line 4983. However, this code isn’t executed, because the host variable isn’t being set. My solution was to add the following code:

i386)
host=darwin
trad_bin_dir=$host
;;

to the case “$host_cpu” in statement on line 4937.

2) After making this change and re-running:

$ ./configure
$ make all

I get the following error:

gcc -ansi -g -O2 -DNO_AUDIO -D’ARCH=”darwin”‘ -Wall -Wno-switch -g -O2 -I. -c -o strarr.o strarr.c
strarr.c:21:20: error: malloc.h: No such file or directory
make[1]: *** [strarr.o] Error 1
make: *** [HTKLib/HTKLib.a] Error 1

To fix this bug, I changed line 21 of HTKLib/strarr.c to:

#include <malloc/malloc.h>

After making this change, compilation completed successfully, and HTK was ready to be installed and used on my system.