After reading a scriptum about neural networks I wanted to get my hands on them. So I wrote some rather simple implementations of Hopfield networks [1], multi-layer-perceptrons [2] and last but not least (serial) Self-Organizing Maps [3] (SOM).
As the Hopfield nets and multi-layer perceptrons worked fine with some artificial test data (hand-crafted patterns for the Hopfield nets and a simple
function for the MLP), I needed some test data for the SOMs. The idea was to cartograph my iTunes library. So I used categorical data stored in the library as input vectors for the SOM. It turned out that the data was not suited for this purpose, as it was to evenly distributed and not significant enough to identify meaningful clusters on the resulting U-Matrix [4].
function for the MLP), I needed some test data for the SOMs. The idea was to cartograph my iTunes library. So I used categorical data stored in the library as input vectors for the SOM. It turned out that the data was not suited for this purpose, as it was to evenly distributed and not significant enough to identify meaningful clusters on the resulting U-Matrix [4].
After searching for a more usefull way of extracting feature vectors from my music collection I eventually decided to go with so called “Mel frequency cepstral coefficients” [5].
According to Wikipedia the mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another.(Mel scale [6]). Actually converting “normal” frequency into Mel frequency is pretty easy:
.
According to Wikipedia the mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another.(Mel scale [6]). Actually converting “normal” frequency into Mel frequency is pretty easy:

I did not find a suitable implementation for calculating those MFCCs, so I wrote my own:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | /* * stdin: WAV, Mono-16bit-44100 mit Standard 48-byte WAV-Header * stdout: dito */ #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <assert.h> #include <fftw3.h> /** * The range of the fiter bank in Hz */ #define F_MIN 20.0 #define F_MAX 11025.0 /** * Basically N is the window size in samples. As the WAV we're reading is supposed * to have a sample size of 44.1kHz and we want a window size of 0.1sec, so we set * this to 4410 (Hz). */ #define N 4410 #define N2_p1 2206 #define N2_m1 2204 #define SAMPLE_RATE 44100.0 #define SAMPLE_RATE_DIV_N 10.0 /** * The amount of filters in the MEL filter bank */ #define M_NUMCH 24 #define M_COEFF 20 double melScaleInverse(double f) { return 700 * (pow(10, f / 2595.0) - 1); } double melScale(double f) { return 2595 * log10(1 + (f / 700.0)); } void buildFilterBank(double* filterBank) { int i = 0; double melDelta = (melScale(F_MAX) - melScale(F_MIN)) / (M_NUMCH + 1); for(i = 0; i < M_NUMCH; i++) { filterBank[i] = melScaleInverse(melDelta * i); } } double filter(double* f /*filterBank*/, int m /* filterIndex */, double f_k) { if(f_k < f[m - 1] || f_k >= f[m + 1]) { return 0; } else if(f[m - 1] <= f_k && f_k < f[m]) { return (f_k - f[m - 1]) / (f[m] - f[m - 1]); } else if(f[m] <= f_k && f_k < f[m + 1]) { return (f_k - f[m + 1]) / (f[m] - f[m + 1]); } else { return 0; } } void computeFFT() { int i, k, nread; uint16_t *iobuf; double *in; double ONE_DIV_N; double filterBank[M_NUMCH]; fftw_complex *out; fftw_plan plan; ONE_DIV_N = 1.0 / N; iobuf = calloc(sizeof(uint16_t), N); in = (double*) fftw_malloc(sizeof(double) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); plan = fftw_plan_dft_r2c_1d(N, in, out, FFTW_MEASURE); buildFilterBank(filterBank); /* pass thru the 48 byte wav-header */ fread (iobuf, sizeof(uint16_t), 24, stdin); double c_ls[N2_m1]; double c_mf[M_NUMCH]; double c_mc[M_COEFF]; while( (nread = fread(iobuf, sizeof(uint16_t), N, stdin)) == N ) { for(i = 0; i < N; i++) { in[i] = iobuf[i]; } fftw_execute(plan); double result_re = 0; double result_im = 0; for(k = 0; k < N2_m1; k++) { c_ls[k] = sqrt(((out[k][0] / N) * (out[k][0] / N)) + (out[k][1] * out[k][1])); } for(i = 0; i < M_NUMCH; i++) { c_mf[i] = 0; for(k = 0; k < N2_m1; k++) { c_mf[i] += c_ls[k] * filter(filterBank, i, k * SAMPLE_RATE_DIV_N); } } for(k = 0; k < M_COEFF; k++) { c_mc[k] = 0; for(i = 0; i < M_NUMCH; i++) { if(c_mf[i] != 0) c_mc[k] += log(c_mf[i]) * cos(k * (M_PI / M_NUMCH) * (i - 0.5)); } } fwrite(c_mc, sizeof(double), M_COEFF, stdout); } free(in); free(iobuf); } int main (int argc, char *argv[]) { computeFFT(); return 0; } |
The implementation above reads raw WAV from the stdin and calculates the MFCCs from them. It’s called using
1 | lame --decode -m m -t $filename | ./mfcc > $filename.features |
and works as follows:
- Read the data using a rectangular 10ms window (usually a Hamming window [7] or similiar is used). This results in 4410 samples (16 bit wide) as the WAV is supposed to be sampled in 44.1kHz.
- Run FFT [8] on the acquired samples – here we use the FFTW [9] library. Then take the square of the transformation results: if
is the
th transformation result we use
. - Scale the result of step 2 using the Mel scale filter bank. The one I used looks like this (this presentation [10] helped me alot for implementing the filter bank):
We’ll call the Mel transformed value
. - Finally compute the
th coefficient using a discrete cosine transformation:
where
is the dimensionality of the resulting feature vector. - Write the feature vector
to the stdout as 8 byte wide little endian IEEE 754 double precision floats.
Running the programming above on my music collection (roughly 28 gigabytes) took roughly 24 hours and resulted in 487747560 feature vectors. That leaves me with to questions:
- How good is the quality of those feature vectors?
- How do I process such a vast amount of data?
In order to get an idea of the quality (in the context of feeding them into a SOM) I used some simple statistical methods. As the SOM estimates the propability density function of the samples, I had a closer look at the distribution of those vectors in the
.
The idea is to compute the mean vector using
, compute the eucledian distance (as this is the measure of distance a SOM implementation would use) of each vector
to
:
. Finally we take the mean of the
(
) and calculate the
. Those
drawn plotted in a histogram look like this:
.The idea is to compute the mean vector using
, compute the eucledian distance (as this is the measure of distance a SOM implementation would use) of each vector
to
:
. Finally we take the mean of the
(
) and calculate the
. Those
drawn plotted in a histogram look like this:
The image above leaves us with two possible conclusions:
- The music I listen two is pretty much all the same
- The feature vectors are not really of good quality
Personally I’d go with the later one as my the music in my library is not that much the same. One way to improve the quality of those feature vectors could be use the first and second derivation of the cebstrum coefficients as proposed here [11] or to use a Hamming window for reading the data.
None the less, once I finally make it and finish a working batch SOM [12] (and this paper [13]) implementation I’ll do another post describing the results. Or I might simply abandon this project as the semester starts in a few days and I might run out of time.
References
- [1] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences of the USA, vol. 79 no. 8 pp. 2554-2558, April 1982.: http://www.pnas.org/content/79/8/2554.abstract
- [2] Rosenblatt, Frank (1958): The perceptron : a probabilistic model for information storage and organization in the brain. Psychological Reviews 65 (1958) 386-408: http://www.citeulike.org/user/kira/article/2051038
- [3] Wikipedia: Self Organizing Map: http://en.wikipedia.org/wiki/Self_organizing_map
- [4] Jaakko Hollmén, 1996: Using the Self-Organizing Map: http://www.cis.hut.fi/jhollmen/dippa/node24.html
- [5] Beth Logan: Mel frequency cepstral coefficients for musical modeling: http://ciir.cs.umass.edu/music2000/papers/logan_paper.pdf
- [6] Wikipedia: Mel scale: http://en.wikipedia.org/wiki/Mel_scale
- [7] Wikipedia: Hamming window: http://en.wikipedia.org/wiki/Window_function#Hamming_window
- [8] FFTW documentation: The 1d Real-data DFT: http://www.fftw.org/fftw3_doc/The-1d-Real_002ddata-DFT.html#The-1d-Real_002ddata-DFT
- [9] Homepage: Fastest Fourier Transform in the West: http://www.fftw.org/
- [10] Stefan Scherer: MFCC Implementierung: http://www.informatik.uni-ulm.de/ni/Lehre/SS09/PraktikumNI/MFCC.pdf
- [11] Lehrbuch zur Mustererkennung: http://www5.informatik.uni-erlangen.de/fileadmin/Persons/NiemannHeinrich/klassifikation-von-mustern/m00-www.pdf
- [12] Teuvo Kohonen, et. al.: Self organization of a massive document collection (2000): http://130.203.133.121:8080/viewdoc/summary;jsessionid=73500594ABCEE7205702FB7CB3DBF8AB?doi=10.1.1.32.2117
- [13] R. D. Lawrence, et. al: A Scalable Parallel Algorithm for Self-Organizing Maps with Applicationsto Sparse Data Mining Problems: http://portal.acm.org/citation.cfm?id=593480
