## Research Log – Jedi’s Revenge 2

This is the compensation of yesterday’s Log.

I did not fell asleep until 2:30 a.m. in the morning, and got up at 6:50 p.m. One possible reason is that I need to collect my appointed iphone in IFC, but I think the real reason is that I missed her too much ==

But the little sleep time did not affect my efficiency. I tested the algorithm in a larger scale, say 100 ads’ audios. The result is magically right with a clip time of 3 seconds. One defect is that the processing time of MATLAB is unacceptable – 47 seconds in order to process with a database of size 20! I spent a lot time trying to reduce the time but it proved to be MATLAB’s problem – the programming is too high level that the computation is not optimized.

So I implemented a Java version(although Java may not be as fast as C, it is at 100 times faster than MATLAB). The result sucks. I am still looking for the reason.

P.S. I was really happy to facetime my girlfriend with my new iphone 5 ^^

## Research Log – Jedi’s Revenge

Today is a remarkable day.

The whole morning I was still working on the AIM(Auditory Image Model) thing. I tried again and again to understand the following Matlab package

aim 2006

There are several steps to generate the final image that I want to replace spectrogram : PCP, BMM, Strobe and then the final SAI. The package is a whole thing which is hard to dissemble. It finally proved to be idle work.

Some moment in the afternoon, I figured out that if the whole thing is not bullshit, there is less than twenty people which could fully understand what these Cambridge genius are talking about. So I distracted my focus. My dear colleague told me that he did some work to improve the system last week. His previous work was to find the little red lines in the spectrogram and to use the lines as fingerprints. The recognition was good but the no of fingerprints exploded and unacceptable processing time was taken. He followed my suggestion on Mel frequency, which is to reduce the previous 4000 frequencies to only 64 recognizable frequencies. After that, mark the peaks on each time slot and use 9 closest peaks as the fingerprint on that time slot.

Why should we use 9 closes peaks ? This lacks a good reason and is too closed to Shazam’s patent. The line recognition idea was good if it does not recognize too much lines. What if I applied yesterday’s finding ?

I really do. After applying the convolution on the 64-frequency spectrogram, the transformed matrix looks really good ! — clear lines, thick lines, not only horizontal lines but also sloped lines(I have not figured out why). What’s more, with this sparse matrix, we don’t really need a fingerprint ! The sparse matrix itself is !

Matching is easy in logic – count how many non-zeros the matching clip and database file both have on the same position. A simple experiment on 21 audio sources proved that a generally good recording environment can have really good performance by recording 3 second audio clip: the number of recorded file have more than 3000 co-occurrence  with the correct audio source and 2000 – 3000 co-occurrence with the incorrect audio source, with the configuration of 8000 sampling rate and time interval 16ms.

## Research Log – Separation of music and human speech (1)

Most current audio fingerprinting methods are operated on the spectrogram. Spectrogram is actually an image.

Spectrogram 1

Spectrogram 2

From the above spectrograms, where Spectrogram 1 is a mix of music and human speech and Spectrogram 2 is solely human speech, it has been found that audios of melody consist of red straight lines and human speeches always come with little waves and a lot of red messes. The red messes can be considered as noises. My idea is to separate the music, speeches and noise, i.e. the lines, waves and other red messes, from each other.

The problem intuitively lead me to image processing : find the lines in the image. Initially I am trying to use a line detector, which is to apply a convolution filter over the spectrogram :

$latex \left( \begin{array}{ccc} -1 & -1 & -1 \\ 2 & 2 & 2 \\ -1 & -1 & -1 \end{array} \right)$

This filter only find lines of one pixel width, whilst I want to mark thicker lines. Thanks to my friend Yupeng Zhang, he helped me find a filter which could mark thicker lines :

$latex \left( \begin{array}{ccccc} -1 & -1 & -1 & -1 & -1 \\ 0 & 0 & 0 & 0 & 0 \\ 2 & 2 & 2 & 2 & 2 \\ 0 & 0 & 0 & 0 & 0 \\ -1 & -1 & -1 & -1 & -1 \end{array} \right)$

This filter can actually find lines with two pixel wide, and I extend to find thicker lines. The filter works generally good .

The other thing I have done today is that I find a library which can do inverse spectrogram. Inverse FFT is easy but inverse spectrogram – OMG it is a nightmare. When we do spectrogram there is an overlap in FFT. The inverse spectrogram has inaccurate result but people can still identify the melodies and the speeches. This is also applied to generate .wav file after the previous line detecting operation on frequency domain but the resulting audio hardly can be recognized. Although the spectrogram seems good as what I want, the recovered voice signal sucks anyway.

## Research Log – study of an SVD audio fingerprinting system – US patent 20060190450

US patent 20060190450

Aug 24, 2006

Audio Fingerprinting System and Method

I am trying to associate some matrix factorization method – like NMF, PCA, ICA with audio fingerprinting method. Fortunately and unfortunately , there is few literature about the way and I find one.

The patent is actually a mass, claiming a lot of systems such that I cannot even tell which one is their focus. The first several steps are the same as most other fingerprinting systems :

preprocessing, FFT, spectrogram …

The following ones are a little bit different. Since spectrogram is from a time-frequency matrix, some matrix factorization method must can be reasonably applied. Therefore comes SVD first. The characteristic vectors on which the points have largest variance are separated and treated as fingerprints. Comparison of fingerprints is down by comparing matrix norm. Since there is no time index, time alignment is not needed; Time alignment can also be applied if a window is applied but this scenario is not fully discussed in this patent.

The patent mentioned another system. It finds frequency peaks from FFT result and maps them to musical notes. These mapping are major audio recognition fingerprints in this patent’s embodiment, however, SVD fingerprints are used to calculate average and then classify different styles of music.

This is really weird, so I wonder if matrix factorization method can be applied in acoustic fingerprinting method.

## 开发日记

1. local 要cache些什么，cache在哪里很难决定。是在程序开启的时候cache在一个叫做Application的全局变量里呢？还是cache在一个叫做sharedpreference的本地文件里面。这两种方法的读取速度貌似有些区别。还有用户图像的cache也比较麻烦，要截取到适合显示size才行。
2. 和server api的沟通。api也是刚刚写好的，有时候还会变，一些返回的数据需要自己去挖掘，要和写api的同事沟通，总之很麻烦。
4. 动画效果。我虽然不负责这part（因为我不会。。。:-(    )，但是一个简单的动画就要耗费整个下午。

2. 系统化用户信息储存
3. 测试从内存和sharedpreference读取信息速度上的差距

## 开发日志

1. 在android的library project里面建立一个NDK library

2. 运行的时候出现 java.lang.unsatisfiedlink error [library name]

c library跟错了package名

1. 在.c文件中改package名

2. 重新编译.o文件

3. 改.java文件再运行（如果不改.java文件的话不会重新编译.apk ）

1. 上传当前信号id
2. 于本地存储信号当前id信息，包括id，下次有效上传日期，任务开始时间，
3. 当服务器返回后，开启闪签动画，加分

———–