Similar concepts
Pages with this concept
Similarity |
Page |
Snapshot |
| 39 |
There are five commonly used measures of association in information retrieval
...The simplest of all association measures is X [[intersection]]Y Simple matching coefficient which is the number of shared index terms
...These may all be considered to be normalised versions of the simple matching coefficient
...then X 1 1 Y 1 1 X 1 [[intersection]]Y 2 1 >S 1 1 S 2 1 X 2 10 Y 2 10 X 2 [[intersection]]Y 2 1 >S 1 1 S 2 1 10 S 1 X 1,Y 1 S 1 X 2,Y 2 which is clearly absurd since X 1 and Y 1 are identical representatives whereas X 2 and Y 2 are radically different
...Doyle [17]hinted at the importance of normalisation in an amusing way:One would regard the postulate All documents are created equal as being a reasonable foundation for a library description
... |
| 40 |
pertain to documents,such as index tags,being careful of course to deal with the same number of index tags for each document
...I now return to the promised mathematical definition of dissimilarity
...If P is the set of objects to be clustered,a pairwise dissimilarity coefficient D is a function from P x P to the non negative real numbers
...D 1 D X,Y >0 for all X,Y [[propersubset]]P D 2 D X,X 0 for all X [[propersubset]]P D 3 D X,Y D Y,X for all X,Y [[propersubset]]P Informally,a dissimilarity coefficient is a kind of distance function
...D 4 D X,Y <D X,Z D Y,Z which may be recognised as the theorem from Euclidean geometry which states that the sum of the lengths of two sides of a triangle is always greater than the length of the third side
...An example of a dissimilarity coefficient satisfying D 1 D 4 is where X [[Delta]]Y X [[union]]Y X [[intersection]]Y is the symmetric different of sets X and Y
...and is monotone with respect to Jaccard s coefficient subtracted from 1
... |
| 98 |
is another example of a matching function
...A popular one used by the SMART project,which they call cosine correlation,assumes that the document and query are represented as numerical vectors in t space,that is Q q 1,q 2,...or,in the notation for a vector space with a Euclidean norm,where [[theta]]is the angle between vectors Q and D
...Serial search Although serial searches are acknowledge to be slow,they are frequently still used as parts of larger systems
...Suppose there are N documents Di in the system,then the serial search proceeds by calculating N values M Q,Di the set of documents to be retrieved is determined
...1 the matching function is given a suitable threshold,retrieving the documents above the threshold and discarding the ones below
...2 the documents are ranked in increasing order of matching function value
... |
|
|