TY - JOUR
T1 - Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data
AU - Canessa, Enrique
AU - Chaigneau, Sergio E.
AU - Moreno, Sebastián
AU - Lagos, Rodrigo
N1 - Publisher Copyright:
© 2020, Marta Olivetti Belardinelli and Springer-Verlag GmbH Germany, part of Springer Nature.
PY - 2020/11
Y1 - 2020/11
N2 - To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 × 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.
AB - To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 × 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.
KW - Chebyshev distance
KW - Clustering
KW - Conceptual properties
KW - Cosine similarity
KW - Euclidean distance
UR - http://www.scopus.com/inward/record.url?scp=85087718025&partnerID=8YFLogxK
U2 - 10.1007/s10339-020-00985-5
DO - 10.1007/s10339-020-00985-5
M3 - Article
C2 - 32647948
AN - SCOPUS:85087718025
SN - 1612-4782
VL - 21
SP - 601
EP - 614
JO - Cognitive Processing
JF - Cognitive Processing
IS - 4
ER -