Clustering Relational Data

Large amounts of relational data arise in areas like cyber security, and such data is usually represented as graphs or hypergraphs, with entities represented as vertices, and relations represented as edges or hyper-edges. It is common that we do not have prior knowledge about the entities (unsupervised context) or that we only have information about a small subset of entities (semi-supervised context). In this talk, we review some of our work related to unsupervised learning on graphs, in particular graph partitioning and clustering. We first review cluster quality measures known as set partition similarities, and we argue that graph clustering algorithms should also be considered through “graph-aware” measures. Using those measures, we point out issues in graph clustering algorithms: instability, scalability and resolution issues. We propose a simple heuristic ensemble method that shows good scalability, and greatly improves stability. Density-based clustering in vector spaces has many appealing advantages, such as the ability to discover arbitrarily shaped and dense clusters, and to handle noisy data. Our ensemble method yields weights on edges that can be used as a density measure on graphs. Empirical results show good promise, but it remains a heuristic notion. We discuss our initial ideas for defining a connectivity measure on graphs based on some more formal definitions, and we show initial results with this measure. This is ongoing work, as we are looking for a better theoretical understanding of the connectivity measure.