@benitorosenbergBenedek Rozemberczki
PhD student at the University of Edinburgh studying machine learning
Karate club is an unattended machine learning extension library for the NetworkX Python package. See the documentation Here.
The Karate Club consists of the most modern methods for unsupervised learning on graphically structured data. Simply put, it’s a Swiss Army Knife for small-scale graph mining research. First, network embedding techniques are provided at the node and diagram level. Second, it includes a variety of overlapping and non-overlapping community discovery methods. The implemented methods cover a wide range of conferences, workshops and contributions by well-known scientists (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) Magazines.
A simple example
Karate Club makes using modern community recognition techniques pretty easy (see here for the related tutorial). The following snippet uses an overlapping community detection algorithm for a synthetic chart.
import networkx how nx
of Karate club import EgoNetSplitter g = nx.newman_watts_strogatz_graph (1000, 20th, 0.05) splitter = EgoNetSplitter (1.0) splitter.fit (g) print (splitter.get_memberships ())
Design principles
When we started Karate Club, we used an API-oriented machine learning system design to create an end-user-friendly machine learning tool. This API-oriented design principle involves a few simple ideas. In this section, these ideas and their obvious advantages are discussed in detail using suitable illustrative examples.
Encapsulated model hyperparameters and inspection
An unattended Karate Club model instance is created using the constructor of the appropriate Python object. This constructor has a default setting for hyperparameters that allows useful out-of-the-box model usage. In simple terms, this means that the end user does not need to understand the mechanics of the inner model in detail in order to use the methods implemented in our framework.
We set these default hyperparameters to achieve adequate learning and runtime performance. If necessary, these model hyperparameters can be changed at the time of model creation with the corresponding parameterization of the constructor. The hyperparameters are saved as public attributes to enable the model settings to be checked.
import networkx how nx
of Karate club import DeepWalk graph = nx.gnm_random_graph (100, 1000) model = DeepWalk () print (model.dimensions) model = DeepWalk (dimensions =64) print (model.dimensions)
We demonstrate the encapsulation of hyperparameters using the code snippet above. First, we would like to embed an Erdos-Renyi diagram generated by NetworkX with the default settings for hyperparameters.
When the model is built, we do not change these default hyperparameters and can print the default dimension hyperparameters. Second, we decided to set a different number of dimensions so that we would have created a new model and still have public access to the dimensions hyperparameter.
Consistency and non-proliferation of classes
Each unsupervised model of machine learning in the karate club is implemented as a separate class that inherits from the Estimator class. Algorithms implemented in our framework have a limited number of public methods because we do not assume that the end user will be particularly interested in the algorithmic details related to a particular technique.
All models come with the fit() Method that uses the inputs (graphics, node functions) and calls the appropriate private methods to learn embedding or clustering. Node and diagram embeddings are returned by the get_embedding () Public method and cluster memberships are obtained by calling get_memberships ().
import networkx how nx
of Karate club import DeepWalk graph = nx.gnm_random_graph (100, 1000) model = DeepWalk () model.fit (graph) embedding = model.get_embedding ()
In the above snippet we are creating a random chart and DeepWalk Model with the default hyperparameters, we adapt this model using the public fit() Method and return the embed by calling the public get_embedding () Method.
This example can be modified to create one Walklets Embed with minimal effort by changing the model import and constructor – these changes result in the following snippet.
import networkx how nx
of Karate club import Walklets graph = nx.gnm_random_graph (100, 1000) model = Walklets () model.fit (graph) embedded = model.get_embedding ()
If you look at these two snippets, that’s the advantage of the API driven design is obvious as we just had to make a few changes. First, the import of the embedding model had to be changed. Second, we had to change the model design and the default hyperparameters were already set.
Third, the public methods of DeepWalk and Walklets Classes behave the same way. Embedding is also learned fit() and it is returned by get_embedding (). This allows for quick and minimal changes to code when an unattended upstream model used for feature extraction is performing poorly.
Standardized data record recording
We designed the Karate Club in such a way that a standardized data record is used when adapting a model. In practical terms, this means that algorithms with the same purpose use the same data types for model training. In detail:
- Neighborhood-based and structural knot embedding techniques use a single one NetworkX diagram as input for the adjustment method.
- Associated node embedding procedures require a NetworkX diagram as input and the features are represented as NumPy array or as Low density SciPy matrix. In these matrices, rows correspond to nodes and columns correspond to features.
- Graphic level embedding methods and statistical fingerprinting of graphs are required List of NetworkX diagrams as input.
- Use community discovery methods a NetworkX diagram as input.
High performance model mechanics
The underlying mechanisms of the graph mining algorithms have been implemented using widely used Python libraries that are not dependent on the operating system and do not require any other external libraries such as TensorFlow or PyTorch does. The internal graphics in the Karate Club are used NetworkX.
Dense linear algebra operations are also carried out NumPy and use their sparse counterparts SciPy. Implicit matrix factorization techniques use the GenSim Package and methods based on the processing of graph signals PyGSP.
Standardized output generation and interface
The standardized output generation of Karate club ensures that unattended learning algorithms that serve the same purpose always return the same type of output with a consistent data point order.
There is a very important consequence of this design principle. If the same type of algorithm is substituted for a particular type of algorithm, there is no need to change the downstream code that uses the output of the upstream unattended model. In particular, the outputs generated with our framework use the following data structures:
- Algorithms for Embedding Nodes (neighborhood preserving, attributed and structural) always return a NumPy float array if the get_embedding () Method is called. The number of rows in the array is equal to the number of vertices, and the row index is always the same as the vertex index. In addition, the number of columns is the number of embedding dimensions.
- Embedding methods for whole graphs (spectral fingerprints, implicit matrix factorization techniques) return a Numpy float array if the get_embedding () Method is called. The row index corresponds to the position of an individual diagram in the list of entered diagrams. Columns represent the embedding dimensions in the same way.
- Community discovery process Return a dictionary if the get_memberships () Method is called. Node indexes are keys, and the values corresponding to the keys are the community memberships of vertices. Certain graph clustering techniques create a node embedding to find vertex clusters. These return a NumPy float array if the get_embedding () Method is called. This array is structured as it is returned by node embedding algorithms.
We demonstrate the standardized output generation and interface through the following code fragment. We create clusters of a random diagram and return dictionaries containing the cluster memberships. Using the external community library, we can calculate the modularity of these clusters.
This shows that the standardized output generation simplifies the connection to external graph mining and machine learning libraries.
import Community
import networkx how nx
of Karate club import LabelPropagation, SCD graph = nx.gnm_random_graph (100, 1000) model = SCD () model.fit (graph) scd_memberships = model.get_memberships () model = LabelPropagation () model.fit (graph) lp_memberships = model.get_memberships () print (community.modularity (scd_memberships, graph)) print ( community.modularity (lp_memberships, graph))
restrictions
The current design of the karate club has certain limitations and we make assumptions about the input. We assume that the NetworkX The graph is undirected and consists of a single one strongly connected component. All algorithms assume that they are nodes continuously indexed with whole numbers and the starting node index is 0. We also assume that the graph is not multipart, the nodes are homogeneous, and the edges are unweighted (each edge has a unit weight).
In the case of the embedding algorithms for the whole chart, all charts in the chart set must change the input requirements listed earlier. The Weisfeiler-Lehman Feature-based embedding techniques allow nodes to have a single string feature that can be accessed by feature Key. Without this key, these algorithms use degree centrality as a nodal characteristic by default.
Read my stories
PhD student at the University of Edinburgh studying machine learning
similar posts
Keywords
Join Hacker Noon
Create your free account to unlock your custom reading experience.
Comments are closed.