This paper describes the development of a scalable social network discovery algorithm and its deployment in a revenue agency. A revenue agency makes many risk assessments in the course of its operations. Registration of companies, individuals and other entities need to be checked to avoid fraudulent registrations or identity theft. Correct lodgement or filing of returns, correct reporting of income, expenses and other relevant data on the return and finally payment of any resulting liabilities all involve risk assessments and prioritization of work. Most current risk assessments are made on the basic of the particular entity history or current return details. These risk assessments do not exploit social, professional and organizational relationship. The application of social network mining extends the current risk models to take into account an entity's relationships to other entities.
The core social network extraction (or detection) algorithm implemented here is well known but the implementation process and case study described here should be of interest to other practitioners with large-scale social network discovery applications. The scalable implementation is able to detect networks amongst millions of nodes and tens of millions of links. The resulting thousands of networks have sizes ranging from a few thousand nodes down to small two or three node networks.
The described implementation has replaced various local implementations within the revenue agency. Advantages of the current implementation include its conciseness, its scalability, its correctness and that it is shared corporate-wide. This reduces maintenance cost by a shared code base across the organisation.
We use the term “social network”, which narrowly and historically means nodes as people and links as “knows or interacts with”. Our use in the applications described in this paper are much wider. Our nodes can be individuals, companies, trusts, partnerships, tax return forms. Our links can be “shares an address or other common feature”, “is a director of” , “is an employee of”, “is a partner” or “is a beneficiary of” to name just some. We have made the judgment for now that these networks are similar enough to social networks to justify the use of the term. Others have made similar judgments. For example, the IMDB movie database with its network of actors appearing in same films is often considered a social network. Alternative terms that may be more appropriate include “relationship networks” or “entity networks”.
Section 2 relates our case study to other applications appearing in the literature. The implementation process is described in Section 3. Next the case studies are given and some of the benefits of social network mining explained. In conclusion, expected future developments are given.
There has been a lot of recent interest in graph mining in academia and in industry. In data mining research community, the focus has been large graphs, of which the world wide web is the largest. There has been less focus on the older area of social network analysis, partly because the datasets are fewer and smaller. This has changed in recent years with the social networking sites such as LinkedIn and Facebook requiring practical implementations of large scale social network analysis algorithms and networks. In the fraud detection and criminal intelligence communities, there is recognition of the potential for social network analysis algorithms. The vendor market appears mostly to have network visualisation tools such Analyst Notebook, Netmap. Network analysis tools are now becoming available in data mining tools such as SAS and BAE Systems (Detica). Gartner’s Hype Cycle of Emerging Technologies has Social network analysis on the “Slope of Enlightenment” with an estimate of 2 to 5 years to mainstream adoption. Social network analysis was previously on the “Peak of Inflated Expectations” in 2006 and crossed the “Trough of Despair” in 2007/2008.
Some studies on extraction of social networks have been done[1,3,4], but not for very large databases The published work involve small databases of entities and often a single network. They usually focus on identifying entities of interest within that network. Our particular focus here is to find networks of particular interest among thousands or millions of small networks.By small network, we mean anything from one node to a few thousand nodes, where the total population can be up to 20 million entities.
This paper’s focus on risk rating networks does not seem to be a focus in the literature. One exception is the knowledge of social structures has been used for detecting securities fraud based on networks of brokers. Other work, such as the SNARE system focuses on identifying high risk nodes within graph representing the general ledger data. This has application in a revenue agency but this paper focuses on network risk assessment rather than entity-within-network risk assessment.
Fig. 1 shows our methodology for building risk score models with social networks. First the raw data of entities to be networked are identified and then the links (or edges) are found. For example the things could be taxpayers. The hard links could be characteristics shared by two taxpayers such as a location or a shared association with another entity such as a company. The network detection (or extraction) algorithm is run and produces the social networks. A further optional step can add “soft” links which just means attributes that entities have in common but which don't guarantee a connection. For example, soft links may include an address in a common locality or working for the same employer. Then the networks are scored and ranked according to risk score. Higher risk networks are then investigated. Ideally, feedback is received on the networks investigated and intelligence on any false positives is used to improve the process.
So the starting point for the Network Detection implementation is a list of entities and links between entities (the Linked Soup stage of Fig. 1.), the algorithm then produces a list of entities with their assigned network (the Social Network stage of Fig. 1.).
The algorithms for extracting the list of networks are well known and straightforward. They include breadth-first search, depth-first search and Union-Find. The existing implementations were depth-first search algorithms. With the right data structure and system, they take
O(V+E) time and
O(V+E)-O(E) extra space, where
V is the number of nodes (vertices) and
E is the number of links (edges). In contrast, the algorithm we implemented, Union Find, has
O(n log n) time complexity and
O(V+E)-O(E) extra space. The log n component of this can be seen in Fig. 2 which shows the log of the number of networks to merge decreases linearly with each iteration.
Union Find is chosen because it maps to an implementation in SQL, a declarative language, in a relatively straightforward manner. It is not obvious how to implement a breadth-first search algorithm in SQL, since the method seems inherently procedural. To use Union Find to find the networks, start off with each node being its own network. Then each edge/link can be processed as the join (or merger) of two existing networks.
The implementation in a stored procedure takes about 45 lines of SQL and is given in the Appendix.
Table 1 shows some of the other implementations that the SQL solution described in this paper have replaced. The existing implementations were relatively slow and did not always produce correct answers. They implemented the depth-first algorithm but didn't use efficient data structures for representing networks.
The table is presented to show the reality of industry/government practice. For example, correct implementations of network detection in Visual Basic or SAS should be possible with appropriate personnel expertise and resources, but the agency experience shows the expertise and experience was not easily accessible with its operating divisions.
The possibility of scaling up the non-SQL implementations is less clear. For example, R (running on a 64-bit linux server) didn't scale beyond about 3K nodes using the SNA (Social Network Analysis) package. The SNA package needs to use sparse matrix network representation to be able to scale beyond 3K nodes but current efforts have not achieved this yet.
No (up to 5 minutes)
Visual Basic / Access Database
No ( 10 hours on Application 2 from Table 2)
No (3-4 hours on Application 4 from Table 2)
SQL / Teradata
Yes (1-2 minutes)
The Teradata 5380 server machine is a powerful parallel computing platform. There has been recent benchmarking performed to show how a parallel database architecture such as the Teradata server compares to the Google/Yahoo map-reduce on computer clusters for many key analyzes needed for large-scale internet companies[16,17]. In future work, we would like to test whether we can get good performance on a computer cluster. If successful, this would reduce the data mining computational load on the Teradata server, which the main workhorse for reporting and analysis applications within the revenue agency.
Table 2 shows the four application area datasets we wish to highlight in this case study. Each dataset has a different number of entities and links for extraction of networks. The entities and relationships represented by the datasets are also different. The first two applications are relatively small scale relative to the second two. Already the second application is too large to run in R as was seen in Table 1.
Application 1 of Table 2 involved finding entities without links to other entities in the dataset. This was a method for excluding low-risk entities who had inadvertently been included in a hot-list. Prior to the use of this application, it was too costly to manually identify the low-risk entities and so there was cost of wrongly initiating compliance activities on these entities. This has a cost for the entities involved and the for agency and its reputation. These costs are now avoided
Fig. 3 gives an example of a discovered network from Application 2 of Table 2. The current risk assessment uses a risk score based on the combination of network size, the number of known high risk entities in the network and the amount of revenue at risk. The benefits of this application are a new capability to detect related patterns of fraudulent behaviour on a daily basis. Recall that the previous implementation took 10 hours to run while the SQL one takes less than 2 minutes. This makes the required information available on an intra-day basis for timely decision-making.
For the fourth dataset, Fig. 4. shows the linear relationship between the log of the number of networks and network size. This shows a power law exists for this network between network size and number of networks. This power law also applies to the other datasets in this case study. This application has benefits in better informing the agency about the use and type of complex trust structures.
|Dataset||Purpose||System used||No. of entities||Number of links|
|1||Filter low-risk entities from a list of high-risk entities||R||2,000||8,000|
|2||Using starting-list of identified high-risk entities and find related unknown high-risk entities||Teradata||300-2,000||2,000-16,000|
|3||Find non-agent tax returns that appear to have a common “guiding mind”||Teradata|| 1.2m |
Networks of interest: 6,000
|4||Find high-risk entity networks involving individuals, trusts, partnerships and companies||Teradata||2,000,000||18,000,000|
This case study reveals many opportunities for future work and improvements. The main focus of this paper was to produce the Social Networks of Fig. 1. Future work would extract enhanced networks as shown in Fig. 1 using soft links. The current applications use basic network risk scoring methods but we are aware of more work to be done around better characterizing higher risk networks. For example, the high risk network of Fig. 3 has a distinctive topology in common with other high risk networks and distinct from lower risk networks. A related issue is the need to detect high risk networks earlier. This suggests moving away from a reliance on network size and revenue risk as they key risk assessment features and moving towards signatures available in networks even while they are small.
One immediate issue that has arisen from the current applications is the need to monitor changes in the networks across time. Current work is underway to assign a stable name to each network despite splits and mergers. The naming rules for networks needs to fit the intuition of the intelligence analysts working the network case. A practical issue arises around the better integration of the network detection process described here with the available network visualization productions, such as Ibase and Netmap. The bridge between the visualizations and the discovered networks in the data warehouse needs to built and made seamless for end-users such as auditors to use.
The final areas for development were identified by the survey in . Most network mining algorithms do not deal with links between entities of different types and where the links can represent different types of relationships. Risk score algorithms that can learn the relative importance of the different types of links will be important in future applications.
This SQL code runs on Teradata but should be transportable to other database engines.
The code assumes the follow tables with these structures:
/* contains the network nodes */
network_nodes (node_type node id, network_name_type network_id )
/* contains the network links */
network_links (node_type head, node_type tail)
/* working tables used by Stored Procedure below */
components_to_merge (node_type c1, node_type c2 )
new_components( network_name_type source, network_name_type target)
This is the SQL code for extracting the networks.
Example procedure call:
replace procedure analyticsdb.netdet ()
declare number_left integer default 1;
while ((number_left > 0) do
set loop_cnt = loop_cnt + 1;
delete from analyticsdb.networks_to_merge;
insert into analyticsdb.networks_to_merge
distinct (a.network_id) as c1,
b.network_id as c2
from analyticsdb.network_links l
on l.head = a.id
on l.tail = b.id where a.network_id <> b.network_id;
-- To ensure symmetry: if you can go A->B,then,
-- also B-> A
insert into analyticsdb.networks_to_merge
select component2, component1
select count(*) into :number_left
-- Identify new components
delete from analyticsdb.new_networks;
insert into analyticsdb.new_networks
network1 as source,
min( network2 ) as target
network1 = network_id
group by source;
-- Do the merge
network_id = case when network_id < target then network_id else target end
analyticsdb.new_networks.source = network_id;
end; -- End Loop