How journalists mined terabytes of offshore data to expose the world’s elites

The Pandora Papers revealed how politicians, celebrities, royalty and fraudsters use offshore tax havens to hide assets, secretly buy property, launder money and avoid taxes.

More than 600 journalists in 117 countries collaborated, using data tools to extract hidden connections between offshore companies and wealthy elites who used tax havens to hide their financial activities. Their investigation embarrassed politicians, royalty, celebrities and oligarchs worldwide.

The Pandora Papers showed that one of the world’s longest-serving monarch, King Abdullah II of Jordan, had secretly built up a personal property empire.

His portfolio, including luxury properties in Malibu and London’s Belgravia, were worth well over $100m. And they were bought at a time when his country’s citizens were facing severe austerity measures and rampant unemployment. Their true ownership was hidden by offshore companies registered in the British Virgin Islands.

In Chile, opposition politicians launched impeachment proceedings against president Sebastián Piñera over irregularities in the sale of a mining company that were revealed in the documents.

And in the UK, prime minister Boris Johnson faced demands to return political funds from Conservative Party donors with alleged links to corruption.

Donations came from a Russian-born oil tycoon and the wife of a former Russian oligarch whose husband channeled money through a network of secret offshore “shell” companies.

Another wealthy Conservative Party donor advised a Swiss telecoms company on a complex financial transaction that was later acknowledged as a corrupt payment, the papers reveal.

The Pandora Papers are one of the most significant leaks to have been received by the International Consortium of Investigative Journalists (ICIJ).

Journalists across the world spent over a year analysing the trove of 11.9 million offshore company records to unearth important stories.

Included in the terabytes of data were copies of passports, bank statements, tax declarations, company incorporation documents, property contracts and due diligence questionnaires, presentations, audio and video files and handwritten notes.

“We were talking about very complex documents,” said Emilia Díaz-Struck, the ICIJ’s research editor and Latin America coordinator. “We are talking about financial documents and complex corporate structures.”

The data came from 14 different firms specialising in offshore services and each of them stored their data in a different way.

Making sense of that “mess of data” required a combination of journalism and sophisticated data analysis.

Only 4% of the data they contained was held in spreadsheets. The rest was unstructured and difficult to search.

“We were lucky with some providers,” said Díaz-Struck. “There were spreadsheets, but we still had to combine them, find duplicates and put them together in one single file.”

In other cases, information was buried in huge PDFs, which had to be analysed and reviewed by the data and technology teams.

“The worst-case scenario was when there were handwritten forms,” said Díaz-Struck. “We had people rom our team manually extracting that information and putting it into a structured format.”

Pierre Romera, the ICIJ’s chief technology officer, has spent his career working with sensitive sources, secure communications and analysing massive amounts of data.

Romera was there from the start of the Pandora Papers project when the ICIJ had its first contacts with a confidential informant with access to millions of records on offshore companies.

He works with a team of about 10 people on technology that enables journalists to analyse huge datasets. The team includes developers, systems administrators, designers and specialists in DevOps.

Since its pioneering work on the first large-scale offshore leak in 2013, the ICIJ has developed increasingly powerful tools to index and search documents.

The first leak, which became known as Offshore Leaks, was small in comparison to the Pandora Papers, at just 260GB of leaked emails and databases.

That investigation took blind alleys, made errors, and faced technical difficulties, but it also pioneered new methods of journalism and data analysis, Computer Weekly reported at the time.

After experimenting with structured databases, including SQL, to analyse the Offshore Leaks, data experts at the ICIJ turned to free text retrieval software Nuix.

Data specialists also developed a web portal, using another free-text retrieval programme, DT Search, which enabled more than100 journalists across the world to interrogate the documents.

By the time the results of another major offshore leak, the Panama Papers, were published in 2016, the ICIJ had set up a small dedicated data team and had begun to develop its own collaboration tools.

The Panama Papers leak was much bigger that previous leaks and it was obvious that Nuix was not a good fit, said Romera.

The data team turned to open source software to build a dedicated free-text search engine using Blacklight, a tool widely used by libraries for searching documents, and Apache Solr, an open source enterprise search tool.

Over time, the data team switched to another technology, Elasticsearch, which allowed faster searches.

“Elasticsearch is much more powerful – it has a huge open source community and has a lot of features that are very useful to these investigations,” said Romera.

That project resulted in the creation of Datashare, which Romera describes as the most important tool used by ICIJ journalists during collaborations. It allows journalists to search vast archives of documents quickly and securely.

One of the most useful features of Datashare is its ability to perform bulk searches of data. Journalists can upload files containing, for example, lists of politicians, members of royalty or celebrities to find stories within the vast archives of data.

Datashare is also scalable, allowing Romera to add more servers to provide computing power needed to analyse bigger leaks and support larger teams.

During the Pandora Papers project, the ICIJ had the capability to deploy 15-20 servers. This made it possible for over 600 journalists to conduct key-word searches on the data – a step up from the 370-plus journalists who worked on the Panama Papers.

“Because we are trying to find the highest number of stories in the documents, we really need to use this search engine intensively,” said Romera.

Datashare is designed to be simple and fast to use and is, said Romera, essentially a lightweight interface built on top of Elasticsearch.

But it can also take software plug-ins and extensions. One of the most useful is a plug-in that extracts the names of people, organisations and place names automatically from the documents.

“Datashare is at the very centre of everything we do at ICIJ,” said Romera. “It is the most important tool we have.”

The second key tool used by ICIJ collaborators is I-Hub, a digital newsroom, which features in every investigation. Romera described I-Hub as a digital newsroom that allows journalists in multiple countries to work in a co-ordinated way.

Collaborators worked in regional groups during the Pandora Papers operation to share the discoveries they made. Others formed groups to analyse the data, or to develop stories.

I-Hub grew out of the ICIJ’s work on the Offshore Leaks project. An ICIJ member suggested the need for a tool that would allow journalists to work together in a secure way and other ICIJ members agreed.

The Knight Foundation provided a grant to develop I-Hub from an open source platform, Oxwall, originally designed to support social media and dating apps. It was used for the first time in the Swiss Leaks investigation in 2015.

By 2019, I-Hub was in need of an update to allow it to be used to manage the increasing volume of data being shared by journalists. It moved to a new platform, Discourse, which offered greater potential for customisation.

Projects like the Pandora Papers are successful because journalists agree to pool their information. “There is no room for egos,” said Díaz-Struck. “It is all based on sharing and trust.” 

“It’s important to get journalists involved at an early stage,” said Romera. “You need to understand very quickly whether it is of public interest, and whether there are potential stories within the documents.”

At the beginning of the Pandora Papers project, ICIJ members used batch files to match names contained in the documents with “country lists”. These files contained the names of politicians, celebrities, members of royalty and other people of interest in each geography.

Journalists also compared the leaked records to data in previous leaks, sanctions lists and other data sources.

The exercise gave the collaborators an overview of which countries and which individuals featured most prominently in the dataset.

“They were able to do the work and go much deeper into the documents, but that all starts by giving them some leads,” said Romera.

One team pulled together data on US trusts; another worked to identify and count all the billionaires listed in the data. Others worked on identifying the presence of Russian oligarchs. Different research teams focused on different offshore service providers to try to make sense of their data.

“We actually split our data research team across providers, so different people took ownership of specific providers to see how we could structure that information,” said Díaz-Struck.

Some journalists used the documents to research topics they were already interested in. One investigation showed that a disgraced Catholic order, the Legion of Christ, with $300m deposited in offshore companies, had invested millions in a property company that evicted struggling tenants during the pandemic.

The ICIJ’s data and tech team initially spent time manually going through huge PDFs found in the data to identify tables of relevant information.

The teams were able to automate the process by using machine learning tools based on Python, Fonduer and Scikit-learn, to identify and extract the information. Like all data, it needed to be reviewed and cleaned manually.

As journalists began searching the documents, it became clear that they were picking up large numbers of documents that were not directly useful.

The data contained a large number of due-diligence reports, including lists of sanctioned companies listed by the US Office of Foreign Assets Control (OFAC), “know your customer” forms, and searches on World-Check, a commercial due diligence database.

“They were interesting in that they tell us that potentially [offshore service providers] are researching clients, but the records did not actually mean those people were in our data,” said Díaz-Struck.

The data team used machine learning technology again to identify and cluster the unwanted files, enabling journalists to remove them from their searches.

During the Pandora Papers investigation, Romera and his team developed a plug-in to link I-Hub and Datashare which allowed journalists to comment and start discussions on the documents of interest directly on the Datashare platform.

“We spent months trying to build this feature,” said Romera. “It was not a great success.”

Whether that was because journalists were not aware of the feature or they had not been trained to use it, is as yet unclear.

“Maybe the next step is to ensure that they know they can comment,” he added.

Fact-checking was a major part of the Pandora Papers project. Every number quoted in public stories goes through a scrupulous fact-checking process.

A huge amount of work goes into verifying, for example, that the leaked data contains information about more than 330 politicians and public officials in 90 countries and territories, and more than 29,000 beneficial owners.

Finding an accurate figure for the number of politicians in the dataset required painstaking effort. It meant cross-checking the names of political figures with other data, such as their dates of birth and data contained in public records, to ensure they were identified correctly.

Different countries record dates, including dates of birth, for example, in different formats.

“You need someone assessing a random sample of data and reviewing each service provider to check what format the date is in, because it is not always obvious,” said Díaz-Struck.

Journalists and data scientists do most of their research using Datashare’s search capabilities.

Towards the end of the investigation, researchers used the structured data they had created to create a graph database, which mapped the relationship between offshore companies and their beneficial owners.

The ICIJ worked with Neo4j’s graph database and another open source platform, Linkurious, to create interactive visualisations and make them searchable.

The graph database comprises nodes, such as the name of a company, and individual or offshore service provider, and relationships, which show the connections between them.

The ICIJ began looking at graphs after the first offshore leak in 2013. In the early days, journalists were mapping connections by drawing lines on Word documents.

Emil Eifrem, founder and CEO of Neo4j, offered the technology pro-bono. “We talked to them to understand what they were doing,” he said. “We tried to help out as much as we could, but fundamentally had no idea what they were investigating. And then, bam – the Panama Papers hit.”

The strength of graph databases is that they make it easy to identify connections that are hidden in plain sight. “It’s about finding the indirect connections, the multiple hops down the line, those layers of obscurity,” said Eifrem.

In the future, Romera said he would like to develop ways of exporting structured data recorded in emails and other documents directly into Neo4j.

That could include, for example, extracting the names of people who sent emails and the people who received them, to create a map showing the relationship between individuals and organisations.

“We would like to be able, with Neo4j, to export all this metadata we have in the documents to generate automatically a graph of relationships between people inside our data,” he said. “If we managed to do that, the power of graph database is absolutely central.”

It is likely to take at least six or eight months to build the first prototype, and Romera hopes to work with the open source community to develop the technology.

Meanwhile, the ICIJ is making the results of the graph database available on its website. The Power Players is an interactive graphic that shows the relationship between world leaders and their offshore companies, along with links to redacted copies of selected documents.

Romera said the ICIJ has good reasons for not making the trove of leaked document fully public. The organisation works with its partner publications to ensure they follow a strict methodology to protect the data and to fact-check it, he said.

“I think that is one of the reasons why, after so many years and after exposing so many companies, we have never got sued,” he added.

Keeping both journalists and leaked documents secure is another priority. The ICIJ works with journalists in 117 countries.

“Journalists could be monitored or could be targeted because of this investigation or other investigations they are working on,” said Díaz-Struck.

“Journalists could be monitored or could be targeted because of this investigation or other investigations they are working on”

Romera has a small team in Spain that carry out regular threat monitoring and security testing of the ICIJ’s servers. “Journalists are not always tech-savvy,” he said. “So we try to make user-friendly interfaces that are also secure.”

Emails are encrypted with PGP and journalists use two-factor authentication to access a single sign-on platform that gives them access to Datashare and I-Hub.

In some cases, the ICIJ also supplements this security with SSL certificates, which journalists can install on their computer to provide an additional layer of authentication.

The risks are real. The ICIJ was hit by a cyber attack as it began publishing the first stories from the Pandora Papers.

The first distributed denial of service (DDoS) attack hit the website on the Sunday evening when the first stories were published, bombarding it with messages. The next attack was “much smarter” and managed to make the website inaccessible for several hours.

“We have to be very careful because most of the time, a DDoS attack can be used to hide an attempt to penetrate the system,” said Romera.

The ICIJ’s mission is not to train journalists to work securely, but because of the sensitivity of its work, training is essential, he added. “Because of this investigation, there are now 627 journalists all around the world who know how to use PGP, that were not aware of this kind of technology.”

Romera and the ICIJ develop new capabilities in Datashare in response to requests from journalists carrying out the research. That might include, for example, the ability to search a new type of document, or carry out a new type of search.

“When reporters want to look for something, they just tell us and if Datashare is not able to find it for them, we try to build that in the future,” he said.

The ICIJ is developing a desktop version of Datashare that will allow journalists to search documents on their computers and share alerts with other journalists in their network.

For example, a journalist could mark a politician as a subject of interest in their desktop version of Datashare and receive alerts when other journalist identify documents containing details about the same politician.

The journalists will have the option of contacting each other and sharing documents if they want to.

The ICIJ, which is working in collaboration with researchers at the Swiss university École Polytechnique Fédérale de Lausanne (EPFL), is about to develop the first prototype of the technology, which will allow ICIJ members to set up their own collaborative investigations.

“We want to provide that network of investigative journalists with some kind of software that mimics the way we all work together on investigations,” said Romera. “If they want to start collaborating on documents together, that is the ideal scenario.”

For Romera, the project has proved that the ICIJ has the capability to turn around large document leaks extremely quickly. “Now when we get such a big leak, we are able to make it available for research in, maybe 15 days, whereas before it would have taken months of troubleshooting,” he said.


Comments