How journalists mined terabytes of offshore data to expose the world’s elites
The Pandora Papers revealed how politicians, celebrities,
royalty and fraudsters use offshore tax havens to hide assets, secretly buy
property, launder money and avoid taxes.
More than 600 journalists in 117 countries collaborated,
using data tools to extract hidden connections between offshore companies and
wealthy elites who used tax havens to hide their financial activities. Their
investigation embarrassed politicians, royalty, celebrities and oligarchs
worldwide.
The Pandora Papers showed that one of the world’s
longest-serving monarch, King Abdullah II of Jordan, had secretly built up a
personal property empire.
His portfolio, including luxury properties in Malibu and
London’s Belgravia, were worth well over $100m. And they were bought at a time
when his country’s citizens were facing severe austerity measures and rampant
unemployment. Their true ownership was hidden by offshore companies registered
in the British Virgin Islands.
In Chile, opposition politicians launched impeachment
proceedings against president Sebastián Piñera over irregularities in the sale
of a mining company that were revealed in the documents.
And in the UK, prime minister Boris Johnson faced demands to
return political funds from Conservative Party donors with alleged links to
corruption.
Donations came from a Russian-born oil tycoon and the wife
of a former Russian oligarch whose husband channeled money through a network of
secret offshore “shell” companies.
Another wealthy Conservative Party donor advised a Swiss
telecoms company on a complex financial transaction that was later acknowledged
as a corrupt payment, the papers reveal.
The Pandora Papers are one of the most significant leaks to
have been received by the International Consortium of Investigative Journalists
(ICIJ).
Journalists across the world spent over a year analysing the
trove of 11.9 million offshore company records to unearth important stories.
Included in the terabytes of data were copies of passports, bank
statements, tax declarations, company incorporation documents, property
contracts and due diligence questionnaires, presentations, audio and video
files and handwritten notes.
“We were talking about very complex documents,” said Emilia
Díaz-Struck, the ICIJ’s research editor and Latin America coordinator. “We are
talking about financial documents and complex corporate structures.”
The data came from 14 different firms specialising in
offshore services and each of them stored their data in a different way.
Making sense of that “mess of data” required a combination
of journalism and sophisticated data analysis.
Only 4% of the data they contained was held in spreadsheets.
The rest was unstructured and difficult to search.
“We were lucky with some providers,” said Díaz-Struck.
“There were spreadsheets, but we still had to combine them, find duplicates and
put them together in one single file.”
In other cases, information was buried in huge PDFs, which
had to be analysed and reviewed by the data and technology teams.
“The worst-case scenario was when there were handwritten
forms,” said Díaz-Struck. “We had people rom our team manually extracting that
information and putting it into a structured format.”
Pierre Romera, the ICIJ’s chief technology officer, has
spent his career working with sensitive sources, secure communications and
analysing massive amounts of data.
Romera was there from the start of the Pandora Papers
project when the ICIJ had its first contacts with a confidential informant with
access to millions of records on offshore companies.
He works with a team of about 10 people on technology that
enables journalists to analyse huge datasets. The team includes developers,
systems administrators, designers and specialists in DevOps.
Since its pioneering work on the first large-scale offshore
leak in 2013, the ICIJ has developed increasingly powerful tools to index and
search documents.
The first leak, which became known as Offshore Leaks, was
small in comparison to the Pandora Papers, at just 260GB of leaked emails and
databases.
That investigation took blind alleys, made errors, and faced
technical difficulties, but it also pioneered new methods of journalism and
data analysis, Computer Weekly reported at the time.
After experimenting with structured databases, including
SQL, to analyse the Offshore Leaks, data experts at the ICIJ turned to free
text retrieval software Nuix.
Data specialists also developed a web portal, using another
free-text retrieval programme, DT Search, which enabled more than100
journalists across the world to interrogate the documents.
By the time the results of another major offshore leak, the
Panama Papers, were published in 2016, the ICIJ had set up a small dedicated data
team and had begun to develop its own collaboration tools.
The Panama Papers leak was much bigger that previous leaks
and it was obvious that Nuix was not a good fit, said Romera.
The data team turned to open source software to build a
dedicated free-text search engine using Blacklight, a tool widely used by
libraries for searching documents, and Apache Solr, an open source enterprise
search tool.
Over time, the data team switched to another technology,
Elasticsearch, which allowed faster searches.
“Elasticsearch is much more powerful – it has a huge open
source community and has a lot of features that are very useful to these
investigations,” said Romera.
That project resulted in the creation of Datashare, which Romera
describes as the most important tool used by ICIJ journalists during
collaborations. It allows journalists to search vast archives of documents
quickly and securely.
One of the most useful features of Datashare is its ability
to perform bulk searches of data. Journalists can upload files containing, for
example, lists of politicians, members of royalty or celebrities to find
stories within the vast archives of data.
Datashare is also scalable, allowing Romera to add more
servers to provide computing power needed to analyse bigger leaks and support
larger teams.
During the Pandora Papers project, the ICIJ had the
capability to deploy 15-20 servers. This made it possible for over 600
journalists to conduct key-word searches on the data – a step up from the
370-plus journalists who worked on the Panama Papers.
“Because we are trying to find the highest number of stories
in the documents, we really need to use this search engine intensively,” said
Romera.
Datashare is designed to be simple and fast to use and is,
said Romera, essentially a lightweight interface built on top of Elasticsearch.
But it can also take software plug-ins and extensions. One
of the most useful is a plug-in that extracts the names of people,
organisations and place names automatically from the documents.
“Datashare is at the very centre of everything we do at
ICIJ,” said Romera. “It is the most important tool we have.”
The second key tool used by ICIJ collaborators is I-Hub, a
digital newsroom, which features in every investigation. Romera described I-Hub
as a digital newsroom that allows journalists in multiple countries to work in
a co-ordinated way.
Collaborators worked in regional groups during the Pandora
Papers operation to share the discoveries they made. Others formed groups to
analyse the data, or to develop stories.
I-Hub grew out of the ICIJ’s work on the Offshore Leaks
project. An ICIJ member suggested the need for a tool that would allow
journalists to work together in a secure way and other ICIJ members agreed.
The Knight Foundation provided a grant to develop I-Hub from
an open source platform, Oxwall, originally designed to support social media
and dating apps. It was used for the first time in the Swiss Leaks
investigation in 2015.
By 2019, I-Hub was in need of an update to allow it to be
used to manage the increasing volume of data being shared by journalists. It
moved to a new platform, Discourse, which offered greater potential for
customisation.
Projects like the Pandora Papers are successful because
journalists agree to pool their information. “There is no room for egos,” said
Díaz-Struck. “It is all based on sharing and trust.”
“It’s important to get journalists involved at an early
stage,” said Romera. “You need to understand very quickly whether it is of
public interest, and whether there are potential stories within the documents.”
At the beginning of the Pandora Papers project, ICIJ members
used batch files to match names contained in the documents with “country
lists”. These files contained the names of politicians, celebrities, members of
royalty and other people of interest in each geography.
Journalists also compared the leaked records to data in
previous leaks, sanctions lists and other data sources.
The exercise gave the collaborators an overview of which
countries and which individuals featured most prominently in the dataset.
“They were able to do the work and go much deeper into the
documents, but that all starts by giving them some leads,” said Romera.
One team pulled together data on US trusts; another worked
to identify and count all the billionaires listed in the data. Others worked on
identifying the presence of Russian oligarchs. Different research teams focused
on different offshore service providers to try to make sense of their data.
“We actually split our data research team across providers,
so different people took ownership of specific providers to see how we could
structure that information,” said Díaz-Struck.
Some journalists used the documents to research topics they
were already interested in. One investigation showed that a disgraced Catholic
order, the Legion of Christ, with $300m deposited in offshore companies, had
invested millions in a property company that evicted struggling tenants during
the pandemic.
The ICIJ’s data and tech team initially spent time manually
going through huge PDFs found in the data to identify tables of relevant
information.
The teams were able to automate the process by using machine
learning tools based on Python, Fonduer and Scikit-learn, to identify and
extract the information. Like all data, it needed to be reviewed and cleaned
manually.
As journalists began searching the documents, it became
clear that they were picking up large numbers of documents that were not
directly useful.
The data contained a large number of due-diligence reports,
including lists of sanctioned companies listed by the US Office of Foreign
Assets Control (OFAC), “know your customer” forms, and searches on World-Check,
a commercial due diligence database.
“They were interesting in that they tell us that potentially
[offshore service providers] are researching clients, but the records did not
actually mean those people were in our data,” said Díaz-Struck.
The data team used machine learning technology again to
identify and cluster the unwanted files, enabling journalists to remove them
from their searches.
During the Pandora Papers investigation, Romera and his team
developed a plug-in to link I-Hub and Datashare which allowed journalists to
comment and start discussions on the documents of interest directly on the
Datashare platform.
“We spent months trying to build this feature,” said Romera.
“It was not a great success.”
Whether that was because journalists were not aware of the
feature or they had not been trained to use it, is as yet unclear.
“Maybe the next step is to ensure that they know they can
comment,” he added.
Fact-checking was a major part of the Pandora Papers
project. Every number quoted in public stories goes through a scrupulous
fact-checking process.
A huge amount of work goes into verifying, for example, that
the leaked data contains information about more than 330 politicians and public
officials in 90 countries and territories, and more than 29,000 beneficial
owners.
Finding an accurate figure for the number of politicians in
the dataset required painstaking effort. It meant cross-checking the names of
political figures with other data, such as their dates of birth and data
contained in public records, to ensure they were identified correctly.
Different countries record dates, including dates of birth,
for example, in different formats.
“You need someone assessing a random sample of data and
reviewing each service provider to check what format the date is in, because it
is not always obvious,” said Díaz-Struck.
Journalists and data scientists do most of their research using
Datashare’s search capabilities.
Towards the end of the investigation, researchers used the
structured data they had created to create a graph database, which mapped the
relationship between offshore companies and their beneficial owners.
The ICIJ worked with Neo4j’s graph database and another open
source platform, Linkurious, to create interactive visualisations and make them
searchable.
The graph database comprises nodes, such as the name of a
company, and individual or offshore service provider, and relationships, which
show the connections between them.
The ICIJ began looking at graphs after the first offshore
leak in 2013. In the early days, journalists were mapping connections by
drawing lines on Word documents.
Emil Eifrem, founder and CEO of Neo4j, offered the
technology pro-bono. “We talked to them to understand what they were doing,” he
said. “We tried to help out as much as we could, but fundamentally had no idea
what they were investigating. And then, bam – the Panama Papers hit.”
The strength of graph databases is that they make it easy to
identify connections that are hidden in plain sight. “It’s about finding the
indirect connections, the multiple hops down the line, those layers of
obscurity,” said Eifrem.
In the future, Romera said he would like to develop ways of
exporting structured data recorded in emails and other documents directly into
Neo4j.
That could include, for example, extracting the names of
people who sent emails and the people who received them, to create a map showing
the relationship between individuals and organisations.
“We would like to be able, with Neo4j, to export all this
metadata we have in the documents to generate automatically a graph of
relationships between people inside our data,” he said. “If we managed to do
that, the power of graph database is absolutely central.”
It is likely to take at least six or eight months to build
the first prototype, and Romera hopes to work with the open source community to
develop the technology.
Meanwhile, the ICIJ is making the results of the graph
database available on its website. The Power Players is an interactive graphic
that shows the relationship between world leaders and their offshore companies,
along with links to redacted copies of selected documents.
Romera said the ICIJ has good reasons for not making the
trove of leaked document fully public. The organisation works with its partner
publications to ensure they follow a strict methodology to protect the data and
to fact-check it, he said.
“I think that is one of the reasons why, after so many years
and after exposing so many companies, we have never got sued,” he added.
Keeping both journalists and leaked documents secure is
another priority. The ICIJ works with journalists in 117 countries.
“Journalists could be monitored or could be targeted because
of this investigation or other investigations they are working on,” said
Díaz-Struck.
“Journalists could be monitored or could be targeted because
of this investigation or other investigations they are working on”
Romera has a small team in Spain that carry out regular
threat monitoring and security testing of the ICIJ’s servers. “Journalists are
not always tech-savvy,” he said. “So we try to make user-friendly interfaces
that are also secure.”
Emails are encrypted with PGP and journalists use two-factor
authentication to access a single sign-on platform that gives them access to
Datashare and I-Hub.
In some cases, the ICIJ also supplements this security with
SSL certificates, which journalists can install on their computer to provide an
additional layer of authentication.
The risks are real. The ICIJ was hit by a cyber attack as it
began publishing the first stories from the Pandora Papers.
The first distributed denial of service (DDoS) attack hit
the website on the Sunday evening when the first stories were published,
bombarding it with messages. The next attack was “much smarter” and managed to
make the website inaccessible for several hours.
“We have to be very careful because most of the time, a DDoS
attack can be used to hide an attempt to penetrate the system,” said Romera.
The ICIJ’s mission is not to train journalists to work
securely, but because of the sensitivity of its work, training is essential, he
added. “Because of this investigation, there are now 627 journalists all around
the world who know how to use PGP, that were not aware of this kind of
technology.”
Romera and the ICIJ develop new capabilities in Datashare in
response to requests from journalists carrying out the research. That might
include, for example, the ability to search a new type of document, or carry
out a new type of search.
“When reporters want to look for something, they just tell
us and if Datashare is not able to find it for them, we try to build that in
the future,” he said.
The ICIJ is developing a desktop version of Datashare that
will allow journalists to search documents on their computers and share alerts
with other journalists in their network.
For example, a journalist could mark a politician as a
subject of interest in their desktop version of Datashare and receive alerts
when other journalist identify documents containing details about the same
politician.
The journalists will have the option of contacting each
other and sharing documents if they want to.
The ICIJ, which is working in collaboration with researchers
at the Swiss university École Polytechnique Fédérale de Lausanne (EPFL), is
about to develop the first prototype of the technology, which will allow ICIJ
members to set up their own collaborative investigations.
“We want to provide that network of investigative
journalists with some kind of software that mimics the way we all work together
on investigations,” said Romera. “If they want to start collaborating on
documents together, that is the ideal scenario.”
For Romera, the project has proved that the ICIJ has the
capability to turn around large document leaks extremely quickly. “Now when we
get such a big leak, we are able to make it available for research in, maybe 15
days, whereas before it would have taken months of troubleshooting,” he said.
Comments
Post a Comment