January 24, 2006 7:30 PM PST

Editing tips from the NSA

Hiding confidential information with black marks works on printed copy, but not with electronic documents, the National Security Agency has warned government officials.

The agency makes the point in a guidance paper on editing documents for release, published last month following several embarrassing incidents in which sensitive data was unintentionally included in computer documents and exposed. The 13-page paper (click here for PDF) is called: "Redacting with confidence: How to safely publish sanitized reports converted from Word to PDF."

Instead of covering up digital text with black boxes, it is better to delete any information you don't want to share, the NSA suggested.

"The key concept for understanding the issues that lead to...inadvertent exposure is that information hidden or covered in a computer document can almost always be recovered," the NSA wrote in the Information Assurance Division paper, dated Dec. 13 but only recently posted to the Web. "The way to avoid exposure is to ensure that sensitive information is not just visually hidden or made illegible, but is actually removed."

Three common mistakes

There are a number of pitfalls for people trying to amend a sensitive Word document for public release as a PDF. Here is the NSA's advice on typical traps.

Redaction of text and diagrams
Covering text, charts, tables or diagrams with black rectangles, or highlighting text in black...is not effective, in general, for computer documents distributed across computer networks (i.e. in "softcopy" format). The most common mistake is covering text with black.

Redaction of images
Covering up parts of an image with separate graphics such as black rectangles, or making images "unreadable" by reducing their size, has also been used for redaction of hardcopy printed materials. It is generally not effective for computer documents distributed in softcopy form.

Metadata and document properties
In addition to the visible content of a document, most office tools, such as (Microsoft) Word, contain substantial hidden information about the document. This information is often as sensitive as the original document, and its presence in downgraded or sanitized documents has historically led to compromise.

Source: NSA Information Assurance Division report

The unintended disclosure of metadata, resulting in high-profile leaks of secrets, has led to red faces at businesses and government bodies in the past. In March 2004, a gaffe by the SCO Group revealed which companies it had considered targeting in its legal campaign against Linux users.

More recently, pharmaceutical giant Merck was put in the hot seat because of changes made to a document regarding the painkiller Vioxx. There have also been document data leaks at the White House, the Pentagon, the United Nations and others, according to compiled research from Workshare, a maker of software that strips tell-tale hidden data out of files.

There have been so many stumbles that the NSA document should be welcome help, said Pete Lindstrom, an analyst with Spire Security in Malvern, Pa.

"It ends up being a really big exercise in public humility because it is an embarrassing issue," he said. "It affects governments more than anyone else."

Cleaning up
Government analysts make three main missteps that will jeopardize confidentiality when sanitizing documents, according to the NSA report. "The most common mistake is covering text with black," the agency said. While this works for printed material, "it is not effective, in general, for computer documents."

The second top goof is similar: In this case, workers cover up graphics and other images with new graphics, such as a black rectangle. As with blacked-out text, a recipient of the document can often delete the coverings and see the information that is intended to be hidden. The third gaffe is failure to remove information about the document, such as change history, author name and creation dates, known as metadata.

To avoid such blunders, the NSA paper gives step-by-step instructions on how to strip a Microsoft Word document of confidential information and then convert it an Adobe Systems PDF file. The advice deals with text passages and images in the document, as well as with metadata.

Both the Word and Adobe PDF formats can contain many kinds of information--such as text, graphics, tables, images and metadata--all mixed together. "The complexity makes them potential vehicles for exposing information unintentionally, especially when downgrading or sanitizing classified materials," the NSA said.

Microsoft Word is used throughout the Department of Defense and the intelligence community, while Adobe PDF is used "very extensively" by all parts of the U.S. government and military services, the agency said. It noted that government bodies often distribute cleaned-up documents in PDF format, and cautioned: "As numerous people have learned to their chagrin, merely converting an MS Word document to PDF does not remove all metadata automatically."

Metadata methodology
Metadata could become an increasing problem in the future, Gartner analysts warned recently. Vista, the next version of the Microsoft Windows operating system, will let people tag files with metadata to improve search capabilities, Microsoft has said. But those tags could lead to unwanted disclosure of information, Gartner analysts said.

Microsoft provides some tools to remove metadata in its Office applications and built into Word 2003 a feature to remove personal information. However, these do not remove sensitive data from the main document, nor do they remove all metadata of possible concern, the NSA said.

Adobe supports the agency's guidance for proper editing techniques and is developing additional documentation for other customers, John Landwehr, director of security solutions and strategy for the San Jose, Calif., technology company, said in a statement via e-mail.

"As the NSA points out, it's very important to actually remove the redacted content from an electronic document--not just leave the data in a document and attempt to graphically cover it," he said.

Following the guidelines will effectively clean a document, said Joe Fantuzzi, chief executive of San Francisco-based Workshare, but could be challenging for the less tech-savvy.

"They are way too complicated. It is going to take too long for people to do the right thing, and people are going to continue to make mistakes," he said.

Meanwhile, the NSA paper itself contains a bit of metadata. According to its cover the paper was created on Dec. 13, 2005. The properties of the Adobe PDF file, however, state the document was created on Jan. 10, 2006.


Join the conversation!
Add your comment
Meta Data Affects SMBs too
The sba.gov site notes how much Small Business contributes to the U.S. infrastructure, Meta Data puts Small Business entities at risk too.

Microsoft publishes how meta data can be removed from such things as Word Documents: <a class="jive-link-external" href="http://support.microsoft.com/default.aspx?scid=kb;en-us;290945" target="_newWindow">http://support.microsoft.com/default.aspx?scid=kb;en-us;290945</a>

Prospective clients can possibly see other client data when meta data is used - <a class="jive-link-external" href="http://www.essentialsecurity.com/Documents/article12.htm" target="_newWindow">http://www.essentialsecurity.com/Documents/article12.htm</a>
Posted by marileev (292 comments )
Reply Link Flag
Free tool works, too
For non-business use there is a great, free tool called Doc Scrubber (www.docscrubber.com) that I have been using for years to strip out all metadata. It makes the job much easier and is more thorough than the cumbersome methods Microsoft prescribes. I just wish they would license a commercial version!
Posted by curtiscarmack (20 comments )
Reply Link Flag
that any government agency would think they could advise people on any sort of computer security issue.

I'm not digging at the IT people within these departments, who must have a hellish job trying to persuade the typical ******, middle-management tw@t that runs these departments to exercise a modicum of common sense, rather that people who can't stop children hacking into their networks, leave sensitive information lying around in word documents and regularly leave laptops full of top secret data at airports or in taxis.
Posted by ajbright (447 comments )
Reply Link Flag
Flawed Premise
I'm all in favor of spreading awareness that metadata and hidden content are dangerous, but the problem can solved without resorting to the tortured workflow they present. Too bad these folks didn't search on "PDF redaction" before they wrote the paper.
Posted by (3 comments )
Reply Link Flag
The creation date is not relevant metadata
The author thinks he is clever for pointing out that the NSA
paper contains "metadata" in the form of a creation date for the
PDF. That creation date is in your filesystem, not necessarily the
document. Depending on your web browser, and how you saved
the PDF file from your web browser - it may even mean on the
date it was downloaded.

That metadata is a relic of the file system - not the document.
The fact that the date doesn't agree with the document date is
insignificant and irrelevant. The document could well have been
created in 2005, printed and distributed internally - and the PDF
could have recently been created for online distribution.

If you consider creation and modification dates sensitive meta-
data, and you think that's what it is talking about - you need to
find another job. Technical writing is not for you.
Posted by ryebrye (6 comments )
Reply Link Flag
Useful guidelines from NSA
These help to preserve the document in it's original electronic format - here was the feedback i sent to the NSA on their guidelines:

Interestingly - they do not like/trust the new Word 2003 redaction tool. it might be one of those, "prove a negative situations" - Can you prove there is no metadata left in the document that might cause embarrassment?

It would be good if they defined what metadata is being stripped, some of it may legitimately be considered a portion of the exact original record and therefore relevant.

They should move away from converting from an original electronic format of a Word doc to a searchable PDF (is it searchable? or an image only result - unclear) - because this may reduce the ADA compliance of the resulting document for use by disabled for reading the document with alternative interface software (Machine readers etc).

Also converting to a PDF reduces some of the abilities to search for information as it was contacted in the Word format. For example regular expressions in a Word doc search are probably not possible in a searchable PDF.

<a class="jive-link-external" href="http://office.microsoft.com/en-us/assistance/HA010873051033.aspx" target="_newWindow">http://office.microsoft.com/en-us/assistance/HA010873051033.aspx</a>
Any single character except the characters in the range inside the brackets [!x-z] t[!a-m]ck finds "tock" and "tuck," but not "tack" or "tick."

I like that they understand the concept of preserving a document format at a page and line count level by replacing redacted information with XXX in the same amount of space. They should promote this as the default way to redact.

Nice coverage on how to redact images.
would like to see them replace redacted information with the cited withholding justification - i.e. rather than:
XXXsec 2.12 iiiXX

as i have seen on other FOIA responses - typed on the side of a redacted portion.

It might also be useful to use the marked out space to suggest an alternate way to achieve the information as well.
Posted by kimocrossman (31 comments )
Reply Link Flag
Is This A Simple Solution?
Since most new versions of Word export to HTML and import HTML, why not export to HTML, edit the HTML to remove any confidential information, then re-save the cleaned up document?

That should eliminate any metadata since it would all be visible in HTML.
Posted by bluemist9999 (1020 comments )
Reply Link Flag
Blocking out information before release....
I totally know what it means by saying that blocking out text within a document with a black marker or text cover only works on certain documents.

I once came across some documents about myself that someone tried to black out certain parts, but when I held it under the light at a certain angle, I could read every word. My eyes are kind of funny that way, but if I had to black out anything within my own life endeavors, I will definitely visit this site more often.

Especially, since I am going to school online, so I can become a document specialist. I have always been fascinated with people's signatures, the color of ink, the way certain kinds of pencils look when they get sharpened, and the way a newly printed and completely perfect or imperfect document looks. I love to edit things and try my best to make it look better.

I have even found typos within the online school site that I attend, but I am not totally perfect either!

I would like to again inform you, about the immense enjoyment that I had in reading this site.
Posted by Chaya M Loyens (1 comment )
Reply Link Flag

Join the conversation

Add your comment

The posting of advertisements, profanity, or personal attacks is prohibited. Click here to review our Terms of Use.

What's Hot



RSS Feeds

Add headlines from CNET News to your homepage or feedreader.