Re: Terabyte Limit Doesn't Compute
As an archivist who formerly worked for a state government, I'm not the slightest bit surprised that 75% of the total number of documents generated in the conflict turned out to be redundant and/or inconsequential. Usually by the time documents reach an archives, they've been picked over by their creators and many duplicates and low-content items have already been discarded; even when that's the case it's quite common for the archivist to further "weed" multiple copies of memos, duplicates of reports filed elsewhere, etc. How much more so when it sounds like this project caused the documents to more or less come straight to NARA without that filtering?
Beyond the cost of storage, there's also the fact that a bloated, duplicate-ridden collection is more difficult to search and use than a streamlined one. What Li Tan said is accurate: "Keep it all" would be akin to an individual carefully filing away every single scrap of paper that ever came into her house, whether it was information-packed correspondence from distant family members or the fifteenth copy of the exact same Little Caesars flier. It's not helpful to researchers, and it's not an effective use of funds. I do believe that great care must be taken in determining which materials are truly redundant, and there needs to be transparency in terms of what's being kept versus what's being discarded, but it's extremely rare for "Keep it all" to be the appropriate response to the intake of a large collection.
Also, can I just say that the idea of software that identifies potential privacy issues in the documents warms my heart? I can't tell you how much time I spent combing through materials we were going to provide to researchers to make sure we wouldn't be revealing social security numbers.