'How to create quasi-copy of a file [closed]

I would like to create quasi-copy of my directory with sensitive data.

Then I would like to share this quasi-copy with others to provide so called 'real data'.

This 'real data' would allow others to do tests in matters related to storage performance.

My question is how to create copy of any file ( text, jpeg, sqlite.db, ... ) that will not contain any of its original data, but from point of view of compression, de-duplication and so on would be very similar.

I appreciate any pointers to tools, libs that helps with creating such quasi copy.

I appreciate any pointers what to measure and how to measure similarity of original file and its quasi copy.



Solution 1:[1]

I don't know whether a "quasi-copy" is an established notion and whether there are accepted rules and procedures. But here is a crude take on how to "mask" data for protection: replace words by equal-length sequences of (perhaps adjusted) random characters. One cannot then do a very accurate storage analysis of real data but that has to suffer after any data scrambling.

One way to build such a "quasi-word," wrapped in a program for convenience

use warnings;
use strict;
use feature 'say';

use Scalar::Util qw(looks_like_number);

my $word = shift // die "Usage: $0 word\n";

my @alphabet = 'a'..'z';

my $quasi_word;

foreach my $c (split '', $word) { 
    if (looks_like_number($c)) {
        $quasi_word .= int rand 10;
    }
    else {
        $quasi_word .= $alphabet[int rand 26];
    }
}

say $quasi_word;

This doesn't cut it at all for a de-duplication analysis. For that one can replace repeated words by the same random sequence, for example as follows.

First make a pass over the words from the file and build a frequency hash, of how many times each word appears. Then as each word is processed it is first checked whether it repeats, and if it does a random replacement is built only the first time and later that is used every time.

Further adjustments for specific needs should be easy to add.

Any full masking (scrambling/tokenization...) of data of course cannot allow a precise analysis of compression of real data using such a mangled set.

If you know specific sensitive parts then only those can be masked and that would improve the accuracy of the following analyses considerably.

This will be slow but if a set of files need be built once in a while it shouldn't matter.


A question was raised of the "criteria" for the "similarity" of masked data, and the question itself was closed for lack of detail. Here is a comment on that.

It seems that only "measure" of "similarity" is simply whether the copy behaves the same in the storage performance analysis as the real data would. But, one can't tell without using real data for that analysis! (What clearly would reveal that data.)

The one way I can think of is to build a copy using a candidate approach and then use it (yourself) for individual components of that analysis. Does it compress (roughly) the same? How about de-duplication? How about ...? Etc. Then make your choices.

If the used approach is flexible enough the masking can then be adjusted for whichever part of analysis "failed" -- the copy behaved substantially differently. (If compression was very different perhaps refine your algorithm to study words and produce more similar obfuscation, etc.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1