'Keep k random strings from input file of unknown length in C
Problem
I want to get input from a file of unknown length, separating the input to strings at a defined delimiter and saving some of them. After reading the entire file, I want to have in an array of strings, k random strings from the input that was read. Each string must have equal probability (uniform distribution) of being kept in the final array.
Important Part
I don't want to read the file, save all the strings somewhere and then pick k of them at random, which has equal probability (uniform distribution) for each string. I want the same result but using a fixed length array of k strings, changing/overwriting strings until end of file is reached.
Summary of Goals
- All the strings have equal probability to be kept in the final array.
- No more than
kstrings will be saved in memory (or temporary file) while the program is executing.
Expected Behavior
Contents of input file for testing
Assuming delimeter is ':' but can be changed to anything you wish.
ok:maybe:not:something:yes:off:blue:donut:A:B:C:D:E:F:G:
Output
A random sequence of the strings in file separated by the delimiter. Assuming k = 5 one sequence would look something like the following (format it however you want):
0: not
1: A
2: maybe
3: yes
4: blue
Purpose of my Question
Long story short, accuracy in uniform distribution.
Another question was asked at some point, which is almost identical to mine and i tried to answer it myself. BUT my question is not about solving the initial problem. My attempt of answering the other question had the logic i describe in this question but it has a major weakness, which is non-uniform distribution of the probabilities of the strings that end up in the final array. This issue was mentioned from another user at the comments of my answer, so i checked the information he provided and did some research on the topic myself but still, i wasn't satisfied with what i found. So this question emphasizes on finding an accurate way to implement uniform distribution for this problem and not just solving the problem like the other question. Also, the other question may was a homework problem, but it already has an accepted answer. So the only purpose of my question is out of pure interest and not for homework cheating.
Here are some links from what i have already checked and may be useful to someone who tries to answer this question.
- Similar question, not emphasized on uniform distribution
- Similar question, not sure for uniform distribution
- Reservoir Sampling
- Fisher–Yates shuffle
My Approach
This section contains my setup and assumptions on how i am going to work on it. Ignore it completely if you prefer your own way to approach it, or use it if you want to.
My Setup
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#define DELIMITER ':'
FILE* getInputSource(char* source);
char** pickStrings(char* source, int k);
void printStrings(char** strings, int k);
void freeStrings(char** strings, int k);
// extra declerations
int main(int argc, char const *argv[]) {
srand(time(NULL));
int toKeep = 5;
char** strings = pickStrings("input.txt",toKeep);
printStrings(strings,toKeep);
freeStrings(strings,toKeep);
return 0;
}
char** pickStrings(char* source, int k) {
FILE* input = getInputSource(source);
char** strings = (char**) calloc(k,sizeof(char*));
// implementation
return strings;
}
// extra implementations
FILE* getInputSource(char* source) {
FILE* input = fopen(source,"r");
if( !input ) {
printf("Unable to open input file.\n");
exit(0);
}
return input;
}
void printStrings(char** strings, int k) {
for( int i = 0; i<k; i++ )
printf("%d: %s\n", i, strings[i]);
}
void freeStrings(char** strings, int k) {
for( int i = 0; i < k; i++)
free(strings[i]);
free(strings);
}
My Assumptions
- Each string length is unknown. Either define a fixed length for example 256 chars per string assuming the input file won't contain strings larger than 255 chars, or dynamically allocate memory as much as needed for each string.
- Strings can contain anything except the defined delimiter which is used to seperate them.
- Delimiter is used at the last string as well.
- If input contains less than
kstrings, you can fill first slots of the array (at random order) and leave the rest of the array empty.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
