'Reimplement an algorithm to create a refine list
I'm trying to reimplement an algorithm to create a refine keywords list. I don't have the original source code, only the tool .exe file, so I only have the input and the expected output.
The problem here is that the output of my function doesn't match with the output of the original one. Here's the code that I'm using:
string[] inputLines = File.ReadAllLines("Input.txt");
Dictionary<string, int> keywordsCount = new Dictionary<string, int>();
List<string> refineList = new List<string>();
//Get Keywords Count
foreach (string fileName in inputLines)
{
string[] fileNameSplitted = fileName.Split('_');
for (int i = 0; i < fileNameSplitted.Length; i++)
{
string currentKeyWord = fileNameSplitted[i];
if (!string.Equals(currentKeyWord, "SFX", StringComparison.OrdinalIgnoreCase))
{
if (keywordsCount.ContainsKey(fileNameSplitted[i]))
{
keywordsCount[fileNameSplitted[i]] += 1;
}
else
{
keywordsCount.Add(fileNameSplitted[i], 1);
}
}
}
}
//Get final keywords
foreach (KeyValuePair<string, int> keyword in keywordsCount)
{
if (keyword.Value > 2 && keyword.Key.Length > 2)
{
refineList.Add(keyword.Key);
}
}
The input file:
SFX_AMB_BIRDSONG
SFX_AMB_BIRDSONG_MISC
SFX_AMB_BIRDSONG_SEAGULL
SFX_AMB_BIRDSONG_SEAGULL_BUSY
SFX_AMB_BIRDSONG_VULTURE
SFX_AMB_CAVES_DRIP
SFX_AMB_CAVES_DRIP_AUTO
SFX_AMB_CAVES_LOOP
SFX_AMB_DESERT_CICADAS
SFX_AMB_EARTHQUAKE
SFX_AMB_EARTHQUAKE_SHORT
SFX_AMB_EARTHQUAKE_STREAMED
SFX_AMB_FIRE_BURNING
SFX_AMB_FIRE_CAMP_FIRE
SFX_AMB_FIRE_JET
SFX_AMB_FIRE_LAVA
SFX_AMB_FIRE_LAVA_DEEP
SFX_AMB_FIRE_LAVA_JET1
SFX_AMB_FIRE_LAVA_JET2
SFX_AMB_FIRE_LAVA_JET3
SFX_AMB_FIRE_LAVA_JET_STOP
SFX_AMB_UNDW_BUBBLE_RELEASE
SFX_AMB_UNDW_BUBBLE_RELEASE_AUTO
SFX_AMB_WATER_BEACH1
SFX_AMB_WATER_BEACH2
SFX_AMB_WATER_BEACH3
SFX_AMB_WATER_CANALS
SFX_AMB_WATER_FALL_HUGE
SFX_AMB_WATER_FALL_NORMAL
SFX_AMB_WATER_FALL_NORMAL2
SFX_AMB_WATER_FALL_NORMAL3
SFX_AMB_WATER_FOUNTAIN
SFX_CS_LUX_PORTAL_LIGHTNING
SFX_CS_LUX_PORTAL_LIGHTNING1
SFX_CS_LUX_PORTAL_LIGHTNING2
SFX_CS_LUX_PRIEST_COWER
SFX_CS_LUX_PRIEST_MEDAL
SFX_CS_LUX_PRIEST_MEDITATE
SFX_CS_LUX_PRIEST_SCREAM
SFX_CS_LUX_PRIEST_SNIFF1
SFX_CS_LUX_PRIEST_SNIFF2
SFX_CS_LUX_PRIEST_SPIRITS
SFX_CS_LUX_PRIEST_SPIRITS2
SFX_CS_LUX_PRIEST_SPIRITS3
SFX_CS_LUX_PRIEST_SURPRISE
SFX_MON_BM05_TOO_WALK1
SFX_MON_BM05_TOO_WALK2
SFX_MON_BM06_SQU_WALK1
SFX_MON_BM06_SQU_WALK2
SFX_MON_BR06_HAL_ATTACK1
SFX_MON_BR06_HAL_ATTACK2
SFX_MON_BR06_HAL_DIE
SFX_MON_BR06_HAL_HIT
SFX_MON_BR06_HAL_IDLE
SFX_MON_BR06_HAL_IDLE_EATING
SFX_MON_BR06_HAL_LAND1
SFX_MON_BR06_HAL_LAND2
SFX_MON_BR06_HAL_SCRAPE
SFX_MON_BR06_HAL_SLAM
SFX_MON_BR06_HAL_SURPRISE
SFX_MON_BR06_HAL_WALK1
SFX_MON_BR06_HAL_WALK2
SFX_MON_BU01_MUM_ATTACK1
SFX_MON_BU01_MUM_ATTACK2
SFX_MON_BU01_MUM_DIE
SFX_MON_BU01_MUM_HIT
SFX_MON_BU01_MUM_IDLE_RETRIEVE
SFX_MON_BU01_MUM_IDLE_RETRIEVE_GROW
SFX_MON_BU01_MUM_SURPRISE
SFX_MON_BU01_MUM_WALK1
SFX_MON_BU01_MUM_WALK2
SFX_WATER_SPLASH_BIG
SFX_WATER_SPLASH_BIG1
SFX_WATER_SPLASH_BIG2
SFX_WATER_SPLASH_BIG3
SFX_WATER_SPLASH_MED1
SFX_WATER_SPLASH_MED2
SFX_WATER_SPLASH_MED3
SFX_WATER_SPLASH_MEDIUM
SFX_WATER_SPLASH_OUT
SFX_WATER_SPLASH_OUT1
SFX_WATER_SPLASH_OUT2
SFX_WATER_SPLASH_SMALL
And the expected output (from the original tool):
AMB
MON
WATER
LUX
BR06
HAL
SPLASH
PRIEST
FIRE
BU01
MUM
LAVA
BIRDSONG
WALK1
WALK2
JET
IDLE
EARTHQUAKE
FALL
SURPRISE
BIG
CAVES
What should I modify to make that my method matches with the original output?
Thanks in advance!
-------EDIT I've done some new discoveries:
->It is a method of approximately 100-130 lines.
->Use the Visual Basic methods InStr, Len, Right and Left
->Discards the word "SFX", and all words less than 3 characters long.
->It uses a combobox as a temporary list where it puts all the words that appear more than once, and from here it takes out some words, which are the ones that are shown in the combobox visible to the user.
->For the first test case, that I've published, this is the discarded words list:
UNDW
BM05
BM06
SEAGULL
DRIP
BUBBLE
PORTAL
TOO
SQU
OUT
AUTO
RELEASE
NORMAL
LIGHTNING
SPIRITS
ATTACK1
ATTACK2
DIE
HIT
RETRIEVE
Solution 1:[1]
How about taking it as a block of text, splitting on line endings or underscores and getting the unique remnants:
File.ReadAllText(path)
.Split(new[]{'\r','\n','_'},StringSplitOptions.RemoveEmptyEntries)
.Distinct();
Hang on.. maybe it's only words three plus length, that appear thrice or more:
File.ReadAllText(path)
.Split(new[]{'\r','\n','_'},StringSplitOptions.RemoveEmptyEntries)
.GroupBy(w => w)
.Where(g => g.Key.Length > 2 && g.Count() > 2)
.Select(g => g.Key)
If you have a fixed list of words to exclude you can do e.g. .Except(new[]{ "SFX", "..." }) on the end..
Solution 2:[2]
You can do it with plain LINQ, use a GroupBy and convert it to a dictionary. On that Dictionary you can add additional criteria where you e.g. check the minimum amount of occurrences. You don't need to worry about several if-else conditions and keeps it pretty readable:
string[] inputLines = File.ReadAllLines("Input.txt");
var output = inputLines
.SelectMany(s =>
s.Split('_')
.Where(w => w != "SFX")
)
.GroupBy(g => g)
.ToDictionary(s => s.Key, s => s.Count())
.Where(w => w.Key.Length > 2 && w.Value > 2);
Solution 3:[3]
I gave it a go. Can't figure out the ordering, and the performance is not top notch, but you get your required output selection for your one given example.
"SFX" could be excluded due to being (a) contained in all input items, or (b) the very first part of each input item, but I have kept it as a hard-coded string to exclude, in addition to "PORTAL". I really have no idea why "PORTAL" is excluded in the output.
Here, Input is a string[] with the example input provided in the question post.
var excludedWords = new[] { "SFX", "PORTAL" };
var feasibleWords = Input
.SelectMany(str => str.Split('_'))
.Where(word =>
word.Length > 2 &&
!excludedWords.Contains(word));
var repeatedWords = feasibleWords
.GroupBy(word => word)
.Where(gr => gr.Count() > 2)
.ToDictionary(
gr => gr.Key,
gr => gr.Count());
var serialWords = feasibleWords
.Except(repeatedWords.Keys)
.GroupBy(word => Regex.Replace(word, @"[\d]", string.Empty))
.Where(gr =>
gr.Contains(gr.Key) &&
gr.Count() > 3)
.ToDictionary(
gr => gr.Key,
gr => gr.Count());
var output = repeatedWords.Concat(serialWords)
.OrderByDescending(kvp => kvp.Value) // Doesn't add much value, but oh well
.Select(kvp => kvp.Key);
Console.Write(string.Join(Environment.NewLine, output));
Prints:
AMB
MON
WATER
LUX
BR06
HAL
SPLASH
FIRE
PRIEST
BU01
MUM
LAVA
BIRDSONG
FALL
WALK1
WALK2
IDLE
JET
BIG
CAVES
EARTHQUAKE
SURPRISE
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 |

