'Product model extraction from text - Python, NLTK, Custom approach

I have a task I need to do and I think that AI would be the ideal solution. I'm a web developer and I have some experience in the web development field but nothing I searched online really pointed me in the right direction.

I discovered some NLP packages which could be quite-handy for solving the problem, but I would like to know how to solve the problem without using any external packages.

DESCRIPTION OF THE PROBLEM:

Let's say that we have several strings. Strings consist of Toner Brand, Toner Model, and some additional information.

Examples:

  • 'Brother TN960 - Extra High Yield' - Model is TN960, brand is Brother
  • 'Brother 2PK TN932 - Extra high Yield' - Model is TN932, brand is Brother
  • 'Canon 2PK Extra high Yield 031 - model is 031, brand is Canon.

Question:

How would I create an train AI to recognize only the extract the model? I figured I would need unsupervised learning, since I don't know the output variables. If I would like to extract TN960 from the string 'Brother TN960 - Extra High Yield' , I don't know in advance that TN960 is a model and can't label it, to use supervised learning.

Although maybe the correct solution would be to mark each entity in the sentence? Like

Brother - Brand 2PK - Quantity TN960 - Toner

And let those be my input for supervised learning. I'm interested in theory, code would also be helpful but I suspect this is a complicated problem and it would take some time to be coded, hence it would not be fair to ask for that for free.

If anyone has any advice on how this should be done, or could point me to things I would need to learn that would be extremely helpful.

Also, would this be possible to do in a 2 weeks for someone who has no prior experience in AI, without using a library like NLTK or Spacy? How actually difficult is the problem on a scale of 1-10?

Thanks in advance to anyone willing to spend their time on this.

import nltk
import requests

TonersJSON = None

try:
    req = requests.get('http://localhost:8080/get_toners_admin')
    TonersJSON = req.json()
except requests.exceptions.RequestException as e:
    print(e)


print(type(TonersJSON))

#nltk.download('words')

#TonersJSON = TonersJSON[0:100]
    for toner in TonersJSON:
    sentence = toner['Name'].split(" - ")[0]
    tokens = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokens)
    ne_tree = nltk.ne_chunk(tagged)
if len(tagged) > 1:
    print(tagged)

I tried using approach above to see if I will get anything that makes sense but I'm getting either NNP or CD for model parts, it's not helpful.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source