'How to remove stopwords through SQL
I have a table with millions of messages. I want to remove a list of stopwords from each of those message - in SQL
Example input:
id message
-------------------------------
1 we are on top of the world
2 too bad the apple is rotten
3 there is no I in team
4 it matters where you go to
Stopwords to remove:
in, on, of, to, too
Desired output:
id message
-------------------------------
1 we are top the world
2 bad the apple is rotten
3 there is no I team
4 it matters where you go
The problem here it, I guess, that either stopword could be at the beginning, in the middle or at the end of the message. So a query like this would be satisfactory:
UPDATE table SET message = REPLACE(message, ' in ', '');
UPDATE table SET message = REPLACE(message, ' on ', '');
UPDATE table SET message = REPLACE(message, ' of ', '');
etc...
Is there a better solution?
Solution 1:[1]
If you create a table called stopwords with a field stopword in it holding a list of all your stopwords you can do it like this:
CREATE TABLE [dbo].[stopwords](
[stopword] char(100) NOT NULL
)
insert into stopwords values ('in');
insert into stopwords values ('on');
insert into stopwords values ('of');
insert into stopwords values ('to');
insert into stopwords values ('too');
-- DEBUG: select message ,stopword, replace(message,CONCAT(' ', stopword , ' '), ' ')
update table
set message = trim(replace(CONCAT(' ',message, ' '),CONCAT(' ',stopword,' '),' '))
from stopwords
where CONCAT(' ', message , ' ') like CONCAT('% ' ,stopword , ' %')
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
