Ticket #2299 (new enhancement)

Opened 9 years ago

Last modified 9 years ago

Improve the quality of search matching for words with non-ascii characters

Reported by: madarche Owned by: madarche
Priority: P2 Milestone: CPS 3.5.7
Component: CPS (global) Version: TRUNK
Severity: normal Keywords: stemming stemmer french accented non-ascii
Cc:

Description (last modified by madarche) (diff)

If there is a word like "électroménager" (which contains non-ascii characters) in a document, one expects the search engine to return this document if the search is done on the word "electromenager".

It's not hard to do at all, one has just to extend UnicodeSplitter? from the UnicodeLexicon? product to create a UnicodeAccentedCharNormalizer? class.

I've search if there was any new version or spin-offs of UnicodeLexicon?, for example in the area of Plone, but couldn't find any.

Then a question to end: where to put our UnicodeAccentedCharNormalizer?, in which product?

Attachments

catalog.xml.patch Download (1.1 KB) - added by madarche 9 years ago.
UnicodeLexicon.patch Download (2.2 KB) - added by madarche 9 years ago.

Change History

comment:1 Changed 9 years ago by madarche

I'm working on this.

I've got a modified version of the UnicodeLexicon? product that almost work. But I couldn't manage to register a working CPS lexicon outside of UnicodeLexicon?.

Changed 9 years ago by madarche

Changed 9 years ago by madarche

comment:2 Changed 9 years ago by gracinet

Since these accented char degradations are locale-dependent, CPSI18n might be a good candidate in the long run. Apart from that, all catalog tweaks are in CPSCore, so this one should go along.

Btw, don't be afraid to push your patches in branches. A branch is not a real fork imho

comment:3 Changed 9 years ago by madarche

I've pushed the fix in a newly "UnicodeLexicon?" created branch in  http://hg.cps-cms.org/vendor/UnicodeLexicon/

This branch has a file catalog_to_use_in_CPSDefault_profile.xml to put in CPSDefault/profiles/default/

This branch makes it possible to find a document that has "électroménager" as a title when searching for "electromenager".

Georges, please, could you test it and see how to integrate it best in CPS?

comment:4 Changed 9 years ago by madarche

  • Keywords stemming stemmer french added

comment:5 Changed 9 years ago by madarche

  • Keywords accented non-ascii added
  • Description modified (diff)
  • Summary changed from Improve the quality of search matching for words with accented characters to Improve the quality of search matching for words with non-ascii characters

comment:6 Changed 9 years ago by madarche

  • Version changed from 3.5.1 to TRUNK

Upstream (UnicodeLexicon? author) has been notified and is interested in the functionality. So we'll try to integrate the proposed functionality directly into UnicodeLexicon?, which is the best solution for everyone.

Note: See TracTickets for help on using tickets.