Your strings in the BanglaAlphabet dictionary are lacking the u (Unicode) flag. Looking at the Bengali code chart it seems possible that you missed some other characters. U+09BC BENGALI SIGN NUKTA is not matched by your regular expression. Your string কয়া is made up of these characters: > import unicodedata Many ''.format(**bangla_alphabet), re.VERBOSE) I will be looking into those and adding them to this basic structure incrementally. There are several special rules / exceptions not implemented. As such, this code now serves as a very basic regular expression for validating the syntax of Bengali words. I am happy to report that it is working as expected. This code not only corrects errors, but accounts for more characters and valid constructs than before. The matching is now: bangla_word_pattern.match(w) However, it is matching invalid words, such as োগাড় and িদগ.Īfter numerous corrections as suggested by and I ended up with: bangla_alphabet = dict(Ĭonsonant = u'',ĭependent_vowel = u'',ĭependent_sign = u'',īangla_word_pattern = re.compile(ur'''(?: This is meant to match a valid Bengali word when matched from right to left. The matching is done with: re.match(BanglaWordPattern, w) 'DependantVowel' : '',īanglaWordPattern = re.compile(BanglaWordPattern, re.VERBOSE)
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |