I'm sketching out an idea for a readability assessment program. It will report the education level required to comfortably read a body of text using formulas, Dale-Chall being the most significant, that count length of sentences, what level of vocab a word is considered to be, etc. I was inspired by the word counter website I always paste my essays into. When it's done, I would like to plug it into APIs for it to be used on Lemmy, Mastodon, and Discord.
BTW, do you guys think I should use databases for this? The one formula uses a list of 4,000 easy words, and storing lists of common proper nouns will help with flagging them. Also, I could probably get vocab level data for tens of thousands of words... better in a DB than a ginormous hash table or trie?
With that small of a dataset imo either option is fine. If it were me I would use an ORM + sqlite just to start, in case I ever needed to migrate to a "real" database.
I am writing in C (the CLI, which I'll just have the bots use) and have never used any databases, would using the sqlite interface straightup with C and some cursory reading of docs be too much, do you think? Course I can switch it all to c++ and then there appears to be at least one nice ORM
I think if you're storing vocabulary etc, using the C interface for sqlite wouldn't be too unwieldy and would be a good learning experience if you haven't done much raw SQL query writing of your own. Even when you use an ORM there are often times you need to write your own queries for more complicated situations.
One other suggestion: once you have the CLI and bots working, you could abstract this even more. Have a service process that communicates in some way (IPCC, a network port, etc.) that does the actual text analysis. Your cli and bots can then just interface over that channel. This gives separation of duties so you can easily implement new clients/servers or rework them much more easily.