[Noisebridge-discuss] How to mine for a lot of English phrases

Jonathan Foote jtfoote at ieee.org
Fri Mar 26 23:33:10 UTC 2010


There's a lot of  work out there from the linguistics research
community, including many huge corpora. No need to scrape your own
text.

Try starting with these:
http://www.sil.org/linguistics/etext.html
http://www.ldc.upenn.edu/


There are easy statistical ways of finding high-likelihood phrases and
sentences, but
I think it's going to be tough to meet your criteria "must make sense"
without human vetting.


On Fri, Mar 26, 2010 at 3:30 PM, Micah Lee <micahflee at gmail.com> wrote:
> Hi Noisebridge, I'm working on a cryptogram Android/iPhone game and I
> need to create a large databases of short English sentences that make
> sense. Things like popular sayings and quotes are great, or pieces of
> lyrics from songs, or famous lines from plays. They need to be between
> 40 and 84 characters (I'll have to test each phrase to make sure it
> fits the actual max size, which will likely be shorter than 84
> characters due to word-wrapping). I'm hoping to get a large database
> to work with, somewhere around 20,000 phrases.
>
> I've tried googling for phrase databases but it isn't leading anywhere
> good. I'll probably write some software that scrapes websites like
> wikiquote.org for phrases of that size. It's also important that the
> phrases make sense out of context and are in sensible English, which
> rules out twitter feeds. Overall I don't have much experience in data
> mining. Anyone have any suggestions?
>
> micah
> _______________________________________________
> Noisebridge-discuss mailing list
> Noisebridge-discuss at lists.noisebridge.net
> https://www.noisebridge.net/mailman/listinfo/noisebridge-discuss
>



More information about the Noisebridge-discuss mailing list