This is the mail archive of the
mailing list for the cygwin project.
cygwn uses for public document retrieval
- From: "Mike Marchywka" <marchywka at hotmail dot com>
- To: cygwin-talk at cygwin dot com
- Date: Sun, 08 Oct 2006 16:53:22 -0400
- Subject: cygwn uses for public document retrieval
- Reply-to: The Cygwin-Talk Maiming List <cygwin-talk at cygwin dot com>
( this was originally rejected from main list, thought to be marginally
( I searched the archives, this hasn't come up before and the question is
at the bottom- sorry for the long intro. I posted this on cygwin because
I run my scripts on cygwin and cygwin illustrates the relationship between
graphiically oriented things like windoze and information oriented systems
like linux. )
I've been using scripts now to access and organize searches from various
sources. Given the proliferation of documents and document types,
I think everyone recognizes the need for more structured documents and
the ability to easily do ad hoc searches and extractions- scripts make that
possible and indeed I have some examples to show that may be of
interest beyond specialized communities.
The federal government is one entity that collects structured documents
of public interest from a variety of sources. However, the
various agencies support automated access ( scripts) in highly
My favorite example of an information-friendly site is still the
"Entrez Programming Utilities are tools that provide access to Entrez
data outside of the regular web query interface and may be helpful for
retrieving search results for future use in another environment."
( the IEEE you would think would he leading the charge in automated document
access but so far all I've seen is requests for money when I try to search
their journal databases)
As one example:
Most other sites seem to just accept that interactive access via
a web interface is how "normal people" will use the site- this
is just not practical or in the public interest at most sites.
Consider searching US patent documents- afaik you have to parse the
html document hits from their search engine- there is no way
to get documents returned in some simple to use format:
You do have a choice of tiff images but these of course offer nothing
unless you also have local OCR software. Try to do a search on
reasonable criteria and see that you get lots of hits- it is difficult to do
keyword searches without download a bunch of documents and
finding confounding words. I've got scripts to do much of this
but it would be easier if there was a stable API supported at uspto.
This site's "API" changes everytime they regenerate their site
since they seem to use generated code:
Or, consider the SEC website ( their webmaster has been very interested in
but apparently an API is not currently a priority):
Public companies in the US submit lots of info of interest to the general
but it is difficult to find and sort. Scripts offer a great solution for
investors who happen to know a little programming. However, the SEC
forces you to either use the web interface or parse some cumbersome html.
Further, their full text search is being implremented with even more
difficult to parse
html but it offers incredible benefits to those seeking to sort out
financial disasters ( for example, look at the option ARM situation):
( I was told that "yahoo" and "finance" attract the spam filter)
Even FDA filings concerning the drugs we take are available but difficult to
due to the web, rather than programming, interface that the FDA
presents- in this case they have all the important documents but
the search facilities are limited due to the data being presented as
scanned pdf files ( scripting very difficult):
( if you wanted to find all approved drugs with certain incidental
this could be a great database except for the above issues)
I could go on and on about the government sources of info-
NOAA, CDC,FTC, various courts, etc - all provide great information of
importance to the
public but access is artificially constrained for any serious uses. If it
was available to
programmers, it could be repackaged at low cost and presented in a range
of web formats ( for even larger audiences). Of course- local governments
have even more information types (ranging from traffic cameras and abduction
court and property records)
that have audiences that could be more easily targeted by making the
available for innovative programmers to re-distribute.
So, my question is, are there other people who have used cygwin for
these purposes and what sites have you accessed or attempted to access
in some script based way? Has anyone approached govt sites at
any level requesting computer friendly interaction mechanisms?
What responses have you gotten?
Many private sites make their money from things predicated on
interaction ( advertising for most sites- academic journals have a number
of revenue sources and I find it difficult to believe that they
would have problems with free, automated online access).
Does anyone have examples or thoughts
on free-to-user private entities that are still compatible with
automated access? 10kwizard had a nice service but they took
even their simple features into their subscription rather than free area.