This is the mail archive of the cygwin-talk mailing list for the cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

cygwn uses for public document retrieval

From: "Mike Marchywka" <marchywka at hotmail dot com>
To: cygwin-talk at cygwin dot com
Date: Sun, 08 Oct 2006 16:53:22 -0400
Subject: cygwn uses for public document retrieval
Bcc:
Reply-to: The Cygwin-Talk Maiming List <cygwin-talk at cygwin dot com>

Hi, ( this was originally rejected from main list, thought to be marginally relevant here) ( I searched the archives, this hasn't come up before and the question is at the bottom- sorry for the long intro. I posted this on cygwin because I run my scripts on cygwin and cygwin illustrates the relationship between graphiically oriented things like windoze and information oriented systems like linux. ) I've been using scripts now to access and organize searches from various sources. Given the proliferation of documents and document types, I think everyone recognizes the need for more structured documents and the ability to easily do ad hoc searches and extractions- scripts make that possible and indeed I have some examples to show that may be of interest beyond specialized communities. The federal government is one entity that collects structured documents of public interest from a variety of sources. However, the various agencies support automated access ( scripts) in highly variable ways. My favorite example of an information-friendly site is still the ncbi api: http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html "Entrez Programming Utilities are tools that provide access to Entrez data outside of the regular web query interface and may be helpful for retrieving search results for future use in another environment." ( the IEEE you would think would he leading the charge in automated document access but so far all I've seen is requests for money when I try to search their journal databases)

As one example:
http://bioinformatics.org/pipermail/bio_bulletin_board/2006-May/003249.html


Most other sites seem to just accept that interactive access via
a web interface is how "normal people" will use the site- this
is just not practical or in the public interest at most sites.

Consider searching US patent documents- afaik you have to parse the
html document hits from their search engine- there is no way
to get documents returned in some simple to use format:

http://www.uspto.gov/main/search.html

You do have a choice of tiff images but these of course offer nothing
unless you also have local OCR software. Try to do a search on
reasonable criteria and see that you get lots of hits- it is difficult to do
keyword searches without download a bunch of documents and
finding confounding words. I've got scripts to do much of this
but it would be easier if there was a stable API supported at uspto.
This site's "API" changes everytime they regenerate their site
since they seem to use generated code:
http://portal.uspto.gov/external/portal/pair

Or, consider the SEC website ( their webmaster has been very interested in this but apparently an API is not currently a priority):

http://www.sec.gov/edgar/quickedgar.htm

Public companies in the US submit lots of info of interest to the general public but it is difficult to find and sort. Scripts offer a great solution for even casual investors who happen to know a little programming. However, the SEC currently forces you to either use the web interface or parse some cumbersome html. Further, their full text search is being implremented with even more difficult to parse html but it offers incredible benefits to those seeking to sort out potential financial disasters ( for example, look at the option ARM situation):

http://www.investorshub.com/boards/read_msg.asp?message_id=13071715

( I was told that "yahoo" and "finance" attract the spam filter)
http://messages.f-----e.y----o.com/Business_%26_Finance/Investments/Sectors/Healthcare/Biotechnology_and_Drugs/threadview?bn=5990&tid=866620&mid=866620

Even FDA filings concerning the drugs we take are available but difficult to access due to the web, rather than programming, interface that the FDA presents- in this case they have all the important documents but the search facilities are limited due to the data being presented as scanned pdf files ( scripting very difficult): ( if you wanted to find all approved drugs with certain incidental properties, this could be a great database except for the above issues) http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm

I could go on and on about the government sources of info- NOAA, CDC,FTC, various courts, etc - all provide great information of importance to the public but access is artificially constrained for any serious uses. If it was available to programmers, it could be repackaged at low cost and presented in a range of web formats ( for even larger audiences). Of course- local governments have even more information types (ranging from traffic cameras and abduction alerts to court and property records) that have audiences that could be more easily targeted by making the information available for innovative programmers to re-distribute.


So, my question is, are there other people who have used cygwin for
these purposes and what sites have you accessed or attempted to access
in some script based way? Has anyone approached govt sites at
any level requesting computer friendly interaction mechanisms?
What responses have you gotten?

Many private sites make their money from things predicated on
interaction ( advertising for most sites- academic journals have a number
of revenue sources and I find it difficult to believe that they
would have problems with free, automated online access).
Does anyone have examples or thoughts
on free-to-user private entities that are still compatible with
automated access? 10kwizard had a nice service but they took
even their simple features into their subscription rather than free area.

Thanks.

Follow-Ups:
- Re: cygwn uses for public document retrieval
  - From: Carlo Florendo

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]