I second that.<br><br><div class="gmail_quote">On Thu, May 15, 2008 at 5:11 AM, Markus Döring &lt;<a href="mailto:mdoering@gbif.org">mdoering@gbif.org</a>&gt; wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

that&#39;s right. So they need to be escaped if they really want to have<br>

control characters in their dumps.<br>

<br>

But this is no different from escaping xml or any other document. It<br>

would just be nice if the number of escape characters is kept to a<br>

minimum. For this reason I personally prefer tab files, as escaping<br>

line returns and the delimiting tab space is rather little work.<br>

<font color="#888888"><br>

<br>

Markus<br>

</font><div><div></div><div class="Wj3C7c"><br>

<br>

On 15 May, 2008, at 13:40, Holetschek, Jörg wrote:<br>

<br>

&gt; Hi guys,<br>

&gt;<br>

&gt; sorry for the late reaction, but I put off reading all the mails<br>

&gt; until today.<br>

&gt;<br>

&gt; Using CSV and tab delimited files will cause problems when the dumps<br>

&gt; contains freetext data, e.g. locality description or notes. When I<br>

&gt; pushed our BioCASE cache (50 million occurrence records) between<br>

&gt; different DBMS using tab delimited files, I had to experience that<br>

&gt; people are very eager to use tabs and new lines in freetext fields.<br>

&gt; Any character you choose for delimiting contents you will find in<br>

&gt; freetext fields...<br>

&gt;<br>

&gt; Cheers from Berlin,<br>

&gt; Jörg<br>

&gt;<br>

&gt; -----Ursprüngliche Nachricht-----<br>

&gt; Von: <a href="mailto:tdwg-tapir-bounces@lists.tdwg.org">tdwg-tapir-bounces@lists.tdwg.org</a><br>

&gt; [mailto:<a href="mailto:tdwg-tapir-bounces@lists.tdwg.org">tdwg-tapir-bounces@lists.tdwg.org</a>]Im Auftrag von Markus Döring<br>

&gt; Gesendet: Mittwoch, 14. Mai 2008 15:35<br>

&gt; An: Aaron D. Steele<br>

&gt; Cc: TAPIR mailing list<br>

&gt; Betreff: Re: [tdwg-tapir] Fwd: Tapir protocol - Harvest<br>

&gt; methods?[SEC=UNCLASSIFIED]<br>

&gt;<br>

&gt;<br>

&gt; it would keep the relations, but we dont really want any relational<br>

&gt; structure to be served up.<br>

&gt; And using sqlite binaries for the dwc star scheme would not be easier<br>

&gt; to work with than plain text files. they can even be loaded into excel<br>

&gt; straight away, can be versioned with svn and so on. If there is a<br>

&gt; geospatial extension file which has the GUID in the first column,<br>

&gt; applications might grab that directly and not even touch the central<br>

&gt; core file if they only want location data.<br>

&gt;<br>

&gt; I&#39;d prefer to stick with a csv or tab delimited file.<br>

&gt; The simpler the better. And it also cant get corrupted as easily.<br>

&gt;<br>

&gt; Markus<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; On 14 May, 2008, at 15:25, Aaron D. Steele wrote:<br>

&gt;<br>

&gt;&gt; for preserving relational data, we could also just dump tapirlink<br>

&gt;&gt; resources to an sqlite database file (<a href="http://www.sqlite.org" target="_blank">http://www.sqlite.org</a>), zip it<br>

&gt;&gt; up, and again make it available via the web service. we use sqlite<br>

&gt;&gt; internally for many projects, and it&#39;s both easy to use and well<br>

&gt;&gt; supported by jdbc, php, python, etc.<br>

&gt;&gt;<br>

&gt;&gt; would something like this be a useful option?<br>

&gt;&gt;<br>

&gt;&gt; thanks,<br>

&gt;&gt; aaron<br>

&gt;&gt;<br>

&gt;&gt; On Wed, May 14, 2008 at 2:21 AM, Markus Döring &lt;<a href="mailto:mdoering@gbif.org">mdoering@gbif.org</a>&gt;<br>

&gt;&gt; wrote:<br>

&gt;&gt;&gt; Interesting that we all come to the same conclusions...<br>

&gt;&gt;&gt; The trouble I had with just a simple flat csv file is repeating<br>

&gt;&gt;&gt; properties like multiple image urls. ABCD clients dont use ABCD just<br>

&gt;&gt;&gt; because its complex, but because they want to transport this<br>

&gt;&gt;&gt; relational data. We were considering 2 solutions to extending this<br>

&gt;&gt;&gt; csv<br>

&gt;&gt;&gt; approach. The first would be to have a single large denormalised csv<br>

&gt;&gt;&gt; file with many rows for the same record. It would require knowledge<br>

&gt;&gt;&gt; about the related entities though and could grow in size rapidly.<br>

&gt;&gt;&gt; The<br>

&gt;&gt;&gt; second idea which we think to adopt is allowing a single level of 1-<br>

&gt;&gt;&gt; many related entities. It is basically a &quot;star&quot; design with the core<br>

&gt;&gt;&gt; dwc table in the center and any number of extension tables around<br>

&gt;&gt;&gt; it.<br>

&gt;&gt;&gt; Each &quot;table&quot; aka csv file will have the record id as the first<br>

&gt;&gt;&gt; column,<br>

&gt;&gt;&gt; so the files can be related easily and it only needs a single<br>

&gt;&gt;&gt; identifier per record and not for the extension entities. This would<br>

&gt;&gt;&gt; give a lot of flexibility while keeping things pretty simple to deal<br>

&gt;&gt;&gt; with. It would even satisfy the ABCD needs as I havent yet seen<br>

&gt;&gt;&gt; anyone<br>

&gt;&gt;&gt; requiring 2 levels of related tables (other than lookup tables).<br>

&gt;&gt;&gt; Those<br>

&gt;&gt;&gt; extensions could even be a simple 1-1 relation, but would keep<br>

&gt;&gt;&gt; things<br>

&gt;&gt;&gt; semantically together just like a xml namespace. The darwin core<br>

&gt;&gt;&gt; extensions would be good for example.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; So we could have a gzipped set of files, maybe with a simple<br>

&gt;&gt;&gt; metafile<br>

&gt;&gt;&gt; indicating the semantics of the columns for each file.<br>

&gt;&gt;&gt; An example could look like this:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; # darwincore.csv<br>

&gt;&gt;&gt; 102 &nbsp; &nbsp;Aster alpinus subsp. parviceps &nbsp; &nbsp;...<br>

&gt;&gt;&gt; 103 &nbsp; &nbsp;Polygala vulgaris &nbsp; &nbsp;...<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; # curatorial.csv<br>

&gt;&gt;&gt; 102 &nbsp; &nbsp;Kew Herbarium<br>

&gt;&gt;&gt; 103 &nbsp; &nbsp;Reading Herbarium<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; # identification.csv<br>

&gt;&gt;&gt; 102 &nbsp; &nbsp;2003-05-04 &nbsp; &nbsp;Karl Marx &nbsp; &nbsp;Aster alpinus L.<br>

&gt;&gt;&gt; 102 &nbsp; &nbsp;2007-01-11 &nbsp; &nbsp;Mark Twain &nbsp; &nbsp;Aster korshinskyi Tamamsch.<br>

&gt;&gt;&gt; 102 &nbsp; &nbsp;2007-09-13 &nbsp; &nbsp;Roger Hyam &nbsp; &nbsp;Aster alpinus subsp. parviceps<br>

&gt;&gt;&gt; Novopokr.<br>

&gt;&gt;&gt; 103 &nbsp; &nbsp;2001-02-21 &nbsp; &nbsp;Steve Bekow &nbsp; &nbsp;Polygala vulgaris L.<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; I know this looks old fashioned, but it is just so simple and gives<br>

&gt;&gt;&gt; us<br>

&gt;&gt;&gt; so much flexibility.<br>

&gt;&gt;&gt; Markus<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; On 14 May, 2008, at 24:39, Greg Whitbread wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; We have used a very similar protocol to assemble the latest AVH<br>

&gt;&gt;&gt;&gt; cache.<br>

&gt;&gt;&gt;&gt; It should be noted that this is an as-well-as protocol that only<br>

&gt;&gt;&gt;&gt; works<br>

&gt;&gt;&gt;&gt; because we have an established semantic standard (hispid/abcd).<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; greg<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; <a href="mailto:trobertson@gbif.org">trobertson@gbif.org</a> wrote:<br>

&gt;&gt;&gt;&gt;&gt; Hi All,<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; This is very interesting too me, as I came up with the same<br>

&gt;&gt;&gt;&gt;&gt; conclusion<br>

&gt;&gt;&gt;&gt;&gt; while harvesting for GBIF.<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; As a &quot;harvester of all records&quot; it is best described with an<br>

&gt;&gt;&gt;&gt;&gt; example:<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; - Complete Inventory of ScientificNames: 7 minutes @ the limited<br>

&gt;&gt;&gt;&gt;&gt; 200<br>

&gt;&gt;&gt;&gt;&gt; records per page<br>

&gt;&gt;&gt;&gt;&gt; - Complete Harvesting of records:<br>

&gt;&gt;&gt;&gt;&gt; - 260,000 records<br>

&gt;&gt;&gt;&gt;&gt; - 9 hours harvesting duration<br>

&gt;&gt;&gt;&gt;&gt; - 500MB TAPIR+DwC XML returned (DwC 1.4 with geospatial and<br>

&gt;&gt;&gt;&gt;&gt; curatorial<br>

&gt;&gt;&gt;&gt;&gt; extensions)<br>

&gt;&gt;&gt;&gt;&gt; - Extraction of DwC records from harvested XML: &lt;2 minutes<br>

&gt;&gt;&gt;&gt;&gt; - Resulting file size 32MB, Gzipped to &lt;3MB<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; I spun hard drives for 9 hours, and took up bandwidth that is paid<br>

&gt;&gt;&gt;&gt;&gt; for, to<br>

&gt;&gt;&gt;&gt;&gt; retrieve something that could have been generated provider side in<br>

&gt;&gt;&gt;&gt;&gt; minutes<br>

&gt;&gt;&gt;&gt;&gt; and transferred in seconds (3MB).<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; I sent a proposal to TDWG last year termed &quot;datamaps&quot; which was<br>

&gt;&gt;&gt;&gt;&gt; effectively what you are describing, and I based it on the<br>

&gt;&gt;&gt;&gt;&gt; Sitemaps<br>

&gt;&gt;&gt;&gt;&gt; protocol, but I got nowhere with it. &nbsp;With Markus, we are making<br>

&gt;&gt;&gt;&gt;&gt; more<br>

&gt;&gt;&gt;&gt;&gt; progress and I have spoken with several GBIF data providers<br>

&gt;&gt;&gt;&gt;&gt; about a<br>

&gt;&gt;&gt;&gt;&gt; proposed new standard for full dataset harvesting and it has been<br>

&gt;&gt;&gt;&gt;&gt; received<br>

&gt;&gt;&gt;&gt;&gt; well. &nbsp;So Markus and I have started a new proposal and have a<br>

&gt;&gt;&gt;&gt;&gt; working name<br>

&gt;&gt;&gt;&gt;&gt; of &#39;Localised DwC Index&#39; file generation (it is an index if you<br>

&gt;&gt;&gt;&gt;&gt; have more<br>

&gt;&gt;&gt;&gt;&gt; than DwC data, and DwC is still standards compliant) which is<br>

&gt;&gt;&gt;&gt;&gt; really a<br>

&gt;&gt;&gt;&gt;&gt; GZipped Tab file dump of the data, which is slightly extensible.<br>

&gt;&gt;&gt;&gt;&gt; The<br>

&gt;&gt;&gt;&gt;&gt; document is not ready to circulate yet but the benefits section<br>

&gt;&gt;&gt;&gt;&gt; reads<br>

&gt;&gt;&gt;&gt;&gt; currently:<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; - Provider database load reduced, allowing it to serve real<br>

&gt;&gt;&gt;&gt;&gt; distributed<br>

&gt;&gt;&gt;&gt;&gt; queries rather than &quot;full datasource&quot; harvesters<br>

&gt;&gt;&gt;&gt;&gt; - Providers can choose to publish their index as it suits them,<br>

&gt;&gt;&gt;&gt;&gt; giving<br>

&gt;&gt;&gt;&gt;&gt; control back to the provider<br>

&gt;&gt;&gt;&gt;&gt; - Localised index generation can be built into tools not yet<br>

&gt;&gt;&gt;&gt;&gt; capable of<br>

&gt;&gt;&gt;&gt;&gt; integrating with TDWG protocol networks such as GBIF<br>

&gt;&gt;&gt;&gt;&gt; - Harvesters receive a full dataset view in one request, making it<br>

&gt;&gt;&gt;&gt;&gt; very<br>

&gt;&gt;&gt;&gt;&gt; easy to determine what records are eligible for deletion<br>

&gt;&gt;&gt;&gt;&gt; - It becomes very simple to write clients that consume entire<br>

&gt;&gt;&gt;&gt;&gt; datasets.<br>

&gt;&gt;&gt;&gt;&gt; E.g. data cleansing tools that the provider can run:<br>

&gt;&gt;&gt;&gt;&gt; - &nbsp;Give me ISO Country Codes for my dataset<br>

&gt;&gt;&gt;&gt;&gt; &nbsp; - &nbsp;The application pulls down the providers index file,<br>

&gt;&gt;&gt;&gt;&gt; generates ISO<br>

&gt;&gt;&gt;&gt;&gt; country code, returns a simple table using the providers own<br>

&gt;&gt;&gt;&gt;&gt; identifier<br>

&gt;&gt;&gt;&gt;&gt; - Check my names for spelling mistakes<br>

&gt;&gt;&gt;&gt;&gt; &nbsp;- Application skims over the records and provides a list that<br>

&gt;&gt;&gt;&gt;&gt; are not<br>

&gt;&gt;&gt;&gt;&gt; known to the application<br>

&gt;&gt;&gt;&gt;&gt; - Providers such as UK NBN cannot serve 20 million records to the<br>

&gt;&gt;&gt;&gt;&gt; GBIF<br>

&gt;&gt;&gt;&gt;&gt; index using the existing protocols efficiently.<br>

&gt;&gt;&gt;&gt;&gt; - They have the ability to generate a localised index however<br>

&gt;&gt;&gt;&gt;&gt; - Harvesters can very quickly build up searchable indexes and it<br>

&gt;&gt;&gt;&gt;&gt; is<br>

&gt;&gt;&gt;&gt;&gt; easy<br>

&gt;&gt;&gt;&gt;&gt; to create large indices.<br>

&gt;&gt;&gt;&gt;&gt; - Node Portal can easily aggregate index data files<br>

&gt;&gt;&gt;&gt;&gt; - true index to data, not an illusion of a cache. More like Google<br>

&gt;&gt;&gt;&gt;&gt; sitemaps<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; It is the ease at which one can offer tools to data providers that<br>

&gt;&gt;&gt;&gt;&gt; really<br>

&gt;&gt;&gt;&gt;&gt; interests me. &nbsp;The technical threshold required to produce<br>

&gt;&gt;&gt;&gt;&gt; services<br>

&gt;&gt;&gt;&gt;&gt; that<br>

&gt;&gt;&gt;&gt;&gt; offer reporting tools on peoples data is really very low with this<br>

&gt;&gt;&gt;&gt;&gt; mechanism. &nbsp;This and the fact that large datasets will be<br>

&gt;&gt;&gt;&gt;&gt; harvestable - we<br>

&gt;&gt;&gt;&gt;&gt; have even considered the likes of bit-torrent for the large ones<br>

&gt;&gt;&gt;&gt;&gt; although<br>

&gt;&gt;&gt;&gt;&gt; I think this is overkill.<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; As a consumer therefore I fully support this move as a valuable<br>

&gt;&gt;&gt;&gt;&gt; addition<br>

&gt;&gt;&gt;&gt;&gt; to the wrapper tools.<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; Cheers<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; Tim<br>

&gt;&gt;&gt;&gt;&gt; (wrote the GBIF harvesting, and new to this list)<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt; Begin forwarded message:<br>

&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; From: &quot;Aaron D. Steele&quot; &lt;<a href="mailto:eightysteele@gmail.com">eightysteele@gmail.com</a>&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; Date: 13 de mayo de 2008 22:40:09 GMT+02:00<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; To: <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; Cc: Aaron Steele &lt;<a href="mailto:asteele@berkeley.edu">asteele@berkeley.edu</a>&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; Subject: Re: [tdwg-tapir] Tapir protocol - Harvest methods?<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; at berkeley we&#39;ve recently prototyped a simple php program that<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; uses<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; an existing tapirlink installation to periodically dump tapir<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; resources into a csv file. the solution is totally generic and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; can<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; dump darwin core (and technically abcd schema, although it&#39;s<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; currently<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; untested). the resulting csv files are zip archived and made<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; accessible using a web service. it&#39;s a simple approach that has<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; proven<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; to be, at least internally, quite reliable and useful.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; for example, several of our caching applications use the web<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; service<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; to harvest csv data from tapirlink resources using the following<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; process:<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; 1) download latest csv dump for a resource using the web<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; service.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; 2) flush all locally cached records for the resource.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; 3) bulk load the latest csv data into the cache.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; in this way, cached data are always synchronized with the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; resource and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; there&#39;s no need to track new, deleted, or changed records. as an<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; aside, each time these cached data are queried by the caching<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; application or selected in the user interface, log-only search<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; requests are sent back to the resource.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; after discussion with renato giovanni and john wieczorek, we&#39;ve<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; decided that merging this functionality into the tapirlink<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; codebase<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; would benefit the broader community. csv generation support<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; would<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; be<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; declared through capabilities. although incremental harvesting<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; wouldn&#39;t be immediately implemented, we could certainly extend<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; service to include it later.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; i&#39;d like to pause here to gauge the consensus, thoughts,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; concerns, and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; ideas of others. anyone?<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; thanks,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; aaron<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; 2008/5/5 Kevin Richards &lt;<a href="mailto:RichardsK@landcareresearch.co.nz">RichardsK@landcareresearch.co.nz</a>&gt;:<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I think I agree here.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; The harvesting &quot;procedure&quot; is really defined outside the Tapir<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; protocol, is<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; it not? &nbsp;So it is really an agreement between the harvester and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; harvestees.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; So what is really needed here is the standard procedure for<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; maintaining a<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &quot;harvestable&quot; dataset and the standard procedure for harvesting<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; that<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; dataset.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; We have a general rule at Landcare, that we never delete<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; records<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; in<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; our<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; datasets - they are either deprecated in favour of another<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; record,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; and so<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the resolution of that record would point to the new record, or<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; are set<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; to a state of &quot;deleted&quot;, but are still kept in the dataset, and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; can<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; be<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; resolved (which would indicate a state of deleted).<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Kevin<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &quot;Renato De Giovanni&quot; &lt;<a href="mailto:renato@cria.org.br">renato@cria.org.br</a>&gt; 6/05/2008 7:33<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; a.m.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Hi Markus,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; I would suggest creating new concepts for incremental<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; harvesting,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; either in the data standards themselves or in some new<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; extension. In<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the case of TAPIR, GBIF could easily check the mapped concepts<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; before<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; deciding between incremental or full harvesting.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Actually it could be just one new concept such as<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &quot;recordStatus&quot;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; or<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; &quot;deletionFlag&quot;. Or perhaps you could also want to create your<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; own<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; definition for dateLastModified indicating which set of<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; concepts<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; should be considered to see if something has changed or not,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; but I<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; guess this level of granularity would be difficult to be<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; supported.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Regards,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; --<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Renato<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; On 5 May 2008 at 11:24, Markus Döring wrote:<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Phil,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; incremental harvesting is not implemented on the GBIF side as<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; far<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; as I<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; am aware. And I dont think that will be a simple thing to<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; implement on<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the current system. Also, even if we can detect only the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; changed<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; records since the last harevesting via dateLastModified we<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; still<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; have<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; no information about deletions. We could have an arrangement<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; saying<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; that you keep deleted records as empty records with just the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ID<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; nothing else (I vaguely remember LSIDs were supposed to work<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; like<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; this<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; too). But that also needs to be supported on your side then,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; never<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; entirely removing any record. I will have a discussion with<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; others<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; at GBIF about that.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Markus<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; Please consider the environment before printing this email<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; WARNING : This email and any attachments may be confidential<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; and/<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; or<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; privileged. They are intended for the addressee only and are<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; not<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; to<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; be read,<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; used, copied or disseminated by anyone receiving them in<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; error. If<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; you are<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; not the intended recipient, please notify the sender by return<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; email and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; delete this message and any attachments.<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; The views expressed in this email are those of the sender and<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; do<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; not<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; necessarily reflect the<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; official views of Landcare Research. http://<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="http://www.landcareresearch.co.nz" target="_blank">www.landcareresearch.co.nz</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt;&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt;&gt;&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt;&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; --<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; Australian Centre for Plant BIodiversity<br>

&gt;&gt;&gt;&gt; Research&lt;------------------+<br>

&gt;&gt;&gt;&gt; National &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;greg whitBread &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; voice: +61 2 62509<br>

&gt;&gt;&gt;&gt; 482<br>

&gt;&gt;&gt;&gt; Botanic Integrated Botanical Information System &nbsp;fax: +61 2 62509<br>

&gt;&gt;&gt;&gt; 599<br>

&gt;&gt;&gt;&gt; Gardens &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;S........ I.T. happens..<br>

&gt;&gt;&gt;&gt; <a href="mailto:ghw@anbg.gov.au">ghw@anbg.gov.au</a><br>

&gt;&gt;&gt;&gt; +-----------------------------------------&gt;GPO Box 1777 Canberra<br>

&gt;&gt;&gt;&gt; 2601<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; ------<br>

&gt;&gt;&gt;&gt; If you have received this transmission in error please notify us<br>

&gt;&gt;&gt;&gt; immediately by return e-mail and delete all copies. If this e-mail<br>

&gt;&gt;&gt;&gt; or any attachments have been sent to you in error, that error does<br>

&gt;&gt;&gt;&gt; not constitute waiver of any confidentiality, privilege or<br>

&gt;&gt;&gt;&gt; copyright<br>

&gt;&gt;&gt;&gt; in respect of information in the e-mail or attachments.<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; Please consider the environment before printing this email.<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; ------<br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt;&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;&gt;&gt;<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; _______________________________________________<br>

&gt;&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;&gt;<br>

&gt;&gt; _______________________________________________<br>

&gt;&gt; tdwg-tapir mailing list<br>

&gt;&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt;&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;&gt;<br>

&gt;<br>

&gt; _______________________________________________<br>

&gt; tdwg-tapir mailing list<br>

&gt; <a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

&gt; <a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

&gt;<br>

<br>

_______________________________________________<br>

tdwg-tapir mailing list<br>

<a href="mailto:tdwg-tapir@lists.tdwg.org">tdwg-tapir@lists.tdwg.org</a><br>

<a href="http://lists.tdwg.org/mailman/listinfo/tdwg-tapir" target="_blank">http://lists.tdwg.org/mailman/listinfo/tdwg-tapir</a><br>

</div></div></blockquote></div><br>