Final changes in TAPIR
Dear all,
There are just two items left on the list of possible changes before submitting TAPIR to the TDWG standards track:
1) Allow custom operations to be declared as part of capabilities.
I would suggest to simply include a new custom slot for this in the schema in case someone needs to use it in the future.
2) Allow dump files to be declared.
This has been discussed some time ago in the TAPIR mailing list but we didn't come to a final conclusion.
Since some networks are starting to harvest data by fetching entire dump files, I think it's important to allow TAPIR services to declare any dump files that may be available. Fetching a dump file from a provider and using incremental harvesting in later interactions with the service will probably be the most efficient approach.
Since TAPIR generates XML output, it makes more sense to me to see dump files in XML. However, Tim/Markus (GBIF) are proposing another format for dump files using tab/csv files together with a metafile. It should be easy to allow both options when declaring a dump file in TAPIR capabilities, but I don't think it's the role of TAPIR to define specific formats. We can probably use something like this to declare dump files:
<archives> <archive format="" location="" outputModel=""/> ... </archives>
Where format could be "xml" or any custom term, and outputModel would be optional (only used with "xml" format). Things like date when the dump file was generated and whether it's gzipped or not could be additional attributes, but in most cases this can be discovered through the protocol used to retrieve the file, so in principle I would not include the attributes.
Please let me know if this is an acceptable solution or if you have any different thoughts. Also let me know if you have any other ideas or suggestions about TAPIR in general. This is the time.
I would like to finally submit specification & schema to the standards track in the beginning of the week.
Best Regards, -- Renato
Renato,
I really think #2 is worth including. There are times when I wish to send small requests through the TAPIR protocol but there are other times when, especially on first inspection it would be nice to pull a initial dump. Your XML format for archives looks find but I would consider adding attributes like dateCreated and numberOfRecords. This way if there are monthly archives for example I could pull the latest one. Secondly the number of records would be useful to know how much is in the dump and not just the size. Depending on the actual dump files the other question I would have is is the dump the delta of the previous dump or a complete dump. That way I would know if I was getting the full dump or I have to get all the dumps and then merge them together.
These are my inital thoughts.
Regards,
Michael Giddens Biodiversity Informatics Software Development www.SilverBiology.com Baton Rouge, LA phone: +1 225-937-9657 email: mikegiddens@silverbiology.com skype: mikegiddens
renato@cria.org.br wrote:
Dear all,
There are just two items left on the list of possible changes before submitting TAPIR to the TDWG standards track:
- Allow custom operations to be declared as part of capabilities.
I would suggest to simply include a new custom slot for this in the schema in case someone needs to use it in the future.
- Allow dump files to be declared.
This has been discussed some time ago in the TAPIR mailing list but we didn't come to a final conclusion.
Since some networks are starting to harvest data by fetching entire dump files, I think it's important to allow TAPIR services to declare any dump files that may be available. Fetching a dump file from a provider and using incremental harvesting in later interactions with the service will probably be the most efficient approach.
Since TAPIR generates XML output, it makes more sense to me to see dump files in XML. However, Tim/Markus (GBIF) are proposing another format for dump files using tab/csv files together with a metafile. It should be easy to allow both options when declaring a dump file in TAPIR capabilities, but I don't think it's the role of TAPIR to define specific formats. We can probably use something like this to declare dump files:
<archives> <archive format="" location="" outputModel=""/> ... </archives>
Where format could be "xml" or any custom term, and outputModel would be optional (only used with "xml" format). Things like date when the dump file was generated and whether it's gzipped or not could be additional attributes, but in most cases this can be discovered through the protocol used to retrieve the file, so in principle I would not include the attributes.
Please let me know if this is an acceptable solution or if you have any different thoughts. Also let me know if you have any other ideas or suggestions about TAPIR in general. This is the time.
I would like to finally submit specification & schema to the standards track in the beginning of the week.
Best Regards,
Renato
Michael, I think you have a good point here. But is this use case related to TAPIR really? It doesnt seem you want to use TAPIR at all, but rather aggregate or sync complete datasets only using dump files. Although this is a very frequent scenario and I agree dump files are much better in doing this job than OAI, Atom feeds or TAPIR, I dont see the need to use TAPIR for this.
Things get TAPIR related only when providing already supported/ advertised output models as full dumps. Those XML models are already part of TAPIR and generated for responses, so why not provide the entire dataset like this?
Markus
On Jan 30, 2009, at 16:30, Michael Giddens wrote:
Renato,
I really think #2 is worth including. There are times when I wish to send small requests through the TAPIR protocol but there are other times when, especially on first inspection it would be nice to pull a initial dump. Your XML format for archives looks find but I would consider adding attributes like dateCreated and numberOfRecords. This way if there are monthly archives for example I could pull the latest one. Secondly the number of records would be useful to know how much is in the dump and not just the size. Depending on the actual dump files the other question I would have is is the dump the delta of the previous dump or a complete dump. That way I would know if I was getting the full dump or I have to get all the dumps and then merge them together.
These are my inital thoughts.
Regards,
Michael Giddens Biodiversity Informatics Software Development www.SilverBiology.com Baton Rouge, LA phone: +1 225-937-9657 email: mikegiddens@silverbiology.com skype: mikegiddens
renato@cria.org.br wrote:
Dear all,
There are just two items left on the list of possible changes before submitting TAPIR to the TDWG standards track:
- Allow custom operations to be declared as part of capabilities.
I would suggest to simply include a new custom slot for this in the schema in case someone needs to use it in the future.
- Allow dump files to be declared.
This has been discussed some time ago in the TAPIR mailing list but we didn't come to a final conclusion.
Since some networks are starting to harvest data by fetching entire dump files, I think it's important to allow TAPIR services to declare any dump files that may be available. Fetching a dump file from a provider and using incremental harvesting in later interactions with the service will probably be the most efficient approach.
Since TAPIR generates XML output, it makes more sense to me to see dump files in XML. However, Tim/Markus (GBIF) are proposing another format for dump files using tab/csv files together with a metafile. It should be easy to allow both options when declaring a dump file in TAPIR capabilities, but I don't think it's the role of TAPIR to define specific formats. We can probably use something like this to declare dump files:
<archives> <archive format="" location="" outputModel=""/> ... </archives>
Where format could be "xml" or any custom term, and outputModel would be optional (only used with "xml" format). Things like date when the dump file was generated and whether it's gzipped or not could be additional attributes, but in most cases this can be discovered through the protocol used to retrieve the file, so in principle I would not include the attributes.
Please let me know if this is an acceptable solution or if you have any different thoughts. Also let me know if you have any other ideas or suggestions about TAPIR in general. This is the time.
I would like to finally submit specification & schema to the standards track in the beginning of the week.
Best Regards,
Renato
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
Hi Michael,
Thanks for your input.
The original idea was to advertise only complete dumps, not delta files, otherwise things can get more complicated and, as Markus said, it looks a bit out of scope. Even if there can be multiple "archive" elements for whatever reasons (multiple locations, different output models, etc.), in most cases I think there will only be a single element pointing to the most recent dump file. After getting the file, it should be possible to do incremental harvesting using the search operation instead of handling multiple delta files.
Regarding the other suggestions, I don't mind adding attributes for creation timestamp and number of records. Following this approach, we can also add an attribute to indicate compression, as Tim suggested on the Wiki. I'll do this if there are no further comments or ideas.
Thanks again, -- Renato
Michael, I think you have a good point here. But is this use case related to TAPIR really? It doesnt seem you want to use TAPIR at all, but rather aggregate or sync complete datasets only using dump files. Although this is a very frequent scenario and I agree dump files are much better in doing this job than OAI, Atom feeds or TAPIR, I dont see the need to use TAPIR for this.
Things get TAPIR related only when providing already supported/ advertised output models as full dumps. Those XML models are already part of TAPIR and generated for responses, so why not provide the entire dataset like this?
Markus
On Jan 30, 2009, at 16:30, Michael Giddens wrote:
Renato,
I really think #2 is worth including. There are times when I wish to send small requests through the TAPIR protocol but there are other times when, especially on first inspection it would be nice to pull a initial dump. Your XML format for archives looks find but I would consider adding attributes like dateCreated and numberOfRecords. This way if there are monthly archives for example I could pull the latest one. Secondly the number of records would be useful to know how much is in the dump and not just the size. Depending on the actual dump files the other question I would have is is the dump the delta of the previous dump or a complete dump. That way I would know if I was getting the full dump or I have to get all the dumps and then merge them together.
These are my inital thoughts.
Regards,
Michael Giddens Biodiversity Informatics Software Development www.SilverBiology.com Baton Rouge, LA phone: +1 225-937-9657 email: mikegiddens@silverbiology.com skype: mikegiddens
renato@cria.org.br wrote:
Dear all,
There are just two items left on the list of possible changes before submitting TAPIR to the TDWG standards track:
- Allow custom operations to be declared as part of capabilities.
I would suggest to simply include a new custom slot for this in the schema in case someone needs to use it in the future.
- Allow dump files to be declared.
This has been discussed some time ago in the TAPIR mailing list but we didn't come to a final conclusion.
Since some networks are starting to harvest data by fetching entire dump files, I think it's important to allow TAPIR services to declare any dump files that may be available. Fetching a dump file from a provider and using incremental harvesting in later interactions with the service will probably be the most efficient approach.
Since TAPIR generates XML output, it makes more sense to me to see dump files in XML. However, Tim/Markus (GBIF) are proposing another format for dump files using tab/csv files together with a metafile. It should be easy to allow both options when declaring a dump file in TAPIR capabilities, but I don't think it's the role of TAPIR to define specific formats. We can probably use something like this to declare dump files:
<archives> <archive format="" location="" outputModel=""/> ... </archives>
Where format could be "xml" or any custom term, and outputModel would be optional (only used with "xml" format). Things like date when the dump file was generated and whether it's gzipped or not could be additional attributes, but in most cases this can be discovered through the protocol used to retrieve the file, so in principle I would not include the attributes.
Please let me know if this is an acceptable solution or if you have any different thoughts. Also let me know if you have any other ideas or suggestions about TAPIR in general. This is the time.
I would like to finally submit specification & schema to the standards track in the beginning of the week.
Best Regards,
Renato
tdwg-tag mailing list tdwg-tag@lists.tdwg.org http://lists.tdwg.org/mailman/listinfo/tdwg-tag
-- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
participants (3)
-
"Markus Döring (GBIF)"
-
Michael Giddens
-
renato@cria.org.br