The unofficial client for the Wilson Center Digital Archive¶
The digitalrchive Python library is a client and ORM for accessing, searching, and downloading historical documents and their accompanying scans, translations, and transcriptions from the Wilson Center’s Digital Archive of historical primary sources.
Features¶
- Search for documents and other Digital Archive resources by keyword.
- Easily retrieve translations, transcriptions, and related records of any document.
- Fully documented models for all of the Digital Archive resource types.
Installation¶
Install the latest stable version of digitalarchive
using pip:
$ python3 -m pip install digitalarchive
Usage¶
Find documents by keyword:
>>> from digitalarchive import Subject
>>> Subject.match(name="Tiananmen Square Incident").first()
Subject(id='2229', name='China--History--Tiananmen Square Incident, 1989', value='China--History--Tiananmen Square Incident, 1989', uri='/srv/subject/2229.json')
Discover collections of related documents:
>>> from digitalarchive import Collection, Document
>>> collection = Collection.match(name="Local Nationalism in Xinjiang").first()
>>> docs = Document.match(collections=[collection])
>>> for doc in docs.all():
... print(doc.title)
Memorandum on a Discussion held by the Consul-General of the USSR in Ürümchi, G.S. DOBASHIN, with the Secretary of the Party Committee of the Xinjiang Uyghur Autonomous Region, Comrade LÜ JIANREN
Memorandum on a Discussion held by the Consul-General of the USSR in Ürümchi, G.S. DOBASHIN, with Deputy Chairman of the People’s Committee of the Xinjiang Uyghur Autonomous Region, Comrade XIN LANTING
Memorandum of a Discussion held by USSR Consul-General in Ürümchi, G S. Dobashin, with First Secretary of the Party Committee of the Xinjiang Uyghur Autonomous Region, Comrade Wang Enmao, and Chair of the People’s Committee, Comrade S. Äzizov
Note from G. Dobashin, Consul-General of the USSR in Ürümchi, to Comrades N.T. Fedorenko, Zimianin, and P.F. Iudin
Memorandum on a Discussion with Wang Huangzhang, Head of the Foreign Affairs Office of the Prefectural People's Committee
Iu. Andropov to the Central Committee of the CPSU, 'On the Struggle with Local Nationalism in China'
Note, M. Zimianin to the Central Committee of the CPSU and to Comrade Iu. V. Andropov
M. Zimianin to the Central Committee of the CPSU and to Comrade Iu. V. Andropov, 'On Manifestations of Local Nationalism in Xinjiang (PRC)'
M. Zimianin to the Department of the Central Committee of the CPSU and to Comrade Iu. V. Andropov
Read the Quickstart guide for a tutorial on working with documents and searches, or consult the Cookbook for examples of common operations.
Contents¶
Quickstart¶
The Document
is the basic unit of content in the Digital Archive. Every document is accompanied by metadata, including
a short description of its content, information about the archive it was obtained, subjects it is tagged with,
alongside other information.
Most of the Digital Archive’s documents originate from outside the United States. Translations are available for most
documents, as well as original scans in some cases. The Document
model describes the available
methods and attributes for documents.
The digitalarchive
package also provides models for other kinds of resources, such as
Subject
, Collection
,
Theme
, Coverage
, and
Repository
. These models can be used as filters when searching for
documents. Consult the Public API documentation for a full description of available models.
Searching¶
The Document
, Contributor
, Coverage
, Collection
, Subject
, and Repository
, models each expose a
match()
method that can be used to search for documents. The method accepts a
list of keyword arguments corresponding to the attributes of the matched for model.
>>> from digitalarchive import Document
>>> docs = Document.match(description="Cuban Missile Crisis")
The match method always returns an instance of digitalarchive.matching.ResourceMatcher
. ResourceMatcher
exposes a first()
method for to accessing a single document and an
all()
for accessing a list of all respondent records.
>>> from digitalarchive import Document
>>> docs = Document.match(description="Cuban Missile Crisis")
>>> docs.first().title
"From the Journal of S.M. Kudryavtsev, 'Record of a Conversation with Prime Minister of Cuba Fidel Castro Ruz, 21 January 1961'"
Searching for a record by its id
always returns a single record and ignores any other keyword arguments.
>>> from digitalarchive import Document
>>> test_search = Document.match(id="175898")
>>> test_search.count
1
>>> doc = test_search.first()
>>> doc.title
'Memorandum on a Discussion held by the Consul-General of the USSR in Ürümchi, G.S. DOBASHIN, with the Secretary of the Party Committee of the Xinjiang Uyghur Autonomous Region, Comrade LÜ JIANREN'
Filtering Searches¶
One can limit searches to records created between specific dates by passing a start_date
keyword, an end_date
keyword, or both.
>>> from digitalarchive import Document
>>> from datetime import date
>>> Document.match(start_date=date(1989, 4, 15), end_date=date(1989, 5, 4))
ResourceMatcher(model=<class 'digitalarchive.models.Document'>, query={'start_date': '19890415', 'end_date': '19890504', 'model': 'Record', 'q': '', 'itemsPerPage': 200}, count=22)
Searches can also be limited to records contained within a specific collection, subject, or other container. Matches for
Documents can be filtered by one or more Collection
, Repository
, Coverage
, Subject
, Contributor
,
and Donor
instances:
>>> from digitalarchive import Collection, Document
>>> xinjiang_collection = Collection.match(id="491").first()
>>> xinjiang_collection.name
'“Local Nationalism" in Xinjiang, 1957-1958'
>>> docs = Document.match(collections=[xinjiang_collection])
>>> docs.count
9
Hydrating Search Results¶
Most search results return “unhydrated” instances of resources with incomplete metadata. All attributes that are not yet
available are represented by NoneType
. Use the
hydrate()
method to download the full metadata for a resource.
>>> from digitalarchive import Document
>>> test_doc = Document.match(description="Vietnam War").first()
>>> test_doc.source is None
True
>>> test_doc.hydrate()
>>> test_doc.source
'AVPRF f. 0100, op. 34, 1946, p. 253, d. 18. Obtained and translated for CWIHP by Austin Jersild.'
It is also possible to hydrate all of the contents of a search result using the
hydrate()
method of ResourceMatcher
.
This operation can take some time for large result sets.
>>> from digitalarchive import Document
>>> docs = Document.match(description="Taiwan Strait Crisis")
>>> docs.hydrate()
When hydrating a result set, it it is also possible to recursively hydrate any child records (translations, transcripts,
etc.) in the result set by setting the recurse
parameter of
hydrate()
to True
.
>>> from digitalarchive import Document
>>> docs = Document.match(description="Taiwan Strait Crisis")
>>> docs.hydrate(recurse=True)
Cookbook¶
Examples of common operations encountered using the Digital Archive client.
Search for a resource by keyword¶
Run a keyword search across the title, description, and document content.
>>> from digitalarchive import Document
>>> # Find a document
>>> results = Document.match(description="Cuban Missile Crisis")
>>> # Acccess a single record.
From the Journal of S.M. Kudryavtsev, 'Record of a Conversation with Prime Minister of Cuba Fidel Castro Ruz, 21 January 1961'
Filter a Document search by language¶
- Limit a search to documents in a certain language:
>>> from digitalarchive.models import Document, Language >>> RYaN_docs = Document.match(description="project ryan", languages=[Language(id="ger")]) >>> RYaN_docs.count 32
Filter a Document search by date¶
- Search for records after a certain date:
>>> from digitalarchive import Document >>> from datetime import date >>> postwar_docs = Document.match(start_date=date(1945, 9, 2))
- Search for records before a certain date:
>>> from digitalarchive import Document >>> from datetime import date >>> prewar_docs = Document.match(end_date=date(1945, 9, 2))
- Search for docs between two dates:
>>> from digitalarchive import Document >>> from datetime import date >>> coldwar_docs = Document.match(start_date=date(1945, 9, 2), end_date=date(1991, 12, 26))
Download the complete metadata for a document¶
>>> from digitalarchive import Document
>>> chernobyl_doc = Document.match(description="pripyat evacuation order").first()
>>> chernobyl_doc.repositories
>>> chernobyl_doc.repositories is None
True
>>> chernobyl_doc.hydrate()
>>> chernobyl_doc.repositories
[Repository(id='84', name='Central State Archive of Public Organizations of Ukraine (TsDAHOU)', uri=None, value=None), Repository(id='507', name='Archive of the Ukrainian National Chornobyl Museum', uri=None, value=None)]
Download the original scan of a document.¶
Original scans (referred to internally as MediaFile
) are child records of
Document
. They must be hydrated before the PDF content can be accessed.
>>> from digitalarchive import Document
>>> chernobyl_doc = Document.match(id="208406").first()
>>> original_scan = chernobyl_doc.media_files[0]
>>> original_scan.pdf is None
True
>>> original_scan.hydrate()
>>> type(original_scan.pdf)
<class 'bytes'>
>>> len(original_scan.pdf)
10936093
Download the translation or transcript of a document.¶
Like original scans, Transcript
and Translation
are
child records of Document
. They must also be hydrated before their content can be
accessed. Translations and transcripts are typically presented as HTML files, but may sometimes be presetened as PDFs.
>>> from digitalarchive import Document
>>> chernobyl_doc = Document.match(id="208406").first()
>>> translation = chernobyl_doc.translations[0]
>>> translation.hydrate()
>>> translation.filename
'TranslationFile_208406.html'
Serialize and dump a document to the filesystem.¶
>>> from digitalarchive import Document
>>> chernobyl_doc = Document.match(id="208406").first()
>>> chernobyl_doc.hydrate()
>>> chernobyl_doc_str = chernobyl_doc.json()
>>> chernobyl_doc == Document.parse_raw(chernobyl_doc_str)
True
Public API¶
Models¶
Match-able & Hydrate-able Models¶
-
class
digitalarchive.models.
Document
(id, uri, title, description, doc_date, frontend_doc_date, slug, source_created_at, source_updated_at, first_published_at, source, type, rights, pdf_generated_at, date_range_start, sort_string_by_coverage, main_src, model, donors, subjects, transcripts, translations, media_files, languages, contributors, creators, original_coverages, collections, attachments, links, repositories, publishers, classifications)¶ A Document corresponding to a single record page on digitalarchive.wilsoncenter.org.
Note
Avoid constructing Documents directly–use the match function to create Documents by keyword search or by ID.
Attributes present on all Documents:
-
source_created_at
¶ Timestamp of when the Document was first added to the DA.
Type: datetime.datetime
-
source_updated_at
¶ Timestamp of when the Document was last edited.
Type: datetime.datetime
-
first_published_at
¶ Timestamp of when the document was first made publically accessible.
Type: datetime.datetime
Attributes present only on hydrated Documents
These attributes are aliases of
UnhydratedField
untilDocument.hydrate()
is called on the Document.-
type
¶ The type of the document (meeting minutes, report, etc.)
Type: digitalarchive.models.Type
-
rights
¶ A list of entities holding the copyright of the Document.
Type: list
ofdigitalarchive.models.Right
-
pdf_generated_at
¶ The date that the combined source, translations, and transcriptions PDF. was generated.
Type: str
-
date_range_start
¶ A rounded-down date used to standardize approximate dates for date-range matching.
Type: datetime.date
-
sort_string_by_coverage
¶ An alphanumeric identifier used by the API to sort search results.
Type: str
-
donors
¶ A list of donors whose funding make the acquisiton or translation of a document possible.
Type: list
ofdigitalarchive.models.Donor
-
subjects
¶ A list of subjects that the document is tagged with.
Type: list
ofdigitalarchive.models.Subject
-
transcripts
¶ A list of transcripts of the document’s contents.
Type: list
ofdigitalarchive.models.Transcript
-
translations
¶ A list of translations of the original document.
Type: list
ofdigitalarchive.models.Translation
-
media_files
¶ A list of attached original scans of the document.
Type: list
ofdigitalarchive.models.MediaFile
-
languages
¶ A list of langauges contained in the document.
Type: list
ofdigitalarchive.models.Language
-
original_coverages
¶ A list of geographic locations referenced in the document.
Type: list
ofdigitalarchive.models.Coverage
-
collections
¶ A list of Collections that contain this document.
Type: list
ofdigitalarchive.models.Collection
-
attachments
¶ A list of Documents that were attached to the Document.
Type: list
ofdigitalarchive.models.Document
-
links
¶ A list of topically related documents.
Type: list
ofdigitalarchive.models.Document
-
respositories
¶ A list of archives/libraries containing this document.
Type: list
ofdigitalarchive.models.Repository
-
publishers
¶ A list of Publishers that released the document.
Type: list
ofdigitalarchive.models.Publisher
-
classifications
¶ A list of security classification markings present on the document.
Type: list
ofdigitalarchive.models.Publisher
-
hydrate
(recurse: bool = False)¶ Downloads the complete version of the Document with metadata for any related objects.
Parameters: recurse (bool) – If true, also hydrate subordinate and related records.
-
classmethod
match
(**kwargs) → digitalarchive.matching.ResourceMatcher¶ Search for a Document by keyword, or fetch one by ID.
Matching on the Document model runs a full-text search using keywords passed via the title and description keywords. Results can also be limited by dates or by related records, as described below.
Note
Title and description keywords are not searched for individually. All non-date or child record searches are concatenated to single querystring.
Note
Collection and other related record searches use INNER JOIN logic when passed multiple related resources.
Allowed search fields:
Parameters: - title (
str
, optional) – Title search keywords. - description (
str
, optional) – Title search keywords. - start_date (
datetime.date
, optional) – Return only Documents with a doc_date after the passed start_date. - end_date (
datetime.date
, optional) – Return only Documents with a doc_date before the passed end_date. - collections (
list
ofdigitalarchive.models.Collection
, optional) – Restrict results to Documents contained in all of the passed Collections. - publishers (
list
ofdigitalarchive.models.Publisher
, optional) – Restrict results to Documents published by all of the passed Publishers. - repositories (
list
ofdigitalarchive.models.Repository
, optional) – Documents contained in all of the passed Repositories. - coverages (
list
ofdigitalarchive.models.Coverage
, optional) – relating to all of the passed geographical Coverages. - subjects (
list
ofdigitalarchive.models.Subject
) – all of the passed subjects - contributors (
list of :class:`digitalarchive.models.Contributor
) – authors include all of the passed contributors. - donors (list(
digitalarchive.models.Donor
)) – translated with support from all of the passed donors. - languages (
digitalarchive.models.Language
or str) – original document. If passing a string, you must pass an ISO 639-2/B language code. - translation (
digitalarchive.models.Translation
) – is a translation available in the passed Language. - theme (
digitalarchive.models.Theme
) –
Returns: - An instance of (
digitalarchive.matching.ResourceMatcher
) containing any records responsive to the search.
- title (
-
-
class
digitalarchive.models.
Collection
(id, name, slug, uri, parent, model, value, description, short_description, main_src, no_of_documents, is_inactive, source_created_at, source_updated_at, first_published_at)¶ A collection of Documents on a single topic
-
parent
¶ A Collection containing the Collection.
Type: digitalarchive.models.Collection
-
model
¶ A sting name of the model used to differentiate Collection and Document searches in the DA API.
Type: str
-
source_created_at
¶ Timestamp of when the Document was first added to the DA.
Type: datetime.datetime
-
source_updated_at
¶ Timestamp of when the Document was last edited.
Type: datetime.datetime
-
first_published_at
¶ Timestamp of when the document was first made publically accessible.
Type: datetime.datetime
-
hydrate
()¶ Populate all unhydrated fields of a resource.
-
classmethod
match
(**kwargs) → digitalarchive.matching.ResourceMatcher¶ Find a resource using passed keyword arguments.
Note
If called without arguments, returns all records in the DA .
-
-
class
digitalarchive.models.
Subject
(id, name, uri, value)¶ A historical topic to which documents can be related.
-
hydrate
()¶ Populate all unhydrated fields of a resource.
-
classmethod
match
(**kwargs) → digitalarchive.matching.ResourceMatcher¶ Find a resource using passed keyword arguments.
Note
If called without arguments, returns all records in the DA .
-
-
class
digitalarchive.models.
Coverage
(id, name, uri, value, parent, children)¶ A geographical area referenced by a Document.
-
children
¶ (list of
Covereage
): Subordinate geographical areas, if any.
-
hydrate
()¶ Populate all unhydrated fields of a resource.
-
classmethod
match
(**kwargs) → digitalarchive.matching.ResourceMatcher¶ Find a resource using passed keyword arguments.
Note
If called without arguments, returns all records in the DA .
-
-
class
digitalarchive.models.
Contributor
(id, name, value, uri)¶ An individual person or organization that contributed to the creation of the document.
Contributors are typically the Document’s author, but for meeting minutes and similar documents, a Contributor may simply be somebody who was in attendance at the meeting.
-
hydrate
()¶ Populate all unhydrated fields of a resource.
-
classmethod
match
(**kwargs) → digitalarchive.matching.ResourceMatcher¶ Find a resource using passed keyword arguments.
Note
If called without arguments, returns all records in the DA .
-
-
class
digitalarchive.models.
Repository
(id, name, value, uri)¶ The archive or library possessing the original, physical Document.
-
hydrate
()¶ Populate all unhydrated fields of a resource.
-
classmethod
match
(**kwargs) → digitalarchive.matching.ResourceMatcher¶ Find a resource using passed keyword arguments.
Note
If called without arguments, returns all records in the DA .
-
Hydrate-able Models¶
-
class
digitalarchive.models.
Transcript
(id, filename, content_type, extension, asset_id, source_created_at, source_updated_at, url, html, pdf, raw)¶ A transcript of a document in its original language.
-
hydrate
()¶ Populate all unhydrated fields of a
digitalarchive.models._Asset
.
-
-
class
digitalarchive.models.
Translation
(id, filename, content_type, extension, asset_id, source_created_at, source_updated_at, url, html, pdf, raw, language)¶ A translation of a Document into a another language.
-
language
¶ Type: digitalarchive.models.Language
-
hydrate
()¶ Populate all unhydrated fields of a
digitalarchive.models._Asset
.
-
-
class
digitalarchive.models.
MediaFile
(id, filename, content_type, extension, asset_id, source_created_at, source_updated_at, path, html, pdf, raw)¶ An original scan of a Document.
-
hydrate
()¶ Populate all unhydrated fields of a
digitalarchive.models._Asset
.
-
-
class
digitalarchive.models.
Theme
(id, slug, title, value, description, main_src, uri, featured_resources, has_map, has_timeline, featured_collections, dates_with_events)¶ A parent container for collections on a single geopolitical topic.
Note
Themes never appear on any record model, but can be passed as a search param to Document.
-
main_src
¶ A URI for the Theme’s banner image on the Digital Archive website.
-
has_map
¶ A boolean value for whether the Theme has an accompanying map on the Digital Archive website.
Type: str
-
has_timeline
¶ A boolean value for whether the Theme has a Timeline on the Digital Archive website.
Type: str
-
featured_collections
¶ A list of related collections.
Type: list of Collection
-
hydrate
()¶ Populate all unhydrated fields of a resource.
-
Other Models¶
-
class
digitalarchive.models.
Language
(id, name)¶ The original language of a resource.
-
class
digitalarchive.models.
Donor
(id, name)¶ An entity whose resources helped publish or translate a document.
-
class
digitalarchive.models.
Type
(id, name)¶ The type of a document (memo, report, etc).
-
class
digitalarchive.models.
Right
(id, name)¶ A copyright notice attached to the Document.
-
class
digitalarchive.models.
Classification
(id, name)¶ A classification marking applied to the original Document.
,,,
Matching¶
-
class
digitalarchive.matching.
ResourceMatcher
(resource_model: digitalarchive.models.Resource, items_per_page=200, **kwargs)¶ Runs a search against the DA API for the provided DA model and keywords.
ResourceMatcher wraps search results and exposes methods for interacting with the resultant set of resources.
-
list
¶ search results. Handles pagination of the DA API.
Type: Generator
ofdigitalarchive.models.Resource
-
count
¶ The number of respondant records to the given search.
-
all
() → List[digitalarchive.models.Resource]¶ Exhaust the results generator and return a list of all search results.
-
first
() → digitalarchive.models.Resource¶ Return only the first record from a search result.
-
hydrate
(recurse: bool = False)¶ Hydrate all of the resources in a search result.
-