Table of Contents
API Reference: Corpus
API Reference: Partition
API Reference: Utilities
API Reference: Corpus
Overview
The Corpus class represents a collection of textual data, providing methods to manage, serialize, and query corpora. It allows loading corpora from JSON and XML formats, retrieving metadata, and filtering content based on different criteria.
Class: Corpus
Constructor
Corpus(name: str = "", corpus=None)Initializes a Corpus object.
Parameters:
name (optional): The name of the corpus.
corpus (optional): A dictionary representing the corpus structure.
Static Methods
deserialize_from_json
staticmethod deserialize_from_json(filepath: str, name: str) -> CorpusFactory Method for creating Corpus objects from a JSON file. The JSON file should have been created by the object method serialize().
Parameters:
filepath : Path to the JSON file.
name : Name of the corpus.
Returns: The deserialized Corpus object.
deserialize_from_xml
staticmethod deserialize_from_xml(lp: Union[range, int], path: str, name: str) -> CorpusFactory Method for creating Corpus objects from an XML corpus. The XML corpus should comply with the structure of the original GermaParlTEI Corpus that is fetched with utilities.clone_corpus() or created with Partition.serialize_corpus_as_xml().
Parameters:
lp : Legislative term(s) as a range or integer.
path (optional): Path to the corpus directory.
name (optional): Name of the new corpus object.
Returns: The deserialized Corpus object.
Instance Methods
serialize
serialize(path: str) -> NoneSerialize a Corpus object as JSON file.
Parameters:
path: Path where the JSON file should be saved.
get_corpus
get_corpus(deep: bool = False) -> dict[str, dict[str, Any]]Returns a copy of the instance’s corpus.
Parameters:
deep (optional): If True, a deep copy is returned.
Returns: A copy of the instance’s corpus.
get_metadata
get_metadata(key: str) -> dict[str, str]Retrieves metadata for a specific corpus entry, representing a document.
Parameters:
key: The corpus entry (#legislativePeriod_#sessionNumber).
Returns: The metadata on the corpus entry, i.e., the document.
get_partition_by_sp_attribute
get_partition_by_sp_attribute(attribute: str, value: str) -> PartitionTemplate Method to filters the corpus based on an attribute of sp elements.
All div-elements and their matching child elements in the corpus are collected, if a children sp-element matches a specified attribute and value pair. The div-element containing the matched sp-element is collected, plus all child elements of sp. All sp-elements within the parent div-element that do not fulfill the condition, are not collected. All collected elements are assembled within a new ElementTree with a <body>-element as the root. The new corpus is indexed with the old key and the old metadata within a new Partition Object.
Parameters:
attribute: The attribute name to filter by.
value: The attribute value to match.
Returns: A Partition object containing the matching elements.
get_speeches_from_politician
get_speeches_from_politician(person: str, attribute_name: str = "name") -> PartitionFilters speeches by politician name with get_partition_by_sp_attribute().
Parameters:
person : The name of the politician.
attribute_name (optional): Attribute name for the identification of the politician’s name.
Returns: A Partition object containing speeches by the specified politician.
get_speeches_from_party
get_speeches_from_party(party: str) -> PartitionFilters speeches by party affiliation with get_partition_by_sp_attribute().
Parameters:
party : The name of the party.
Returns: A Partition object containing speeches from the specified party.
get_speeches_from_role
get_speeches_from_role(role: str, attribute_name: str = "role") -> PartitionFilters speeches by role with get_partition_by_sp_attribute().
Parameters:
role : The role to filter by.
attribute_name (optional): Attribute name for the role identification.
Returns: Partition: A partition containing speeches from the specified role.
get_speeches_from_condition
_get_speeches_from_condition(condition: Callable[[str], bool]) -> PartitionTemplate Method to filters speeches based on a user-defined condition concerning the content.
All div-elements and their matching child elements in the corpus are collected, if a children sp-element fulfills the condition. The <div>-element containing the matched <sp>-element is collected, plus all child elements of sp. All <sp>-elements within the parent <div>-element that do not fulfill the condition, are not collected. All collected elements are assembled within a new ElementTree with a <body>-element as the root. The new corpus is indexed with the old key and the old metadata within a new Partition Object.
Parameters:
condition : A function that takes a string (speech text) and returns a boolean.
Returns: A Partition object containing speeches that match the condition.
get_speeches_from_keyword
get_speeches_from_keyword(keyword: str, case_sensitive: bool = False) -> PartitionFilters speeches containing a specified keyword with get_speeches_from_condition().
Parameters:
keyword : The keyword to search for.
case_sensitive (optional): Whether the search is case-sensitive.
Returns: A Partition object containing speeches with the keyword.
get_speeches_from_word_list
get_speeches_from_word_list(word_list: list[str], case_sensitive: bool = False) -> PartitionFilters speeches containing any word from a given list.
Parameters:
word_list : A list of words to search for.
case_sensitive (optional): Whether the search is case-sensitive.
Returns: A Partition Object containing speeches with any of the specified words.
get_speeches_from_regex
get_speeches_from_regex(pattern: str) -> PartitionFilters speeches using a regular expression with get_speeches_from_condition().
Parameters:
pattern : Regular expression pattern to match.
Returns: A Partition object containing speeches that match the pattern.
Magic Methods
len
__len__() -> intReturns the number of entries in the corpus.
Returns: Number of entries in the corpus.
bool
__bool__() -> boolReturns whether the corpus is non-empty.
Returns: True if the corpus contains entries, otherwise False.
Private Methods for internal use
load_xml_for_legislative_period
__load_xml_for_legislative_period(lp: int, path: str) -> NoneLoads XML files for a given legislative period.
extract_metadata
staticmethod __extract_metadata(tree: ElementTree) -> dict[str, str]Extracts metadata from an XML document.
get_text
staticmethod __get_text(element: Element, path: str) -> strExtracts text from an XML element.
API Reference: Partition
Overview
The Partition class implements a partition of a corpus as objects. Objects of this class are created by the retrieval methods of the parent class Corpus.
Class: Partition
Constructor
Corpus(name: str = "", corpus=None)Initializes a Partition object by calling the super class Corpus.
Parameters:
corpus (optional): A dictionary representing the corpus structure.
Instance Methods
serialize_corpus_as_xml
serialize_corpus_as_xml(self, path: str = "derived_corpus") -> None:Serializes the corpus as a set of XML files. Generates an XML file for every entry in the corpus in a specified folder that is created during runtime under “path”.
Parameters:
path (optional): The path and name of the folder to be created.
Private Methods for internal use
create_tei_header
staticmethod __create_tei_header(metadata: dict) -> Element:Creates a < teiHeader > element from the given metadata dictionary.
Parameters:
metadata: The metadata dictionary.
Returns: The < teiHeader > element as an Element object.
API Reference: Utilities
Overview
The module Utilities provides a set of functions to retrieve data from Corpus objects and fetch the TEI-Corpus from GitHub. It includes utilities for extracting text and attributes from corpus elements.
Module: Utilities
Functions
clone_corpus
clone_corpus(repo_url: str = "https://github.com/PolMine/GermaParlTEI.git") -> NoneClones the GermaParlTEI corpus from GitHub.
Parameters:
repo_url (optional): The URL of the repository. Defaults to https://github.com/PolMine/GermaParlTEI.git.
get_paragraphs_from_element
get_paragraphs_from_element(element: Element) -> list[str]Extracts the text content from all <p> elements that are descendants of the given XML element.
Parameters:
element: The parent XML element.
Returns:
A list of text content from all <p> elements found in the subtree.
get_interjections_from_element
get_interjections_from_element(element: Element) -> list[str]Extracts the text content from all <stage> elements that are descendants of the given XML element.
Parameters:
element: The parent XML element.
Returns: A list of text content from all <stage> elements found in the subtree.
get_paragraphs_from_corpus
get_paragraphs_from_corpus(corpus: Corpus) -> list[str]Extracts the text content from all <p> elements, which contain speeches without interjections, from the given Corpus object.
Parameters:
corpus: The Corpus object.
Returns: A list of text content from all <p> elements in the corpus.
get_interjections_from_corpus
get_interjections_from_corpus(corpus: Corpus) -> list[str]Extracts the text content from all <stage> elements, which contain interjections to speeches, from the given Corpus object.
Parameters:
corpus: The Corpus object.
Returns: A list of text content from all <stage> elements in the corpus.
extract_element_attributes
extract_element_attributes(corpus: Corpus, tag_name: str) -> list[str]Extracts all unique attributes of a specified tag from all documents in the corpus.
Parameters:
corpus: The Corpus object containing the documents.
tag_name: The name of the tag whose attributes should be retrieved.
Returns: A list of unique attributes found in the specified tag.
extract_attribute_values
extract_attribute_values(corpus: Corpus, tag: str, attribute: str) -> list[str]Extracts the values of a specific attribute from the specified tag across all documents in the corpus.
Parameters:
corpus: The Corpus object containing the documents.
tag: The tag whose attributes should be searched.
attribute: The name of the attribute whose values should be extracted.
Returns: A list of unique values for the specified attribute.