Table of Contents
API Reference: Corpus
API Reference: Partition
API Reference: Utilities
API Reference: Corpus
Overview
The Corpus
class represents a collection of textual data, providing methods to manage, serialize, and query corpora. It allows loading corpora from JSON and XML formats, retrieving metadata, and filtering content based on different criteria.
Class: Corpus
Constructor
str = "", corpus=None) Corpus(name:
Initializes a Corpus
object.
Parameters:
name
(optional): The name of the corpus.
corpus
(optional): A dictionary representing the corpus structure.
Static Methods
deserialize_from_json
staticmethod deserialize_from_json(filepath: str, name: str) -> Corpus
Factory Method for creating Corpus objects from a JSON file. The JSON file should have been created by the object method serialize().
Parameters:
filepath
: Path to the JSON file.
name
: Name of the corpus.
Returns: The deserialized Corpus
object.
deserialize_from_xml
staticmethod deserialize_from_xml(lp: Union[range, int], path: str, name: str) -> Corpus
Factory Method for creating Corpus objects from an XML corpus. The XML corpus should comply with the structure of the original GermaParlTEI Corpus that is fetched with utilities.clone_corpus() or created with Partition.serialize_corpus_as_xml().
Parameters:
lp
: Legislative term(s) as a range or integer.
path
(optional): Path to the corpus directory.
name
(optional): Name of the new corpus object.
Returns: The deserialized Corpus
object.
Instance Methods
serialize
str) -> None serialize(path:
Serialize a Corpus object as JSON file.
Parameters:
path
: Path where the JSON file should be saved.
get_corpus
bool = False) -> dict[str, dict[str, Any]] get_corpus(deep:
Returns a copy of the instance’s corpus.
Parameters:
deep
(optional): If True
, a deep copy is returned.
Returns: A copy of the instance’s corpus.
get_metadata
str) -> dict[str, str] get_metadata(key:
Retrieves metadata for a specific corpus entry, representing a document.
Parameters:
key
: The corpus entry (#legislativePeriod_#sessionNumber).
Returns: The metadata on the corpus entry, i.e., the document.
get_partition_by_sp_attribute
str, value: str) -> Partition get_partition_by_sp_attribute(attribute:
Template Method to filters the corpus based on an attribute of sp
elements.
All div-elements and their matching child elements in the corpus are collected, if a children sp-element matches a specified attribute and value pair. The div-element containing the matched sp-element is collected, plus all child elements of sp. All sp-elements within the parent div-element that do not fulfill the condition, are not collected. All collected elements are assembled within a new ElementTree with a <body>-element as the root. The new corpus is indexed with the old key and the old metadata within a new Partition Object.
Parameters:
attribute
: The attribute name to filter by.
value
: The attribute value to match.
Returns: A Partition
object containing the matching elements.
get_speeches_from_politician
str, attribute_name: str = "name") -> Partition get_speeches_from_politician(person:
Filters speeches by politician name with get_partition_by_sp_attribute().
Parameters:
person
: The name of the politician.
attribute_name
(optional): Attribute name for the identification of the politician’s name.
Returns: A Partition
object containing speeches by the specified politician.
get_speeches_from_party
str) -> Partition get_speeches_from_party(party:
Filters speeches by party affiliation with get_partition_by_sp_attribute().
Parameters:
party
: The name of the party.
Returns: A Partition
object containing speeches from the specified party.
get_speeches_from_role
str, attribute_name: str = "role") -> Partition get_speeches_from_role(role:
Filters speeches by role with get_partition_by_sp_attribute().
Parameters:
role
: The role to filter by.
attribute_name
(optional): Attribute name for the role identification.
Returns: Partition
: A partition containing speeches from the specified role.
get_speeches_from_condition
str], bool]) -> Partition _get_speeches_from_condition(condition: Callable[[
Template Method to filters speeches based on a user-defined condition concerning the content.
All div-elements and their matching child elements in the corpus are collected, if a children sp-element fulfills the condition. The <div>-element containing the matched <sp>-element is collected, plus all child elements of sp. All <sp>-elements within the parent <div>-element that do not fulfill the condition, are not collected. All collected elements are assembled within a new ElementTree with a <body>-element as the root. The new corpus is indexed with the old key and the old metadata within a new Partition Object.
Parameters:
condition
: A function that takes a string (speech text) and returns a boolean.
Returns: A Partition object containing speeches that match the condition.
get_speeches_from_keyword
str, case_sensitive: bool = False) -> Partition get_speeches_from_keyword(keyword:
Filters speeches containing a specified keyword with get_speeches_from_condition().
Parameters:
keyword
: The keyword to search for.
case_sensitive
(optional): Whether the search is case-sensitive.
Returns: A Partition object containing speeches with the keyword.
get_speeches_from_word_list
list[str], case_sensitive: bool = False) -> Partition get_speeches_from_word_list(word_list:
Filters speeches containing any word from a given list.
Parameters:
word_list
: A list of words to search for.
case_sensitive
(optional): Whether the search is case-sensitive.
Returns: A Partition Object containing speeches with any of the specified words.
get_speeches_from_regex
str) -> Partition get_speeches_from_regex(pattern:
Filters speeches using a regular expression with get_speeches_from_condition().
Parameters:
pattern
: Regular expression pattern to match.
Returns: A Partition object containing speeches that match the pattern.
Magic Methods
len
__len__() -> int
Returns the number of entries in the corpus.
Returns: Number of entries in the corpus.
bool
__bool__() -> bool
Returns whether the corpus is non-empty.
Returns: True
if the corpus contains entries, otherwise False
.
Private Methods for internal use
load_xml_for_legislative_period
int, path: str) -> None __load_xml_for_legislative_period(lp:
Loads XML files for a given legislative period.
extract_metadata
staticmethod __extract_metadata(tree: ElementTree) -> dict[str, str]
Extracts metadata from an XML document.
get_text
staticmethod __get_text(element: Element, path: str) -> str
Extracts text from an XML element.
API Reference: Partition
Overview
The Partition
class implements a partition of a corpus as objects. Objects of this class are created by the retrieval methods of the parent class Corpus.
Class: Partition
Constructor
str = "", corpus=None) Corpus(name:
Initializes a Partition
object by calling the super class Corpus
.
Parameters:
corpus
(optional): A dictionary representing the corpus structure.
Instance Methods
serialize_corpus_as_xml
self, path: str = "derived_corpus") -> None: serialize_corpus_as_xml(
Serializes the corpus as a set of XML files. Generates an XML file for every entry in the corpus in a specified folder that is created during runtime under “path”.
Parameters:
path
(optional): The path and name of the folder to be created.
Private Methods for internal use
create_tei_header
staticmethod __create_tei_header(metadata: dict) -> Element:
Creates a < teiHeader > element from the given metadata dictionary.
Parameters:
metadata
: The metadata dictionary.
Returns: The < teiHeader > element as an Element object.
API Reference: Utilities
Overview
The module Utilities
provides a set of functions to retrieve data from Corpus
objects and fetch the TEI-Corpus from GitHub. It includes utilities for extracting text and attributes from corpus elements.
Module: Utilities
Functions
clone_corpus
str = "https://github.com/PolMine/GermaParlTEI.git") -> None clone_corpus(repo_url:
Clones the GermaParlTEI corpus from GitHub.
Parameters:
repo_url
(optional): The URL of the repository. Defaults to https://github.com/PolMine/GermaParlTEI.git
.
get_paragraphs_from_element
-> list[str] get_paragraphs_from_element(element: Element)
Extracts the text content from all <p>
elements that are descendants of the given XML element.
Parameters:
element
: The parent XML element.
Returns:
A list of text content from all <p>
elements found in the subtree.
get_interjections_from_element
-> list[str] get_interjections_from_element(element: Element)
Extracts the text content from all <stage>
elements that are descendants of the given XML element.
Parameters:
element
: The parent XML element.
Returns: A list of text content from all <stage>
elements found in the subtree.
get_paragraphs_from_corpus
-> list[str] get_paragraphs_from_corpus(corpus: Corpus)
Extracts the text content from all <p>
elements, which contain speeches without interjections, from the given Corpus
object.
Parameters:
corpus
: The Corpus
object.
Returns: A list of text content from all <p>
elements in the corpus.
get_interjections_from_corpus
-> list[str] get_interjections_from_corpus(corpus: Corpus)
Extracts the text content from all <stage>
elements, which contain interjections to speeches, from the given Corpus
object.
Parameters:
corpus
: The Corpus
object.
Returns: A list of text content from all <stage>
elements in the corpus.
extract_element_attributes
str) -> list[str] extract_element_attributes(corpus: Corpus, tag_name:
Extracts all unique attributes of a specified tag from all documents in the corpus.
Parameters:
corpus
: The Corpus
object containing the documents.
tag_name
: The name of the tag whose attributes should be retrieved.
Returns: A list of unique attributes found in the specified tag.
extract_attribute_values
str, attribute: str) -> list[str] extract_attribute_values(corpus: Corpus, tag:
Extracts the values of a specific attribute from the specified tag across all documents in the corpus.
Parameters:
corpus
: The Corpus
object containing the documents.
tag
: The tag whose attributes should be searched.
attribute
: The name of the attribute whose values should be extracted.
Returns: A list of unique values for the specified attribute.