GermaParlPy
  • API Reference
  • XML Structure
  • Tutorials

Table of contents

  • Table of Contents
    • API Reference: Corpus
    • API Reference: Partition
    • API Reference: Utilities
  • API Reference: Corpus
    • Overview
    • Class: Corpus
      • Constructor
      • Static Methods
        • deserialize_from_json
        • deserialize_from_xml
      • Instance Methods
        • serialize
        • get_corpus
        • get_metadata
        • get_partition_by_sp_attribute
        • get_speeches_from_politician
        • get_speeches_from_party
        • get_speeches_from_role
        • get_speeches_from_condition
        • get_speeches_from_keyword
        • get_speeches_from_word_list
        • get_speeches_from_regex
      • Magic Methods
        • len
        • bool
      • Private Methods for internal use
        • load_xml_for_legislative_period
        • extract_metadata
        • get_text
  • API Reference: Partition
    • Overview
    • Class: Partition
      • Constructor
      • Instance Methods
        • serialize_corpus_as_xml
      • Private Methods for internal use
        • create_tei_header
  • API Reference: Utilities
    • Overview
    • Module: Utilities
      • Functions
        • clone_corpus
        • get_paragraphs_from_element
        • get_interjections_from_element
        • get_paragraphs_from_corpus
        • get_interjections_from_corpus
        • extract_element_attributes
        • extract_attribute_values

Table of Contents

API Reference: Corpus

  • Overview
  • Class: Corpus
    • Constructor
    • Static Methods
      • deserialize_from_json
      • deserialize_from_xml
    • Instance Methods
      • serialize
      • get_corpus
      • get_metadata
      • get_partition_by_sp_attribute
      • get_speeches_from_politician
      • get_speeches_from_party
      • get_speeches_from_role
      • _get_speeches_from_condition
      • get_speeches_from_keyword
      • get_speeches_from_word_list
      • get_speeches_from_regex
    • Magic Methods
      • __len__
      • __bool__
    • Private Methods
      • __load_xml_for_legislative_period
      • __extract_metadata
      • __get_text

API Reference: Partition

  • Overview
  • Class: Partition
    • Constructor
    • Instance Methods
      • serialize_corpus_as_xml
    • Private Methods
      • __create_tei_header

API Reference: Utilities

  • Overview
  • Module: Utilities
    • clone_corpus
    • get_paragraphs_from_element
    • get_interjections_from_element
    • get_paragraphs_from_corpus
    • get_interjections_from_corpus
    • extract_element_attributes
    • extract_attribute_values

API Reference: Corpus

Overview

The Corpus class represents a collection of textual data, providing methods to manage, serialize, and query corpora. It allows loading corpora from JSON and XML formats, retrieving metadata, and filtering content based on different criteria.


Class: Corpus

Constructor

Corpus(name: str = "", corpus=None)

Initializes a Corpus object.

Parameters:
name (optional): The name of the corpus.
corpus (optional): A dictionary representing the corpus structure.


Static Methods

deserialize_from_json

staticmethod deserialize_from_json(filepath: str, name: str) -> Corpus

Factory Method for creating Corpus objects from a JSON file. The JSON file should have been created by the object method serialize().

Parameters:
filepath : Path to the JSON file.
name : Name of the corpus.

Returns: The deserialized Corpus object.


deserialize_from_xml

staticmethod deserialize_from_xml(lp: Union[range, int], path: str, name: str) -> Corpus

Factory Method for creating Corpus objects from an XML corpus. The XML corpus should comply with the structure of the original GermaParlTEI Corpus that is fetched with utilities.clone_corpus() or created with Partition.serialize_corpus_as_xml().

Parameters:
lp : Legislative term(s) as a range or integer.
path (optional): Path to the corpus directory.
name (optional): Name of the new corpus object.

Returns: The deserialized Corpus object.


Instance Methods

serialize

serialize(path: str) -> None

Serialize a Corpus object as JSON file.

Parameters:
path: Path where the JSON file should be saved.


get_corpus

get_corpus(deep: bool = False) -> dict[str, dict[str, Any]]

Returns a copy of the instance’s corpus.

Parameters:
deep (optional): If True, a deep copy is returned.

Returns: A copy of the instance’s corpus.


get_metadata

get_metadata(key: str) -> dict[str, str]

Retrieves metadata for a specific corpus entry, representing a document.

Parameters:
key: The corpus entry (#legislativePeriod_#sessionNumber).

Returns: The metadata on the corpus entry, i.e., the document.


get_partition_by_sp_attribute

get_partition_by_sp_attribute(attribute: str, value: str) -> Partition

Template Method to filters the corpus based on an attribute of sp elements.

All div-elements and their matching child elements in the corpus are collected, if a children sp-element matches a specified attribute and value pair. The div-element containing the matched sp-element is collected, plus all child elements of sp. All sp-elements within the parent div-element that do not fulfill the condition, are not collected. All collected elements are assembled within a new ElementTree with a <body>-element as the root. The new corpus is indexed with the old key and the old metadata within a new Partition Object.

Parameters:
attribute: The attribute name to filter by.
value: The attribute value to match.

Returns: A Partition object containing the matching elements.


get_speeches_from_politician

get_speeches_from_politician(person: str, attribute_name: str = "name") -> Partition

Filters speeches by politician name with get_partition_by_sp_attribute().

Parameters:
person : The name of the politician.
attribute_name (optional): Attribute name for the identification of the politician’s name.

Returns: A Partition object containing speeches by the specified politician.


get_speeches_from_party

get_speeches_from_party(party: str) -> Partition

Filters speeches by party affiliation with get_partition_by_sp_attribute().

Parameters:
party : The name of the party.

Returns: A Partition object containing speeches from the specified party.


get_speeches_from_role

get_speeches_from_role(role: str, attribute_name: str = "role") -> Partition

Filters speeches by role with get_partition_by_sp_attribute().

Parameters:
role : The role to filter by.
attribute_name (optional): Attribute name for the role identification.

Returns: Partition: A partition containing speeches from the specified role.


get_speeches_from_condition

_get_speeches_from_condition(condition: Callable[[str], bool]) -> Partition

Template Method to filters speeches based on a user-defined condition concerning the content.

All div-elements and their matching child elements in the corpus are collected, if a children sp-element fulfills the condition. The <div>-element containing the matched <sp>-element is collected, plus all child elements of sp. All <sp>-elements within the parent <div>-element that do not fulfill the condition, are not collected. All collected elements are assembled within a new ElementTree with a <body>-element as the root. The new corpus is indexed with the old key and the old metadata within a new Partition Object.

Parameters:
condition : A function that takes a string (speech text) and returns a boolean.

Returns: A Partition object containing speeches that match the condition.


get_speeches_from_keyword

get_speeches_from_keyword(keyword: str, case_sensitive: bool = False) -> Partition

Filters speeches containing a specified keyword with get_speeches_from_condition().

Parameters:
keyword : The keyword to search for.
case_sensitive (optional): Whether the search is case-sensitive.

Returns: A Partition object containing speeches with the keyword.


get_speeches_from_word_list

get_speeches_from_word_list(word_list: list[str], case_sensitive: bool = False) -> Partition

Filters speeches containing any word from a given list.

Parameters:
word_list : A list of words to search for.
case_sensitive (optional): Whether the search is case-sensitive.

Returns: A Partition Object containing speeches with any of the specified words.


get_speeches_from_regex

get_speeches_from_regex(pattern: str) -> Partition

Filters speeches using a regular expression with get_speeches_from_condition().

Parameters:
pattern : Regular expression pattern to match.

Returns: A Partition object containing speeches that match the pattern.


Magic Methods

len

__len__() -> int

Returns the number of entries in the corpus.

Returns: Number of entries in the corpus.


bool

__bool__() -> bool

Returns whether the corpus is non-empty.

Returns: True if the corpus contains entries, otherwise False.


Private Methods for internal use

load_xml_for_legislative_period

__load_xml_for_legislative_period(lp: int, path: str) -> None

Loads XML files for a given legislative period.

extract_metadata

staticmethod __extract_metadata(tree: ElementTree) -> dict[str, str]

Extracts metadata from an XML document.

get_text

staticmethod __get_text(element: Element, path: str) -> str

Extracts text from an XML element.


API Reference: Partition

Overview

The Partition class implements a partition of a corpus as objects. Objects of this class are created by the retrieval methods of the parent class Corpus.


Class: Partition

Constructor

Corpus(name: str = "", corpus=None)

Initializes a Partition object by calling the super class Corpus.

Parameters:
corpus (optional): A dictionary representing the corpus structure.


Instance Methods

serialize_corpus_as_xml

serialize_corpus_as_xml(self, path: str = "derived_corpus") -> None:

Serializes the corpus as a set of XML files. Generates an XML file for every entry in the corpus in a specified folder that is created during runtime under “path”.

Parameters:
path (optional): The path and name of the folder to be created.


Private Methods for internal use

create_tei_header

staticmethod __create_tei_header(metadata: dict) -> Element:

Creates a < teiHeader > element from the given metadata dictionary.

Parameters:
metadata: The metadata dictionary.

Returns: The < teiHeader > element as an Element object.


API Reference: Utilities

Overview

The module Utilities provides a set of functions to retrieve data from Corpus objects and fetch the TEI-Corpus from GitHub. It includes utilities for extracting text and attributes from corpus elements.


Module: Utilities

Functions

clone_corpus

clone_corpus(repo_url: str = "https://github.com/PolMine/GermaParlTEI.git") -> None

Clones the GermaParlTEI corpus from GitHub.

Parameters:
repo_url (optional): The URL of the repository. Defaults to https://github.com/PolMine/GermaParlTEI.git.


get_paragraphs_from_element

get_paragraphs_from_element(element: Element) -> list[str]

Extracts the text content from all <p> elements that are descendants of the given XML element.

Parameters:
element: The parent XML element.

Returns:
A list of text content from all <p> elements found in the subtree.


get_interjections_from_element

get_interjections_from_element(element: Element) -> list[str]

Extracts the text content from all <stage> elements that are descendants of the given XML element.

Parameters:
element: The parent XML element.

Returns: A list of text content from all <stage> elements found in the subtree.


get_paragraphs_from_corpus

get_paragraphs_from_corpus(corpus: Corpus) -> list[str]

Extracts the text content from all <p> elements, which contain speeches without interjections, from the given Corpus object.

Parameters:
corpus: The Corpus object.

Returns: A list of text content from all <p> elements in the corpus.


get_interjections_from_corpus

get_interjections_from_corpus(corpus: Corpus) -> list[str]

Extracts the text content from all <stage> elements, which contain interjections to speeches, from the given Corpus object.

Parameters:
corpus: The Corpus object.

Returns: A list of text content from all <stage> elements in the corpus.


extract_element_attributes

extract_element_attributes(corpus: Corpus, tag_name: str) -> list[str]

Extracts all unique attributes of a specified tag from all documents in the corpus.

Parameters:
corpus: The Corpus object containing the documents.
tag_name: The name of the tag whose attributes should be retrieved.

Returns: A list of unique attributes found in the specified tag.


extract_attribute_values

extract_attribute_values(corpus: Corpus, tag: str, attribute: str) -> list[str]

Extracts the values of a specific attribute from the specified tag across all documents in the corpus.

Parameters:
corpus: The Corpus object containing the documents.
tag: The tag whose attributes should be searched.
attribute: The name of the attribute whose values should be extracted.

Returns: A list of unique values for the specified attribute.