API Reference

Stance Target Extraction and Stance Detection

class stancemining.main.StanceMining(stance_target_type='noun-phrases', llm_method='finetuned', model_inference='vllm', model_name='microsoft/Phi-4-mini-instruct', model_kwargs={}, tokenizer_kwargs={}, stance_detection_model=None, stance_detection_finetune_kwargs={}, stance_detection_model_kwargs={}, stance_detection_generation_kwargs={}, target_extraction_model=None, target_extraction_finetune_kwargs={}, target_extraction_model_kwargs={}, target_extraction_generation_kwargs={}, embedding_model='intfloat/multilingual-e5-small', embedding_model_inference='vllm', topic_model='bertopic', cosine_similarity_threshold=0.8, verbose=False, use_embedding_cache=True)

Class for performing stance mining on a set of documents.

Parameters:

stance_target_type (str) – Type of stance target to extract, either ‘noun-phrases’ or ‘claims’.
llm_method (str) – Method to use for LLM inference, either ‘prompting’ or ‘finetuned’.
model_inference (str) – Inference method for the LLM, either ‘vllm’ or ‘transformers’.
model_name (str) – Name of the base LLM model, which will be used for target cluster naming, and, if llm_method is ‘prompting’, for stance target extraction and detection.
model_kwargs (dict) – Additional keyword arguments for the LLM model.
tokenizer_kwargs (dict) – Additional keyword arguments for the tokenizer.
stance_detection_model (str) – Name of the stance detection model to use. Defaults to ‘bendavidsteel/SmolLM2-360M-Instruct-stance-detection’ if not provided.
stance_detection_finetune_kwargs (dict) – Keyword arguments for the fine-tuned stance detection model.
stance_detection_model_kwargs (dict) – Keyword arguments for stance detection model inference.
stance_detection_generation_kwargs (dict) – Keyword arguments for generate method during stance detection.
target_extraction_model (str) – Name of the target extraction model to use. Defaults to ‘bendavidsteel/SmolLM2-360M-Instruct-target-extraction’ if not provided.
target_extraction_finetune_kwargs (dict) – Keyword arguments for the fine-tuned target extraction model.
target_extraction_generation_kwargs (dict) – Keyword arguments for generate method during target extraction.
target_extraction_model_kwargs (dict) – Keyword arguments for target extraction model inference.
embedding_model (str) – Name of the embedding model to use for target extraction.
embedding_model_inference (str) – Inference method for the embedding model, either ‘vllm’ or ‘transformers’.
topic_model (str) – Topic model to use for clustering targets, either ‘bertopic’ or ‘toponymy’.
cosine_similarity_threshold (float) – Cosine similarity threshold for deduplicating targets. Defaults to 0.8.
verbose (bool) – Whether to enable verbose logging. Defaults to False.

fit_transform(docs: List[str] | DataFrame, text_column: str = 'text', parent_text_column: str = 'parent_text', get_stance: bool = True, generate_targets: bool = True, generate_higher_level_targets: bool = True, deduplicate_all_targets: bool = True, targets: List[str] = [], topic_model_kwargs: dict = {}, embedding_cache: DataFrame = None, max_layers: int = 2) → DataFrame

Find stances from the given documents.

Parameters:

docs (Union[List[str], pl.DataFrame]) – List of documents or a DataFrame containing documents.
text_column (str) – Name of the column containing the text in the DataFrame, if docs is a DataFrame. Defaults to ‘text’.
parent_text_column (str) – Name of the column containing the parent text in the DataFrame if docs is a DataFrame. Defaults to ‘parent_text’.
get_stance (bool) – Whether to get stance classifications for the targets. Defaults to True.
generate_targets (bool) – Whether to generate stance targets from the documents. Defaults to True.
generate_higher_level_targets (bool) – Whether to generate higher-level stance targets using topic modeling. Defaults to True.
deduplicate_all_targets (bool) – Whether to deduplicate all targets using embedding similarity. Defaults to True.
targets (List[str]) – List of stance targets to use if not generating them. If generate_targets is True, this should be an empty list.
topic_model_kwargs (dict) – Additional keyword arguments for the topic model.
embedding_cache (pl.DataFrame) – Optional cache of embeddings to use for the documents. Should be a polars DataFrame with ‘text’ and ‘embedding’ columns.
max_layers (int) – Maximum number of hierarchical topic model layers to use when generating higher-level targets. Defaults to 2.

Returns:

DataFrame containing the documents with their stance targets and classifications.

Return type:

pl.DataFrame

get_base_targets(docs: List[str] | DataFrame, embedding_model=None, text_column='text', parent_text_column='parent_text') → DataFrame

Generate stance targets from the given documents.

Parameters:

docs (Union[List[str], pl.DataFrame]) – List of documents or a DataFrame containing documents.
embedding_model – Embedding model to use for computing embeddings. If None, uses the default embedding model.
text_column (str) – Name of the column containing the text in the DataFrame, if docs is a DataFrame. Defaults to ‘text’.
parent_text_column (str) – Name of the column containing the parent text in the DataFrame if docs is a DataFrame. Defaults to ‘parent_text’.

Returns:

DataFrame containing the documents with their stance targets.

Return type:

pl.DataFrame

get_stance(document_df: DataFrame, text_column='text', parent_text_column='parent_text') → DataFrame

Get stance classifications for the targets in the documents.

Parameters:

document_df (pl.DataFrame) – DataFrame containing documents with ‘Targets’ column.
text_column (str) – Name of the column containing the text in the DataFrame. Defaults to ‘text’.
parent_text_column (str) – Name of the column containing the parent text in the DataFrame Defaults to ‘parent_text’.

Returns:

DataFrame containing the documents with their stance targets and classifications.

Return type:

pl.DataFrame

get_target_info()

Get information about the stance targets.

Returns:: DataFrame containing target information, including counts and topic associations.
Return type:: pl.DataFrame

Stance Mean and Trend Estimation

stancemining.estimate.infer_stance_trends_for_all_targets(document_df: DataFrame, time_column: str = 'createtime', stance_target_type: str = 'noun-phrases', filter_columns: List[str] = [], min_count: int = 5, time_scale: str = '1mo', interpolation_method: str = 'gp', verbose: bool = False) → Tuple[DataFrame, DataFrame]

Compute trends for all targets in the document DataFrame.

Parameters:

document_df (pl.DataFrame) – DataFrame containing the document data with ‘Targets’ and ‘Stances’ columns.
time_column (str) – Column name for the time data. Defaults to ‘createtime’.
filter_columns (List[str]) – List of columns to filter by for trend calculation (i.e. ‘Source’, ‘Author’, etc.).
min_count (int) – Minimum count of occurrences for a filter value to be considered.
time_scale (str) – Time scale for the trends, e.g., ‘1mo’, ‘1w’. Can be any combination of an integer and a time unit from [‘h’, ‘d’, ‘w’, ‘mo’, ‘y’]. If time scale is changed from ‘1mo’, the lengthscale prior should be adjusted accordingly.
interpolation_method (str) – The method to use for interpolation, ‘gp’ for Gaussian Process, ‘lowess’ for LOWESS, and ‘kernelreg’ for Kernel Regression. Defaults to ‘gp’. Gaussian Process is better for noise and modelling error, but is slower. LOWESS is faster, but does not model error, and does not allow setting a prior to properly model noise.
verbose (bool) – Whether to print progress information.

Returns:

A tuple containing:

A DataFrame with trend data for each target and filter value.
A DataFrame with interpolation method outputs, useful for Gaussian Process outputs.

Return type:

Tuple[pl.DataFrame, pl.DataFrame]

stancemining.estimate.infer_stance_trends_for_target(df: DataFrame, target_name, filter_columns: List[str], time_column: str = 'createtime', stance_target_type: str = 'noun-phrases', interpolation_method='gp', classifier_profiles=None, min_filter_count=5, time_scale='1mo', verbose=False, lengthscale_loc=2.0, lengthscale_scale=0.1, sigma_loc=1.0, sigma_scale=0.2) → Tuple[DataFrame, List[Dict[str, Any]]]

Compute trend data for a specific target.

Parameters:

df (pl.DataFrame) – DataFrame containing the data with ‘Target’ and ‘Stance’ columns.
target_name (str) – The target name to filter by.
filter_columns (List[str]) – List of columns to filter by for trend calculation (i.e. ‘Source’, ‘Author’, etc.).
time_column (str) – Column name for the time data, defaults to ‘createtime’.
interpolation_method (str) – The method to use for interpolation, ‘gp’ for Gaussian Process and ‘lowess’ for LOWESS. Defaults to ‘gp’. Gaussian Process is better for noise and modelling error, but is slower. LOWESS is faster, but does not model error, and does not allow setting a prior to properly model noise.
classifier_profiles (Dict) – Dictionary containing classifier profiles for ordinal GP.
min_filter_count (int) – Minimum count of occurrences for a filter value to be considered.
time_scale (str) – Time scale for the trends, e.g., ‘1mo’, ‘1w’. Can be any combination of an integer and a time unit from [‘h’, ‘d’, ‘w’, ‘mo’, ‘y’]. If time scale is changed from ‘1mo’, the lengthscale prior should be adjusted accordingly.
verbose (bool) – Whether to print progress information.
lengthscale_loc (float) – Location parameter for the lengthscale prior.
lengthscale_scale (float) – Scale parameter for the lengthscale prior.
sigma_loc (float) – Location parameter for the sigma prior.
sigma_scale (float) – Scale parameter for the sigma prior.

Returns:

A tuple containing the trend DataFrame and a list of interpolation outputs.

Return type:

Tuple[pl.DataFrame, List[Dict[str, Any]]]

stancemining.estimate.infer_stance_normal_for_all_targets(document_df: DataFrame, filter_cols: List[str] = [], min_count: int = 5, verbose: bool = False) → DataFrame

Get the stance normal for all targets in the document DataFrame.

Parameters:

document_df (pl.DataFrame) – DataFrame containing document data with stance information.
filter_cols (List[str]) – List of columns to filter by unique values.
min_count (int) – Minimum number of samples required to calculate stance.
verbose (bool) – If True, print progress messages.

Returns:

DataFrame containing the stance normal for all targets.

Return type:

pl.DataFrame

stancemining.estimate.infer_stance_normal_for_target(target_df: DataFrame, min_count: int = 5, filter_cols: List[str] = [], verbose: bool = False) → DataFrame

Get the stance normal for a specific target DataFrame.

Parameters:

target_df (pl.DataFrame) – DataFrame containing stance data for a specific target.
min_count (int) – Minimum number of samples required to calculate stance.
filter_cols (List[str]) – List of columns to filter by unique values.
verbose (bool) – If True, print progress messages.

Returns:

DataFrame containing the stance normal for the target.

Return type:

pl.DataFrame

Stance Visualization

stancemining.plot.plot_semantic_map(doc_target_df: DataFrame, top_num_targets: int = 30) → Figure

Create a semantic map of targets based on stance and frequency. This function uses UMAP to reduce the dimensionality of target embeddings and visualizes them in a scatter plot with stance represented by color and frequency represented by circle size.

Parameters:

doc_target_df (pl.DataFrame) – DataFrame containing document stance data with columns: ‘Target’, ‘Stance’ (numeric stance value), and optionally ‘Targets’ and ‘Stances’.
top_num_targets (int) – Number of top targets to visualize based on frequency.

Returns:

The figure containing the semantic map.

Return type:

plt.Figure

stancemining.plot.plot_trend_map(document_df: DataFrame, trend_df: DataFrame, figsize=(20, 14), plot_stream_transitions=True, filter_col: str | None = None, max_stream_width=4.0, min_transition_count=4) → Figure

Plot the continuous stance stream diagram with enhanced features.

Parameters:

document_df (pl.DataFrame) – DataFrame containing document stance data.
trend_df (pl.DataFrame) – DataFrame containing trend data with columns: ‘createtime’, ‘Target’, ‘Stance’, ‘volume’, ‘trend_mean’.
figsize (tuple) – Size of the figure.
plot_stream_transitions (bool) – Whether to plot significant user transitions.
filter_col (str, optional) – Column to filter by (e.g., ‘user_id’). If None, no filtering is applied.
max_stream_width (float) – Maximum width of the streams.
min_transition_count (int) – Minimum number of transitions to consider significant.

Returns:

The figure containing the continuous stance stream diagram.

Return type:

plt.Figure

Multimodal Utilities

stancemining.utils.get_transcripts_from_audio_files(audio_paths: List[str], hf_token: str, whisper_model: str = 'large-v2', batch_size: int = 16, save_speaker_embeddings: bool = False, verbose: bool = True, skip_errors: bool = False) → DataFrame

Get transcripts from a list of audio file paths using whisperx.

Requires whisperx and pyannote.audio.

Parameters:

audio_paths (List[str]) – List of paths to audio files.
hf_token (str) – Hugging Face token for accessing models.
whisper_model (str) – Whisper model to use (default: “large-v2”).
batch_size (int) – Batch size for processing (default: 16).
save_speaker_embeddings (bool) – Whether to save speaker embeddings (default: False).
verbose (bool) – Whether to show progress bar (default: True).

Returns:

DataFrame containing the transcripts and diarization results.

Return type:

pl.DataFrame

stancemining.utils.get_transcripts_from_video_files(video_paths: List[str], hf_token: str, whisper_model: str = 'large-v2', batch_size: int = 16, save_speaker_embeddings: bool = False, verbose: bool = True, skip_errors: bool = False) → DataFrame

Get transcripts from a list of video file paths using whisperx.

Requires whisperx, moviepy, and pyannote.audio.

Parameters:

video_paths (List[str]) – List of paths to video files.
hf_token (str) – Hugging Face token for accessing models.
whisper_model (str) – Whisper model to use (default: “large-v2”).
batch_size (int) – Batch size for processing (default: 16).
save_speaker_embeddings (bool) – Whether to save speaker embeddings (default: False).
verbose (bool) – Whether to show progress bar (default: True).

Returns:

DataFrame containing the transcripts and diarization results.

Return type:

pl.DataFrame