A python package for extracting electronic health transcripts , and then classifying them based on human annotated data.

These details have not been verified by PyPI

Project description

pytranscripts

An Open source👨‍🔧 Python Library for Automated classification of Electronic Medical records

Installation

To install the latest version , simply use

pip install -U pytranscripts

Stages

Data Extraction
Target Identification
Finetuning Annotated Data on Pretrained models (Bert & Electra)
Extracting Interviwer/Interviewee records from the specified docx file storage
Metrics Evaluation (Accuracy & Cohen Kappa Score)
Reordering records as a neatly arranged and flagged spreadsheet, alongside metrics and reports from pretrained models.
Running inference on raw documents and Color Coding them

Example Usage

Mount Google Drive (Optional)

If using Google Drive as the data source:

from google.colab import drive
drive.mount('/content/drive')

Automated Data Export

To export and combine all .docx files from a folder into a single file:

from pytranscripts import export_docx_from_folder

# Define paths for document processing
INPUT_FOLDER = "/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/synthetic_transcripts"  # Folder containing source DOCX files

OUTPUT_FILE = "FULL_INTERVIEW.csv"  # Output consolidated spreadsheet (either .csv or .excel)

LABELS = [
    "Clinical_Experience",        # Descriptions of personal experience using lung ultrasound
    "Diagnostic_Utility",         # How lung ultrasound helps in diagnosing diseases
    "Comparative_Analysis",       # Comparisons with other imaging modalities like X-ray or CT
    "Implementation_Challenges",  # Barriers to adoption and practical difficulties
    "Training_and_Education",     # Aspects related to learning and teaching lung ultrasound
    "System_Infrastructure",      # How hospital systems, devices, and software support ultrasound use
    "Administrative_Buying",       # Role of hospital leadership and institutional support
    "Workflow_Impact",            # How lung ultrasound affects daily hospital operations
    "Patient_Engagement",         # Ways ultrasound enhances patient understanding and involvement
    "Future_Adoption",            # Predictions about the role of lung ultrasound in hospital practice
]  # Labels to be used for the columns in the output spreadsheet to be filled up with empty 0s



#-------------------------------------        PLEASE NOTE         --------------------------------------

# AS YOU SELECT YOUR PREDEFINED LIST OF   LABELS ABOVE THEY SHOULD BE  SAME ONE YOU WOULD PASS INTO YOUR "TranscriptTrainer"

# Export and combine all DOCX files from the input folder
# This function will:
# 1. Read all .docx files from INPUT_FOLDER
# 2. Combine their contents
# 3. Save to a single OUTPUT_FILE

export_docx_from_folder(
    input_directory=INPUT_FOLDER,
    output_file=OUTPUT_FILE,
    labels = LABELS
)

This will:

Read all .docx files from INPUT_FOLDER.
Combine their content into a single file.
Apply the defined labels to create a structured dataset.

Requirements

Python 3.6 or later GPU access recommended for optimal performance (if using Jupyter Notebook). pytranscripts version 1.2.4 or higher.

Model Training

Now , the detailed class shows how to properly use our transcript trainer in making training and inference easy based on your document

from pytranscripts import TranscriptTrainer


trainer = TranscriptTrainer(
    input_file='/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/FULL_INTERVIEW_TAGGED.xlsx',  # Path to the CSV / XLSX file containing the tagged documents. This is the main data source for training and evaluation.

    destination_path='/content/',  # Directory where all the training results, models, and logs will be saved. , We are using colab path to make things seamless

    text_column='Interviewee',  # Specifies the column name in the CSV file that contains the text data to be used for training.

    test_size=0.2,  # Determines the fraction of the data that will be used for testing the model, instead of training it. Here, 20% of data will be used for testing.

    max_length=512, #The maximum number of tokens to include in each input sequence, this helps in managing memory and computational resources. Sequences longer than this will be truncated.

    num_train_epochs=10, # The number of times the model will iterate over the entire training dataset during training. More epochs will mean more training.

    labels=[
    "Clinical_Experience",        # Descriptions of personal experience using lung ultrasound
    "Diagnostic_Utility",         # How lung ultrasound helps in diagnosing diseases
    "Comparative_Analysis",       # Comparisons with other imaging modalities like X-ray or CT
    "Implementation_Challenges",  # Barriers to adoption and practical difficulties
    "Training_and_Education",     # Aspects related to learning and teaching lung ultrasound
    "System_Infrastructure",      # How hospital systems, devices, and software support ultrasound use
    "Administrative_Buying",       # Role of hospital leadership and institutional support
    "Workflow_Impact",            # How lung ultrasound affects daily hospital operations
    "Patient_Engagement",         # Ways ultrasound enhances patient understanding and involvement
    "Future_Adoption",            # Predictions about the role of lung ultrasound in hospital practice
    ],


     # PLEASE MAKE SURE THAT THE LIST YOU ARE GOING TO BE USING HERE MATCHES THE ONE IN YOUR INPUT FILE


    upper_lower_mapping = {
    "multi_level_org_char": [  # High-level category
        "Clinical_Experience",  # Provider Characteristics
        "System_Infrastructure"  # Health System Characteristics
    ],

    "multi_level_org_perspect": [  # High-level category
        "Comparative_Analysis",  # Imaging modalities in general
        "Administrative_Buying",  # Value equation
        "Diagnostic_Utility",  # Clinical utility & efficiency-Provider perspective
        "Patient_Engagement",  # Patient/Physician interaction in LUS
        "Workflow_Impact"  # Workflow related problems
    ],

    "impl_sust_infra": [  # High-level category
        "Training_and_Education",  # Training
        "Implementation_Challenges",  # Credentialing / Quality Assurance Infrastructure
        "Future_Adoption"  # Financial Impact
    ]
}
)

Next, we initialize the training job using a single line of code.

bert_model, electra_model = trainer.train_and_classify()

Inferencing and Automated Document Classification (via Color Coding)

This involves making use of any of the trained models to predict on a folder containing raw EHR transcripts

trainer.inference_documents(
    input_folder = "/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/synthetic_transcripts",
    output_folder = "/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/inferenced_transcripts",
    threshold  = 0.15, # default value = 0.15
    model_type = 'bert' # defaults to bert ,  options (bert, electra)
)

Contributing

We welcome contributions! Please follow the contributing guidelines.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.5.1

May 12, 2025

1.5.0

May 12, 2025

1.4.2

Mar 7, 2025

1.4.1

Feb 12, 2025

1.4.0

Feb 12, 2025

1.3.9

Feb 12, 2025

1.3.8

Feb 11, 2025

1.3.7

Feb 11, 2025

1.3.6

Feb 11, 2025

1.3.5

Feb 11, 2025

1.3.4

Feb 11, 2025

1.3.3

Feb 11, 2025

1.3.1

Feb 11, 2025

1.3.0

Feb 10, 2025

1.2.10

Feb 9, 2025

1.2.9

Feb 8, 2025

1.2.8

Feb 8, 2025

1.2.7

Feb 8, 2025

1.2.6

Jan 20, 2025

1.2.5

Jan 20, 2025

1.2.4

Jan 10, 2025

1.2.3

Jan 10, 2025

1.2.1

Jan 9, 2025

1.2.0

Jan 3, 2025

1.1.0

Jan 3, 2025

1.0.0

Dec 1, 2024

0.2.5

Dec 1, 2024

0.2.4

Dec 1, 2024

0.2.3

Nov 19, 2024

0.2.2

Oct 6, 2024

0.2.1

Oct 6, 2024

0.2.0

Oct 6, 2024

0.1.0

Oct 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytranscripts-1.5.1.tar.gz (17.3 kB view details)

Uploaded May 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytranscripts-1.5.1-py3-none-any.whl (18.0 kB view details)

Uploaded May 12, 2025 Python 3

File details

Details for the file pytranscripts-1.5.1.tar.gz.

File metadata

Download URL: pytranscripts-1.5.1.tar.gz
Upload date: May 12, 2025
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for pytranscripts-1.5.1.tar.gz
Algorithm	Hash digest
SHA256	`e440ee41296c2060e1fbbdafdc5773a9a71a16684e50c685bd8168f3e92c9fcf`
MD5	`93a2258fba7b028fc7e6045db0cbe44a`
BLAKE2b-256	`583f2a5217437fccf85cc4e670ffda96f97e9fc36048d0cd29c63dac624aa35c`

See more details on using hashes here.

File details

Details for the file pytranscripts-1.5.1-py3-none-any.whl.

File metadata

Download URL: pytranscripts-1.5.1-py3-none-any.whl
Upload date: May 12, 2025
Size: 18.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for pytranscripts-1.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c67b0de0bcd2b6b29192c848080893d901a58019bf8abc8783954c093dd24960`
MD5	`a5ae79d661d38627cbe21cbb4ccd3060`
BLAKE2b-256	`e58eff4dc643110269d4cc567356da8eb8135a0d8c702beb11d72039b72bee2f`

See more details on using hashes here.

pytranscripts 1.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pytranscripts

Installation

Stages

Example Usage

Mount Google Drive (Optional)

Automated Data Export

Requirements

Model Training

Inferencing and Automated Document Classification (via Color Coding)

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes