A python package for extracting electronic health transcripts , and then classifying them based on human annotated data.
Project description
pytranscripts
An Open source👨🔧 Python Library for Automated classification of Electronic Medical records
Installation
To install the latest version , simply use
pip install -U pytranscripts
Stages
- Data Extraction
- Target Identification
- Finetuning Annotated Data on Pretrained models (Bert & Electra)
- Extracting Interviwer/Interviewee records from the specified docx file storage
- Metrics Evaluation (Accuracy & Cohen Kappa Score)
- Reordering records as a neatly arranged and flagged spreadsheet, alongside metrics and reports from pretrained models.
- Running inference on raw documents and Color Coding them
Example Usage
Mount Google Drive (Optional)
If using Google Drive as the data source:
from google.colab import drive
drive.mount('/content/drive')
Automated Data Export
To export and combine all .docx files from a folder into a single file:
from pytranscripts import export_docx_from_folder
# Define paths for document processing
INPUT_FOLDER = "/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/synthetic_transcripts" # Folder containing source DOCX files
OUTPUT_FILE = "FULL_INTERVIEW.csv" # Output consolidated spreadsheet (either .csv or .excel)
LABELS = [
"Clinical_Experience", # Descriptions of personal experience using lung ultrasound
"Diagnostic_Utility", # How lung ultrasound helps in diagnosing diseases
"Comparative_Analysis", # Comparisons with other imaging modalities like X-ray or CT
"Implementation_Challenges", # Barriers to adoption and practical difficulties
"Training_and_Education", # Aspects related to learning and teaching lung ultrasound
"System_Infrastructure", # How hospital systems, devices, and software support ultrasound use
"Administrative_Buying", # Role of hospital leadership and institutional support
"Workflow_Impact", # How lung ultrasound affects daily hospital operations
"Patient_Engagement", # Ways ultrasound enhances patient understanding and involvement
"Future_Adoption", # Predictions about the role of lung ultrasound in hospital practice
] # Labels to be used for the columns in the output spreadsheet to be filled up with empty 0s
#------------------------------------- PLEASE NOTE --------------------------------------
# AS YOU SELECT YOUR PREDEFINED LIST OF LABELS ABOVE THEY SHOULD BE SAME ONE YOU WOULD PASS INTO YOUR "TranscriptTrainer"
# Export and combine all DOCX files from the input folder
# This function will:
# 1. Read all .docx files from INPUT_FOLDER
# 2. Combine their contents
# 3. Save to a single OUTPUT_FILE
export_docx_from_folder(
input_directory=INPUT_FOLDER,
output_file=OUTPUT_FILE,
labels = LABELS
)
This will:
- Read all .docx files from INPUT_FOLDER.
- Combine their content into a single file.
- Apply the defined labels to create a structured dataset.
Requirements
Python 3.6 or later GPU access recommended for optimal performance (if using Jupyter Notebook). pytranscripts version 1.2.4 or higher.
Model Training
Now , the detailed class shows how to properly use our transcript trainer in making training and inference easy based on your document
from pytranscripts import TranscriptTrainer
trainer = TranscriptTrainer(
input_file='/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/FULL_INTERVIEW_TAGGED.xlsx', # Path to the CSV / XLSX file containing the tagged documents. This is the main data source for training and evaluation.
destination_path='/content/', # Directory where all the training results, models, and logs will be saved. , We are using colab path to make things seamless
text_column='Interviewee', # Specifies the column name in the CSV file that contains the text data to be used for training.
test_size=0.2, # Determines the fraction of the data that will be used for testing the model, instead of training it. Here, 20% of data will be used for testing.
max_length=512, #The maximum number of tokens to include in each input sequence, this helps in managing memory and computational resources. Sequences longer than this will be truncated.
num_train_epochs=10, # The number of times the model will iterate over the entire training dataset during training. More epochs will mean more training.
labels=[
"Clinical_Experience", # Descriptions of personal experience using lung ultrasound
"Diagnostic_Utility", # How lung ultrasound helps in diagnosing diseases
"Comparative_Analysis", # Comparisons with other imaging modalities like X-ray or CT
"Implementation_Challenges", # Barriers to adoption and practical difficulties
"Training_and_Education", # Aspects related to learning and teaching lung ultrasound
"System_Infrastructure", # How hospital systems, devices, and software support ultrasound use
"Administrative_Buying", # Role of hospital leadership and institutional support
"Workflow_Impact", # How lung ultrasound affects daily hospital operations
"Patient_Engagement", # Ways ultrasound enhances patient understanding and involvement
"Future_Adoption", # Predictions about the role of lung ultrasound in hospital practice
],
# PLEASE MAKE SURE THAT THE LIST YOU ARE GOING TO BE USING HERE MATCHES THE ONE IN YOUR INPUT FILE
upper_lower_mapping = {
"multi_level_org_char": [ # High-level category
"Clinical_Experience", # Provider Characteristics
"System_Infrastructure" # Health System Characteristics
],
"multi_level_org_perspect": [ # High-level category
"Comparative_Analysis", # Imaging modalities in general
"Administrative_Buying", # Value equation
"Diagnostic_Utility", # Clinical utility & efficiency-Provider perspective
"Patient_Engagement", # Patient/Physician interaction in LUS
"Workflow_Impact" # Workflow related problems
],
"impl_sust_infra": [ # High-level category
"Training_and_Education", # Training
"Implementation_Challenges", # Credentialing / Quality Assurance Infrastructure
"Future_Adoption" # Financial Impact
]
}
)
Next, we initialize the training job using a single line of code.
bert_model, electra_model = trainer.train_and_classify()
Inferencing and Automated Document Classification (via Color Coding)
This involves making use of any of the trained models to predict on a folder containing raw EHR transcripts
trainer.inference_documents(
input_folder = "/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/synthetic_transcripts",
output_folder = "/content/drive/MyDrive/Kalu+Deola/PYTRANSCRIPTS PUBLIC DEMO/inferenced_transcripts",
threshold = 0.15, # default value = 0.15
model_type = 'bert' # defaults to bert , options (bert, electra)
)
Contributing
We welcome contributions! Please follow the contributing guidelines.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytranscripts-1.5.1.tar.gz.
File metadata
- Download URL: pytranscripts-1.5.1.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e440ee41296c2060e1fbbdafdc5773a9a71a16684e50c685bd8168f3e92c9fcf
|
|
| MD5 |
93a2258fba7b028fc7e6045db0cbe44a
|
|
| BLAKE2b-256 |
583f2a5217437fccf85cc4e670ffda96f97e9fc36048d0cd29c63dac624aa35c
|
File details
Details for the file pytranscripts-1.5.1-py3-none-any.whl.
File metadata
- Download URL: pytranscripts-1.5.1-py3-none-any.whl
- Upload date:
- Size: 18.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c67b0de0bcd2b6b29192c848080893d901a58019bf8abc8783954c093dd24960
|
|
| MD5 |
a5ae79d661d38627cbe21cbb4ccd3060
|
|
| BLAKE2b-256 |
e58eff4dc643110269d4cc567356da8eb8135a0d8c702beb11d72039b72bee2f
|