The Challenge

Why T-RECS?

Cancer is one of the leading causes of death worldwide, and early detection and accurate diagnosis of tumors is critical for effective treatment and improved patient outcomes.

Our recommendation system centralizes relevant information, making it easier for patients to make informed decisions. Tumor conditions often require prompt medical attention and treatment. Having a recommendation system that can quickly connect patients with suitable doctors and specialists can not only help reduce delays and ensure timely access to appropriate care but also greatly enhance the overall patient experience.

Unlike other recommendation systems, our system utilizes a unique feature called Placekey API along machine learning to help patients make informed decisions based on the most current information. In addition to this, our system provides an aggregation of sentiment from all available reviews. By training on core key features we are able to improve the accuracy and relevance of tumor recommendations.

According to the Centers for Diease Control and Prevention

602,350+

Cancer deaths in the U.S. in 2023

Learn More about Tumors

Information About Tumors

A tumor is a mass or group of abnormal cells that form in the body. If you have a tumor, it isn’t necessarily cancer. Many tumors are benign (not cancerous).Tumors can form throughout the body. They can affect bone, skin, tissues, glands and organs. Neoplasm is another word for tumor.

A Tumor May Be:

Cancerous: Malignant or cancerous tumors can spread into nearby tissue, glands and other parts of the body. The new tumors are metastases (mets). Cancerous tumors can come back after treatment (cancer recurrence). These tumors can be life-threatening.

Noncancerous: Benign tumors are not cancerous and are rarely life-threatening. They’re localized, which means they don’t typically affect nearby tissue or spread to other parts of the body. Many noncancerous tumors don’t need treatment. But some noncancerous tumors press on other body parts and do need medical care.

Precancerous: These noncancerous tumors can become cancerous if not treated.

Malignant

Bone tumors (osteosarcoma and chordomas)
Brain tumors such as glioblastoma and astrocytoma
Malignant soft tissue tumors and sarcomas
Organ tumors such as lung cancer and pancreatic cancer
Skin tumors (such as squamous cell carcinoma)
Ovarian germ cell tumors.

Benign

Benign bone tumors (osteomas)
Brain tumors such as meningiomas and schwannomas
Gland tumors such as pituitary adenomas
Lymphatic tumors such as angiomas
Benign soft tissue tumors such as lipomas
Uterine fibroids

Noncancerous

Actinic keratosis, a skin condition
Cervical dysplasia
Colon polyps
Ductal carcinoma in situ, a type of breast tumor

Learn More about Types of Tumors

Our Mission

Our mission is to empower individuals dealing with tumors by providing a personalized and comprehensive tumor recommendation system. Through machine learning and targeted recommendations, we aim to simplify the process of finding and connecting with the most suitable doctors and specialists. By delivering transparency, personalized care, and informed decision-making, we strive to improve treatment outcomes and enhance the overall well-being of tumor patients.

Dataset

Data.CMS.Gov Dataset

Data.CMS.GOV gives you direct access to the Centers for Medicare & Medicaid Services’ (CMS) official data that are used on the Medicare Care Compare website and directories.

The Doctors and Clinicians national downloadable file is organized such that each line is unique at the clinician/enrollment record/group/address level. Clinicians with multiple Medicare enrollment records and/or single enrollments linking to multiple practice locations are listed on multiple lines.

Learn more about CMS.Gov Data Commons

Yelp Dataset

The Yelp dataset contains subset of Yelp's businesses, reviews, and user data. We mapped business IDs back to business names and narrowed down the yelp dataset to oncologists based on reviews. The dataset includes business IDs, business names, addresses, cities, states, zip codes, latitude/longitude, and categories for each review. The dataset also includes a json file with review IDs, user IDs, business IDs, review text, and date.

Yelp Dataset

Key Features

After data transformation, our project team narrowed data down to core features to inform our machine learning models:

Full Name (MD or DO Medical Professionals)
Specialty (filtered for oncology)
Zip Code (from zip code center to practice)
Coordinates (latitude and longitude of medical practice)
Placekey (unique identifier for any address)
Score (sentiment analysis of ind. + business Yelp reviews)

Models

Sentiment Analysis

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.
Learn More about Sentiment Analysis

Our Approach

1. Gathered oncologist reviews from Yelp Academic dataset and RateMds Github dataset.
2. Ran VADER sentiment analysis model - open source model specifically used for analyzing social media text that doesn’t require any training data. Limited to only English text.
3. Calculated compound review score - normalized weighted composite score for each review.

Top 5 Best Match by Distance, Rating, and Years of Experience

Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data. It is the process of automatically choosing relevant features for your machine learning model based on the type of problem you are trying to solve. We do this by including or excluding important features without changing them. It helps in cutting down the noise in our data and reducing the size of our input data.

Feature selection models are of two types:

Supervised Models: Supervised feature selection refers to the method which uses the output label class for feature selection. They use the target variables to identify the variables which can increase the efficiency of the model.

Unsupervised Models: Unsupervised feature selection refers to the method which does not need the output label class for feature selection. We use them for unlabelled data.

Learn More about Feature Selection

Our Approach

1. Individuals have reviews and their practices also have reviews. Combine their review scores to create an overall score.
2. Locations are encoded by 9-character placekeys, a unique identifier for a physical place.
3. Merge doctor information and review scores on placekeys.
4. Not all zip codes may have an oncologist. Expand placekey comparison to fewer placekey characters and calculate distances from starting zip code to all matches.
5. Based on entered zip code user preferences, display top recommendations accordingly (e.g. if the user's highest to lowest preference is score, distance, years of experience then sort in that order).
6.Information including name, gender, years of experience, address, and phone number present key facts at a single glance.

HDBScan

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).
Learn More about HDBScan

Our Approach

1. Using the merged data from above, convert gender to a number for the HDBSCAN model. Also map the review scores, which were previously from -1 to 1, to 1 to 5 scale. If there is no review present, use a 0 value.
2. Cluster based on distance, score, years of experience, and gender. SPecify each cluster to have a minimum of 25 samples.
3. Not all zip codes may have an oncologist. Expand zip code search to neighboring +/- 10 zip codes.
4. Use the zip code centroid coordinates to calculate distance to medical practice from the center of the zip code.
5. Based on user preferences, sort HDBSCAN results accordingly (e.g. if the user's highest to lowest preference is distance, score, years of experience then sort in that order).

Architecture

Yelp Sentiment Data Pipeline

Preprocess_yelp_data.py - clean_data/yelp_bid_reviews.csv

Yelp_sentiment.py - sentiment_data/yelp_bid_review_sentiment.csv

Yelp_post_sentiment_combiner.py - sentiment_data/yelp_bid_sent_name.csv

Generate_yelp_placekeys.py - clean_data/final_yelp_dataset.csv

Medical Provider Data Pipeline

Preprocess_provider_data.py - clean_data/oncologists.csv

Generate_provider_placekeys.py - clean_data/placekey_oncologist_dataset.csv

Generate_final_oncologist_dataset.py - clean_data/final_oncologist_dataset.csv

Data Combination Pipeline

Generate_final_trecs_dataset.py - final/trecs.csv

Next Steps

Although our recommendation system focuses primarily on tumors, we can see our project being applied to other specialties. We want to incorporate more doctor reviews and further down the line update our website using real-time-data. Another opporunity is to develop a mobile application to increase accessibility.

Project Team

Eric Le

ericle@berkeley.edu

Amangeet Kaur Samra

aksamra@berkeley.edu

Jashwanth Sompalli

jashsompalli@berkeley.edu

Stephen Tan

stephen.tan@berkeley.edu

Special Thanks to Dr. Fred Nugen and Dr. Alberto Todeschini

Presentations

Presentation #1

Presentation #2

Presentation #3

Website Demo

Frequently Asked Questions

Have you thought about privacy and ethical concerns with the dataset?

The dataset did not contain any personal information or demographics. Subjects were identified by a randomly generated identifier. From our review of the dataset, their is no data point that can be utilized or reversed engineered to identify names, locations, or any other identifiers from.

Capstone Project Audit Conducted by Megan Martin

Who is your intended audience?

Our intended audience are tumor patients looking to find specialists and doctors for their specific conditions.

What is the key differentiation between your MVP and the existing solutions and/or approaches?

One key differentiator between existing solutions and our proposal is the focus on tumor treatment. Another differentiator is the size and quality of the dataset used to train our machine learning model. A larger and more diverse dataset can always help to improve the accuracy and create a more robust model. Another differentiator is the specific algorithms and techniques we utilized for feature extraction and classification, which can impact the performance of the model.

What are future plans for this project?

Expand recommendations to other specialties. Collect more doctor reviews. Open source review Engine (sentiment-analysis).

How can I trust these recommendations?

We preform sentiment analysis on Yelp reviews and reviews found on HealthGrades for a given doctor and medical practice.

Are there any televant readings, market research, white papers, academic research (share title and link)?

Kavalci, E., Hartshorn
Zhang A, Xing L, Zou J, Wu JC
Chaitra H, Sreyas M, Ravi T, Aakash K
Jason T, Swaroop R
Jiancong S

Tumor Doctor Recommendation System

In association with the Master's in Information and Data Science Program University of California Berkeley

The Challenge

Why T-RECS?

602,350+

Cancer deaths in the U.S. in 2023

Information About Tumors

A Tumor May Be:

Malignant

Benign

Noncancerous

Our Mission

Dataset

Data.CMS.Gov Dataset

Yelp Dataset

Key Features

Models

Sentiment Analysis

Our Approach

Top 5 Best Match by Distance, Rating, and Years of Experience

Our Approach

HDBScan

Our Approach

Architecture

Yelp Sentiment Data Pipeline

Medical Provider Data Pipeline

Data Combination Pipeline

Next Steps

Project Team

Eric Le

Amangeet Kaur Samra

Jashwanth Sompalli

Stephen Tan

Special Thanks to Dr. Fred Nugen and Dr. Alberto Todeschini

Presentations

Presentation #1

Presentation #2

Presentation #3

Website Demo

Frequently Asked Questions

Have you thought about privacy and ethical concerns with the dataset?

Who is your intended audience?

What is the key differentiation between your MVP and the existing solutions and/or approaches?

What are future plans for this project?

How can I trust these recommendations?

Are there any televant readings, market research, white papers, academic research (share title and link)?