The Challenge
Why T-RECS?
Cancer is one of the leading causes of death worldwide, and early detection and accurate diagnosis of tumors is critical for effective treatment and improved patient outcomes.
                        Our recommendation system centralizes relevant information, making it easier for patients to make informed decisions.
                        Tumor conditions often require prompt medical attention and treatment. Having a recommendation system that can 
                        quickly connect patients with suitable doctors and specialists can not only help reduce delays and ensure timely access to appropriate
                        care but also greatly enhance the overall patient experience.
                        
Unlike other recommendation systems, our system utilizes a unique feature called Placekey API along machine learning to help patients make informed decisions based on the most current information. In addition to this, our system provides an aggregation of sentiment from all available reviews. By training on core key features we are able to improve the accuracy and relevance of tumor recommendations.
According to the Centers for Diease Control and Prevention
602,350+
Cancer deaths in the U.S. in 2023
Learn More about Tumors 
                Information About Tumors
A tumor is a mass or group of abnormal cells that form in the body. If you have a tumor, it isn’t necessarily cancer. Many tumors are benign (not cancerous).Tumors can form throughout the body. They can affect bone, skin, tissues, glands and organs. Neoplasm is another word for tumor.
A Tumor May Be:
Malignant
- Bone tumors (osteosarcoma and chordomas)
- Brain tumors such as glioblastoma and astrocytoma
- Malignant soft tissue tumors and sarcomas
- Organ tumors such as lung cancer and pancreatic cancer
- Skin tumors (such as squamous cell carcinoma)
- Ovarian germ cell tumors.
Benign
- Benign bone tumors (osteomas)
- Brain tumors such as meningiomas and schwannomas
- Gland tumors such as pituitary adenomas
- Lymphatic tumors such as angiomas
- Benign soft tissue tumors such as lipomas
- Uterine fibroids
Noncancerous
- Actinic keratosis, a skin condition
- Cervical dysplasia
- Colon polyps
- Ductal carcinoma in situ, a type of breast tumor
Our Mission
Our mission is to empower individuals dealing with tumors by providing a personalized and comprehensive tumor recommendation system. Through machine learning and targeted recommendations, we aim to simplify the process of finding and connecting with the most suitable doctors and specialists. By delivering transparency, personalized care, and informed decision-making, we strive to improve treatment outcomes and enhance the overall well-being of tumor patients.
Dataset
Data.CMS.Gov Dataset
Data.CMS.GOV gives you direct access to the Centers for Medicare & Medicaid Services’ (CMS) official data that are used on the Medicare Care Compare website and directories.
The Doctors and Clinicians national downloadable file is organized such that each line is unique at the clinician/enrollment record/group/address level. Clinicians with multiple Medicare enrollment records and/or single enrollments linking to multiple practice locations are listed on multiple lines.
Learn more about CMS.Gov Data Commons
 
                    
            Yelp Dataset
The Yelp dataset contains subset of Yelp's businesses, reviews, and user data. We mapped business IDs back to business names and narrowed down the yelp dataset to oncologists based on reviews. The dataset includes business IDs, business names, addresses, cities, states, zip codes, latitude/longitude, and categories for each review. The dataset also includes a json file with review IDs, user IDs, business IDs, review text, and date.
Yelp Dataset
Key Features
After data transformation, our project team narrowed data down to core features to inform our machine learning models:
- Full Name (MD or DO Medical Professionals)
- Specialty (filtered for oncology)
- Zip Code (from zip code center to practice)
- Coordinates (latitude and longitude of medical practice)
- Placekey (unique identifier for any address)
- Score (sentiment analysis of ind. + business Yelp reviews)
Models
Sentiment Analysis
                            Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. 
                            Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.
                             Learn More about Sentiment Analysis
                        
Our Approach
                            1. Gathered oncologist reviews from Yelp Academic dataset and RateMds Github dataset. 
                            2. Ran VADER sentiment analysis model - open source model specifically used for analyzing
                            social media text that doesn’t require any training data. 
                            Limited to only English text.
                            3. Calculated compound review score - normalized weighted composite score for each review.
                        
 
                        
                Top 5 Best Match by Distance, Rating, and Years of Experience
                            Feature Selection is the method of reducing the input variable to your model by using only relevant data and getting rid of noise in data. 
                            It is the process of automatically choosing relevant features for your machine learning model based on the type of problem you are trying to solve. 
                            We do this by including or excluding important features without changing them. It helps in cutting down the noise in our data and reducing the size of our input data.
                            
Feature selection models are of two types: 
                            
Our Approach
                            1. Individuals have reviews and their practices also have reviews. Combine their review scores to create an overall score.
                            2. Locations are encoded by 9-character placekeys, a unique identifier for a physical place.
                            3. Merge doctor information and review scores on placekeys.
                            4. Not all zip codes may have an oncologist. Expand placekey comparison to fewer placekey characters and calculate distances from starting zip code to all matches.
                            5. Based on entered zip code user preferences, display top recommendations accordingly (e.g. if the user's highest to lowest preference is score, distance, years of experience then sort in that order).
                            6.Information including name, gender, years of experience, address, and phone number present key facts at a single glance.
                        
 
                        
                HDBScan
                            HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
                            
In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.
                            
                            
HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).
                            Learn More about HDBScan
                        
Our Approach
                            1. Using the merged data from above, convert gender to a number for the HDBSCAN model. Also map the review scores, which were previously from -1 to 1, to 1 to 5 scale. If there is no review present, use a 0 value.
                            2. Cluster based on distance, score, years of experience, and gender. SPecify each cluster to have a minimum of 25 samples.
                            3. Not all zip codes may have an oncologist. Expand zip code search to neighboring +/- 10 zip codes.
                            4. Use the zip code centroid coordinates to calculate distance to medical practice from the center of the zip code.
                            5. Based on user preferences, sort HDBSCAN results accordingly (e.g. if the user's highest to lowest preference is distance, score, years of experience then sort in that order).
                        
 
                        
                Architecture
Yelp Sentiment Data Pipeline
Medical Provider Data Pipeline
Data Combination Pipeline

Next Steps
Although our recommendation system focuses primarily on tumors, we can see our project being applied to other specialties. We want to incorporate more doctor reviews and further down the line update our website using real-time-data. Another opporunity is to develop a mobile application to increase accessibility.
 
             
             
            
        Project Team
 
                            
                        Eric Le
ericle@berkeley.edu
 
                            
                        Amangeet Kaur Samra
aksamra@berkeley.edu
 
                            
                        Jashwanth Sompalli
jashsompalli@berkeley.edu
 
                            
                        Stephen Tan
stephen.tan@berkeley.edu
Special Thanks to Dr. Fred Nugen and Dr. Alberto Todeschini
Presentations
 
               
             
             
             
               
              Frequently Asked Questions
Have you thought about privacy and ethical concerns with the dataset?
                        The dataset did not contain any personal information or demographics. Subjects were
                        identified by a randomly generated identifier. From our review of the dataset, their is no data
                        point that can be utilized or reversed engineered to identify names, locations, or any other
                        identifiers from.
                        
Capstone Project Audit Conducted by Megan Martin
                    
Who is your intended audience?
Our intended audience are tumor patients looking to find specialists and doctors for their specific conditions.
What is the key differentiation between your MVP and the existing solutions and/or approaches?
One key differentiator between existing solutions and our proposal is the focus on tumor treatment. Another differentiator is the size and quality of the dataset used to train our machine learning model. A larger and more diverse dataset can always help to improve the accuracy and create a more robust model. Another differentiator is the specific algorithms and techniques we utilized for feature extraction and classification, which can impact the performance of the model.
What are future plans for this project?
Expand recommendations to other specialties. Collect more doctor reviews. Open source review Engine (sentiment-analysis).
How can I trust these recommendations?
We preform sentiment analysis on Yelp reviews and reviews found on HealthGrades for a given doctor and medical practice.
 
            