About Me

Machine Learning and Data Science Engineer with experience to prove. Worked at Butterflies Dating and Socials, General Electric and Propeller Health (ResMed). Bachelors in Computer Science from VIT University. Masters in Data Science from University of San Francisco.

Projects

LinkedIn Alumni Profile Similarity

Executed an ETL process using AirFlow to extract alumni data from GCP, transformed via PySpark, and loaded into MongoDB Atlas. Leveraged BERT for sentence embeddings to gauge profile similarities in Databricks.

Image Captioning on Flickr-8k Dataset

Built an Image Captioning model trained on the Flickr-8k dataset using PyTorch for an attention based Sequence 2 Sequence model which outputs what's happening in a given image.

Apple Support Chatbot using Twitter Customer Support Data

Developed an Apple Support Chatbot utilizing Twitter customer support data, leveraging an attention-based sequence-to-sequence (seq2seq) model to provide automated assistance and solutions to user queries.

Default Prediction using American Express Dataset

Successfully implemented a default prediction model using the American Express dataset, leveraging Apache Spark to handle 15 million rows of data, resulting in accurate risk assessment and informed decision-making.

Windmill Power Forecasting

Prototyped multiple time series forecasting models like Prophet, SARIMAX, VAR and ETS to predict the generation of a windmill.

Experience

Propeller Health (ResMed)

Data Scientist, October 2022 - July 2023 - Developed and implemented an Isolation Forest model and a distance based algorithm to enhance the quality and consistency of patient drug intake data by accurately detecting anomalies on 200M rows of data, utilizing Apache Spark for efficient data transformation. - Utilized the RandomizedSearch technique to create a robust XGBoost regression model on Amazon SageMaker and Redshift to accurately predict the likelihood of patients taking rescue inhaler puffs. - As an intern, I generated interactive Tableau dashboards to analyze user behavior in an Electronic Health Record (EHR) platform. - Utilized PostgreSQL for efficient data storage and retrieval, ensuring data integrity and accessibility throughout the project. - Collaborated with cross-functional teams including data engineers and healthcare professionals to gather requirements, design and deploy data pipelines.

General Electric Steam Power

Data Analyst, January 2022 - June 2022 - Conducted an in-depth word frequency analysis on an imbalanced dataset and applied advanced feature engineering techniques to enhance the quality of the data and developed a highly accurate Random Forest model to predict the Parts Qualification Level for turbine components, achieving an accuracy rate of 97.3%. - Engineered and implemented a streamlined model pipeline that incorporated the aforementioned Random Forest model. This automated pipeline can have an estimated reduction of workload of field engineers by 20%. - Leveraged a diverse technology stack, including PostgreSQL, Tableau, and machine learning to address real-time business needs.

Butterflies Dating and Socials

Software Development Engineer, April 2021 - December 2021 - As a part of the Machine Learning team at Butterflies Dating, I developed a transformer-based profile matching algorithm using PyTorch. This algorithm significantly improved the accuracy and efficiency of matching user profiles within the application. - Built a DistilBERT model that achieved an impressive accuracy rate of 98% in detecting and flagging toxic chats. This model played a crucial role in maintaining a safe and respectful environment within the app. - Conducted A/B testing with the identified KPIs to evaluate the effectiveness of different algorithms and model variations, leading to data-driven decision-making and improved algorithm performance. - Actively participated in code reviews and knowledge sharing sessions, contributing to a collaborative and innovative work environment.

Skills

Languages & Frameworks

Python, C, C++, Java, R, PyTorch, Keras, HTML, CSS, Git, Tableau, PowerBI, Looker, Django

Databases

MongoDB, PostreSQL, Oracle (PL/SQL),SQL Server, T-SQL, MySQL, PrestoDB, Cassandra

Data Engineering

Spark (PySpark, SparkSQL, SparkML), RAPIDS cuDF, Kafka, Hadoop, Hive, Snowflake, dbt, Databricks

Cloud

AWS (EC2, S3, SageMaker, Redshift, Lambda, Glue, Athena), GCP (Compute, Buckets, BigQuery), Microsoft Azure

Data Science

Regression, Gradient Boosting, Time Series Analysis (ARIMA, VAR, Prophet), NumPy, SciPy, A/B Testing, SHAP, Recommendation Engines, Customer Segmentation

Machine Learning

Neural Networks, CNN, RNN, LSTM, Transformers (GPT), Generative AI, Prompt Engineering, ChatGPT API, LangChain

ML Ops

MLFlow, DVC, MetaFlow ,AirFlow, Git, Kubernetes, Docker, Prefect, DeepChecks, EvidentlyAI, Alibi Detect

Other

Docker, CI / CD, Microservices, API design, Agile / Scrum