Train_Beneficiarydata-1542865627584.csv: Contains demographic and health-related information for various beneficiaries (e.g., BeneID, Age, Gender, ChronicCond_*).Train_Inpatientdata-1542865627584.csv: Records of inpatient claims, including details like ClaimID, Provider, AttendingPhysician, AdmissionDt, DischargeDt, ClmDiagnosisCode_*, ClmProcedureCode_*, InscClaimAmtReimbursed, and DeductibleAmtPaid.Train_Outpatientdata-1542865627584.csv: Similar to inpatient data but for outpatient claims.Train-1542865627584.csv: The target file containing Provider IDs and their corresponding PotentialFraud status ('Yes' or 'No').Unseen_Beneficiarydata-1542969243754.csvUnseen_Inpatientdata-1542969243754.csvUnseen_Outpatientdata-1542969243754.csvUnseen-1542969243754.csv: Contains Provider IDs for which fraud predictions are to be submitted.BeneID) to enrich claim records with beneficiary demographics and chronic conditions.Train.csv (or Unseen.csv) file based on Provider ID, which served as the aggregation key and target identifier.TotalClaims, TotalInpatientClaims, TotalOutpatientClaims).SumInscClaimAmtReimbursed, AvgInscClaimAmtReimbursed, SumDeductibleAmtPaid, AvgDeductibleAmtPaid).AvgAge, AvgGender, AvgRace, AvgChronicCond_*, AvgRenalDiseaseIndicator).UniqueBeneIDs, UniqueAttendingPhysicians, UniqueOperatingPhysicians, UniqueOtherPhysicians).UniqueClmDiagnosisCode_*, UniqueClmProcedureCode_*, UniqueClmAdmitDiagnosisCode, UniqueDiagnosisGroupCode).AvgClaimDuration) and average inpatient stay duration (AvgInpatientStayDuration).PropMissingAttendingPhysician, PropMissingOperatingPhysician, PropMissingOtherPhysician).ClaimsWithManyDiagnosisCodes: Counts claims with at least 4 diagnosis codes, potentially signaling 'upcoding' or unnecessary complexity.InpatientToOutpatientRatio: The ratio of inpatient to outpatient claims, which can reveal unusual service distribution by a provider.ReimburseToDeductibleRatio: The ratio of total reimbursed amount to total deductible paid, flagging abnormal financial patterns.PropClaimsWithOperatingPhysician: Proportion of claims where an operating physician is listed, which might indicate higher volumes of complex or invasive procedures.AvgInpatientStayDuration for providers with no inpatient claims), were imputed using the median value of their respective columns.scale_pos_weight parameter was used during training to give more weight to the minority class (fraud), directly addressing the imbalance.pandas (for data manipulation)numpy (for numerical operations)scikit-learn (for machine learning utilities and metrics)xgboost (for the final model)matplotlib (for plotting)seaborn (for enhanced visualizations)joblib (for saving and loading models)Posted Sep 7, 2025
Developed a machine learning model to detect healthcare fraud with improved recall and precision.
0
0
Jul 2, 2025 - Jul 31, 2025