1001 Freelance Projects
Latest Projects from Freelance Marketplaces
Today is: 19-Apr-2026 08:59 GMT
View Project
View this project in detail (Note: you will be redirected to external marketplace)
Project title: Extract blood test data from PDF documents that have been OCR'd
Posted by: External project from PeoplePerHour
Started: 03-Mar-2026 14:36 GMT
Description: Expected duration: 1 day or less
The objective is to build a structured blood test database that allows pathology results to be viewed, edited, filtered, and exported to Excel via a web-based HTML interface. The system stores results in a clean, standardised format so trends can be analysed accurately over time.

Using AI-assisted OCR, I have built a local Python extraction pipeline that converts PDF pathology reports into machine-readable text and inserts structured data into a SQLite database. The majority of blood tests extract correctly, including canonical test name, result value, unit, and reference range.

However, I have reached a specific technical issue with three markers:

• CRP (C-reactive protein)
• ESR
• GLU (Glucose)

The OCR output clearly contains the correct lines, and debug logs confirm they are processed. Yet no rows are inserted for these markers.

The failure appears to occur between canonical matching, numeric extraction, or validation logic.

Current System Architecture

The system runs locally and consists of:

• extraction_core_2.py (main engine)
• Supporting modules for OCR preprocessing, lab dictionary building, regex matching, and validation
• SQLite backend
• Schema-driven canonical lab dictionary
• Controlled fuzzy fallback logic
• HTML viewer for results display and Excel export

Pipeline flow:

Convert PDF to image (pdf2image)

Preprocess

Run Tesseract OCR

Clean and normalise text

Match against canonical lab dictionary

Extract:

canonical test name

numeric result

unit

reference range

Validate

Insert into SQLite

The engine is deterministic and rule-based.

The Specific Problem

Example OCR line:

CRP H 5.2 mg/L 0-5

OCR text is correct.
NUMBER_PATTERN matches.
The canonical dictionary contains the test.

Yet:

Inserted 0 rows from 0126251OrderReport_23B00006604_CRP.pdf

Likely failure points include:

• Canonical containment match failing due to normalisation
• Flag tokens (“H”, “L”) interfering with numeric capture
• Numeric extraction anchored incorrectly
• Validation rejecting due to strict range formatting
• Unit pattern mismatch (e.g. mmol/L)
• Dictionary indexing issue
• Match overridden by another lab name
• Guard conditions too strict

If validation fails, the row is rejected silently.

All other panels extract correctly. The issue appears isolated.

What Is Required

This is not a rebuild.

We do not want:

• Re-architecture
• Experimental AI guessing logic
• Large-scale changes
• Expanded fuzzy matching

We need:

1. Precise Diagnosis

Identify exactly where CRP, ESR, and GLU are failing insertion and which rule is causing rejection.

2. Minimal Safe Fix

Implement a targeted correction that:

• Adjusts canonical matching if required
• Anchors numeric extraction correctly
• Allows flag tokens without blocking capture
• Relaxes only necessary validation checks
• Preserves deterministic behaviour

3. Zero Regression

• No impact to currently working panels
• No performance degradation
• No uncontrolled fuzzy expansion

4. Modular Implementation

If appropriate:

• Implement as small isolated module
or
• Cleanly adjust matching block

The existing architecture should remain intact.

Constraints

The system is designed to be:

• Deterministic
• Schema-driven
• Reproducible
• Forensic-grade

We cannot introduce probabilistic or unpredictable behaviour.

Longer-Term Goal

After stabilising extraction:

• Migrate to web deployment
• Enable structured uploads
• Add trend analysis
• Later incorporate AI-assisted interpretation

Immediate priority:

Stabilise deterministic extraction for CRP, ESR, and GLU without breaking the existing engine.

Materials Provided

Uploaded:

• Full extraction_core_2.py (text format)
• Screenshot of HTML viewer
• Sample PDF files
• Export showing required output

Additional materials available on request:

• Sample OCR blocks
• Canonical dictionary entries
• Regex patterns
• Validation logic
• Database schema
• Debug logs

This is a focused debugging and refinement request. I have spent many hours attempting to isolate the issue and now require an experienced developer to identify the blocking condition and implement a practical fix.

I have been advised this should take 1–2 hours for a senior developer.

Looking for a swift turnaround.
Project ID: 3472184
Project category:
Project budget:
View this project in detail (Note: you will be redirected to external marketplace)
Last Projects / Browse Projects
  Project Started
Instagram Post Cropping & Resizing
Category: Photoshop, Design, Digital Design, Graphic Design, Photo Editing, Photoshop Design, Visual Design
Budget: ₹12500 - ₹37500 INR
19-Apr-2026
04:04 GMT
AI CTF Solver
Category: AI Agents, AI Chatbot Development, AI Development, AI Model Development, AI Research, Algorithm, Engineering, Machine Learning (ML), Mathematics, Matlab And Mathematica
Budget: £250 - £750 GBP
19-Apr-2026
04:04 GMT
Recruitment Sales Lead Generator Needed
Category: Human Resources, Recruiting Sales, Recruitment, Sales
Budget: ₹600 - ₹1500 INR
19-Apr-2026
04:03 GMT
Precise Document Replication with Limited Substitution for Acrylic design
Category: Adobe Illustrator, Adobe InDesign, Photoshop, Design Optimization, Graphic Art, Graphic Design, Pattern Design, PDF, Print Design, Vector Design
Budget: $10 - $30 USD
19-Apr-2026
04:03 GMT
Concours Logo Texte Créatif
Category: Adobe Illustrator, Photoshop, Graphic Design, Illustration, Logo Design, Typography, Vector Design
Budget: $15 - $25 USD
19-Apr-2026
04:02 GMT
Remote Part-Time Call Support
Category: Admin Support, Communications, Copy Editing, Copy Typing, Copywriting, Data Entry, Email Handling, Phone Support, Time Management, Virtual Assistant
Budget: ₹100 - ₹400 INR
19-Apr-2026
04:02 GMT
Digital Marketing Strategist for Lead Generation
Category: Content Marketing, Digital Marketing, Google Adwords, Internet Marketing, Lead Generation, SEO, Social Media Marketing, YouTube
Budget: $250 - $750 USD
19-Apr-2026
04:02 GMT
Admin Layanan Pelanggan Media Sosial -- 2
Category: Content Creation, Customer Service, Customer Support, Data Analysis, Email Marketing, Social Media Management, Telemarketing
Budget: $2 - $8 USD
19-Apr-2026
04:00 GMT
Unique Content Creation for Dog Training Website
Category: AI Content Creation, AI Writing, Content Development, Content Writing, Copywriting, Internet Marketing, Link Building, SEO
Budget: $30 - $250 USD
19-Apr-2026
03:56 GMT
Subscription based websites for nightlife freelancers to be advertised and booked directly.
Category: Analytics, Content Management System (CMS), Database Management, Marketing Strategy, Payment Gateway Integration, SEO, Web Application, Web Design, Web Development, Web Security
Budget: ₹12500 - ₹37500 INR
19-Apr-2026
03:54 GMT
Rancang Paket Pelatihan SDM Menarik
Category: B2B Marketing, Business Consulting, Content Development, Human Resources, Learning Management Systems (LMS), Marketing Strategy, Proposal Writing, Training Development
Budget: $15 - $25 USD
19-Apr-2026
03:53 GMT
International Project Manager for Kuddoland Toys
Category: Business Plans, Digital Marketing, Graphic Design, HTML, Marketing, Product Development, Project Management, Web Design
Budget: min ₹2500 INR
19-Apr-2026
03:50 GMT
Professional Blog WordPress Design
Category: Elementor, HTML, PHP, SEO, Web Design, Web Development, WordPress
Budget: $3000 - $5000 USD
19-Apr-2026
03:49 GMT
Comprehensive On-Page & Off-Page SEO
Category: Analytics, Content Marketing, Content Strategy, Digital Marketing, Internet Marketing, Keyword Research, Link Building, Search Engine Marketing (SEM), SEO, Website Management
Budget: ₹12500 - ₹37500 INR
19-Apr-2026
03:45 GMT
Optimasi Penjualan via Instagram & Facebook
Category: Content Creation, Data Entry, Digital Marketing, Facebook Marketing, Instagram Marketing, Internet Marketing, SEO, Social Media Marketing
Budget: $15 - $25 USD
19-Apr-2026
03:45 GMT
Browse All Projects
Projects by Skills ...
Projects for 'android'
Projects for 'ajax'
Projects for 'asp'
Projects for 'aspnet'
Projects for 'cms'
Projects for 'cpp'
Projects for 'csharp'
Projects for 'css'
Projects for 'delphi'
Projects for 'design'
Projects for 'drupal'
Projects for 'excel'
Projects for 'facebook'
Projects for 'flash'
Projects for 'html'
Projects for 'java'
Projects for 'javascript'
Projects for 'joomla'
Projects for 'iphone'
Projects for 'mysql'
Projects for 'photoshop'
Projects for 'php'
Projects for 'python'
Projects for 'ruby'
Projects for 'seo'
Projects for 'sql'
Projects for 'sysadm'
Projects for 'translate'
Projects for 'typing'
Projects for 'twitter'
Projects for 'vbnet'
Projects for 'xml'
Projects for 'wordpress'
Projects for 'writing'
Read RSS feeds ... New!
RSS feed for 'android'
RSS feed for 'ajax'
RSS feed for 'asp'
RSS feed for 'aspnet'
RSS feed for 'cms'
RSS feed for 'cpp'
RSS feed for 'csharp'
RSS feed for 'css'
RSS feed for 'delphi'
RSS feed for 'design'
RSS feed for 'drupal'
RSS feed for 'excel'
RSS feed for 'facebook'
RSS feed for 'flash'
RSS feed for 'html'
RSS feed for 'java'
RSS feed for 'javascript'
RSS feed for 'joomla'
RSS feed for 'iphone'
RSS feed for 'mysql'
RSS feed for 'photoshop'
RSS feed for 'php'
RSS feed for 'python'
RSS feed for 'ruby'
RSS feed for 'seo'
RSS feed for 'sql'
RSS feed for 'sysadm'
RSS feed for 'translate'
RSS feed for 'typing'
RSS feed for 'twitter'
RSS feed for 'vbnet'
RSS feed for 'xml'
RSS feed for 'wordpress'
RSS feed for 'writing'
New!
Проекты на русском
(Projects in Russian)

Short URL:
1001fp.com
Mobile version:
m.1001freelanceprojects.com
Copyright © 2005-2025 1001 Freelance Projects