The Ghost Rows Mystery: A Tale of Excel Import Optimization
The Problem That Started It All
It was supposed to be a straightforward feature: allow users to import orders from Excel files into our order management system. Upload a file, parse it, validate the data, and save it to the database. Simple, right?
Wrong.
Our users started complaining that when they uploaded their Excel files, the entire browser would freeze for several seconds—sometimes more than 10 seconds. The UI became completely unresponsive. No loading spinners, no progress bars, just a frozen screen that made users think the application had crashed.
This wasn’t just a minor inconvenience. In a business environment where users might import hundreds of orders at once, a 10+ second freeze was unacceptable.
The First Fix: Moving to Web Workers
The obvious solution to UI blocking was to move the Excel parsing off the main thread. I implemented a Web Worker to handle the file processing:
1
2
3
4
5
6
7
8
9
10
// Main thread stays responsive
const worker = new Worker(new URL('./excelParser.worker', import.meta.url));
worker.postMessage({ file });
// Worker does the heavy lifting
self.onmessage = async (e) => {
const workbook = XLSX.read(arrayBuffer);
const jsonData = XLSX.utils.sheet_to_json(worksheet);
// ... process data
};
Problem solved! Or so I thought.
The UI no longer froze, and users could see a nice “Processing…” spinner. But there was still something wrong—the processing itself was taking way too long. For some files, users had to wait 15+ seconds staring at a spinner before seeing any results.
The Investigation Begins
I decided to add detailed performance logging to understand where the time was going:
1
2
3
4
console.log('[Excel Worker] Convert ArrayBuffer to binary string:', time1);
console.log('[Excel Worker] Parse Excel file:', time2);
console.log('[Excel Worker] Convert worksheet to JSON:', time3);
console.log('[Excel Worker] Process rows:', time4);
I tested with three different files from our users:
File 1: A 2.88MB file from a user who complained about slow processing
File 2: A typical small file with 36 orders
File 3: A test file with just 10 orders
The results were… confusing:
1
2
3
File 1: 2.88MB → Processing time: 13.4 seconds
File 2: 0.01MB → Processing time: 0.02 seconds
File 3: 0.01MB → Processing time: 0.01 seconds
The small files were lightning fast. But that 2.88MB file? Over 13 seconds. That seemed excessive for a file that wasn’t even that large.
The Breakthrough: Reading the Logs
I dove deeper into the performance breakdown:
1
2
3
4
[Excel Worker] Convert ArrayBuffer to binary string: 60.70ms
[Excel Worker] Parse Excel file: 456.00ms
[Excel Worker] Convert worksheet to JSON: 12,836.90ms ← 😱
[Excel Worker] Process rows: 19.50ms
Wait. 12.8 seconds out of 13.4 seconds was spent in the sheet_to_json() call? That’s 96% of the total time!
But here’s what made me dig deeper—when I looked at the actual data being processed:
1
2
Total rows parsed: 1,048,562
Valid rows after filtering: 1,581
1,048,562 rows?! But the user only imported 1,581 orders!
The “Aha!” Moment
I opened the Excel file manually and scrolled down. Row 1,582… empty. Row 1,583… empty. Row 10,000… empty. Row 100,000… still empty!
But Excel’s row counter at the bottom showed: 1,048,576 rows.
That number looked familiar. A quick Google search confirmed it: 1,048,576 is the maximum number of rows in an Excel spreadsheet.
The file was marked as having over a million rows, but only the first ~1,500 contained actual data. The remaining 1,047,000+ rows were “ghost rows”—empty rows that existed in the file’s metadata but contained no actual data.
And our code was dutifully trying to convert all 1,048,562 rows to JSON.
Understanding the Root Cause
Excel files can accumulate ghost rows in several ways:
- Accidental formatting: A user selects “all cells” and applies formatting, marking millions of empty rows as “used”
- Copy-paste operations: Copying large ranges extends the file’s “used range” even if most cells are empty
- Deleted data: Data gets deleted but the file structure retains the row references
- Template files: Pre-formatted templates with empty rows already marked as “used”
In our case, the user had likely copied data from another source, pasted it into a template, and Excel marked the entire range as used—all 1 million+ rows.
Why This Destroys Performance
When you call XLSX.utils.sheet_to_json(), the library processes every row in the file’s range, not just rows with data:
1
2
3
4
5
6
7
// This converts ALL 1,048,562 rows to JSON objects
const jsonData = XLSX.utils.sheet_to_json(worksheet);
// Result: An array with 1,048,562 objects (mostly empty)
// Memory: ~500MB
// Time: 12.8 seconds
// Actual useful data: 1,581 rows
The performance breakdown showed:
| Step | Time | Percentage |
|---|---|---|
| Read file | 9.70ms | 0.07% |
| Parse Excel structure | 456ms | 3.41% |
| Convert to JSON | 12,836ms | 96.08% ← The killer |
| Process valid rows | 19.50ms | 0.15% |
Converting a massive JSON array doesn’t scale linearly—it grows exponentially with size because:
- Memory allocation for 1M+ objects
- Iteration over every row
- JSON stringification overhead
- Garbage collection pressure
Designing the Solution
Now that I understood the problem, I needed a solution. But I had to balance several concerns:
- Performance: Don’t process millions of ghost rows
- Usability: Most legitimate imports are < 10,000 orders
- User experience: Show progress, allow cancellation
- Memory: Don’t load millions of rows into memory at once
The Strategy: Three-Pronged Approach
After analyzing the performance data and understanding the problem, I designed a three-part solution:
1. Set a Practical Row Limit
Our business context: users import order data. Even bulk imports rarely exceed a few thousand orders. I set a hard limit:
1
const MAX_ROWS_TO_PROCESS = 10000;
Reasoning:
- Legitimate use cases: 10k rows is generous for order imports
- Ghost row protection: Stops processing at 10k instead of 1M+
- Clear expectations: Users know the limit upfront
- Performance guarantee: Bounded processing time
If a file claims to have 1,048,562 rows, we process the first 10,000 and ignore the rest (which are likely ghost rows anyway).
2. Read in Batches, Not All at Once
The original approach:
1
2
3
// ❌ OLD: Convert entire sheet to JSON (1M+ rows at once)
const jsonData = XLSX.utils.sheet_to_json(worksheet);
// Result: 12.8 seconds, 500MB memory spike
The new approach:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// ✅ NEW: Read in small batches
const BATCH_SIZE = 500;
// First, get just the headers
const headers = XLSX.utils.sheet_to_json(worksheet, {
range: 0, // Just row 0
header: 1
})[0];
// Then read 500 rows at a time
for (let start = 1; start < totalRows; start += BATCH_SIZE) {
const batch = XLSX.utils.sheet_to_json(worksheet, {
range: start, // Read from row 'start'
header: headers
});
// Process this batch immediately
const validRows = batch.filter(isValidRow);
sendBatchToMainThread(validRows);
}
Why this works:
- Smaller JSON conversions: 500 rows instead of 1M+
- Memory efficient: Process and discard each batch
- Progressive: Send data as it’s ready
- Interruptible: Can cancel between batches
3. Progressive Status with Cancel Option
Users shouldn’t stare at a generic “Processing…” spinner for 10+ seconds. They need to know:
- What’s happening right now
- How much progress has been made
- The ability to cancel if it’s taking too long
I implemented three distinct states:
Stage 1: Reading File
1
2
🔄 Đang đọc file...
[Cancel Button]
Stage 2: Parsing Excel
1
2
🔄 Đang xử lý 2,500 / 10,000 dòng hợp lệ (25%)
[Cancel Button]
Stage 3: Complete
1
✅ Import thành công 1,581 đơn hàng
The cancel button terminates the worker and clears all accumulated data:
1
2
3
4
5
const handleCancel = () => {
worker.terminate(); // Kill the worker
allData = []; // Clear received data
setProgress(null); // Hide progress UI
};
Implementation: The Technical Details
Now for the fun part—actually building this thing. Here’s how I implemented each piece:
Worker Architecture
The worker needs to:
- Detect ghost rows early by checking total rows vs the limit
- Read headers first to get column structure
- Process rows in batches
- Send progress updates
- Respect the 10k row limit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
// excelParser.worker.ts
const MAX_ROWS_TO_PROCESS = 10000;
const BATCH_SIZE = 500;
self.onmessage = async (e: MessageEvent<WorkerMessage>) => {
const { file } = e.data;
const startTime = performance.now();
try {
// Step 1: Read file
const arrayBuffer = await file.arrayBuffer();
const fileSizeMB = (file.size / (1024 * 1024)).toFixed(2);
// Step 2: Parse Excel structure (not the data yet!)
const workbook = XLSX.read(arrayBuffer, { type: 'array' });
const worksheet = workbook.Sheets[workbook.SheetNames[0]];
// Step 3: Check total rows from metadata
const range = XLSX.utils.decode_range(worksheet['!ref'] || 'A1');
const totalRows = range.e.r + 1;
// Step 4: Apply row limit
const rowsToProcess = Math.min(totalRows, MAX_ROWS_TO_PROCESS);
if (totalRows > MAX_ROWS_TO_PROCESS) {
console.warn(
`⚠️ File có ${totalRows.toLocaleString()} dòng (có thể là ghost rows), ` +
`chỉ xử lý ${MAX_ROWS_TO_PROCESS.toLocaleString()} dòng đầu`
);
}
// Notify main thread we're starting to parse
self.postMessage({
type: 'parse_start',
totalRows: rowsToProcess
});
// Step 5: Read headers only (row 0)
const headerData = XLSX.utils.sheet_to_json(worksheet, {
range: 0,
header: 1
});
const headers = headerData[0] as string[];
// Step 6: Process in batches
let processedCount = 0;
for (let startRow = 1; startRow < rowsToProcess; startRow += BATCH_SIZE) {
const endRow = Math.min(startRow + BATCH_SIZE - 1, rowsToProcess - 1);
// Read just this batch
const batchData = XLSX.utils.sheet_to_json(worksheet, {
range: startRow,
header: headers,
defval: null
});
// Validate and filter
const validRows = batchData
.map((row: any) => parseOrderRow(row))
.filter((row): row is ParsedOrder => row !== null);
processedCount += batchData.length;
// Send batch to main thread
self.postMessage({
type: 'batch',
data: validRows,
progress: {
current: processedCount,
total: rowsToProcess - 1 // -1 for header
}
});
}
// Step 7: Complete
const totalTime = performance.now() - startTime;
self.postMessage({
type: 'complete',
totalTime,
totalRows: processedCount
});
} catch (error) {
self.postMessage({
type: 'error',
error: error instanceof Error ? error.message : 'Unknown error'
});
}
};
Main Thread Handler
The main thread needs to:
- Show appropriate status for each stage
- Accumulate batches as they arrive
- Update progress bar
- Handle cancellation
- Clean up when done
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// excelImportService.ts
export const processExcelFile = async (
file: File,
onProgress: (status: ImportStatus) => void,
cancelSignal: { cancelled: boolean }
): Promise<ParsedOrder[]> => {
return new Promise((resolve, reject) => {
const worker = new Worker(
new URL('../workers/excelParser.worker', import.meta.url)
);
let allData: ParsedOrder[] = [];
let totalRows = 0;
// Stage 1: Reading file
onProgress({
stage: 'reading',
message: 'Đang đọc file...'
});
worker.onmessage = (e) => {
const { type, data, progress, totalRows: total, error } = e.data;
// Check if cancelled
if (cancelSignal.cancelled) {
worker.terminate();
reject(new Error('Import cancelled by user'));
return;
}
switch (type) {
case 'parse_start':
// Stage 2: Parsing started
totalRows = total;
onProgress({
stage: 'parsing',
message: `Đang xử lý file...`,
total: totalRows
});
break;
case 'batch':
// Accumulate data
allData = [...allData, ...data];
// Update progress
onProgress({
stage: 'parsing',
message: `Đang xử lý ${progress.current} / ${progress.total} dòng`,
current: progress.current,
total: progress.total,
percentage: Math.round((progress.current / progress.total) * 100)
});
break;
case 'complete':
// Stage 3: Complete
worker.terminate();
onProgress({
stage: 'complete',
message: `Import thành công ${allData.length} đơn hàng`
});
resolve(allData);
break;
case 'error':
worker.terminate();
reject(new Error(error));
break;
}
};
worker.onerror = (error) => {
worker.terminate();
reject(error);
};
// Start processing
worker.postMessage({ file });
});
};
UI Component
The React component ties it all together:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
const ExcelImportDialog = () => {
const [status, setStatus] = useState<ImportStatus | null>(null);
const cancelSignal = useRef({ cancelled: false });
const handleFileUpload = async (file: File) => {
cancelSignal.current.cancelled = false;
try {
const orders = await processExcelFile(
file,
setStatus,
cancelSignal.current
);
// Save to database...
await saveOrders(orders);
} catch (error) {
if (error.message === 'Import cancelled by user') {
toast.info('Import đã bị huỷ');
} else {
toast.error('Lỗi khi import: ' + error.message);
}
} finally {
setStatus(null);
}
};
const handleCancel = () => {
cancelSignal.current.cancelled = true;
};
return (
<div>
{status && (
<div className="import-progress">
<Spinner />
<div className="status-message">
{status.message}
</div>
{status.percentage && (
<div className="progress-bar">
<div
className="progress-fill"
style={{ width: `${status.percentage}%` }}
/>
<span>{status.percentage}%</span>
</div>
)}
<button onClick={handleCancel} className="cancel-button">
Huỷ
</button>
</div>
)}
</div>
);
};
The Results: Before and After
Performance Comparison
| Metric | Before | After | Improvement |
|---|---|---|---|
| File with ghost rows (2.88MB, 1M+ rows) | 13.4s | 0.2s | 67x faster |
| Memory usage | ~500MB | ~15MB | 33x less |
| UI blocking | Yes (frozen) | No (responsive) | ✅ Fixed |
| Progress visibility | None | Real-time with % | ✅ Added |
| Cancellable | No | Yes | ✅ Added |
Real-World Impact
Before:
1
2
3
4
5
6
User uploads file with ghost rows
→ UI freezes for 13+ seconds
→ No indication of progress
→ Can't cancel
→ User thinks app crashed
→ Bad experience
After:
1
2
3
4
5
6
7
User uploads file with ghost rows
→ UI stays responsive
→ Shows "Processing 2,500 / 10,000 rows (25%)"
→ Can cancel anytime
→ Processes in < 1 second
→ Clear success message
→ Great experience
The Detailed Performance Breakdown
For a file with ghost rows (1,048,562 total rows, 1,581 valid):
Before optimization:
| Step | Time | % of Total |
|---|---|---|
| Read file | 9.7ms | 0.07% |
| Parse Excel metadata | 456ms | 3.41% |
| Convert to JSON | 12,837ms | 96.08% |
| Validate rows | 19.5ms | 0.15% |
| Total | 13,380ms | 100% |
After optimization:
| Step | Time | % of Total |
|---|---|---|
| Read file | 9.7ms | 4.5% |
| Parse Excel metadata | 156ms | 72.2% |
| Convert headers to JSON | 2ms | 0.9% |
| Convert batches to JSON (10k rows) | 42ms | 19.4% |
| Validate rows | 7ms | 3.2% |
| Total | ~216ms | 100% |
The bottleneck (JSON conversion) went from 12.8 seconds to 42 milliseconds.
Key Learnings and Takeaways
1. File Metadata Can Lie
Never trust file metadata blindly. Excel files can claim to have 1M+ rows when they only contain hundreds of actual rows. Always validate and set practical limits based on your use case.
2. Batched Processing > Bulk Conversion
Converting large datasets to JSON in one operation is an anti-pattern:
- Memory spikes exponentially
- No progress indication
- Can’t be cancelled
- Slower overall (JSON stringification doesn’t scale linearly)
Batched processing gives you:
- Constant memory usage
- Real-time progress
- Cancellation points
- Often faster (smaller JSON operations are more efficient)
3. Performance Logging Is Essential
Without detailed timing logs, I never would have identified that 96% of the time was spent in sheet_to_json(). Always instrument your code when investigating performance issues.
4. Context Matters for Optimization
The 10k row limit works for our order import use case. Your limit might be different based on:
- What kind of data you’re processing
- Your users’ typical workflows
- Your memory constraints
- Your performance requirements
Don’t cargo-cult solutions—understand your context and adapt accordingly.
5. User Experience Is More Than Just Speed
Even with the optimization, a large file might take a second or two. But the experience is dramatically better because:
- UI stays responsive (Web Worker)
- User sees progress (percentage bar)
- User has control (cancel button)
- User knows what’s happening (clear status messages)
Sometimes perception is as important as actual performance.
Recommendations for Similar Implementations
If you’re building Excel import functionality, here’s my advice:
1. Set Practical Row Limits
1
2
3
4
5
6
// Don't do this
const data = XLSX.utils.sheet_to_json(worksheet); // Unlimited!
// Do this
const MAX_ROWS = 10000; // Based on your use case
const totalRows = Math.min(range.e.r + 1, MAX_ROWS);
2. Always Use Web Workers for File Processing
1
2
// Any operation that might take > 50ms should be in a worker
const worker = new Worker(new URL('./fileProcessor.worker', import.meta.url));
3. Provide Progress and Control
1
2
3
4
5
6
// Show what's happening
onProgress({ message: 'Processing 2,500 / 10,000 rows (25%)' });
// Let users cancel
cancelSignal.current.cancelled = true;
worker.terminate();
4. Read in Batches
1
2
3
4
5
6
// Not all at once
const BATCH_SIZE = 500;
for (let start = 0; start < total; start += BATCH_SIZE) {
const batch = readBatch(start, BATCH_SIZE);
processBatch(batch);
}
5. Validate Early and Often
1
2
3
4
5
6
7
8
9
// Check file size before processing
if (file.size > 50 * 1024 * 1024) { // 50MB
throw new Error('File quá lớn');
}
// Check row count after parsing metadata
if (totalRows > MAX_ROWS) {
console.warn(`Chỉ xử lý ${MAX_ROWS} dòng đầu`);
}
Conclusion
The journey from “simple Excel import” to “optimized, user-friendly Excel import” taught me a valuable lesson: performance issues often have surprising root causes.
I started with a UI freeze problem, fixed it with Web Workers, then discovered a deeper issue with ghost rows consuming 96% of processing time. The solution wasn’t to optimize the JSON conversion itself—it was to avoid converting millions of ghost rows in the first place.
By setting a practical row limit (10k), reading in batches (500 rows), and providing clear progress indicators, we transformed a 13-second operation into a sub-second one, reduced memory usage by 97%, and dramatically improved the user experience.
The key insight: Don’t process more data than necessary. Excel files can lie about their row counts, and bulk operations don’t scale. Read incrementally, validate continuously, and always give users visibility and control.
Now when users upload files with ghost rows, instead of a 13-second freeze, they see:
1
2
🔄 Đang xử lý 10,000 / 10,000 dòng (100%)
✅ Import thành công 1,581 đơn hàng
And that makes all the difference.
Technical Specifications
- Stack: TypeScript, React, Web Workers, XLSX (SheetJS)
- Row Limit: 10,000 rows (configurable constant)
- Batch Size: 500 rows (configurable constant)
- Target Processing Time: < 500ms for typical files
- Memory Target: < 20MB for worker operations
- UI States: 3 (Reading, Parsing with progress, Complete)
- Cancellable: Yes, at any point