pdf_text_extraction 2.0.0 copy "pdf_text_extraction: ^2.0.0" to clipboard
pdf_text_extraction: ^2.0.0 copied to clipboard

pdf_text_extraction

pdf_text_extraction #

Bindings and convenience wrappers around a fork of xpdf that enable extracting text and metadata from PDF files using Dart. The native bits are available for Linux and Windows only.

ℹ️ The project depends on a fork of xpdf maintained at https://github.com/insinfo/xpdf.

Platform requirements #

  • Windows: ship the compiled pdftotext.dll and TextExtraction.dll alongside your executable.
  • Linux: ensure the GNU C++ runtime (libstdc++6) is available before using the package.
sudo apt-get install libstdc++6

Getting started #

Add the package as a dependency and ensure the native libraries are available on the execution path or in the working directory. Two APIs are exposed:

  1. Low level bindings generated by package:ffigen, mirroring the C API.
  2. High level wrappers that take care of memory management and validation.

Low-level usage #

import 'dart:io' show Platform, Directory;
import 'package:ffi/ffi.dart';
import 'dart:ffi';
import 'package:path/path.dart' as path;
import 'package:pdf_text_extraction/pdf_text_extraction.dart';
import 'package:pdf_text_extraction/src/pdf_to_text_bindings.dart';

void logCallback(Pointer<Int8> msg) {
  print(nativeInt8ToString(msg));
}

void main() {
  var libraryPath = path.join(Directory.current.path, 'pdftotext.dll');
  if (Platform.isLinux) {
    libraryPath = path.join(Directory.current.path, 'pdftotext.so');
  }

  final dylib = DynamicLibrary.open(libraryPath);
  var pdfLib = PDFToTextBindings(dylib);
  //input pdf file
  var uriPointer = stringToNativeInt8('pdf_file.pdf', allocator: calloc);
  // output text character encoding 
  var textOutEnc = stringToNativeInt8('UTF-8', allocator: calloc);
  var layout = stringToNativeInt8('rawOrder', allocator: calloc);
  //function for print log info
  var lgf = Pointer.fromFunction<Void Function(Pointer<Int8>)>(logCallback);

  Pointer<Pointer<Int8>> textOut = calloc();

  var result = pdfLib.extractText(
      uriPointer, 1, 1, textOutEnc, layout, textOut, lgf, nullptr, nullptr);

  var textResult = nativeInt8ToString(textOut.value);

  calloc.free(uriPointer);
  calloc.free(textOutEnc);
  calloc.free(textOut);

  if (result == 0) {
    print('result ok: $textResult');
  } else {
    print('erro on text extraction');
  }
}

High-level usage #

void main() {
  final wrapper = PDFToTextWrapping();
  final text = wrapper.extractText(
    'pdf_file.pdf',
    startPage: 1,
    endPage: 1,
  );
  print('result: $text');
}

PDFToTextWrapping also exposes getPagesCount and reports any native errors through the static lastError property.

Testing #

The repository ships with unit and integration tests. To use the integration tests you must have a fixture PDF (for example 1417.pdf) and the native libraries in the root of the project.

dart test

Regenerating bindings #

If you need to regenerate the FFI bindings after updating the native headers, run:

dart run ffigen --config ffigen.yaml
4
likes
0
points
77
downloads

Publisher

unverified uploader

Weekly Downloads

pdf_text_extraction

Repository (GitHub)
View/report issues

License

unknown (license)

Dependencies

ffi, path

More

Packages that depend on pdf_text_extraction