程序员最近都爱上了这个网站  程序员们快来瞅瞅吧!  it98k网:it98k.com

本站消息

站长简介/公众号

  出租广告位,需要合作请联系站长

+关注
已关注

分类  

暂无分类

标签  

暂无标签

日期归档  

基于python3 pdf转化为图片

发布于2020-03-12 11:13     阅读(1106)     评论(0)     点赞(9)     收藏(1)


 

安装:

  1. apt-get install python-poppler
  2. apt install poppler-utils
  3. pip3 install pdfminer.six
  4. pip3 install pdf2image

 

pdf_decompose.py

  1. #!/usr/bin/python3
  2. # -*- coding: utf-8 -*-
  3. import io
  4. import os
  5. import sys
  6. import time
  7. from pdfminer.pdfparser import PDFParser
  8. from pdfminer.pdfdocument import PDFDocument
  9. from pdfminer.pdfpage import PDFPage
  10. from pdfminer.pdfinterp import PDFResourceManager
  11. from pdfminer.pdfinterp import PDFPageInterpreter
  12. from pdfminer.layout import LAParams
  13. from pdfminer.layout import LTText
  14. from pdfminer.converter import PDFPageAggregator
  15. from pdf2image import convert_from_path, convert_from_bytes
  16. class PDFDecompose(object):
  17. """
  18. pdf文件转为image
  19. """
  20. def __init__(self):
  21. pass
  22. def decompose_from_bytes(self, file_bytes, dpi=96):
  23. """
  24. :param file_bytes:byte type of pdf file
  25. :return: image list, each element is a PIL image, RGB format
  26. """
  27. try:
  28. images = convert_from_bytes(file_bytes, dpi=dpi)
  29. return images
  30. except Exception as e:
  31. #gl.log.error('PDF Decompose from byte fail, error: {}'.format(str(e)))
  32. return None
  33. def decompose_from_file(self, file_name, check_content=False):
  34. """
  35. :param file_name: file in disk
  36. :param check_content: if True, check pdf content whether text or image
  37. :return:
  38. """
  39. time_start = time.time()
  40. if check_content:
  41. try:
  42. with open(file_name, 'rb') as fp:
  43. parser = PDFParser(fp)
  44. document = PDFDocument(parser)
  45. if not document.is_extractable:
  46. # can not extract
  47. self._log_helper('fail, can not extract', time_start)
  48. return None
  49. rsrcmgr = PDFResourceManager()
  50. laparams = LAParams()
  51. # Create a PDF page aggregator object.
  52. device = PDFPageAggregator(rsrcmgr, laparams=laparams)
  53. interpreter = PDFPageInterpreter(rsrcmgr, device)
  54. if self._is_image_file(device, document, interpreter):
  55. images = self._to_images(file_name)
  56. num_pages = len(images)
  57. self._log_helper('success, pages: {0}'.format(num_pages), time_start)
  58. return images
  59. self._log_helper('fail, no image content', time_start)
  60. return None
  61. except Exception as e:
  62. # no file
  63. self._log_helper('fail, file io error, {0}'.format(file_name), time_start)
  64. return None
  65. else:
  66. images = self._to_images(file_name)
  67. num_pages = len(images)
  68. self._log_helper('success, pages: {0}'.format(num_pages), time_start)
  69. return images
  70. def _is_image_file(self, device, document, interpreter):
  71. """
  72. 检查pdf内前十个page,是否image page占多数,如果是,就认为是一个image的pdf
  73. :param device:
  74. :param document:
  75. :param interpreter:
  76. :return:
  77. """
  78. pages = PDFPage.create_pages(document)
  79. page_count = 0
  80. image_page_count = 0
  81. for i, page in enumerate(pages):
  82. if i > 10:
  83. break
  84. page_count += 1
  85. interpreter.process_page(page)
  86. # receive the LTPage object for the page.
  87. layout = device.get_result()
  88. if not self._is_text_page(layout):
  89. image_page_count += 1
  90. if page_count <= 0:
  91. return True
  92. if image_page_count // page_count > 0.8:
  93. return True
  94. return False
  95. def _is_text_page(self, page):
  96. """
  97. 检查page内前十个对象,是否text对象占多数,如果是,就认为page是一个text page
  98. :param layout:
  99. :return:
  100. """
  101. object_count = len(page._objs)
  102. if object_count <= 0:
  103. return False
  104. if object_count > 10:
  105. object_count = 10
  106. text_line_count = 0
  107. for j, obj in enumerate(page._objs):
  108. if j > object_count:
  109. break
  110. if isinstance(obj, LTText):
  111. text_line_count += 1
  112. continue
  113. if text_line_count // object_count > 0.8:
  114. return True
  115. return False
  116. def _to_images(self, file_name):
  117. images = convert_from_path(file_name, dpi=96)
  118. result = []
  119. for image in images:
  120. byteArray = io.BytesIO()
  121. image.save(byteArray, format='JPEG')
  122. result.append(byteArray.getvalue())
  123. return result
  124. def _log_helper(self, log_content, start_time_point):
  125. time_end = time.time()
  126. consume = time_end - start_time_point

 

test_pdf_to_images.py

  1. #!/usr/bin/python3
  2. # -*- coding: utf-8 -*-
  3. import os
  4. import sys
  5. import unittest
  6. from pdf_decompose import PDFDecompose
  7. class PDFDecomposeTestCase(unittest.TestCase):
  8. def setUp(self):
  9. self.decomposer = PDFDecompose()
  10. def test_pdf_decompose_image(self):
  11. pdf_file_path = './decompose.pdf'
  12. images = self.decomposer.decompose_from_file(pdf_file_path,
  13. check_content=False)
  14. for i, image in enumerate(images):
  15. image_path = os.path.join("./", 'decompose_{0}.jpg'.format(i))
  16. with open(image_path, 'wb') as f:
  17. f.write(image)
  18. image_count = len(images)
  19. self.assertEqual(image_count, 2)
  20. if __name__ == '__main__':
  21. unittest.main()

 

原文链接:https://blog.csdn.net/qq_14845119/article/details/104798552



所属网站分类: 技术文章 > 博客

作者:我Lovepython

链接:https://www.pythonheidong.com/blog/article/253986/9815cdd74b472c0df285/

来源:python黑洞网

任何形式的转载都请注明出处,如有侵权 一经发现 必将追究其法律责任

9 0
收藏该文
已收藏

评论内容:(最多支持255个字符)