pytesseract_Test

필요모듈 설치

In [1]:
#!pip install pillow
In [2]:
# !pip install pytesseract
In [3]:
# !pip install PIL
In [4]:
# !pip install pdf2image

필요 모듈 import

In [5]:
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
import re
from IPython.display import Image as image

모듈 초기화

In [6]:
try:
    from PIL import Image
except ImportError:
    import Image

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract'

파일 불러오기

In [7]:
PDF_file = 'CV.pdf'
In [8]:
pages = convert_from_path(PDF_file, 500)
In [9]:
image_counter = 1

pdf 파일 이미지화

In [10]:
for page in pages:
    filename = "page_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1
In [11]:
filelimit = image_counter-1

이미지 인식 후 텍스트화

In [12]:
outfile = "out_text.txt"
In [13]:
f = open(outfile, "a", encoding='utf-8-sig')
for i in range(1, filelimit + 1):
    filename = "page_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
    text = text.replace('-\n', '')
    f.write(text)
f.close()

결과물 출력

In [15]:
print(pytesseract.image_to_string(Image.open('page_1.jpg')))
Mitcnell O'Hara-Wila

DATA SCIENTIST
58 Madeleine Rd, Clayton
0+61 408 259421 | SZmail@mitchelloharawild.com | mitchelloharawild | mitchelloharawild | YW mitchoharawild

Education
Monash University Clayton, Australia
BCom (HONS) IN ECONOMETRICS Mar. 2017 - Nov. 2017

« GPA of 3.875, WAM of 86.625
« Best in class for: Advanced statistical modelling (ETC3580), Bayesian time series econometrics (ETC4541), Applied econometrics 2 (ETC4410),
Advanced topics in computational science (FIT4012), Honours Research Project (ETC4860)

Monash University Clayton, Australia

BCOM IN ECONOMETRICS, BSC IN MATHEMATICAL STATISTICS AND COMPUTATIONAL SCIENCE Mar. 2013 - Nov. 2016

* GPA of 3.688, WAM of 85.385

« Mentored in the Access Monash Ambassador Program (2015 and 2016)

« Participated in the Vice-Chancellor’s Ancora Imparo Student Leadership Program (2014)

« Best in class for: Business analytics (ETC3450), Business forecasting (ETC2450), Algorithms and data structures (FIT2004), Time series analysis
for business and economics (ETC3450)

Experience
iSelect Cheltenham, Australia
DATA MINING (INTERNSHIP) Feb. 2015 - Mar. 2015

« Improved business data and issue reporting with interactive visualisations, and model-based anomaly detection.

Coles Rowville, Australia
FRESH PRODUCE Oct. 2010 - Nov. 2015
« Food preparation & display, first aid, staff training and customer assistance.

Monash University Clayton, Australia
RESEARCH ASSISTANT Jan. 2016 - Present

« Supervisors include Rob Hyndman, Dianne Cook, and George Athanasopoulos.
* Consulting projects with DiabetesLab, Tennis Australia, Monash University and Huawei.
« Contributed to the development of numerous open source R packages.

Google Sydney, Australia
STUDENT AMBASSADOR Feb. 2015 - Nov. 2015
« Supported Google’s presence on campus with events and media.

Monash University Caulfield & Clayton, Australia
TEACHING ASSOCIATE S1 2016 - Present

« Advanced statistical modelling (ETC3580)

- Business forecasting (ETF3231/ETF5231)

« Mathematics for business (ETF2700)

« Data modelling and computing (ETC1010)

Rotaract Monash, Australia

VOLUNTEER Feb. 2013 - Nov. 2016

« Development and maintenance of club website and online services.
« Organising and hosting fundraisers.

NOVEMBER, 2018 MITCHELL O’HARA-WILD + CURRICULUM VITAE 1
In [17]:
print(pytesseract.image_to_string(Image.open('page_2.jpg')))
Awards & Achievements
AWARDS

2017 Commerce Dean’s Honour

2016 Commerce Dean’s Commendation
2014-2016 Science Dean’s List

2014 ‘International Institute of Forecasters Award

2013. Rotary Youth Leadership Award

SCHOLARSHIPS

2017 Econometrics Honours Memorial Scholarship

2015 &
2016

2011 & 2012Mitcham Rotary Scholarship

Monash Community Leaders Scholarship

COMPETITIONS

2018 UseR! 2018 Datathon Champion
2017 RMIT SBITL Analytics Competition Champion
2016 RMIT SBITL Analytics Competition Champion

NOVEMBER, 2018 MITCHELL O’HARA-WILD -

CURRICULUM VITAE

Monash
Monash
Monash
[IF
Rotary

Monash
Monash

Rotary

UseR!
RMIT
RMIT

cv 분석

In [18]:
eng_table = pytesseract.image_to_string(Image.open('page_2.jpg'))
In [19]:
eng_table = re.split('\n',text)
In [20]:
print(eng_table)
['Awards & Achievements', 'AWARDS', '', '2017 Commerce Dean’s Honour', '', '2016 Commerce Dean’s Commendation', '2014-2016 Science Dean’s List', '', '2014 ‘International Institute of Forecasters Award', '', '2013. Rotary Youth Leadership Award', '', 'SCHOLARSHIPS', '', '2017 Econometrics Honours Memorial Scholarship', '', '2015 &', '2016', '', '2011 & 2012Mitcham Rotary Scholarship', '', 'Monash Community Leaders Scholarship', '', 'COMPETITIONS', '', '2018 UseR! 2018 Datathon Champion', '2017 RMIT SBITL Analytics Competition Champion', '2016 RMIT SBITL Analytics Competition Champion', '', 'NOVEMBER, 2018 MITCHELL O’HARA-WILD ', 'CURRICULUM VITAE', '', 'Monash', 'Monash', 'Monash', '[IF', 'Rotary', '', 'Monash', 'Monash', '', 'Rotary', '', 'UseR!', 'RMIT', 'RMIT']

한글 파일 읽기

In [21]:
PDF_file = '입사지원서_정민.pdf'
In [22]:
pages = convert_from_path(PDF_file, 500)
In [23]:
image_counter = 1

pdf 파일 이미지화

In [24]:
for page in pages:
    filename = "hanpage_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1
In [25]:
filelimit = image_counter-1

이미지 인식 후 텍스트화

In [26]:
outfile = "hanout_text.txt"
In [27]:
f = open(outfile, "a", encoding='utf-8-sig')
for i in range(1, filelimit + 1):
    filename = "hanpage_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
    text = text.replace('-\n', '')
    f.write(text)
f.close()
In [29]:
print(pytesseract.image_to_string(Image.open('hanpage_1.jpg'), lang='kor+eng'))
입사지원서

국적       대한민국
생년월일       1995/06/18
 010-9473-9051
-Mai   h20141231@g-mail.hallym.ac.kr 긴급연락처 | 033)261-0312

Zz A | 강원도 춘천시 석사동 퇴계주공 아파트 303동 1002호

Pa
[20
rH
a
~
17
40
re
ra

o7
ro

02 | 0%
Fo

0%

of | oF
Wes

0

ㅇ

=

(으

S

=

m

 

학력사항
=         ai            oti             전공            선전 /마전      조언없브     A
Te      (yyyy/mm~yyyy/mm)          어그                     LO                    영석/만섬         솔업여부       소재지

 2011/03~2014/02 인문계            00 / 0 [졸업 강원 춘천
    2014/03~ | 한림 대학교 | 영어영문학과         3.35/4.5 강원 춘천

00

경력사항
-           (yyyy/mm~yyyy/mm)              그                   ㅇㅇ 버               수
배아 코볼 | ae이
군별                 급              며제사유
(yyyy/mm~yyyy/mm)           =              A 번             그 | |유
2015/07~2017~04

“52
MOS: Access2016         2019/05/02              YBM IT                  uNAH-XMYc
MOS: PowerPoint2016     2019/06/03               YBM IT                   ULYn-sFaN

격 (

@ 제(    )     !회계사 2차 시험 합격           년)        @ 공인회계사 1차 시험     -(

외국어능력
213018
90 0 | 000 | 0 906

취업보호대상여

 

on

ot
Je
=
oct

01
mE!
|
mel
2
4x

>
Jy
>
0안

oln
JK
[또
LO

건
iY
on

'

와
OH
O

re
Jy

[것
+

(급수)              점수(급수)취득일

2019/03/23

FY

요
오

40

HI
FO}
2
HI
FO}
re
Lot

rb
4
뿌
[뽀
수
of

lot             0건
oo               =
[요               요
OHI               4r

4a
ogt
고

i,
|

2019/07/1                                                                         Part-time

ru
4o
>
jal
<i
이
eo
In [30]:
text = pytesseract.image_to_string(Image.open('hanpage_1.jpg'), lang='kor+eng')
In [31]:
import re
In [32]:
cv_table = re.split('\n',text)
In [33]:
print(cv_table)
['입사지원서', '', '국적       대한민국', '생년월일       1995/06/18', ' 010-9473-9051', '-Mai   h20141231@g-mail.hallym.ac.kr 긴급연락처 | 033)261-0312', '', 'Zz A | 강원도 춘천시 석사동 퇴계주공 아파트 303동 1002호', '', 'Pa', '[20', 'rH', 'a', '~', '17', '40', 're', 'ra', '', 'o7', 'ro', '', '02 | 0%', 'Fo', '', '0%', '', 'of | oF', 'Wes', '', '0', '', 'ㅇ', '', '=', '', '(으', '', 'S', '', '=', '', 'm', '', ' ', '', '학력사항', '=         ai            oti             전공            선전 /마전      조언없브     A', 'Te      (yyyy/mm~yyyy/mm)          어그                     LO                    영석/만섬         솔업여부       소재지', '', ' 2011/03~2014/02 인문계            00 / 0 [졸업 강원 춘천', '    2014/03~ | 한림 대학교 | 영어영문학과         3.35/4.5 강원 춘천', '', '00', '', '경력사항', '-           (yyyy/mm~yyyy/mm)              그                   ㅇㅇ 버               수', '배아 코볼 | ae이', '군별                 급              며제사유', '(yyyy/mm~yyyy/mm)           =              A 번             그 | |유', '2015/07~2017~04', '', '“52', 'MOS: Access2016         2019/05/02              YBM IT                  uNAH-XMYc', 'MOS: PowerPoint2016     2019/06/03               YBM IT                   ULYn-sFaN', '', '격 (', '', '@ 제(    )     !회계사 2차 시험 합격           년)        @ 공인회계사 1차 시험     -(', '', '외국어능력', '213018', '90 0 | 000 | 0 906', '', '취업보호대상여', '', ' ', '', 'on', '', 'ot', 'Je', '=', 'oct', '', '01', 'mE!', '|', 'mel', '2', '4x', '', '>', 'Jy', '>', '0안', '', 'oln', 'JK', '[또', 'LO', '', '건', 'iY', 'on', '', "'", '', '와', 'OH', 'O', '', 're', 'Jy', '', '[것', '+', '', '(급수)              점수(급수)취득일', '', '2019/03/23', '', 'FY', '', '요', '오', '', '40', '', 'HI', 'FO}', '2', 'HI', 'FO}', 're', 'Lot', '', 'rb', '4', '뿌', '[뽀', '수', 'of', '', 'lot             0건', 'oo               =', '[요               요', 'OHI               4r', '', '4a', 'ogt', '고', '', 'i,', '|', '', '2019/07/1                                                                         Part-time', '', 'ru', '4o', '>', 'jal', '<i', '이', 'eo']
In [34]:
matchers = ['성명','전공','성적','자격사항']
size = [s for s in cv_table if any(xs in s for xs in matchers)]
In [35]:
print(size)
['=         ai            oti             전공            선전 /마전      조언없브     A']

아직 한글은 인식이 잘되지 않는다.