영문 파일 비교

pytesseract

In [1]:
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os

초기화

In [2]:
try:
    from PIL import Image
except ImportError:
    import Image

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract'

파일 불러오기

In [3]:
PDF_file = 'CV.pdf'
In [4]:
pages = convert_from_path(PDF_file, 500)
In [5]:
image_counter = 1

pdf 파일 이미지화

In [6]:
for page in pages:
    filename = "page_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1
In [7]:
filelimit = image_counter-1

이미지 인식 후 텍스트화

In [8]:
outfile = "out_text.txt"
In [9]:
f = open(outfile, "a", encoding='utf-8-sig')
for i in range(1, filelimit + 1):
    filename = "page_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
    text = text.replace('-\n', '')
    f.write(text)
f.close()

pyocr

초기화

In [10]:
from PIL import Image
import sys
import pyocr
import pyocr.builders
In [11]:
# 툴 가져오기
tools = pyocr.get_available_tools()
In [12]:
# 툴이 있는지 확인
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)
In [13]:
# 권장되는 순서대로 툴 반환
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
Will use tool 'Tesseract (sh)'
In [14]:
# 사용할 수 있는 언어 출력
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
Available languages: eng, kor, osd
In [15]:
# 영어 사용
lang = langs[0]
print("Will use lang '%s'" % (lang))
Will use lang 'eng'

이미지 인식 후 텍스트화

In [16]:
import codecs
import pyocr
import pyocr.builders
In [17]:
tool = pyocr.get_available_tools()[0]
In [18]:
builder = pyocr.builders.TextBuilder()
In [19]:
txt = tool.image_to_string(
    Image.open('page_1.jpg'),
    lang=lang,
    builder=builder
)
In [20]:
with codecs.open("hanpage_1.txt", 'w', encoding='utf-8') as file_descriptor:
    builder.write_file(file_descriptor, txt)

영문 파일 결과물 출력

pytesseract

In [21]:
print(pytesseract.image_to_string(Image.open('page_1.jpg')))
Mitcnell O'Hara-Wila

DATA SCIENTIST
58 Madeleine Rd, Clayton
0+61 408 259421 | SZmail@mitchelloharawild.com | mitchelloharawild | mitchelloharawild | YW mitchoharawild

Education
Monash University Clayton, Australia
BCom (HONS) IN ECONOMETRICS Mar. 2017 - Nov. 2017

« GPA of 3.875, WAM of 86.625
« Best in class for: Advanced statistical modelling (ETC3580), Bayesian time series econometrics (ETC4541), Applied econometrics 2 (ETC4410),
Advanced topics in computational science (FIT4012), Honours Research Project (ETC4860)

Monash University Clayton, Australia

BCOM IN ECONOMETRICS, BSC IN MATHEMATICAL STATISTICS AND COMPUTATIONAL SCIENCE Mar. 2013 - Nov. 2016

* GPA of 3.688, WAM of 85.385

« Mentored in the Access Monash Ambassador Program (2015 and 2016)

« Participated in the Vice-Chancellor’s Ancora Imparo Student Leadership Program (2014)

« Best in class for: Business analytics (ETC3450), Business forecasting (ETC2450), Algorithms and data structures (FIT2004), Time series analysis
for business and economics (ETC3450)

Experience
iSelect Cheltenham, Australia
DATA MINING (INTERNSHIP) Feb. 2015 - Mar. 2015

« Improved business data and issue reporting with interactive visualisations, and model-based anomaly detection.

Coles Rowville, Australia
FRESH PRODUCE Oct. 2010 - Nov. 2015
« Food preparation & display, first aid, staff training and customer assistance.

Monash University Clayton, Australia
RESEARCH ASSISTANT Jan. 2016 - Present

« Supervisors include Rob Hyndman, Dianne Cook, and George Athanasopoulos.
* Consulting projects with DiabetesLab, Tennis Australia, Monash University and Huawei.
« Contributed to the development of numerous open source R packages.

Google Sydney, Australia
STUDENT AMBASSADOR Feb. 2015 - Nov. 2015
« Supported Google’s presence on campus with events and media.

Monash University Caulfield & Clayton, Australia
TEACHING ASSOCIATE S1 2016 - Present

« Advanced statistical modelling (ETC3580)

- Business forecasting (ETF3231/ETF5231)

« Mathematics for business (ETF2700)

« Data modelling and computing (ETC1010)

Rotaract Monash, Australia

VOLUNTEER Feb. 2013 - Nov. 2016

« Development and maintenance of club website and online services.
« Organising and hosting fundraisers.

NOVEMBER, 2018 MITCHELL O’HARA-WILD + CURRICULUM VITAE 1

pyocr

In [22]:
txt = tool.image_to_string(
    Image.open('page_1.jpg'),
    lang=lang,
    builder=pyocr.builders.TextBuilder()
)
print(txt)
Mitcnell O'Hara-Wilc

DATA SCIENTIST
58 Madeleine Rd, Clayton
0+61 408 259421 | SWmail@mitchelloharawild.com | mitchelloharawild | wmitchelloharawild | YW mitchoharawild

Education
Monash University Clayton, Australia
BCom (HONS) IN ECONOMETRICS Mar. 2017 - Nov. 2017

« GPA of 3.875, WAM of 86.625
« Best in class for: Advanced statistical modelling (ETC3580), Bayesian time series econometrics (ETC4541), Applied econometrics 2 (ETC4410),
Advanced topics in computational science (FIT4012), Honours Research Project (ETC4860)

Monash University Clayton, Australia

BCOM IN ECONOMETRICS, BSC IN MATHEMATICAL STATISTICS AND COMPUTATIONAL SCIENCE Mar. 2013 - Nov. 2016

* GPA of 3.688, WAM of 85.385

« Mentored in the Access Monash Ambassador Program (2015 and 2016)

« Participated in the Vice-Chancellor’s Ancora Imparo Student Leadership Program (2014)

« Best in class for: Business analytics (ETC3450), Business forecasting (ETC2450), Algorithms and data structures (FIT2004), Time series analysis
for business and economics (ETC3450)

Experience
iSelect Cheltenham, Australia
DATA MINING (INTERNSHIP) Feb. 2015 - Mar. 2015

« Improved business data and issue reporting with interactive visualisations, and model-based anomaly detection.

Coles Rowville, Australia
FRESH PRODUCE Oct. 2010 - Nov. 2015
« Food preparation & display, first aid, staff training and customer assistance.

Monash University Clayton, Australia
RESEARCH ASSISTANT Jan. 2016 - Present

« Supervisors include Rob Hyndman, Dianne Cook, and George Athanasopoulos.
* Consulting projects with DiabetesLab, Tennis Australia, Monash University and Huawei.
« Contributed to the development of numerous open source R packages.

Google Sydney, Australia
STUDENT AMBASSADOR Feb. 2015 - Nov. 2015
« Supported Google’s presence on campus with events and media.

Monash University Caulfield & Clayton, Australia
TEACHING ASSOCIATE S1 2016 - Present

« Advanced statistical modelling (ETC3580)

« Business forecasting (ETF3231/ETF5231)

« Mathematics for business (ETF2700)

« Data modelling and computing (ETC1010)

Rotaract Monash, Australia

VOLUNTEER Feb. 2013 - Nov. 2016

« Development and maintenance of club website and online services.
« Organising and hosting fundraisers.

NOVEMBER, 2018 MITCHELL O’HARA-WILD + CURRICULUM VITAE 1

한글 파일 비교

pytesseract

파일 읽기

In [23]:
PDF_file = '입사지원서_정민.pdf'
In [24]:
pages = convert_from_path(PDF_file, 500)
In [25]:
image_counter = 1

pdf 파일 이미지화

In [26]:
for page in pages:
    filename = "hanpage_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG')
    image_counter = image_counter + 1
In [27]:
filelimit = image_counter-1

이미지 인식 후 텍스트화

In [28]:
outfile = "hanout_text.txt"
In [29]:
f = open(outfile, "a", encoding='utf-8-sig')
for i in range(1, filelimit + 1):
    filename = "hanpage_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename)))))
    text = text.replace('-\n', '')
    f.write(text)
f.close()

pyocr

이미지 인식 후 텍스트화

In [30]:
tool = pyocr.get_available_tools()[0]
In [31]:
builder = pyocr.builders.TextBuilder()
In [32]:
txt = tool.image_to_string(
    Image.open('hanpage_1.jpg'),
    lang='kor',
    builder=builder
)
In [33]:
with codecs.open("hanpage_1.txt", 'w', encoding='utf-8') as file_descriptor:
    builder.write_file(file_descriptor, txt)

한글 파일 결과물 출력

pytesseract

In [34]:
print(pytesseract.image_to_string(Image.open('hanpage_1.jpg'), lang='kor+eng'))
입사지원서

국적       대한민국
생년월일       1995/06/18
 010-9473-9051
-Mai   h20141231@g-mail.hallym.ac.kr 긴급연락처 | 033)261-0312

Zz A | 강원도 춘천시 석사동 퇴계주공 아파트 303동 1002호

Pa
[20
rH
a
~
17
40
re
ra

o7
ro

02 | 0%
Fo

0%

of | oF
Wes

0

ㅇ

=

(으

S

=

m

 

학력사항
=         ai            oti             전공            선전 /마전      조언없브     A
Te      (yyyy/mm~yyyy/mm)          어그                     LO                    영석/만섬         솔업여부       소재지

 2011/03~2014/02 인문계            00 / 0 [졸업 강원 춘천
    2014/03~ | 한림 대학교 | 영어영문학과         3.35/4.5 강원 춘천

00

경력사항
-           (yyyy/mm~yyyy/mm)              그                   ㅇㅇ 버               수
배아 코볼 | ae이
군별                 급              며제사유
(yyyy/mm~yyyy/mm)           =              A 번             그 | |유
2015/07~2017~04

“52
MOS: Access2016         2019/05/02              YBM IT                  uNAH-XMYc
MOS: PowerPoint2016     2019/06/03               YBM IT                   ULYn-sFaN

격 (

@ 제(    )     !회계사 2차 시험 합격           년)        @ 공인회계사 1차 시험     -(

외국어능력
213018
90 0 | 000 | 0 906

취업보호대상여

 

on

ot
Je
=
oct

01
mE!
|
mel
2
4x

>
Jy
>
0안

oln
JK
[또
LO

건
iY
on

'

와
OH
O

re
Jy

[것
+

(급수)              점수(급수)취득일

2019/03/23

FY

요
오

40

HI
FO}
2
HI
FO}
re
Lot

rb
4
뿌
[뽀
수
of

lot             0건
oo               =
[요               요
OHI               4r

4a
ogt
고

i,
|

2019/07/1                                                                         Part-time

ru
4o
>
jal
<i
이
eo

pyocr

In [35]:
txt = tool.image_to_string(
    Image.open('hanpage_1.jpg'),
    lang='kor',
    builder=pyocr.builders.TextBuilder()
)
print(txt)
기본사항
지원본부/직무 ㅣ 인턴                         국적      대한민국
성 명                               색티퀴일    1995/06/18
영문성명      100119 1410                       1/1010116      010-9473-9051
『-131|      120141231@0-버키.2011417.00.1                 033)261-0312
주 소    강원도 춘천시 석사동 퇴계주공 아파트 303동 1002호

 

  

 

기간
(/7/"\/0007~\5\77/7/00)
고교 ㅣ|2011/03~2014/02

학교명             저모            성적/만점      졸업여부    소재지

 

 

 

 

 

 

 

 

 

대학         2014/03~
~                                                              /
경력사항
회사명                는우기애                   직위                 담당업무                소재지

(/////0001~///0ㅁ000)

 

 

 

 

 

0표
」요
스
0앞

 

 

 

 

 

 

 

 

 

 

 

 

 

 

군필여부           0       계급             면제사유
0///7/0000~///0000)

군필            2015/07~2017~04     육군               병장
자격사항

자격명                 취득일                  발급처                 등록번호
"14105: 24000552016        2019/05/02                                   44/사1-×\
"105: 『ㅁ0\4600012016 ㅣ 2019/06/03                                   41\7-하24
6 제(  )회 공인회계사 2차 시험 합격 (        년)
외국어능력

외국어명             수준            공인시험명          점수(급수)        점수(급수)쥐득일
영어                                70티16              595            2019/03/23
취업보호대상여부
보훈여부                       보훈번호                       장애여부

근무관련사항

근무시작가능일                     근무형태                       8 비어

2019/07/12            『81ㄴ81176

비교 결과:

  • 영문 이미지(pdf)파일을 텍스트화 하는 것은 정확도가 높고 비슷한 수준을 보였다.
  • 한글 이미지(pdf)파일을 텍스트화 할 때는 pyocr이 더 높은 수준을 보여주고있다. 하지만 아직까지 더 많은 학습이 이루어져야 높은 수준을 보일 것으로 예상된다.