21. 크롤링

2022. 11. 22. 12:27

728x90

BeautifulSoup 라이브러리

HTML 객체 파싱에 이용
pasre (구문 분석)

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
type(html_doc)

임의의 html 문서를 불러온다.
이때 """"""로 감싸 str 타입으로 만든다.

import bs4

# res = rq.get('https://www.daum.net/')
# html = res.text
type(bs4.BeautifulSoup(html_doc))
# bs4.BeautifulSoup 가 출력된다.

soup = bs4.BeautifulSoup(html_doc)
print(soup)
print(type(soup)) # bs4.BeautifulSoup 타입
print(type(soup.title)) # bs4.element.Tag (태그 요소)
print(type(soup.head)) # bs4.element.Tag (태그요소)
print(type(soup.title.name)) # str 요소


print(soup.title) # <title>The Dormouse's story</title>
print(type(soup.title)) # <class 'bs4.element.Tag'>

bs4 모듈을 부른다.
bs4.BeautifulSoup(html_doc) 를 이용해 html_doc(str타입)을 beautifulSoup타입으로 바꿔준다.
그 후 soup에 할당하여 출력하면 아래와 같이 beautifulSoup타입이 나온다.

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

# 태그 이름
soup.title.name

soup의 타이틀 태그의 이름을 반환해준다(굳이?)

# 태그 내부의 텍스트만 반환
print(soup.title.string)

title 내부의 텍스트만 반환해 준다. 이 경우에 -> 'The Dormouse's story' 가 출력된다.

# 태그요소
type(soup.title)

title이라는 태그의 타입이기 때문에 태그요소이다. -> 'bs4.element.Tag' 출력

# 상위 태그 접근
print(soup.title.parent)
print(type(soup.title.parent)) # bs4.element.Tag (태그요소)

.parent는 해당 요소의 부모태그에 접근하는 메서드이다.
부코태그에 접근하기에 이 요소의 타입도 태그요소이다. -> 'bs4.element.Tag' 출력

# 자식 태그
# 이터레이터(반복자, for문으로 내부 접근 가능)
for tag in soup.p.children:
  print(tag)

반대로 자식태그에 접근하는 children은 접근한 자식태그를 이터레이터(반복자)로 반환하기에 for문으로 접근이 가능하다.

print(type(soup.p)) # <class 'bs4.element.Tag'>
print(type(soup.children)) # <class 'list_iterator'>
print(soup.children) # <list_iterator object at 0x7f2e61ce5790>
print(type(soup.p.children)) # children 을쓰면 list_iterator 요소(이터레이터)가 나온다.

.children을 쓰면 자식태그가 이터레이너 타입으로 나온다.
soup.p의 타입은 태그타입이지만 soup.p.children 즉, 자식태그는 이터레이터테그이다.

# 하위 태그 리스트로 반환
soup.p.contents

# 태그 객체 접근
soup.p
print(type(soup.a))

soup의 태그들중 p 태그를 반환한다.
<p class="title"><b>The Dormouse's story</b></p> -> html형식으로 나온다.

# 태그 객체에서 url 가져오기
print(soup.a.attrs['href']) # 태그 객체에선 attrs[''], .get() 쓸수 있다
print(soup.a.get('href'))

print(soup.find_all('a'))
print(soup.a)
print(type(soup.find_all('a')))

soup.a 는 태그 객체이기 때문에 attrs나 get를 쓸 수있다.
soup.a는 태그 객체이지만 find_all 을 쓴 soup.find_all('a')는 soup 안에 있는 모든 a 태그를 가져오기 때문에 태그 요소가 아니라 ResultSet 요소이다. 이 요소에는 attrs나 get를 쓸 수 없다.

# 리스트 내포문으로 a 태그의 링크만 가져오기
[a.attrs['href'] for a in soup.find_all('a')]
# [a.attrs.get('href') for a in soup.find_all('a')]

고로 ResultSet 요소는 for 문을 돌려 하나씩 가져와야한다.

url = "https://en.wikipedia.org/wiki/Artemis_1"
res = rq.get(url)
html = res.text

soup = bs4.BeautifulSoup(html)
print(res)
# print(html)
print(type(html)) # str
print(soup)
print(type(soup)) # bs4.beatifulSoup

이번엔 임의의 url을 가져와서 적용시켜 보자.
가져온 url을 간단히 BeautifulSoup로 만들어 soup에 할당하였다.
이때 print(res) 는 re 모듈을 사용하여 요청을하는 get 을 사용하였기 때문에 요청에 성공했다는 응답코드인 <Response [200]> 가 나온다.
print(html) 은 가져온 url의 html정보가 출력된다.
print(soup)는 html을 beautifulSoup를 사용하여 해당 타입으로 변환한후 출력된다.
여기서 soup의 타입은 beautifulsoup이다.

# print(soup.body)
print(type(soup.body))
print(type(soup.find_all('a')))

soup.body는 soup에서 body 태그안의 모든 텍스트가 출력된다.
soup.body의 타입은 태그이다.
soup.find_all('a')의 타이은 find_all 을 사용하였기 때문에 resultset 타입이다.

print(soup.find_all('a'))
print(type(soup.find_all('a')))  # ResultSet 은 nonetype 이므로 attrs 를  쓸 수 없다.

## attrs를 쓸 수 없으므로 for를 이용하여 모든 링크를 추출하여 리스트에 담아주세요
soup.find_all('a')
[a.attrs.get('href') for a in soup.find_all('a')]

# 이미지 태그를 찾아주세요
# soup.find('img').get('src')
print(soup.img.get('src')) # 도 가능
type(soup.find('img'))
soup.find_all('img')

태그를 이용해 원하는 태그를 찾을 때 find()를 써도 되고, . 을 찍고 바로 찾고싶은 태그를 써도 찾아진다. 차이는 없다.
find_all 이 아닌 find로 찾기때문에(정확히는 하나만 찾기때문에) 태그 요소이다.
soup.find_all('img') 는 당연히 resultset 요소이다.

# 13개의 이미지 주소를 리스트에 담아주세요
image_links = [img.get('src') for img in soup.find_all('img')[:5]] # 5개 까지만

# 링크에서 이미지 저장
for link in image_links:
  images = rq.get('https:' + link).content
  with open(link.split('/')[-1],'wb') as f:
    f.write(images)
    print('image save.')

# 뷰티풀숲 태그 및 속성 가져오기
from bs4 import BeautifulSoup as bs

html = """
<html>
<head>
<title class='title' id='title'>test</title>
</head>
<body>
<p class='t1' id='t1'>test1<span>span_test</span></p>
<p class='t2' id='t2'>test2</p>
<p class='t3' id='t3'>test3</p>
</body>
</html>
"""

soup = bs(html)
print(soup.title)
print(soup.title.attrs)
print(soup.title['id'])
print(soup.body)
print(soup.title.get('class'))
print(type(soup.title))
# print(soup.title.get('class_none'.'value'))

bs4 의 BeautifulSoup모듈을 쓰기 쉽게 bs로 이름을 변경해 불러왔다.
이번에 사용될 html 스트링 문서는 class랑 id 가 존재한다.
print(soup.title) -> soup의 title부분만 html 형식으로 불러온다.
print(type(soup.title)) -> 이 때 타입은 태그타입이다.
print(soup.title.attrs) -> title 부분의 속성인 class 와 id를 딕셔너리형식으로 불러온다
print(soup.title['id']) -> title의 id 속성의 값을 불러온다.
print(soup.body) -> soup의 body부분을 html 형식으로 불러온다.
print(soup.title.get('class')) -> title의 class 속성의 값을 불러온다.

# text, string
# text는 해당 태그 전체의 데이터 추출 가능, 문자열
# string 해당 태그의 테이터 추출 가능, 스트링객체

print(soup.title.text, type(soup.title.text)) # test <class 'str'>
print(soup.title.string, type(soup.title.string)) # test <class 'bs4.element.NavigableString'>
print(soup.p.text, type(soup.p.text)) # test1span_test <class 'str'>
print(soup.p.string, type(soup.p.string)) # None <class 'NoneType'>

# childeren <이터레이터(반복분으로 접근가능 객체)로 반환>
soup.p.children

[element for element in soup.p.children]

# 형제 태그객체 접근
sib = soup.body.p.next_sibling.next_sibling
soup.body.p.next_sibling.next_sibling

sib.previous_sibling.previous_sibling

형제 태그에 접근하기위해선 next_sibling 와 previous_sibling 메서드가 필요하다.
next_element 와 previous_element 도 있는데 sibling은 html 식으로 반환되고, element는 안의 content 가 반환된다.

# 형제 태그 접근 방법
# 반복문
for s in sib.previous_siblings:
  print(s)

# 단수 메서드의 경우 태그 객체 반환
# 복수 메서드의 경우 제너레이터(반복문 접근 가능) 반환
print(soup.p.next_element)
print(soup.p.next_elements)

sibling 이든 element든 뒤에 s 가 붙으면 태그객체가아닌 복수의 제너레이터가 반환되기에 이터레이터 처럼 for 문으로 접근해야 한다.

# soup.find_all('태그명', {'속성명':'속성값',....})
soup.find_all('p',{'id':'t1'})

soup.find_all(attrs={'class':'t2'})

soup.p.attrs['class']

find_all 의 조금 더 자세한 사용법이다.
find_all () 안에 찾고싶은 태그명과 해당태그의 속성키값이라 그 키의 value 값을 입력하면 일치되는 태그들만 찾아준다.
물론 찾은 결과값은 하나든 여러개든 resultset 타입이다.
find_all을 쓰지 않고 attrs 를 쓰면 .attrs[] 형식으로 써여한다.

# 태그객체 연속 접근
soup.find(attrs={'id':'t1'}).find('span')

find를 사용하면 반환되는 값이 tag(태그)타입이기 때문에 연속해서 find를 쓸 수 있다.

# Select
# CSS Selector를 사용해서 가져온다.
soup.select('#t1')

soup.select('.t2')

url = "https://en.wikipedia.org/wiki/Artemis_1"
res = rq.get(url)
html = res.text

soup = bs4.BeautifulSoup(html)
soup.select('mw-content-text > div.mw-parser-output > table.infobox.ib-spaceflight > tbody > tr:nth-child(1) > td > a > img')

가져온 html에서는 class 말고 id 속성도 가지고있다.
CSS select 메서드를 이용하여 class 또는 id 를 가져올 수 있다.
id 속성은 #을 붙여야 가져올 수 있다.
class 속성은 .을 붙여야 가져올 수 있다.

# CSS Selector
# 크롬 개발자 도구 > 요소 선택 > 우클릭 > Copy > Selector복사 선택
print(soup.select('#t1 span'))
print(type(soup.select('#t1 span'))) # list 요소이다

해당 구글 페이지의 개발자 도구를 이용하여 복잡한 페이지의 요소를 바로 복사 붙여넣기 할 수 있다.

# Extract : Soup 객체에서 해당 요소만 제거 (pop 처럼)
soup.p.extract()

soup의 맨 처음 p태그 하나를 뽑아내 저장하고, 원래 soup에서는 제거된다.
마치 pop()과 비슷하다.

728x90

저작자표시 (새창열림)

'Python' 카테고리의 다른 글

25-2. Numpy(2) (0)	2022.11.28
25-1. Numpy(1) (1)	2022.11.28
20. 네트워크 (0)	2022.11.22
19. 정규식...(진짜 너무한다..) (0)	2022.11.18
개발 환경 구축 (0)	2022.11.18

담학기에이쁠's 코딩 복습

21. 크롤링

BeautifulSoup 라이브러리

'Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바