Stemming VS Lemmatization

Deep Learning

Stemming VS Lemmatization

Genie Lee 2022. 2. 26. 23:13

728x90

어간추출 스테밍 (Stemming)

출처 : 어간 추출 - 위키백과, 우리 모두의 백과사전

어간 추출(語幹抽出, 영어: stemming)은 어형이 변형된 단어로부터 접사 등을 제거하고 그 단어의 어간을 분리해 내는 것
“cats”(“catlike”, “catty” 등도 마찬가지)의 어간으로는 “cat”이 추출된다.
“stemmer”, “stemming”, “stemmed”의 어간은 “stem”이다.
“fishing”, “fished”, “fisher”는 “fish”가 된다.
“argue”, “argued”, “arguing”, “argus”의 어간은 “argu”이다.
“argument”, “arguments”에서는 “argument”가 추출된다.

스테밍 알고리즘

1. 포터스테머 알고리즘을 통한 어간 추출

# 포터 스태머의 사용 예*
stemmer = nltk.stem.PorterStemmer()
print(stemmer.stem('maximum'))
print("The stemmed form of running is: {}".format(stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(stemmer.stem("run")))`

-----
maximum
The stemmed form of running is: run
The stemmed form of runs is: run
The stemmed form of run is: run`

2. 랭거스터 스테머 LancasterStemmer 알고리즘을 통한 어간 추출

# 랭커스터 스태머의 사용 예from nltk.stem.lancasterimport LancasterStemmer
lancaster_stemmer= LancasterStemmer()
print(lancaster_stemmer.stem('maximum'))
print("The stemmed form of running is: {}".format(lancaster_stemmer.stem("running")))
print("The stemmed form of runs is: {}".format(lancaster_stemmer.stem("runs")))
print("The stemmed form of run is: {}".format(lancaster_stemmer.stem("run")))

-----
maxim
The stemmed form of running is: run
The stemmed form of runs is: run
The stemmed form of run is: run

3. 스노우볼 스테머 SnowballStemmer 알고리즘을 통한 어간을 추출

표제어 추출 레마타이징 (Lemmatizing)

lemma란? (명사의 단수형·동사의 원형과 같은) 단어의 기본형 (네이버 사전)

레마타이제이션은 앞뒤 문맥을 보고 단어의 의미를 식별하는 것이다.
영어에서 meet는 meeting으로 쓰였을 때 회의를 뜻하지만 meet 일 때는 만나다는 뜻을 갖는데 그 단어가 명사로 쓰였는지 동사로 쓰였는지에 따라 적합한 의미를 갖도록 추출하는 것이다.
fly 날다와 flying 파리를 구별해 준다.

from nltk.stemimport WordNetLemmatizer
wordnet_lemmatizer= WordNetLemmatizer()

print(wordnet_lemmatizer.lemmatize('fly'))
print(wordnet_lemmatizer.lemmatize('flies'))


-----
fly
fly

Stemming VS Lemmatization

레마타이징은 스테밍이랑 밀접하게 연결이 되어 있지만 약간의 차이가 있다.

- Stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster.

- The reduced "accuracy" may not matter for some applications. In fact, when used within information retrieval systems, stemming improves query recall accuracy, or true positive rate, when compared to lemmatisation. Nonetheless, stemming reduces precision, or the proportion of positively-labeled instances that are actually positive, for such systems.

For instance:

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatization.
The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatisation attempts to select the correct lemma depending on the context.

728x90

저작자표시 비영리 변경금지

'Deep Learning' 카테고리의 다른 글

딥 러닝을 이용한 자연어 처리 입문_Vector model study (0)	2022.02.14
다양한 신경망 - MNIST 분류 CNN 모델 (0)	2021.11.18

현재글Stemming VS Lemmatization

What if_Genie

Stemming VS Lemmatization

어간추출 스테밍 (Stemming)

스테밍 알고리즘

1. 포터스테머 알고리즘을 통한 어간 추출

2. 랭거스터 스테머 LancasterStemmer 알고리즘을 통한 어간 추출

3. 스노우볼 스테머 SnowballStemmer 알고리즘을 통한 어간을 추출

표제어 추출 레마타이징 (Lemmatizing)

Stemming VS Lemmatization

'Deep Learning' 카테고리의 다른 글

'Deep Learning'의 다른글

티스토리툴바

Stemming VS Lemmatization

어간추출 스테밍 (Stemming)

스테밍 알고리즘

1. 포터스테머 알고리즘을 통한 어간 추출

2. 랭거스터 스테머 LancasterStemmer 알고리즘을 통한 어간 추출

3. 스노우볼 스테머 SnowballStemmer 알고리즘을 통한 어간을 추출

표제어 추출 레마타이징 (Lemmatizing)

Stemming VS Lemmatization

'Deep Learning' 카테고리의 다른 글

'Deep Learning'의 다른글

관련글

티스토리툴바