爱学习的站长www.mmic.net.cn

www.mmic.net.cn 欢迎学习共同成长
公告信息
www.mmic.net.cn 欢迎学习共同成长
文章分类
文章档案
文章
处理学习笔记(18):3.2 字符串:最底层的文本处理
2011/8/5 15:22:44

3.2 Strings: Text Processing at the Lowest Level  字符串:最底层的文本处理

 

PS:个人认为这部分很重要,字符串处理是NLP里最基本的部分,各位童鞋好好看,老鸟略过...

It’s time to study a fundamental data type that we’ve been studiously(故意地) avoiding so far. In earlier chapters we focused on a text as a list of words. We didn’t look too closely at words and how they are handled in the programming language. By using NLTK’s corpus interface we were able to ignore the files that these texts had come from. The contents of a word, and of a file, are represented by programming languages as a fundamental data type known as a string. In this section, we explore strings in detail, and show the connection between strings, words, texts, and files.

 

Basic Operations with Strings 字符串的基本操作

 

Strings are specified using single quotes or double quotes, as shown in the following code example. If a string contains a single quote, we must backslash-escape the quote so Python knows a literal quote character is intended, or else put the string in double quotes. Otherwise, the quote inside the stringwill be interpreted as a close quote, and the Python interpreter will report a syntax error:

  >>> monty = 'Monty Python' ①

  
>>> monty

  
'Monty Python'

  
>>> circus = "Monty Python's Flying Circus" ②

  
>>> circus

  
"Monty Python's Flying Circus"

  
>>> circus = 'Monty Python\'s Flying Circus' ③

  
>>> circus

  
"Monty Python's Flying Circus"

  
>>> circus = 'Monty Python's Flying Circus'  ④

  File 
"<stdin>", line 1

  circus 
= 'Monty Python's Flying Circus'

  
^

  SyntaxError: invalid syntax

 

Sometimes strings go over several lines. Python provides us with various ways of entering them. In the next example, a sequence of two strings is joined into a single string. We need to use backslash or parentheses so that the interpreter knows that the statement is not complete after the first line.

 

  >>> couplet = "Shall I compare thee to a Summer's day?"\

  ...           
"Thou are more lovely and more temperate:" ①

  
>>> print couplet

  Shall I compare thee to a Summer
's day?Thou are more lovely and more temperate:

  
>>> couplet = ("Rough winds do shake the darling buds of May,"

  ...           
"And Summer's lease hath all too short a date:")  ②

  
>>> print couplet

  Rough winds do shake the darling buds of May,And Summer
's lease hath all too short a date:

 

Unfortunately these methods do not give us a newline between the two lines of the sonnet(十四行诗). Instead, we can use a triple-quoted string as follows:

 

 

>>> couplet = """Shall I compare thee to a Summer's day?

  ... Thou are more lovely and more temperate:
"""

  
>>> print couplet

  Shall I compare thee to a Summer
's day?

  Thou are more lovely 
and more temperate:

  
>>> couplet = '''Rough winds do shake the darling buds of May,

  ... And Summer's lease hath all too short a date:
'''

  
>>> print couplet

  Rough winds do shake the darling buds of May,

  And Summer
's lease hath all too short a date:

 

Now that we can define strings, we can try some simple operations on them. First let’s look at the + operation, known as concatenation . It produces a new string that is a copy of the two original strings pasted together end-to-end(首尾相连). Notice that concatenation doesn’t do anything clever like insert a space between the words. We can even multiply strings:

 

  >>> 'very' + 'very' + 'very' ①

  
'veryveryvery'

  
>>> 'very' * 3 ②

  
'veryveryvery'

 

Your Turn: Try running the following code, then try to use your understanding of the string + and * operations to figure out how it works. Be careful to distinguish between the string ' ', which is a single whitespace character, and '', which is the empty string.

 

 

  >>> a = [1234567654321]

  
>>> b = [' ' * 2 * (7 - i) + 'very' * i for i in a]

  
>>> for line in b:

  ...     print b

 

We’ve seen that the addition and multiplication operations apply to strings, not just numbers. However, note that we cannot use subtraction or division with strings:

 

  >>> 'very' - 'y'

  Traceback (most recent call last):

    File 
"<stdin>", line 1in <module>

  TypeError: unsupported operand type(s) 
for -'str' and 'str'

  
>>> 'very' / 2

  Traceback (most recent call last):

    File 
"<stdin>", line 1in <module>

  TypeError: unsupported operand type(s) 
for /'str' and 'int'

 

These error messages are another example of Python telling us that we have got our data types in a muddle(困惑). In the first case, we are told that the operation of subtraction (i.e., -) cannot apply to objects of type str (strings), while in the second, we are told that division cannot take str and int as its two operands.

 

Printing Strings 打印字符串

 

So far, when we have wanted to look at the contents of a variable or see the result of a calculation, we have just typed the variable name into the interpreter. We can also see the contents of a variable using the print statement:

 

  >>> print monty

  Monty Python
 

 

Notice that there are no quotation marks this time. When we inspect a variable by typing its name in the interpreter, the interpreter prints the Python representation of its value. Since it’s a string, the result is quoted. However, when we tell the interpreter to print the contents of the variable, we don’t see quotation characters, since there are none inside the string.

The print statement allows us to display more than one item on a line in various ways,

as shown here:

 

 >>> grail = 'Holy Grail'

  
>>> print monty + grail

  Monty PythonHoly Grail

  
>>> print monty, grail

  Monty Python Holy Grail

  
>>> print monty, "and the", grail   #会在词之间自动添加空格

  Monty Python 
and the Holy Grail    

 

 

Accessing Individual Characters 访问单独的字符

 

As we saw in Section 1.2 for lists, strings are indexed, starting from zero. When we index a string, we get one of its characters (or letters). A single character is nothing special—it’s just a string of length 1.

  >>> monty[0]

  
'M'

  
>>> monty[3]

  
't'

  
>>> monty[5

  
' '

As with lists, if we try to access an index that is outside of the string, we get an error:

 

  >>> monty[20]

  Traceback (most recent call last):

    File 
"<stdin>", line 1in ?

  IndexError: 
string index out of range

Again as with lists, we can use negative indexes for strings, where -1 is the index of the last character. Positive and negative indexes give us two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes 5 and -7 both refer to the same character (a space). (Notice that 5 = len(monty) - 7.)

  >>> monty[-1]  #注意 monty='Monty Python' 我刚还在想就5个字符啊…

  
'n'

  
>>> monty[5]

  
' '

  
>>> monty[-7]

  
' '

We can write for loops to iterate over the characters in strings. This print statement ends with a trailing comma, which is how we tell Python not to print a newline at the end.

  >>> sent = 'colorless green ideas sleep furiously'

  
>>> for char in sent:

  ...     print 
char,

  ...

  c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y

We can count individual characters as well. We should ignore the case distinction by normalizing everything to lowercase, and filter out non-alphabetic characters:

 

  >>> from nltk.corpus import gutenberg

  
>>> raw = gutenberg.raw('melville-moby_dick.txt')

  
>>> fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())

  
>>> fdist.keys()

  [
'e''t''a''o''n''i''s''h''r''l''d''u''m''c''w',

  
'f''g''p''b''y''v''k''q''j''x''z']

 

This g

引自:ic pdf http://pdflist.mmic.net.cn

ives us the letters of the alphabet, with the most frequently occurring letters listed first (this is quite complicated and we’ll explain it more carefully later). You might like to visualize the distribution using fdist.plot(). The relative character frequencies of a text can be used in automatically identifying the language of the text.

 

新浪微博粉丝精灵,刷粉丝、刷评论、刷转发、企业商家微博营销必备工具"
 技术   浏览(2375)   评论(0)   关键字
  
Copyright © 2010-2020 power by CYQ.Blog - 秋色园 v2.0 All Rights Reserved