Python语法笔记3

1、正则表达式

在re模块中实现

import re
s = r’abc’
print re.findall(s, ‘ababcdbabcbc’) #[‘abc’, ‘abc’]
s = r’t[io]*p’
print re.findall(s,‘top tiop abc tip’) #[‘top’, ‘tiop’, ‘tip’] 包含
s = r’t[^io]*p’
print re.findall(s,‘top tiop tabcp tip’) #[‘tabcp’] 排除

1.1、元字符：

[]

指定一个字符集：[abc], [a-z]

元字符在字符集中不起作用[sjdkf$],只会当做普通字符处理（^表示除了，而不是匹配行首）

匹配行首，除非设置MULTILINE标志，它只是匹配字符串的开始。在MULTILINE模式里，它也可以直接匹配字符串中的每个换行。

匹配行尾，行尾被定义为要么是字符串尾，要么是一个换行字符后面的任何位置

1.2、元字符与转义字符

import re
r = r’\^abc’ #使用\^匹配尖括号
print re.findall(r, ‘^abc’); #[‘^abc’]

反斜杠后面可以加不同的字符以表示不同特殊意义

也可以用于取消所有的元字符： \[ \\

转义字符

含义

任何的十进制数字 [0-9]

任何非数字字符 [^0-9

任何空白字符 [\t\n\r\f\v]

任何非空白字符 [^\t\n\r\f\v]

任何字母数字字符 [a-zA-Z0-9_]

任何非字母数字字符 [^a-zA-Z0-9_]

指定前一个字符可以被匹配零次或者多次

import re
r = r’^010-[0-9]*’
print re.findall(r, ‘010-12345678’); #[‘010-12345678’]
r = r’^010-[0-9]{8}’
print re.findall(r, ‘010-1234567890’); #[‘010-12345678’]

表示匹配一次或多次

匹配一次或零次，用于标示某事物是可选的

import re
r = r’^010-?[0-9]{8}$’
print re.findall(r, ‘0101235678’); #[‘01012345678’]

最大匹配（贪婪模式）
最小匹配（非贪婪模式）

import re
r = r’ab+’
print re.findall(r, ‘abbbbbbbbbb’); #[‘abbbbbbbbbb’] #最大匹配
r = r’ab?’
print re.findall(r, ‘abbbbbbbbbb’); #[‘ab’] #最小匹配

{m,n}

至少有m个重复，最多有n个重复

import re
r = r’ab{2,}’
print re.findall(r, ‘abbbbbb’) #[‘abbbbbb’]

表达式

含义

{0,}

等同于 *

{1,}

等同于 +

{0,1}

等同于?

1.3、如何在Python中使用正则表达式：

re模块提供了一个正则表达式引擎的接口，可以将REstring编译成对象并用他们来进行匹配。

import re
p = re.compile(‘ab’) #编译成对象，解释起来比较快
print p #<_sre.SRE_Pattern object at 0x0044ED40>
print p.findall(‘aaaabbb’) #[‘ab’]
p = re.compile(‘ab’, re.I) #不区分大小写
print p.findall(‘Ab’) #[‘Ab’]

反斜杠的麻烦:

字符串前加r，反斜杠就不会被任何特殊方法处理

import re
p = r’\sb’
print re.findall(p, ‘\sb’) #[]
p = r’\\sb’
print re.findall(p, ‘\\sb’) #[‘\\sb’]
p = ‘\\\sb’
print re.findall(p, ‘\sb’) #[‘\\sb’]
p = re.compile(‘\\\sb’);
print p.findall(‘\sb’); #[‘\\sb’]

1.4、执行匹配：

RegexObject实例有一些方法和属性：

match(): 决定RE是否在字符串刚开始的位置匹配
search(): 扫描字符串，找到这个RE匹配的位置
findall(): 找到RE匹配的所有子串，作为列表返回
finditer(): 找到RE匹配的所有子串，作为迭代器返回

import re
p = re.compile(‘ab’)
print p.match(‘cab’) #None
print p.search(‘cab’) #<_sre.SRE_Match object at 0x01DFBB10>
print p.finditer(‘cababb’) #
x = p.finditer(‘cababb’)
print x.next() #<_sre.SRE_Match object at 0x003FBB10>

上面迭代器next()方法返回了一个Match，关于该对象的相关方法：

group(): 返回被RE匹配的字符串
start(): 返回匹配开始的位置
end(): 返回匹配结束的位置
span(): 返回一个元祖包含匹配（开始，结束）的位置

import re
p = re.compile(‘ab’)
m = p.search(‘cababb’) #
if m:
print m.group(0) #ab

1.5、模块级函数：

re模块也提供了顶级函数调用如：

match()
search()
sub()
subn()
split()
findall()

使用上面的方法，可以使用正则替换或者切割查找匹配字符串

import re
p = re.compile(‘ab’)
print re.sub(p, ‘', ‘sfabkjkbskbabaf’, 1) #sfkjkbskbabaf sub替换
print re.split(p, ‘abjkjkabjlkbaabjlj’) #[’', ‘jkjk’, ‘jlkba’, ‘jlj’]

Python常用的help小技巧：

print dir(re) #查看模块下的属性和方法

help(re.sub) #查看模块下的方法的介绍

1.6、编译标志：

标志

作用

DOTALL, S

使匹配包括换行在内的所有字符

IGNORECASE, I

使匹配对大小写不敏感

LOCALE, L

做本地化识别（locale-aware）匹配法语等特殊字符

MULTILINE, M

多行匹配，影响^和$

VERBOSE, X

能够使用REs的verbose状态，使之被组织得更清晰易懂

import re
r1 = r’itzhai.com’
print re.findall(r1, ‘itzhai.com’) #[‘itzhai.com’]
print re.findall(r1, ‘itzhai-com’) #[‘itzhai-com’]
print re.findall(r1, ‘itzhai\ncom’) #[]
print re.findall(r1, ‘itzhai\ncom’, re.S) #[‘itzhai\ncom’]
#多行
str = “”“abskdfjaks
abkjaslkfa
abkjskfa
“””
r2 = r’^ab.*’
print re.findall(r2, str, re.M) #[‘abskdfjaks’, ‘abkjaslkfa’, ‘abkjskfa’]
#verbose状态
r3 = r"“”
\d{3}
-?
\d{8}
“”"
print re.findall(r3, ‘010-12345678’, re.X) #[‘010-12345678’]

1.7、分组：

使用括号进行分组

import re
email = r’\w{3,}@\w+(\.com|\.cn)’
print re.match(email, ‘admin@itzhai.com’) #<_sre.SRE_Match object at 0x02554D60>
m = re.match(email, ‘admin@itzhai.com’)
print m.group(0) + ", " + m.group(1) #admin@itzhai.com, .com
#匹配的时候，会优先的返回分组中的数据：
print re.findall(email, ‘admin@itzhai.com’) #[‘.com’]

1.8、一个爬取程序

下面演示一个爬取并下载页面所有图片的例子：

import re
import urllib
def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html
def getImg(html):
p = r’src=“(.*?\.jpg-itzhai)”’
imgp = re.compile(p)
imgList = re.findall(imgp, html);
index = 0
for url in imgList:
urllib.urlretrieve(url, ‘%d.jpg’ % index)
index+=1
getImg(getHtml(“http://www.csdn.net/”))

2、Python对内存的使用

2.1、浅拷贝和深拷贝

浅拷贝：对引用的拷贝

深拷贝：对对象资源的拷贝

下面演示一下数组的复制和浅拷贝：

import copy
#下面演示下两个变量指向了同一个地址空间
a = [1, 2, 3, ‘test’]
b = a
print id(a) #39511344
print id(b) #39511344
a.append(‘hello’)
print a #[1, 2, 3, ‘test’, ‘hello’]
print b #[1, 2, 3, ‘test’, ‘hello’]
a.append([1,2]);
c = copy.copy(a)
#下面使用copy的复制方法复制数组，这样对a的操作就不会影响到c了
print id(a) #39513104
print id© #39515600
a[5][0] = ‘new’
print a #[1, 2, 3, ‘test’, ‘hello’, [‘new’, 2]]
print c [1, 2, 3, ‘test’, ‘hello’, [‘new’, 2]]

发现a改变了其中一个列表元素的里面的元组，c的元素也改变了，这就是浅拷贝的原因，内层的数据还是引用了同一个内存块

下面实现深拷贝：

import copy
#下面演示下两个变量指向了同一个地址空间
a = [1, 2, 3, ‘test’]
a.append(‘hello’)
a.append([1,2]);
d = copy.deepcopy(a)
#下面使用copy的复制方法复制数组，这样对a的操作就不会影响到c了
print id(a) #39644176
print id(d) #39648992
a[5][0] = ‘new’
print a #[1, 2, 3, ‘test’, ‘hello’, [‘new’, 2]]
print d #[1, 2, 3, ‘test’, ‘hello’, [1, 2]]

3、文件与目录

3.1、读写文件

Python进行文件读写的函数是open或file

file_handler = open(filename, mode)

两种读文件的方法

file1 = open(‘test.txt’)
print file1.read()
file1.close()
file2 = file(‘test.txt’)
print file2.read()
file2.close()

mode

作用

只读

读写

写入，先删除原文件，再重新写入，如果文件没有则创建

读写，先删除原文件，再重新写入，如果文件没有则创建（可以写入输出）

写入：在文件末尾追加新的内容，文件不存在，创建之

读写，在文件末尾追加新的内容，文件不存在，创建之

打开二进制的文件，可以与r, w, a, + 结合使用

支持所有的换行符号，‘r’, ‘\n’, ‘\r\n’

#写文件
file3 = open(‘test.txt’, ‘w’)
file3.write(‘last commit…’) #last commit…
file3.close()
file3 = open(‘test.txt’)
print file3.read()
file3.close();
file4 = open(‘test.txt’, ‘r+’)
file4.write(‘a’)
file4.close()
file4 = open(‘test.txt’, ‘r’) #关掉之后重新打开，使得文件指针复位，从头读取
print file4.read() #aast commit…
file4.close()

3.2、文件对象方法

FileObject.close()
String = FileObject.readline([size]) size: 前n个字符
List = FileObject.readlines([size])
String = FileObject.read([size])
FileObject.next()
FileObject.write(string)
FileObject.writelines(List)
FileObject.seek(偏移量, 选项)
选项：0 表示将文件指针指向从文件头部到“偏移量”字节处
1 表示将文件指针指向从文件的当前位置，向后移动“偏移量”字节处
2 表示将文件指针指向从文件的结尾，向前移动“偏移量”字节处
FileObject.flush()

#写文件
file1 = open(‘test.txt’)
for i in file1:
print i

3、OS模块

3.1、目录操作：

目录操作就是通过python来实现目录的创建，修改，遍历等功能

import os

目录操作需要调用os模块

比如

os.mkdir(‘text2.txt’)

其他方法：

mkdir(path[,mode=0777])
makedirs(name, mode=511) #级联添加
rmdir(path)
removedirs(path) #级联删除
listdir(path) #返回当前目录的所有文件，列表形式返回

os.listdir(‘.’)
os.listdir(‘/’)

getcwd() #获得当前的路径

chdir(path) #切换目录

walk(top, topdown=True, onerror=None)

3.2、目录遍历

3.2.1、递归函数

import os
def listAllFiles(path):
files = os.listdir(path)
allfiles = []
for filename in files:
filepath = os.path.join(path, filename)
if os.path.isdir(filepath):
allfiles.append(listAllFiles(filepath))
allfiles.append(filepath)
return allfiles
print listAllFiles(‘/dev/workspace’);

3.2.1、os.walk()函数

generator = os.walk(‘/dev/workspace’)
for path, dirs, files in generator:
for filename in files:
print os.path.join(path, filename)

4、异常处理

try:
f = open(‘a.txt’, ‘r’)
print asdf
except IOError, e:
print False, str(e) #False [Errno 2] No such file or directory: ‘a.txt’
except NameError, e:
print False, str(e)
print ‘使用了不存在的变量’
finally:
if f:
f.close() #NameError: name ‘f’ is not defined 注意，如果这里不做判断或捕获NameError，由于文件没有打开成功，所以提示f未定义

4.1、抛出机制

如果在运行时发生异常，解释器就会查找到相应的处理语句；如果在当前函数里没有找到，则上层继续抛出，跟Java中的异常有点类型；如果在最外层（全局main)没有找到，解释器就会退出，同事打印traceback以便让用户找出错误产生的原因。

有时候异常不一定代表错误，有时候是一个警告，有时候也可能是一个终止信号，比如退出循环等。

4.2、raise抛出异常

使用raise来抛出异常：

raise TypeError(“'a’必须为整型”);

异常

描述

AssertionError

assert语句失败

AttributeError

试图访问一个对象没有的属性

IOError

输入输出异常

ImportError

无法引入模块或者包

IndentationError

语法错误

IndexError

下标索引超出序列边界

KeyError

试图访问字典里不存在的key

KeyboardInterrupt

按下Ctrl+C时触发

NameError

使用一个还未赋予对象的变量

SyntaxError

Python代码逻辑语法出错，不能执行

TypeError

传入的对象类型与要求不符合

UnboundLocalError

试图访问一个还未设置的全局变量

ValueError

传入一个不被期望的值，类型不正确

架构解码：模式与实践

JVM速成手册

图解网络协议

并发编程

Anthropic最新研究解读：AI写代码后，开发者价值正在上移

当AI人人会写代码，开发者还剩什么竞争力

AI写代码越快，项目为什么越容易烂？

Agent 架构解析与工作原理：为什么它不只是一个会聊天的大模型

Mac Pro 停产，苹果为什么不再执着大型工作站？

Python语法笔记3