Python

Table of Contents

1 Language

The ultimate reference:

For python 3

atom is the most basic expression, while the enclosure is interesting.

atom      ::=  identifier | literal | enclosure
enclosure ::=  parenth_form | list_display | dict_display | set_display
               | generator_expression | yield_atom

The display is a special syntax for python to create lists and dicts, using comprehension.

list_display ::=  "[" [starred_list | comprehension] "]"
set_display ::=  "{" (starred_list | comprehension) "}"

dict_display       ::=  "{" [key_datum_list | dict_comprehension] "}"
key_datum_list     ::=  key_datum ("," key_datum)* [","]
key_datum          ::=  expression ":" expression | "**" or_expr
dict_comprehension ::=  expression ":" expression comp_for

And the comprehension grammar:

comprehension ::=  expression comp_for
comp_for      ::=  [ASYNC] "for" target_list "in" or_test [comp_iter]
comp_iter     ::=  comp_for | comp_if
comp_if       ::=  "if" expression_nocond [comp_iter]

The comp_for non-terminal contains one or more for clause, and zero or more if clause.

The evaluation of these displays will evaluate from left to right, so the last assignment will prevail in the case of set and dict. The comprehension is executed by nesting the for and if clauses from left to right.

So some examples:

((a,b) for a in range(2) for b in range(3))
[(a,b) for a in range(2) for b in range(3)]
{(a,b) for a in range(2) for b in range(3)}

{a:b for a in (1,2) for b in (3,4)}

Note the out-most braces are required, which also creates a scope.

2 Emacs support

Install the elpy package. It provides:

  • C-c C-c runs the shell and send the current buffer
  • C-c C-d runs elpy-doc
  • C-c C-t runs elpy-test, which runs the unittest discover

To enable linter python in emacs, use pylint. It will use pylint executable. And it also needs the configure file. Generate it:

pylint --generate-rcfile > ~/.pylintrc

3 Unit Test

class MyTest(unittest.TestCase):
    def test_me(self):
        self.assertEqual(1,2)
unittest.main()

python unit test can support automatic test discovery. To use that, the file must be named test_xxx.py, and run the python -m unittest discover.

4 Concept

4.1 Scoping

There're four levels:

  • current scope
  • parent scope
  • module scope (global)
  • built-in scope

nonlocal keyword specify this variable should be referenced to the parent scope. But, this will not reach global. Instead, the global keyword declares the listed variables to be in the module level scope.

The nonlocal statement causes the listed identifiers to refer to previously bound variables in the nearest enclosing scope excluding globals.

As an example:

var = 0 # global

def outer():
  var = 1 # parent
  def inner():
    nonlocal var
    var = 2 # local
    global var
    var =3
  inner()
  # var = 2

outer()
# global var = 3

4.2 Collection

4.2.1 String

4.2.1.1 Concatenation
  • concatenate two strings directly by +.
  • need to convert integer to string before concatenate: s + str(35)
  • "".join(lst) works
4.2.1.2 split
str.split(sep=None)
default by white space
str.strip()
strip out white space at both begin and end
str.replace(old, new)
replace all.
str.startswith(s)
str.endswith(s)
4.2.1.3 Slicing

String is an immutable object. It can use slicing. E.g. reversing a string is as easy as "hello"[::-1]!

However, notice that when using a negative step, the slicing should be lst[end:begin:-1]. This is because x = i + n*k:

with a third “step” parameter: a[i:j:k] selects all items of a with index x where x = i + n*k, n >= 0 and i <= x < j.

Also, the negative step does not always work as expect. E.g. the i index is included and j is not; the j can not be negative, then how can I include the first one in the list??

Thus if want to get a reverse of a sub-string, I would get sub-string first and then reverse it.

4.2.2 TODO tuple

4.2.3 List

4.2.3.1 Slicing

The slicing syntax is l[start:end:step]. The slicing will return a new list. Change to that list will not change the original one.

l[4]
l[4:]
l[::2]
l[:-1]

However, assign to the slicing itself will change the original one:

l[1:2] = [4,5,6]

Also, assign to a new variable only assign the reference:

a = [1,2,3]
b = a # only a reference
4.2.3.2 create a list
  • range(stop)
  • range(start, stop[, step])

Creating a matrix:

newmat=[[-1 for x in range(height)] for y in range(width)]
4.2.3.3 Modify a list
  • list.append
  • list.pop

4.2.4 Dictionary

Create:

x = {'a': 1, 'b': 2}

Dictionary is not sorted. Use collections.OrderedDict if you want this feature. Basically it remember the order when the elements are inserted.

import collections
od = collections.OrderedDict(sorted(d.items()))

Merge two dictionary (x and y):

z = x.copy()
z.update(y)

4.2.5 Set

s = set()
s.add(x)
if x in s:
  pass

4.3 Algorithm

4.3.1 TODO sort

sort a dictionary by value:

sorted(dict1, key=dict1.get) # => list
sorted(dict1, key=dict1.get, reverse=True)

4.4 Function

4.4.1 variadic parameter

use *args syntax, and args will be a tuple:

  def foo(*args):
    for a in args:
      print a

use **args to capture all keyword arguments.

def bar(**kwargs):
  for a in kwargs:
    print a, kwargs[a]

Combine them together:

def foobar(kind, *args, **kwargs):
  pass

Also, there's a concept for the reverse thing: unpack argument list from a list, with *list:

def foo(a,b):
  pass

l = [1,2]
foo(*l)

on python3, this syntax can appear on left side

first, *rest = [1,2,3,4]
first,*l,last = [1,2,3,4]

4.5 Exception

To give a quick feel:

try:
  pass
except TypeError as e: # capture the exception into a variable
  pass
except AnotherError: # does not capture
  pass
except: # all exception
  pass
else: # if doesn't raise an exception
  pass
finally:
  pass

4.6 Lambda

lambda x : x+2
lambda x: x%2==0

The usage of lambda is often in map and filter.

  • map(lambda_exp, mylist) will execute the lambda expression on each element of the list, and return a list containing the results.

4.7 Packaging

Exposing API: the following only expose foo but not bar.

__all__ = ['foo']
def foo():
  pass
def bar():
  pass

4.7.1 importing

The local structure directory must contain the __init__.py file to be able to import.

|-- main.py
|-- mypackage
    |-- __init__.py
    |-- a.py
    |-- b.py
    |-- subdir
        |-- __init__.py
        |-- c.py

The import statements should be:

from mypackage import a
from mypackage.b import foo as myfoo
from mypackage.subdir import c

4.8 Thread

from threading import Thread

class MyThread(Thread):
  def __init__(self, arg):
    Thread.__init__(self)
    self.arg = arg
  def run(self):
    pass

t = MyThread(arg)
t.start()

5 Type

5.1 Boolean

  • not True

5.2 Integer

  • i += 1

5.3 conversion

  • string to integer: int('45')
  • integer to string: str(45)
  • ASCII to char: chr(100) returns 'd'
  • char to ASCII: ord('d') returns 100

6 Black Tech

If else or:

var = d.get('key') or 0
# is equal to:
var = d.get('key') if d.get('key') else 0

list comprehension

even_squares = [x**2 for x in l if x%2 == 0]

7 Pep8

Indent:

  • function and class should be separated by 2 lines
  • In a class, function should be separated by 1 line
  • 1 space before and after variable assignment

Naming

  • function, variable, attribute: func_var_attr
  • protected instance attributes: _protected_field
  • private instance attributes: __private_field
  • class and exception: ClassExceptionName
  • module level constants: CONSTANT
  • instance method of class should use self as first parameter, refer to the object
  • class method should use cls as first parameter, refer to the class

Expression

use DONT use
a is not b not a is b
if not list if len(list) == 0

Import

  • always use absolute path
  • if must use relative, use from . import foo instead of import foo

7.1 document

One can use one line or multi-line document. The doc string can be retrieved by func.__doc__.

def func():
  """one line doc"""

def func():
  """The outline

  The above empty line is required.
  Here's the detailed documentation.
  """

8 IO

print('xxx', end='')
  f = open('text.txt')
  f.read() # return all content

  f = open('text.txt')
  for line in f:
      print(line)

  with open('a.txt') as f:
      for line in f:
          print(line)

read from stdin:

for line in sys.stdin:
  print(line)

get command line argument: sys.argv

9 Operating System

9.1 Work filesystem:

import os
for root,dirs,files in os.walk('.'):
  for f in files:
    print f
  • os.path.abspath('relative/path/to/file')
  • os.path.exists("/path/to/file")
  • os.rename('old', 'new')
  • os.path.isfile

9.2 Shell command

os.system
simply run command
os.system("some command")
os.popen
access to input output
stream = os.popen("some command")
stream.read()
  • subprocess.Popen
p = subprocess.Popen("echo Hello World", shell=True, stdout=subprocess.PIPE)
p.stdout.read()
s = subprocess.check_output('wc -l', stdin=p.stdout)
subprocess.call
this is the same as subprocess.Popen except that it waits and gives return code.
return_code = subprocess.call("echo Hello World", shell=True, stdout=subprocess.DEVNULL)

10 Standard Library

10.1 Built-in exceptions

BaseException
 +-- SystemExit
 +-- KeyboardInterrupt
 +-- GeneratorExit
 +-- Exception
      +-- StopIteration
      +-- StandardError
      |    +-- BufferError
      |    +-- ArithmeticError
      |    |    +-- FloatingPointError
      |    |    +-- OverflowError
      |    |    +-- ZeroDivisionError
      |    +-- AssertionError
      |    +-- AttributeError
      |    +-- EnvironmentError
      |    |    +-- IOError
      |    |    +-- OSError
      |    |         +-- WindowsError (Windows)
      |    |         +-- VMSError (VMS)
      |    +-- EOFError
      |    +-- ImportError
      |    +-- LookupError
      |    |    +-- IndexError
      |    |    +-- KeyError
      |    +-- MemoryError
      |    +-- NameError
      |    |    +-- UnboundLocalError
      |    +-- ReferenceError
      |    +-- RuntimeError
      |    |    +-- NotImplementedError
      |    +-- SyntaxError
      |    |    +-- IndentationError
      |    |         +-- TabError
      |    +-- SystemError
      |    +-- TypeError
      |    +-- ValueError
      |         +-- UnicodeError
      |              +-- UnicodeDecodeError
      |              +-- UnicodeEncodeError
      |              +-- UnicodeTranslateError
      +-- Warning
           +-- DeprecationWarning
           +-- PendingDeprecationWarning
           +-- RuntimeWarning
           +-- SyntaxWarning
           +-- UserWarning
           +-- FutureWarning
	   +-- ImportWarning
	   +-- UnicodeWarning
	   +-- BytesWarning

10.2 Built-in

These functions are always available.

Numbers:

  • abs(x): absolute value
  • divmod(a,b): a pair (a // b, a % b)
  • max(arg1, arg2, *args)
  • min(arg1, arg2, *args)
  • pow(x,y): xy
  • round(x, ndigits=0)
  • sum(iterable)

Convertion

  • int(x)
  • float(x)
  • long(x)
  • chr(x): ASCII to char
  • ord(c): char to ASCII
  • bool(x): convert x to bool
  • hex(x): convert integer to lowercase hex string prefix with '0x'
  • oct(x): integer to octal string
  • bin(x): an integer to binary string

Boolean:

  • all(iterable): true if all items are true. empty => True
  • any(iterable): true if any item is true. empty => False
  • cmp(x,y)
    • x<y => negative
    • x=y => 0
    • x>y => positive

Symbol Table

  • locals()
  • globals()
  • dir()

Creation

  • dict
  • list
  • set
  • tuple

Other

  • len(s): length
  • next(iterator)
  • print(*objects, sep='', end='\n', file=sys.stdout)
  • range(stop): [0,stop)
  • range(start, stop, step=1)
  • sorted(iterable, cmp, key, reverse=False)
    • key=lambda x: x[1]
  • type(obj): get the type of obj
  • open(name, mode): return an object of file type.
    • r,w,a,b; + for read and write

10.3 Printing

  • pprint.pprint(object, stream=None): pretty print
  • 'string {0}, {hello}'.format('yes', hello=2)

10.4 File System

10.4.1 os.path

If parameter is not listed, it means a single path.

  • exists: GOOD. check whether a path exists
  • split: return a pair (head, tail). tail is the last component, without slash. If path ends with slash, tail is empty
    • basename: the tail of the split output
    • dirname: head of split output
  • normpath: collapse redundant separators and up level references
  • abspath: from relative to absolute path. normpath(join(os.getcwd(), path))
  • commonprefix(list): return the longest path prefix
  • expanduser: replace the initial component of ~ by the users directory.
  • getsize: in bytes
  • isabs: predicate for absolute
  • isfile:
  • isdir
  • islink
  • join(path, *paths): join intelligently
  • realpath: canonical path by following symbolic links

10.4.2 TODO pathlib

10.4.3 TODO tempfile

10.5 os

10.5.1 Env

  • os.environ['HOME']
  • os.getenv(name)
  • os.putenv(name, value)
  • os.unsetenv(name)

10.5.2 Filesystem

  • os.getcwd(): current working directory
  • os.chdir(path): change cwd
  • os.mkdir(path)
  • os.listdir(path='.'): list all in this dir. E.g. for item in os.listdir('/path'): print (item)
  • os.makedirs(path): GOOD this is the way to go the make directories
  • os.remove(path): remove a file
  • os.rmdir(): remove an empty dir.
  • os.removedirs(path): foo/bar/aaa will try to remove aaa, than bar, then foo. Don't use! To recursively remove all contents, use shutil.rmtree
  • os.rename(src, dst)
  • os.renames(old, new)
  • os.rmdir(path): only work if dir is empty
  • os.tempnam(): a reasonable absolute name for creating temporary file
    • seems to be vulnerable
  • os.walk(top, topdown=True): for each directory including top itself, it yields 3-tuple (dirpath, dirnames, filenames). E.g. for root,dirs,files in os.walk('/path'): for f in files: print (f);

10.5.3 shutil

  • copy(src,dst)
  • copytree(src, dst): recursive
  • rmtree(path): rm -r
  • move(src, dst)

popen family is deprecated. Use subprocess.

10.5.4 Process

  • os.abort()
  • os.execl(path, arg0, arg1, …)
  • os.execle(path, arg0, arg1, …, env)
  • os.execlp(file, arg0, arg1, …)
  • os.execlpe(file, arg0, arg1, …, env)
  • os.execv(path, args)
  • os.execve(path, args, env)
  • os.execvp(file, args)
  • os.execvpe(file, args, env)
  • os.folk
  • os.wait()
  • os.system(cmd): run cmd, return exit code
  • os.times(): 5-tuple
    • user time
    • system time
    • childrens user time
    • childrens system time
    • elapsed real time

10.6 io

  • f = open('file.txt')
  • f = io.StringIO("some string"): in memory text stream
  • f = open('file', 'rb')
  • f = io.BytesIO(b"some binary data \x00\x01")
  • support with statement: with open('file.txt') as file:

10.6.1 IOBase

Methods:

  • close()
  • flush()
  • readline(): return one line
  • readlines(): return a list of lines
  • seek(offset=0)
    • 0 start
    • 1 current
    • 2 end
  • tell(): current position
  • writelines(lines): write a list of lines

10.6.2 RawIOBase : IOBase (should not use directly)

  • read()
  • readall()
  • readinto(b)
  • write(b)

10.6.3 BufferedIOBase

  • read(): read all
  • write(b)

10.6.4 FileIO : RawIOBase

10.6.5 BytesIO : BufferedIOBase

10.6.6 BufferedReader(raw)

  • peek()
  • read()

10.6.7 BufferedWriter(raw)

  • flush()
  • write()

10.6.8 TextIOBase : IOBase

  • read()
  • readline(size=1)
  • seek(offset=0)
  • tell()
  • write(s): finally the string!

10.6.9 TextIOWrapper(buffer) : TextIOBase

10.6.10 StringIO

  • getvalue()

10.7 time

  • time.sleep(secs)
  • time.time(): time in seconds since epoch
  • strptime(string[, format]): parse a string into time object
    • format default: "%a %b %d %H:%M:%S %Y"
    • time.strptime("30 Nov 00", "%d %b %y")
  • strftime(format[, t]): convert from time object to string
    • %a/A: abbr/full weekday name
    • %b/B: abbr/full month name
    • %Y: year
    • %m: month [01,12]
    • %d: day of the month [01,31]
    • %H: 24-hour [00,23]
    • %I: 12-hour [01,12]
    • %p: AM or PM
    • %M: Minute [00,59]
    • %S: second [00,61]
  • gmtime(): in seconds, from epoch
  • localtime(): convert gmtime() to local
  • clock(): processor time as floating number in seconds

class time.structtime: returned by gmtime(), localtime() and strptime()

10.8 argparse

import argparse
parser = argparse.ArgumentParser(descripton='Description here')

parser.add_argument('-q', '--query', help='query github api', require=True)
parser.add_argument('-d', '--download', help='do download', action='store_true')

args = parser.parse_args()

The most interesting method is of course the add_argument. It accepts the name, either a single string, bar, indicating positional argument, or a string starting with -, indicating optional arguments. You can supply parser.add_argument(-f, --foo) for short and full argument. The value is stored as an attribute with the same name (i.e. bar, foo) of the result, but you can change it to anther name via dest argument.

An action defines what to do with the argument. It is a string (!!!). The default is 'store', meaning store the supplied value to the result. If you don't need the value, but just want to know if the option is supplied, use store_true or store_false, which differ only in default value. The action append will collect each occurrence of the argument into a list.

By default, each option consume one argument. You can change this by the argument nargs. If it is an integer, it means how many should be consumed. The result will be a list, thus in case of 1, it is still different from default. It can be a string '*', '+', '?', which conforms to the regular expression meaning of them. * and + produce a list, + will get give error when no arguments are provided, ? will use default if missing.

In case of missing value, the default argument can be used to supply the default value. Otherwise, it is none. You can also use required argument to make sure user supplies something. A value is by default a string, you can convert it to anther data type by the type option, accepting a data type, e.g. int. You might also want to restrict the choices of the argument, so choices is a list of allowed values.

Finally, help option can be used to provide help string, and it can be printed out using parser.print_help(). To test the parser, you can use parser.parse_args(['-f', '1', 'bar']).

10.9 Regular Expression

construction

import re
pattern = re.compile('\d+.*$')

match

s = 'this is a test string'
pattern.match(s) # return True or False

search

pattern.findall(s)

shorthand

m = re.match("[pattern]", "string")
m.group()
m = re.search("[pattern]", "string")
m.group()
re.search("pattern", "string", re.IGNORECASE)
m = re.findall("[pattern]", "string")

10.10 Concurrent

10.10.1 threading

The package name is threading, the object is Thread.

Functions

  • threading.activecount(): number of Thread object
  • threading.currentthread(): current Thread object
  • threading.enumerate(): return a list of all Thread objects
  • threading.meain(): the main Thread object
  • threading.local(): the instance of local storage. Different for different threads. Typical usage: mydata = threading.local()

Two ways to specify what to run:

  • pass a callable object to the target argument when constructing Thread
  • define a subclass of Thread and override the run method.

Methods:

  • start: start the thread. It will call run method in a separate thread. The thread terminate when run terminate
  • join(timeout=None): the calling thread will block until this thread terminate
    • timeout should be float in seconds
  • is_alive: test whether the thread terminate

10.10.2 Thread Sync

class threading.Lock

  • acquire()
  • release()

class threading.RLock

  • this is recursive lock. The same thread can acquire the lock multiple times. They will be nested and only when the last release is called, the lock can be acquired by another thead
  • acquire()
  • release()

class threading.Condition(lock=None)

  • the lock must be a Lock or RLock. If none, a RLock is created
  • acquire()
  • release()
  • wait(timeout=None): wait until notified
    • release underlying lock
    • block until notify
    • re-acquire the lock and return
    • typical usage: while not item_is_available(): cv.wait()
    • often use with statement: =with cv: cv.waitfor(pred); get();
  • waitfor(predicate, timeout=None)
    • this is same as while not predicate(): cv.wait(), thus more convenient than wait
  • notify(n=1): notify one thread
  • notifyall(): notify all threads waiting on this condition

class threading.Semaphore: this class manage resources with limited capacity.

  • acquire(): decrease capacity
  • release(): increase capacity

class threading.Event

  • isset():
  • set(): set flag to true
  • clear(): set flag to false
  • wait(timeout=None): block until internal flag is true

class threading.Timer(interval, function) : Thread

  • interval is float in seconds, function is callable. use start method to start the thread, and the function will be called after the delay.
  • cancel(): stop the timer and cancel the execution. Only work if the the timer is still waiting.

class threading.Barrier(parties, action=None, timeout=None)

  • parties is integer. Every thread calling wait will block, until parties number of such call is called. Then all players unblock and do things simultaneously.
  • wait(timeout=None)
  • reset(): reset the barrier. The thread waiting for it will receive BrokenBarrierError
  • abort(): all current and future wait call for it will get BrokenBarrierError
  • parties: number of parties
  • nwaiting: number of current waiting
  • broken: True or False
10.10.2.1 Using with statement

Lock, RLock, Condition, Semaphore can be used.

with somelock:
  # do somthing

is equivalent to:

somelock.acquire()
try:
  # do something
finally:
  somelock.release()

10.10.3 multiprocessing

This provide multiprocessing.Process class, having similar API with Thread. It seems to use fork but don't have explicit exec on the document?? Wired and seems just do something thread can do (except the sharing of memory of course).

10.10.4 Process (subprocess module)

  • subprocess.run(args, *, stdin=None, input=None, stdout=None, stderr=None, shell=False, timeout=None, check=False)
    • run the command and wait for it to complete. Return a CompleteProcess instance.
    • if check is True, raise CalledProcessError exception if return code non-zero. This replace the checkcall and checkoutput.

class subprocess.CompletedProcess

  • args
  • returncode
  • stdout: captured if PIPE is passed to stdout
  • stderr: captured if PIPE is passed to stderr
  • checkreturncode(): if returncode is non-zero, raise CalledProcessError

Variables:

  • subprocess.DEVNULL
  • subprocess.PIPE
  • subprocess.STDOUT: this is only used in the place of stderr to redirect it to stdout

class subprocess.CalledProcessError

  • returncode
  • cmd
  • output: same as stdout
  • stdout
  • stderr

The followings are from 2.7, now only use run.

  • subprocess.call(args, *, stdin=None, stdout=None, stderr=None, shell=False)
    • args: a list of argument, including arg0
    • it can also be a string due to that *
    • it will wait, then return returncode
    • do not use stdout=PIPE, use communicate() instead TODO
    • use shell=True is bad, but it can give me
      • shell pipes
      • filename wildcard
      • env variable expansion
      • ~ expansion
  • checkcall(args, *, …): same as call, except it will raise exception if return non-0
  • checkoutput(args, *, stdin=None, stderr=None, shell=False, universalnewlines=False)
    • if return non-0, raise exception. Otherwise return the stdout

Popen object

  • Popen constructor
    • args, bufsize=0, executable=None,
    • stdin=None, stdout=None, stderr=None,
    • preexecfn=None, closefds=False,
    • shell=False, cwd=None, env=None,
    • universalnewlines=False, startupinfo=None, creationflags=0
  • Popen.poll(): check if child process has terminated. Set and return returncode.
  • Popen.wait(): wait for process to terminate. Don't use PIPE with this.
  • Popen.communicate(input=None): to use this, the corresponding stdin, stdout, stderr should be set to PIPE.
    • send data to stdin (string)
    • read data from stdout and stderr (it returns a tuple (out, err))
    • wait for termination
  • Popen.snedsignal(signal)
  • Popen.terminate(): send SIGTERM
  • Popen.kill(): send SIGKILL
  • Popen.pid
  • Popen.returncode
    • set by poll and wait (and indirectly by communicate)
    • None indicate hasn't terminated
    • -N means terminated by signal N

10.11 Internet

10.11.1 urllib.request

package urllib.request

Functions

  • urlopen(url, data=None)
    • url can be a string or Request object
    • for http and https, returns a http.client.HTTPResponse object
    • for FTP, file, data urls, return a urllib.response.addinfourl object
  • pathname2url(path): do quoting
  • url2pathname(path): do unquoting

class Request

  • constructor: (url, data=None, headers={}, method=None)
    • url: a string
    • headers: a dictionary.
    • method: a string. 'GET' is default. Available values: 'HEAD', 'POST'

methods:

  • getmethod()
  • addheader(key, val)
  • hasheader(key)
  • getheader(key)
  • removeheader(key)
  • getfullurl()
  • headeritems(): return a list of tuples (key, value)
  req = request.Request(query)
  req.add_header("Authorization", "token " + token)
  response = request.urlopen(req)
  s = response.read().decode('utf8')
  langj = json.loads(s);
  # deprecated
  urllib.request.urlretrieve(url[, filename])

10.11.2 urllib.parse

  • quote(string)
  • quoteplus(string)
  • unquote(string)
  • unquoteplus(string)
  • urlencode(query)

10.12 Data

10.12.1 Json

import json
json.dumps({"C": 0, "D": 1})
json.loads("a string of json")

json.dump(obj, fp, indent=2)
json.load(fp)

11 Third party libraries

11.1 urllib

from urllib import request
import json

url = 'https://api.github.com'
api = '/search/repositories'
query = 'language:C&stars:>10&per_page='+size
response = request.urlopen(url+api+"?q="+query)

s = response.read().decode('utf8')
j = json.loads(s)
# j will be a mix of list and dict

11.2 XML

import xml.etree.ElementTree as ET
root = ET.fromstring(s)
# XPath
nodes = root.findall('{http://www.sdml.info/srcML/src}function')
for node in nodes:
  # do with node
  pass

APIs

  • node.find(XPath)
  • node.findall(XPath)
  • node.get(Attribute)
  • node.text

11.4 BeautifulSoup

The package is called BeautifulSoup4.

The preface to use the package:

from bs4 import BeautifulSoup
BeautifulSoup('<html>string</html>')
with open('a.html') as fp:
    BeautifulSoup(fp)

Each node can be used as a data structure, with the following fields:

  • name: the tag name
  • string: the (first?) string directly embedded inside the node
  • strings: a list of the strings
  • a-tag: the first child that is of that tag
  • attrs: a list of all attribute names
  • children: going downwards
  • descendants: intuitive
  • parent
  • parents: wow, this should be called ancestor?
  • next_sibling, previous_sibling

It can also be used as a dictionary of its attributes, e.g. s['href']. This should be a string. It is equivalent to using the get method with the class name.

Several methods are of particular interests.

  • get_text(): return all text in the node

You can also execute a query on it. In general, find_all returns a list, while find returns the first one. There are also some methods in this family, namely find_next_siblings, find_parents. E.g.

  • s.find_all('a'): return a list of all 'a' tag nodes

Or it can be a query respecting css id and classes. Although find has some support for id and class, the select is easier to use.

  • s.select("body a"): non-direct
  • s.select("p > a"): direct
  • s.select(p.c#id): class and id
  • s.select(p > #id): mix
  • s.select(a[href^=xxx]): filtering based on attribute values