GZIP, Sockets, and Web Pages

Project #0

What are GZIP and ZIP? How do they differ? Do they use the same compression algorithm?

What other decompression algorithms does your web browser support?

Project #1

Use a web browser to download a web pages. Modify (clean up) the code below. Use it to download and decompress web pages. Display the first 60 characters/bytes.

Find and test 10 other web pages. Are they compressed or uncompressed? What is the difference in file sizes?

Project #2

Wireshark is an open source program for monitoring network traffic. Download and install Wireshark.

Monitor network traffic when requesting and downloading a web page using Wireshark.

Display some of the packets.

Project #3

Test creating a compressed and uncompressed web page.

  1. create a web page (HTML file) containing several paragraphs, etc.
  2. using the code below, read the html file
  3. write the file's data as a compressed and uncompressed html files
  4. open each file in a web browser to verify the page
  5. display the file sizes (compressed vs uncompressed)

See Project: Create a Simple Website for information on creating web pages.

Links

Wireshark (home)

Real Python - Programming Sockets in Python
(The first 2 or 3 videos are free to watch.)

GZIP Compression: How to Enable for Faster Web Pages

Sample Code

#!/usr/bin/python3 # =================================================================== # Based on: realpython.com/courses/programming-sockets/"> # Real Python - Programming Sockets in Python # #FYI: GZIP magic number is: # b'\x1f\x8b' bytes or 0x1f8b hex or 8075 dec # (It indicates the text was compressed using GZIP.) # =================================================================== from urllib import request import gzip # ---- compressed and uncompressed web page url = 'https://python.org' #url = 'http://www.tomshodgepodge.com/programming-projects' #url = 'http://httpforever.com' # ---- download a web page with request.urlopen(url) as response: html = response.read() print(f'HTML {len(html)} bytes') print(f'HTML data type {type(html)}') print('HTML first 50 bytes') print(f'[:50] {html[:50]}') # ---- compressed or uncompressed? if html[0] == 0x1f and html[1] == 0x8b: print() print(f'---- web page is compressed') print() dc_html = gzip.decompress(html) print(f'decompressed {len(dc_html)} bytes') print(f'decompressed data type {type(dc_html)}') x1 = dc_html[:50] ##print(f'[:50] {len(x1)} bytes') ##print(f'[:50] data type {type(x1)}') print(f'{x1}') print() decode_dc_html = dc_html.decode('utf-8') print(f'decode("utf=8") {len(decode_dc_html)} characters') print(f'decode("utf=8") data type {type(decode_dc_html)}') x2 = decode_dc_html[:50] ##print(f'[:50] {len(x2)} characters') ##print(f'[:50] data type {type(x2)}') print(f'{x2}') print() else: print() print(f'---- web page is not compressed') print() print(f'[:50] {html[:50]}') print(f'[:50] {html[:50].decode('utf-8')}') print()