>[!SUMMARY] Table of Contents
>- [[publish/9900/9900_web_scraping_outline#Main level site:|Main level site: ]]
>- [[publish/9900/9900_web_scraping_outline#First level site:|First level site: ]]
> - [[publish/9900/9900_web_scraping_outline#Content in each year site:|Content in each year site:]]
>- [[publish/9900/9900_web_scraping_outline#Second level site:|Second level site: ]]
> - [[publish/9900/9900_web_scraping_outline#Content in each case site:|Content in each case site:]]
# Main level site:
[https://www.austlii.edu.au/cgi-bin/viewdb/au/cases/wa/WASAT/](https://www.austlii.edu.au/cgi-bin/viewdb/au/cases/wa/WASAT/)
1. We can use page side to track if the is new coming data
![[Pasted image 20250313010507.png]]
1. And also can verified our scraping db
![[Pasted image 20250313011038.png]]
# First level site:
[https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/wa/WASAT/YEAR/](https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/wa/WASAT/YEAR/)
YEAR is year numbers from 2005-current year
## Content in each year site:
![[Pasted image 20250313010737.png]]
page-main's content will have each month's
Page month's section will contain each case that we need to scrape deeper
and also has its title
![[Pasted image 20250313010827.png]]
# Second level site:
[https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/wa/WASAT/2006/data-count.html](https://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/wa/WASAT/2006/data-count.html)
data-count is form first level site
## Content in each case site:
1. For rtf:
rtf download link format:
[https://www.austlii.edu.au/au/cases/wa/WASAT/year/case_number.rtf](https://www.austlii.edu.au/au/cases/wa/WASAT/year/case_number.rtf)
2. For HTML
the-document is the main element we want to save
![[Pasted image 20250313010921.png]]
3. For PDF
TBC