Abstract
Web page segmentation has been done to address the problems in different fields including mobile web, archiving, phishing, etc. In this paper, different algorithms are summarized that web page segmentation addresses in different fields .Web page segmentation has myriad applications like information retrieval, page type classification etc. . This paper presents a survey of web page segmentation algorithms including DOM Tree, VIPS and SD Tree algorithms. VIPS approach is independent of underlying HTML representation and works well even when layout structure is different from the HTML structure. As there is difficulty in finding the meaningful blocks existing approaches presented can extract informative parts from web pages by creating meaningful blocks and segmenting noisy WebPages.
Keywords
- Cloud-Centric Data Engineering
- Artificial Intelligence
- Data Quality Assurance
- Machine Learning
- Natural Language Processing
- Anomaly Detection
- Cloud Computing
- Data Validation
- Data Cleansing
- Real-Time Monitoring
- Data Integration
- Distributed Systems
- Scalability
- Compliance
- Data Lakes
- Data Warehouses
- AWS Glue
- Google Cloud
- Microsoft Azure
- Open-Source Tools
- Algorithmic Bias
- Ethical Considerations
- Cost Implications
- Autonomous Systems
- Federated Learning
- Edge Computing.