Software Engineer (Web Crawling)
We are seeking a highly motivated and experienced WebCrawler Engineer to develop exciting Generative AI (GenAI) products. Candidates should have at least 3 to 5 years web crawler experience, and experience in multimodal (text, image, video) data is preferred. As a WebCrawler engineer, you will work with other teams to ensure the smooth progress of product iterations, continuously improve the user experience of the product through in-depth understanding of the business and products, and use technology to drive business growth. Your expertise will play a key role to build our products and achieve our vision.
  • Designing and Implementing Web Crawlers: Develop scalable and efficient web crawling systems to gather data from various online sources. This involves understanding the structure of different websites, implementing crawling algorithms, handling dynamic content, and ensuring compliance with legal and ethical guidelines.
  • Data Extraction and Parsing: Write scripts or develop algorithms to extract relevant information from web pages. This may involve parsing HTML/XML documents, using regular expressions, or employing advanced parsing techniques such as natural language processing (NLP) to extract structured data from unstructured sources.
  • Data Quality Assurance: Implement mechanisms to ensure the quality and reliability of crawled data. This includes error handling, data validation, deduplication, and dealing with inconsistencies or missing data.
  • Scalable Data Storage and Management: Design and develop backend systems to store, organize, and manage large volumes of crawled data efficiently. This may involve selecting appropriate databases (e.g., relational databases, NoSQL databases), optimizing database schemas, and implementing data caching and indexing strategies for faster retrieval.
  • Performance Optimization: Optimize the performance of web crawling and data management systems to handle large-scale data processing efficiently. This includes optimizing algorithms, minimizing resource usage, and parallelizing data processing tasks.
  • Monitoring and Maintenance: Implement monitoring tools and logging mechanisms to track the health and performance of web crawling and data management systems. Proactively identify and resolve issues such as crawling failures, performance bottlenecks, or data inconsistencies.
  • Security and Compliance: Ensure that the web crawling and data management systems adhere to security best practices and regulatory requirements. Implement mechanisms to protect against security threats such as XSS (Cross-Site Scripting) attacks, CSRF (Cross-Site Request Forgery) attacks, and data breaches.
  • Possess a Bachelor's degree or above in Computer Science, Software Engineering, Technical, Science, E-Commerce, Information Technology, Mathematics or software related majors.
  • More than 3 years of web crawler experience, experience in large-scale web crawling and multimodal (text, image, video) data is preferred.
  • Proficient in one or more programming languages such as Golang/Python/PHP/Java, with strong architectural capabilities and good coding standards.
  • Familiar with common databases, such as MySQL, Redis, HBase, etc.
  • Experience in web crawling libraries like Scrapy, Beautiful Soup, Selenium, Apache Nutch.
  • Prior experience in Search engine companies is a plus.
Empowering everyone with best-in-class generative AI
HyperGAI © 2024. All rights reserved