Develop an efficient Search Engine with the following features it should have distributed crawlers to crawl the private/air-gapped networks (data sources in these networks might include websites, files, databases) and must work behind sections of networks secured by firewalls
It should use AI/ML/NLP/BDA for better search (queries and results) It should abide by the secure coding practices (
and SANS Top 25 web vulnerability mitigation techniques.) feel free to improvise your solution and be creative with your approach Goal
Have a search engine which takes keyword/expression as an input and crawls the web (internal network or internet) to get all the relevant information. The application shouldn't have any vulnerabilities, make sure it complies with OWASP Top 10 Outcome Write a code which will scrape data, match it with the query and give out relevant/related information. Note - Make search as robust as possible (eg, it can correct misspelt query, suggest similar search terms, etc) be creative in your approach. Result obtained from search engine should display all the relevant matches as per search query/keyword along with the time taken by search engine to fetch that result There is no constraint on programming language.
To Submit: - A Readme having steps to install and run the application - Entire code repo - Implement your solution/model in Dockers only. - A video of the working search engine
Rename the example.env to .env and setup the environment variables according to your choice.
5. Create a database
Now open pgadmin and create a database named search_engine. After creating the database reassign the DATABASE_URL value acordingly in .env file.
Note please read this also
6. Start Rabitmq and Redis Instance
Read their docs regarding how to start them. redisrabbitmq
7. Migrate the data
pythonmanage.pymigrate
And to migrate the 10 Lakh dataset of the website for the crawler to crawl, do
pythonmanage.pymigrate_default_to_be_crawl_data
I have also given some crawled datasets for the reference, you can see it here data_backup
8. Compress the static files
Now run the following command in the console:
pythonmanage.pycollectcompress
9. Create a superuser for the site
pythonmanage.pycreatesuperuser
It asks for some necessary information, give it then it will create a superuser for the site.
10. Running the celery worker and beat
Now run this command in the terminal
pythonmanage.pyadd_celery_tasks_in_panel
Now, open two different terminals
And run these commands respectively :-
celery -A search_engine worker --loglevel=INFO
celery -A search_engine beat -l INFO --scheduler django_celery_beat.schedulers:DatabaseScheduler
11. Run the application
Before running the application, make sure that you have the redis up and running :)
For windows, mac-os, linux
Without IP address bound
uvicorn search_engine.asgi:application --reload --lifespan off
IP address bound
uvicorn search_engine.asgi:application --reload --lifespan off --host 0.0.0.0
If you are on Linux OS then you can run this command also instead of the above one:
There are 3 different ways in order to achieve this
1. crawl_already_crawled
This is custom django management command and it starts crawling the already crawled and stored sites and then updates it
pythonmanage.pycrawl_already_crawled
2. crawl_to_be_crawled
This is custom django management command and it starts crawling the site which were entered using either the migrate_default_to_be_crawl_data custom command or it was entered using submit_site/ endpoint
pythonmanage.pycrawl_to_be_crawled
3. Scrapy Command Line Crawler
This is a scrapy project that crawls the site using the command line
Here in example.com replace it with the site you want to crawl (without http or https`)
scrapy crawl konohagakure_to_be_crawled_command_line -a allowed_domains=example.com
Develop an efficient Search Engine with the following features it should have distributed crawlers to crawl the private/air-gapped networks (data sources in these networks might include websites, files, databases) and must work behind sections of networks secured by firewalls
It should use AI/ML/NLP/BDA for better search (queries and results) It should abide by the secure coding practices (
and SANS Top 25 web vulnerability mitigation techniques.) feel free to improvise your solution and be creative with your approach Goal
Have a search engine which takes keyword/expression as an input and crawls the web (internal network or internet) to get all the relevant information. The application shouldn't have any vulnerabilities, make sure it complies with OWASP Top 10 Outcome Write a code which will scrape data, match it with the query and give out relevant/related information. Note - Make search as robust