General information for the website: webstore
Num. of web pages/modules: 1
Description of every page/module: Rationale: I am looking for a PHP 7 Script to read sitemap.xml, check URL's for 40x and 30x + check local/remote resources on the page. So this is NOT a crawler. Just a resource checker. Results are then emailed.
Business reason: remove/fix all internal 40x and 30x. In sitemap. Plus resources locally hosted on pages. And as an option a href links.
PHP SCRIPT
- run from command line via cron / options via param/arg
- run from http (results on screen) / options via get paramss
- without options: reads local sitemap and check every page for 30x and 40x
- option --autodomain: inputs a domain name www.domain.com or domain.com and
a) tries to find the www.domain.com/robots.txt and parse it to find the sitemap.xml mentioned (example code available)
b) if no robots.txt found then tries www.domain.com/sitemap.xml
- option --alldomains: value 1 or true => in the case a robots file is found then it can contain more than 1 sitemap. If this option is true then all sitemaps are process and the result in the report is presented per domain
- option --sitemap: inputs http link or local direct link to sitemap
- option --checkres: for every page it checks it reads the contents, then check all resources like IMG (not a href) found on that page only for 30x and 40x
- option --checkhref: for every page it checks it reads the contents, then check all the ahrefs found on that page only for 30x and 40x
- option --40xonly: do the above for both res and href, but only check and report 40x
- option --30xonly: do the above for both res and href, but only check and report 30x
- option --all: check all above
- results are listed as
* sorted per status code
* list local page that is either in error 30x or 40x
* or list local page wiht an indent for the resource or ahref found on that page that is either in error 30x or 40x
- option --emailto:mail@mail.com : sends results to the e-mail address
- the above are options for the commandline
- the same options should be available for the web interface if we call the script via http: in this case a page is shown where we can enter the above params .. and hit enter to execute (step 1 is input, step 2 is execute)
- assume most methods exist in php like curl, xmlparse etc or check with us first
Ahref/link Errors
- domain.com/page1
301: domain.com/page1/page2.html
404: remoteblog.com/page1/other page.html
- domain.com/page2
302: domain.com/page2/myimagexx.png
404: remoteblog.com/page1/image66.png
etc
Description of requirements/features: Job is for experience PHP only
Needs to run on php 7
Expert in CURL or WGET
Sitemap, XML parse, HTML parse
Extra notes: