Web Information Gathering

  • There are two types of information gathering:

    • Active: Direct interaction with the target system to gather information.

    • Passive: Gathering information about the target without directly interacting with it.

Whois

  • WHOIS is a query and response protocol designed to access databases that store information about registered internet resources.

  • Some of the information that are included in a WHOIS:

    • Domain Name: The domain name itself (e.g., example.com)

    • Technical Contact: The person handling technical issues related to the domain.

    • Creation and Expiration Dates: When the domain was registered and when it's set to expire.

    • Name Servers: Servers that translate the domain name into an IP address.

Domain Name Server (DNS)

  • The Domain Name System (DNS) is responsible for converting domain names (e.g., www.example.com) to IP addresses (e.g., 192.0.2.1).

  • Hosts File: A simple text file used to map hostnames to IP addresses, providing a manual method of domain name resolution that bypasses the DNS process. The hosts file is located in C:\Windows\System32\drivers\etc\hosts on Windows and in /etc/hosts on Linux and MacOS.

  • DNS Zone: A distinct part of the domain namespace that a specific entity or administrator manages. Think of it as a virtual container for a set of domain names. For example, example.com and all its subdomains (like mail.example.com or blog.example.com) would typically belong to the same DNS zone.

  • Zone File: A text file residing on a DNS server that defines the resource records within this zone.

  • Some of the information that are included in a Zone File:

    • IPv4 & IPv6 Address Records

    • Name Servers

    • Mail Exchange Servers

  • There are many tools that can perform DNS reconnaissance, including dig, nslookup, host, dnsenum, and dnsrecon.

  • There are many tools that can perform DNS recon, dig, nslookup, host, dnsenum, dnsrecon.

Subdomains

  • Subdomain enumeration is the process of identifying and listing subdomains.

  • Subdomains can be enumerated passively by using search engine dorks, and actively by brute-forcing.

  • Subdomains are typically represented by A (or AAAA for IPv6) records in a DNS zone file. Additionally, CNAME records might be used to create aliases for subdomains, pointing them to other domains or subdomains.

Zone Transfers

  • DNS zone transfers, designed for replicating DNS records between name servers. However, it become a goldmine of information if misconfigured.

  • A DNS zone transfer is a full copy of all DNS records within a zone.

  • Modern DNS servers are typically configured to allow zone transfers only to trusted secondary servers, ensuring that sensitive zone data remains confidential.

  • Tools like dig can attempt this attack using the command: dig axfr @<Name-Server> <Domain-Name>

Virtual Hosts

  • At the core of virtual hosting is the ability of web servers to distinguish between multiple websites or applications sharing the same IP address. This is achieved by leveraging the HTTP Host header, a piece of information included in every HTTP request sent by a web browser.

  • The difference between a virtual host and a subdomain is that subdomains are extensions of a main domain name and typically have their own DNS records, pointing to either the same IP address as the main domain or a different one. Virtual hosts are configurations within a web server that allow multiple websites or applications to be hosted on a single server. They can be associated with top-level domains (e.g., example.com) or subdomains.

  • We can access virtual hosts by modifying the hosts file on a local machine to include the virtual host.

  • Gobuster can be used to enumerate virtual hosts: gobuster vhost -u http://<target_IP_address> -w <wordlist_file> --append-domain.

Other Techniques

Certificate Transparency Logs

  • Certificate Transparency Logs: Certificate Transparency (CT) logs are public, append-only ledgers that record the issuance of SSL/TLS certificates.

  • Tools like crt.sh and Censys can be used to query these logs. For example, the following command can be used to get subdomains: curl -s "https://crt.sh/?q=<Target-URL>&output=json" | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' | sort -u

Fingerprinting

  • Fingerprinting: Focuses on extracting technical details about the technologies powering a website or web application.

  • This can be done through various techniques, like banner grabbing, analyzing HTTP headers, probing for specific responses, and analyzing page content.

  • Tools: Wappalyzer, BuiltWith, WhatWeb, Nikto, wafw00f.

  • Example Command: nikto -h <Target-URL> -Tuning b

Crawling

  • Crawling: Often called spidering, is the automated process of systematically browsing the World Wide Web. Similar to how a spider navigates its web, a web crawler follows links from one page to another, collecting information.

  • Web crawling can be automated using many available tools, such as Burp Suite Spider, OWASP ZAP (Zed Attack Proxy), Scrapy (Python Framework), and Apache Nutch (Scalable Crawler).

Robots.txt

  • Robots.txt: A simple text file placed in the root directory of a website (e.g., www.example.com/robots.txt). This file contains instructions in the form of "directives" that tell bots which parts of the website they can and cannot crawl.

  • The robots.txt file follows a straightforward structure, with each set of instructions, or "record," separated by a blank line. Each record consists of two main components:

    • User-agent: This line specifies which crawler or bot the following rules apply to. A wildcard (*) indicates that the rules apply to all bots. Specific user agents can also be targeted, such as "Googlebot" (Google's crawler) or "Bingbot" (Microsoft's crawler).

    • Directives: These lines provide specific instructions to the identified user-agent.

Well-Known

  • The .well-known standard serves as a standardized directory within a website's root domain.

  • This designated location, typically accessible via the /.well-known/ path on a web server, centralizes a website's critical metadata, including configuration files and information related to its services, protocols, and security mechanisms (e.g., https://example.com/.well-known/security.txt).

  • Trying to fuzz for well-known URLs can provide new attack vectors.

Search Engines

  • Search engines are another important resource for information.

  • The use of Google dorks, which is the utilization of search operators to get more efficient results, makes this a good starting point for any target that's connectable to the internet.

  • Examples: site:example.com (inurl:login OR inurl:admin)

Web Archives

  • The Wayback Machine is a digital archive of the World Wide Web and other information on the Internet.

  • It allows users to "go back in time" and view snapshots of websites as they appeared at various points in their history.

  • This can be used to uncover hidden assets that might have been exposed in the past.

Automating Recon

  • There are many tools that automate the entire process of web reconnaissance. Some examples include FinalRecon, Recon-ng, theHarvester, SpiderFoot, and autorecon.

Last updated