How to search and analyze documents? Document OSINT

 

How to search and analyze documents? Document OSINT

City pulse 
hacker-basement.com
12 min
January 23, 2023

Hi, friend. Searching and analyzing documents is one of the directions in Document OSINT, which some people forget about, and some people undeservedly ignore. But in fact, with a skillful, or just a scrupulous approach, looking for all sorts of documents, you can find a lot of interesting information. This works, if only because documents tend to contain very specific information. For example last names, addresses, names of organizations, financial information, and so on in the same vein. Therefore, in this article, we will deal with the issue of searching for documents in more detail.

Where to begin?

Document search, i.e. The process itself can be divided into two types. This is a search by keywords, and a search by the presence of a document. Although this is a very conditional division and it is needed rather to simplify the study of the process and the formation of the necessary way of thinking.

Keyword search implies that we approximately understand with what content we need a document. For example, if we are looking for information about a person, knowing his last name, then we can look for lists of students of a university or school, or, alternatively, participants in some olympiads. Something else is often published on the websites of educational institutions. Having found such a document, we will automatically receive the amount of additional information. Starting from the city where the person lived or lives, ending with a list of people with whom the object is supposed to be in contact.

The same is true when studying a company, by name or by registration number. In the public domain, some reporting or other documents on the desired organization may well be provided. Which, well, can give us a whole bunch of additional information.

The second option is " search by availability ". In this case, we do not know exactly what document and with what content we are looking for. But, we mean that there may well be a document in which there will be some interesting information. For example, we are studying a site for the deanon of its owner. In this situation, it makes perfect sense to collect documents from this site, if they are there of course, and examine the content and metadata. And all of a sudden you find something you need.

Documents and their formats

Since we are talking about documents here, first we need to decide which documents we will be looking for, i.e. what formats do we need. Therefore, let's go through the most relevant extensions in this situation:

Microsoft Office formats:
doc, docx - Microsoft Word document format. And although doc is an outdated option, documents with this extension are still quite common, so you should not forget about them.
xls, xlsx - Microsoft Excel spreadsheet format. Similarly, xls is an obsolete option, but it should not be forgotten.
ppt, pptx is Microsoft PowerPoint presentation format.

OpenDocument formats:
odt - document format;
ods - tables;
odp - presentations;

pdf is a cross-platform open electronic document format. It doesn't need any additional introduction. Used frequently and heavily. You should always check for the presence of pdf.

txt is not the most common story, but sometimes it happens to find useful information, so it's worth checking.

This is certainly not a complete list of existing document extensions. But these are the ones that you should always check for. Everything else is a very situational story. You can view all existing document extensions at the link:

https://www.file-extensions.org/filetype/extension/name/document-files

Another important point to keep in mind when searching for documents is that they are not always stored in text formats. It often happens that the necessary documents are stored as images. And, accordingly, it will not work to find them by content. Moreover, there are cases that they are placed in this form on purpose, just in order to complicate the search process.

Well, now let's go directly to the ways to search for documents. First, we will consider ways to search for documents, and therefore we will analyze some of the nuances of working with already found documents.

Google Dorks for document search

One of the easiest ways to find documents is to use google dorks. The main advantage of this method is simplicity and versatility. You don't need any additional software, and it takes five minutes to figure out how it works, even if you've never done it before.

The main dork we'll be using in this situation is filetype: . At the same time, we will correct the results of his work with some additional dorks. The meaning of filetype is to look for files that have the extension we need. In this way, of course, you can search for any files at all, but since we are talking about searching for documents here, then we will use the appropriate extensions, those listed above.

The easiest option is to write filetype:docx (or whatever extension you want) and then specify the desired query. For example:

filetype:docx "Ivanov Ivan Ivanovich"

In this situation, we used quotes to search for a complete match, i.e. Google will search for exactly the phrase that we took in quotation marks. But this is a too generalized version of the search, and if the query is common (as in our example), then the results will show a huge number. Therefore, the request needs to be clarified.

filetype:docx "Ivanov Ivan Ivanovich" site:*.gov.ru

Here we have refined the query by using another dork site: . In this version, we asked Google to find documents that contain the phrase “Ivanov Ivan Ivanovich”, with the docx extension, and you need to search only on sites with the gov.ru domain. In this case, the symbol * means "any value". But, if necessary, you can specify a specific site. Or you can add a - (minus) symbol before site: and then Google will remove the results found on this site from the results. Also, the asterisk character (*) can be used in the query itself if the exact wording is unknown. For example, if you do not know what is the correct letter in some confused surname or name.

It can also sometimes be useful to limit the range of dates to search, this is done with dork "..",

2020..2022

Programs for searching documents

Google dorks are good when we at least roughly know what we are looking for. But if, for example, we need to download all the documents from some site, then doing it manually is not very convenient and not very fast. For such purposes, there are useful utilities that will do everything for us.

On the Metagophy

In my opinion, Metagoofil is the best at downloading files. It will download all the necessary files, and based on the found metadata, it will make a detailed report. There is a separate article on Metagoofil where I figured out how to use it and what you can find. I don’t see the point in retelling, who needs to go and read. And we will look at additional options.

Dork Dump

Another equally useful utility is Dork Dump . It is very well suited if you need to check some small site for documents. it shows the metadata of the found documents right in the terminal and there is no need to wait for the end of the work.

Dork Dump installation:

git clone https://github.com/dievus/msdorkdump.git
cd msdorkdump
pip3 install -r requirements.txt

After installation, to run, specify the -d parameter so that the found documents are downloaded. A folder with the name of the site will be created in the directory with the program and all documents will be saved there. Also, after the -t parameter , we indicate the site of interest to us.

Example:

python3 msdorkdump.py -d -t gijn.org

Sites for finding documents

In most cases, Google dorks and the listed utilities are enough for a high-quality search for documents. But my story would not be complete if I did not mention search sites. Well, maybe it's more convenient for someone. There are quite a few websites with similar topics. Therefore, I will show only those that seem to me the most sensible.

https://intelx.io/tools?tab=file

This is a resource for those who are too lazy to write dorks by hand. You enter a search query, select the desired extensions and get the Google results for the desired query.

https://cartographia.github.io/FilePhish/

The same as the previous one only applies dorks to the entered site.

https://find-pdf-form.pdffiller.com/

This site searches for pdf files for a given query. But its main feature is that the found document can be immediately opened in a convenient online editor. It has a search, the ability to make marks, highlight text in different colors, draw on a document, and other similar useful features.

Well, a few more useful sites for finding files that may come in handy:

https://www.dedigger.com/ - Looks for public files in Google Drive. You can select the desired extension.

https://www.pdfsearchengine.net/ - using Google CSE (Custom Search Engine) searches for pdf files. A useful feature is that in the list of results you can click "Structured data" and see detailed information about the found document.

https://www.searchftps.net/ - Searches for files on ftp servers. The found results can be downloaded immediately.

Data extraction and processing

Suppose we have found the necessary documents. And it is very good if this is one page with the information we need. But this is not always the case. I would even say that almost always happens not so. The resulting document may have hundreds of pages with a bunch of extra data. And all this needs to be processed, select the right one, then process it again, taking into account the data already received. Of course, no one canceled ctrl + F. But this is a normal option when you need to find something quickly, and preferably in a well-structured document. But for working with large documents, ctrl+F is not the best solution. And if you get a scanned copy of a document or just a photo, then nothing will work at all. Also, often the data is published in, to put it mildly, not the most readable form. Why so I do not know, whether the administrator is a drug addict, or specifically to complicate the search.

Therefore, we will analyze all sorts of useful things and utilities that may come in handy when working with documents.

Google Pinpoint

https://journaliststudio.google.com/pinpoint/

This is a fairly simple, but extremely useful Google tool for working with documents and their contents. It is especially useful with voluminous documents or when there are a lot of them. The process of use is simple and not intricate, but from that, it is extremely effective. First you need to create a new workspace and add all the necessary documents there. When the download and processing is over, you can begin the learning process. Let's consider the main possibilities.

  • Downloading files. Pinpoint is almost omnivorous in terms of formats. You can upload pdf, images, office formats, web pages, text documents, audio and video files. If the file is large, it will be split into several parts.
  • Recognition. If we have uploaded an image that has text on it, then that text will be recognized. In the same way, if there are images in the uploaded document, then Pinpoint will also recognize it. And when we search for something, the search will be carried out, including by recognized elements. It should also be noted here that it recognizes any text in the photo at all, even if it is, for example, a sign or an inscription on the wall. This chip can be used as a standalone tool. For example, upload a bunch of photos and search for matches by name to find where the photos are the same or similar places.
  • If we upload an audio file, then the sound will be converted to text, which can also be searched for, and it can also be downloaded separately. In the case of uploading a video, a text version of it will also be created in the workspace.
  • The search bar is located at the top of the window. If you search through it, then the search is carried out on all downloaded files. There is another nice feature here. Pinpoint knows what abbreviations are and knows how to decipher them. For example, if we enter “osint” in the search box, then the search will include the phrase “open source intelligence”. He also sometimes manages to find synonyms for the requested word.
  • If we open a specific document, we can search by its content. Moreover, any found text, including the one recognized in the photo, can be highlighted in color, copied or created a separate link to this particular text.

Additional tools for document analysis

https://diffnow.com/compare-clips

Allows you to compare two texts line by line to determine differences. The thing is very situational, but sometimes it can come in handy. You can also compare files or sites.

https://voyant-tools.org/

Website for vocabulary analysis of text. You can upload the document, or you can just add a link. As a result, we will see a general summary of the frequency of use of words and phrases in the studied text. You can also select any word in the text itself and see how often and where in other places it is mentioned. This thing is also very situational, but sometimes it can be useful when studying large texts.

https://online.sodapdf.com/

This is an online pdf editor, with a built-in converter to different formats. You can also search within a document, compare two documents, make marks on text, select fragments, and so on. In short, for lovers of online tools, that's it.

As a conclusion about documents and working with them

Well, you can end there. Because, having dealt with the listed tools, you can comfortably search for the necessary documents and get information of interest from them. As for the information, perhaps a few words need to be added. It's good that you found some document. But this document still needs to be properly studied and selected from it the information that is needed now and that which may be useful in the future. And this skill is developed only in practice.

So, for example, when studying a document, it is a good habit to pay attention and write out what is commonly called "immutable data". These, for example, include personal data, names, dates, as well as information about the fact of an event (place, time, what happened, participants). All this should be recorded even if it seems to you that this is not the necessary information. The point here is that the more you collect information about something, the more your understanding of this event and its attendants changes. And, at some point, what you thought was not needed can be very useful. And if you didn’t write it down in time or didn’t mark where you saw it, then sometimes finding the right information again can be either difficult or long.

Well, that's all for sure now.

Your Pulse.

Просмотры:

Коментарі

Популярні публікації