How to Search a Codebase with Python

In this post, I will show you how you can search an entire directory to find keywords in a file. Once we find files with offending code we will create a list so we can find the most offensive files and focus on those.

Iterating the File Tree

First, we will collect the paths for files we are interested in. Here we have the directory to search defined at the top. Also, we created an exclusion list. The exclusion list makes sure we don’t get files that aren’t relevant to our platform. The node_modules folder is a prime example of one we don’t want to include.

We initialize file paths outside the loop, as we will be iterating over it later.

OS.walk will help us with the hard work of iterating each of the directories. It returns the root directory name (as we iterate we are getting an updated root). Also, it will give us a list of subdirectory names and file names.

We are ignoring the subdirectory names because we just want the paths to files. Collecting them means iterating over the list of file names in the directory, check to make sure it has an appropriate extension, and make sure it doesn’t exist in our exclusion list.

import os

walk_dir = 'C:\\Directory\\ForWalking\\'

# If the file path contains these we dont want them
#    eg. C:\\Directory\\ForWalking\\node_modules will be ignored
exclusions = ["node_modules", "SolutionFiles", ".bin", "Test"]

# Array to store all our file paths
file_paths = []

# Iterate file tree
for root, sub_dirs, file_names in os.walk(walk_dir):
    
    # Iterate the file names in the directory
    for file_name in file_names:
        
        # We only are interested in Typscript and JS Files
        if file_name.endswith(".ts") 
           or file_name.endswith(".tsx") 
           or file_name.endswith(".js"):
           
            # If the file path doesnt have 
            #   anything from the exclusion list
            if not any(exclusion in root for exclusion in exclusions):
                file_paths.append(os.path.join(root, file_name))

Sort Through the File

Next, we will want to go through all the files we found and see if they have any of the offensive code. We created an array that contains all the methods we want to search for. Any occurrences of these will need to be updated later.

We will visit each of the files we found in our last step. When we visit them we will open and count the occurrences of any offensive code. If we find an occurrence, we will add to the occurrences array with the path, unsupported code, and the number of times it appeared.

# Occurances will track each time an offensive bit of code is found
# Its format will be:
#   File Path, Function, Num Occurances
occurances = []

# Methods that need to be update
nogos = [
".SetFocus(",
".IsValid(",
".Clear",
".IsDirty",
".RemoveItem",
".SetTime",
".RemoveItem",
".SetTime"
]

# Iterate previously collected file paths
for file_path in file_paths:
    
    # Open the file as read only ignoring unknown chars
    with open(file_path, 'r', encoding='utf8', errors='ignore' ) as f:
        contents = f.read()
        
        # Check each of the offensive code bits
        for string in nogos:
            countNogo = contents.count(string)
            
            # If there is offensive code in the file append it to 
            #   the occurances array
            if countNogo > 0:
                occurances.append([file_path, string, str(countNogo)])

# Create an output csv string
outCSV = "\n".join([",".join(line) for line in occurances])

# Write to file
with open("Outfile.csv", 'w+') as f:
    f.write(outCSV)

Results

Using this sort bit of scripting saved me from having to open 150+ files of code. Instead, it found the 26 files with offensive code so I can focus on those. Also, I was able to give my manager a better idea of the scope of the project in just a few minutes rather than several days.

For those who may look over the code and point out this could have been done in fewer steps using map, filter, and reduce — you are right! Not every situation needs polished code though. This is a great example of how with relatively little Python experience, anyone can save time at their job.

Thank for reading

#python #python3 #programming #cloud #developer

How to Search a Codebase with Python
14.40 GEEK