Do you have the Software Engineer and Data Scientist skills?

Becoming a reliable software engineer and data scientist developer, and prepare for production level coding requires a few techniques.

Writing clean and modular code
Code refactoring
Writing efficient code
Adding meaningful documentation
Testing
Log
Code reviews

These are all essential skills to develop and will help when implementing production solutions. Additionally, data scientists often work side-by-side with software engineers, and it is necessary to work well together. This means being familiar with standard practices and being able to collaborate effectively with others on code.

Clean and Modular Code

When a data scientist first starts to coding, they often struggle in writing code in a way that is clean and modular even though they have been coding for years. Practically, code could potentially be used in production when working in the industry. Production code is a piece of software running on production servers to handle live users and data of the intended audience, for example, using software products in a laptop like Microsoft Office, Google, or Amazon. The code running those services is called a production code. Ideally, the code which is being used in production should meet several criteria to ensure reliability and efficiency before it becomes public. First, the code needs to be clean. The code is clean when it is readable, concise, and simple.

Here is an example in plain English of a sentence that is not clean.

One could notice that your pants have been sullied, due to the pink color of your pants that appears to be similar to the color of a certain kind of juice.

This sentence is redundant and convoluted. Just reading this makes overwhelming. This can be rewritten as:

It looks like you spilled strawberry juice on your pants.

That sentence accomplishes the same thing. Nevertheless, this sentence is much more concise and clear.

A characteristic of production quality code is crucial for collaboration and maintainability in software development. Writing clean code is very important in an industry setting because working on a team continually iterating over its work. This makes it much easier for others to understand and reuse the code. In addition to being clean, the code should also be modular. In fact, code is logically broken up into functions and modules. Furthermore, an essential characteristic of production quality code makes code more organized, efficient, and reusable. In programming, a module is just a file. Similarly, encapsulate code can be used in a function and reuse it by calling the function in different places. On the other hand, modules allow code to be reused by encapsulating them into files that can be imported into other files.

To get a better understanding of what modular code is, try to think of it as putting clothes away. We could just put all clothes in a single container, but it would not be easy to find anything maybe because it has multiple reversions of the same shirt or socks. It would be much better if we had a drawer for tee-shirts, another one for just shirts, and another for socks. With this design, it will be much easier to tell someone else how to find the right shirt, pants, and a pair of socks. The same is true in writing modular code.

Splitting code into logical functions and modules allows finding relevant pieces of code quickly. Generalizing pieces of code to be reused in different place need to be considered to prevent from writing extra unnecessary lines of code. Abstracting out these details into these functions and modules can help in improving the readability of the code. Thus, programming in a way that makes it easier for a team to understand and iterate on is crucial for production.

Refactoring Code

Paying little attention to writing good code is easy. Specifically, focus on just getting it to work when start writing code for a new idea or task. Typically, it gets a little messy and repetitive at this stage of development. Furthermore, hard to know what is the best way to write code before it is finished. For example, it could be challenging to understand what functions would best modularize the steps in the code if we do not have enough experiment with the code to follow. Thus, going back to do some refactoring after achieved a working model is a must.

Code refactoring is a term for restructuring code to improve its internal structure without changing its external functionality. Refactoring allows cleaning and modularizing code after production. In the short-term, this might be a waste of time, since we could be moving on to the next feature. However, allocating time to refactoring code speed-up time. It will take the team to develop code in the long run. Refactoring code consistently not only makes it much easier to come back to it later, but it also allows us to reuse parts for different tasks and learn reliable programming techniques along the way. The more practice in refactoring the code, the more intuitive it becomes.

Efficient Code

It is essential to improve the efficiency of the code in addition to making it clean and modular in the refactoring process. There are two parts to making code efficient: reducing the time it takes to execute and reducing the amount of space it takes up and memory. Both can have a significant impact on a company or product’s performance. Therefore, it is important to practice this when working in a production environment.

However, it should be noted that how important it is to improve efficiency is context-dependent. Slow code, might be possible in one case and not another. For example, some batch data preparation processes, might not need to be optimized right away if it runs once every three days, for a few minutes. On the other hand, code used to generate posts to show on a social media feed needs to be relatively fast, since updates happen instantaneously. Moreover, spending lots of time refactoring to clean or optimize the code after it is working is essential. It is crucial to understand how valuable this process for a developer. Each time optimizing the code, we will pick up new knowledge and skills, which will make a more efficient programmer over time.

Documentation

Documentation is additional text or illustrated information that comes with or is embedded in the code of the software. Documentation helps clarify complex parts of programs, making code easier to read, navigate, and quickly conveying how and why different components of the program or algorithm are used. Several types of documentation can be added at different levels of the programs — first, the line-level documentation using in-line comments to clarify code. Second, the function or module-level documentation using docstrings to describe its purpose and details. Finally, the **project-level documentation **using various tools such as a readme file to document information on the project as a whole and how all the files work together.

In-line Comments

Texts following a hash symbol throughout code are in-line comments. They are used to explain parts of the code and help future contributors to understand. There are different ways comments are used and differences among great comments, okay comments, and even users’ comments. One way comments are used is to document the significant steps of complex code to help readers follow. For example, with the guiding comments on a function, future contributors do not need to understand the code to follow what the function does. Comments help to understand the purpose of each block of code, and even help to figure out individual lines of code or methods.

However, others would argue that using comments help to justify lousy code or code requires comments to follow. It is a sign of refactoring needed. Comments are valuable for explaining where code can not — for example, the history behind why a particular method was implemented in a specific way. Sometimes an unconventional or seemingly arbitrary approach may be used because of some undefined external variable causing side effects. These things are difficult to explain with code. These numbers for detecting edge levels in an image may seem arbitrary. Still, the programmer experimented with different numbers and realized that this was the one that worked for this specific use case.

Docstrings

Docstrings or documentation strings are valuable pieces of documentation that explain the functionality of any function or module in a code. Ideally, all of the functions in code should have docstrings. Triple quotes always surround a docstring. The first line of the docstring is a brief explanation of the function’s purpose. Single line docstrings are perfectly acceptable if one-line of documentation is sufficient to end the docstring. However, if the function is complicated enough to warrant a longer description, a more thorough paragraph after the one-line summary can be added. The next element of a docstring is an explanation of the function’s arguments. It should be something like listing the arguments, state their purpose, and state what types the arguments should be. Finally, it is common to provide some description of the output of the function. Every piece of the docstring is optional. However, docstrings are part of good coding practice¹. They assist the understanding of the produced code.

#software-engineering #data-scientist #software-development #data-science #python #data analysis