In order to read the Git Blob objects, we need to understand that git uses zlib to compress the stored objects. We can use the Java zip utils to decompress the Git blob.
Code Snippets
Below are some methods of decompressing the Blob file. If you did some research online, you will find many examples showing Method 1.
But I recommend using Method 2 as it is does not assume the size of the decompressed file. This method is also used in Apache Commons library.
Make sure the binary file is not corrupted or you might encounter java.util.zip.ZipException
Method 1
The byte array size of result can be arbitrarily set with a specific size. But you will have problems if the decompressed file size is uncertain.
String file = "<PATH to Git blob>";
byte[] fileBytes = Files.readAllBytes(Paths.get(file));
Inflater decompresser = new Inflater();
decompresser.setInput(fileBytes, 0, fileBytes.length);
byte[] result = new byte[1024]; // Size need to be set
int resultLength = 0;
resultLength = decompresser.inflate(result);
decompresser.end();
Method 2: Checks if end of compressed data is reached
This method reads the content of the file and output to ByteArrayOutputStream object.
String file = "<PATH to Git blob>";
byte[] fileBytes = Files.readAllBytes(Paths.get(file));
Inflater inflater = new Inflater();
inflater.setInput(fileBytes);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(fileBytes.length);
byte[] buffer = new byte[1024];
while (!inflater.finished()) {
int count = inflater.inflate(buffer);
outputStream.write(buffer, 0, count);
}
outputStream.close();
byte[] result = outputStream.toByteArray();
Performing source code review is one of the important skills that you should pickup as an Application Security engineer. Being polygot in programming is helpful because you might be reviewing source code that are written in different languages. Right now, I seldom see anyone mentioning a systematic approach to read source code if you are a novice with the codebase or languages.
Reading source code is a underrated skill in today’s programming education. Often when we want to learn programming, we are given advice to build projects and write more code to learn programming. However, another aspects of learning programming is to read the code of other programmers.
This is why I think “The Programmer’s Brain” is one of the insightful programming books to read because it discusses about different aspects of being a programmer: – How to get better at reading code? – How to get better at thinking about code? – How to get better at writing code?
Get better at reading code!
Cognitive Processes in relations to programming
We can model our cognitive processes with the below diagram:
Hermans, F. (2021). Programmer’s brain: What every programmer needs to know about cognition. Manning.
At a high level, when we start reading the code, information relating to the code enters to our Short-term memory. Think of the time when you remember some information for a short period of time to memorize a phone number or a quick task to complete. This is the use of short-term memory.
Then when we are thinking and interpreting the code, we are using the information from our Short-term memory to our Working memory. The information is processed in our Working memory to generate new understanding / meaning.
Our Working memory retrieves and connect information from our Long-term memory to process the information. Think of our Working memory like a melting pot. For example, we can recall certain syntax pattern from our Long-term memory. Like when we are reading the Javascript code, we know console.log(...) will print out information based on our Long-term memory.
console.log(message)
In short, when we are reading a code, different cognitive processes are engaged to comprehend the code and perform certain actions later such as changing the code or adding new code etc.
Why we get confused when reading code?
1) Lack of knowledge
Sometimes you get confused when reading a source code if you are unfamiliar with the language syntax or concepts. Or you might be unfamiliar with the specific industry / domain knowledge that the code is written for. In this context, we say that you lack knowledge to understand the code.
For example, below is a snippet of a Racket code. If you do not understand LISP-like code or concept of Lambda, then you will not know how to evaluate ((double inc) 5)
To understand what is going on: – You need to know the syntax of a function in LISP – You understand that you can pass function into another function – In this case, the function inc is passed to the function double. In this function double, the inc function will be applied twice. – Hence ((double inc) 5) will evaluate to 7.
To understand in terms of cognitive processes, we are unable to retrieve any information from our Long-term memory that can be used by the working memory to evaluate the code.
2) Lack of information
Unfamiliar with how a particular method works or purpose of a class etc because the information cannot be retrieved directly from the code itself.
For example, we can see a python function that seems to filter a group of members by name. But we do not know how exactly the name is gonna be filtered. Hence we might guess or search for additional information from the code base.
We are temporarily confused because we cannot understand the function immediately from the code itself.
While coding a program, we might forget about how a particular concept works or their syntax. Often, we will just quickly google to retrieve for an answer (either from the documentation or stackoverflow).
The author argues it is better to know some of these syntax by heart rather than googling for answers all the time. First of all, it is distracting to google for the answers as you might be tempted to do something else (once you are in your browser especially with the multi-tabs). Second, you need to be able to recall the language syntax from your long-term memory in order to clunk the code (that you are reading) effectively.
Our long-term memories will decay in a pattern similar to a forgetting curve. If we don’t recall the things that we learned in a specific period of time, we will forget things. But if we try to recall the memories at a specific time, the decay will be slowed down.
We know this very well when we tried to cram for an exam, then we might forget everything a few things after the exam is over. Because of the failure of having future reminders, we will struggle a retrieve a concept or syntax from our long-term memory.
It is not efficient to recall every concepts every single day. Instead, we can use a spaced repetition system (SRS) software to automate the reminders for us based on how well we think we can recall a particular concept.
Note: I have tried spaced repetition system and struggles to integrate to my everyday workflow. It’s not because SRS methodology does not work. Rather it takes a commitment to adopt SRS well and make it a habit go through your deck everyday. I will keep trying and see what are the ways we can integrate SRS better into our lives.
Adopt a spaced repetition system (SRS) and add new knowledge that you have to your deck. This can be a situation where you are learning a new programming language, a new concept or framework. You will need to use judgments to know what concepts or syntax need to be included as you can’t add everything you learned into the deck.
Adding a new card on Javascript’s array prototype filter method
Also when you find yourself googling for an answer, then add a new card to your deck. This shows that you have not understand the concepts or syntax by heart. For example, most programmers would know how to write a for-loop in their language. If they do not know, it means that they have not learned the language deeply yet.
Another way to recall the syntax better is to actively think about the concepts that you are studying. It is easier for us to recall something if they are related to something that we know. When we relate a new concept / syntax to our existing knowledge, then we have a better chance of recalling the new concept / syntax in the future. Some questions to think about can be:
Think and write down the concepts that you think is related to this new concept or syntax.
In what ways are they similar and different?
Think of variants of code that can achieve the same goals as this new concept or syntax.
How important is this concept or syntax to the language, framework or codebase etc.?
Reducing cognitive loads when reading complex code
Refactoring code temporarily
Replacing unfamiliar language constructs
Adding the concepts that you are confused to SRS.
Working memory aids
Create Dependency graph
Using a state table
Think about code better!
Reaching deeper understanding of the codebase
Get better at solving programming problems
Avoiding bugs (misconceptions in thinking)
Write better code!
Naming things better
Name moulds
Feitelson’s three-step model
Avoiding code smells and cognitive loads
22 code smells from Martin Fowler’s Refactoring
Arnaoudova’s six linguistic antipatterns
Get better at solving complex programming problem
Automatization
Learn from code and its explanation
Germane Load
Practices that you can do
A) Reading different code base and attempt to understand what each code is doing.
1. Choose a code base to read
Choose a codebase where you have at least some knowledge of the programming language. You should have a high level understanding of what the code does.
2. Select a code snippet and study it for two minutes.
Choose a method, function or coherent code that is about half a page or maximum 50 lines of code.
3. Reproduce the code in paper or in an IDE (new file).
4. Reflect on what you have produced.
Which lines do you find easy and which lines are difficult?
Does the lines of code that are unfamiliar to you because of the programming concepts or domain knowledge?
Besides learning about Python ASYNCIO, For the last few weeks, I have been learning JavaScript for web development. My methodology is to consume the knowledge from multiple resources (book, blogs and MOOCs).
Why multiple resources that explain the same concepts? If you use different resources, you will be exposed to the concept in different context. This is especially useful for beginners to not stuck in one context. You need to understand the concept in different situation.
Also, feel free to modify the tutorial steps. Add in anything that is interesting. Apply previously learned knowledge to the tutorial. Combine two different concepts. In short, be active in experimenting.
When you follow a tutorial that builds something, modify it and make it your own. After that, comment every function in a fashion that someone who never has seen your code before understands it.
If you are starting to learn JavaScript from the basics, please use this course. I find that there is a balanced mix of explanation and practical usage of concepts.
One particular thing that is useful is the challenges that the course instructor gave to the students. After the instructor demonstrated on a practical concept, you are expected to complete the variant of the demo.
Eloquent JavaScript
Disclaimer: I cannot give a complete review since I completed only the earlier chapters (1-7). In future, I will read the remaining chapters again. The book introduces foundational programming knowledge. If you are new to programming, you can consider reading chapter 1 – 7 to learn the fundamentals. The chapter on different JavaScript built-in functions for Arrays (e.g. forEach(..), filter(...) and map(...)) was useful later on when I studied with the other MOOCs.
I also advise beginners to try the few challenges that are available at the end of each chapters. I have consolidated my understanding by doing these challenges. Some of the challenges may be difficult. So you should feel free to refer to the code (this is not school).
Before you take this React course, I suggest that you take Modern JavaScript Bootcamp. At the same time, you should create a few demo web applications. If you want to learn Web Development, then React is one of the JS framework that you need to learn. Why? Because of the wide adoption. One cool thing that I like about React is the speed of rendering and JSX (JavaScript XML).
Food for Thought: React seems like a powerful framework that allows the application to process and compute data for the client side. Does this mean that more application will start perform business logic workflows in the client side and forgets about backend validation?
Suppose your program is trying to execute a series of tasks (1 – 6). If each task takes different time to complete, then your program will need to wait for each task to be completed sequentially before it can proceed.
Asyncio will be useful in such scenarios because it enables the program to continue running other tasks while waiting for the specific task to be completed. In order to use Asyncio, you will need to use compatible libraries. For example, instead of using requests (a popular HTTP library in Python), you will use aiohttp.
You want to speed up a specific part of your program where you are running a list of tasks sequentially for large-N items.
Suppose you are making API calls based on a list of different values for a parameter, you can use asyncio and aiohttp to make the API requests.
You do not need to change your entire program to use async/await syntax. Try to observe which part of the program is a bottleneck and explore how asyncio can improve performance on this particular flow.
Example: Crawling Wikipedia for info on Football (Soccer) Clubs
In this demo, we are going to perform the list of tasks below:
Read the list of football clubs from a csv file.
Get the Wikipedia URL of each Football club.
Get the Wikipedia HTML page of each Football club.
Write the HTML page into a HTML file for each Football Club.
From the HTML page, we need to parse for information (Full Name, Ground, Founded Date etc.) using BeautifulSoup library.
For the information of each club, we want to append the information into a Dataframe.
Finally, print out the Dataframe to see if the information is correct.
See the Synchronous example and Asynchronous example from my Github repo. If we execute both scripts, we can an estimated difference here where Asyncio complete the execution faster by about 20-30%.
Execution time for Asyncio : 17.885913610458374
Execution time for Synchronous: 23.075875997543335
Refactor Tips
As a practice, a co-routine main is often defined and used in an event loop (e.g. asyncio.run(main()). Then in the co-routine main function, all the other co-routines are await.
If the request has a consistent response time, then you should stick to the synchronous approach. For example, if you are using Pandas, then you should use apply() on a function. For parts of the program which are bottleneck, you should try with asyncio to see if the speed performance is improved.
Key Terms
Event Loop. You must use an event loop to run the co-routine.
async / await. This is the syntax for defining co-routines in python. You can declare a co-routine by using async def in front of a function. await is used inside a co-routine and tells the program to come back to foo() when do_something() is ready. Make sure that do_something() is also a co-routine.
async def foo():
x = await do_something()
return x
Recently, I plan to work through two technical books (Black Hat Python and Black Hat Go).
One of the motivations of going through these books is to understand how to build tools for content discovery and brute-forcing. Also I will like to develop my Python scripting skills further.
In Black Hat Python, the sample code for the chapters are in Python 2. I decided to convert the Python 2 code to Python 3 code. I will also use libraries such as requests to replace some of the steps were performed by urllib and urllib2.
Here are some sample projects from Black Hat Python that were converted to Python 3:
Web Application Mapper Once you identified the open source technology used by the target web app, you can download the open source code to your directory. The mapper will send request to the target and spider the target using the directories and file names used in the open source code.
The script uses the known directories of the particular to map out the attack surfaces of the web app
Content Brute Forcing In cases where you do not know the exact technology stack, you will need to brute force using a common word list. The word list can contain the common directory and file names. In the book, the script allow extension brute forcing as well. I have added filter method that allow the script to display responses that have specific status codes (e.g. 200).
Notice only response with status 200 are displayed?
A common workflow that we can observe from these tooling scripts:
A word list or list of test cases are generated or taken from open source. These are added to the queue.
A filter or specific information list is given based on what we are interested during our recon.