Pickling is said to be the process through which we convert python objects into the byte stream. It is used mainly to serialize and deserialize the object structure in Python. If a Python object needs to be saved on a disk, then it can be pickled before writing it to the file. Pickle is used to handle the process of transferring Python objects from one machine to another. The Pickle module provides a ‘dump’ function where we need to pass the desired arguments and it serializes the file for us.
Suppose you are a data scientist and use sets of data that can be in the form of data frames, dictionaries or any other data type. You might want to save them to a file so later you can send it to anyone else. This is where the python pickle module comes into the picture. Whenever we need to transfer any Python object from one machine to another then pickling and unpickling come into great importance.
So now let’s understand what pickling and unpickling are.
What Is Pickling?
In Python, basically, pickling is said to be the process through which we convert the python objects into the byte stream.
Now, the Byte stream is a sequence of bytes that can be used by any program for any input or output operation. The main idea is that this byte stream that will be obtained by converting Python objects has all the necessary information so that in future if we are writing any Python script and need those objects, we can reconstruct them through the byte stream we have. In other words, it is all about serializing the structure of objects.
To implement this feature, Python provides a pickle module. It is used mainly to serialize and deserialize the object structure in Python. If a Python object needs to be saved on a disk, then it can be pickled before writing it to the file, which means it will be serialized first and then stored in a file.
The above image (Img src: cppsecrets) shows that through serialization, python objects are stored in a file in the form of bytes and we can also deserialize the file to obtain the objects again.
If we want to serialize any object hierarchy then the Pickle module provides a ‘dump’ function where we need to pass the desired arguments to it and it serializes the file for us. The dump function looks like this:
pickle.dump(object, file_obj, protocol)
There are basically three arguments in the function:
The first argument is the python object that needs to be serialized.
The second argument will be the file object where we will store the serialized python object.
The third is the protocol. If it is not specified the by default protocol 0 is taken. As the new versions of python were introduced, they had different protocols with improved features for pickling.
- Protocol version 0 pickles in a normal human-readable format.
- Protocol version 1 pickle data in old binary format.
- Protocol version 2 came with python 2.3 and it provided much more efficient pickling for class type objects.
- Protocol version 3 came with python 3.0 and had support for Bytes Objects also. Bytes objects are basically sequences of single bytes that are immutable.
- Protocol version 4 came with python 3.4 and had support for pickling of very large objects.
There are also some of the constants provided by the pickle module. Those are:
- pickle.HIGHEST_PROTOCOL– This can be passed as a protocol value to the above-shown dump function so it will have the value of the highest protocol version available.
- pickle.DEFAULT_PROTOCOL– This will take the default protocol value compatible with that python version. Sometimes, the default protocol value is less than the highest protocol value.
Let’s now have a look at a sample implementation of the pickling process.
First, we will make a python object with some random data in it and name it ‘sample_list’. Then we will use the ‘dump’ function to serialize that object and store it into a file called “data.pickle”. We will use the highest protocol version so as to assure we have all the latest features available.
# Pickling Example in Python import pickle # Sample Python object sample_list = [23, 'Hello World', 'Python'] # Pickling with open("data.pickle","wb") as file_handle: pickle.dump(sample_list, file_handle, pickle.HIGHEST_PROTOCOL) print("Pickling finished!")
Once our object is pickled and stored in the file named “data.pickle”, we get the message as “Pickling finished” on the screen. If we look into the current directory, the file “data.pickle” would already be present there. It contains the serialized format of the object we created earlier
The pickle module that is imported using keyword import pickle accepts any python objects and after that, the object is converted into a string representation and then dumped into a file with the help of dump() method. This process is called pickling.
Some advantages of the Pickle Module:
The pickle module easily handles the Recursive objects. Recursive objects are those which contain a reference to themselves. Serializing such objects may cause programs to be stuck in an infinite loop and eventually crash the interpreter as the reference to the same object will keep on occurring recursively. To handle this, the Pickle module tracks all the objects it has serialized, so if the object is already serialized and later on its reference is found, it does not serialize it again.
One more great advantage of using the pickle module is that it can serialize pretty much any python object in an easy way without having to add so much extra code.
What Is Unpickling?
Unpickling can be said as the process in which original Python objects are retrieved from the previously-stored string representation or we can say pickle file.
So it’s just the opposite of pickling i.e here a Byte stream is converted into a Python object.
If we want to deserialize any file containing byte streams and obtain the python object from it, then the Pickle module provides a ‘load’ function where we need to pass the file name as an argument, and the load function will deserialize the file and give us the Python object.
Let’s see the sample implementation of the unpickling process. Here in the example, we will deserialize the same “data.pickle” file we obtained by pickling the python object we made earlier. In the Pickle module’s load function we pass the file and then receive the Python object in a variable named ‘retrieved_data’.
# Unpickling example in Python import pickle # Pickling with open("data.pickle","rb") as file_handle: retrieved_data = pickle.load(file_handle) print(retrieved_data)
[23, 'Hello World', 'Python']
We can see, when we print the received_data, we get the same object which we pickled earlier. Thus, retrieving the python object from the pickled file is unpickling. While reading the file, the character ‘r’ stands for reading mode and the character ‘b’ denotes binary mode. So basically here we are reading a binary file.
What data types can be Pickled?
If we talk of the data types that support pickling, then those are Integer data types, Float data types, Boolean, Complex numbers, Strings and also the data types like Tuples, Lists, Sets all are compatible with pickling.
Pickling Use Cases:
Pickling can be used when a program’s state needs to be saved on the disk so that when it is restarted, it can start off from where it was previously left.
It is also very useful when python data needs to be sent over a TCP connection over a multicore or distributed system.
Whenever Python objects need to be stored in a database, pickling can be helpful.
Pickling can help in caching where we can convert an arbitrary python object to a string and use it as a dictionary key.
Some Dangers of Pickling:
As the documentation of the pickling module states that pickle module is not secure against incorrect or maliciously constructed data because, during unpickling, it executes any arbitrary code given to it. So it becomes very easy to create such data that may harm your device if it gets executed. So it is always a great practice to never ever unpickle the data that is received from an unauthorized or unknown source.
Finally, we can conclude that Pickling and Unpickling are quite simple but yet very important and useful processes. As a data scientist, your code needs to be serialized for several reasons such as to save your fitted model to the disk. So the pickle module makes the life of data scientists much easier who work with ML algorithms all the time. Just with the help of ‘dump’ and ‘load’ functions, they can easily pickle and unpickle their data.
Still while using the modules, one must always take care of vulnerabilities and should never use them between unknown parties. One must always ensure that the parties exchanging Pickle have an encrypted network connection.