Python code for LinkedIn Scraping – extract profile and analytics data

How to build a Python Script to connect and scrape LinkedIn for extracting profile or analytics data.

Building Applications having integration with social networks is increasingly becoming popular for various purposes. Extracting data from social channels is the key part of it.

Whatever may be the purpose, connecting to social media requires your application has to go through certain authentication and authorization mechanisms. For that, you need to go through the basic setup and generate a list of credentials. The credentials you going to need ahead are – Access Token, App Id, App Secret, and Account Id. If you have it then you are good to go. In case you don’t have those credentials, no worry, you need to go through just 3 simple steps to set up a LinkedIn Developer account and generate credentials.

Your application can be a web-based one or a script written in any scripting language. Here, I am using python for this basic LinkedIn Data Extraction Application demonstration.

Code structure:-

List of files in this project:-

  • ln_main.py
  • get_ln_campaign_data.py
  • readConfig.py
  • ln_cred.json
  • campaign_category.json

Without boring you off from a long base setup intro, let’s have a look at how you can run a python script to connect to Linkedin and pull your profile data or analytics data (campaigns and ads).

1. Create a Credential JSON file.

JSON file is the best way to store any credential required for your automation code. It enables easy scalability and flexibility for maintenance and updates. 

Create a json file “ln_cred.json”

Sample structure, you can customise according to your need.

Note: I used client_id to maintain only once JSON from multiple clients and for easy scalability across clients.

{
    "client_name":
    {
        "id":1,
        "Access_token":"Replace it with LinkedIn access token",
        "Client_id":"Replace it with LinkedIn app ID",
        "Client_secret":"Replace it with LinkedIn app secret",
        "Account_id":"Replace this with LinkedIn ads account ID"
    }
}

2. Create a Main Python file (controller/initiator).

This python file will be used for flow control. Consider this file as the heart of this LinkedIn campaign data extraction project. Save this file as “ln_main.py”.

#!/usr/local/bin/python3
# command to run this code $ python3 ./file_directory/ln_main.py -c client_name -d ./filePath/ln_cred.config -s 2020-07-05(startDate) -e 2020-07-11(endDate) -q week/month
import getopt
import sys
import datetime
import os.path
import json
 
from get_ln_campaign_data import *
#importing readConfig file functions
from readConfig import*
 
def isFile(fileName):
    if(not os.path.isfile(fileName)):
        raise ValueError("You must provide a valid filename as parameter")
 
def readfile(argv):
    global cred_file
    global client_name
    global s_date
    global e_date
    global qry_type
    try:
        opts, args = getopt.getopt(argv,"c:d:s:e:q:")
    except getopt.GetoptError:
        usage()
    for opt, arg in opts:
        if opt == '-c':
            client_name = arg
        elif opt == '-d':
            isFile(arg)
            cred_file = arg
        elif opt == '-s':
            s_date = arg
        elif opt == '-e':
            e_date = arg
        elif opt == '-q':
            qry_type = arg
        else:
            print("Invalid Option in command line")
 
if __name__ == '__main__':
    try:
        timestamp = datetime.datetime.strftime(datetime.datetime.now(),'%Y-%m-%d : %H:%M')
        print("DATE : ",timestamp,"\n")
        print("LinkedIn data extraction process Started")
        readfile(sys.argv[1:])
        #reading LinkedIn credential json file
        cred_file = open(cred_file, 'r')
        cred_json = json.load(cred_file)
 
        #reading campaign type reference json file
        campaign_type_file = "./src/campaign_category.json"
        campaign_type_file = open(campaign_type_file, 'r')
        camapign_type_json = json.load(campaign_type_file)
        
        #Initializing variable with data from cred file
        org_id = cred_json[client_name]["id"]
        access_token = cred_json[client_name]["access_token"]
        app_id = cred_json[client_name]["client_id"]
        app_secret = cred_json[client_name]["client_secret"]
        account_id = cred_json[client_name]["account_id"]
 
        #call the LinkedIn API query function (i.e get_linkedin_campaign_data)
        ln_campaign_df = get_LinkedIn_campaigns_list(access_token,account_id,camapign_type_json)
        
        print("LinkedIn Data :\n",ln_campaign_df)
 
        if not ln_campaign_df.empty:
            #get campaign analytics data
            campaign_ids = ln_campaign_df["campaign_id"]
            ln_campaign_analytics = get_LinkedIn_campaign(access_token,campaign_ids,s_date,e_date,qry_type)
            print("\nLinkedIn campaigns analytics :\n",ln_campaign_analytics)
        else:
            print("\n!!Dataframe (campaigns_df) is empty !!!")
            sys.exit()
            
        #query_result_column = tuple(ln_campaign_analytics.columns.values)
        #print("column name :",query_result_column)
         
        print("LN_MAIN : LinkedIn data extraction Process Finished \n")
    except:
        print("LN_MAIN : LinkedIn data extraction processing Failed !!!!:", sys.exc_info())
 

 3. Create ReadConfig Python file.

This file will have a function for processing config files so that we can extract all the credentials from it as and when needed. We are going to import this function as a module in needed files (i.e in our case ln_main.py file). This file is like a helping hand of this project.

Save this file as “readConfig.py”.

#!/usr/local/bin/python3
 
class ReadConfig:
    def __init__( self, path=''):
        self.filePath = path
 
    def __del__(self):
        class_name = self.__class__.__name__
        print(class_name, "Completed")
 
    def getCongDict(self):
        input = open(self.filePath, 'r')
        configs = input.read().split('\n')
        input.close()
        self.configs = {}
 
        for config in configs:
            if config == '' or config == None:
                continue
            try:
                config = config.split('\t')
                self.configs[config[0]] = config[1]
            except:
                print('readConfig : Error reading config: ', config)
 
        return self.configs
 
 

4. Creating a Data Extraction Process Python file.

This python file will do the most important work. Extracting data from LinkedIn Campaigns using LinkedIn Marketing API is the responsibility of python code in this file. Save the file with the name “get_ln_campaign_data.py”. Hence this file acts as the brain of all this project.

#!/usr/bin/python3
import requests
import pandas
import sys
import json
from datetime import datetime, timedelta
import datetime
import re
from urllib import parse
 
#Function for date validation
def date_validation(date_text):
    try:
        while date_text != datetime.datetime.strptime(date_text, '%Y-%m-%d').strftime('%Y-%m-%d'):
            date_text = input('Please Enter the date in YYYY-MM-DD format\t')
        else:
            return datetime.datetime.strptime(date_text,'%Y-%m-%d')
    except:
        raise Exception('linkedin_campaign_processing : year does not match format yyyy-mm-dd')
 
def get_LinkedIn_campaigns_list(access_token,account,camapign_type_json):
    try:
        url = "https://api.linkedin.com/v2/adCampaignsV2?q=search&search.account.values[0]=urn:li:sponsoredAccount:"+account
 
        headers = {"Authorization": "Bearer "+access_token}
        #make the http call
        r = requests.get(url = url, headers = headers)
        #defining the dataframe
        campaign_data_df = pandas.DataFrame(columns=["campaign_name","campaign_id","campaign_account",
                            "daily_budget","unit_cost","objective_type","campaign_status","campaign_type"])
 
        if r.status_code != 200:
            print("get_linkedIn_campaigns_list function : something went wrong :",r)
        else:
            response_dict = json.loads(r.text)
            #print(response_dict)
            if "elements" in response_dict:
                campaigns = response_dict["elements"]
                print("\nTotal number of campain in account : ",len(campaigns))
                #loop over each campaigns in the account
                for campaign in campaigns:
                    tmp_dict = {}
                    #for each campign check the status; ignor DRAFT campaign
                    if "status" in campaign and campaign["status"]!="DRAFT":
                        try:
                            campaign_name = campaign["name"]
                        except:
                            campaign_name = "NA"
                        tmp_dict["campaign_name"] = campaign_name
                        
                        try:
                            campaign_id = campaign["id"]
                        except:
                            campaign_id = "NA"
                        tmp_dict["campaign_id"] = campaign_id
                        
                        try:
                            campaign_acct = campaign["account"]
                            campaign_acct = re.findall(r'\d+',campaign_acct)[0]
                        except:
                            campaign_acct = "NA"
                        tmp_dict["campaign_account"] = campaign_acct
                        
                        try:
                            daily_budget = campaign["dailyBudget"]["amount"]
                        except:
                            daily_budget = None
                        tmp_dict["daily_budget"] = daily_budget
 
                        try:
                            unit_cost = campaign["unitCost"]["amount"]
                        except:
                            unit_cost = None
                        tmp_dict["unit_cost"] = unit_cost
 
                        try:
                            campaign_obj = campaign["objectiveType"]
                            if campaign_obj in camapign_type_json["off_site"]:
                                tmp_dict["campaign_type"] = "off_site"
                            elif campaign_obj in camapign_type_json["on_site"]:
                                tmp_dict["campaign_type"] = "on_site"
                            else:
                                print(" ### campaign ObjectiveType doesent match CampaignType references ###")
                        except:
                            campaign_obj = None
                            pass
                        tmp_dict["objective_type"] = campaign_obj
 
                        campaign_status = campaign["status"]
                        tmp_dict["campaign_status"] = campaign_status
                    
                        campaign_data_df = campaign_data_df.append(tmp_dict,ignore_index = True)
                try:
                    campaign_data_df["daily_budget"] = pandas.to_numeric(campaign_data_df["daily_budget"])
                    campaign_data_df["unit_cost"] = pandas.to_numeric(campaign_data_df["unit_cost"])
                except:
                    pass
            else:
                print("\nkey *elements* nmissing in JSON data from LinkedIn")
 
            return campaign_data_df
    except:
        print("get_linked_campaigns_list Failed :",sys.exc_info())
 
def get_LinkedIn_campaign(access_token,campaigns_ids,s_date,e_date,qry_type):
    try:
        #calling date validation funtion for start_date format check
        startDate = date_validation(s_date)
        dt = startDate+timedelta(1)
        week_number = dt.isocalendar()[1]
        #calling date validation funtion for end_date format check
        endDate = date_validation(e_date)
        #defining the dataframe
        campaign_analytics_data = pandas.DataFrame(columns=["campaign_id","start_date","end_date",
                                    "cost_in_usd","impressions","clicks"])
 
        for cmp_id in campaigns_ids:
            #Building api query in form of url 
            dateRange_start = "dateRange.start.day="+str(startDate.day)+"&dateRange.start.month="+str(startDate.month)+"&dateRange.start.year="+str(startDate.year)
            dateRange_end = "dateRange.end.day="+str(endDate.day)+"&dateRange.end.month="+str(endDate.month)+"&dateRange.end.year="+str(endDate.year)
            
            url = "https://api.linkedin.com/v2/adAnalyticsV2?q=analytics&pivot=CAMPAIGN&"+dateRange_start+"&"+dateRange_end+"&timeGranularity=ALL&campaigns[0]=urn:li:sponsoredCampaign:"+str(cmp_id)
            #defining header for authentication
            headers = {"Authorization": "Bearer "+access_token}
            #make the http call
            r = requests.get(url = url, headers = headers)
 
            if r.status_code != 200:
                print("*get_LinkedIn_campaign : something went wrong :",r)
            else:
                response_dict = json.loads(r.text)
                if "elements" in response_dict:
                    campaigns = response_dict["elements"]
                    for campaign in campaigns:
                        tmp_dict = {}
 
                        cmp_costInUsd = campaign["costInUsd"]
                        tmp_dict["cost_in_usd"] = cmp_costInUsd
 
                        cmp_impressions = campaign["impressions"]
                        tmp_dict["impressions"] = cmp_impressions
 
                        cmp_clicks = campaign["clicks"]
                        tmp_dict["clicks"] = cmp_clicks
                        
                        campaign_analytics_data = campaign_analytics_data.append(tmp_dict,ignore_index = True)
                        campaign_analytics_data["campaign_id"] = cmp_id
                        campaign_analytics_data["start_date"] = startDate
                        campaign_analytics_data["end_date"] = endDate
                        
                        if qry_type in ["week","weekly"]:
                            campaign_analytics_data["week"] = week_number
                        elif qry_type in ["month","monthly"]:
                            campaign_analytics_data["month"] = startDate.month
 
                    campaign_analytics_data["cost_in_usd"] = pandas.to_numeric(campaign_analytics_data["cost_in_usd"])
                else:
                    print("\nkey *elements* nmissing in JSON data from LinkedIn")
        
        return campaign_analytics_data
    except:
        print("\n*get_linked_campaigns_analytics Failed :",sys.exc_info())

5. Creating Campaign Category JSON file

This file has a list of all different types of LinkedIn campaigns and categorized into off-site and on-site campaigns. This is a JSON file i.e key: value structure. So we have two keys “off-site” and “on-site”.

Save this file as “campaign_category.json”.

{
    "off_site":["VIDEO_VIEW","LEAD_GENERATION","BRAND_AWARENESS","CREATIVE_ENGAGEMENT"],
    "on_site":["WEBSITE_VISIT","WEBSITE_CONVERSION","JOB_APPLICANT","ENGAGEMENT","WEBSITE_TRAFFIC"]
}
Tips:-
  1. Store JSON files in separate folders say “config” and python file in separate folders say “source” for better structuring and easy maintenance.
  2. Don’t forget to update LinkedIn marketing API access token, since it expires after 2 months i.e valid only for 2 months.
  3. For database storing, create a separate file for only the data load process. And since we have used pandas data frame, data load to database table is easy. In some cases it may need few additional steps, to refine data and its column.

For any suggestions or doubts ~ Get In Touch

Leave a Reply

Your email address will not be published. Required fields are marked *