Scraping LinkedIn Topics and Skills Data

What are the most popular skills among LinkedIn users?
What are the most popular skills among Microsoft employees?
Other top tech companies? (Google, Amazon, Facebook, etc…)
What are the most interconnected skills?

These are questions that LinkedIn does not provide a direct answer to. However, through their “Topics Directory“, we should be able to come to these conclusions ourselves!

The Topics Directory seems to be an index over all the different skills that people have put on their profile, alphabetically ordered by skill name. Some pages, like Azure, have very specific metadata about the skill, while others like Azure Active Directory, show up in the directory, but do not have this additional metadata.

If we look at the additional metadata, you can see that it calls out a number of very interesting data points. It tells you:

  1. How many people have this skill
  2. The top 10 companies that have employees who register this skill
  3. (My guess) The top skills that people have who also have this skill
  4. (My guess) The top related skills

Now clearly, there is some poor Web Design, in that there are two different sections, both with the same title “Top Skills”, but contain different data. We will have to do our own interpretation of what this data exactly means, but nonetheless, the data is all useful.

So how do we start scouring this data to answer the questions I proposed at the start of this post? Well, by scraping it of course, and storing it into our own database. Now, this is not an original idea, but certainly I have not seen anyone collect the level of data which I am interested in. I want to have a copy of all the data points above for each topic, all in a single list!

So let’s do it!

Of course we will be using Python + Beautiful Soup + Requests. You can find the latest version of my LinkedIn scraper on my GitHub. Here, I will only be looking at the main function, which describes the logic of my code, not the specific functions which actually does the scraping. You can find that on my GitHub.

from bs4 import BeautifulSoup
import requests
import string
import re
import json

# ...
# sub-functions removed, check GitHub for full source
# ...

def main():
    letters = list(string.ascii_lowercase)
    letters.append('more')
    base_url = "https://www.linkedin.com/directory/topics-"
    for letter in letters:
        letter_url = base_url + letter + "/"
        content = get_content(letter_url)
        for con in content:
            if letter == 'y' or letter == 'z':
                sub_content = content
            else:
                letter_page_url = con.find("a")
                print(letter_page_url)
                if letter_page_url.has_attr('href'):
                    sub_content = get_content(letter_page_url['href'])
                else:
                    sub_content = None
            for sub_con in sub_content:
                topic_url = sub_con.find("a")
                topic = scrape_data(topic_url)
                create_json(topic)
            if letter == 'y' or letter == 'z':
                break

To scrape this site, we are basically figuring out the pattern which generates these pages. LinkedIn organizes these topics first by letter, https://www.linkedin.com/directory/topics-{letter}/. Then on each “letter page”, they group the topics by alphabetical order, in groups, https://www.linkedin.com/directory/topics-{letter}-{number}/. Finally, if you navigate to the specific topic, you will get the final page with data, https://www.linkedin.com/topic/{topic}.

There are a few exceptions to this pattern, which added complexity to the scraper. Basically the letters Y and Z do not have enough topics to be able to put them in groups, which means instead of navigating 3 pages deep to get the data, we need to navigate only 2 pages deep. You can see I handle this situation in my scraper. Other than that, once I get the data off the page, I put it into a JSON file for later usage!

One thing to note, but that I will not go into detail here about is that LinkedIn actually blocks scrapers in general, by creating a 999 response when you try to get data using a bot. If you want to run this script, you will have to overcome this. If you look online, people mention that you might need to update the user-agent passed in the headers of the web requests, but this did not work for me. I might go into detail about this during another post.

Results

So, let’s look at some of the data. I can import the JSON as an array of dictionaries in Python, and then try and write some queries to get data from it. I am not claiming to write the best or most efficient queries, but hopefully they will get the correct data.

Loading the data:

with open(r'C:\Users\shawn\Documents\GitHubVisualStudio\LinkedIn-Topic-Skill-Analysis\results\linkedin_topics_7-23-17.json') as data_file:
    data = json.load(data_file)

How many topics are there total?

len(data)
33188

What are the most popular overall topics/skills?

ordered_by_count = sorted(data, key=lambda k: k['count'] if isinstance(k['count'],int) else 0, reverse=True)
for skill in ordered_by_count[:20]:
    print(skill['name'])
Management - 69725749
Microsoft - 55910552
Office - 46632581
Microsoft Office - 45351678
Planning - 34397412
Microsoft Excel - 32966966
Leadership - 31017503
Customer Service - 30810924
Leadership Management - 25854094
Word - 25793371
Project - 25766790
Project+ - 25766790
Microsoft Word - 25567902
Business - 25374740
Customer Management - 24946045
Management Development - 24207445
Development Management - 24207409
Project Management - 23922491
Marketing - 23047665
Customer Service Management - 22856920

What are the top <Company> Skills?

company = 'Microsoft'
company_skills = []
for skill in ordered_by_count:
    if skill['companies'] is not None:
        if company in skill['companies']:
            company_skills.append(skill)

order_by_company = sorted(company_skills, key=lambda k: k['companies'][company], reverse=True)
for skill in order_by_company[:20]:
     print(skill['name'], "-", skill['companies'][company])

Microsoft

Cloud - 74817
Cloud Computing - 74817
Cloud-Computing - 74817
Cloud Services - 74817
Management - 73123
Management Skills - 73123
Multi-Unit Management - 73123
Enterprise - 54516
Enterprise Software - 54516
Software Development - 53201
Project Management - 52083
Project Management Skills - 52083
PMP - 52083
PMI - 52083
Strategy - 43983
SaaS - 41450
Software as a Service - 41450
Program Management - 40749
Business Intelligence - 39291
C# - 39158

Google

Java - 23225
Strategy - 22235
Marketing - 21672
Data-driven Marketing - 21672
Python - 20788
Software Development - 20406
C++ - 20199
Social Media - 20082
Social Networks - 20082
Digital Marketing - 19942
Online Advertising - 19922
Marketing Strategy - 16882
Linux - 16272
JavaScript - 14567
JavaScript Frameworks - 14567
C - 14460
C Programming - 14460
Online Marketing - 13925
Online-Marketing - 13925
Social Media Marketing - 12931

Amazon

Leadership - 44329
Leadership Skills - 44329
Microsoft Office - 42713
Office for Mac - 42713
Customer Service - 36176
Microsoft Excel - 33403
Java - 25609
Word - 23314
Microsoft Word - 23314
PowerPoint - 22318
Microsoft PowerPoint - 22318
Social Media - 22110
Social Networks - 22110
C++ - 19619
Training - 19250
Marketing - 18826
Data-driven Marketing - 18826
Software Development - 18521
Public Speaking - 17366
C - 16813

Facebook

Digital Marketing - 4973
Online Advertising - 4334
Digital Strategy - 3399
Online Marketing - 3012
Online-Marketing - 3012
Facebook - 2883
Algorithms - 2881
Mobile Marketing - 2163
Machine Learning - 2103
Distributed Systems - 2033
User Experience - 1971
UX - 1971
Web Analytics - 1682
SEM - 1626
Computer Science - 1440
Google Analytics - 1261
Adwords - 1093
Google AdWords - 1093
Scalability - 1057
Mobile Advertising - 919

What are the top interconnected skills?

skill_count = {}
for topic in data:
    if topic['skills'] is not None:
        for top_skill in topic['skills']:
            if top_skill not in skill_count:
                skill_count[top_skill] = 1
            else:
                skill_count[top_skill] += 1
    if topic['topSkills'] is not None:
        for top_skill in topic['topSkills']:
            if top_skill not in skill_count:
                skill_count[top_skill] = 1
            else:
                skill_count[top_skill] += 1

for skill in sorted(skill_count, key=skill_count.get, reverse = True)[:20]:
    print(skill, "-", skill_count[skill])
Microsoft Office - 11081
Management - 8845
Customer Service - 7010
Project Management - 6902
Microsoft Excel - 4884
Leadership - 4682
Social Media - 3883
Research - 3798
Public Speaking - 3243
Marketing - 2644
Microsoft Word - 2426
Sales - 2335
SQL - 2322
Engineering - 2300
Business Development - 2071
Strategic Planning - 1879
Java - 1792
Adobe Photoshop - 1555
JavaScript - 1488
Microsoft PowerPoint - 1483

 

There is so much more we can do with this data, and I do have plans! I just can’t talk about them here. In a related note, I am super excited for the Microsoft Hackathon happening this next week. I will be using these tools, and hopefully more to accomplish an awesome project. Maybe more to share here in the future!

Clients and Tokens and Claims! Oh My!

Let me just jump to the point with this post: Client applications should not depend on claims in access tokens to gather data about the signed-in user or anything about the authenticated session.

Time and time again, I have seen client applications complain to me that certain claims, like group membership claims, are not appearing in the access token they receive, and they ask me how to enable this. They incorrectly assume that if they go into their application manifest, and change the “groupMembershipClaims” settings, that they will be able to start getting claims, but everyone eventually finds out… it doesn’t work!

Let’s take a look at source material; from the OAuth 2 specification:

An access token is a string representing an authorization issued to the client. The string is usually opaque to the client.

Unfortunately, the OAuth 2 specification is intentionally broad, but in summary, the ‘access token’ that is given to a client should only really be explored by the specified audience of the token. Some implementations of OAuth 2 do not even pass a JWT token to the client. Instead they pass a unique string, and then the resource exchanges that string for the actual token using a signed request. Alternatively, other implementations pass an encrypted JWT token rather than just a signed token. This means that the resource application uploads a token signing key which the authorization server uses to encrypt the full token. That means that the only person who can look at the claims in the token is the resource who also has the private key for decryption.

The implementation of OAuth 2 that I am most familiar with, Azure Active Directory,  issues a signed token, which means that its content is completely visible to the client. In the future, Azure AD may add support for encrypted tokens, which means that clients are going to have to start following the correct practices.

Need to know about the user signed into your web application?

>> Get an ID token! These are meant for client consumption!

Need to know which groups a user is a member of?

>> Get an access token to the AAD or Microsoft Graph API and query the API!

Now lets go back to the original problem. If groupMembershipClaims are not meant for clients to get the claims in the access token, what are they used for? You might have figured out by now, but they are for resource applications to get the claims in the access token!

Lets show an example. To set up, I have registered two Azure AD Web Apps/APIs called Web API 1 and Web API 2. Both of these applications are identical, except Web API 1 has the setting “groupMembershipClaims”: “All”, and the other is set to null, which is default. I have to set up a fake App ID URI for both apps, and I have to make sure that each application has the other set as a “required permission”.

I will be using my PowerShell Scripts to quickly get two access tokens. One where the client is Web API 1 and the resource is Web API 2, and vice versa.

Let’s look at the results, using my JWT Decoder to look at the payload:

Payload 1:  Client = Web API 1, Resource = Web API 2

{
    "aud": "https://shawntest.onmicrosoft.com/WebApi2",
    "iss": "https://sts.windows.net/4a4d599f-e69d-4cd8-a9e1-9882ea340fb5/",
    "iat": 1500243353,
    "nbf": 1500243353,
    "exp": 1500247253,
    "acr": "1",
    "aio": "ATQAy/.../oU",
    "amr": [ "rsa", "mfa" ],
    "appid": "eb7b6208-538c-487b-b5b5-137ac6ab6646",
    "appidacr": "1",
    "email": "shtabriz@microsoft.com",
    "family_name": "Tabrizi",
    "given_name": "Shawn",
    "idp": "https://sts.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47/",
    "in_corp": "true",
    "ipaddr": "XX.XXX.XXX.XXX",
    "name": "Shawn Tabrizi",
    "oid": "41bdce9b-3940-40a9-b2f2-03a003ad599c",
    "platf": "3",
    "scp": "user_impersonation",
    "sub": "hfS9IZ_..._JW8c5Gg",
    "tid": "4a4d599f-e69d-4cd8-a9e1-9882ea340fb5",
    "unique_name": "shtabriz@microsoft.com",
    "ver": "1.0"
}

Payload 2: Client = Web API 2, Resource = Web API 1

{
    "aud": "https://shawntest.onmicrosoft.com/WebApi1",
    "iss": "https://sts.windows.net/4a4d599f-e69d-4cd8-a9e1-9882ea340fb5/",
    "iat": 1500243330,
    "nbf": 1500243330,
    "exp": 1500247230,
    "acr": "1",
    "aio": "ATQAy/...BLDunA",
    "amr": [ "rsa", "mfa" ],
    "appid": "554e427d-36c3-4a77-89a5-a082ee333e12",
    "appidacr": "1",
    "email": "shtabriz@microsoft.com",
    "family_name": "Tabrizi",
    "given_name": "Shawn",
    "groups": [ "0f4374e6-8131-413e-b32b-f98bfdb371ed" ],
    "idp": "https://sts.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47/",
    "in_corp": "true",
    "ipaddr": "XX.XXX.XXX.XXX",
    "name": "Shawn Tabrizi",
    "oid": "41bdce9b-3940-40a9-b2f2-03a003ad599c",
    "platf": "3",
    "scp": "user_impersonation",
    "sub": "xy..._zGJEZnIB4",
    "tid": "4a4d599f-e69d-4cd8-a9e1-9882ea340fb5",
    "unique_name": "shtabriz@microsoft.com",
    "ver": "1.0",
    "wids": [ "62e90394-69f5-4237-9190-012177145e10" ]
}
Note that we only get the group membership claims when the resource application has this setting, not the client application. The client application has no power to control the types of claims in the token, because ultimately the token is not for them!
If you are building a client application using Azure Active Directory, please do not use the access token to try and get information about the authenticated session. The correct practice here is to request separately an ID Token , or to call the AAD / Microsoft Graph API to get the information you need. I hope you learned exactly how to use the “groupMembershipClaims” property and I hope this helps you build better apps in the future!

Customizing WordPress’s Twenty Seventeen Theme

As I mentioned in my first post on this new blog, this isn’t my first rodeo with WordPress or blogging in general. I have actually used a number of different content management systems in the past like Drupal, older versions of WordPress, and even forum platforms like ProBoards, phpBB, and vBulletin.

All of these tools are amazingly simple to use in comparison to how much they offer to their users. However, this latest version of WordPress (version 4.8) has been above an beyond a great experience.

I value the ability to make quick and accurate customization to my website, and with WordPress’s huge library of plugins and their latest customization features like Additional CSS, it has been easier than ever to do just that. I wanted to share with you all a few of the specific customizations I made, and just how simple it was to do those things.

Adjusting the Social Links Footer

One of my top goals for this site was to have a single location for all of my various social media profiles. WordPress’s new Social Links Menu makes it super easy to achieve this, however their default implementation was not very good looking in my opinion. Specifically, they made the ‘social-navigation’ section have a width of 36%, with a max-width of 1000px for the container. This means the menu can have at most 5 icons before wrapping to a second line, and even less if the window is smaller.

I wanted to fix this. I wanted the Social Links Menu to take the entire bottom footer space, and I also wanted the links to align right, so that it would be right underneath my Home Page text. This also meant I needed to get rid of the “Proudly powered by WordPress” text. I am not against giving credit where credit is due, but not if it means compromising a good look.

To achieve my goals, I made the following simple CSS update in the Additional CSS settings:

.site-info { display: none; }
.social-navigation { width: 100%; text-align: right;}

Here is the final result:

Increasing the max-width of the theme

Surprisingly, the twenty seventeen theme has a max-width of 1000px. We are in a generation of increasingly higher and higher resolution screens, and I think that 1000px was really choking the available space for a text based blog. I wanted to increase the amount of space used by the site to 1200px (+200px), so that my blog posts and code samples would be easier to read. To do this I made another CSS update on the Additional CSS settings:

.wrap { max-width: 1200px; }
.navigation-top .wrap { max-width: 1200px; }

This allowed me to change the base wrapper for most elements on the site to have a max width of 1200px, as well as the top navigation menu which adopted another parent style. The results were already great, but now the secondary content (which contains the blog sidebar) was taking too much space. The default setting for the twenty seventeen theme has the primary content take up 58% of the width, while the secondary content takes up 36% (with 6% spacing implicitly between the two). We added extra width to the overall content, but we really wanted all of that extra space to go to the primary content. So we simply need to update the percentages used to define the width for these two content divs:

@media screen and (min-width: 48em) {
.has-sidebar:not(.error404) #primary { width: 68% }
.has-sidebar #secondary { width: 26% }
}

Note that we only wanted to adjust the CSS when the viewport is large. At lower screen sizes, the responsive layout of the theme takes over and it looks good out of the box.

The final results look great:

Before:

After:

You can see my opening block of text which used to take 6 lines now only takes 4 lines. The difference is most notable when you look at the space between the primary and secondary content. You will see that it has significantly shifted to the right, while the primary content also extends further to the left. I think overall this makes the site look much better, and I might even consider increasing the max-width again.

Creating an About Me page as my Home Page

The last thing I wanted to do was to create a minimalist, yet good looking home page that would introduce me to the viewer, and act as a starting point for people to discover my site and other projects I might be working on. I needed a picture of myself, and a blurb about me. (If you have been paying attention, you will notice a sneak peak at the final result above.)

So what is the problem? Well, lets look at what the default theme does if you write some text and add an image:

Ew. This is such an unbalanced use of space. The title of the page eats up nearly half of the page’s overall space, and the picture forces the text off the screen, which creates a disconnect between my image and my bio. Wouldn’t it be better to have the image in that empty space to the left? I found the easiest way to do this was to simply add the <img> html as the title of the page.

Which gives us the following result:

Much better! But I wanted to make a few more small tweaks. Lets make the image bigger, and lets make sure that it has no text alignment, so text doesn’t start wrapping into it when the page changes size. Finally, I wanted to make the image an circle / oval rather than square. This is pretty common for bio pictures and quite easy to do, again using the Additional CSS settings. This was my final title:

<img class="img-circle wp-image-103 size-large alignnone" src="http://shawntabrizi.com/wordpress/wp-content/uploads/2017/07/19453121_1568143886560829_337872348308545095_o-927x1024.jpg" alt="" width="525" height="580" />

Note that I added a special class to the image called ‘img-circle’. This points to a configuration in my Additional CSS which changes the border-radius to 50%.

Here is the final result:

This was exactly what I was going for, and I think makes a slick home page for any personal site. I expect that I will continue to make a few smaller changes, adding additional styles to the page, but really this was the starting point I needed to feel comfortable with this page being on the web, and it was done almost entirely using the Additional CSS settings available in the latest version of WordPress. Just amazing how simple WordPress makes it for the end user.

Final CSS Settings

If you liked the changes I made, and would like to do the same to your own instance of the twenty seventeen theme, you can copy and paste this CSS into your Additional CSS settings:

.site-info { display: none; }
.social-navigation { width: 100%; text-align: right;}
.wrap { max-width: 1200px; }
.navigation-top .wrap { max-width: 1200px; }

@media screen and (min-width: 48em) {
.has-sidebar:not(.error404) #primary { width: 68% }
.has-sidebar #secondary { width: 26% }
}

.img-circle { border-radius: 50% }

Let me know if you found any other tricks or have iterated on the changes I made!

Does Company ‘X’ have an Azure Active Directory Tenant?

One of the cool things about the Open ID Configuration endpoint is that it not only tells us random facts about the tenant, but it confirms that the tenant exists! Make sure to check out my last post to learn more about this. Using some clever scripting and this endpoint behavior, we could probably figure out which companies have an Azure Active Directory Tenant. Let’s try that!

$csv = Import-Csv -Path .\input.csv
$output = @()

foreach ($line in $csv)
{
    $companyname = $line.CompanyName
    $companynameencoded = [System.Net.WebUtility]::UrlEncode($companyname)

    $GoogleURI = 'https://www.google.com/search?q=' + $companynameencoded + '&amp;btnI'
 
    try { 
        $GoogleResult = Invoke-WebRequest -Uri $GoogleURI
        $CompanyURI = ([System.Uri]$GoogleResult.BaseResponse.ResponseUri).Host.split('.')[-2..-1] -join '.'
    } catch {
        write-host $_.Exception
        $CompanyURI = "error"
    }

    $OpenIDConfigURL = 'https://login.microsoftonline.com/' + $CompanyURI + '/.well-known/openid-configuration'

    try {
        $OpenIDResult = (Invoke-WebRequest -Uri $OpenIDConfigURL).StatusCode
    } catch {
        $OpenIDResult = $_.Exception.Response.StatusCode.value__
    }

    if ($OpenIDResult -eq 200) {
        $tenant = $true
    } else {
        $tenant = $false
    }

    $result = [pscustomobject]@{
        CompanyName = $companyname.ToString()
        HomepageURI = $CompanyURI.ToString()
        OpenIDResult = $OpenIDResult.ToString()
        HasTenant = $tenant.ToString()
    }

    Write-Host $result
    $output += $result 
}

$output | Export-Csv -Path output.csv -NoTypeInformation

So in summary what does this script do?

We take a CSV which lists a bunch of Company Names. We then do a Google search, and go to the first result (‘I’m Feeling Lucky’). We assume the first result is the homepage of that company, and the domain they would use for their tenant. We pull out the host name, and then check it against the Open ID Configuration endpoint. If we get a valid response from the endpoint, then we say that they have a tenant! Otherwise, we say they do not have a tenant.

One thing to note about these results is that when we get a result that says the company has a tenant, we are nearly 100% correct in that fact. However, if we say that a company does not have a tenant, we are not necessarily correct. It is possible that the google result did not point to their actual domain name, or they are using a different domain name for their AAD Tenant. If you wanted to do this really robustly, you would probably want to get a better source for your domain names than automated google search results. You might want to also look at other combinations like “<companyname>.onmicrosoft.com”, however we are doing just rough estimates here.

So lets look at the result for the Fortune 500. A quick Google search later, and I have a CSV with a list of all the Company Names for all 500 companies. Running it through this script, I find that 417, or 83.4% of companies have AAD, which is just a little off from Microsoft’s public claim of 85%. Not bad for a quick and dirty script!

Secret APIs in Azure Active Directory and Azure Resource Manager

Have you ever wondered what the Tenant ID for Microsoft (microsoft.com) or any other domain is? Have you ever wondered how you can find the right Tenant ID to sign in a user given their Azure Subscription ID?

Oh, you haven’t? Well that is certainly more reasonable than the fact that I have; but, if for some reason you are asking the same questions as I am, let me tell you about some of the “secret APIs” that are available to answer those questions.

Getting the Tenant ID for a Verified Domain in Azure Active Directory

Azure Active Directory tenants have a special type of domain called a ‘verified domain’. Verified domains are what they sound like, domains which a user has proven they own through DNS verification. These domains are unique across all tenants, and can act as a alternative domain to the initial domain given to all tenants (*.onmicrosoft.com).

While authentication and even the AAD Graph API both support the use of these domains for referencing a tenant, not all APIs support this. Sometimes you might need to convert the tenant domain to a Tenant ID… but how?

Well known open id config

Check out the specification here. This Open ID configuration endpoint is required for all Open ID Providers, which AAD is one of. Let’s take a look at what the response looks like for the Microsoft tenant using the verified domain ‘microsoft.com’:

https://login.microsoftonline.com/microsoft.com/.well-known/openid-configuration

{"authorization_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/authorize","token_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token","token_endpoint_auth_methods_supported":["client_secret_post","private_key_jwt"],"jwks_uri":"https://login.microsoftonline.com/common/discovery/keys","response_modes_supported":["query","fragment","form_post"],"subject_types_supported":["pairwise"],"id_token_signing_alg_values_supported":["RS256"],"http_logout_supported":true,"frontchannel_logout_supported":true,"end_session_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/logout","response_types_supported":["code","id_token","code id_token","token id_token","token"],"scopes_supported":["openid"],"issuer":"https://sts.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47/","claims_supported":["sub","iss","cloud_instance_name","cloud_graph_host_name","aud","exp","iat","auth_time","acr","amr","nonce","email","given_name","family_name","nickname"],"microsoft_multi_refresh_token":true,"check_session_iframe":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/checksession","userinfo_endpoint":"https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/openid/userinfo","tenant_region_scope":"WW","cloud_instance_name":"microsoftonline.com","cloud_graph_host_name":"graph.windows.net"}

I honestly have never used most of the data in this JSON, and I am not really sure where it gets used… BUT, you will notice that all of the various authentication endpoints now have a Tenant ID GUID rather than a domain name! This tells us two things:

  1. The Tenant ID for Microsoft.com is 72f988bf-86f1-41af-91ab-2d7cd011db47
  2. (maybe this is obvious already…. but) Microsoft has a tenant!

Now the second realization is kind of a super-set of the first, but it makes me think about something else cool we can do. What if we wanted to get a count and see which companies have an Azure Active Directory Tenant? As long as we know their Domain Name, we should be able to use this endpoint to confirm if a tenant exists! I will save this exploration for my next blog post.

Get the Tenant ID for a Specific Azure Subscription ID

The world of Azure Subscriptions is one of the most complicated spaces that shouldn’t be complicated. Depending on how you start using Azure, you may never even know that you have an Azure Active Directory Tenant. You just have your Live ID, which you use to sign on to the Azure Portal, and from there you can access your Subscription ID!  You can’t even use the ‘common’ endpoint with Live IDs on AAD V1, so your lack of knowledge can be really painful here for app developers. We need your Tenant ID to know the right login endpoint to send you to. Luckily, we can find that using helpful error messages from Azure Resource Manager! All we need is an application for which we can get a token to Azure Resource Manager in the

We can easily execute this plan using my PowerShell Scripts. Update the scripts to have the following configuration:

  • Pick any Tenant ID and Application Information relative to that tenant
  • Set Resource ID to “https://management.azure.com/”
  • Create a variable “$subscriptionId” and set it to the Azure Subscription ID you are looking to investigate.
  • Set up the REST call like this:
try {
    Invoke-RestMethod -Method Get -Uri ("{0}/subscriptions/{1}?api-version=2016-06-01" -f $resourceId, $subscriptionId) -Headers $headers
} catch {
    Write-Host $_.ErrorDetails.Message
}

Hmm… why would I be catching an error? Well let’s run it and see what gets outputted:

{"error":{"code":"InvalidAuthenticationTokenTenant","message":"The access token is from the wrong issuer 'https://sts.windows.net/4a4d599f-e69d-4cd8-a9e1-9882ea340fb5/'. It must match the tenant 'https://sts.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47/' associated with this subscription. Please use the authority (URL) 'https://login.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47' to get the token. Note, if the subscription is transferred to another tenant there is no impact to the services, but information about new tenant could take time to propagate (up to an hour). If you just transferred your subscription and see this error message, please try back later."}}

Right in the error they tell us the correct tenant for this Subscription ID!

Please use the authority (URL) ‘https://login.windows.net/72f988bf-86f1-41af-91ab-2d7cd011db47

This really is a “secret API”, and we can use it to consistently get back the right tenant for a user, as long as they know what their Azure Subscription is.

Azure AD Authentication with PowerShell and ADAL

In the 3 years I spent on the Azure AD team, I learned a number of useful ‘tricks’ to make my job (and usually the jobs of others) a ton easier. However, if I had to pick just one trick to share to others trying to learn, it would probably be the PowerShell scripts I wrote to quickly get an access token to Azure Active Directory and then call AAD protected APIs like the AAD Graph API.

In general, authentication is hard, and requires way more set up than should be needed for simple testing. To get AAD authentication working on other platforms, you may need to write a ton of code, compile it, or even publish it to the web. With these scripts, you can get authentication and REST API calls done with as little as 13 lines of PowerShell. Running the code is instant, and modifying the REST calls or even the authentication parameters takes seconds rather than minutes.

How to get the samples

You can find all the basic scripts I have written on GitHub here:

https://github.com/shawntabrizi/Azure-AD-Authentication-with-PowerShell-and-ADAL

I provide different scripts for different authentication flows:

  1. Authorization Code Grant Flow for Confidential Client
  2. Native Client Authentication
  3. Client Credential Flow
    1. Using Application Key
    2. Using Client Certificate

Each script ends with a REST API call to get the list of Users in your tenant using the AAD Graph API. You should be able to do this with any application because it uses the “Sign in and read basic profile” permission which is assigned to all AAD Applications by default.

Note that to get these samples running, you will need to add the .NET dlls for ADAL v2 into the ADAL folder. You can find those files on NuGet.

Why it is so darn useful

So now that you have the scripts downloaded, and hopefully working, let me illustrate to you just a few of the different scenarios where I have used this tool to greatly simplify my work.

Verifying Token Claims

So many errors in AAD app development come from some sort of wrong setting, which may manifest itself in your access token. You might want to check the ‘scp’ claims to see if your app has the right permissions. You might want to check the ‘tid’ claim to make sure that you are getting a token to the right tenant! Or even the ‘aud’ claim to make sure the token is for the correct resource. You can simply pump in the settings for your application into the appropriate PowerShell script, run the script, and you will get a .txt file with your access token in it. Then you can pop that JWT token into a JWT decoder like the one I created… and viola! There are your claims, and it took literally seconds.

Making quick REST API calls

Another thing that comes up very often around work is just pulling random data from AAD. Let’s say that someone wants to know the settings of a certain Application Object, Service Principal, or even User. You may be able to do this with tools like the Graph Explorer, but what about some more complicated queries, or ones that you want to download to a file for later? Or how about simply wanting to test that YOUR app can make those queries rather than the Graph Explorer app. Not to mention the fact that you can call ANY AAD protected API, not just the AAD Graph API with these scripts. Simply update the Invoke-RestMethod command and bam, results will be saved into a .json file!

Making scripted REST API calls

Maybe you are still not convinced that these scripts are useful. Most of what I showed above can be done if you want to use multiple other tools. However, I challenge you to find a quicker way to create “scripted” REST API calls. What do I mean by that? Lets say you wanted to pull a list of all the users in your company. Well the AAD Graph API can return at most 999 results in a single call, so you probably want to create a loop that iterates over the paged results that the Graph API returns. This is SIMPLE!

Here is the loop I wrote to solve this exact problem:

$result = Invoke-RestMethod -Method Get -Uri ('{0}/{1}/users/?api-version=1.6&amp;amp;amp;$top=999' -f $resourceId,$tenantId) -Headers $headers
$count = 0
$result.value | Export-Csv ([String]$count + "_" +$output) -Encoding UTF8

while (($result.'odata.nextLink' -split 'skiptoken=')[1] -ne $null)
{
  $skiptoken = ($result.'odata.nextLink' -split 'skiptoken=')[1]
  Write-Host ('{0}/{1}/users/?api-version=1.6&amp;amp;amp;$top=999&amp;amp;amp;$skiptoken={2}' -f $resourceId,$tenantId,$skiptoken)

  try
  {
    $result = Invoke-RestMethod -Method Get -Uri ('{0}/{1}/users/?api-version=1.6&amp;amp;amp;$top=999&amp;amp;amp;$skiptoken={2}' -f $resourceId,$tenantId,$skiptoken) -Headers $headers
    $count += 1
    $result.value | Export-Csv ([String]$count + "_" + $output) -Encoding UTF8
  }
  catch
  {
    Write-Host "Error with Invoke Rest Method!"
    Write-Host $result.'odata.nextLink'
    break
  }
}

The result is a folder of CSV files all numbered and ready to be merged. If the script fails at some point (like if I lose an internet connection), I can use the outputted ‘odata.nextLink’ and just pick up where I left off. I couldn’t imagine doing this any other way for my needs.

Convinced?

I hope that you too will be able to find this little tool helpful for your day to day needs. Let me know if you find some other unconventional uses for this!

Decoding JWT Tokens

Forewarning: I know that “JWT Tokens” is case of RAS syndrome… but I can’t help it!

Are your tokens safe when using online decoders?

In the identity space, decoding JSON Web Tokens (JWT tokens) is a regular event. One of the first things we do in order to try and debug issues that customers or partners are having is taking a quick peek into the access tokens they are using, and seeing if anything is wrong.

In Azure Active Directory, we are commonly looking at the “audience” claim or the “scopes” in the token to make sure that they have the token to the right resource, and they have the right level of permissions for the task. But sometimes problems can be even more subtle than that. For example, the “tenant” information can be wrong, and people may never notice the subtle difference in GUID.

Either way, being able to read the contents of a token is crucial, and so I have always relied on small web apps created by others to do this. However, at work recently, there was discussion about how the most popular site for this (https://jwt.io/) may be storing the tokens that are submitted into the app. If someone submits a token that is still active, there is a possibility that the site could use that token and impersonate you! Furthermore, the website was created by a Microsoft competitor, Auth0… so just bad news in general.

I wanted to create my own JWT decoder so that I know for certain that my tokens are not being used maliciously, and so I could learn a little more about JWT tokens in general.

I created this very basic page: http://shawntabrizi.com/jwt/

You can find the GitHub source here. Let’s talk about what I did.

JSON Web Token Structure

A JWT token is broken up into 3 sections, all separated by periods. The first section is the Header, which contains information about the token type and the algorithm used to sign or encrypt that token. The second section is the Payload, where all the main claims are stored for the token. Finally, the third section is the token signature, where a token issuer can prove that they were the ones that actually minted the token. Tokens do not need to be signed, and if they are not, the third section will be empty. However, they will still contain a period to separate it from the second section as shown here.

The problem I needed to solve was pretty simple: Take the encoded JWT token, and get the claims out of it. I think the easiest way to explain the steps is simply to look at my commented code:

//This function takes a base 64 url encoded string, and converts it to a JSON object... using a few steps.
function decoder(base64url) {
    try {
        //Convert base 64 url to base 64
        var base64 = base64url.replace('-', '+').replace('_', '/')
        //atob() is a built in JS function that decodes a base-64 encoded string
        var utf8 = atob(base64)
        //Then parse that into JSON
        var json = JSON.parse(utf8)
        //Then make that JSON look pretty
        var json_string = JSON.stringify(json, null, 4)
    } catch (err) {
        json_string = "Bad Section.\nError: " + err.message
    }
    return json_string
}

JWT tokens are Base 64 URL encoded. While they are nearly the same, characters like “+” and “/” turn into “-” and “_” respectively. Learn more here. From there, converting a Base 64 encoded string to a pretty JSON string is really self explanatory.

The rest of the work beyond this is just handling random user inputs. We have checks to verify the individual parts of the token are good, and whether or not the token contains a signature. As I suspected, creating a site to decode JWT tokens is really quite simple, and now I have my own site to do it on!

Discovery through Experience

I have had a blog more than once in the past.

At different points in my life, blogging meant different things. The problem with blogging about myself or other personal things is that I find often I do not relate to my past self.

This blog should hopefully be more permanent because I hope to mostly talk about things related to technology, development, and problem solving.

I think the subtitle of my blog is well fitting for the kind of content I hope to have on here. I want to learn, and then teach, all through experience and action.

I hope that you will discover something new along with me.