My NLP Interview Experience (9 June 2023)
A recent interview for a NLP position gave me an opportunity to delve into NLP problems and some good basic introductions to them. The topics that I looked into were related to semantic search and document deduplication. In this article (and with it, jupyter notebooks), I will dive into the domain semantic search.
For the interview, I focused on the topic of Semantic Search. Semantic Search consists of returning texts that are most closely related to a provided search query. These are not strictly speaking lexical based searches, where the most keywords matched shows up in the search results.
There are many ways to capture the semantic of a corpus, either from a pretrained model perspective or a more statistical approach, as well as how to scale it up to accomodate more data in a fast and reasonable manner. For the interview itself, I decided to go for TF-IDF based search as it was the easiest to explain within an hour, and also can be further explained for document deduplication.
In the end, I did not pass the interview. Whilst I cannot speak on behalf of the interviewer as I was not given replies nor feedback about the interview, I think this is a good learning opportunity to explain my work and a future v2.0 of it. This post will go through the jupyter notebook
Data
We see the data that we are working with here:
jobId | jobUrl | jobTitle | jobDescription | datePosted | companyId | companyIdNormalised | companyName | rawWageMin | rawWageMax | sourceName | qualifications | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00d9917e95ebbb58d237e90b5a01095947c31fbe119d82… | https://www.efinancialcareers.sg/jobs-Singapor… | Change Manager, IBOR Transition Programme | Change Manager, IBOR Transition Programme… | 2022-04-01 | 7cc4c2d8b4893e7c64265beccd30d4c1b644cf8b57a9e8… | 06634f73009b4765beafae5f98c0996b33870a3d34fa87… | Standard Chartered Bank | 0 | 0 | E-FinancialCareer | [‘No Requirement’] |
1 | 0104963e8e1289488f2ff96edfe95dddc9ab84231b37a5… | https://www.efinancialcareers.sg/jobs-Singapor… | Analyst, KYC Analyst, Corporate Banking, Insti… | Analyst, KYC Analyst, Corporate Banking, Insti… | 2022-04-01 | c3475240458aa07566e1db7eec98affa5d85d8bd2b9577… | 1810faaf5f96a398f5b43df1b80809dbf5b7935f94a5f7… | DBS Bank Limited | 0 | 0 | E-FinancialCareer | [‘Bachelors’] |
2 | 01561a39ff31372551e0be1caaf6a2c32150925f75af76… | https://www.jobstreet.com.sg/en/job/senior-leg… | (Senior) Legal Counsel, Autumn Venture - SC Ve… | About Standard CharteredWe are a leading inter… | 2022-04-01 | a1ad3581a81222507fa918dc2d978ed1db672c44415c3f… | 25fa191ddad0bb854bd7bbe811437b1c820271351c4ac8… | Autumn Life Pte. Ltd. | 0 | 0 | JobStreetSG | [‘No Requirement’] |
3 | 0110a85844f5aa0b87060109b25567903d2188130391b2… | https://www.efinancialcareers.sg/jobs-Singapor… | Product/Data Analyst | About usEndowus is Asia’s leading fee-onl… | 2022-04-01 | 4aac38458acbd96d2de7ceda69e0c3f9923c5c8b4773f5… | 3c7c5b38bc57d7f85029e41a219822866fe3dd6587027a… | Endowus |
The ones we will be using will be jobDescription and jobTitle
Data Cleaning
1def clean_text(text):
2 # remove htmltags and new lines/tags
3 try:
4 text = re.sub(r'<.[a-zA-Z]+.>', ' ', text)
5 text = re.sub(r'&.[a-zA-Z]+.;', '', text)
6 #text = re.sub(r'^[a-zA-Z.]', '', text)
7 text = re.sub(r'httpS+s*', ' ', text)
8 text = re.sub(r'\.', '', text)
9 text = re.sub(r'\(', '',text)
10 text = re.sub(r'\)', '',text)
11 text = re.sub(r' +', ' ',text)
12 text = text.lower()
13 except Exception as e:
14 print(f"Error: {text}")
15 return text
16
17 return text
It is common understanding that preprocessing of texts data is very important in NLP tasks. However, with pretrained BERT frameworks, especially sentence-BERT, it is best that we do not perform common lexical preprocessing such as lemmatization, stemming, and stopword removals. BERT also has an internal tokenizer to process it. See this article for details.
Here, we simply remove some level of “dirty” texts that do not contribute towards the semantic meaning of the sentence, such as extra spaces, links, html elements ie. <\ br> or &
1# preprocess sample
2clean_sent(clean_text(raw_data.jobDescription[68]))
'responsibilitiesmanage lead team provide residential services effectivelymaintain synergy hotel on-site managing agent resident councilresponsible time attendance record supervised team ensure accurate billing process mcstoversee maintenance accurate updated occupant records ensuring staff adherence confidentiality residents contact details personal informationconduct regular staff meetings maintain open channel communicationresolve resident complaints management office maintain high level resident satisfaction service qualitycommunicate on-site management office resident feedback recurring challenges improvementsparticipate site meetings convened management office requiredconduct participate yearly service excellence audit collaborate on-site managing agent meet compliancemaintain ongoing schedules ensure residential facilities safe clean attractivemaintain compliance regulatory requirements including workplace health amp safety occupational health amp safetyrequirementsdiploma hospitality management equivalentmin years experience least years leadership role luxury hospitality servicestrong leadership communication skillsability build trusting relationships stakeholdersproficient ms office-'
Sample job description after cleaning
1if raw_data.isnull().values.any():
2 raw_data.dropna(how='any', inplace=True)
3raw_data.reset_index(drop=True, inplace=True)
4raw_data.set_axis(range(len(raw_data)), inplace=True)
We also remove any rows that contain NaN
in its jobDescription and jobTitle columns
Calculating TF-IDF Matrix
Here, we use Sklearn to produce the Dictionary as well as the tfidf_matrix. For context,
- TF (Term Frequency): word occurence in a doc / total number of words in a doc, describing the rarity of a word
- IDF (Inverse Document Frequency): number of docs with appearance of given word / total number of documents, describing the frequency of word appearance.
- Each vector’s value is 0 - 1.0 . A higher value represent a higher importance and correlation, where 1.0 would mean an identical corpus.
1from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
2from sklearn.metrics.pairwise import cosine_similarity, linear_kernel
3
4vectorizer = TfidfVectorizer(tokenizer=tokenize, analyzer="word")
5bow_vectorizer = CountVectorizer(tokenizer=tokenize, analyzer="word")
6
7corpus = resume.Resume_str
8tfidf_matrix = vectorizer.fit_transform(corpus)
9bow_matrix = bow_vectorizer.fit_transform(corpus)
10print(f"tfidf_matrix shape: {tfidf_matrix.shape}")
11print(f"bow_matrix shape: {bow_matrix.shape}")
There are several available libraries that allow you to calculate TFIDF values quickly. The one I used in this case will be from scikit-learn.
TfidfVectorizer
will first create a vocabulary of words by counting the occurence of words in a one-hot encoding vector, and then calculate its tfidf value based on it. You can learn more about its inner working here
Cosine Similarity
1## Cosine similarity matrix of a corpus
2Cosine score is 0 (no similarity) and 1 (exact same)
3
4${sim(A,B)}$ = ${\cos(\theta)}$ = ${ {A\cdot B} \over ||A||||B|| }$
1cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
2bow_cosine_sim = cosine_similarity(bow_matrix, bow_matrix)t
3print(cosine_sim)
4print(f"cosine_sim shape: {cosine_sim.shape}")
5
6cosine_sim_lk = linear_kernel(tfidf_matrix, tfidf_matrix)
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. With this, we can estimate if the embedding vectors of two documents are similar, and rank them accordingly.
1tfidf_matrix shape: (2484, 32351)
2bow_matrix shape: (2484, 32351)
3[[1. 0.25575024 0.23616097 ... 0.09187018 0.10855762 0.09396372]
4 [0.25575024 1. 0.20146724 ... 0.07162671 0.1392705 0.0711357 ]
5 [0.23616097 0.20146724 1. ... 0.0621186 0.09232892 0.08819255]
6 ...
7 [0.09187018 0.07162671 0.0621186 ... 1. 0.0711868 0.17859443]
8 [0.10855762 0.1392705 0.09232892 ... 0.0711868 1. 0.06674726]
9 [0.09396372 0.0711357 0.08819255 ... 0.17859443 0.06674726 1. ]]
10cosine_sim shape: (2484, 2484)
Inverse Document Indexing
1from tqdm import tqdm
2
3def inverted_index(words):
4 """
5 An ivnerted index of words (given word, find docID and idx)
6 """
7 inverted = {}
8 for idx, word in enumerate(words):
9 loc = inverted.setdefault(word, [])
10 loc.append(idx)
11 return inverted
12
13def inverted_index_add(inverted, docID, doc_idx):
14 for word in doc_idx.keys():
15 loc = doc_idx[word]
16 indices = inverted.setdefault(word, {})
17 indices[docID] = loc
18
19 return inverted
20
21corpus = resume.Resume_str
22inverted_doc_idx = {}
23word_corpus = {}
24with tqdm(total=len(corpus)) as pbar:
25 for docid, x in enumerate(corpus):
26 words = tokenize(x)
27 word_corpus[docid] = words
28 inv_idx = inverted_index(words)
29 inverted_index_add(inverted_doc_idx, docid, inv_idx)
30 pbar.update(1)
Finally, we generate an inverted index table. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in. This is similar, in a high level, to how elasticsearch performs it search.
Search Function
1# Make sure you run the above to get your tfidf mat first as we refer it internally
2def ranked_search(query, firstx=10):
3 tokens = tokenize(query)
4 query_weights = {}
5 # get all the weights of the documents in which the term existed
6 #get documents that matches the key
7 for mapword, wmap in inverted_doc_idx.items():
8 appear_in_docs = list(wmap.keys())
9
10 if mapword in tokens:
11 # print(f"looking via: {tokens}")
12 # print(f"found {mapword} in tokens for docs {appear_in_docs}")
13 for docid in appear_in_docs:
14 wordidx = list(vectorizer.get_feature_names_out()).index(mapword)
15 tfidfval = tfidf_matrix[docid,wordidx]
16 # we add the value onto that doc id. The more words scored for that doc, the heavier the weight
17 query_weights[docid] = query_weights.get(docid,0) + tfidfval
18
19 query_weights = sorted(query_weights.items(), key=lambda x:x[1], reverse=True)[:firstx]
20 result = []
21 for (docid, tfidfval) in query_weights:
22 data = {
23 'Relevance': round(tfidfval*100,2),
24 'ID': docid,
25 'Resume_str': resume.Resume_str.iloc[docid],
26 'Category': resume.Category[docid]
27 }
28 result.append(data)
29 result = pd.DataFrame(result)
30 return result
1%time ranked_search("HR")
2
3Relevance ID Resume_str Category
40 58.90 4 HR MANAGER Skill Highlights ... HR
51 51.87 101 REGIONAL HR BUSINESS PARTNER Hu... HR
62 51.10 58 HR CONSULTANT Summary C... HR
73 49.76 92 GLOBAL HR MANAGER Summary ... HR
84 46.55 85 SENIOR HR BUSINESS PARTNER ... HR
95 46.51 31 HR GENERALIST Professional Prof... HR
106 46.18 68 HR DIRECTOR Summary HR Prof... HR
117 45.31 69 HR PROFESSIONAL Summary Dep... HR
128 42.35 88 REGIONAL HR DEPUTY MANAGER Summ... HR
139 42.09 65 HR CONSULTING Summary 7+ yea... HR
Challenges
Some of the challenges for this method is:
- Scalability: Computing TFIDF Matrix can be computationally expensive as it grows. For ~4000 documents, it took about 7minutes to compute the TF-IDF matrix. Adding new documents into the matrix also requires recomputing the matrix or a form of partial fitting like this repo suggest
- Lexical dependency: Even though we are deployed some semblence of semantic meaning into our embeddings, ultimately a vocabulary on the number of occurence of words are still somewhat lexical. That means it’s value will heavily be influenced by how its being preprocessed and the statistics of word appearances.
What’s Next?
We will look into S-BERT as a way for embedding instead of TF-IDF values for a semantic search system.