Introduction to IR, LSA (and SEO) by Marie-Claire Jenkins
Posted: December 14th, 2008 | Author: Ben McKay | Filed under: SEO Help | Tags: IR, search engines, Semantics Talk:Following on from my last blog post entitled: “Is Semantic SEO the Marketers SEO“, Just Me and My received a fantastic response from Marie-Claire Jenkins, a part time PhD student from the University of East Anglia, UK, in the field of Latent Semantic Analysis (LSA) and Information Retrieval (IR)…essentially what search engine algorithms use to both index and serve the Search Engine Results Pages (SERPs), an understanding of which is vital for SEOs.
One of the best things about SEO for me is that it’s where technology, analysis and creative marketing converges in such an obvious way. I had been tackling semantic SEO and how it related very closely to marketing concepts in my previous post and I could not believe my luck when a finalist of 2006’s Loebner Artificial Intelligence (AI) award agreed to help explain concepts around search engines in more depth. Here’s the interview, but first a few definitions…
Mini IR Glossary
To help explain a few of the points below, here is an explanation of a few of the key terms used…
How search engines work
Latent Semantic Indexing and Analysis:
“Semantic” = meaning
“Latent” = present but hidden
LSI = The analysis of the hidden meaning of words and how often they occur in a document.
Probabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
OK, so here’s the interview…
What is your background and experience in information retrieval and SEO?
I’m called Marie-Claire Jenkins (nickname CJ). I have worked as an SEO for 6 years where I worked for a big fortune 500 company taking care of SEO and also research and development (reputation mining, sentiment analysis…), so it was a varied role. I then worked for a hip digital agency as head of search and enjoyed it very much.
I did a Degree in translating interpreting, and then an MSc in computing, my thesis was in machine translation. I won a scholarship for a PhD, but I didn’t want to give up the SEO work I was doing. I ended up doing my PhD part-time. I’m in my final year now and my project is on natural language generation and understanding. It requires a good knowledge of information retrieval, information extraction, NLP [natural language processing], cognitive linguistics, linguistics, human computer interaction and AI.
I’m a yogi and practice Ashtanga every day, I love running and swimming. I’m an internet addict and a news junkie. I love record shopping.
How would you describe IR and it’s importance for search engine optimisation?
Information retrieval is the bread and butter of most systems. In order for example for my natural language system to have anything to actually provide to the user, it needs to find the relevant knowledge in the knowledge Base. Search engines obviously rely on it, but they also work with ranking algorithms, which are not strictly part of IR and HCI methods, AI, and a host of other things.
Information retrieval should definitely be understood at a basic level by SEO experts. The reason for this is that when you rely on search to make a living, you should have a thorough idea of how the system that you rely on works. If you don’t keep up with new developments and trends, you could be missing a trick. Most importantly, keeping up with different theories is very important because it helps you prepare for possible changes in the future. Social networks and Web 2.0 for example were being discussed in the computing community long before it became mainstream.
There are a good number of SEO people who do take the time to research and understand IR developments such as Bill Slawski, David Harry , Marios Alexandrou (a brilliant example of someone jumping in and not being afraid to) the guys at SeoMoz and others too. There are a lot though that don’t and rely on hearsay or basic blog information. This is good as a launchpad but the next step is to verify all of this for yourself and not be afraid to ask questions.
What are the most important aspects of algorithmic LSI and PLSI?
The initial theory of LSI and it’s methodology has been extended a great deal throughout the years, but if you are interested in the original post you can find the original LSA pdf here. SEOs, either directly or indirectly, have taken some interest in these methodologies but a greater understanding is obviously quite necessary.
Here are also a few extracts from the Science for SEO blog [ed: which you really have to read - great writing on semantics and SE's] various blog posts on latent semantic, indexing and analysis that have been written on the subject to provide a little more detail:
Currently the focus on keywords, which is what LSI uses isn’t quite right anymore. I’ve seen a lot of recent research (and so have many of you) talking about semantics. There is lot of work on using semantic units which are not always keywords anyway.The question should be “What multitudes of methods is Google using?” and “I wonder which LSI method is being used, although I know it is just one factor in a very very large system”. Not “How should I optimise my site for LSI” - I’d ask you which type. I believe that Matt Cutts said something very generic when he said Google used LSI![]()
If you’re interested in going one step further and build, or use your own semantic search engine to run these queries, take a look at these latent semantic tools.
What is the importance of LSI / PLSI areas of study for people that take an interest in search engine optimisation?
I think people look for anything that might answer a question, and LSI gave everyone something to embrace. I think it was very valuable because it helped people understand how basic topic detection could work, and helped them gain insight into how to write and present their content. It was very useful for everyone. LSI has since evolved quite a lot and so the basic formula is still useful for Seo people, but you know, the actual LSI method looks very different.
PLSI is the upcoming form of semantics as developed by search engine engineers. How do you think SEOs should build this into their practices on a day-to-day basis?
I think that as far as any LSI technique is concerned, the SEO should simply worry about providing really useful content. I think that working beyond a keyword basis is very useful. There is a lot of talk about semantics, and I know that I use methods in my computing work that looks for structures rather than keywords, and also looks for the surrounding structures, how they correspond to each other and that kind of thing. If little me is doing that, i suspect that the big grown up scientists have built on these techniques and I know they have been expanded a great deal - I don’t need such accuracy in my work.
SEO experts should provide very relevant content to the page they are writing for. Using language that is proper to the topic, and that represents it well if the way to go IMHO. Keywords provide some benefit but there is more to look at.
And also, what about a strategic basis…thinking long-term is there something that you believe should SEOs should consider in their work?
I would simply say use common sense, and take the time to learn about IR. You don’t need to be an expert, and you don’t need to understand all the big complicated equations and the difficult mathematical concepts. At least read the abstract, introduction and conclusion. This should give you enough information.
It’s important to recognise the benefits of Latent Semantic Analysis too:
Its advantage over simple keyword analysis (Boolean search = True or False) is that it can infer meaning from words which is not evident, and match words which would not normally happen with other methods. For example, “computer”, “PC”, “laptop” are all connected. Documents are put together even if it is not obvious that they are connected, because a “latent semantic space” is created.
It uses vectors in a a high-dimensional vector space (lots of them). It creates a term-document matrix from all the documents. Then 3 matrices are created using SVD (singular value decomposition) (also the second matrix houses the singular values of the original matrix in a diagonal matrix). This means that sets of terms or documents can be represented as d-dimensional vectors. Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents. It can be used in any language because of the way that it’s constructed.
And the limitations of LSI:
- The resulting dimensions can be very difficult to interpret so there are mistakes. It’s unclear what the resulting similarities between terms really means.
- The input is a bag-of-words so we don’t have any text structure information.
- A compound term (bull-headed) is treated as 2 terms.
- Ambiguous terms create noise in the vector space
- There’s no way to define the optimal dimensionality of the vector space
- There’s a time complexity for SVD in dynamic collections
So because of the limitations of LSI, there has been a move towards probabilistic latent semantic indexing / analysis:
- It has a more robust statistical foundation and provides a proper generative data model
- It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise) - this makes it far more flexible
- It can deal with domain specific synonymy and polysemous words
…the implications for search engine marketing therefore become all the more apparent regarding site structure and meaning found in the flow of information.
Beyond how PLSI is used, how do you see IR moving forward? Is PLSI as far as we can go with regards to search engines?
I don’t think it really about PLSI or anything like that. An awful lot of very prominent scientists who pioneered these techniques are working on personalisation, not to mention Google obviously.
Social networks are very important and harnessing their strengths is super important right now. There are also issues within them such a deciding how important an individual vote is. If a layman votes on a technical post and an expert does, are the votes valued the same? Do topics with lots of expert votes rank higher rather than those with lots of votes? There is a lot of spam in these networks and these include people voting for their friends and so forth. I think taking on board what is happening right now in personalisation also is quite important.
There are also going to be a lot of developments in things like classification, topic detection and other research areas. I obviously am particularly looking forward to seeing more advances in question-answering systems.
If there was one final bit of advice that you could offer SEOs regarding semantics and IR, what would it be?
Write for your user, pay attention to larger structures rather than just keywords alone, inform yourself about IR and follow developments, read a lot
Marie-Claire, this is a spectacular introduction to how search engines work in the way they do. Thank you for sparing this time. For those of you who require more information on the topic:
- Several downloadable research papers on IR, SVD, LSA, LSI, PLSI and more…
- Patterns in unstructured data presentation, including LSI and multi-dimensional scaling
- A piece with ideas on where search engines maybe heading…
- Search Engine Index Presentation by Marie-Claire Jenkins
- Search Engine Spiders by Marie-Claire Jenkins
Books include:
For the more hardcore SEO geeks - but can others can also get the gist of it too:
Please note: PLSI / A is but one method of learning models for concepts and related words…there are a variety of methodologies that search engines might buy-in to. Although, this is perceived to be amongst the most prominent direction that search engines have taken regarding their algorithms, it is always a good idea to keep on reading and researching to stay up to date with the SEO game! Algorithmic models constantly evolve and new ones are born…here’s a few extra methodologies by Marie-Claire.

















Hi Marie, I’m a professor for Uni of Wolverhampton. Your post is really nice and I really appreciate the fact more and more people are looking at SEO from a scientific point of view.
Great post, definately important to keep on track these search engine changes.
[...] keywords onsite/offsite to that of the search query. It’s a game of inferred meaning (latent semantics). Meaning must be found throughout all aspects of the site – and literally everything should be [...]
[...] Introduction to IR, LSA (and SEO) by Marie-Claire Jenkins [...]
[...] You may also want to read her noteworthy piece - published elsewhere - Introduction to IR, LSA (and SEO) [...]
Great data , Thanks a lot for sharing