Varakulan's Blog
The World is not enough to learn...
Thursday, April 21, 2011
Monday, July 26, 2010
Automatic generation of META tags in ASP.NET
Automatic generation of META tags for ASP.NET
Some of the well known tags commonly used in SEO are the three following meta tags: meta title tag, meta keywords tag and meta description tag:
view plainprint?
1. < meta name="title" content="title goes here" />
2. < meta name="keywords" content="keywords, for, the, page, go, here"/>
3. < meta name="description" content="Here you will find a textual description of the page" />
< meta name="title" content="title goes here" />
< meta name="keywords" content="keywords, for, the, page, go, here"/>
< meta name="description" content="Here you will find a textual description of the page" />
A lot has been written about the benefits of using them, and almost the same amount telling that they are not considered anymore by search engines. Anyway, no matter if they are used or not on the calculation of SERP (Search Engine Results Page), nobody discusses the benefits of having them correctly set on all your pages. At least meta description tags are somehow considered by Google, since Google Webmaster Tools warns you about pages with duplicate meta description:
Differentiate the descriptions for different pages. Using identical or similar descriptions on every page of a site isn't very helpful when individual pages appear in the web results. In these cases we're less likely to display the boilerplate text. Wherever possible, create descriptions that accurately describe the specific page. [...]
The question is not “should I use meta tags in my pages?”, the real question (and here comes the problem) is “how can I manage to create individual meta descriptions for all my pages?” and “how can I automate the process of creating meta keywords?”. That would be too much work (or too much technical work) for you (or your users, if they create content on their own).
For instance, consider a CMS (Content Management System) in which users are prompted for some fields to create a new entry. In the simplest form, the CMS can ask the user to enter title and content, for the new entry. In advanced-user mode, the CMS could also ask the user to suggest some keywords, but the user will probably enter just two, three or four words (if any). The CMS needs a way to automatically guess and suggest a default set of meta keywords based on the content before definitely saving the new entry. Those could be checked, and eventually completed by the user, and then accepted. Meta title and meta descriptions are much easier, but will be covered also in our code.
In our sample VB project we will not suggest keywords for the user to confirm, we will just calculate them on the fly and we will set them without user intervention. We will use a dummy VirtualPathProvider that will override the GetFile function in order to retrieve the virtualPath file from the real file system, so it is not a real VirtualPathProvider in the whole sense, just a wrapper to take control of the files being served to ASP.NET before they are actually compiled. A VirtualPathProvider is commonly used to seamless integrate path URLs with databases or any other source of data rather than the file system itself. Our custom class inheriting from VirtualPathProvider will be called FileWrapperPathProvider. In our case it will not use the full potential of VirtualPathProviders, since we will only retrieve the data from the file system, do minor changes to the source code on the fly and return them in order to be compiled. This will introduce a bit of overload and some extra CPU cycles before the compilation of the pages, but that will only happen once, until the file needs to be compiled again (because the underlying file has changed, for instance).
Our FileWrapperPathProvider.GetFile function will return a FileWrapperVirtualFile whenever the virtualPath requested falls under the conditions of IsPathVirtual function: the file extension is .aspx or .aspx.vb and the path of the requested URL follows the scheme of ~/xx/, that is to say, under a folder of two characters (for the language, ~/en/, ~/de/, ~/fr/, ~/es/, …). In other case, it will return a VirtualFile handled by the previously registered VirtualPathProvider; ie. none, or the filesystem itself without any change.
We have chosen to use a VirtualPathProvider wrapper around the real file system just to show what kind of things can be done with that class. If your data is on a database instead of static files, you will probably be using your own VirtualPathProvider, and in that case it will work by virtualizing the path being requested and retrieving the file contents from the database instead of the filesystem. Whichever the case, you can adapt it to your scenario in order to make use of the idea that we will illustrate in this post.
The idea is somewhat twisted or cumbersome:
1. Parse the code behind file for the page being requested (.aspx.vb file) and, using regular expressions (regex), replace the base class so that the page no longer inherits from System.Web.UI.Page and inherits from System_Web_UI_ProxyPage instead(a custom class of our own). This proxy page class declares public MetaTitle, MetaDescription and MetaKeywords properties and link them to the underlying meta title, meta description and meta keywords declared inside the head tag in the masterpage. When a page inherits from our System_Web_UI_ProxyPage, it will expose those 3 properties that can be easily set. See System_Web_UI_ProxyPage.OnLoad in our sample project.
2. Read and parse the .aspx file linked to the former .aspx.vb file (the same without the .vb) and make a call to JAGBarcelo.MetasGen.GuessMetasFromString method which makes the main job with the file contents. See FileWrapperVirtualFile.Open function in the sample project.
3. Besides of changing the base class to that of our own, we add some lines to create (or extend) Page_Init method on that .aspx.vb file. In those few lines of code that are added on the fly we set the three properties exposed by System_Web_UI_ProxyPage class and that we have just calculated.
4. Return the Stream as output of the VirtualFile.Open function with the modified contents so that it can be compiled by ASP.NET engine, based on the underlying real file, using the formerly calculated meta title, meta keywords and meta description. Note that this is done in memory, the actual filesystem is not written at any time. The real files are read (.aspx.vb and .asxp), parsed, and the virtual contents created on the fly and given to ASP.NET. You need too be really careful since you can run into compile-time errors in places that will be hard to understand, since the filesystem version of the files are the base contents, but not the actual contents being compiled.
The way we calculate the metas in JAGBarcelo.MetasGen.GuessMetasFromString is:
1. Select the proper set of noise-words set depending on the language of the text.
2. Look for the content inside the whole text. It must be inside ContentPlaceholders (we will suppose you will be using masterpages), and we will look for a particular ContentPlaceHolder that contains the main body/contents of the page. Change the LookForThisContentPlaceHolder const inside MetasGen.vb file in order to customise it for your own masterpage ContentPlaceHolder's names.
3. Calculate the meta title as the text within the first < h1 > tags right after the searched ContentPlaceHolder.
4. Iterate through the rest of the content, counting word occurrences and two-word phrases occurrences, discarding noise words for the given language.
5. Calculate the keywords, creating a string that will be filled with the most frequent single-word occurrences (up to 190 characters), and two-word most frequent occurrences (up to 250 characters in total).
6. Calculate the description, concatenating content previously parsed, to create a string between of 200 and 394 characters. Those two figures are not randomly chosen, Google Webmaster Tools warns you when any of your pages has meta descriptions shorter than 200 or longer than 394 characters (based on my experience).
7. Return the calculated title, keywords and description in the proper ByRef parameters.
A good thing about this approach, using a VirtualFile is that you can apply it to your already existing website easily. No matter how many pages your site has, hundreds, thousands,... this code adds meta title, meta keywords and meta descriptions to all your pages automatically, transparently, without user intervention, very little modifications (if any) to your already existing pages and it scales well.
Counting word occurrences.
We iterate through the words within the text under consideration (ContentPlaceHolder) and store their occurrences into a HashTable (ht1 for single-words and ht2 for two-words). All words are considered in their lowercase variant. The word must have more than two characters to taken into account and must not start with a number. If it passes the former fast test, it is checked against a noise-word list. If it is not a noise word, it is checked against the existing values in the proper HashTable and included (ht1.Add(word, 1)), or its value incremented (ht1(word) = ht1(word) + 1) if it was already there.
Regarding the noise words, we first considered some word frequency lists available out there, but then we thought about using verb conjugations as well. So we first created MostCommonWordsEN, an array based on simple frequency lists, and then we created also MostCommonVerbsEN based on another frequency list which considered only verbs. At the end we created MostCommonConjugatedVerbsEN, where we stored all the conjugations of the former most common English verbs. When checking a word against one of these word strings we only use MostCommonWordsXX and MostCommonConjugatedVerbsXX (where XX is one of EN, ES, FR, DE, IT). Yes, we did the same for other languages like Spanish, French, German and Italian, whose conjugations are much more complex than -ed, -ing and -s terminations. For automatic generation of all possible conjugations for the given verbs (in their infinitive form) we used http://www.verbix.com/
tags) after the first < h1 > tag. Based on our experience, when the meta description tag is longer than 394 characters, Google Webmaster Tools complain about it being too long. Taking that point in mind, we try to concatenate html-cleaned text from the first paragraphs of the text to create the meta description tag, ensuring it is not longer than 394 characters. Once we know the way our meta descriptions are automatically created, all we need to do is create our pages starting with an < h1 > header tag followed by one or more paragraphs (< p >) that will be the source for creating the meta description for the page. This will be suitable for most scenarios. In other cases, you should modify the way you create your pages or update the code to match your needs.
Some of the well known tags commonly used in SEO are the three following meta tags: meta title tag, meta keywords tag and meta description tag:
view plainprint?
1. < meta name="title" content="title goes here" />
2. < meta name="keywords" content="keywords, for, the, page, go, here"/>
3. < meta name="description" content="Here you will find a textual description of the page" />
< meta name="title" content="title goes here" />
< meta name="keywords" content="keywords, for, the, page, go, here"/>
< meta name="description" content="Here you will find a textual description of the page" />
A lot has been written about the benefits of using them, and almost the same amount telling that they are not considered anymore by search engines. Anyway, no matter if they are used or not on the calculation of SERP (Search Engine Results Page), nobody discusses the benefits of having them correctly set on all your pages. At least meta description tags are somehow considered by Google, since Google Webmaster Tools warns you about pages with duplicate meta description:
Differentiate the descriptions for different pages. Using identical or similar descriptions on every page of a site isn't very helpful when individual pages appear in the web results. In these cases we're less likely to display the boilerplate text. Wherever possible, create descriptions that accurately describe the specific page. [...]
The question is not “should I use meta tags in my pages?”, the real question (and here comes the problem) is “how can I manage to create individual meta descriptions for all my pages?” and “how can I automate the process of creating meta keywords?”. That would be too much work (or too much technical work) for you (or your users, if they create content on their own).
For instance, consider a CMS (Content Management System) in which users are prompted for some fields to create a new entry. In the simplest form, the CMS can ask the user to enter title and content, for the new entry. In advanced-user mode, the CMS could also ask the user to suggest some keywords, but the user will probably enter just two, three or four words (if any). The CMS needs a way to automatically guess and suggest a default set of meta keywords based on the content before definitely saving the new entry. Those could be checked, and eventually completed by the user, and then accepted. Meta title and meta descriptions are much easier, but will be covered also in our code.
In our sample VB project we will not suggest keywords for the user to confirm, we will just calculate them on the fly and we will set them without user intervention. We will use a dummy VirtualPathProvider that will override the GetFile function in order to retrieve the virtualPath file from the real file system, so it is not a real VirtualPathProvider in the whole sense, just a wrapper to take control of the files being served to ASP.NET before they are actually compiled. A VirtualPathProvider is commonly used to seamless integrate path URLs with databases or any other source of data rather than the file system itself. Our custom class inheriting from VirtualPathProvider will be called FileWrapperPathProvider. In our case it will not use the full potential of VirtualPathProviders, since we will only retrieve the data from the file system, do minor changes to the source code on the fly and return them in order to be compiled. This will introduce a bit of overload and some extra CPU cycles before the compilation of the pages, but that will only happen once, until the file needs to be compiled again (because the underlying file has changed, for instance).
Our FileWrapperPathProvider.GetFile function will return a FileWrapperVirtualFile whenever the virtualPath requested falls under the conditions of IsPathVirtual function: the file extension is .aspx or .aspx.vb and the path of the requested URL follows the scheme of ~/xx/, that is to say, under a folder of two characters (for the language, ~/en/, ~/de/, ~/fr/, ~/es/, …). In other case, it will return a VirtualFile handled by the previously registered VirtualPathProvider; ie. none, or the filesystem itself without any change.
We have chosen to use a VirtualPathProvider wrapper around the real file system just to show what kind of things can be done with that class. If your data is on a database instead of static files, you will probably be using your own VirtualPathProvider, and in that case it will work by virtualizing the path being requested and retrieving the file contents from the database instead of the filesystem. Whichever the case, you can adapt it to your scenario in order to make use of the idea that we will illustrate in this post.
The idea is somewhat twisted or cumbersome:
1. Parse the code behind file for the page being requested (.aspx.vb file) and, using regular expressions (regex), replace the base class so that the page no longer inherits from System.Web.UI.Page and inherits from System_Web_UI_ProxyPage instead(a custom class of our own). This proxy page class declares public MetaTitle, MetaDescription and MetaKeywords properties and link them to the underlying meta title, meta description and meta keywords declared inside the head tag in the masterpage. When a page inherits from our System_Web_UI_ProxyPage, it will expose those 3 properties that can be easily set. See System_Web_UI_ProxyPage.OnLoad in our sample project.
2. Read and parse the .aspx file linked to the former .aspx.vb file (the same without the .vb) and make a call to JAGBarcelo.MetasGen.GuessMetasFromString method which makes the main job with the file contents. See FileWrapperVirtualFile.Open function in the sample project.
3. Besides of changing the base class to that of our own, we add some lines to create (or extend) Page_Init method on that .aspx.vb file. In those few lines of code that are added on the fly we set the three properties exposed by System_Web_UI_ProxyPage class and that we have just calculated.
4. Return the Stream as output of the VirtualFile.Open function with the modified contents so that it can be compiled by ASP.NET engine, based on the underlying real file, using the formerly calculated meta title, meta keywords and meta description. Note that this is done in memory, the actual filesystem is not written at any time. The real files are read (.aspx.vb and .asxp), parsed, and the virtual contents created on the fly and given to ASP.NET. You need too be really careful since you can run into compile-time errors in places that will be hard to understand, since the filesystem version of the files are the base contents, but not the actual contents being compiled.
The way we calculate the metas in JAGBarcelo.MetasGen.GuessMetasFromString is:
1. Select the proper set of noise-words set depending on the language of the text.
2. Look for the content inside the whole text. It must be inside ContentPlaceholders (we will suppose you will be using masterpages), and we will look for a particular ContentPlaceHolder that contains the main body/contents of the page. Change the LookForThisContentPlaceHolder const inside MetasGen.vb file in order to customise it for your own masterpage ContentPlaceHolder's names.
3. Calculate the meta title as the text within the first < h1 > tags right after the searched ContentPlaceHolder.
4. Iterate through the rest of the content, counting word occurrences and two-word phrases occurrences, discarding noise words for the given language.
5. Calculate the keywords, creating a string that will be filled with the most frequent single-word occurrences (up to 190 characters), and two-word most frequent occurrences (up to 250 characters in total).
6. Calculate the description, concatenating content previously parsed, to create a string between of 200 and 394 characters. Those two figures are not randomly chosen, Google Webmaster Tools warns you when any of your pages has meta descriptions shorter than 200 or longer than 394 characters (based on my experience).
7. Return the calculated title, keywords and description in the proper ByRef parameters.
A good thing about this approach, using a VirtualFile is that you can apply it to your already existing website easily. No matter how many pages your site has, hundreds, thousands,... this code adds meta title, meta keywords and meta descriptions to all your pages automatically, transparently, without user intervention, very little modifications (if any) to your already existing pages and it scales well.
Counting word occurrences.
We iterate through the words within the text under consideration (ContentPlaceHolder) and store their occurrences into a HashTable (ht1 for single-words and ht2 for two-words). All words are considered in their lowercase variant. The word must have more than two characters to taken into account and must not start with a number. If it passes the former fast test, it is checked against a noise-word list. If it is not a noise word, it is checked against the existing values in the proper HashTable and included (ht1.Add(word, 1)), or its value incremented (ht1(word) = ht1(word) + 1) if it was already there.
Regarding the noise words, we first considered some word frequency lists available out there, but then we thought about using verb conjugations as well. So we first created MostCommonWordsEN, an array based on simple frequency lists, and then we created also MostCommonVerbsEN based on another frequency list which considered only verbs. At the end we created MostCommonConjugatedVerbsEN, where we stored all the conjugations of the former most common English verbs. When checking a word against one of these word strings we only use MostCommonWordsXX and MostCommonConjugatedVerbsXX (where XX is one of EN, ES, FR, DE, IT). Yes, we did the same for other languages like Spanish, French, German and Italian, whose conjugations are much more complex than -ed, -ing and -s terminations. For automatic generation of all possible conjugations for the given verbs (in their infinitive form) we used http://www.verbix.com/
Calculating meta title.
It will be the text surrounding the first < h1 > and heading tags right after the mainContentPlaceHolder
of the page.Calculating meta description.
Most of the time a description about what a whole text is about (or at least it should) is within the first paragraphs of it. Based on that supposition, we try to parse and concatenate text within paragraphs (tags) after the first < h1 > tag. Based on our experience, when the meta description tag is longer than 394 characters, Google Webmaster Tools complain about it being too long. Taking that point in mind, we try to concatenate html-cleaned text from the first paragraphs of the text to create the meta description tag, ensuring it is not longer than 394 characters. Once we know the way our meta descriptions are automatically created, all we need to do is create our pages starting with an < h1 > header tag followed by one or more paragraphs (< p >) that will be the source for creating the meta description for the page. This will be suitable for most scenarios. In other cases, you should modify the way you create your pages or update the code to match your needs.
Calculating meta keywords.
Provided those noise word lists for a given language, calculating the keywords (single word) and key phrases (two words) occurrences within the text was something straightforward. We just iterate through the text, check against noise words, and add a new keyword or increment the frequency if the given keyword is already on theHashTable
. At the end of the iteration, we sort the HashTables by descending frequency (using a custom class inheriting from System.Collections.IComparer
). The final keywords list is a combination of the most frequent single keywords (ht1
) up to 190 characters, and the most common two-word key phrases (ht2
), until completing a maximum of 250 characters. All of them will be comma separated values in lowercase.
Padding is invalid and cannot be removed in WebResource.axd
If you are using ASP.NET in your website and have a look at your Application EventLog you will probably see warning entries like this:
CryptographicException: Padding is invalid and cannot be removed.
In fact, they are just warnings that can be ignored on most of the cases, but they can be a real problem when they bury other events and the forest do not let you see the trees. If there are many of them and you want to get rid of them (or most of them at least), keep on reading.
You might check your IIS Log by the times when the warnings appear and (if you also log user-agent) you will probably see that most of the time the URL is NOT requested by a real user, but a spider engine doing its crawl (googlebot, msnbot, yahoo, tahoma, or any other). You can double check doing a reverse dns check for the offending IP address doing a ping –a aaa.bbb.ccc.ddd and you will also see the IP resolves to something like *.googlebot.com, *.search.msn.com, *.crawl.yahoo.net or *.ask.com. This should give you a hint on what to do…
WebResource.axd is just an httpHandler that wraps several resources within the same DLL. It is in charge of returning from little .gif files for serving the arrows of the ASP:Menu control, to .js files governing the behavior of the menu itself. Even though your website do not use ASP:Menu control, you probably will be using WebResource.axd for javascript dealing the post back of your form or any other thing.
In fact these CryptographicException warnings are well known in web farms configurations. In that case, all the servers belonging to the same farm must have the same machineKey because if a served page (.aspx container page) by a particular server of the farm includes a value of t parameter and the subsequent request for that URL resource is handled by other server of the farm, the exception would arise and the user could not download the resource. And, in this case we would be talking about real browsers with real users behind them, not spider engines.
For this you can use a machineKey generator or generate your own if you know how to do it (random chars will not work).
CryptographicException: Padding is invalid and cannot be removed.
Event Type: Warning Event Source: ASP.NET 2.0.50727.0 Event Category: Web Event Event ID: 1309 Date: 21/08/2009 Time: 13:08:48 User: N/A Equipo: WEBSERVER Description: Event code: 3005 Event message: An unhandled exception has occurred. Event time: 21/08/2009 13:08:48 Event time (UTC): 21/08/2009 11:08:48 Event ID: 1cc59501bae34562a1e486c16f2e799f Event sequence: 11912 Event occurrence: 1 Event detail code: 0 Application information: Application domain: /LM/W3SVC/1/ROOT-1-128952696565995867 Trust level: Full Application Virtual Path: / Application Path: C:\Inetpub\webs\www.test-domain.com\ Machine name: WEBSERVER Process information: Process ID: 3920 Process name: w3wp.exe Account name: TEST-DOMAIN\IWAM_WEBSERVER Exception information: Exception type: CryptographicException Exception message: Padding is invalid and cannot be removed. Request information: Request URL: http://www.test-domain.com/WebResource.axd?d=pFeBotgPWN6u7M4UfAnWTw2&t=633687432177195930 Request path: /WebResource.axd User host address: 127.0.0.1 User: Is authenticated: False Authentication Type: Thread account name: TEST-DOMAIN\IWAM_WEBSERVER Thread information: Thread ID: 12 Thread account name: TEST-DOMAIN\IWAM_WEBSERVER Is impersonating: False Stack trace: at System.Security.Cryptography.RijndaelManagedTransform.DecryptData(Byte[] inputBuffer, Int32 inputOffset, Int32 inputCount, Byte[]& outputBuffer, Int32 outputOffset, PaddingMode paddingMode, Boolean fLast) at System.Security.Cryptography.RijndaelManagedTransform.TransformFinalBlock(Byte[] inputBuffer, Int32 inputOffset, Int32 inputCount) at System.Security.Cryptography.CryptoStream.FlushFinalBlock() at System.Web.Configuration.MachineKeySection.EncryptOrDecryptData(Boolean fEncrypt, Byte[] buf, Byte[] modifier, Int32 start, Int32 length, IVType ivType, Boolean useValidationSymAlgo) at System.Web.UI.Page.DecryptStringWithIV(String s, IVType ivType) at System.Web.Handlers.AssemblyResourceLoader.System.Web.IHttpHandler.ProcessRequest(HttpContext context) at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) Custom event details: For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.Depending on how busy is your web server you can see them appear from time to time or up to every few minutes, thus filling your EventLog and being from a light annoyance up to a real problem (depending on how hypochondriac you are).
In fact, they are just warnings that can be ignored on most of the cases, but they can be a real problem when they bury other events and the forest do not let you see the trees. If there are many of them and you want to get rid of them (or most of them at least), keep on reading.
You might check your IIS Log by the times when the warnings appear and (if you also log user-agent) you will probably see that most of the time the URL is NOT requested by a real user, but a spider engine doing its crawl (googlebot, msnbot, yahoo, tahoma, or any other). You can double check doing a reverse dns check for the offending IP address doing a ping –a aaa.bbb.ccc.ddd and you will also see the IP resolves to something like *.googlebot.com, *.search.msn.com, *.crawl.yahoo.net or *.ask.com. This should give you a hint on what to do…
WebResource.axd is just an httpHandler that wraps several resources within the same DLL. It is in charge of returning from little .gif files for serving the arrows of the ASP:Menu control, to .js files governing the behavior of the menu itself. Even though your website do not use ASP:Menu control, you probably will be using WebResource.axd for javascript dealing the post back of your form or any other thing.
Why does this exception happen?
If you see in detail the parameters following the WebResource.axd request you will notice two of them. The first one d refers to a particular resource embedded in the httpHandler DLL. It is a fixed value as long as the source DLL is not updated or recompiled. The second t parameter is a timestamp parameter that changes whenever the web application (AppPool) is recompiled (a changed/updated DLL, an update to web.config, and so) and depends on the machineKey of the web site. If web.config does not explicitly declare a fixed machineKey, the t parameter will change from time to time (restarts, job recycles, etc).In fact these CryptographicException warnings are well known in web farms configurations. In that case, all the servers belonging to the same farm must have the same machineKey because if a served page (.aspx container page) by a particular server of the farm includes a value of t parameter and the subsequent request for that URL resource is handled by other server of the farm, the exception would arise and the user could not download the resource. And, in this case we would be talking about real browsers with real users behind them, not spider engines.
The solution: two steps.
As you can imagine, the first thing that you can do is setting a fixed machineKey in your web.config file. Even though you are not running a cluster, nor a web farm, it will help you to minimize the occurrences of the warning Padding is invalid and cannot be removed.For this you can use a machineKey generator or generate your own if you know how to do it (random chars will not work).
The second (and easier) step to follow is to prevent WebResource.axd URLs from being requested as much as possible. In particular by search engines crawlers or bots, since those resources should not be indexed nor cached in any way by them. Those URLs are not real content to be indexed. If you only add the following lines to your robots.txt you will see how the frequency of CryptographicException is reduced drastically. If you also change the machineKey to a static value, you will get rid of them almost completely.validationKey='A06BDCF2F6CF.A.VERY.LONG.44F13E76184945A7C477601' decryptionKey='99079B21C2F3644.A.BIT.SHORTER.BB81C7E9D58378' validation='SHA1'/>
User-agent: * Disallow: /WebResource.axdAs I said, you will get rid of this warning almost completely. There might be search engines not following your robots.txt policies, users visiting you from a Google cached page version, etc. so you cannot get rid of this warning messages for good, but yet enough for not being a problem anymore.
Summary.
Summing up, this event appears when there is a big time difference (lap time) between the page that contains the resource and the resource itself being requested. During that lapse, the application pool might have been recycled, recompiled, the server restarted, etc, thus changing the value of t and thus, rendering the older t value useless (the cryptographic checks fail).ETag implementation for ASP.NET
implementing the logic for handling requests whose headers specify If-None-Match (ETag) and/or If-Modified-Since values. This is not something easy, since ASP.NET does not offer support for it directly, nor have primitives/methods/functions for it and, by default, always returns 200 OK, no matter the headers of the request (apart from errors, such as 404, and so).
The idea behind this is quite simple; let’s suppose a dialog between a browser (B) and a web server (WS):
Most of the browsers nowadays support such a negotiation, but the web server must do it also in order to get benefits. Unfortunately IIS only supports conditional GET natively for static files. If you want to use it also for dynamic content (ASP.NET files) you need to add support for it programmatically. That’s what we are going to show here.
It might seem too much burden, too much CPU usage but, as everything, it really depends on the website:
Note*: Consider this question: Is your CPU usage so high partly because the same contents are requested once and again by the same users? If you answer is yes (or maybe), an extra CPU usage with the intention of allowing client-side caching and conditional GETs will, globally viewed, lower your overall CPU usage and also the bandwidth being used. Giving a try to this idea and decide for yourself afterwards.
The logic for deciding if we should return 200 OK or 304 Not modified is as follows:
The idea behind this is quite simple; let’s suppose a dialog between a browser (B) and a web server (WS):
B: Hi, can you give me a copy of ~/document.aspx?Stupid, isn’t it? The way for enhancing the dialogue and avoid unnecessary traffic is having a richer vocabulary (If-None-Match & If-Modified-Since). Here is the same dialogue with these improvements:
WS: Of course. Here you are: 200Kb of code. Thanks for coming, 200 OK.
B: Hi again, can you give me a copy of ~/another-document.aspx?
WS: Yes, we’re here to serve. Here you are: 160Kb. Thanks for coming, 200 OK.
(Now the user clicks on a link that points to ~/document.aspx or goes back in his browsing history)
B: Sorry for disturbing you again, can I have another copy of ~/document.aspx
WS: No problem at all. Here you are: 200Kb of code (the same as before). Thanks for coming, 200 OK.
B: Hi can you give me a copy of ~/document.aspx?It sounds more logical. It takes a little more dialogue (negotiation) previous to the transaction, but if the conditions are met, these extra words saves time and money (bandwidth) on both parties.
WS: Of course. Here you are: 200Kb of code. ISBN is 55511122 (ETag) and this is the 2009 edition (Last-Modified). Thanks for coming, 200 OK.
B: Hi again, can you give me a copy of ~/another-document.aspx?
WS: Yes, we are here to serve. Here you are: 160Kb. ISBN is 555111333 (ETag) and it is the 2007 edition (Last-Modified). Thanks for coming, 200 OK.
(Now the time passes and the user goes back to ~/document.aspx, maybe it was in his favorites, or arrived to the same file after browsing for a while)
B: Hi again, I already have a copy of ~/document.aspx, ISBN is 555111222 (If-None-Match), dated 2009 (If-Modified-Since). Is there any update for it?
WS: Let me check… No, you are up to date, 0Kb transferred, 304 Not modified.
Most of the browsers nowadays support such a negotiation, but the web server must do it also in order to get benefits. Unfortunately IIS only supports conditional GET natively for static files. If you want to use it also for dynamic content (ASP.NET files) you need to add support for it programmatically. That’s what we are going to show here.
Calculating Last-Modified response header.
To begin with, the server needs to know when a page was last modified. This is very easy for static contents, a simple mapping between the web page being requested and the file in the underlying file system and you are done. The calculation of this date for .ASPX files is a little more complicated. You need to consider all the dependencies for the content being served and calculate the most recent date among them. For instance, let’s suppose the browser requests a page at ~/default.aspx and this file is based on a masterpage called ~/MasterPage.master which has a menu inside it, that grabs its contents from the file ~/web.sitemap. In the simplest scenario (no content being retrieved from a database, no user controls), ~/default.aspx will contain static content within. In this case, the Last-Modified value will be the most recent last modification time of these files:- ~/default.aspx
- ~/default.aspx.vb (Optionally, depending on your pages having code behind which modifies the output or not)
- ~/MasterPage.master
- ~/MasterPage.master.vb (Optionally)
- ~/web.sitemap
Calculating ETag response header.
The second key of the dialogue is the ETag value. It is simply a hash function for the final contents being served. If you have any way (with low CPU footprint) for calculating a hash based on certain textual input, it can be used. In our implementation, we used CRC32 but any other will work the same way. We calculate the ETag value making a CRC32 checksum of any dependant content plus the last-mod-dates of these dependencies. I our simplest case, the concatenation of all these strings:- ~/default.aspx last write time
- ~/default.aspx.vb last write time (not likely, but optionally necessary)
- ~/MasterPage.master last write time
- ~/MasterPage.master.vb last write time (Optionally)
- ~/web.sitemap last write time
- ~/default.aspx contents
- ~/default.aspx.vb contents (Optionally, but not likely, to speed up calculations)
- ~/MasterPage.master contents
- ~/MasterPage.master.vb (Optionally)
- ~/web.sitemap contents
It might seem too much burden, too much CPU usage but, as everything, it really depends on the website:
High CPU usage | Low CPU usage | |
High volume | This scenario might not cope with the extra CPU needed. See Note*. | You can safely spend CPU cycles in order to save some bandwidth. Implementing conditional GETs is a must. |
Low volume | What kind of web server is it? Definitely not a public web server as we know them. | Implementing conditional GETs will give your website the impression of being served faster. |
Returning them in the response.
Once we have calculated both the Last-Modified & Etag values, we need to return them with the response of the page. This is done using the following lines of code:Response.Cache.SetLastModified(LastModifiedValue.ToUniversalTime)
Response.Cache.SetETag(ETagValue)
Looking for the values in request’s headers.
Now that our pages’ responses are generated with Last-Modified and ETag headers, we need to check for those values in the requests too. The names of those parameters, when asked via request headers differ from the original names:Response headers names | Request headers names |
Last-Modified | If-Modified-Since |
ETag | If-None-Match |
- If both values (If-Modified-Since & If-None-Match) were provided in the request and both match, return 304 and no content (0 bytes)
- If any of them do NOT match, return 200 and the complete page
- If only one of them was specified (If-Modified-Since or If-None-Match), it decides.
- If none were provided, always return 200 and the complete page.
Response.Clear()
Response.StatusCode = System.Net.HttpStatusCode.NotModified
Response.SuppressContent = True
Perfomance Tunning in VIEWSTATE size minimization
According to Microsoft, in NET Framework Class Library documentation for Control.ViewState Property:
What is it used for? When server technologies are used, such as ASP, ASP.NET, PHP, and so on, in the server side a high level and powerful language is used. These languages have advanced server controls (such as grid, treeview, etc), and they can do validations of any kind (on database access, etc). The final end of this high level language is transforming the ‘server page’ in a final page that a browser can understand (HTML+Javascript). If on the one hand you have server controls that are rendered into HTLM when they are output to the browser, what happens when the user does a postback/callback and sends the page back to the server? Here is where the ViewState plays its role, helping to recreate the page objects at the server, in the OOP sense ( < asp :TextBox ID =...) based on HTML controls (< input name=”...).
Wouldn’t be easier to forget about all this and handle it in the traditional way? For recreating simple controls as a text box or a select box, it could be feasible to fetch the values right from the HTML, without using the ViewState, but imagine trying to recreate a GridView with only the HTML, having to strip out the formatting. Besides, without the ViewState we could not send to the server certain events such as a change of selection in a DropDownList (the previously selected element is saved in the ViewState).
Ok, we will need the ViewState after all, but, is there any way of minimizing it? Yes. As Microsoft states:
When can we disable the view state? Basically, when we use data that will be read only, or that will not be needed again by the server in case of a postback/callback, for instance a grid that do not have associated events for sorting or selection at the server.
There are several ways for controlling the viewstate:
One of the asp.net controls that makes the View State grow heavily is the asp.menu control. Moving it out of the masterpage and placing it in another standalone client-side cacheable file can make wonders. However, if you do not implement such a suggestion, you can at least disable the view state for the menu control. The menu control will still be rendered within every page, but the size of the View State will be significantly smaller without further effort. In one of our customers, simply adding EnableViewState="false" to the menu control definition, reduced the size of their homepage (for example) from 150Kb to around 109Kb. Since the menu was in the masterpage, the reduction was similar for all the pages in their site.
A server control's view state is the accumulation of all its property values. In order to preserve these values across HTTP requests, ASP.NET server controls use this property […]That means that, the bigger the contents of the control, the bigger must be its ViewState property.
What is it used for? When server technologies are used, such as ASP, ASP.NET, PHP, and so on, in the server side a high level and powerful language is used. These languages have advanced server controls (such as grid, treeview, etc), and they can do validations of any kind (on database access, etc). The final end of this high level language is transforming the ‘server page’ in a final page that a browser can understand (HTML+Javascript). If on the one hand you have server controls that are rendered into HTLM when they are output to the browser, what happens when the user does a postback/callback and sends the page back to the server? Here is where the ViewState plays its role, helping to recreate the page objects at the server, in the OOP sense ( < asp :TextBox ID =...) based on HTML controls (< input name=”...).
Wouldn’t be easier to forget about all this and handle it in the traditional way? For recreating simple controls as a text box or a select box, it could be feasible to fetch the values right from the HTML, without using the ViewState, but imagine trying to recreate a GridView with only the HTML, having to strip out the formatting. Besides, without the ViewState we could not send to the server certain events such as a change of selection in a DropDownList (the previously selected element is saved in the ViewState).
Ok, we will need the ViewState after all, but, is there any way of minimizing it? Yes. As Microsoft states:
View state is enabled for all server controls by default, but there are circumstances in which you will want to disable it. For more information, see Performance Overview.If you see Performance Overview, you will be suggested to:
Save server control view state only when it is required. View state enables server controls to repopulate property values on a round trip without requiring you to write code […]Take that into consideration when writing your master pages, since most of the controls in the master page will be static (or at much, written only once by the server) and probably not needed at all again in case of a postback or callback (unless, for instance a DropDownList for changing the language of the site, being placed in the master page).
When can we disable the view state? Basically, when we use data that will be read only, or that will not be needed again by the server in case of a postback/callback, for instance a grid that do not have associated events for sorting or selection at the server.
There are several ways for controlling the viewstate:
- In a page by page basis: If you have any particular page in which you know you will not need the viewstate, you can disable it completely at the page declaration:
<%@ Page Title="Home" Language="C#" MasterPageFile="~/MasterPage.master" AutoEventWireup="true" CodeFile="default.aspx.cs" Inherits="_default" EnableViewState="false" %>
However, doing so might render your masterpage controls (if any) unusable for that particular page. For instance, if you have DropDownList control in your masterpage for changing/selecting the language of the website and you disable the viewstate for several single files of your site, - In the master page declaration: In a similar way as you do for a single page, you can do it also in the masterpage. The result will be that, unless you override this option for single page (explicitly declaring single pages as having it), all pages using a master page declared this way will not have ViewState (and if they do, it will not contain any info about controls from the masterpage):
<%@ Master Language="C#" AutoEventWireup="true" CodeFile="MasterPage.master.cs" Inherits="MasterPage" EnableViewState="false" %> - In a control by control basis: A more flexible (due to its granularity) way for controlling the view state is enabling/disabling it control by control:
<asp:TextBox ID="TextBox1" runat="server" EnableViewState="false" >asp:TextBox>
This will probably be the easiest method and the one that less interfere with the rest of a website; besides its effects (in the size of the viewstate and functionality) can be easily checked and can be easily reverted back if something does not work.
One of the asp.net controls that makes the View State grow heavily is the asp.menu control. Moving it out of the masterpage and placing it in another standalone client-side cacheable file can make wonders. However, if you do not implement such a suggestion, you can at least disable the view state for the menu control. The menu control will still be rendered within every page, but the size of the View State will be significantly smaller without further effort. In one of our customers, simply adding EnableViewState="false" to the menu control definition, reduced the size of their homepage (for example) from 150Kb to around 109Kb. Since the menu was in the masterpage, the reduction was similar for all the pages in their site.
Friday, July 23, 2010
How to improve your YSlow score under IIS7
On my last two projects, I’ve been involved in various bits of performance tweaking and testing. One of the common tools for checking that your web pages are optimised for delivery to the user is YSlow, which rates your pages against the Yahoo! best practices guide.
Since I’ve now done this twice, I thought it would be useful to document what I know about the subject. The examples I’ve given are from my current project, which is built using ASP.NET MVC, the Spark View Engine and hosted on IIS7.
Make Fewer Http Requests
There are two sides to this. Firstly and most obviously, the lighter your pages the faster the user can download them. This is something that has to be addressed at an early stage in the design process – you need to ensure you balance up the need for a compelling and rich user experience against the requirements for fast download and page render times. Having well defined non-functional requirements up front, and ensuring that page designs are technically reviewed with those SLAs in mind is an important part of this. As I have learned from painful experience, having a conversation about slow page download times after you’ve implemented the fantastic designs that the client loved is not nice.
The second part of this is that all but the sparsest of pages can still be optimised by reducing the number of individual files downloaded. You can do this by combining your CSS, Javascript and image files into single files. CSS and JS combining and minification is pretty straightforward these days, especially with things like the Yahoo! UI Library: YUI Compressor for .NET. On my current project, we use the MSBuild task from that library to combine, minify and obfuscate JS and CSS as part of our release builds. To control which files are referenced in our views, we add the list of files to our base ViewModel class (from which all the other page ViewModels inherit) using conditional compilation statements.
- private static void ConfigureJavaScriptIncludes(PageViewModel viewModel)
- {
- #if DEBUG
- viewModel.JavaScriptFiles.Add(“Libraries/jquery-1.3.2.min.js”);
- viewModel.JavaScriptFiles.Add(“Libraries/jquery.query.js”);
- viewModel.JavaScriptFiles.Add(“Libraries/jquery.validate.js”);
- viewModel.JavaScriptFiles.Add(“Libraries/xVal.jquery.validate.js”);
- viewModel.JavaScriptFiles.Add(“resources.js”);
- viewModel.JavaScriptFiles.Add(“infoFlyout.js”);
- viewModel.JavaScriptFiles.Add(“bespoke.js”);
- #else
- viewModel.JavaScriptFiles.Add(“js-min.js”);
- #endif
- }
- private static void ConfigureStyleSheetIncludes(PageViewModel viewModel)
- {
- #if DEBUG
- viewModel.CssFile = ”styles.css”;
- #else
- viewModel.CssFile = “css-min.css”;
- #endif
- }
Use a Content Delivery Network
This isn’t something you can’t really address through configuration. In case you’re not aware, the idea is to farm off the hosting of static files onto a network of geographically distributed servers, and ensure that each user receives that content from the closest server to them. There are a number of CDN providers around – Akamai, Amazon Cloud Front and Highwinds, to name but a few.
The biggest problem I’ve encountered with the use of CDNs is for sites which have secure areas. If you want to avoid annoying browser popups telling you that your secure page contains insecure content, you need to ensure that your CDN has both a http and https URLs, and that you have a good way of switching between them as needed. This is a real pain for things like background images in CSS and I haven’t found a good solution for it yet.
On the plus side I was pleased to see that Spark provides support for CDNs in the form of it’s section, which allows you to define a path to match against an a full URL to substitute – so for example, you could tell it to map all files referenced in the path “~/content/css/” to “http://yourcdnprovider.com/youraccount/allstyles/css/”. Pretty cool – if Louis could extend that to address the secure/insecure path issue, it would be perfect.
Add an Expires or Cache-Control Header
The best practice recommendation is to set far future expiry headers on all static content (JS/CSS/images/etc). This means that once the files are pushed to production you can never change them, because users may have already downloaded and cached an old version. Instead, you change the content by creating new files and referencing those instead (read more here.) This is fine for most of our resources, but sometimes you will have imagery which has to follow a set naming convention – this is the case for my current project, where the product imagery is named in a specific way. In this scenario, you just have to make sure that the different types of images are located in separate paths so you can configure them independently, and then choose an appropriate expiration policy based on how often you think the files will change.
To set the headers, all you need is a web.config in the relevant folder. You can either create this manually, or using the IIS Manager, which will create the file for you. To do it using IIS Manager, select the folder you need to set the headers for, choose the “HTTP Response Headers” option and then click the “Set Common Headers” option in the right hand menu. I personally favour including the web.config in my project and in source control as then it gets deployed with the rest of the project (a great improvement over IIS6).
Here’s the one we use for the Content folder.
- xml version=”1.0″ encoding=”UTF-8″?>
- <configuration>
- <system.webServer>
- <staticContent>
- <clientCache cacheControlMode=”UseExpires”
- httpExpires=”Sat, 31 Dec 2050 00:00:00 GMT” />
- </staticContent>
- </system.webServer>
- </configuration>
And here’s what we use for the product images folder.
- xml version=”1.0″ encoding=”UTF-8″?>
- <configuration>
- <system.webServer>
- <staticContent>
- <clientCache cacheControlMode=”UseMaxAge”
- cacheControlMaxAge=”7.00:00:00″ />
- </staticContent>
- </system.webServer>
- </configuration>
Our HTML pages all have max-age=0 in the cache-control header, and have the expires header set to the same as the Date and Last-Modified headers, as we don’t want these cached – it’s an ecommerce website and we have user-specific infomation (e.g. basket summary, recently viewed items) on each page.
Compress components with Gzip
IIS7 does this for you out of the box but it’s not always 100% clear how to configure it, so, here’s the details.
Static compression is for files that are identical every time they are requested, e.g. Javascript and CSS. Dynamic compression is for files that differ per request, like the HTML pages generated by our app. The content types that actually fall under each heading are controlled in the IIS7 metabase, which resides in the file C:\Windows\System32\inetsrv\config\applicationHost.config. In that file, assuming you have both static and dynamic compression features enabled, you’ll be able to find the following section:
- <httpCompression directory=”%SystemDrive%\inetpub\temp\IIS Temporary Compressed Files”>
- gzip” dll=”%Windir%\system32\inetsrv\gzip.dll” />
- <dynamicTypes>
- mimeType=”text/*” enabled=”true” />
- mimeType=”message/*” enabled=”true” />
- x-javascript” enabled=”true” />
- <add mimeType=”*/*” enabled=”false” />
- </dynamicTypes>
- <staticTypes>
- <add mimeType=”text/*” enabled=”true” />
- <add mimeType=”message/*” enabled=”true” />
- javascript” enabled=”true” />
- <add mimeType=”*/*” enabled=”false” />
- </staticTypes>
- </httpCompression>
Item MIME Type
Views (i.e. the HTML pages) text/html
CSS text/css
JS application/x-javascript
Images image/gif, image/jpeg
One gotcha there is that IIS7 is configured out of the box to return a content type of application/x-javascript for .JS files, which results in them coming under the configuration for dynamic compression.This is not great – dynamic compression is intended for frequently changing content so dynamically compressed files are not cached for future requests. Our JS files don’t change outside of deployments, so we really need them to be in the static compression category.
There’s a fair bit of discussion online as to what the correct MIME type for Javascript should be, with the three choices being:
- text/javascript
- application/x-javascript
- application/javascript
So you have two choices – you can change your config, either in web.config or applicationHost.config to map the .js extension to text/javascript or application/javascript, which are both already registered for static compression, or you can move application/x-javascript into the static configuration section in your web.config or applicationHost.config. I went with the latter option and modified applicationHost.config. I look forward to a comment from anyone who can explain whether that was the correct thing to do or not.
Finally, in the <system.webserver> element of your app’s main web.config, add this:
- <urlCompression doStaticCompression=”true”
- doDynamicCompression=”true” />
- <serverRuntime alternateHostName=”"
- appConcurrentRequestLimit=”5000″
- enabled=”true”
- enableNagling=”false”
- frequentHitThreshold=”2″
- frequentHitTimePeriod=”00:00:10″
- maxRequestEntityAllowed=”4294967295″
- uploadReadAheadSize=”49152″>
You’ll just have an empty <serverRuntime /> tag. The values I’ve shown above are the defaults, and the interesting ones are “frequentHitThreshold” and “frequentHitTimePeriod”. Essentially, a file is only a candidate for static compression if it’s a frequently hit file, and those two attributes control the definition of “frequently hit”. So first time round, you won’t see the Content-Encoding: gzip header, but if you’re quick with the refresh button, it should appear. If you spend some time Googling, you’ll find some discussion around setting frequentHitThreshold to 1 – some say it’s a bad idea, some say do it. Personally, I decided to trust the people who built IIS7 since in all likelihood they have larger brains than me, and move on.
There are a couple of other interesting configuration values in this area – the MSDN page covers them in detail.
Minify Javascript and CSS
The MSBuild task I mentioned under “Make fewer HTTP requests” minifies the CSS and JS as well as combining the files.
Configure Entity tags (etags)
ETags are a bit of a pain, with a lot of confusion and discussion around them. If anyone has the full picture, I’d love to hear it but my current understanding is that when running IIS, the generated etags vary by server, and will also change if the app restarts. Since this effectively renders them useless, we’re removing them. Howard has already blogged this technique for removing unwanted response headers in IIS7. Some people have had luck with adding a new, blank response header called ETag in their web.config file, but that didn’t work for me.
Split Components Across Domains
The HTTP1.1 protocol states that browsers should not make more than two simultaneous connections to the same domain. Whilst the most recent browser versions (IE8, Chrome and Firefox 3+) ignore this, older browsers will normally still be tied to that figure. By creating separate subdomains for your images, JS and CSS you can get these browsers to download more of your content in parallel. It’s worth noting that using a CDN also helps here. Doing this will also help with the “Use Cookie Free Domains For Components” rule, since any domain cookies you use won’t be sent for requests to the subdomains/CDN. However, you will probably fall foul of the issue I mentioned earlier around mixing secure and insecure content on a single page.
A great tool for visualising how your browser is downloading content is Microsoft Visual Roundtrip Analyzer. Fire it up and request a page, and you will be bombarded with all sorts of information (too much to cover here) about how well the site performs for that request.
Optimise Images
Sounds obvious, doesn’t it? There are some good tools out there to help with this, such as Smush.it, now part ofYSlow. If you have an image heavy website, you can make a massive difference to overall page weight by doing this.
Make favicon.ico Small and Cacheable
If you don’t have a favicon.ico, you’re missing a trick – not least because the browser is going to ask for it anyway, so you’re better returning one than returning a 404 response. The recommendation is that it’s kept under 1Kb and that it has a fairly long expiration period.
WAX – Best practice by default
On my last project, I used the Aptimize Website Accelerator to handle the majority of the things I’ve mentioned above (my current project has a tight budget, so it’s not part of the solution for v1). It’s a fantastic tool and I’d strongly recommend anyone wishing to speed their site up gives it a try. I was interested to learn recently that Microsoft have started running it on sharepoint.microsoft.com, and are seeing a 40-60% reduction in page load times. It handles combination and minification of CSS, Javascript and images, as well as a number of other cool tricks such as inlining images into CSS files where the requesting browser supports it, setting expiration dates appropriately, and so on.
Subscribe to:
Posts (Atom)