I know HTML well and I always author it with semantic elements like paragraphs. I don’t insert blank paragraphs between paragraphs for spacing, and why in the world would anyone wrap an image or a figure (both block elements) in a paragraph?
I recently came up with a partial solution that doesn’t require a plugin or editing functions.php.
It’s not perfect but it’s easy to use and it doesn’t require access to the codebase.
Without further ado I give you (drum roll please):
My Auto-Formatting Solution: class="wp"
Have a look at the code of this post.
You’ll notice that all the <p> elements have a class of wp like this: <p class="wp">.
I don’t use forced line breaks very often (<br>) but it works to preserve those as well.
Anyone who comes after me and works on a site where I have used this method, and isn’t aware of it, could easily be tricked, just like WordPress is, into thinking there’s some significance to this class; there isn’t.
As soon as a class is added, WordPress stops wantonly deleting my semantic HTML presumably because the software is intelligent enough to realize that classes generally indicate a default element no longer has default properties, that it must have additional properties or stylings associated with it.
I don’t really care why it works, I don’t really care why WordPress innately strips paragraphs (and then adds them programmatically when it renders it‽).
What I care about is that I don’t have blank lines (i.e. new lines \n) followed by in between lines of text that isn’t wrapped in an appropriate element like a <p>.
I don’t know why; I’m not really a developer. I loathe error logs and debug modes.
On some sites, the <img> tags end up wrapped in perfect little <figure> tags and any captions show up wrapped in <figcaption> tags with no despicable wrapping paragraphs standing on every street corner proclaiming their right to be there.
No, I haven’t reported a bug because as I said it’s hit or miss. Sometimes it works like a charm and sometimes it fails miserably and given the complexity of most sites I work on (think of all the variables: server software, PHP version, WordPress version, other plugins, performance enhancements [i.e. cache], proxies, human error) I don’t even try to figure it out, in my recent experience, it probably works correctly about 2/3 of the time.
PS Disable Auto-Formatting has been more reliable for me so it’s now my plugin of choice (it has taken the place of Don’t Muck My Markup in the script that automagically installs plugins for me).
The screenshot above is from Google Analytics (the last 6 months) for a site that I have access to the analytics data of but not the codebase/server (otherwise I’d have remedied the situation already). My significant other’s sister is the creator/founder/owner/president of Megan Lee Designs and, after learning of my vocation, granted me access to her analytics.
While 2.5% of traffic may not seem especially significant, it amounted to 16% (yep, 1 in 6) of their referral traffic! 42 of 143 sources of referral traffic were actually this referral spammer garbage (in the screenshot I included every website that was reported as having sent more than 1 visitor…there are another 35 domains (well, subdomains) that list a single visit. 34 of those are a subdomain of semalt.com (grumble, grumble)
When I first really noticed the problem in my analytics reports (in mid-2014) I came up with a blacklist. I called it a hitlist because, as I told my colleague Charlie, No workplace is complete without a hitlist ;-)
In the spirit of open-source, I give you my current list of directives:
Be forewarned, I’m no back-end developer or SysAdmin so while I can piece together enough RegEx to get simple things done, it is entirely possible that the directives written below are more efficient (and they are certainly more comprehensive).
My logic when I wrote them was simple:
I don’t care about protocols (http:// or https://)
I don’t care about subdomains (they often use lots of them)
I don’t care about case (hence the [NC] No Case [sensitivity] flag
If the domain listed as a referrer contains that series of characters (that string) I want to the RewriteRule to kill it before it hits my site (and my analytics) hence [F,L] the ‘fatal’ and ‘last’ rules on the re-write.
For anyone who might be of the copy and paste skill level (I myself was until fairly recently) be mindful of the ‘OR’ part of those directives [NC,OR], an [OR] is needed on all but the last RewriteCond, omitting it on prior conditions or adding it to the last condition will likely cause a series 500 error on your server.
A few months ago I addressed the most flagrant perpetrators—namely semalt.com—but suddenly I was getting visits from buttons-for-website.com.