Content scraping or theft is when original content is stolen from a blog to republish on another blog or site. This is considered to be plagiarism or copyright infringement, but since the Web is not completely regulated yet, scraping has become an extremely common problem. Although, this process can be done manually, normally it is done by automated software programs that allow content to be easily scraped from any blog.
Some people practice scraping to intentionally harm others, but some do not know they are actually creating harm. If the original content is published on other blogs or websites, it can cause duplicate content issues, especially if the copied content out-ranks the original content in the search engines (See my blog post on duplicate content).
I have experienced this problem myself in the last few months, and I have let it slide because I know people who copied my posts meant no harm. Although I appreciate and encourage sharing of content, there are netiquette rules that should be followed (I will cover this topic in a different post).
Unfortunately, we cannot completely prevent this growing problem, but there are steps that can be taken to reduce the problem.
Simple Steps
- Put a copyright statement with every blog post. This can be done manually or automatically (I will share some plug-ins in the next post).
- Inspect your blog’s back links often. This can be done manually in the search engines or a back link checker.
- Set up keywords related to your blog posts in Google Alerts. This method will alert you of similar posts and may help monitor scraping.
- Use absolute links if you are interlinking any blog posts or pages of your sites (i.e., a href=”http://www.domain.com/pagename.html”; vs. a href=”pagename.html”).
- Place a link to your blog in the footer of each post. This can also be done manually or automatically if you have a WordPress blog.
- Link to your blog posts within your RSS feeds.
Advanced Steps
If you are not a techie, you can ask an experienced webmaster to perform the following steps for you.
- Utilize cloaking to change the source code by sending content other than what is originally seen on the blog, making it harder for automated programs to scrape. This is normally thought of as a black-hat SEO technique, so should be used sparingly with caution (only use this as means to protect your content if you are having a huge theft problem).
- Make use of IP Blocking methods to prevent future content theft. IP Blocking is done through your site’s .htaccess file. This method finds the IPs of sites publishing your content without permission, and then blocks them.
- Use captchas to effectively block automated scraping. Captchas are randomly generated strings of words and numbers that can be displayed in picture format. Some scrapers have found a way around captchas. Also, this will not work if someone is copying your content manually.
If you use the above methods and find scraping incidents of your blog posts, I recommend you leave a comment and a link to your original post to warn both the scraper and the blog readers.
Look for my next post on WordPress plug-ins that will help you streamline your efforts.
If you’ve caught content thieves in the past, share your story in the comments below.





Pingback: Data-Scraping Lawsuit Sheds Light On Risk To Databases | Harddrive Data Recovery - Data Storage
Pingback: Data-Scraping Lawsuit Sheds Light On Risk To Databases | Harddrive Data Recovery - Data Storage