C#抓取博客文章中的图片

当我们抓取网页文章内容的时候，文字我们是可以直接拿来用的，但是文章内容中的图片最好不要直接拿来用，因为图片毕竟是你抓取网站的链接，用在自己的网站是不合适的，我们需要将文章中的图片抓取出来，然后保存在你自己的图片服务器上。

几天就给大家分享一下使用C#抓取文章中的图片，并且保存在自己的服务器上。

抓取文章图片的步骤分析

将html网页中的图片路径用正则匹配查找出来
将图片查找出来的路径抓取到本地服务器，并保存好抓取网页路径与现在存储路径的关系
根据保存的图片原来和现在保存的路径，替换抓取网页的图片路径成对应的自己服务器图片

下方是一个框架方法，逻辑就是根据html，抓取到所有图片，然后保存到自己的服务器，同时返回一个原图片地址和现有图片地址的一个对照关系。

public IList<CrawImageDto> CrawAndSaveImage(string html)
{
    IList<CrawImageDto> crawImageList = new List<CrawImageDto>();

    var images = HtmlImageHelper.GetHtmlImageUrlList(html);
    if (images != null)
    {
        foreach (var imgUrl in images)
        {
            var imgSavePath = HtmlImageHelper.Get_img(imgUrl);
            int width;
            int height;
            ImageHandler.ReadWH(imgSavePath, out width, out height);
            crawImageList.Add(new CrawImageDto
            {
                OriginalUrl = imgUrl,
                CrawlSaveUrl = imgSavePath,
                Width = width,
                Height = height
            });
        }
    }
    return crawImageList;
}

CrawImageDto是一个返回输出的载体。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace SCMS.Application.Spider.Dto
{
    public class CrawImageDto
    {
        /// <summary>
        /// 原有图片URL
        /// </summary>
        public string OriginalUrl { get; set; }

        /// <summary>
        /// 抓取后保存URL
        /// </summary>
        public string CrawlSaveUrl { get; set; }

        /// <summary>
        /// 图片的宽度
        /// </summary>
        public int Width { get; set; }

        /// <summary>
        /// 图片的高度
        /// </summary>
        public int Height { get; set; }

        /// <summary>
        /// 图片对应250*150的一个精准度，可以作为特色图片的选取
        /// </summary>
        public double Deviation
        {
            get
            {
                return Math.Abs(250 / 150.0 - Width / Height);
            }
        }
    }
}

上面对象的Deviation表示抓取图片的一个你自定义的图片比例精准度，可以作为特色图片标准，越小代表越适合作为特色图片。特色图片毕竟可以根据自己的情况自定义，当然也可以没有。

其中HtmlImageHelper类的代码如下：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace SCMS.Common.SpiderHelper
{
    public static class HtmlImageHelper
    {

        private static string Path = AppDomain.CurrentDomain.BaseDirectory + "img";

        public static string Get_img(string imgpath)
        {
            string[] file = imgpath.Split('?');
            string name = System.IO.Path.GetFileName(file[0]);
            WebClient mywebclient = new WebClient();
            string savePath = Path + @"\" + name;
            mywebclient.DownloadFile(imgpath, Path + @"\" + name);
            return savePath;
        }

        /// <summary> 
        /// 取得HTML中所有图片的 URL。 
        /// </summary> 
        /// <param name="sHtmlText">HTML代码</param> 
        /// <returns>图片的URL列表</returns> 
        public static string[] GetHtmlImageUrlList(string sHtmlText)
        {
            // 定义正则表达式用来匹配 img 标签 
            Regex regImg = new Regex(@"<img\b[^<>]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?<imgUrl>[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase);

            // 搜索匹配的字符串
            MatchCollection matches = regImg.Matches(sHtmlText);
            int i = 0;
            string[] sUrlList = new string[matches.Count];

            // 取得匹配项列表
            foreach (Match match in matches)
                sUrlList[i++] = match.Groups["imgUrl"].Value;
            return sUrlList;
        }
    }
}

这是一个抓取图片的核心类，主要用于根据url下载图片和获取文章中所有图片的url。

ImageHandler类的代码如下，这个主要是用于读取图片的属性，比如长和宽，当然如果你不需要图片的属性，你可以完全不去管图片属性获取的相关代码。

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace SCMS.Common.FileManager
{
    public static class ImageHandler
    {
        public static void ReadWH(string localPaht,out int width, out int height)
        {
            using (FileStream fs = new FileStream(localPaht, FileMode.Open, FileAccess.Read))
            {
                System.Drawing.Image image = System.Drawing.Image.FromStream(fs);
                width = image.Width;
                height = image.Height;
            }
        }
    }
}

得到了CrawImageDto的List集合，你就可以循环替换原来HTML中图片的分类了。

以上就是C#抓取文章中图片的所有流程和操作代码。

原创文章，作者：知道91 ，如若转载，请注明出处：http://zhidao91.com/csharp-crawl-post-images

看完这篇还不够？如果你也在创业，并希望自己的项目被报道，请戳这里告诉我们！

上一篇：博客网站推广的3条实用技巧心得

下一篇：使用MiniProfiler 分析 EntityFramework 6.0（EF6.0）站点性能