Java自学者论坛

 找回密码
 立即注册

手机号码,快捷登录

恭喜Java自学者论坛(https://www.javazxz.com)已经为数万Java学习者服务超过8年了!积累会员资料超过10000G+
成为本站VIP会员,下载本站10000G+会员资源,会员资料板块,购买链接:点击进入购买VIP会员

JAVA高级面试进阶训练营视频教程

Java架构师系统进阶VIP课程

分布式高可用全栈开发微服务教程Go语言视频零基础入门到精通Java架构师3期(课件+源码)
Java开发全终端实战租房项目视频教程SpringBoot2.X入门到高级使用教程大数据培训第六期全套视频教程深度学习(CNN RNN GAN)算法原理Java亿级流量电商系统视频教程
互联网架构师视频教程年薪50万Spark2.0从入门到精通年薪50万!人工智能学习路线教程年薪50万大数据入门到精通学习路线年薪50万机器学习入门到精通教程
仿小米商城类app和小程序视频教程深度学习数据分析基础到实战最新黑马javaEE2.1就业课程从 0到JVM实战高手教程MySQL入门到精通教程
查看: 823|回复: 0

Java、C#双语版HttpHelper类(解决网页抓取乱码问题)

[复制链接]
  • TA的每日心情
    奋斗
    2024-11-24 15:47
  • 签到天数: 804 天

    [LV.10]以坛为家III

    2053

    主题

    2111

    帖子

    72万

    积分

    管理员

    Rank: 9Rank: 9Rank: 9

    积分
    726782
    发表于 2021-4-14 09:39:52 | 显示全部楼层 |阅读模式

    在做一些需要抓取网页的项目时,经常性的遇到乱码问题。最省事的做法是去需要抓取的网站看看具体是什么编码,然后采用正确的编码进行解码就OK了,不过总是一个个页面亲自去判断也不是个事儿,尤其是你需要大量抓取不同站点的页面时,比如网页爬虫类的程序,这时我们需要做一个相对比较通用的程序,进行页面编码的正确识别。

    乱码问题基本上都是编码不一致导致的,比如网页编码使用的是UTF-8,你使用GB2312去读取,肯定会乱码。知道了本质问题后剩下的就是如何判断网页编码了。GBK、GB2312、UTF-8、BIG-5,一般来说遇到的中文网页编码大多是这几种,简化下就是只有 GBK和UTF-8两种,不夸张的说,现在的网站要么是GBK编码,要么是UTF-8编码,所以接下来的问题就是判断站点具体是UTF-8的还是GBK的。

    那怎么判断页面具体编码呢?首先查看响应头的 Content-Type,若响应头里找不到,再去网页里查找meta头,若还是找不到,那没办法了,设置个默认编码吧,个人推荐设置成UTF-8。比如访问博客园首页http://www.cnblogs.com/,可以在响应头里看到 Content-Type: text/html; charset=utf-8,这样我们就知道博客园是采用utf-8编码,但并不是所有的网站都会在响应头Content-Type加上页面编码,比如百度的就是Content-Type: text/html,找不到charset,这时只能去网页里面找<meta http-equiv=Content-Type content="text/html;charset=utf-8">,确认网页最终编码,总结下就是下面几步

    1. 1.响应头查找Content-Type中的charset,若找到了charset则跳过步骤2,3,直接进行第4步
    2. 2.若步骤1得不到charset,则先读取网页内容,解析meta里面的charset得到页面编码
    3. 3.若步骤2种还是没有得到页面编码,那没办法了设置默认编码为UTF-8
    4. 4.使用得到的charset重新读取响应流

    通过上面方法基本上能正确解析绝大多数页面,实在不能识别的只好亲自去核实下具体编码了

    注意:

    1. 1.现在站点几乎都启用了gzip压缩支持,所以在请求头里面加上Accept-Encoding:gzip,deflate,这样站点会返回压缩流,能显著的提高请求效率
    2. 2.由于网络流不支持流查找操作,也就是只能读取一次,为了提高效率,所以这里建议将http响应流先读取到内存中,以方便进行二次解码,没有必要重新请求去重新获取响应流

    下面分别给出Java和C#版的实现代码,页面底部给出了源码的git链接,有需要的童鞋请自行下载

    Java实现

    package com.cnblogs.lzrabbit.util;
    
    import java.io.*;
    import java.net.*;
    import java.util.*;
    import java.util.Map.Entry;
    import java.util.regex.*;
    import java.util.zip.*;
    
    public class HttpUtil {
    
        public static String sendGet(String url) throws Exception {
            return send(url, "GET", null, null);
        }
    
        public static String sendPost(String url, String param) throws Exception {
            return send(url, "POST", param, null);
        }
    
        public static String send(String url, String method, String param, Map<String, String> headers) throws Exception {
            String result = null;
            HttpURLConnection conn = getConnection(url, method, param, headers);
            String charset = conn.getHeaderField("Content-Type");
            charset = detectCharset(charset);
            InputStream input = getInputStream(conn);
            ByteArrayOutputStream output = new ByteArrayOutputStream();
            int count;
            byte[] buffer = new byte[4096];
            while ((count = input.read(buffer, 0, buffer.length)) > 0) {
                output.write(buffer, 0, count);
            }
            input.close();
            // 若已通过请求头得到charset,则不需要去html里面继续查找
            if (charset == null || charset.equals("")) {
                charset = detectCharset(output.toString());
                // 若在html里面还是未找到charset,则设置默认编码为utf-8
                if (charset == null || charset.equals("")) {
                    charset = "utf-8";
                }
            }
            
            result = output.toString(charset);
            output.close();
    
            // result = output.toString(charset);
            // BufferedReader bufferReader = new BufferedReader(new
            // InputStreamReader(input, charset));
            // String line;
            // while ((line = bufferReader.readLine()) != null) {
            // if (result == null)
            // bufferReader.mark(1);
            // result += line;
            // }
            // bufferReader.close();
    
            return result;
        }
    
        private static String detectCharset(String input) {
            Pattern pattern = Pattern.compile("charset=\"?([\\w\\d-]+)\"?;?", Pattern.CASE_INSENSITIVE);
            if (input != null && !input.equals("")) {
                Matcher matcher = pattern.matcher(input);
                if (matcher.find()) {
                    return matcher.group(1);
                }
            }
            return null;
        }
    
        private static InputStream getInputStream(HttpURLConnection conn) throws Exception {
            String ContentEncoding = conn.getHeaderField("Content-Encoding");
            if (ContentEncoding != null) {
                ContentEncoding = ContentEncoding.toLowerCase();
                if (ContentEncoding.indexOf("gzip") != 1)
                    return new GZIPInputStream(conn.getInputStream());
                else if (ContentEncoding.indexOf("deflate") != 1)
                    return new DeflaterInputStream(conn.getInputStream());
            }
    
            return conn.getInputStream();
        }
    
        static HttpURLConnection getConnection(String url, String method, String param, Map<String, String> header) throws Exception {
            HttpURLConnection conn = (HttpURLConnection) (new URL(url)).openConnection();
            conn.setRequestMethod(method);
    
            // 设置通用的请求属性
            conn.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
            conn.setRequestProperty("Connection", "keep-alive");
            conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36");
            conn.setRequestProperty("Accept-Encoding", "gzip,deflate");
    
            String ContentEncoding = null;
            if (header != null) {
                for (Entry<String, String> entry : header.entrySet()) {
                    if (entry.getKey().equalsIgnoreCase("Content-Encoding"))
                        ContentEncoding = entry.getValue();
                    conn.setRequestProperty(entry.getKey(), entry.getValue());
                }
            }
    
            if (method == "POST") {
                conn.setDoOutput(true);
                conn.setDoInput(true);
                if (param != null && !param.equals("")) {
                    OutputStream output = conn.getOutputStream();
                    if (ContentEncoding != null) {
                        if (ContentEncoding.indexOf("gzip") > 0) {
                            output=new GZIPOutputStream(output);
                        }
                        else if(ContentEncoding.indexOf("deflate") > 0) {
                            output=new DeflaterOutputStream(output);
                        }
                    }
                    output.write(param.getBytes());
                }
            }
            // 建立实际的连接
            conn.connect();
            return conn;
        }
    }

    C#实现

    using System;
    using System.Collections;
    using System.IO;
    using System.Linq;
    using System.Net;
    using System.Net.Security;
    using System.Security.Cryptography.X509Certificates;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Web;
    using System.IO.Compression;
    using System.Collections.Generic;
    using System.Collections.Specialized;
    
    namespace CSharp.Util.Net
    {
        public class HttpHelper
        {
            private static bool RemoteCertificateValidate(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors errors)
            {
                //用户https请求
                return true; //总是接受
            }
    
            public static string SendPost(string url, string data)
            {
                return Send(url, "POST", data, null);
            }
    
            public static string SendGet(string url)
            {
                return Send(url, "GET", null, null);
            }
    
            public static string Send(string url, string method, string data, HttpConfig config)
            {
                if (config == null) config = new HttpConfig();
                string result;
                using (HttpWebResponse response = GetResponse(url, method, data, config))
                {
                    Stream stream = response.GetResponseStream();
                   
                    if (!String.IsNullOrEmpty(response.ContentEncoding))
                    {
                        if (response.ContentEncoding.Contains("gzip"))
                        {
                            stream = new GZipStream(stream, CompressionMode.Decompress);
                        }
                        else if (response.ContentEncoding.Contains("deflate"))
                        {
                            stream = new DeflateStream(stream, CompressionMode.Decompress);
                        }
                    }
                  
                    byte[] bytes = null;
                    using (MemoryStream ms = new MemoryStream())
                    {
                        int count;
                        byte[] buffer = new byte[4096];
                        while ((count = stream.Read(buffer, 0, buffer.Length)) > 0)
                        {
                            ms.Write(buffer, 0, count);
                        }
                        bytes = ms.ToArray();
                    }
    
                    #region 检测流编码
                    Encoding encoding;
    
                    //检测响应头是否返回了编码类型,若返回了编码类型则使用返回的编码
                    //注:有时响应头没有编码类型,CharacterSet经常设置为ISO-8859-1
                    if (!string.IsNullOrEmpty(response.CharacterSet) && response.CharacterSet.ToUpper() != "ISO-8859-1")
                    {
                        encoding = Encoding.GetEncoding(response.CharacterSet == "utf8" ? "utf-8" : response.CharacterSet);
                    }
                    else
                    {
                        //若没有在响应头找到编码,则去html找meta头的charset
                        result = Encoding.Default.GetString(bytes);
                        //在返回的html里使用正则匹配页面编码
                        Match match = Regex.Match(result, @"<meta.*charset=""?([\w-]+)""?.*>", RegexOptions.IgnoreCase);
                        if (match.Success)
                        {
                            encoding = Encoding.GetEncoding(match.Groups[1].Value);
                        }
                        else
                        {
                            //若html里面也找不到编码,默认使用utf-8
                            encoding = Encoding.GetEncoding(config.CharacterSet);
                        }
                    }
                    #endregion
    
                    result = encoding.GetString(bytes);
                }
                return result;
            }
    
            private static HttpWebResponse GetResponse(string url, string method, string data, HttpConfig config)
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                request.Method = method;
                request.Referer = config.Referer;
                //有些页面不设置用户代理信息则会抓取不到内容
                request.UserAgent = config.UserAgent;
                request.Timeout = config.Timeout;
                request.Accept = config.Accept;
                request.Headers.Set("Accept-Encoding", config.AcceptEncoding);
                request.ContentType = config.ContentType;
                request.KeepAlive = config.KeepAlive;
    
                if (url.ToLower().StartsWith("https"))
                {
                    //这里加入解决生产环境访问https的问题--Could not establish trust relationship for the SSL/TLS secure channel
                    ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(RemoteCertificateValidate);
                }
    
    
                if (method.ToUpper() == "POST")
                {
                    if (!string.IsNullOrEmpty(data))
                    {
                        byte[] bytes = Encoding.UTF8.GetBytes(data);
    
                        if (config.GZipCompress)
                        {
                            using (MemoryStream stream = new MemoryStream())
                            {
                                using (GZipStream gZipStream = new GZipStream(stream, CompressionMode.Compress))
                                {
                                    gZipStream.Write(bytes, 0, bytes.Length);
                                }
                                bytes = stream.ToArray();
                            }
                        }
    
                        request.ContentLength = bytes.Length;
                        request.GetRequestStream().Write(bytes, 0, bytes.Length);
                    }
                    else
                    {
                        request.ContentLength = 0;
                    }
                }
    
                return (HttpWebResponse)request.GetResponse();
            }      
        }
    
        public class HttpConfig
        {
            public string Referer { get; set; }
    
            /// <summary>
            /// 默认(text/html)
            /// </summary>
            public string ContentType { get; set; }
    
            public string Accept { get; set; }
    
            public string AcceptEncoding { get; set; }
    
            /// <summary>
            /// 超时时间(毫秒)默认100000
            /// </summary>
            public int Timeout { get; set; }
    
            public string UserAgent { get; set; }
    
            /// <summary>
            /// POST请求时,数据是否进行gzip压缩
            /// </summary>
            public bool GZipCompress { get; set; }
    
            public bool KeepAlive { get; set; }
    
            public string CharacterSet { get; set; }
    
            public HttpConfig()
            {
                this.Timeout = 100000;
                this.ContentType = "text/html; charset=" + Encoding.UTF8.WebName;
                this.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36";
                this.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
                this.AcceptEncoding = "gzip,deflate";
                this.GZipCompress = false;
                this.KeepAlive = true;
                this.CharacterSet = "UTF-8";
            }
        }
    }

    HttpUtil.java

    HttpHelper.cs

    哎...今天够累的,签到来了1...
    回复

    使用道具 举报

    您需要登录后才可以回帖 登录 | 立即注册

    本版积分规则

    QQ|手机版|小黑屋|Java自学者论坛 ( 声明:本站文章及资料整理自互联网,用于Java自学者交流学习使用,对资料版权不负任何法律责任,若有侵权请及时联系客服屏蔽删除 )

    GMT+8, 2024-12-22 16:40 , Processed in 0.059278 second(s), 28 queries .

    Powered by Discuz! X3.4

    Copyright © 2001-2021, Tencent Cloud.

    快速回复 返回顶部 返回列表