国內毛片,国产精品久久久国产,午夜揉揉,亚洲AV秘无码 18日韩,二区三区偷拍浴室洗澡视频 ,又大又紧18p粉嫩少妇,免费看一级特黄a大片,永久免费黄色视频网站

↑↑↑關(guān)注后"星標"Datawhale

每日干貨?&?每月組隊學(xué)習，不錯過

?Datawhale干貨?

作者：牧小熊，華中農(nóng)業(yè)大學(xué)，Datawhale原創(chuàng)作者

本文約3000字，建議閱讀15分鐘

審稿人：魚佬，Datawhale成員，武漢大學(xué)碩士，騰訊廣告算法大賽冠軍選手。

最近爬取了武漢蛋殼公寓的租房信息，并對租房信息進行了清洗及可視化操作。

并構(gòu)建相應(yīng)模型來分析武漢蛋殼公寓房租價格與房屋特征的相關(guān)性挖掘。

后臺回復(fù)【實踐項目】可進項目實踐交流群。

1.數(shù)據(jù)爬取

我們爬取了蛋殼公寓的租房網(wǎng)站，將區(qū)域位置選擇為武漢，通過爬蟲訪問網(wǎng)頁獲得房屋的相關(guān)信息，并輸出到csv文件中。

方法：requests、lxml、BeautifulSoup

import requests
from lxml import etree
from bs4 import BeautifulSoup
import random
import time
from tqdm import tqdm
import csv

我們定義了幾個爬蟲偽裝頭，每次訪問時隨機選擇不同的訪問頭對網(wǎng)頁進行訪問。

通過使用不同的訪問頭，能一定程度上保護爬蟲。

#這里增加了很多user_agent
#能一定程度能保護爬蟲
user_agent = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)"]

訪問蛋殼公寓的網(wǎng)站，并獲得每個房屋的鏈接。這里我們使用了BeautifulSoup對網(wǎng)頁進行解析，并通過便簽及class對房屋信息進行定位，同時獲得每個房租的超鏈接。

通過自定義函數(shù)get_house_info來訪問房屋鏈接，并返回房屋相關(guān)信息的列表，將獲得的列表導(dǎo)出到csv文件中，需要注意的是，為了避免對對方服務(wù)器造成過大的壓力，我們每次訪問中間需要休息幾秒。

def get_info():
    #武漢地區(qū)總共有267頁
    csvheader=['價格','面積','編號','戶型','樓層','朝向','位置1','位置2','小區(qū)','地鐵']
    with open('wuhan_danke.csv', 'a+', newline='') as csvfile:
        writer  = csv.writer(csvfile)
        writer.writerow(csvheader)
        for i in tqdm(range(1,268)):  #總共有267頁
            timelist=[2,3,4,5]
            time.sleep(random.choice(timelist))   #休息2-5秒，防止給對方服務(wù)器過大的壓力！??！
            url='https://www.danke.com/room/wh?search=1&search_text=&from=home&page={}'.format(i)
            headers = {'User-Agent': random.choice(user_agent)}
            r = requests.get(url, headers=headers)
            r.encoding = r.apparent_encoding
            soup = BeautifulSoup(r.text, 'lxml')
            all_info = soup.find_all('div', class_='r_lbx_cena')
            for info in all_info:
                href = info.find('a')
                if href !=None:
                    href=href['href']
                    house_info=get_house_info(href)
                    writer.writerow(house_info)
def get_house_info(href):
    #得到房屋的信息
    time.sleep(3)
    headers = {'User-Agent': random.choice(user_agent)}
    response = requests.get(url=href, headers=headers)
    response=response.content.decode('utf-8', 'ignore')
    div = etree.HTML(response)
    room_price=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[3]/div[2]/div/span/div/text()")[0]
    room_area=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[1]/div[1]/label/text()")[0].replace('建筑面積：約','').replace('㎡（以現(xiàn)場勘察為準）','')
    room_id=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[1]/div[2]/label/text()")[0].replace('編號：','')
    room_type=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[1]/div[3]/label/text()")[0].replace('\n','').replace(' ','').replace('戶型：','')
    room_floor=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[2]/div[3]/label/text()")[0].replace('樓層：','')
    room_dir=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[2]/div[1]/label/text()")[0].replace("朝向","")
    room_postion_1=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[2]/div[4]/label/div/a[1]/text()")[0]
    room_postion_2=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[2]/div[4]/label/div/a[2]/text()")[0]
    room_postion_3=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[2]/div[4]/label/div/a[3]/text()")[0]
    room_subway=div.xpath("/html/body/div[3]/div[1]/div[2]/div[2]/div[4]/div[2]/div[5]/label/text()")[0]
    room_info=[room_price,room_area,room_id,room_type,room_floor,room_dir,room_postion_1,room_postion_2,room_postion_3,room_subway]
    return room_info

看看爬取之后的結(jié)果：

2.單特征可視化分析

使用pandas將武漢蛋殼公寓的數(shù)據(jù)導(dǎo)入，同時對公寓的單特征數(shù)據(jù)進行可視化展示。

方法:pandas、plotly

導(dǎo)入第三方庫：

import pandas as pd
import plotly_express as px
import plotly.offline as of
import plotly as py
import plotly.graph_objs as go
import re

導(dǎo)入數(shù)據(jù)：

data=pd.read_csv('wuhan_danke.csv')
data

各個字段含義

價格---租金單位元/月
面積---房屋實際面積單位：㎡
編號---房屋的編號
戶型---房屋的總體戶型情況
樓層---房屋所在小區(qū)總共有多少層其中出租屋在多少層
朝向---房屋的方向
位置1---房屋所在的區(qū)
位置2---房屋所在街道
小區(qū)---房屋所在小區(qū)名稱

2.1公寓價格

price = go.Histogram(x = data['價格'],histnorm = 'probability')
fig = go.Figure(data=price)
fig

通過直方圖可以看到，公寓的價格分布呈現(xiàn)一個非標準正態(tài)分布，公寓的價格主要是集中在1100-1200之間。

2.2公寓面積

我們選取了公寓面積數(shù)目最多的前15個：

area = pd.DataFrame(data["面積"].value_counts()[:15]).reset_index()
area["index"] =[ '%sm2' % i for i in area["index"]]
fig = go.Bar(x = area['index'],y=area['面積'],text=area['面積'],textposition='outside')
fig = go.Figure(data=fig)
fig

通過柱狀圖可以看到蛋殼公寓面積更加偏向于9-12m2，分析原因主要是小區(qū)的房子臥室一般的面積會在9-12m2 直接將房屋的臥室進行分割，能最大程度減小成本。

2.3房屋戶型

roomtype = pd.DataFrame(data["戶型"].value_counts()).reset_index()
fig = px.pie(roomtype,names="index",values="戶型")
fig

通過分析發(fā)現(xiàn)蛋殼公寓收取的房屋中大部分是4室1衛(wèi)和3室1衛(wèi)，而這2種戶型也是房地產(chǎn)開放商喜歡開放的主流戶型。

2.4房屋行政區(qū)域分布

position = pd.DataFrame(data["位置1"].value_counts()).reset_index()
fig = px.pie(position,names="index",values="位置1")
fig.show()

通過餅狀圖可以看到，蛋殼公寓名下的房屋主要集中在洪山區(qū)、武昌區(qū)以及江夏區(qū)，從百度地圖上進行查詢洪山區(qū)區(qū)域圖，可以看到洪山區(qū)覆蓋了較多的大學(xué)，同時洪山區(qū)覆蓋面積較大，因此具有較大的占比。

2.5街道分布

我們選取了數(shù)目最多的15個街道：

area = pd.DataFrame(data["位置2"].value_counts()[:15]).reset_index()
fig = go.Bar(x = area['index'],y=area['位置2'],text=area['位置2'],textposition='outside')
fig = go.Figure(data=fig)
fig

通過可視化圖可以看到，房屋在光谷軟件園、佛祖嶺、光谷廣場數(shù)量具有較大的占比，光谷軟件園附件有較多的互聯(lián)網(wǎng)公司。

2.6小區(qū)分布

我們選取了蛋殼公寓小區(qū)數(shù)目種最多的前10個：

area = pd.DataFrame(data["小區(qū)"].value_counts()[:10]).reset_index()
fig = go.Bar(x = area['index'],y=area['小區(qū)'],text=area['小區(qū)'],textposition='outside')
fig = go.Figure(data=fig)
fig

數(shù)目最多的居然是旭輝御府，查看了一下旭輝御府周圍的商業(yè)點，附件有光谷金融港、武漢聯(lián)想、光谷電子工業(yè)園這些地方對勞動力具有一定的吸引力，因此旭輝御府小區(qū)也是受到較大的歡迎。

同時旭輝御府小區(qū)開盤時間較晚居民的居住率不高，因此大家更愿意把房屋進行出租。

3.數(shù)據(jù)清理與特征工程

?對爬取的數(shù)據(jù)進行清洗，同時進行特征構(gòu)造，對房屋定價進行挖掘：

##房屋的編號與房屋的價格幾乎沒有關(guān)系
def get_subway_num(row):
    subway_num=row.count('號線')
    return subway_num

def get_subway_distance(row):
    distance=re.search(r'\d+(?=米)',row)
    if distance==None:
        return -1
    else:
        return distance.group()
data['房間數(shù)']=data['戶型'].apply(lambda x:x[0])
data['衛(wèi)生間數(shù)']=data['戶型'].apply(lambda x:x[2])
data['所在樓層']=data['樓層'].apply(lambda x:x.split('/')[0])
data['總樓層']=data['樓層'].apply(lambda x:x.split('/')[1].replace('層',''))
data['地鐵數(shù)']=data['地鐵'].apply(get_subway_num)
data['距離地鐵距離']=data['地鐵'].apply(get_subway_distance)
data

同時對不需要的特征進行刪除以及LabelEncoder編碼：

data=data.drop(['編號'],axis=1)
data=data.drop(['戶型'],axis=1)
data=data.drop(['樓層'],axis=1)
data=data.drop(['地鐵'],axis=1)

from sklearn.preprocessing import LabelEncoder
data=data.fillna(-1)
le = LabelEncoder()
direction={'東':0,'南':270,'西':180,'北':90,'西南':225,'東南':315,'東北':45,'西北':135}
data['朝向']=data['朝向'].map(direction)
data['位置1']=le.fit_transform(data['位置1'])
data['位置2']=le.fit_transform(data['位置2'])
data['小區(qū)']=le.fit_transform(data['小區(qū)'])

將公寓的房屋特征全部轉(zhuǎn)換成了數(shù)字特征。

4.構(gòu)造模型

對數(shù)據(jù)采用lightgbm回歸模型，同時使用5折的方式進行模型交叉訓(xùn)練，找出各個特征對房屋價格的重要性：

data=data.fillna(-1)
data=data.astype('int')
import lightgbm
from sklearn.model_selection import KFold
train_label=data['價格']
train_data=data.drop(['價格'],axis=1)

def select_by_lgb(train_data,train_label,random_state=2020,n_splits=5,metric='mse',num_round=10000,early_stopping_rounds=100):
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    feature_importances = pd.DataFrame()
    feature_importances['feature'] = train_data.columns
    fold=0
    for train_idx, val_idx in kfold.split(train_data):
        random_state+=1
        train_x = train_data.loc[train_idx]
        train_y = train_label.loc[train_idx]
        test_x = train_data.loc[val_idx]
        test_y = train_label.loc[val_idx]
        clf=lightgbm
        train_matrix=clf.Dataset(train_x,label=train_y)
        test_matrix=clf.Dataset(test_x,label=test_y)
        params={
                'boosting_type': 'gbdt',  
                'objective': 'regression',
                'learning_rate': 0.1,
                'metric': metric,
                'seed': 2020,
                'nthread':-1 }
        model=clf.train(params,train_matrix,num_round,valid_sets=test_matrix,early_stopping_rounds=early_stopping_rounds)
        feature_importances['fold_{}'.format(fold + 1)] = model.feature_importance()
        fold+=1
    feature_importances['averge']=feature_importances[['fold_{}'.format(i) for i in range(1,n_splits+1)]].mean(axis=1)
    return feature_importances
        
feature_importances=select_by_lgb(train_data,train_label)
feature_importances['averge']=feature_importances[['fold_{}'.format(i) for i in range(1,6)]].mean(axis=1)

對5次運算的特征進行求平均值并按照特征重要性進行降序排列：

feature_importances=feature_importances.sort_values(by='averge',ascending=False)
feature_importances

對特征進行可視化的展示：

fig = go.Bar(x = feature_importances['feature'],y=feature_importances['averge'],text=feature_importances['averge'],textposition='outside')
fig = go.Figure(data=fig)
fig

通過特征分析我們發(fā)現(xiàn)，對蛋殼公寓房屋價格特征進行挖掘，我們發(fā)現(xiàn)房屋的面積以及房屋所在的街道對房屋的價格有較大的影響。

因為小區(qū)與距離地鐵的距離與所在的街道有較強的關(guān)聯(lián)，因此蛋殼公寓在進行房屋定價時，對公寓的面積以及公寓所在的街道有較大的權(quán)重考慮，然后再考慮房屋所在的樓層。

5.項目總結(jié)

本項目對O2O互聯(lián)網(wǎng)長租公寓蛋殼公寓武漢區(qū)進行的租房信息進行了爬取，獲得了蛋殼公寓對外出租房屋的相關(guān)信息。因為爬取的信息為待租公寓的信息，我們可以認為待租公寓類似樣本抽樣，能一定程度反映總體數(shù)據(jù)的分布情況。

我們對蛋殼公寓的公寓價格、公寓面積、公寓所在房屋戶型、公寓所在的行政區(qū)、公寓所在的街道及小區(qū)進行了可視化的展示與分析。通過分析發(fā)現(xiàn)，蛋殼公寓在收取小區(qū)房屋時，會重點考慮小區(qū)附件是否有吸收勞動力的商業(yè)街或公司，從而提升公寓的入住率。從房屋的戶型及公寓的面積我們發(fā)現(xiàn)，武漢主流公寓的面積在9-12m2，蛋殼公寓一般會將房屋分割成3-5個臥室進行對外出租，同時公寓的租金主要集中在1100-1200元之間。

最后我們對蛋殼公寓的定價進行了建模與結(jié)構(gòu)性挖掘，我們發(fā)現(xiàn)蛋殼公寓的租金定價主要考慮3個方面，公寓的面積、公寓所在的街道以及公寓所在樓層。

6.項目附件

蛋殼公寓房屋數(shù)據(jù)：鏈接：https://pan.baidu.com/s/1CmYdX8dj-JgwAnaU3_Wnnw提取碼：x6fd

蛋殼公寓爬蟲代碼：鏈接：https://pan.baidu.com/s/1D_d0T4SVRrgOXwYI7S0-Cw提取碼：vjr0

蛋殼公寓數(shù)據(jù)分析代碼：鏈接：https://pan.baidu.com/s/1-8cJJqfUiHC0B9Bvysftew提取碼：q47x

“整理不易，點贊三連↓

數(shù)據(jù)項目總結(jié) -- 蛋殼公寓租金分析！