～続・不動産とファイナンス・賃貸物件入居者編(3)～「機械学習を使って東京23区のお買い得賃貸物件を探してみた」を千葉県で再度やってみる（予測編）

今回は、不動産賃貸物件の賃料予測シリーズの最終回です。

(分析のソースコードは最後に一括掲載します)

データクレンジングした千葉県の賃貸物件情報データセットを使って、予測に入っていきます。

その前に、データクレンジングについて補足があります。

スクレイピングの段階で物件の住所が抜け落ちていることに気づきましたので、後から付け足しました。

市ごとにデータを抽出していたため追加は問題なかったのですが、コードを参照される方はご注意ください。

毎度のことながら、こちらが参考記事です。

www.analyze-world.com

基礎分析

実は今回の記事は下の記事の続編でして、全体の外観は既に見ていました。

前回と今回の違いは、物件の構造や駐車場有無など新たに変数を追加した点です。

（スクレイピングにすごく時間がかかりました）

なお、今回も市別物件数等の基礎統計は出していますが、大きな違いはないため重複する内容は前回に譲ります。

https://blog.hatena.ne.jp/d_s/d-s.hatenablog.com/edit?entry=10257846132614731711

では、最初に今回使うデータセットを載せておきます。

df(38906,19)

> str(df)

'data.frame':	38906 obs. of  19 variables:

 $ name        : Factor w/ 26511 levels "#NAME?","（仮）Ｄ－ｒｏｏｍ千葉寺",..: 17354 2060 24672 6816 24832 12633 12634 23640 3630 3631 ...

 $ city        : Factor w/ 10 levels "浦安市","鎌ヶ谷市",..: 6 6 6 6 6 6 6 6 6 6 ...

 $ shikikin    : int  38000 0 40000 0 0 38000 39000 0 0 0 ...

 $ reikin      : int  38000 0 0 0 0 38000 39000 45000 0 0 ...

 $ hoshoukin   : int  0 0 0 0 0 0 0 0 0 0 ...

 $ layout      : Factor w/ 8 levels "1DK","1K","1LDK",..: 2 1 1 3 8 2 2 8 3 3 ...

 $ area        : num  21.1 34.3 33.1 42 34.3 ...

 $ direction   : Factor w/ 9 levels "-","西","東",..: 3 3 8 5 3 6 6 4 5 2 ...

 $ type        : Factor w/ 5 levels "アパート","その他",..: 4 4 5 4 4 1 1 1 4 4 ...

 $ age         : int  89 52 48 46 52 22 22 44 48 48 ...

 $ rout1       : Factor w/ 21 levels "ＪＲ外房線","ＪＲ京葉線",..: 4 4 4 15 4 4 4 4 7 1 ...

 $ station     : Factor w/ 159 levels "おゆみ野","くぬぎ山",..: 8 8 89 50 8 8 8 8 100 100 ...

 $ distance    : int  72 25 100 25 28 80 52 28 68 68 ...

 $ construction: Factor w/ 10 levels "その他","プレコン",..: 7 6 10 6 6 10 10 7 7 7 ...

 $ floor       : Factor w/ 25 levels "1","1-2","1-3",..: 16 12 1 20 12 1 12 12 12 16 ...

 $ height      : int  4 4 1 5 4 2 2 2 4 4 ...

 $ car.dum     : Factor w/ 4 levels "近隣","付無料",..: 3 1 2 4 4 3 4 4 4 4 ...

 $ free_rent   : int  0 1 0 0 1 0 0 0 0 0 ...

 $ monthly_cost: int  38000 53000 40000 27000 50000 40000 41000 45000 49000 54000 ...

はじめに、家賃と相関が高い変数を見ておきます。

家賃との相関係数が絶対値で0.5超のものを抽出してあります。

f:id:d_s:20180906224319p:plain

monthly_cost：家賃、area：専有面積、age:築年ですが、家賃と専有面積の相関が1番高そうです。

感覚的にもそうだろうなと思いますよね。予測の際に重要になりそうな変数です。

次は、建物構造をスクレイピングしてきているので家賃との対応を見ておきます。

f:id:d_s:20180906225837p:plain

やはり木造が・・・という結果になりました。

自分もアパート暮らしを経験したことがありますが、物件の構造は気にしていました。

鉄筋だの鉄骨だの細かくこだわっていませんでしたが、木造だけは「音が気になる」という先入観を持っていたので、当時選択肢にすら入れていなかったのを覚えています。

以下、変数ごとにヒストグラムを描いて分布を確認して、データが寄っていたら対数変換して正規分布に近くなるか確認して・・・と続いていくわけなのですが、長々としてしまいますのでこんな例を載せておきます。

f:id:d_s:20180906231428p:plain

f:id:d_s:20180906231713p:plain

最寄り駅までの距離ですが、これを対数変換すると

f:id:d_s:20180906231547p:plain

f:id:d_s:20180906231740p:plain

幾分か正規分布、線形に近づいたと思います。

今回はこのまま重回帰に進みますが、もっとぐりぐりと探索したいところではあります。その他処理については、コードに解説を。

（後で気づきましたが、元のデータセットで対数変換しようと思ったものを変換し忘れていました。orz）

次に予測に入っていきます。

今回もデータセットを学習：テストで8：2に分けて検証します。

学習データ：tarin(31111, 14)

テストデータ：test(7776, 14)

予測1　重回帰

まずは重回帰分析です。

コードを参照していただければと思いますが、カテゴリー変数はダミー化しています。

先に述べたように、今回Step関数（AIC最小化）で変数選択していますので、下の結果は変数選択後の結果です。

> summary(lm.step)



Call:

lm(formula = monthly_cost ~ city.浦安市 + city.鎌ヶ谷市 + city.市川市 + 

    city.松戸市 + city.千葉市 + city.船橋市 + city.八千代市 + 

    layout.1DK + layout.1K + layout.1LDK + layout.1SDK + layout.1SK + 

    layout.1SLDK + area + type.アパート + type.その他 + type.テラスハウス + 

    type.マンション + age + rout1.ＪＲ外房線 + rout1.ＪＲ京葉線 + 

    rout1.ＪＲ常磐線 + rout1.ＪＲ総武線 + rout1.ＪＲ総武線快速 + 

    rout1.ＪＲ総武本線 + rout1.ＪＲ内房線 + rout1.ＪＲ武蔵野線 + 

    rout1.つくばエクスプレス + rout1.京成千原線 + rout1.京成千葉線 + 

    rout1.京成本線 + rout1.新京成線 + rout1.千葉都市モノレール + 

    rout1.都営新宿線 + rout1.東京メトロ東西線 + rout1.東武野田線 + 

    rout1.北総線 + distance + construction.軽量鉄骨 + construction.鉄筋コン + 

    construction.鉄骨 + construction.鉄骨鉄筋 + floor.1 + floor.12 + 

    floor.13 + floor.14 + floor.15 + floor.17 + floor.18 + floor.19 + 

    floor.2 + `floor.2-3` + floor.3 + floor.36 + floor.38 + floor.6 + 

    height + car.dum.近隣 + car.dum.付無料 + car.dum.敷地内, 

    data = train_dummy)



Residuals:

   Min     1Q Median     3Q    Max 

-47975  -4407   -385   3754  92307 



Coefficients:

                           Estimate Std. Error  t value             Pr(>|t|)    

(Intercept)               51597.992   1333.095   38.705 < 0.0000000000000002 ***

city.浦安市                9991.120    309.562   32.275 < 0.0000000000000002 ***

city.鎌ヶ谷市              -860.327    420.582   -2.046             0.040808 *  

city.市川市                7080.343    218.837   32.354 < 0.0000000000000002 ***

city.松戸市                 785.224    171.400    4.581   0.0000046401196611 ***

city.千葉市               -4705.804    206.609  -22.776 < 0.0000000000000002 ***

city.船橋市                2900.313    179.287   16.177 < 0.0000000000000002 ***

city.八千代市             -4384.625    295.029  -14.862 < 0.0000000000000002 ***

layout.1DK                 2178.404    202.462   10.760 < 0.0000000000000002 ***

layout.1K                  1092.225    124.968    8.740 < 0.0000000000000002 ***

layout.1LDK                3864.571    212.770   18.163 < 0.0000000000000002 ***

layout.1SDK                3438.787   1274.857    2.697             0.006992 ** 

layout.1SK                 3032.182    976.678    3.105             0.001907 ** 

layout.1SLDK               7881.395    932.678    8.450 < 0.0000000000000002 ***

area                        933.801      8.357  111.738 < 0.0000000000000002 ***

type.アパート            -16857.882   1252.718  -13.457 < 0.0000000000000002 ***

type.その他              -15205.806   2156.760   -7.050   0.0000000000018226 ***

type.テラスハウス        -14009.415   1547.312   -9.054 < 0.0000000000000002 ***

type.マンション          -15053.623   1266.946  -11.882 < 0.0000000000000002 ***

age                        -615.966      3.725 -165.344 < 0.0000000000000002 ***

rout1.ＪＲ外房線           3185.541    416.257    7.653   0.0000000000000202 ***

rout1.ＪＲ京葉線          10369.839    373.456   27.767 < 0.0000000000000002 ***

rout1.ＪＲ常磐線           3613.268    327.765   11.024 < 0.0000000000000002 ***

rout1.ＪＲ総武線          10559.045    303.565   34.783 < 0.0000000000000002 ***

rout1.ＪＲ総武線快速      12349.595    389.127   31.737 < 0.0000000000000002 ***

rout1.ＪＲ総武本線         2588.423    414.103    6.251   0.0000000004139921 ***

rout1.ＪＲ内房線           5378.499    461.783   11.647 < 0.0000000000000002 ***

rout1.ＪＲ武蔵野線         1811.428    408.226    4.437   0.0000091398650456 ***

rout1.つくばエクスプレス   3203.856    365.079    8.776 < 0.0000000000000002 ***

rout1.京成千原線           1333.445    440.695    3.026             0.002482 ** 

rout1.京成千葉線           8878.187    445.089   19.947 < 0.0000000000000002 ***

rout1.京成本線             2570.059    295.439    8.699 < 0.0000000000000002 ***

rout1.新京成線            -1117.875    336.274   -3.324             0.000887 ***

rout1.千葉都市モノレール   1237.903    420.995    2.940             0.003280 ** 

rout1.都営新宿線          11057.461    793.120   13.942 < 0.0000000000000002 ***

rout1.東京メトロ東西線     7552.354    352.766   21.409 < 0.0000000000000002 ***

rout1.東武野田線          -1299.222    355.462   -3.655             0.000258 ***

rout1.北総線               1446.186    497.302    2.908             0.003639 ** 

distance                   -279.290      4.316  -64.716 < 0.0000000000000002 ***

construction.軽量鉄骨      2811.936    119.929   23.447 < 0.0000000000000002 ***

construction.鉄筋コン      2424.085    229.301   10.572 < 0.0000000000000002 ***

construction.鉄骨          1085.687    197.038    5.510   0.0000000361614633 ***

construction.鉄骨鉄筋      -779.408    371.952   -2.095             0.036139 *  

floor.1                   -3087.492    175.923  -17.550 < 0.0000000000000002 ***

floor.12                   2906.244   1011.471    2.873             0.004065 ** 

floor.13                   2868.653   1216.872    2.357             0.018410 *  

floor.14                   7940.393   1781.355    4.458   0.0000083210633342 ***

floor.15                  30143.089   4101.146    7.350   0.0000000000002032 ***

floor.17                  11181.280   5029.571    2.223             0.026216 *  

floor.18                  33601.725   7074.130    4.750   0.0000020438072318 ***

floor.19                  34882.878   7073.643    4.931   0.0000008206846043 ***

floor.2                   -1399.758    172.474   -8.116   0.0000000000000005 ***

`floor.2-3`                8776.983   4071.860    2.156             0.031129 *  

floor.3                    -701.876    177.975   -3.944   0.0000804200069929 ***

floor.36                  45586.183   4128.933   11.041 < 0.0000000000000002 ***

floor.38                  12905.202   7082.366    1.822             0.068440 .  

floor.6                    -744.190    365.900   -2.034             0.041974 *  

height                     1311.134     24.931   52.591 < 0.0000000000000002 ***

car.dum.近隣               -868.668    116.089   -7.483   0.0000000000000747 ***

car.dum.付無料             4812.216    625.234    7.697   0.0000000000000144 ***

car.dum.敷地内             -179.877    101.314   -1.775             0.075834 .  

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual standard error: 7036 on 31050 degrees of freedom

Multiple R-squared:  0.843,	Adjusted R-squared:  0.8427 

F-statistic:  2778 on 60 and 31050 DF,  p-value: < 0.00000000000000022

Intercept(切片)のEstimate(偏回帰係数)が51597ですので、解釈としては51597円をベースとして、加算減算し家賃が求まる式ができました。

例えば市で見ると、浦安の偏回帰係数が9991、千葉が-4705ですので、その物件が浦安であれば51597円＋9991円、千葉であれば51597円-4705円といった具合で計算されていきます。

市を見ると、東京から遠ざかるにつれ係数が下がっていくのでこれは納得ですね。

area(専有面積)については、偏回帰係数が933と出ています。

1平米あたり933円という解釈になりますが、今回のデータセットの平米家賃単価(平均家賃÷平均専有面積)は2193円ですので、その差額が他の変数で説明されている部分と捉えられそうです。

今回の重回帰分析の決定係数は0.84なのでなかなか良い結果と言えるのではないでしょうか。

試しに、テストデータで予測を確認しておきます。

f:id:d_s:20180906234229p:plain

重回帰でそれなりに予測できていそうです。

こうなってくると、次のランダムフォレストでの結果も楽しみです。

予測2　ランダムフォレスト

比較しやすいように、先に結果を載せてしまいます。

f:id:d_s:20180906234927p:plain

さらに精度がよくなったようです。

実は今回、ランダムフォレストの学習時間短縮のために、学習データをさらに6割に削減したので不利なはずでした。

それなのにこの結果とは。。。改めてランダムフォレストの強力さを実感しました。

(テストデータは重回帰とまったく一緒です)

なお、今回のパラメータは例のごとくtuneRF関数でmtry（特徴量の数）を求めました。

mtryは8が最適のようです。

f:id:d_s:20180906235127p:plain

変数重要度は以下の通りアウトプットされました。

ランダムフォレストでは築年数、専有面積が重要な要因のようです。

f:id:d_s:20180906235042p:plain

以上、重回帰とランダムフォレストをざっとみてきましたが、ここで精度を比較しておきます。

RMSE(平均)という「平均化された誤差」という指標で比較したいと思うのですが、

> ##### RMSE #####

> sqrt( sum((test$monthly_cost - lm_pred)^2) / length(test$monthly_cost) )

[1] 7095.295

> sqrt( sum((test$monthly_cost - rf_pred)^2) / length(test$monthly_cost) )

[1] 5133.426

上記の通り、重回帰：7095、ランダムフォレスト5133と、ランダムフォレストのほうがRMSEが小さい(誤差が相対的に低い)ので、ランダムフォレストのほうが良い結果を出すことができました。

予測3　お得物件を見つける

やっとここまでたどり着きました。

以下はランダムフォレストで予測した家賃をもとに、実際の家賃との差額からお得係数を算出してソートしたものです。

(お得係数：僕が勝手に考えた係数で、差分÷家賃共益費で算出)

物件	市	間取	面積	築年	駅	最寄駅距離	構造	家賃+共益費	予測家賃	差分	お得係数
景中寮	流山市	ワンルーム	18.00	31	東武野田線	9	木造	12000	31058.24	19058.24	1.5881867
カルフォルニアハウス江戸川台202号室	流山市	1K	7.30	14	東武野田線	6	鉄筋コン	20000	47337.43	27337.43	1.3668713
カルフォルニアハウス江戸川台201号室	流山市	1K	6.60	14	東武野田線	6	鉄筋コン	20000	46878.56	26878.56	1.3439278
カルフォルニアハウス江戸川台101号室	流山市	1K	7.70	14	東武野田線	6	鉄筋コン	20000	46210.84	26210.84	1.3105422
ＪＲ総武線津田沼駅2階建築26年	船橋市	1K	20.00	26	ＪＲ総武線	15	木造	22000	47256.80	25256.80	1.1480366
ヴィレッジダイドー	千葉市	1LDK	42.00	46	千葉都市モノレール	25	鉄筋コン	27000	57002.69	30002.69	1.1112108
ビレッジハウス二和１号棟305号室	船橋市	1K	45.36	57	新京成線	4	鉄筋コン	34000	66637.96	32637.96	0.9599399
アーバンハイム金丸	習志野市	ワンルーム	18.00	19	京成本線	6	木造	27000	48302.86	21302.86	0.7889947
大和田ハイツI－1	八千代市	1K	20.00	50	京成本線	10	木造	19000	33909.16	14909.16	0.7846928
ハイホーム田中305号室	市川市	1K	19.80	48	ＪＲ総武線快速	7	鉄筋コン	31000	55022.72	24022.72	0.7749265

この中から、気になる物件を見てみます。

まず、お得度第一位の物件

f:id:d_s:20180906221406p:plain

f:id:d_s:20180906221506p:plain

おお、男子寮がランクインしてきました。

学生向けの物件でしょうか。ちょっと前提から外れる物件な気もしなくはないですが、破格の値段です。

寮なので、シェアハウスのごとく部屋以外は共同のようですが、写真で見る限り中は綺麗でした。

つぎは、個人的に目に留まった物件で、総武線市川駅徒歩7分です。

お得ランキングとしては10位です。築古なので年季が入っているようですが・・・

f:id:d_s:20180906221327p:plain

f:id:d_s:20180906221444p:plain

家賃共益費込みで31,000円！！！！！！！

築古ワンルームですが、東京の近さと、最寄り駅の近さを勘案するとすごくお得な物件ではないでしょうか？おまけに角部屋、鉄筋コンクリートです。

ちなみに、中の写真を見てみると・・・

f:id:d_s:20180906221616p:plain

f:id:d_s:20180906221635p:plain

築古だけとリフォームしてある感じで綺麗！！！

そしてお風呂が見当たらない！！！！！！！！！

なるほど、そういうことだったんですね笑

もしかしたら、お得度上位物件は、変数で考慮されていない特徴ある訳あり物件が並んでいるのかもしれません。笑

逆に言えば、一般的な物件は予測精度が良いのかもしれません。

風呂有り無し等、スクレイピングレベルで再度試すのは面倒ですが、今回の施行は面白いものとなりました。

次に

賃貸物件の賃料予測は無事行うことができましたが、次回は収益物件の値段の予測問題に入っていきます。収益物件の価格の妥当性を検証したいわけです。

実は、収益物件売買情報掲載サイトの「楽待」をスクレイピングして、既にデータセットを作成してあります。

楽待でも、今回の賃料予測に対応して千葉県の収益物件を取ってきていますので、収支予測を今回の結果をベースに行い、DCFベースの物件価格評価を目指します。

DCFをやるからには、割引率は収益物件の想定利回り(投資家が求める期待リターンとも言えるのか)の構造を重回帰等で割り出して算出してみたら面白いんじゃないかと考えています。

賃貸物件の賃料予測は、これにて終わりです。

本日のコード

library(ggplot2)

library(tidyverse)

library(caret)

library(randomForest)

library(ggrepel)

library(psych)

library(kernlab)

library(knitr)

library(corrplot)



options(scipen=100)



# データセットの読み込み

df <- read.csv(データセットを置いてあるファイルパスを入力,

               fileEncoding = "cp932", sep = ",")



# 後の工程を考えて全角中黒は変換

df$type <- df$type %>% 

  str_replace('テラス・タウンハウス','テラスハウス') %>%

  as.factor()



# データチェック

df %>%

  str()



# rent1とrent2を足してmonthly_costを作成、rent1・rent2は削除（rent1：家賃、rent2：共益費）

df <- df %>%

  mutate(monthly_cost=rent1+rent2) %>%

  select(-c(3,4))



# 相関チェック

numericVars <- which(sapply(df, is.numeric)) #numeric型の変数を抽出する

numericVarNames <- names(numericVars)

all_numVar <- df[, numericVars]

cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") 

# monthly_cost(家賃)との相関が高い順位ソート

cor_sorted <- as.matrix(sort(cor_numVar[,'monthly_cost'], decreasing = TRUE))

# 相関係数が絶対値で0.5超のものを抽出

CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))

cor_numVar <- cor_numVar[CorHigh, CorHigh]

# 相関図プロット

corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")







# 物件数が多い順位に並び替え

df1 <- df %>%

  select(city) %>%

  table %>%

  as.data.frame() %>%

  arrange(Freq)

df1$city <- with(df1,reorder(., Freq))



# 千葉県市別物件数のプロット

ggplot(df1,aes(city,n,fill = city)) +

  geom_bar(stat="identity", alpha=0.8)+

  ggtitle("千葉県市別物件数")+xlab("")+ylab("物件数")+ 

  theme(axis.text.x = element_text(angle = 180, hjust =1))+

  theme_bw(base_family = "HiraKakuProN-W3") +

  coord_flip() +

  ylim(0,25000) + scale_y_continuous(labels = scales::comma)



# 家賃の分布を見る

hist(df$monthly_cost,xlim = c(0, 200000),breaks = 30, col="#993435")



# 市別家賃BoxPlot

cost_median <- with(df, reorder(city, monthly_cost, median))

par(las=1, cex.axis=0.7, family = "HiraKakuProN-W3")

boxplot(monthly_cost ~ cost_median, data = df,

        xlab = "賃料+共益費", ylab = "",

        main = "市別家賃", varwidth = TRUE, horizontal=TRUE)



par(las=1, cex.axis=0.7, family = "HiraKakuProN-W3")

boxplot(monthly_cost ~ cost_median, data = df,

        xlab = "賃料+管理費", ylab = "",

        main = "市別家賃（〜20万円）", varwidth = TRUE, horizontal=TRUE, ylim = c(0,200000), outline=FALSE)



#市別築年数

age_med <- with(df, reorder(city, age, median))

par(las=1, cex.axis=0.7, family = "HiraKakuProN-W3")

boxplot(age ~ age_med, data = df,

        xlab = "築年数", ylab = "",

        main = "市別築年数", varwidth = TRUE, horizontal=TRUE, ylim = c(0,70), outline=FALSE)



#　建物構造別家賃

cost_median <- with(df, reorder(construction, monthly_cost, median))

par(las=1, cex.axis=0.7, family = "HiraKakuProN-W3")

boxplot(monthly_cost ~ cost_median, data = df,

        xlab = "賃料+共益費", ylab = "",

        main = "建物構造別家賃", varwidth = TRUE, horizontal=TRUE,ylim = c(0,200000))



# 家賃 VS 専有面積

plot(df$area,df$monthly_cost,

     xlim = c(0, 100),

     ylim = c(0, 200000),pch=".")



# 家賃 VS 専有面積

plot(df$age,df$monthly_cost,

     xlim = c(0, 60),

     ylim = c(0, 200000),pch=".")



# 家賃 VS 最寄り駅までの距離

plot(df$distance,df$monthly_cost,

     xlim = c(0, 100),

     ylim = c(0, 200000),pch=".")



# 家賃 VS 最寄り駅までの距離

plot(df$floor,df$monthly_cost,

     xlim = c(0, 10),

     ylim = c(0, 200000),pch=".")



# rout(最寄り駅)

df_rout <- df %>%

  select(rout1) %>%

  table %>%

  as.data.frame() %>%

  arrange(Freq)



# 構造

df_const <- df %>%

  select(construction) %>%

  table %>%

  as.data.frame() %>%

  arrange(Freq)

df_const



hist(df2$area, main = '専有面積')

hist(df2$age, main = '築年数')

hist(log(df2$distance), main = '最寄り駅距離(分)')

hist(df2$floor, main = '階')

hist(df2$height, main = '物件高さ(階建)')

hist(df2$free_rent, main = 'フリーレント実施有無')

hist(df2$monthly_cost, main = '賃料/月')



qqnorm(df2$area, main = '専有面積')

qqnorm(df2$age, main = '築年数')

qqnorm(log(df2$distance), main = '最寄り駅距離(分)')

qqnorm(df2$floor, main = '階')

qqnorm(df2$height, main = '物件高さ(階建)')

qqnorm(df2$free_rent, main = 'フリーレント実施有無')

qqnorm(df2$monthly_cost, main = '賃料/月')





#monthly_cost(家賃＋共益費)を20万円以下に絞る

df2 <- df %>%

  filter(monthly_cost <= 200000) %>%

  filter(rout1 != '成田スカイアクセス') %>%

  filter(construction != 'ブロック') %>%

  filter(construction != '鉄骨プレ') %>%

  select(-c(shikikin, reikin, hoshoukin, station, direction))



df3 <- dummyVars(~.,data=df2[-1]) %>%

  predict(df2) %>%

  as.data.frame()



# train, test分割

index <- createDataPartition(df2$monthly_cost, p=.8, list=F)

train <- df2[index,]

test <- df2[-index,]

# dammy

train_dummy <- df3[index,]

test_dummy <- df3[-index,]

# randomforest学習用(上記trainだと時間がかかりすぎるため)

index2 <- createDataPartition(train$monthly_cost, p=.6, list=F)

train_rf <- train[index2,]



##### Linear Regression #####

lm.1 <- lm(monthly_cost~., data = train_dummy)

lm.step <- step(lm.1, trace = 1)

summary(lm.step)

lm_pred <- predict(lm.step, newdata = test_dummy)

lm_pred_csv <- data.frame(lm_pred)

write.csv(lm_pred_csv, "お好みのファイルパス")



test_sort <- test_dummy %>%

  arrange((monthly_cost))

lm_pred_sort <- predict(lm.step, test_sort)



ggplot(test_sort) + 

  geom_line(aes(x=1:nrow(test_sort),y=lm_pred_sort,color="red"))+ 

  geom_line(aes(x=1:nrow(test_sort),y=monthly_cost,color="blue"))+ 

  #geom_line(aes(x=1:nrow(test),y=df_lm_pred,color="green"))+

  ggtitle("テストデータ vs 予測データ（重回帰）")+xlab("index")+ylab("家賃")+ 

  scale_color_hue(name = "", labels = c("テストデータ","重回帰")) +

  theme(axis.text.x = element_text(angle = 180, hjust =1))+theme_bw(base_family = "HiraKakuProN-W3")+

  ylim(0,200000)



##### Random Forest #####



# tuneRFでグリッドサーチ。個々の決定木を作成する際に使用する特徴量の数mtryを求める

# system.time(

#   tuneRF(train_rf %>% select(-c(name,monthly_cost)), train_rf$monthly_cost,

#          doBest=TRUE, trace=TRUE, plot=TRUE )

#   )# 結果 → 8

rf <- randomForest(monthly_cost ~ ., data=train_rf[-1], mtry=8)

saveRDS(rf, file="rf_0906")

summary(rf)

plot(rf)

print(rf$importance)

varImpPlot(rf, main="Contribution",n.var = 10)



# 予測を行う

rf_pred <- predict(rf, test[-1]) #rent_admを削除

rf_pred_csv <- data.frame(rf_pred)

write.csv(rf_pred_csv, "お好みのファイルパス")



# テストデータと予測データをソートしておく

test_sort <- test[-1] %>%

  arrange((monthly_cost))

rf_pred_sort <- predict(rf, test_sort)



ggplot(test_sort) + 

  geom_line(aes(x=1:nrow(test_sort),y=rf_pred_sort,color="red"))+ 

  geom_line(aes(x=1:nrow(test_sort),y=monthly_cost,color="blue"))+ 

  #geom_line(aes(x=1:nrow(test),y=df_lm_pred,color="green"))+

  ggtitle("テストデータ vs 予測データ（ランダムフォレスト）")+xlab("index")+ylab("家賃")+ 

  scale_color_hue(name = "", labels = c("テストデータ","ランダムフォレスト") ) +

  theme(axis.text.x = element_text(angle = 180, hjust =1))+theme_bw(base_family = "HiraKakuProN-W3")+

  ylim(0,200000)



##### Random Forest end #####





##### RMSE #####

sqrt( sum((test$monthly_cost - lm_pred)^2) / length(test$monthly_cost) )

sqrt( sum((test$monthly_cost - rf_pred)^2) / length(test$monthly_cost) )





##### 最終予測 #####



df2$pred_rf <- predict(rf, newdata = df2)

df2$diff <- (df2$pred_rf - df2$monthly_cost)

df2$diff.monthly_cost <- df2$diff / df2$monthly_cost

df_sort <- df2 %>%

  arrange(desc(diff.monthly_cost))

kable(head(df_sort[c(1,2,3,4,6,7,8,9,14,15,16,17)],20))