気象庁の「世界の年平均気温偏差」データの回帰分析

はじめに

気象庁の「世界の年平均気温偏差」データから、毎年どれくらい地球が温暖化しているかを調べてみましょう。データは2023年までのものです。

import pandas as pd
import matplotlib.pyplot as plt

URL = "https://www.data.jma.go.jp/cpdinfo/temp/list/csv/an_wld.csv"
df = pd.read_csv(URL, encoding="cp932")
plt.plot(df["年"], df["世界全体"], "o-")
plt.xlabel("年")
plt.ylabel("年平均気温偏差")

この気象庁のデータはときどきデータに注釈記号 * が付いてエラーになることがあります。そのときは気象庁の「世界の年平均気温偏差」データに説明した方法で切り抜けてください。

よく見ると、いったん温暖化傾向が止まりますが、1980年からはほぼ直線的に増加しています。そこで、1980年以降だけを切り出して、よく調べてみましょう。

df1 = df.query("年 >= 1980")
plt.plot(df1["年"], df1["世界全体"], "o-")
plt.xlabel("年")
plt.ylabel("年平均気温偏差")

直線をあてはめて、その傾きを求めてみましょう。

NumPyを使う

NumPy の多項式をフィットする関数 np.polyfit(x, y, 次数) を使うのが一番簡単です。

import numpy as np

slope, intercept = np.polyfit(df1["年"], df1["世界全体"], 1)
print("傾き", slope, "y切片", intercept)

傾き 0.0189 くらいです。毎年 0.0189 度だけ温暖化しています。少ないようですが、100年たつと 1.89 度の上昇になります。

さきほどのグラフに直線を追加してみましょう。

plt.plot(df1["年"], df1["世界全体"], "o-")
plt.xlabel("年")
plt.ylabel("年平均気温偏差")
plt.plot(df1["年"], df1["年"] * slope + intercept)

さらに np.polyfit(df1["年"], df1["世界全体"], 2) とすると、2次式であてはめて、2次の係数、1次の係数、定数項を出力します。何次式でも同様にできます（解が数値的に安定しているかは別問題です）。

pingouinを使う

t検定でも紹介したpingouinパッケージを使ってみましょう。

import pingouin as pg

pg.linear_regression(df1["年"], df1["世界全体"])

       names       coef        se          T  ...        r2    adj_r2   CI[2.5%]  CI[97.5%]
0  Intercept -37.883328  2.241121 -16.903738  ...  0.871427  0.868365 -42.406094 -33.360562
1          年   0.018891  0.001120  16.871930  ...  0.871427  0.868365   0.016632   0.021151

例によって画面の幅が狭いので省略されてしまいました。値を少し丸めてみましょう。

pg.linear_regression(df1["年"], df1["世界全体"]).round(5)

       names      coef       se         T  pval       r2   adj_r2  CI[2.5%]  CI[97.5%]
0  Intercept -37.88333  2.24112 -16.90374   0.0  0.87143  0.86837 -42.40609  -33.36056
1          年   0.01889  0.00112  16.87193   0.0  0.87143  0.86837   0.01663    0.02115

statsmodelsを使う

statsmodels は R と近い統計解析の機能を Python に実装するライブラリです。あらかじめ pip install statsmodels と打ち込んでインストールしておきます。マニュアルの statsmodels.formula.api.ols および Fitting models using R-style formulas を参照してください。

import statsmodels.formula.api as smf

res = smf.ols("世界全体 ~ 年", data=df1).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   世界全体   R-squared:                       0.871
Model:                            OLS   Adj. R-squared:                  0.868
Method:                 Least Squares   F-statistic:                     284.7
Date:                Wed, 07 Feb 2024   Prob (F-statistic):           2.56e-20
Time:                        14:24:41   Log-Likelihood:                 42.480
No. Observations:                  44   AIC:                            -80.96
Df Residuals:                      42   BIC:                            -77.39
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -37.8833      2.241    -16.904      0.000     -42.406     -33.361
年              0.0189      0.001     16.872      0.000       0.017       0.021
==============================================================================
Omnibus:                        2.175   Durbin-Watson:                   1.555
Prob(Omnibus):                  0.337   Jarque-Bera (JB):                1.575
Skew:                           0.259   Prob(JB):                        0.455
Kurtosis:                       2.231   Cond. No.                     3.15e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.15e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

2次式であてはめるには数式部分を "世界全体 ~ 年 + I(年**2)" とします。

R で同じことをする方法：

URL = "https://www.data.jma.go.jp/cpdinfo/temp/list/csv/an_wld.csv"
df = read.csv(URL, fileEncoding="cp932")
df1 = subset(df, 年 >= 1980)
with(df1, plot(年, 世界全体, type="o", pch=16))
r = lm(世界全体 ~ 年, data=df1)
abline(r)
summary(r)

結果は次のように表示されます：

Call:
lm(formula = 世界全体 ~ 年, data = df1)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.157424 -0.075507 -0.009075  0.063431  0.205879 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -37.88333    2.24112  -16.90   <2e-16 ***
年            0.01889    0.00112   16.87   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09431 on 42 degrees of freedom
Multiple R-squared:  0.8714,	Adjusted R-squared:  0.8684 
F-statistic: 284.7 on 1 and 42 DF,  p-value: < 2.2e-16

2次式にするには式の部分を 世界全体 ~ 年 + I(年^2) とします。

SciPyを使う

SciPy の scipy.stats.linregress で単回帰ができます。

from scipy import stats
stats.linregress(df1["年"], df1["世界全体"])

LinregressResult(slope=0.018891472868217057, intercept=-37.88332840028189, rvalue=0.9335024093239134, pvalue=2.5605509195046824e-20, stderr=0.0011196983660876176, intercept_stderr=2.2411213832828665)

個々の値は res = stats.linregress(df1["年"], df1["世界全体"]) して res.slope などとして取り出せます。

scikit-learnを使う

scikit-learn は機械学習で標準的に使われているライブラリです。この中にも線形回帰モデルが入っています。

from sklearn import linear_model

model = linear_model.LinearRegression()
X = df1["年"].values.reshape(-1, 1)
y = df1["世界全体"].values
model.fit(X, y)
print(model.coef_, model.intercept_)

[0.01889147] -37.883328400281876

標準ライブラリstatisticsを使う

Python 3.10 で標準ライブラリ statistics にも線形回帰の関数が追加されました：

import statistics as st

st.linear_regression(df1["年"], df1["世界全体"])

LinearRegression(slope=0.018891472868217053, intercept=-37.88332840028188)