NTU Machine Learning Homework 1

tags: `NTU_ML` `Machine Learning`

How to choose features of data

After observing the training data visualized image, you can be aware of the relationship between the PM2.5 feature and the others.
For instance, the CO image, NO image, NO2 image, and NOx image are much more correlated with PM2.5.
I also choose PM10, WS_HR, RAINFALL, RH, WIND_SPEED, and PM2.5 which you can see here
I used Zscore normalization to implement in my project and can see as below
You can see the different result of using or unusing normalization with the same config. | Epoch | Regression | LR | Feats | Batch Size | Loss Fn. | Opti. | RMSE | Data Filter | Norm. Data | | :——: | :——: | ——– | :——: | :——: | :——: | :——: | :——: | :——: | :——: | | 200 | 1st-order| 0.015 | [1-4, 6-9, 13, 14] | 1024 | MSE | Adam | 2.44623 | Yes|Yes| | 200 | 1st-order| 0.015 | [1-4, 6-9, 13, 14] | 1024 | MSE | Adam | 2.44623 | Yes|No|

Hyperparameter and Preprocessing

All my testing config can be found in Training Result.xlsx
I used a filter to choose valid data and set a threshold by observing the visualized figure of all features.

My takeaway

(Solved->See the last paragraph)Using normalization is not like what I thought. Practically speaking, using normalization can gather all data to a specific area that the model can converge much more rapidly. But, in this case, the result is worse and also appear negative value of the PM2.5 result. According to this page, maybe the normalization method is not suitable in my case.
(Solved->See the last paragraph)I also figured that using the stored weight and bias by my pretrained model is not the right way. I used pickle to store the dump parameters during the training and used the best one as my pretrained parameter. But it’s still not that good enough.
The better way in this project to enhance your accuracy is tuning your training config and select good features.
After discussing with my friend, I figured out the problem and tried to solve it successfully by fitting numpy random seed. Then, the parameter will truly fix but normalization is still not working to help model converging.

Update

2022/12/06 update - Refer to 相關 taught by Dr.李柏堅, I use Pearson Correlation to compute the correlation of each factor and PM2.5 and the result is shown as below. According to the video, |r| < 0.4 is low correlation, 0.4 ≦ |r| < 0.7is medium correlation, and 0.7 ≦ |r| < 1 is high correlation. So, the factor <font color=#FF0000>CO, NO, NO2, NOx, PM10, and SO2</font> are quite suitable as our input data to address this regression problem.

Factor	AMB_TEMP	CO	NO	NO2	NOx	O3	PM10	WS_HR	RAINFALL	RH	SO2	WD_HR	WIND_DIREC	WIND_SPEED
r	-0.176147465	0.659147668	0.227219147	0.554273687	0.51365014	0.233923944	0.818868214	-0.102047405	-0.060801221	-0.081576429	0.361333416	0.171932397	0.137658351	-0.10119696