Hi All,
I am facing issue with respect to logistic regression/forecast with logistic regression PAL functions in SAP HANA. I am getting incorrect predictions. Below is my situation:
I am having 10 records where data is about whether a customer buys a product or not which is represented by 0 – not buy and 1 – buy.
- i.e. Customer 1001 whose salary is 50000 and age is 45 has bought a product = 1.
Whereas customer 1003 with salary 25000 and age 23 hasn’t bought the product.
The data is in such a manner that customer with low salary and young age do not buy the product whereas customers with more salary and are older buy the product.
The variables which determines whether a customer buys or not are salary and age. The training data is sent to logistic regression function and a coefficient table is got. Below is the code for the same which also has the test data:
SETSCHEMA TRAINING;
-- LOGISTIC REGRESSION PMML DATA code starts
DROPTYPE PAL_T_RG_DATA;
DROPTYPE PAL_T_RG_PARAMS;
DROPTYPE PAL_T_RG_COEFF;
DROPTYPE PAL_T_RG_PMML;
DROPTABLE PAL_RG_SIGNATURE;
DROPTABLE RG_PARAMS;
DROPTABLE RG_COEFF;
DROPTABLE RG_PMML;
CREATETYPE PAL_T_RG_DATA ASTABLE (CUSTID INTEGER, SALARY DOUBLE, AGE INTEGER, BOUGHT INTEGER);
CREATETYPE PAL_T_RG_PARAMS ASTABLE (NAME VARCHAR(50), INTARGS INTEGER, DOUBLEARGS DOUBLE, STRINGARGS VARCHAR (100));
CREATETYPE PAL_T_RG_COEFF ASTABLE (ID INTEGER, AI DOUBLE);
CREATETYPE PAL_T_RG_PMML ASTABLE (ID INTEGER, PMML VARCHAR(5000));
CREATECOLUMNTABLE PAL_RG_SIGNATURE (ID INTEGER, TYPENAME VARCHAR(100), DIRECTION VARCHAR(100));
TRUNCATETABLE PAL_RG_SIGNATURE;
INSERTINTO PAL_RG_SIGNATURE VALUES (1, 'TRAINING.PAL_T_RG_DATA', 'in');
INSERTINTO PAL_RG_SIGNATURE VALUES (2, 'TRAINING.PAL_T_RG_PARAMS', 'in');
INSERTINTO PAL_RG_SIGNATURE VALUES (3, 'TRAINING.PAL_T_RG_COEFF', 'out');
INSERTINTO PAL_RG_SIGNATURE VALUES (4, 'TRAINING.PAL_T_RG_PMML', 'out');
SELECT * FROM TRAINING.PAL_RG_SIGNATURE;
GRANTSELECTON TRAINING.PAL_RG_SIGNATURE toSYSTEM;
CALL"SYSTEM"."AFL_WRAPPER_ERASER" ('PAL_RG');
CALLSYSTEM.AFL_WRAPPER_GENERATOR ('PAL_RG', 'AFLPAL', 'LOGISTICREGRESSION', "TRAINING"."PAL_RG_SIGNATURE");
CREATECOLUMNTABLE PAL_RG_DATA LIKE PAL_T_RG_DATA;
INSERTINTO PAL_RG_DATA VALUES ('1001','50000','45','1');
INSERTINTO PAL_RG_DATA VALUES ('1002','20000','20','0');
INSERTINTO PAL_RG_DATA VALUES ('1003','25000','23','0');
INSERTINTO PAL_RG_DATA VALUES ('1004','40000','47','1');
INSERTINTO PAL_RG_DATA VALUES ('1005','35000','35','0');
INSERTINTO PAL_RG_DATA VALUES ('1006','75000','50','1');
INSERTINTO PAL_RG_DATA VALUES ('1007','60000','50','1');
INSERTINTO PAL_RG_DATA VALUES ('1008','55000','65','1');
INSERTINTO PAL_RG_DATA VALUES ('1009','20000','20','0');
INSERTINTO PAL_RG_DATA VALUES ('1010','20000','23','0');
SELECT * FROM PAL_RG_DATA;
CREATECOLUMNTABLE RG_PARAMS LIKE PAL_T_RG_PARAMS;
CREATECOLUMNTABLE RG_COEFF LIKE PAL_T_RG_COEFF;
CREATECOLUMNTABLE RG_PMML LIKE PAL_T_RG_PMML;
INSERTINTO RG_PARAMS VALUES ('THREAD_NUMBER', 4, null, null);
INSERTINTO RG_PARAMS VALUES ('MAX_ITERATION', 100, null, null);
INSERTINTO RG_PARAMS VALUES ('EXIT_THRESHOLD', null, 0.00001, null);
INSERTINTO RG_PARAMS VALUES ('VARIABLE_NUM', 2, null, null);
INSERTINTO RG_PARAMS VALUES ('METHOD', 0, null, null);
INSERTINTO RG_PARAMS VALUES ('PMML_EXPORT', 2, null, null);
INSERTINTO RG_PARAMS VALUES ('CATEGORY_COL', 3, null, null);
SELECT * FROM RG_PARAMS;
TRUNCATETABLE RG_COEFF;
TRUNCATETABLE RG_PMML;
CALL _SYS_AFL.PAL_RG (PAL_RG_DATA, RG_PARAMS, RG_COEFF, RG_PMML) WITH OVERVIEW;
SELECT * FROM RG_COEFF;
SELECT * FROM RG_PMML;
--Code ends
Logistic regression coefficient table output:
Logistic regression PMML table output:
We have another 10 set of records for which prediction should be made whether a customer buys or not.
This is fed into forecast with logistic regression along with the output coefficient table from logistic regression function. Below is code for the same:
--FORECAST/PREDICTION WITH LOGISTIC REGRESSION code starts
DROPTYPE TRAINING.PAL_T_FRG_PREDICT;
DROPTYPE TRAINING.PAL_T_FRG_CONTROL;
DROPTYPE TRAINING.PAL_T_FRG_COEFF;
DROPTYPE TRAINING.PAL_T_FRG_FITTED;
CREATETYPE PAL_T_FRG_PREDICT ASTABLE ("CUSTID"INTEGER, "SALARY"DOUBLE, "AGE"INTEGER);
CREATETYPE PAL_T_FRG_CONTROL ASTABLE (NAME VARCHAR(60), INTARGS INTEGER, DOUBLEARGS DOUBLE, STRINGARGS VARCHAR (100));
CREATETYPE PAL_T_FRG_COEFF ASTABLE (ID INTEGER, AI VARCHAR(5000));
CREATETYPE PAL_T_FRG_FITTED ASTABLE ("ID"INTEGER, "FITTED"DOUBLE, "TYPE"INTEGER);
DROPTABLE TRAINING.PAL_FRG_SIGN;
CREATECOLUMNTABLE PAL_FRG_SIGN ("ID"INTEGER, "TYPENAME"VARCHAR(100), "DIRECTION"VARCHAR(100));
TRUNCATETABLE TRAINING.PAL_FRG_SIGN;
INSERTINTO PAL_FRG_SIGN VALUES ('1','TRAINING.PAL_T_FRG_PREDICT','IN');
INSERTINTO PAL_FRG_SIGN VALUES ('2','TRAINING.PAL_T_FRG_CONTROL','IN');
INSERTINTO PAL_FRG_SIGN VALUES ('3','TRAINING.PAL_T_FRG_COEFF','IN');
INSERTINTO PAL_FRG_SIGN VALUES ('4','TRAINING.PAL_T_FRG_FITTED','OUT');
GRANTSELECTON TRAINING.PAL_FRG_SIGN TOSYSTEM;
CALLSYSTEM.AFL_WRAPPER_ERASER('PAL_FRLGR_PROC');
CALLSYSTEM.AFL_WRAPPER_GENERATOR('PAL_FRLGR_PROC','AFLPAL','FORECASTWITHLOGISTICR',TRAINING.PAL_FRG_SIGN);
DROPTABLE TRAINING.PAL_FRG_PREDICT;
CREATECOLUMNTABLE PAL_FRG_PREDICT LIKE TRAINING.PAL_T_FRG_PREDICT;
TRUNCATETABLE TRAINING.PAL_FRG_PREDICT;
INSERTINTO PAL_FRG_PREDICT VALUES ('1011','48000','44');
INSERTINTO PAL_FRG_PREDICT VALUES ('1012','18000','22');
INSERTINTO PAL_FRG_PREDICT VALUES ('1013','28000','25');
INSERTINTO PAL_FRG_PREDICT VALUES ('1014','35000','30');
INSERTINTO PAL_FRG_PREDICT VALUES ('1015','50000','50');
INSERTINTO PAL_FRG_PREDICT VALUES ('1016','25000','27');
INSERTINTO PAL_FRG_PREDICT VALUES ('1017','50000','52');
INSERTINTO PAL_FRG_PREDICT VALUES ('1018','70000','67');
INSERTINTO PAL_FRG_PREDICT VALUES ('1019','40000','47');
INSERTINTO PAL_FRG_PREDICT VALUES ('1020','25000','42');
DROPTABLE TRAINING.PAL_FRG_CONTROL;
CREATECOLUMNTABLE PAL_FRG_CONTROL LIKE PAL_T_FRG_CONTROL;
TRUNCATETABLE TRAINING.PAL_FRG_CONTROL;
INSERTINTO PAL_FRG_CONTROL VALUES ('THREAD_NUMBER',8,null,null);
INSERTINTO PAL_FRG_CONTROL VALUES ('CATEGORY_COL',3,null,null);
INSERTINTO PAL_FRG_CONTROL VALUES ('MODEL_FORMAT',1,null,null);
DROPTABLE TRAINING.PAL_FRG_COEFF;
CREATECOLUMNTABLE PAL_FRG_COEFF LIKE PAL_T_FRG_COEFF;
TRUNCATETABLE TRAINING.PAL_FRG_COEFF;
INSERTINTO TRAINING.PAL_FRG_COEFF SELECT * FROM TRAINING.RG_PMML;
DROPTABLE TRAINING.PAL_FRG_FITTED;
CREATECOLUMNTABLE PAL_FRG_FITTED LIKE PAL_T_FRG_FITTED;
TRUNCATETABLE TRAINING.PAL_FRG_FITTED;
CALL _SYS_AFL.PAL_FRLGR_PROC (TRAINING.PAL_FRG_PREDICT,TRAINING.PAL_FRG_CONTROL,TRAINING.PAL_FRG_COEFF,TRAINING.PAL_FRG_FITTED) WITH OVERVIEW;
SELECT * FROM TRAINING.PAL_FRG_FITTED;
--Code ends
Predicted/Fitted table output from Forecast with logistic regression function:
Expected Prediction is:
CUSTID | SALARY | AGE | Expected BOUGHT |
1011 | 48000 | 44 | 1 |
1012 | 18000 | 22 | 0 |
1013 | 28000 | 25 | 0 |
1014 | 35000 | 30 | 0 |
1015 | 50000 | 50 | 1 |
1016 | 25000 | 27 | 0 |
1017 | 50000 | 52 | 1 |
1018 | 70000 | 67 | 1 |
1019 | 40000 | 47 | 1 |
1020 | 25000 | 42 | 1 |
The prediction by logistic regression/forecast with logistic regression is not correct.
Can anybody help in this on how to achieve the correct prediction using logistic regression.
Thanks and Regards,
M.N.Adinarayanan