Hi,
I have posted this question in Blag's blog post (http://scn.sap.com/community/developer-center/hana/blog/2013/02/18/when-sap-hana-met-r--whats-new), but I guess this forum is the more appropriate place to ask these questions (of course I am grateful for Blag's replies!):
I have a very fundamental question concerning R and HANA: I have been working with an AWS HANA instance; I was not allowed to install RServe (or R) at this instance. In a productive HANA environment, is it possible to intall R on the same machine as HANA? Or is there some restriction from SAP's side concerning what may run on that appliance machine?
As with my AWS instance, HANA and R had to communicate over some TCP connection. From what I seen in my tests is that both HANA and R are increadibly efficient, but the internet connection, especially transfer of "big data" slows computations down enormously (to my understanding, even if it would have been a super-fast ethernet connection, this would not be much different).
I was just wondering how HANA/R scales and performs on truly big data sets. I have done my tests mainly on several dozen MB of data on AWS. However, would that also work for, say, 10-100 GB? My question is, would you recommend HANA over any other DB in this case, when doing analytics with R? My concern here is that the performance gain using an in-memory, row-based DB (rather than a disk-based one) seems small when we have to forward all the data over a TCP connection to R for analysis, which seems like a true bottleneck here. Also, running HANA on a seperate instance than the one where R is located does not seem to fit into the general HANA philosophy of moving application logic to the DB in order to get most out of the in-memory performance advantages and real-time capabilities.
The second thing I noticed is that PAL seems to be an alternative for common data mining tasks, which I have tried for the same algorithm (i.e. k-means). It worked very nicely. To my understanding, PAL is implemented in C/C++. Is it possible to extend this by adding functions of my own in C/C++, running on the HANA instance, i.e. machine learning algorithms etc.?
Did I get something wrong here?
Thank you in advance,
regards,
Georg
PS. my background is rather data mining specific other than SAP / BI/BW, so I may have missed some facts obvious to SAP veterans.