-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up file creation #75
Comments
Yes, this library is pretty slow, or i should said, I think that PHP is slow. With the library, you can pass your data line by line or all at the same time. If you pass the data all at the same time, this should be faster than only passing one line at time, but it will consume a lot of RAM. Have you enough RAM in production? Anyway, you can try to find how to speed up the library, but that should be complex and probably the acceleration will be not enough satisfying to you. The other alternative you have is to use PSPP. With PSPP there are two variants, that good, they depends on your real conditions.: 1- Generate a CSV file with PHP and also generate a syntax SPSS file from PHP, which reads the CSV file that you just generated and builds the base in SAV, already specifying everything else that is needed, but that is not in the CSV. You can then tell PSPP to execute the syntax. Let me clarify that this is about twice as fast, but not extremely fast. 2- If your database is in PostgreSQL, you are probably in luck. In Debian, they previously compiled PSPP with a plugin that allowed you to connect directly to PostgreSQL and extract data: https://www.gnu.org/software/pspp/manual/html_node/GET-DATA-_002fTYPE_003dPSQL.html They no longer compile it like this in Debian, due to a copyright problem with an SSL library on which the solution depended, but you can try to compile it yourself with that plugin, if you don't already have it incorporated into your distro. In Debian: I don't think you can find a faster option than the ones I propose. The implementation of PSPP save file writings is the most robust and fastest free implementation you will find, but don't really expect miracles either. |
So, i recommend you to start the generation of the sav in the server, register an id of the process in a database and add there information about the progress of the process. Then in the client implement a procedure that check periodically for the state of the process using the id, that is in your database. At less is how i do that. I don't use this library for writing, I use PSPP instead. And well, I also have another library in PHP to write the data in the STATA (https://www.stata.com/) format and another more that write the data in XBase format (https://en.wikipedia.org/wiki/.dbf). Both libraries are more faster than this one. |
Wow thanks for the awesome answers! It can see it uses around 250 MB of ram (I use generators to keep memory as low as possible), but luckily the system only have 3 users that can export the .sav files, so chances are it wont be a problem. I actually have pspp installed but I installed it with flatpak originally. I will try compiling it myself
If I can just make it twice as fast then that would be quite an improvement! Regarding the frontend. I plan to either use websockets/SSE to inform the user when the file is ready (and perhaps the progress if I use Can SPSS read STATA and XBase and do they have all the same features as .sav ? |
Yes well, you can give a csv file and a syntax file for convert the csv to sav, but the problem then will be that you don't know the absolute url where the user will save the files. I think SPSS don't work with relative urls unfortunately.
XBase is not a database originally intended for statistics analyzed, so it have not the same features of SPSS. STATA is a statistics software that have similar feature than SPSS, that why his database have more less the same features. There are software for convert several statistics database in another, like can be stattransfer: https://stattransfer.com/, but well, SPSS can import easy and directly the data in the STATA and the dBase (XBase) format. But i don't recommended generate a database in a non default format, because what you will do is to add the limitations of each formats. All formats have their own limitations (example: the max size of what you can stored in a label). I need all that formats, because i have several clients and some want the database in XBase - DBase (dbf), for work with Microsoft Visual FoxPro and another's because they use STATA to analyze the data instead of SPSS. Also CSV, because some use R. Our surveys are the officially health surveys of the government. There are a lot of researchers with different formations and with different interested involved, that used different software also to do the same thing.
It sounds like the most efficient task to do in a database manager like Postgres. I don't see the point, PSPP execute a Query and get the result. It don't limited you to execute a query that also formats and filters the data. |
Oh I had hoped there would be an "import cases from csv" thing in the menu or something. But they can still download it and put the files next to each other and then merge them using I have been playing around with PSPP and I can see I can write instructions in a file and pipe it into PSPP with cat. I will give this a shot and see how fast it is. Just need to to create some test files :) I just need to extract the data from a survey system, and convert it into a sav file. It is just for my colleagues who all use SPSS. I will stick to writing to a sav file then 👍
|
O well, if your database is not in postgres already, have not sense move it to postgres to just do that. I can tell you about our case: We have our own system that is prepared for several things and what we do is connected our System with the survey system that we used. For example, one public system that we used and i can tell about it is the case of CSPro: CSPro have a CSWeb server side, that store the data in a json format, inside a MySQL/MariaDb database. We have connected the CSWeb MariaDb database and trigger a procedure when CSWeb insert the data in MariaDB. The procedure call a PHP script that extract the current inserted data and reformat it. The reformatted data is send to our survey and reporting system. So, we don't care if the survey is realized in an external survey system, all data are finally send to our own system always, without matter the source. That is why we can used several survey system at same time and do a service in the same way not matter what. |
I need to make a sav file with around 200 variables and 7500 cases (this is the largest export currently).
The problem is that it is currently taking around 10-11 minutes to create the file on my dev machine. I fear it will be even slower on production.
Has anyone found a clever way of speeding it up?
I was hoping to somehow split it up running parallel but sadly merging files seems quite hard (I know the files could be merged in spss later but I have been asked to just deliver a single spss file that can be downloaded and used).
The text was updated successfully, but these errors were encountered: