Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up file creation #75

Open
sinnbeck opened this issue Feb 14, 2024 · 6 comments
Open

Speed up file creation #75

sinnbeck opened this issue Feb 14, 2024 · 6 comments

Comments

@sinnbeck
Copy link

I need to make a sav file with around 200 variables and 7500 cases (this is the largest export currently).
The problem is that it is currently taking around 10-11 minutes to create the file on my dev machine. I fear it will be even slower on production.

Has anyone found a clever way of speeding it up?

I was hoping to somehow split it up running parallel but sadly merging files seems quite hard (I know the files could be merged in spss later but I have been asked to just deliver a single spss file that can be downloaded and used).

@sinnbeck sinnbeck changed the title Speed up file creating Speed up file creation Feb 14, 2024
@lestcape
Copy link
Collaborator

lestcape commented Feb 14, 2024

Yes, this library is pretty slow, or i should said, I think that PHP is slow. With the library, you can pass your data line by line or all at the same time. If you pass the data all at the same time, this should be faster than only passing one line at time, but it will consume a lot of RAM. Have you enough RAM in production?

Anyway, you can try to find how to speed up the library, but that should be complex and probably the acceleration will be not enough satisfying to you.

The other alternative you have is to use PSPP. With PSPP there are two variants, that good, they depends on your real conditions.:

1- Generate a CSV file with PHP and also generate a syntax SPSS file from PHP, which reads the CSV file that you just generated and builds the base in SAV, already specifying everything else that is needed, but that is not in the CSV. You can then tell PSPP to execute the syntax. Let me clarify that this is about twice as fast, but not extremely fast.

2- If your database is in PostgreSQL, you are probably in luck. In Debian, they previously compiled PSPP with a plugin that allowed you to connect directly to PostgreSQL and extract data: https://www.gnu.org/software/pspp/manual/html_node/GET-DATA-_002fTYPE_003dPSQL.html

They no longer compile it like this in Debian, due to a copyright problem with an SSL library on which the solution depended, but you can try to compile it yourself with that plugin, if you don't already have it incorporated into your distro. In Debian:

Captura desde 2024-02-14 14-24-39

I don't think you can find a faster option than the ones I propose. The implementation of PSPP save file writings is the most robust and fastest free implementation you will find, but don't really expect miracles either.

@lestcape
Copy link
Collaborator

lestcape commented Feb 14, 2024

So, i recommend you to start the generation of the sav in the server, register an id of the process in a database and add there information about the progress of the process. Then in the client implement a procedure that check periodically for the state of the process using the id, that is in your database. At less is how i do that. I don't use this library for writing, I use PSPP instead.

And well, I also have another library in PHP to write the data in the STATA (https://www.stata.com/) format and another more that write the data in XBase format (https://en.wikipedia.org/wiki/.dbf). Both libraries are more faster than this one.

@sinnbeck
Copy link
Author

Wow thanks for the awesome answers!

It can see it uses around 250 MB of ram (I use generators to keep memory as low as possible), but luckily the system only have 3 users that can export the .sav files, so chances are it wont be a problem.
I have implemented both versions and (single write and writeCase()) and the single write is indeed a bit faster but uses more ram. The good thing about the cases is that I can echo our which case I am currently adding.

I actually have pspp installed but I installed it with flatpak originally. I will try compiling it myself

  1. This sounds like an awesome idea. I has suggested to our users that I could just output both a .sav file without data along with a csv file or an xlsx file. They could then merge them inside SPSS. But if I can do this in code, that would be much much better! I hope I can pipe in the instructions.
  2. Sadly it doesn't and either way, the data needs to formatted and filtered before being saved to .sav

If I can just make it twice as fast then that would be quite an improvement!

Regarding the frontend. I plan to either use websockets/SSE to inform the user when the file is ready (and perhaps the progress if I use writeCase()) or just mail the user a link to the file once it is done :)

Can SPSS read STATA and XBase and do they have all the same features as .sav ?

@lestcape
Copy link
Collaborator

lestcape commented Feb 15, 2024

I has suggested to our users that I could just output both a .sav file without data along with a csv file or an xlsx file.

Yes well, you can give a csv file and a syntax file for convert the csv to sav, but the problem then will be that you don't know the absolute url where the user will save the files. I think SPSS don't work with relative urls unfortunately.

Can SPSS read STATA and XBase and do they have all the same features as .sav ?

XBase is not a database originally intended for statistics analyzed, so it have not the same features of SPSS. STATA is a statistics software that have similar feature than SPSS, that why his database have more less the same features. There are software for convert several statistics database in another, like can be stattransfer: https://stattransfer.com/, but well, SPSS can import easy and directly the data in the STATA and the dBase (XBase) format. But i don't recommended generate a database in a non default format, because what you will do is to add the limitations of each formats. All formats have their own limitations (example: the max size of what you can stored in a label).

I need all that formats, because i have several clients and some want the database in XBase - DBase (dbf), for work with Microsoft Visual FoxPro and another's because they use STATA to analyze the data instead of SPSS. Also CSV, because some use R. Our surveys are the officially health surveys of the government. There are a lot of researchers with different formations and with different interested involved, that used different software also to do the same thing.

Sadly it doesn't and either way, the data needs to formatted and filtered before being saved to .sav

It sounds like the most efficient task to do in a database manager like Postgres. I don't see the point, PSPP execute a Query and get the result. It don't limited you to execute a query that also formats and filters the data.

@sinnbeck
Copy link
Author

Oh I had hoped there would be an "import cases from csv" thing in the menu or something. But they can still download it and put the files next to each other and then merge them using ADD FILES or UPDATE I assume.

I have been playing around with PSPP and I can see I can write instructions in a file and pipe it into PSPP with cat. I will give this a shot and see how fast it is. Just need to to create some test files :)

I just need to extract the data from a survey system, and convert it into a sav file. It is just for my colleagues who all use SPSS. I will stick to writing to a sav file then 👍

It sounds like the most efficient task to do in a database manager like Postgres.
It sounds good in theory but I would need to get all the data into the proper structure and format, insert it into postgres and then ask pspp to extract it again. Sadly the survey system isnt inhouse so I cannot just change databases :(

@lestcape
Copy link
Collaborator

lestcape commented Feb 15, 2024

O well, if your database is not in postgres already, have not sense move it to postgres to just do that.

I can tell you about our case:

We have our own system that is prepared for several things and what we do is connected our System with the survey system that we used. For example, one public system that we used and i can tell about it is the case of CSPro:

CSPro have a CSWeb server side, that store the data in a json format, inside a MySQL/MariaDb database. We have connected the CSWeb MariaDb database and trigger a procedure when CSWeb insert the data in MariaDB. The procedure call a PHP script that extract the current inserted data and reformat it. The reformatted data is send to our survey and reporting system. So, we don't care if the survey is realized in an external survey system, all data are finally send to our own system always, without matter the source. That is why we can used several survey system at same time and do a service in the same way not matter what.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants