Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rozetka 3 actors added #533

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from
Open

Conversation

vladyslav-n
Copy link

These are 3 new typescript actors for Hlidac Rozetka Project by Geniusee with edits after previous pull request attempt.

@rarous
Copy link
Collaborator

rarous commented Apr 14, 2021

Please do not use custom editorconfig, eslintrc nor prettierrc. Also, please, do not use TypeScript. Actors have to be directly usable without any compilation steps.

@vladyslav-n
Copy link
Author

vladyslav-n commented Apr 14, 2021 via email

@vladyslav-n
Copy link
Author

Oh, I see Hlidac has dist/ directory in .gitignore, so that I missed that dist directory hasn’t actually been pushed to the repo, sorry for that. I could simply rename it to ‘build/’ so that it will be pushed for sure. In that case will typescript still need to be replaced with javascript? Thanks for your time!

@vladyslav-n
Copy link
Author

Hey @rarous,
I ported Rozetka actors to javascript, please, check them out.
Have a nice day!

Copy link
Collaborator

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. I haven't check in-depth everything nor I tested if it works. Please look at the comments. Thanks.

actors/rozetka-count/Dockerfile Outdated Show resolved Hide resolved
actors/rozetka-count/src/main.js Outdated Show resolved Hide resolved
actors/rozetka-count/src/main.js Outdated Show resolved Hide resolved
await crawler.run();
log.info('Crawl finished.');

await Apify.pushData({ OUTPUT: await getOrIncStatsValue() })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zpelechova This is how the dataset should look like?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zpelechova Since there were no details on the output of the count actor provided in specs, it would be great to know what the correct way of doing this is. I saw an example in another actor as { totalCount: value }. Is this the way it should be done?

actors/rozetka-count/src/tools.js Outdated Show resolved Hide resolved
@@ -0,0 +1,42 @@
export const LABELS = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This second actor looks like most of the code is copy pasted from the first one. That will make it harder for maintenance. I would use a single folder for those and just changed the behaviour via input or env var. cc @rarous

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it look like having separate actors for count functionality is not good idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it could be multiple actors but definitely should not be multiple folders. But I will leave that up to you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, I would be the most efficient to do the count of the results and scraping the data in parallel, considering the crawling logic there has to be very similar. I just don't understand clearly how we should handle the output in case of having one actor for both purposes (though there are plenty ways how handle this, for example we could save the results to separate named datasets for count the the daily actors). But the reuse of the code is great, and I'm very glad that I'm allowed to do that here.
So, to sum up, which way should I choose — separate actors with some shared code or a single actor (some more specs about the output should be provided here then)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already common library for reusable code, but it should be for code that is reusable for all/most of actors.

This case should be IMHO handled just by type: "COUNT" Input parameter. It will be one actor in more modes (we already have this for Black Friday scraping in older actors). It will be scheduled with different input parameters. This mode will just write to Dataset but skip the Keboola upload step.

Sorry, for inconvenience, we are still figuring out the process and shape - count functionality is new requirement for internal benchmarking of scraped data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @rarous 's approach, will be the simplest

Copy link
Collaborator

@zpelechova zpelechova Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I guess it is not properly explained in the docs, but I dont think it makes sense to have an actor four count which does the same as the main actor. The idea behind it is to double check the result, i.e. find and resonably use the numbers which tels how many items there are in each category, like with rozetka here:
Ноутбуки

@vladyslav-n
Copy link
Author

@rarous @metalwarrior665 Hi guys!
I suppose, the code is ready for the 2-nd round of the code review)

Copy link
Collaborator

@metalwarrior665 metalwarrior665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding my points, it is done. But @rarous needs to check it works with the system and @zpelechova needs to check the output.

@janfiedler
Copy link
Contributor

@vladyslav-n I tried run actor with "type" = "COUNT" & "type" = "DAILY" and I am getting quite frequently errors:

ArgumentError: Expected property string "url" to be a URL, got "/laboratornoe-osnashchenie/c4644808/page=11/" in object "requestLike" at Object.ow [as default] (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/ow/dist/index.js:19:23)
      at RequestQueue.addRequest (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/storages/request_queue.js:173:21)
      at enqueueLastPage (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/routes/helpers/enqueueLastPage.js:17:24)
      at countProductsOrSplitPriceRange (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/routes/helpers/countProductsOrSplitPriceRange.js:21:15)
      at handleProductList (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/routes/handleProductList.js:43:15)
      at CheerioCrawler.handlePageFunction [as userProvidedHandler] (file:///Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/actors/rozetka-daily/src/main.js:68:27)
      at CheerioCrawler._handleRequestFunction (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/crawlers/cheerio_crawler.js:452:49)
      at processTicksAndRejections (node:internal/process/task_queues:96:5)
      at async CheerioCrawler._runTaskFunction (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/crawlers/basic_crawler.js:423:13)
      at async AutoscaledPool._maybeRunTask (/Users/janfiedler/Work/TopMonks/GitHub/hlidac-eshopu/node_modules/apify/build/autoscaling/autoscaled_pool.js:399:13)

It is look like, this happen only when part of url is /page=x/

@vladyslav-n
Copy link
Author

@janfiedler Hi!
Seems like there's an issue with relative pathes in urls, will look at it and give some feedback tomorrow.

@vladyslav-n
Copy link
Author

@janfiedler Hi!
Sorry for the delay. Made some updates to the crawling of the actor and also explicitly added a new allowed content-type to the actor — seems like it failed the actor to work at all in both modes COUNT and DAILY. The crawling changes were made due to some changes in the site categories logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants