Skip to content

Commit

Permalink
Merge pull request #63 from dplocki/57-reorganize-readmemd
Browse files Browse the repository at this point in the history
57 reorganize readmemd
  • Loading branch information
dplocki authored Apr 10, 2024
2 parents 7aeae42 + b0cfb26 commit 3655337
Showing 1 changed file with 109 additions and 86 deletions.
195 changes: 109 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,97 +6,117 @@
[![Downloads](https://img.shields.io/pypi/dm/podcast-downloader.svg)](https://pypi.python.org/pypi/podcast-downloader)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

The Python module for downloading files from given RSS feeds.
It is not using database of any sort. It require configuration file.
The Python module designed for downloading files from given RSS feeds, particularly targeted at podcasts.
It does not use any sort of database but requires a configuration file.

The script is analyzing the directory where it put the previously downloaded files.
It is compering the last added file with the rss feed, finding the missing ones, and downloading them.
The script is intended to be run periodically. Upon starting, it analyzes the directory where it previously stored downloaded files.
It then compares these files with those listed in the RSS feed, identifying any missing ones and downloading them.

As name suggested, the script is designed for podcasts. The files searched by default are `mp3`.
The files searched by default are `mp3`.

The result of using the [example below](#configuration), on empty directories, will be:

```log
dplocki@ghost-wheel:~$ python -m podcast_downloader
[2024-04-08 21:19:10] Loading configuration (from file: "~/.podcast_downloader_config.json")
[2024-04-08 21:19:15] Checking "The Skeptic Guide"
[2024-04-08 21:19:15] Last downloaded file "<none>"
[2024-04-08 21:19:15] The Skeptic Guide: Downloading file: "https://traffic.libsyn.com/secure/skepticsguide/skepticast2024-04-06.mp3" saved as "skepticast2024-04-06.mp3"
[2024-04-08 21:19:41] Checking "The Real Python Podcast"
[2024-04-08 21:19:41] Last downloaded file "<none>"
[2024-04-08 21:19:41] The Real Python Podcast: Downloading file: "https://chtbl.com/track/92DB94/files.realpython.com/podcasts/RPP_E199_03_Calvin.eef1db4d6679.mp3" saved as "[20240405] rpp_e199_03_calvin.eef1db4d6679.mp3"
[2024-04-08 21:20:04] Finished
```

The result:

```
dplocki@ghost-wheel:~$ tree podcasts/
podcasts/
├── RealPython
│   └── [20240405] rpp_e199_03_calvin.eef1db4d6679.mp3
└── SGTTU
└── skepticast2024-04-06.mp3
2 directories, 2 files
```

## Setup

### Installation from PyPI
Installation from PyPI:

```bash
pip install podcast_downloader
```

## Running the script

The script [require configuration file](#configuration) in order to work.
After installation, the script can be called as any Python module:
The script [requires configuration file](#configuration) in order to work.
After installation, the script can be run as any Python module:

```bash
python -m podcast_downloader
```

### In action

Using the [example above](#example), the result will be:
It is also possible to run the script with given configuration file:

```log
[2020-06-16 19:54:35] Loading configuration (from file: "~/.podcast_downloader_config.json")
[2020-06-16 19:54:35] Checking "The Skeptic Guide"
[2020-06-16 19:54:35] Last downloaded file "skepticast2020-06-13.mp3"
[2020-06-16 19:54:39] The Skeptic Guide: Nothing new
[2020-06-16 19:54:39] ------------------------------
[2020-06-16 19:54:39] Finished
```bash
python -m podcast_downloader --config my_config.json
```

## Configuration

### The configuration file

The configuration file is placed in home directory.

The name: `.podcast_downloader_config.json`. The file is format in [JSON](https://en.wikipedia.org/wiki/JSON). The expected encoding is [utf-8](https://en.wikipedia.org/wiki/UTF-8).

The configuration file placement can be specified by [script argument](#script-arguments).

### An example of configuration file
An example of configuration file

```json
{
"if_directory_empty": "download_from_4_days",
"podcasts": [
{
"name": "Python for dummies",
"rss_link": "http://python-for-dummies/atom.rss",
"path": "~/podcasts/PythonForDummies"
},
{
"name": "The Skeptic Guide",
"rss_link": "https://feed.theskepticsguide.org/feed/rss.aspx",
"path": "~/podcasts/SGTTU"
},
{
"rss_link": "https://realpython.com/podcasts/rpp/feed",
"path": "~/podcasts/RealPython",
"file_name_template": "[%publish_date%] %file_name%.%file_extension%"
}
]
}
```

### The configuration file

By default the configuration file is placed in home directory. It's file name is: `.podcast_downloader_config.json`.

The config file is format in [JSON](https://en.wikipedia.org/wiki/JSON). The expected encoding is [utf-8](https://en.wikipedia.org/wiki/UTF-8).

The path to configuration file can be specified by [script argument](#script-arguments).

### The settings hierarchy

The script will replace default values by read from configuration file.
Those will be cover by all values given by command line.
The script replaces default values by those read from configuration file.
Those will be overload by values given from command line.

```
command line parameters > configuration file > default values
command line parameters > configuration file > default values
```

### The main options

| Property | Type | Required | Default | Note |
|:---------------------|:----------:|:--------:|:--------------------------------------:|:-----|
| `downloads_limit` | number | no | infinity | |
| `if_directory_empty` | string | no | download_last | See [In case of empty directory](#in-case-of-empty-directory) |
| `podcast_extensions` | key-value | no | `{".mp3": "audio/mpeg"}` | See [File types filter](#file-types-filter) |
| `podcasts` | subsection | yes | `[]` | See [Podcasts sub category](#podcasts-sub-category) |
| `http_headers` | key-value | no | `{"User-Agent": "podcast-downloader"}` | See [HTTP request headers](#http-request-headers) |
| `fill_up_gaps` | boolean | no | false | See [Download files from gaps](#download-files-from-gaps) |
| Property | Type | Required? | Default | Note |
|:---------------------|:----------:|:---------:|:--------------------------------------:|:-----|
| `downloads_limit` | number | no | infinity | |
| `if_directory_empty` | string | no | download_last | See [In case of empty directory](#in-case-of-empty-directory) |
| `podcast_extensions` | key-value | no | `{".mp3": "audio/mpeg"}` | See [File types filter](#file-types-filter) |
| `podcasts` | subsection | yes | `[]` | See [Podcasts sub category](#podcasts-sub-category) |
| `http_headers` | key-value | no | `{"User-Agent": "podcast-downloader"}` | See [HTTP request headers](#http-request-headers) |
| `fill_up_gaps` | boolean | no | false | See [Download files from gaps](#download-files-from-gaps) |

### Podcasts sub category

`Podcasts` is the part of configuration file where you provide the array of objects with fallowing content:
The `podcasts` segment is the part of configuration file where you provide the array of objects with fallowing content:

| Property | Type | Required | Default | Note |
|:---------------------|:----------:|:--------:|:--------------------------------------:|:-----|
Expand All @@ -113,71 +133,74 @@ Those will be cover by all values given by command line.

### HTTP request headers

Some servers may don't like how the urllib is presenting itself to them (the HTTP User-Agent header). This may lead into problems like: `urllib.error.HTTPError: HTTP Error 403: Forbidden`. That is way, there is a possibility to present the script client as something else.
Some servers may not like how the urllib is presenting itself to them (the HTTP User-Agent header). This may lead into problems like: `urllib.error.HTTPError: HTTP Error 403: Forbidden`. That is why, there is a possibility for the script to pose as something else: by specifying the HTTP headers during downloading files.

There is an option to specify HTTP headers when downloading files.
You can provide them using the `http_headers` value in the configuration file.
The option value should be a dictionary where each header is presented as a key-value pair, with the key being the header title and the value being the header value.
Use the `http_headers` option in the configuration file. The value should be a dictionary object where each header is presented as a key-value pair. The key being the header title and the value being the header value.

Default value: `{"User-Agent": "podcast-downloader"}`. Providing any value for `http_headers` will override the default value.
By default the value is: `{"User-Agent": "podcast-downloader"}`. Providing anything else for `http_headers` will override all the default values (they do not merge).

Podcast `http_headers` will be merged with the global `http_headers`. In case of a conflict (same key name), the vale from podcast sub-configuration will override the global one.
On other hand in the podcast sub-configuration, the `http_headers` will be merged with the global `http_headers`. In case of a conflict (same key name), the vale from podcast sub-configuration will override the global one.

Example:

```json
{
"http_headers": {
"User-Agent": "podcast-downloader"

},
"podcasts": [
{
"name": "Unu Podcast",
"rss_link": "http://www.unupodcast.org/feed.rss",
"path": "~/podcasts/unu_podcast",
"name": "Unua Podcast",
"rss_link": "http://www.unuapodcast.org/feed.rss",
"path": "~/podcasts/unua_podcast",
"https_headers": {
"User-Agent": "Mozilla/5.0"
}
},
{
"name": "Dua Podcast",
"rss_link": "http://www.duapodcast.org/feed.rss",
"path": "~/podcasts/dua_podcast",
"https_headers": {
"User-Agent": "User-Agent: Mozilla/5.0",
"Authorization": "Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ=="
}
}
]
}
```

In this example, the Unua Podcast will be download just with the header: `User-Agent: Mozilla/5.0`, and the Dua Podcast with: `User-Agent: podcast-downloader` and `Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==`.


## Script arguments

The script accept following command line arguments:
The script accepts following command line arguments:

| Short version | Long name | Parameter | Default | Note |
|:--------------|:-----------------------|:-------------------:|:-----------------------------------:|:-----|
| | `--config` | string | `~/.podcast_downloader_config.json` | The placement of the configuration file |
| | `--downloads_limit` | number | infinity | The maximum number of downloaded mp3 files |
| | `--if_directory_empty` | string | `download_last` | The general approach on empty directory |

## Adding date to file name

If RSS channel doesn't have single and constant name convention, it may causing the script to working incorrectly. The solution is force files to have common and meaningful prefix. The script is able to adding the date on beginning of downloaded file name.

Use [File name template](#file-name-template) and option `%publish_date%`.

## File name template

Use to change the name of downloaded file after its downloading.
Use to adjust the file name after downloading.

Default value (the `%file_name%.%file_extension%`) will simple save up the file as it was uploaded by original creator. The file name and its extension is taken from the link to podcast file.
Default value (the `%file_name%.%file_extension%`) will simple save up the file as it was uploaded by original creator. The file name and its extension is based on the link to podcast file.

Template values:

| Name | Notes |
|:-------------------|:--------------------------------------------------------|
| `%file_name%` | The file name taken from link, without extension |
| `%file_extension%` | The extension for the file, taken from link |
| `%file_name%` | The file name from the link, without extension |
| `%file_extension%` | The extension for the file, from link |
| `%publish_date%` | The publish date of the RSS entry |
| `%title%` | The title of the RSS entry |

### Non default the publish_date

The `%publish_date%` by default gives result in format `YEARMMDD`. In order to change the date you can provide the new format after the colon (the `:` character). The script respect the codes [of the 1989 C standard](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes), but the percent sign (`%`) must be replaced by dollar sign (`$`). This is because of my unfortunate decision to use the percent character as marker of the code.
### Non-default the publish_date

The `%publish_date%` by default gives result in format `YEARMMDD`. In order to change it you can provide the new one after the colon (the `:` character). The script respect the codes [of the 1989 C standard](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes), but the percent sign (`%`) must be replaced by dollar sign (`$`). This is because of my unfortunate decision to use the percent character as marker of the code.

| The standard code | The script code | Notes |
|:------------------|:----------------|:-------------------------------------------|
Expand All @@ -192,20 +215,19 @@ The `%publish_date%` by default gives result in format `YEARMMDD`. In order to c

## File types filter

Podcasts are mostly stored as `*.mp3` files. By default Podcast Downloader will look just for them.
Podcasts are mostly stored as `*.mp3` files. By default Podcast Downloader looks just for them, ignoring all others types.

If your podcast support other types of media files, you can precised your own podcast file filter, by providing extension for the file (like `.mp3`), and type of link in RSS feed itself (for `mp3` it is `audio/mpeg`).
If your podcast supports other types of media files, you can specified the file filters. Provide the extension of the file (like `.mp3`) and type of link in RSS feed itself (for `mp3` it is `audio/mpeg`).

If you don't know the type of the file, you can check the RSS file. Seek for `enclosure` tags, should looks like this:
If you don't know the type of the file, you can look for it in the RSS file. Seek for `enclosure` tags, should looks like this:

```xml
<enclosure
url="https://an.apple.supporter.page/podcast/episode23.m4a"
length="14527149"
type="audio/x-m4a" />
<enclosure url="https://www.vidocast.url/podcast/episode23.m4a"
length="14527149"
type="audio/x-m4a" />
```

Notes: the dot on the file extension is require.
**Note**: the dot on the file extension is require.

### Example

Expand All @@ -218,37 +240,38 @@ Notes: the dot on the file extension is require.

## In case of empty directory

If a directory for podcast is empty, the script needs to recognize what to do. Due to lack of database, you can:
If a directory for podcast is empty, the script needs to know what to do. Due to lack of database, you can:

* [download all episodes from feed](#download-all-from-feed)
* [download only the last episode](#only-last)
* [download last n episodes](#download-n-last-episodes)
* [download only the last episode](#download-last)
* [download last n episodes](#download-last-n-episodes)
* [download all new episode from last n days](#download-all-from-n-days)
* [download all new episode since day after, the last episode should appear](#download-all-episode-since-last-excepted)

Default behavior is: `download_last`

### Download all from feed

The script will download all episodes from the feed.

Set by `download_all_from_feed`.

### Only last
### Download last

The script will download only the last episode from the feed.
It is a good approach when you wish to start listening the podcast.
It is also default approach of the script.

Set by `download_last`.

### Download last n episodes

The script will download exactly given number of episodes from the feed.
The script will download exactly the given number of episodes from the feed.

Set by `download_last_n_episodes`. The *n* must be replaced by number of episodes, which you wanted to have downloaded. For example: `download_last_5_episodes` means that five last episodes will be downloaded.
Set by `download_last_n_episodes`. The *n* must be replaced by a number of episodes, which you wanted to have downloaded. For example: `download_last_5_episodes` means that five most recent episodes will be downloaded.

### Download all from n days

The script will download all episodes which appear in last *n* days. I can be use when you are downloading on regular schedule.
The script will download all episodes which appear in recent *n* days. It can be use when you are downloading on regular schedule.
The *n* number is given within the setup value: `download_from_n_days`. For example: `download_from_3_days` means download all episodes from last 3 days.

### Download all episode since last excepted
Expand Down Expand Up @@ -282,13 +305,13 @@ Examples:

## Download files from gaps

The script recognizes the stream of downloaded files (based on the feed). By default, the last downloaded file (according to the feed) marks the start of downloading. In case of gaps, situations where there are missing files before the last downloaded one, the script will ignore them by default. However, there is a possibility to change this behavior to download all missing files between already downloaded ones. To enable this, you need to set the `fill_up_gaps` value to **true**. It's important to note that the script will not download files before the first one (according to the feed).
The script recognizes the stream of downloaded files (based on the feed data). By default, the last downloaded file (according to the feed) marks the start of downloading. In case of gaps, situation where there are missing files before the last downloaded one, the script will ignore them by default. However, there is a possibility to change this behavior to download all missing files between already downloaded ones. To enable this, you need to set the `fill_up_gaps` value to **true**. It's important to note that the script will not download files before the first one (according to the feed), the most earlier episode.

Default value: `false`.

## The analyze of the RSS feed

The script is look through all the `items` nodes in RSS file. The `item` node can contain the `enclosure` node. Those nodes are used to passing the files. According to the convention the single `item` should contain only one `enclosure`, but script (as [the library used](https://pypi.org/project/feedparser/) under it) can handle the multiple files attached into podcast `item`.
The script looks through all the `items` nodes in RSS file. The `item` node can contain the `enclosure` node. Those nodes are used to passing the files. According to the convention the single `item` should contain only one `enclosure`, but script (as [the library used](https://pypi.org/project/feedparser/) under it) can handle the multiple files attached into podcast `item`.

## Converting the OPML

Expand Down

0 comments on commit 3655337

Please sign in to comment.