Skip to content
View pjox's full-sized avatar
Drinking coffee
Drinking coffee

Highlights

  • Pro

Organizations

@commoncrawl @bigscience-workshop @oscar-project

Block or report pjox

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
pjox/README.md

Hi there 👋

I'm a Senior Research Scientist at the Common Crawl Foundation.

I am interested in large corpora for training language models, specially for under resourced languages and historical languages. I am interested in tasks such as Name Entity Recognition (NER), Dependency Parsing and Part-of-Speech tagging, Machine Translation and Document structuration.

I love coffee ☕️, cookies 🍪 and maths.

Pinned Loading

  1. commoncrawl/cc-downloader commoncrawl/cc-downloader Public

    A polite and user-friendly downloader for Common Crawl data

    Rust 10 1

  2. oscar-utils oscar-utils Public

    A new set of utilities to work with the OSCAR Corpus

    Rust 2

  3. oscar2parquet oscar2parquet Public

    Converts OSCAR's jsonl files into parquet

    Rust 2

  4. oscar-project/ungoliant oscar-project/ungoliant Public

    🕷️ The pipeline for the OSCAR corpus

    Rust 163 14

  5. oscar-project/goclassy oscar-project/goclassy Public archive

    An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

    Go 86 6

  6. oscar-project/oscar-website oscar-project/oscar-website Public

    The website of the Oscar Project

    TeX 11 14