'How can I download all tags in the 'Posts' table?

If I write "select Tags from Posts WHERE Tags IS NOT NULL" on site https://data.stackexchange.com I can only download 50,000 rows. How can I get all rows?



Solution 1:[1]

This doesn't full solve your problem, but at least it phrases the logic in a more useful way.

You are querying the wrong table for this type of problem. Instead:

with pt as (
      select pt.tagid, pt.postid, t.tagname
      from posttags pt join
           tags t
           on pt.tagid = t.id
     )
select pt.tagname, pt2.tagname, count(*)
from pt join
     pt pt2
     on pt.postid = pt2.postid and
        pt.tagname < pt2.tagname
group by pt.tagname, pt2.tagname
order by count(*) desc;

You are still limited to 50,000 rows, but you are letting the database do more of the work.

The Stack Overflow SEDE is limited to 50,000 rows. That is simply how it works. A few years ago, they moved the data over to BigQuery -- unfortunately modifying the schema a bit (I would be really curious to know why).

In any case, you can return large result sets on BigQuery. The equivalent query is:

#standardSQL

with pt as (
      select pq.*, tag
      from `bigquery-public-data.stackoverflow.posts_questions` pq cross join
           unnest(split(pq.tags, '|')) tag
     )
select pt.tag as tag1, pt2.tag as tag2, count(*) as cnt
from pt join
     pt pt2
     on pt2.id = pt.id and pt.tag < pt2.tag
group by 1, 2
order by count(*) desc;

This returns all 4,689,465 rows of results -- which you can save in a BigQuery table or export to a file and move to another environment.

You can signup for a BigQuery account and run this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Peter Mortensen