

Reddit data extractor code#
So, looking at the code for get_reddit and reddit_urls, you will see that get_reddit is a wrapper for reddit_urls and that the defaults are simply different between the two functions. # Have the same number of unique titles (n=241) Page_threshold = 1000, # can probably get as many URLs as you want but you can only extract a certain amount of data at one time # Have the same number of unique titles (n=239) Links499 <- reddit_urls(search_terms = "president",Ĭn_threshold = 0, # minimum number of comments My question is: How would I next obtain all URLs (and their associated comments)? Not just for the top 499 pages or top 1000 pages but to continue until all URLs with the search term "president" in Reddit have been returned?Īs suggested, I am adding reproducible code below. Moreover, links1000Com could not be created and threw an error: URL '': status was 'Failure when receiving data from the peer'. However, links1000 and links499 were identical. I thought links1000 would contain URLs with search term "president" from the 1000 pages with the largest number of comments (whereas links499 would contain URLs with search term "president" from the 499 pages with the largest number of comments). Links1000Com <- get_reddit(search_terms = "president", links1000 <- reddit_urls(search_terms = "president", However, I (unsuccessfully) tried the same code only now searching through 1,000 pages worth of URLs. I thought this could be accomplished by simply increasing the page_threshold parameter. I next wanted to return an even larger amount of matched URLs for the search term "president" from Reddit. This makes sense because I am pulling URLs from Reddit in order of decreasing number of comments. Links499Com <- get_reddit(search_terms = "president",Įach of these objects had the same number of unique URL titles (n=239) and both only returned URLs with very high number of comments (the lowest of which was 12,378).

links499 <- reddit_urls(search_terms = "president", I first created an object links499 that (should) contain 499 pages worth of URLs that contain the term "president". Specifically, I am using reddit_urls() to return results from Reddit with the search term "president". I am attempting to webscrape from Reddit using the R package RedditExtractoR.
