Daily Dose of Data: October 2018

Friday, October 12, 2018

configuring posgresql server on raspberry pi

I always burn time re-figuring out how to do this.

postgres config files are in /etc/postgresql/9.6/main
edit postgresql.conf to listen on port 5432 (listen_addresses = '*')
verify with netstat -nlt that you're listening on port 5432 to addresses other than localhost.
edit pg_hba.conf to allow the user to connect (host all all 192.168.1.0/24 md5)
in the past I've had issues because I've created the database without a password, so I've needed to do an "alter user pi password '<password>'"
verify by dbGetQuery(con,"select current_time") in R (RPostgreSQL), or dbGetQuery(con,"select current_time") from Python (psycopg2)
this post probably says it all better.

Wednesday, October 3, 2018

harvesting craigslist posts to look for regional differences in brand and vehicle class

Python script running on RaspberryPi to pull Craigslist RSS and parse into PostgreSQL database. Let this run as a cron job every 30 minutes for a week so you can get several thousand records:

#!/usr/bin/python3

import threading, psycopg2, feedparser
from psycopg2.extras import execute_values

craigslist_feeds = [
    "https://detroit.craigslist.org/search/cto?format=rss",
    "https://seattle.craigslist.org/search/cto?format=rss",
    "https://boston.craigslist.org/search/cto?format=rss",
    "https://washingtondc.craigslist.org/search/cto?format=rss",
    "https://houston.craigslist.org/search/cto?format=rss",
    "https://losangeles.craigslist.org/search/cto?format=rss",
    "https://stlouis.craigslist.org/search/cto?format=rss",
    "https://up.craigslist.org/search/cto?format=rss"
]

def fetch_rss(url):
    con = psycopg2.connect(dbname="",user="")
    cur = con.cursor()
    rss = feedparser.parse(url)
    parsed_data = [(i["title"],i["link"]) for i in rss["entries"]]  
    execute_values(cur,"insert into craigslist (title,link) values %s",parsed_data)
    con.commit()
    con.close()
    cur.close()
    return

for i in craigslist_feeds:
    t = threading.Thread(target=fetch_rss,args=(i,))
    t.start()

Downloaded EPA fuel economy data from here. All I really needed from this data is the make, model year, and vehicle class information.

The problem is that Craigslist titles aren't consistent, and are sometimes missing information. It's not uncommon, for example, to see an ad for a "2013 F150" instead of "2013 Ford F150". So how to determine the make and vehicle class on incomplete titles? Answer is to use machine learning (knn, NB) to find the most likely make and class given a Craigslist title.

So I need to take the EPA data and train a model, then predict for each Craigslist title. Ongoing project.

Monday, October 1, 2018

ubuntu 18.04 Hadoop, Spark, sparklyr installation

Despite instructions, I spent some time on this because of Java compatibility issues.

following instructions here, install oracle java 8. Do the update-alternatives thing and make sure java 8 is your default.
following instructions here install hadoop.
follow these instructions to install spark
in R, install.packages("sparklyr")
you should be able to fire up the sparklyr tutorial at this point.

Daily Dose of Data