Daily Dose of Data: 2018

Friday, October 12, 2018

configuring posgresql server on raspberry pi

I always burn time re-figuring out how to do this.

postgres config files are in /etc/postgresql/9.6/main
edit postgresql.conf to listen on port 5432 (listen_addresses = '*')
verify with netstat -nlt that you're listening on port 5432 to addresses other than localhost.
edit pg_hba.conf to allow the user to connect (host all all 192.168.1.0/24 md5)
in the past I've had issues because I've created the database without a password, so I've needed to do an "alter user pi password '<password>'"
verify by dbGetQuery(con,"select current_time") in R (RPostgreSQL), or dbGetQuery(con,"select current_time") from Python (psycopg2)
this post probably says it all better.

Wednesday, October 3, 2018

harvesting craigslist posts to look for regional differences in brand and vehicle class

Python script running on RaspberryPi to pull Craigslist RSS and parse into PostgreSQL database. Let this run as a cron job every 30 minutes for a week so you can get several thousand records:

#!/usr/bin/python3

import threading, psycopg2, feedparser
from psycopg2.extras import execute_values

craigslist_feeds = [
    "https://detroit.craigslist.org/search/cto?format=rss",
    "https://seattle.craigslist.org/search/cto?format=rss",
    "https://boston.craigslist.org/search/cto?format=rss",
    "https://washingtondc.craigslist.org/search/cto?format=rss",
    "https://houston.craigslist.org/search/cto?format=rss",
    "https://losangeles.craigslist.org/search/cto?format=rss",
    "https://stlouis.craigslist.org/search/cto?format=rss",
    "https://up.craigslist.org/search/cto?format=rss"
]

def fetch_rss(url):
    con = psycopg2.connect(dbname="",user="")
    cur = con.cursor()
    rss = feedparser.parse(url)
    parsed_data = [(i["title"],i["link"]) for i in rss["entries"]]  
    execute_values(cur,"insert into craigslist (title,link) values %s",parsed_data)
    con.commit()
    con.close()
    cur.close()
    return

for i in craigslist_feeds:
    t = threading.Thread(target=fetch_rss,args=(i,))
    t.start()

Downloaded EPA fuel economy data from here. All I really needed from this data is the make, model year, and vehicle class information.

The problem is that Craigslist titles aren't consistent, and are sometimes missing information. It's not uncommon, for example, to see an ad for a "2013 F150" instead of "2013 Ford F150". So how to determine the make and vehicle class on incomplete titles? Answer is to use machine learning (knn, NB) to find the most likely make and class given a Craigslist title.

So I need to take the EPA data and train a model, then predict for each Craigslist title. Ongoing project.

Monday, October 1, 2018

ubuntu 18.04 Hadoop, Spark, sparklyr installation

Despite instructions, I spent some time on this because of Java compatibility issues.

following instructions here, install oracle java 8. Do the update-alternatives thing and make sure java 8 is your default.
following instructions here install hadoop.
follow these instructions to install spark
in R, install.packages("sparklyr")
you should be able to fire up the sparklyr tutorial at this point.

Friday, April 6, 2018

MicroPython and RFID-RC522 (NodeMCU clone and RFID reader)

Refer to micropython-mfrc522. Download and unzip to your PC, use a tool like ampy to copy read.py, write.py and mfrc522.py to the root directory of your device.

Connect to the RFID-RC522 according to the picture above. I wrote the corresponding pins of the RFID-RC522 in white text on top of the NodeMCU.

Run a tool like screen to open up a terminal that shows the REPL running on the device that's running MicroPython. I use the command: screen /dev/ttyUSB0 115200.

At the REPL, type "import read". Then "read.do_read()". You should get a dump of the RFID tag when you hold it in front of the reader.

You can also experiment with writing data back to the RFID tag.

Here's my setup:

Daily Dose of Data