Friday, October 12, 2018

configuring posgresql server on raspberry pi

I always burn time re-figuring out how to do this.

  1. postgres config files are in /etc/postgresql/9.6/main
  2. edit postgresql.conf to listen on port 5432 (listen_addresses = '*')
  3. verify with netstat -nlt that you're listening on port 5432 to addresses other than localhost.
  4. edit pg_hba.conf to allow the user to connect (host all all 192.168.1.0/24 md5)
  5. in the past I've had issues because I've created the database without a password, so I've needed to do an "alter user pi password '<password>'"
  6. verify by dbGetQuery(con,"select current_time") in R (RPostgreSQL), or dbGetQuery(con,"select current_time") from Python (psycopg2)
  7. this post probably says it all better.

Wednesday, October 3, 2018

harvesting craigslist posts to look for regional differences in brand and vehicle class

Python script running on RaspberryPi to pull Craigslist RSS and parse into PostgreSQL database.  Let this run as a cron job every 30 minutes for a week so you can get several thousand records:

#!/usr/bin/python3

import threading, psycopg2, feedparser
from psycopg2.extras import execute_values

craigslist_feeds = [
    "https://detroit.craigslist.org/search/cto?format=rss",
    "https://seattle.craigslist.org/search/cto?format=rss",
    "https://boston.craigslist.org/search/cto?format=rss",
    "https://washingtondc.craigslist.org/search/cto?format=rss",
    "https://houston.craigslist.org/search/cto?format=rss",
    "https://losangeles.craigslist.org/search/cto?format=rss",
    "https://stlouis.craigslist.org/search/cto?format=rss",
    "https://up.craigslist.org/search/cto?format=rss"
]

def fetch_rss(url):
    con = psycopg2.connect(dbname="",user="")
    cur = con.cursor()
    rss = feedparser.parse(url)
    parsed_data = [(i["title"],i["link"]) for i in rss["entries"]]  
    execute_values(cur,"insert into craigslist (title,link) values %s",parsed_data)
    con.commit()
    con.close()
    cur.close()
    return

for i in craigslist_feeds:
    t = threading.Thread(target=fetch_rss,args=(i,))
    t.start()

Downloaded EPA fuel economy data from here.  All I really needed from this data is the make, model year, and vehicle class information.

The problem is that Craigslist titles aren't consistent, and are sometimes missing information.  It's not uncommon, for example, to see an ad for a "2013 F150" instead of "2013 Ford F150".  So how to determine the make and vehicle class on incomplete titles?  Answer is to use machine learning (knn, NB) to find the most likely make and class given a Craigslist title.

So I need to take the EPA data and train a model, then predict for each Craigslist title.  Ongoing project.

Monday, October 1, 2018

ubuntu 18.04 Hadoop, Spark, sparklyr installation

Despite instructions, I spent some time on this because of Java compatibility issues.

  1. following instructions here, install oracle java 8.  Do the update-alternatives thing and make sure java 8 is your default.
  2. following instructions here install hadoop.
  3. follow these instructions to install spark
  4. in R, install.packages("sparklyr")
  5. you should be able to fire up the sparklyr tutorial at this point.

Friday, April 6, 2018

MicroPython and RFID-RC522 (NodeMCU clone and RFID reader)

Refer to micropython-mfrc522.  Download and unzip to your PC, use a tool like ampy to copy read.py, write.py and mfrc522.py to the root directory of your device.

Connect to the RFID-RC522 according to the picture above.  I wrote the corresponding pins of the RFID-RC522 in white text on top of the NodeMCU.

Run a tool like screen to open up a terminal that shows the REPL running on the device that's running MicroPython.  I use the command: screen /dev/ttyUSB0 115200.

At the REPL, type "import read".  Then "read.do_read()".  You should get a dump of the RFID tag when you hold it in front of the reader.

You can also experiment with writing data back to the RFID tag.

Here's my setup:

Wednesday, November 25, 2015

vehicle DTC text mining in the R tm package

text mining in R how to process ngrams: http://tm.r-forge.r-project.org/faq.html#Bigrams comparing frequency of DTM across factor levels (like DTC against platform): create DTM create dataframe from DTM: df <- as.data.frame(inspect(dtm)) df$platform <- substr(data$vin_last_8,2,2) nice compact summary table: dtc_by_platform <- with(melt(df),tapply(value,list(variable,platform),sum)) and this table can be used as input to a barplot or chisquare test

Saturday, November 22, 2014

Shakespearean insult generator in R

Credit to this site for the idea. Simple R code to generate Shakespearean insults. download the insult_words.csv file here.
insult_words <- read.csv("insult_words.csv",header=F) paste("Thou",sample(insult_words[,1],1),sample(insult_words[,2],1),sample(insult_words[,3],1),"!")
Or give your self a bunch of ammunition:
for(i in 1:10){ cat(paste("Thou",sample(insult_words[,1],1),sample(insult_words[,2],1),sample(insult_words[,3],1),"!\n")) }

Friday, October 31, 2014

creating Chrysler MOP/MIS charts in Pandas

Pandas is a Python package meant to be used for data analysis.  I love Python, but I'm also a long-term R user.  I've really come to appreciate how quickly I can create a customized chart in R, but let's try a practical example in Pandas and see how it compares to R.

Here's what I came up with, I'm new to Pandas and I'm willing to consider that maybe this isn't the most efficient way to get a chart up:

import pandas as pd
from matplotlib import pyplot as plt

data = pd.read_csv("../charts/my14 mis12/recovery%all_smlc%all_lcc.csv")

#change from int to string, I can do this in the call to read_csv also, but I think that way I need a dict of all column names
data["Model Year"] = data["Model Year"].astype("str")

#clean up column names, remove forward slashes
data = data.rename(columns=lambda x: x.replace('/', ''))

#construct a column of model year - MOP, where MOP is always 2 character
data['MY_MOP'] = "0"+data['MOP']
data['MY_MOP'] = [i[-2:] for i in data["MY_MOP"]]
data['MY_MOP'] = data["Model Year"]+"-"+data["MY_MOP"]

#narrow the dataframe down to just what I need:
data = data[data.MOP!="TOTALS"][data.YTDCMP=="YTD"][['MY_MOP','MIS','C1000']]

#pivot table of C/1000 by Model year - MOP and MIS
#obviously work to do on axis labels!
table = data.pivot_table(values="C1000",index=["MY_MOP"],columns=["MIS"],aggfunc=max)
table.plot()
plt.show()