Search This Blog

Tuesday, June 5, 2018

PostgreSQL Lateral Joins

Example from this this webpage: 

The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)

Lateral joins are like running an iteration over the joined subquery.

id | user_id | created_at ----+---------+---------------------------- 1 | 1 | 2017-06-20 04:35:03.582895 2 | 2 | 2017-06-20 04:35:07.564973 3 | 3 | 2017-06-20 04:35:10.986712 4 | 1 | 2017-06-20 04:58:10.137503 5 | 3 | 2017-06-20 04:58:17.905277 6 | 3 | 2017-06-20 04:58:25.289122

SELECT user_id, first_order_time, next_order_time, id FROM
  (SELECT user_id, min(created_at) AS first_order_time 
       FROM orders GROUP BY user_id) o1
  (SELECT id, created_at AS next_order_time
   FROM orders
   WHERE user_id = o1.user_id AND created_at > o1.first_order_time
   ORDER BY created_at ASC LIMIT 1)
   o2 ON true;

 user_id |      first_order_time      |      next_order_time       | id
       1 | 2017-06-20 04:35:03.582895 | 2017-06-20 04:58:10.137503 |  4
       3 | 2017-06-20 04:35:10.986712 | 2017-06-20 04:58:17.905277 |  5
       2 | 2017-06-20 04:35:07.564973 |                            |


Monday, June 4, 2018

Moving data folder for PostgreSQL on Windows

If you run out of space on a server you may need to move the data folder for PostgreSQL to a new drive. On Windows the easiest way to do this is stop the service, copy the data folder, then unregister the service and re-resister it pointing to the new data folder space. This is answered in the following STACKOVERFLOW answer. 

To move the directory, use these steps:
  1. Stop Postgres (you can use the control panel to find the correct service name)
    net stop 
  2. Make sure Postgres is not running (e.g. using ProcessMonitor)
  3. Remove the Windows service using
    pg_ctl unregister -N 
  4. make sure Postgres is not running
  5. move the data directory to the new location (or maybe only copy it, so that you have a backup)
  6. re-create the service using (this assigns postgres as the service name)
    pg_ctl register -N postgres -D c:\new\path\to\datadir
  7. start the service
    net start postgres 
  8. run psql to verify that Postgres is up and running
    psql -U postgres
  9. Verify the running server is using the new data directory
    show data_directory;

Wednesday, January 31, 2018

Raspberry Pi 3 and I2C issues with Pulsed Light v2 -- returning 0 from Lidar-Lite

Lidar-Lite v3 return always zero with raspberry pi 3

Trying to get an Adafruit servo hat to work on Raspberry Pi 3... and a PulsedLight Lidar-Lite v2 unit.

Using I2C for comms - servo hat on 0x40 and Lidar on 0x62.

Tried on Pi2 and the Lidar was fine but the same code returns 0 for distance/velocity on Pi3.

Running   sudo i2cdetect -y 1 returned the correct addresses (0x62 for the Lidar)
 (if I2C tools not installed then first  run these commands)
sudo apt-get install python-smbus
sudo apt-get install i2c-tools

Issue is that the latest Raspbian kernel is auto restarting the I2C and the LidarLite unit doesn't like this.. hence python code returns 0 for distance and 0 for speed - even though the I2C detect has demonstrated the device is connected correctly.


Bump the kernel back to an earlier version - I'm using 4.4.38 which works.  

You copy the hex code (in this case af9cb14d5053f89857225bd18d1df59a089c171e)  and run it like this:

If rpi-update not installed first run this:       sudo apt-get install rpi-update

sudo rpi-update af9cb14d5053f89857225bd18d1df59a089c171e 

... it'll take a while then ask you to reboot your Pi. If you get an error message then try running the command again (possibly a few times) as it could be a missing packet of data from the network corrupting the file - this happened to me twice - 3rd time it worked fine. 

Thanks to Isa Hatipoglu for the information on rpi-update:

Tuesday, October 31, 2017

Natural Language Processing

Useful links for NLP and processing text input for intent:    (was    << vader sentiment for sentiment analysis


Building a recommender system using Python:

Building a skill with AWS Lambda

SQuAD - Standford QA dataset

Facebook babi project:  : transparent use of semantics embedded in geosocial and other data for helping mobile users in their decision-making 

Sunday, October 22, 2017

Natural Language Processing - link to intro article

The original article can be found here:

Article by  , KDnuggets. 

Natural language processing (NLP) concerns itself with the interaction between natural human languages and computing devices. NLP is a major aspect of computational linguistics, and also falls within the realms of computer science and artificial intelligence.
2. Tokenization
Tokenization is, generally, an early step in the NLP process, a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized.
3. Normalization
Before further processing, text needs to be normalized. Normalization generally refers to a series of related tasks meant to put all text on a level playing field: converting all text to the same case (upper or lower), removing punctuation, expanding contractions, converting numbers to their word equivalents, and so on. Normalization puts all words on equal footing, and allows processing to proceed uniformly.
4. Stemming
Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.
running → run
5. Lemmatization
Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma.
For example, stemming the word "better" would fail to return its citation form (another word for lemma); however, lemmatization would result in the following:
better → good
It should be easy to see why the implementation of a stemmer would be the less difficult feat of the two.
6. Corpus
In linguistics and NLP, corpus (literally Latin for body) refers to a collection of texts. Such collections may be formed of a single language of texts, or can span multiple languages -- there are numerous reasons for which multilingual corpora (the plural of corpus) may be useful. Corpora may also consist of themed texts (historical, Biblical, etc.). Corpora are generally solely used for statistical linguistic analysis and hypothesis testing.
7. Stop Words
Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. As a simple example, the following panagram is just as legible if the stop words are removed:
The quick brown fox jumps over the lazy dog.
8. Parts-of-speech (POS) Tagging
POS tagging consists of assigning a category tag to the tokenized parts of a sentence. The most popular POS tagging would be identifying words as nouns, verbs, adjectives, etc.
POS Tagging
9. Statistical Language Modeling
Statistical Language Modeling is the process of building a statistical language model which is meant to provide an estimate of a natural language. For a sequence of input words, the model would assign a probability to the entire sequence, which contributes to the estimated likelihood of various possible sequences. This can be especially useful for NLP applications which generate text.
10. Bag of Words
Bag of words is a particular representation model used to simplify the contents of a selection of text. The bag of words model omits grammar and word order, but is interested in the number of occurrences of words within the text. The ultimate representation of the text selection is that of a bag of words (bag referring to the set theory concept of multisets, which differ from simple sets).
Actual storage mechanisms for the bag of words representation can vary, but the following is a simple example using a dictionary for intuitiveness. Sample text:
"Well, well, well," said John.
"There, there," said James. "There, there."
The resulting bag of words representation as a dictionary:
      'well': 3,
      'said': 2,
      'john': 1,
      'there': 4,
      'james': 1

11. n-grams
n-grams is another representation model for simplifying text selection contents. As opposed to the orderless representation of bag of words, n-grams modeling is interested in preserving contiguous sequences of N items from the text selection.
An example of trigram (3-gram) model of the second sentence of the above example ("There, there," said James. "There, there.") appears as a list representation below:
      "there there said",
      "there said james",
      "said james there",
      "james there there",

12. Regular Expressions
Regular expressions, often abbreviated regexp or regexp, are a tried and true method of concisely describing patterns of text. A regular expression is represented as a special text string itself, and is meant for developing search patterns on selections of text. Regular expressions can be thought of as an expanded set of rules beyond the wildcard characters of ? and *. Though often cited as frustrating to learn, regular expressions are incredibly powerful text searching tools.
13. Zipf's Law
Zipf's Law is used to describe the relationship between word frequencies in document collections. If a document collection's words are ordered by frequency, and y is used to describe the number of times that the xth word appears, Zipf's observation is concisely captured as y = cx-1/2 (item frequency is inversely proportional to item rank). More generally, Wikipedia says:
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Zipf's Law
14. Similarity Measures
There are numerous similarity measures which can be applied to NLP. What are we measuring the similarity of? Generally, strings.
  • Levenshtein - the number of characters that must be deleted, inserted, or substituted in order to make a pair of strings equal
  • Jaccard - the measure of overlap between 2 sets; in the case of NLP, generally, documents are sets of words
  • Smith Waterman - similar to Levenshtein, but with costs assigned to substitution, insertion, and deletion
15. Syntactic Analysis
Also referred to as parsing, syntactic analysis is the task of analyzing strings as symbols, and ensuring their conformance to a established set of grammatical rules. This step must, out of necessity, come before any further analysis which attempts to extract insight from text -- semantic, sentiment, etc. -- treating it as something beyond symbols.
Also known as meaning generation, semantic analysis is interested in determining the meaning of text selections (either character or word sequences). After an input selection of text is read and parsed (analyzed syntactically), the text selection can then be interpreted for meaning. Simply put, syntactic analysis is concerned with what words a text selection was made up of, while semantic analysis wants to know what the collection of words actually means. The topic of semantic analysis is both broad and deep, with a wide variety of tools and techniques at the researcher's disposal.
Sentiment analysis is the process of evaluating and determining the sentiment captured in a selection of text, with sentiment defined as feeling or emotion. This sentiment can be simply positive (happy), negative (sad or angry), or neutral, or can be some more precise measurement along a scale, with neutral in the middle, and positive and negative increasing in either direction.
Sentiment analysis
Information retrieval is the process of accessing and retrieving the most appropriate information from text based on a particular query, using context-based indexing or metadata. One of the most famous examples of information retrieval would be Google Search.

Tuesday, May 2, 2017

Linux Disk Tools

Useful commands when adding a new HDD to a Linux server

sudo gparted  
  >>partition disk etc

sudo gnome-disks
>> configure mount location

Wednesday, April 26, 2017

Geographic Weighted Regression

Regression has explanatory variables (also known as independant variables) and they have strengths (coefficients)
These inputs result in our dependant variable

+ or - for the coefficients

(if Beta goes up and the number of incidents goes down then a NEGative relationship)