pandasticsearch package¶

Submodules¶

pandasticsearch.client module¶

class pandasticsearch.client.RestClient(host, username=None, password=None, verify_ssl=True)¶

Bases: object

RestClient talks to Elasticsearch cluster through native RESTful API.

get(path, params=None)¶

Sends a GET request to Elasticsearch.

Parameters:	path – Path of the verb and resource params (optional) – Dictionary to be sent in the query string.
Returns:	The response as a dictionary.

>>> from pandasticsearch import RestClient
>>> client = RestClient('http://host:port')
>>> print(client.get('index_name/_search'))

post(path, data, params=None)¶

Sends a POST request to Elasticsearch.

Parameters:	path – The path of the verb and resource, e.g. “/index_name/_search” data – The json data to send in the body of the request. params (optional) – Dictionary to be sent in the query string.
Returns:	The response as a dictionary.

>>> from pandasticsearch import RestClient
>>> client = RestClient('http://host:port')
>>> print(client.post(path='index/_search', data={"query":{"match_all":{}}}))

pandasticsearch.dataframe module¶

class pandasticsearch.dataframe.DataFrame(**kwargs)¶

Bases: object

A DataFrame treats index and documents in Elasticsearch as named columns and rows.

>>> from pandasticsearch import DataFrame
>>> df = DataFrame.from_es('http://host:port', index='people')

Customizing the endpoint of the ElasticSearch:

>>> from pandasticsearch import DataFrame
>>> from pandasticsearch.client import RestClient
>>> df = DataFrame(client=RestClient('http://host:port',), index='people')

It can be converted to Pandas object for subsequent analysis:

>>> df.to_pandas()

agg(*aggs)¶

Aggregate on the entire DataFrame without groups.

Parameters:	aggs – a list of `Aggregator` objects

>>> df[df['gender'] == 'male'].agg(df['age'].avg).collect()
[Row(avg(age)=12)]

collect()¶

Returns all the records as a list of Row.

Returns:	list of `Row`

>>> df.collect()
[Row(age=2, name='Alice'), Row(age=5, name='Bob')]

columns¶

Returns all column names as a list.

Returns:	column names as a list

>>> df.columns
['age', 'name']

count()¶

Returns a list of numbers indicating the count for each group

>>> df.groupby(df.gender).count()
[2, 1]

filter(condition)¶

Filters rows using a given condition.

where() is an alias for filter().

Parameters:	condition – `BooleanFilter` object or a string

>>> df.filter(df['age'] < 13).collect()
[Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]

static from_es(**kwargs)¶

Creates an DataFrame object by providing the URL of ElasticSearch node and the name of the index.

Parameters:	url (str) – URL of the node connected to (default: ‘http://localhost:9200’) index (str) – The name of the index doc_type (str) – The type of the document compat (str) – The compatible ES version (an integer number)
Returns:	DataFrame object for accessing
Return type:	DataFrame

>>> from pandasticsearch import DataFrame
>>> df = DataFrame.from_es('http://host:port', index='people')

groupby(*cols)¶

Returns a new DataFrame object grouped by the specified column(s).

Parameters:	cols – A list of column names, `Column` or `Grouper` objects

index¶

Returns the index name.

Returns:	string as the name

>>> df.index
people/children

limit(num)¶: Limits the result count to the number specified.

orderby(*cols)¶

Returns a new DataFrame object sorted by the specified column(s).

Parameters:	cols – A list of column names, `Column` or `Sorter`.

orderby() is an alias for sort().

>>> df.sort(df['age'].asc).collect()
[Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]

print_debug()¶: Post the query to the Elasticsearch Server and prints out the result it returned

print_schema()¶

Prints out the schema in the tree format.

>>> df.print_schema()
index_name
|-- type_name
  |-- experience :  {'type': 'integer'}
  |-- id :  {'type': 'string'}
  |-- mobile :  {'index': 'not_analyzed', 'type': 'string'}
  |-- regions :  {'index': 'not_analyzed', 'type': 'string'}

classmethod resolve_mappings(json_map)¶

resolve_schema(json_prop, res_schema='', depth=1)¶

schema¶: Returns the schema(mapping) of the index/type as a dictionary.

select(*cols)¶

Projects a set of columns and returns a new DataFrame

Parameters:	cols – list of column names or `Column`.

>>> df.filter(df['age'] < 25).select('name', 'age').collect()
[Row(age=12,name='Alice'), Row(age=11,name='Bob'), Row(age=13,name='Leo')]

show(n=200, truncate=15)¶

Prints the first n rows to the console.

Parameters:	n – Number of rows to show. truncate – Number of words to be truncated for each column.

>>> df.filter(df['age'] < 25).select('name').show(3)
+------+
| name |
+------+
| Alice|
| Bob  |
| Leo  |
+------+

sort(*cols)¶

Returns a new DataFrame object sorted by the specified column(s).

Parameters:	cols – A list of column names, `Column` or `Sorter`.

orderby() is an alias for sort().

>>> df.sort(df['age'].asc).collect()
[Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]

to_dict()¶

Converts the current DataFrame object to Elasticsearch search dictionary.

Returns:	a dictionary which obeys the Elasticsearch RESTful protocol

to_pandas()¶

Export to a Pandas DataFrame object.

Returns:	The DataFrame representing the query result

>>> df[df['gender'] == 'male'].agg(Avg('age')).to_pandas()
    avg(age)
0        12

where(condition)¶

Filters rows using a given condition.

where() is an alias for filter().

Parameters:	condition – `BooleanFilter` object or a string

>>> df.filter(df['age'] < 13).collect()
[Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]

pandasticsearch.errors module¶

exception pandasticsearch.errors.DataFrameException(msg)¶: Bases: pandasticsearch.errors.PandasticSearchException

exception pandasticsearch.errors.NoSuchDependencyException(msg)¶: Bases: pandasticsearch.errors.PandasticSearchException

exception pandasticsearch.errors.PandasticSearchException(msg)¶: Bases: exceptions.RuntimeError

exception pandasticsearch.errors.ParseResultException(msg)¶: Bases: pandasticsearch.errors.PandasticSearchException

exception pandasticsearch.errors.ServerDefinedException(msg)¶: Bases: pandasticsearch.errors.PandasticSearchException

pandasticsearch.operators module¶

pandasticsearch.queries module¶

class pandasticsearch.queries.Agg¶

Bases: pandasticsearch.queries.Query

explain_result(result=None)¶

static from_dict(d)¶

index¶

to_pandas()¶: Export the current query result to a Pandas DataFrame object.

class pandasticsearch.queries.Query¶

Bases: _abcoll.MutableSequence

append(value)¶: S.append(object) – append object to the end of the sequence

explain_result(result=None)¶

insert(index, value)¶: S.insert(index, object) – insert object before index

json¶: Gets the original JSON representation returned by Elasticsearch REST API :return: The JSON string indicating the query result :rtype: string

millis_taken¶

print_json()¶

result¶

to_pandas()¶: Export the current query result to a Pandas DataFrame object.

class pandasticsearch.queries.ScrollSelect(hits_generator)¶

Bases: pandasticsearch.queries.Select

millis_taken/json not supported for ScrollSelect

result¶

row_generator()¶

to_pandas()¶: Export the current query result to a Pandas DataFrame object.

class pandasticsearch.queries.Select¶

Bases: pandasticsearch.queries.Query

explain_result(result=None)¶

static from_dict(d)¶

hit_to_row(hit)¶

resolve_fields(row)¶

result_as_tabular(cols, n, truncate=20)¶

to_pandas()¶: Export the current query result to a Pandas DataFrame object.

pandasticsearch.types module¶

class pandasticsearch.types.Column(field)¶

Bases: object

asc¶

Ascending Sorter

Returns:	`Sorter`

>>> df.orderyby(df.age.asc)

avg¶

Avg aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.avg)

cardinality¶

Distince aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.cardinality)
>>> df.groupby(df.gender).agg(df.age.distinct_count)

count¶

Value count aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.value_count)

date_interval(interval, format='yyyy/MM/dd HH:mm:ss')¶

Returns a Grouper

Parameters:	interval – A string indicating date interval format – Date format string
Returns:	`Grouper`

>>> df.groupby(df.date_interval('1d'))

desc¶

Descending Sorter

Returns:	`Sorter`

>>> df.orderyby(df.age.desc)

distinct_count¶

Distince aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.cardinality)
>>> df.groupby(df.gender).agg(df.age.distinct_count)

extended_stats¶

Extended stats aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.extended_stats)

field_name()¶

isin(values)¶

Returns a BooleanFilter

Parameters:	values – A list of values to filter terms
Returns:	`BooleanFilter`

df.filter(df.gender.isin([‘male’, ‘female’])

isnull¶

BooleanFilter to indicate the null column value

Returns:	`BooleanFilter`

like(wildcard)¶

Returns a BooleanFilter

Parameters:	wildcard (str) – The wildcard to filter the column with.
Returns:	`BooleanFilter`

>>> df.filter(df.name.like('A*'))

max¶

Max aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.max)

min¶

Min aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.min)

notnull¶

BooleanFilter to indicate the non-null column value

Returns:	`BooleanFilter`

percentile_ranks¶

Percentile ranks aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.percentile_ranks)

percentiles¶

Percentile aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.percentiles)

ranges(values)¶

Returns a Grouper

Parameters:	values – A list of numeric values
Returns:	`Grouper`

>>> df.groupby(df.age.ranges([10,12,14]))

rlike(regexp)¶

Returns a BooleanFilter

Parameters:	regexp (str) – The regular expression to filter the column with.
Returns:	`BooleanFilter`

>>> df.filter(df.name.rlike('A.l.e'))

startswith(substr)¶

Returns a BooleanFilter

Parameters:	substr (str) – The sub string to filter the column with.
Returns:	`BooleanFilter`

>>> df.filter(df.name.startswith('Al')

stats¶

Stats aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.stats)

sum¶

Sum aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.sum)

terms(limit=20, include=None, exclude=None)¶

Returns a Grouper

Parameters:	limit – limit the number of terms to be aggregated (default 20) include – the exact term to be included exclude – the exact term to be excluded
Returns:	`Grouper`

>>> df.groupby(df.age.terms(limit=10, include=[1, 2, 3]))

value_count¶

Value count aggregator

Returns:	`Aggregator`

>>> df.groupby(df.gender).agg(df.age.value_count)

class pandasticsearch.types.Row¶

Bases: tuple

The builtin DataFrame row type for accessing before converted into Pandas DataFrame. The fields will be sorted by names.

>>> row = Row(name="Alice", age=12)
>>> row
Row(age=12, name='Alice')
>>> row['name'], row['age']
('Alice', 12)
>>> row.name, row.age
('Alice', 12)
>>> 'name' in row
True
>>> 'wrong_key' in row

as_dict()¶

pandasticsearch package¶

Submodules¶

pandasticsearch.client module¶

pandasticsearch.dataframe module¶

pandasticsearch.errors module¶

pandasticsearch.operators module¶

pandasticsearch.queries module¶

pandasticsearch.types module¶

Module contents¶