pandasticsearch package¶
Submodules¶
pandasticsearch.client module¶
-
class
pandasticsearch.client.
RestClient
(host, username=None, password=None, verify_ssl=True)¶ Bases:
object
RestClient talks to Elasticsearch cluster through native RESTful API.
-
get
(path, params=None)¶ Sends a GET request to Elasticsearch.
Parameters: - path – Path of the verb and resource
- params (optional) – Dictionary to be sent in the query string.
Returns: The response as a dictionary.
>>> from pandasticsearch import RestClient >>> client = RestClient('http://host:port') >>> print(client.get('index_name/_search'))
-
post
(path, data, params=None)¶ Sends a POST request to Elasticsearch.
Parameters: - path – The path of the verb and resource, e.g. “/index_name/_search”
- data – The json data to send in the body of the request.
- params (optional) – Dictionary to be sent in the query string.
Returns: The response as a dictionary.
>>> from pandasticsearch import RestClient >>> client = RestClient('http://host:port') >>> print(client.post(path='index/_search', data={"query":{"match_all":{}}}))
-
pandasticsearch.dataframe module¶
-
class
pandasticsearch.dataframe.
DataFrame
(**kwargs)¶ Bases:
object
A
DataFrame
treats index and documents in Elasticsearch as named columns and rows.>>> from pandasticsearch import DataFrame >>> df = DataFrame.from_es('http://host:port', index='people')
Customizing the endpoint of the ElasticSearch:
>>> from pandasticsearch import DataFrame >>> from pandasticsearch.client import RestClient >>> df = DataFrame(client=RestClient('http://host:port',), index='people')
It can be converted to Pandas object for subsequent analysis:
>>> df.to_pandas()
-
agg
(*aggs)¶ Aggregate on the entire DataFrame without groups.
Parameters: aggs – a list of Aggregator
objects>>> df[df['gender'] == 'male'].agg(df['age'].avg).collect() [Row(avg(age)=12)]
-
collect
()¶ Returns all the records as a list of Row.
Returns: list of Row
>>> df.collect() [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
-
columns
¶ Returns all column names as a list.
Returns: column names as a list >>> df.columns ['age', 'name']
-
count
()¶ Returns a list of numbers indicating the count for each group
>>> df.groupby(df.gender).count() [2, 1]
-
filter
(condition)¶ Filters rows using a given condition.
where() is an alias for filter().
Parameters: condition – BooleanFilter
object or a string>>> df.filter(df['age'] < 13).collect() [Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]
-
static
from_es
(**kwargs)¶ Creates an
DataFrame
object by providing the URL of ElasticSearch node and the name of the index.Parameters: - url (str) – URL of the node connected to (default: ‘http://localhost:9200’)
- index (str) – The name of the index
- doc_type (str) – The type of the document
- compat (str) – The compatible ES version (an integer number)
Returns: DataFrame object for accessing
Return type: >>> from pandasticsearch import DataFrame >>> df = DataFrame.from_es('http://host:port', index='people')
-
groupby
(*cols)¶ Returns a new
DataFrame
object grouped by the specified column(s).Parameters: cols – A list of column names, Column
orGrouper
objects
-
index
¶ Returns the index name.
Returns: string as the name >>> df.index people/children
-
limit
(num)¶ Limits the result count to the number specified.
-
orderby
(*cols)¶ Returns a new
DataFrame
object sorted by the specified column(s).Parameters: cols – A list of column names, Column
orSorter
.orderby() is an alias for sort().
>>> df.sort(df['age'].asc).collect() [Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]
-
print_debug
()¶ Post the query to the Elasticsearch Server and prints out the result it returned
-
print_schema
()¶ Prints out the schema in the tree format.
>>> df.print_schema() index_name |-- type_name |-- experience : {'type': 'integer'} |-- id : {'type': 'string'} |-- mobile : {'index': 'not_analyzed', 'type': 'string'} |-- regions : {'index': 'not_analyzed', 'type': 'string'}
-
classmethod
resolve_mappings
(json_map)¶
-
resolve_schema
(json_prop, res_schema='', depth=1)¶
-
schema
¶ Returns the schema(mapping) of the index/type as a dictionary.
-
select
(*cols)¶ Projects a set of columns and returns a new
DataFrame
Parameters: cols – list of column names or Column
.>>> df.filter(df['age'] < 25).select('name', 'age').collect() [Row(age=12,name='Alice'), Row(age=11,name='Bob'), Row(age=13,name='Leo')]
-
show
(n=200, truncate=15)¶ Prints the first
n
rows to the console.Parameters: - n – Number of rows to show.
- truncate – Number of words to be truncated for each column.
>>> df.filter(df['age'] < 25).select('name').show(3) +------+ | name | +------+ | Alice| | Bob | | Leo | +------+
-
sort
(*cols)¶ Returns a new
DataFrame
object sorted by the specified column(s).Parameters: cols – A list of column names, Column
orSorter
.orderby() is an alias for sort().
>>> df.sort(df['age'].asc).collect() [Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]
-
to_dict
()¶ Converts the current
DataFrame
object to Elasticsearch search dictionary.Returns: a dictionary which obeys the Elasticsearch RESTful protocol
-
to_pandas
()¶ Export to a Pandas DataFrame object.
Returns: The DataFrame representing the query result >>> df[df['gender'] == 'male'].agg(Avg('age')).to_pandas() avg(age) 0 12
-
where
(condition)¶ Filters rows using a given condition.
where() is an alias for filter().
Parameters: condition – BooleanFilter
object or a string>>> df.filter(df['age'] < 13).collect() [Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]
-
pandasticsearch.errors module¶
-
exception
pandasticsearch.errors.
DataFrameException
(msg)¶
-
exception
pandasticsearch.errors.
NoSuchDependencyException
(msg)¶
-
exception
pandasticsearch.errors.
PandasticSearchException
(msg)¶ Bases:
exceptions.RuntimeError
-
exception
pandasticsearch.errors.
ParseResultException
(msg)¶
-
exception
pandasticsearch.errors.
ServerDefinedException
(msg)¶
pandasticsearch.operators module¶
pandasticsearch.queries module¶
-
class
pandasticsearch.queries.
Agg
¶ Bases:
pandasticsearch.queries.Query
-
explain_result
(result=None)¶
-
static
from_dict
(d)¶
-
index
¶
-
to_pandas
()¶ Export the current query result to a Pandas DataFrame object.
-
-
class
pandasticsearch.queries.
Query
¶ Bases:
_abcoll.MutableSequence
-
append
(value)¶ S.append(object) – append object to the end of the sequence
-
explain_result
(result=None)¶
-
insert
(index, value)¶ S.insert(index, object) – insert object before index
-
json
¶ Gets the original JSON representation returned by Elasticsearch REST API :return: The JSON string indicating the query result :rtype: string
-
millis_taken
¶
-
print_json
()¶
-
result
¶
-
to_pandas
()¶ Export the current query result to a Pandas DataFrame object.
-
-
class
pandasticsearch.queries.
ScrollSelect
(hits_generator)¶ Bases:
pandasticsearch.queries.Select
millis_taken/json not supported for ScrollSelect
-
result
¶
-
row_generator
()¶
-
to_pandas
()¶ Export the current query result to a Pandas DataFrame object.
-
pandasticsearch.types module¶
-
class
pandasticsearch.types.
Column
(field)¶ Bases:
object
-
asc
¶ Ascending
Sorter
Returns: Sorter
>>> df.orderyby(df.age.asc)
-
avg
¶ Avg aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.avg)
-
cardinality
¶ Distince aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.cardinality) >>> df.groupby(df.gender).agg(df.age.distinct_count)
-
count
¶ Value count aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.value_count)
-
date_interval
(interval, format='yyyy/MM/dd HH:mm:ss')¶ Returns a
Grouper
Parameters: - interval – A string indicating date interval
- format – Date format string
Returns: Grouper
>>> df.groupby(df.date_interval('1d'))
-
desc
¶ Descending
Sorter
Returns: Sorter
>>> df.orderyby(df.age.desc)
-
distinct_count
¶ Distince aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.cardinality) >>> df.groupby(df.gender).agg(df.age.distinct_count)
-
extended_stats
¶ Extended stats aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.extended_stats)
-
field_name
()¶
-
isin
(values)¶ Returns a
BooleanFilter
Parameters: values – A list of values to filter terms Returns: BooleanFilter
df.filter(df.gender.isin([‘male’, ‘female’])
-
isnull
¶ BooleanFilter
to indicate the null column valueReturns: BooleanFilter
-
like
(wildcard)¶ Returns a
BooleanFilter
Parameters: wildcard (str) – The wildcard to filter the column with. Returns: BooleanFilter
>>> df.filter(df.name.like('A*'))
-
max
¶ Max aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.max)
-
min
¶ Min aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.min)
-
notnull
¶ BooleanFilter
to indicate the non-null column valueReturns: BooleanFilter
-
percentile_ranks
¶ Percentile ranks aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.percentile_ranks)
-
percentiles
¶ Percentile aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.percentiles)
-
ranges
(values)¶ Returns a
Grouper
Parameters: values – A list of numeric values Returns: Grouper
>>> df.groupby(df.age.ranges([10,12,14]))
-
rlike
(regexp)¶ Returns a
BooleanFilter
Parameters: regexp (str) – The regular expression to filter the column with. Returns: BooleanFilter
>>> df.filter(df.name.rlike('A.l.e'))
-
startswith
(substr)¶ Returns a
BooleanFilter
Parameters: substr (str) – The sub string to filter the column with. Returns: BooleanFilter
>>> df.filter(df.name.startswith('Al')
-
stats
¶ Stats aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.stats)
-
sum
¶ Sum aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.sum)
-
terms
(limit=20, include=None, exclude=None)¶ Returns a
Grouper
Parameters: - limit – limit the number of terms to be aggregated (default 20)
- include – the exact term to be included
- exclude – the exact term to be excluded
Returns: Grouper
>>> df.groupby(df.age.terms(limit=10, include=[1, 2, 3]))
-
value_count
¶ Value count aggregator
Returns: Aggregator
>>> df.groupby(df.gender).agg(df.age.value_count)
-
-
class
pandasticsearch.types.
Row
¶ Bases:
tuple
The builtin
DataFrame
row type for accessing before converted into Pandas DataFrame. The fields will be sorted by names.>>> row = Row(name="Alice", age=12) >>> row Row(age=12, name='Alice') >>> row['name'], row['age'] ('Alice', 12) >>> row.name, row.age ('Alice', 12) >>> 'name' in row True >>> 'wrong_key' in row
-
as_dict
()¶
-