pandasticsearch package¶
Submodules¶
pandasticsearch.client module¶
-
class
pandasticsearch.client.RestClient(host, username=None, password=None, verify_ssl=True)¶ Bases:
objectRestClient talks to Elasticsearch cluster through native RESTful API.
-
get(path, params=None)¶ Sends a GET request to Elasticsearch.
Parameters: - path – Path of the verb and resource
- params (optional) – Dictionary to be sent in the query string.
Returns: The response as a dictionary.
>>> from pandasticsearch import RestClient >>> client = RestClient('http://host:port') >>> print(client.get('index_name/_search'))
-
post(path, data, params=None)¶ Sends a POST request to Elasticsearch.
Parameters: - path – The path of the verb and resource, e.g. “/index_name/_search”
- data – The json data to send in the body of the request.
- params (optional) – Dictionary to be sent in the query string.
Returns: The response as a dictionary.
>>> from pandasticsearch import RestClient >>> client = RestClient('http://host:port') >>> print(client.post(path='index/_search', data={"query":{"match_all":{}}}))
-
pandasticsearch.dataframe module¶
-
class
pandasticsearch.dataframe.DataFrame(**kwargs)¶ Bases:
objectA
DataFrametreats index and documents in Elasticsearch as named columns and rows.>>> from pandasticsearch import DataFrame >>> df = DataFrame.from_es('http://host:port', index='people')
Customizing the endpoint of the ElasticSearch:
>>> from pandasticsearch import DataFrame >>> from pandasticsearch.client import RestClient >>> df = DataFrame(client=RestClient('http://host:port',), index='people')
It can be converted to Pandas object for subsequent analysis:
>>> df.to_pandas()
-
agg(*aggs)¶ Aggregate on the entire DataFrame without groups.
Parameters: aggs – a list of Aggregatorobjects>>> df[df['gender'] == 'male'].agg(df['age'].avg).collect() [Row(avg(age)=12)]
-
collect()¶ Returns all the records as a list of Row.
Returns: list of Row>>> df.collect() [Row(age=2, name='Alice'), Row(age=5, name='Bob')]
-
columns¶ Returns all column names as a list.
Returns: column names as a list >>> df.columns ['age', 'name']
-
count()¶ Returns a list of numbers indicating the count for each group
>>> df.groupby(df.gender).count() [2, 1]
-
filter(condition)¶ Filters rows using a given condition.
where() is an alias for filter().
Parameters: condition – BooleanFilterobject or a string>>> df.filter(df['age'] < 13).collect() [Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]
-
static
from_es(**kwargs)¶ Creates an
DataFrameobject by providing the URL of ElasticSearch node and the name of the index.Parameters: - url (str) – URL of the node connected to (default: ‘http://localhost:9200’)
- index (str) – The name of the index
- doc_type (str) – The type of the document
- compat (str) – The compatible ES version (an integer number)
Returns: DataFrame object for accessing
Return type: >>> from pandasticsearch import DataFrame >>> df = DataFrame.from_es('http://host:port', index='people')
-
groupby(*cols)¶ Returns a new
DataFrameobject grouped by the specified column(s).Parameters: cols – A list of column names, ColumnorGrouperobjects
-
index¶ Returns the index name.
Returns: string as the name >>> df.index people/children
-
limit(num)¶ Limits the result count to the number specified.
-
orderby(*cols)¶ Returns a new
DataFrameobject sorted by the specified column(s).Parameters: cols – A list of column names, ColumnorSorter.orderby() is an alias for sort().
>>> df.sort(df['age'].asc).collect() [Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]
-
print_debug()¶ Post the query to the Elasticsearch Server and prints out the result it returned
-
print_schema()¶ Prints out the schema in the tree format.
>>> df.print_schema() index_name |-- type_name |-- experience : {'type': 'integer'} |-- id : {'type': 'string'} |-- mobile : {'index': 'not_analyzed', 'type': 'string'} |-- regions : {'index': 'not_analyzed', 'type': 'string'}
-
classmethod
resolve_mappings(json_map)¶
-
resolve_schema(json_prop, res_schema='', depth=1)¶
-
schema¶ Returns the schema(mapping) of the index/type as a dictionary.
-
select(*cols)¶ Projects a set of columns and returns a new
DataFrameParameters: cols – list of column names or Column.>>> df.filter(df['age'] < 25).select('name', 'age').collect() [Row(age=12,name='Alice'), Row(age=11,name='Bob'), Row(age=13,name='Leo')]
-
show(n=200, truncate=15)¶ Prints the first
nrows to the console.Parameters: - n – Number of rows to show.
- truncate – Number of words to be truncated for each column.
>>> df.filter(df['age'] < 25).select('name').show(3) +------+ | name | +------+ | Alice| | Bob | | Leo | +------+
-
sort(*cols)¶ Returns a new
DataFrameobject sorted by the specified column(s).Parameters: cols – A list of column names, ColumnorSorter.orderby() is an alias for sort().
>>> df.sort(df['age'].asc).collect() [Row(age=11,name='Bob'), Row(age=12,name='Alice'), Row(age=13,name='Leo')]
-
to_dict()¶ Converts the current
DataFrameobject to Elasticsearch search dictionary.Returns: a dictionary which obeys the Elasticsearch RESTful protocol
-
to_pandas()¶ Export to a Pandas DataFrame object.
Returns: The DataFrame representing the query result >>> df[df['gender'] == 'male'].agg(Avg('age')).to_pandas() avg(age) 0 12
-
where(condition)¶ Filters rows using a given condition.
where() is an alias for filter().
Parameters: condition – BooleanFilterobject or a string>>> df.filter(df['age'] < 13).collect() [Row(age=12,gender='female',name='Alice'), Row(age=11,gender='male',name='Bob')]
-
pandasticsearch.errors module¶
-
exception
pandasticsearch.errors.DataFrameException(msg)¶
-
exception
pandasticsearch.errors.NoSuchDependencyException(msg)¶
-
exception
pandasticsearch.errors.PandasticSearchException(msg)¶ Bases:
exceptions.RuntimeError
-
exception
pandasticsearch.errors.ParseResultException(msg)¶
-
exception
pandasticsearch.errors.ServerDefinedException(msg)¶
pandasticsearch.operators module¶
pandasticsearch.queries module¶
-
class
pandasticsearch.queries.Agg¶ Bases:
pandasticsearch.queries.Query-
explain_result(result=None)¶
-
static
from_dict(d)¶
-
index¶
-
to_pandas()¶ Export the current query result to a Pandas DataFrame object.
-
-
class
pandasticsearch.queries.Query¶ Bases:
_abcoll.MutableSequence-
append(value)¶ S.append(object) – append object to the end of the sequence
-
explain_result(result=None)¶
-
insert(index, value)¶ S.insert(index, object) – insert object before index
-
json¶ Gets the original JSON representation returned by Elasticsearch REST API :return: The JSON string indicating the query result :rtype: string
-
millis_taken¶
-
print_json()¶
-
result¶
-
to_pandas()¶ Export the current query result to a Pandas DataFrame object.
-
-
class
pandasticsearch.queries.ScrollSelect(hits_generator)¶ Bases:
pandasticsearch.queries.Selectmillis_taken/json not supported for ScrollSelect
-
result¶
-
row_generator()¶
-
to_pandas()¶ Export the current query result to a Pandas DataFrame object.
-
pandasticsearch.types module¶
-
class
pandasticsearch.types.Column(field)¶ Bases:
object-
asc¶ Ascending
SorterReturns: Sorter>>> df.orderyby(df.age.asc)
-
avg¶ Avg aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.avg)
-
cardinality¶ Distince aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.cardinality) >>> df.groupby(df.gender).agg(df.age.distinct_count)
-
count¶ Value count aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.value_count)
-
date_interval(interval, format='yyyy/MM/dd HH:mm:ss')¶ Returns a
GrouperParameters: - interval – A string indicating date interval
- format – Date format string
Returns: Grouper>>> df.groupby(df.date_interval('1d'))
-
desc¶ Descending
SorterReturns: Sorter>>> df.orderyby(df.age.desc)
-
distinct_count¶ Distince aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.cardinality) >>> df.groupby(df.gender).agg(df.age.distinct_count)
-
extended_stats¶ Extended stats aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.extended_stats)
-
field_name()¶
-
isin(values)¶ Returns a
BooleanFilterParameters: values – A list of values to filter terms Returns: BooleanFilterdf.filter(df.gender.isin([‘male’, ‘female’])
-
isnull¶ BooleanFilterto indicate the null column valueReturns: BooleanFilter
-
like(wildcard)¶ Returns a
BooleanFilterParameters: wildcard (str) – The wildcard to filter the column with. Returns: BooleanFilter>>> df.filter(df.name.like('A*'))
-
max¶ Max aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.max)
-
min¶ Min aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.min)
-
notnull¶ BooleanFilterto indicate the non-null column valueReturns: BooleanFilter
-
percentile_ranks¶ Percentile ranks aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.percentile_ranks)
-
percentiles¶ Percentile aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.percentiles)
-
ranges(values)¶ Returns a
GrouperParameters: values – A list of numeric values Returns: Grouper>>> df.groupby(df.age.ranges([10,12,14]))
-
rlike(regexp)¶ Returns a
BooleanFilterParameters: regexp (str) – The regular expression to filter the column with. Returns: BooleanFilter>>> df.filter(df.name.rlike('A.l.e'))
-
startswith(substr)¶ Returns a
BooleanFilterParameters: substr (str) – The sub string to filter the column with. Returns: BooleanFilter>>> df.filter(df.name.startswith('Al')
-
stats¶ Stats aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.stats)
-
sum¶ Sum aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.sum)
-
terms(limit=20, include=None, exclude=None)¶ Returns a
GrouperParameters: - limit – limit the number of terms to be aggregated (default 20)
- include – the exact term to be included
- exclude – the exact term to be excluded
Returns: Grouper>>> df.groupby(df.age.terms(limit=10, include=[1, 2, 3]))
-
value_count¶ Value count aggregator
Returns: Aggregator>>> df.groupby(df.gender).agg(df.age.value_count)
-
-
class
pandasticsearch.types.Row¶ Bases:
tupleThe builtin
DataFramerow type for accessing before converted into Pandas DataFrame. The fields will be sorted by names.>>> row = Row(name="Alice", age=12) >>> row Row(age=12, name='Alice') >>> row['name'], row['age'] ('Alice', 12) >>> row.name, row.age ('Alice', 12) >>> 'name' in row True >>> 'wrong_key' in row
-
as_dict()¶
-