Contact: Chris Garrity | U.S. Geological Survey | cgarrity@usgs.gov
Edits: Jeremy Ray | U.S. Geological Survey | jray@usgs.gov
Database: U.S. Large-Scale Solar Voltaic Database | API Access: USPVDB API
The United States Large-Scale Solar Photovoltaic Database (USPVDB) provides the locations and array boundaries of U.S. ground-mounted photovoltaic (PV) facilities with capacity of 1 megawatt or more. It includes corresponding PV facility information, including panel type, site type, and initial year of operation. The database combines datasets from the U.S. Energy Information Administration (EIA), the Environmental Protection Agency (EPA) and the National Renewable Energy Laboratory (NREL). The locations and array boundaries of all facilities were visually verified and digitized to within 10 meters using high-resolution aerial imagery. The USPVDB is available for download in a variety of tabular and geospatial file formats to meet a range of user/software needs. In the following examples, we'll be accessing the large-scale solar photovoltaic data through the USPVDB API. Accessing raw data through an API lets users stay in sync with the database without the need to download static versions of the data. Learn more about the USPVDB and USPVDB API https://eerscmap.usgs.gov/uspvdb/.
The following Jupyter Notebook examples are targeted for users who are new to Jupyter and notebook environments in general. A notebook integrates code and code output into a single document that combines visualizations, narrative text, mathematical equations, and other media types. This type of workflow promotes iterative and efficient development, making notebooks an increasingly popular choice for contemporary data science and analysis. Throughout this notebook we'll provide exhaustive narrative text for each step, tailored to those just starting development in the Jupyter Notebook environment. Learn more about Project Jupyter.
The examples in this notebook require the installation of two additional python packages. These packages can easily be installed using pip
, a well-known standard package manager for Python*. pip
allows you to install and manage additional packages that are not part of the Python standard library. The Python installer installs pip
by default, so it should be ready for you to use.
pip install mapboxgl
mapboxgl
allows you to build Mapbox GL JS data driven visualizations natively in Jupyter Notebooks. mapboxgl
is a high-performance, interactive, WebGL-based data visualization tool that leverages Mapbox Vector Tiles. Learn more about the MapBox platform https://www.mapbox.com/maps/.
pip install pandas
pandas
provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. In the following examples we will be leveraging pandas.DataFrame
, generally the most commonly used pandas
object. A dataframe is a 2-dimensional labeled data structure with columns of potentially different types. You can think of a dataframe like a traditional spreadsheet or SQL table, composed of rows and columns supporting a variety of tabular data types. pandas
provides several methods for reading data in different formats. In these examples, we’ll request our source data through the USPVDB API which returns raw data in JSON format using standard http protocols. Learn more about pandas https://pandas.pydata.org/pandas-docs/stable/index.html.
* Conda
is another widely-used packaging tool/installer that, unlike pip
, handles library dependencies outside of strictly Python packages. Conda
package and environment manager is included in all versions of Anaconda and Miniconda. Those using conda
can simply swap pip
with conda
for the installs above. Learn more about conda
https://docs.conda.io/en/latest/
Python provides a flexible framework for importing modules and specific members of a module. In the examples in this notebook, we'll import pandas
and give it the alias pd
. We'll also install the viz
submodule and the utils
submodule from the mapboxgl
package, as well as operating system dependent functionalities via the os
import.
import os
import pandas as pd
from mapboxgl.viz import *
from mapboxgl.utils import *
Map clustering algorithms typically find map markers (points) that are near each other and denotes them with a cluster symbol representing the overall density of aggregated map markers. By default, the new symbols are labeled with the number of map markers they contain. We can apply symbol scaling and custom color ramps to the rendered cluster symbols to better help us visualize density of the dataset in our map window. As we zoom in, the algorithm re-calibrates clustering on the fly based on the number of markers in our map view. Map clustering can be a powerful visualization tool when mapping large numbers of marker data and helps users visualize patterns of points without the traditional issues of marker overlap. In this example, we'll build a simple cluster map to visualize the locations of Solar Voltaic Projects at a national scale throughout the United States. A cluster map becomes a useful tool to help us visualize the overall distribution and density of solar voltaic projects when zoomed out at a national level.
The mapboxgl
package leverages a public MapBox token
to access MapBox hosted basemap styles
. To avoid requiring users of this notebook to have a MapBox account, we'll add a USGS hosted vector tile basemap from the National Geologic Map Database (NGMDB) to our notebook. We can call custom vector tile styles using the style
parameter when we generate our map visualization and simply omit the token
parameter. Below, we'll pass two styles from the NGMDB as variables to use in our notebook exercises. For those with an existing MapBox account, swap the style
parameter above with your MapBox token
and MapBox style
. Learn more about the NGMDB https://ngmdb.usgs.gov.
# NGMDB monochorome style designed to provide a basemap that highlights the data overlay.
ngmdbLight = "https://ngmdb-tiles.usgs.gov/styles/ngmdb-light/style.json"
# NGMDB full-color style that contains standard cartographic basemap layers, contour lines, and hillshading.
ngmdbBasemap = "https://ngmdb-tiles.usgs.gov/styles/ngmdb-tv/style.json"
As noted previously, the USPVDB API allows for programmatic access to the U.S. Large-Scale Solar Photovoltaic Database by the USGS and partner agencies. Creation of the USPVDB API was meant to extend USPVDB visibility, expand user base, and create more productive internal workflows. The API supports filtering table rows by appending specific attributes, the filter operator, and the filter value to the request. Filters can exclude table rows using simple operators that compare against specified key values. Applying filters to the request allows for more efficient, faster API responses because unneeded data is withheld by the server prior to API return. This is particularly useful when users are interested in a subset of data from the USPVDB. See additional USPVDB API filter operations here https://eerscmap.usgs.gov/uspvdb/api-doc/#operators.
In the first example, we'll make a customized http request to the API and return solar photovoltaic projects that (1) have a capacity greater than 0 kW to exclude any zero or null capacity values and (2) limit the project attributes in the response to case_id
(unique ID), p_name
(facility name), p_cap_ac
(ac capacity), xlong
(longitude), and ylat
(latitude). This is done by appending URL parameters ?amp;t_cap=gt.0&select=case_id,p_name,p_cap_ac,xlong,ylat
to the root level API https://eersc.usgs.gov/api/uspvdb/v1/projects/
. Once a successful request is made, we'll parse the JSON response and preview the first 5 records of the pandas.DataFrame
.
Note: There are many more attributes related to the USPVDB that can be leveraged in the API request. Feel free to experiment with the URL parameters to build your own custom maps using the USPVDB.
# Call the USPVDB API and apply custom URL parameters to the request. Parameters allow us to filter the data return.
data_url = "https://eersc.usgs.gov/api/uspvdb/v1/projects?&p_cap_ac=gt.0&select=case_id,p_cap_ac,p_state,xlong,ylat"
# Parse the JSON response from the API return and populate the dataframe
dfClusterMap = pd.read_json(data_url)
# Preview the first five records of our dataframe based on the custom URL paramters in the API request
dfClusterMap.head(5)
case_id | p_cap_ac | p_state | xlong | ylat | |
---|---|---|---|---|---|
0 | 402964 | 5.00 | MN | -93.7884 | 45.2575 |
1 | 402893 | 5.00 | MN | -92.9717 | 44.5420 |
2 | 402892 | 5.00 | MN | -92.9718 | 44.5271 |
3 | 402796 | 6.62 | MN | -93.2724 | 45.4585 |
4 | 402774 | 4.80 | MN | -91.7913 | 44.1149 |
The mapboxgl
package supports both vector tile sources and the GeoJSON format for rendering map visualizations. GeoJSON is a common, open standard, geospatial data interchange format based on JSON. It's designed for representing geographical features, along with their non-spatial attributes and spatial extents. Learn more about the GeoJSON format.
The conversion from our pandas.dataframe
to GeoJSON is handled by df_to_geojson
*. There are a variety of parameters we can pass to the function, but for the scope of this example, we'll stick to (1) passing the dataframe columns (attributes) to be passed to our GeoJSON object, (2) define the precision of the project latitude/longitude values, and (3) map the names of the dataframe columns to the required latitude and longitude parameters of the function.
* There are a variety of other geospatial extensions like `GeoPandas` to make working with geospatial data in Python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Learn more about the `GeoPandas` project https://geopandas.org/index.html.
# Create GeoJSON object with selected attributes. Define coordinates to three decimal places. Maps require lat/lon.
projectClusterGeoJson = df_to_geojson(
dfClusterMap,
properties=["case_id", "p_cap_ac","p_state"],
precision=3,
lat="ylat",
lon="xlong",
)
projectClusterGeoJson
{"features": [{"geometry": {"coordinates": [-93.788, 45.258], "type": "Point"}, "properties": {"case_id": 402964, "p_cap_ac": 5.0, "p_state": "MN"}, "type": "Feature"}, {"geometry": {"coordinates": [-92.972, 44.542], "type": "Point"}, "properties": {"case_id": 402893, "p_cap_ac": 5.0, "p_state": "MN"}, "type": "Feature"}, {"geometry": {"coordinates": [-92.972, 44.527], "type": "Point"}, "properties": {"case_id": 402892, "p_cap_ac": 5.0, "p_state": "MN"}..."type": "Feature"}], "type": "FeatureCollection"}
To create the project cluster map, we'll create color 'stops' (or cutoffs) based on the density of the project locations (proximity to one another). For our cluster map, we'll apply a 7-step diverging color ramp from ColorBrewer (diverging color schemes highlight the largest and smallest ranges) and create stops for proximity bins with counts of 1, 5, 25, 50, 100, 500, 1000. Next, we'll define the sizes of our cluster makers. Finally, we'll call ClusteredCircleViz
and apply our color ramp along with some custom parameters for our visualization. In the code cell below, we provide a brief explanation for each custom parameter used for rendering our cluster map. An exhaustive list of parameters can be found in the `mapboxgl-jupyter` documentation.
# Define our color stops based on a 6-step divergent color ramp
project_color_stops = create_color_stops([1, 5, 25, 50, 100, 500, 1000], colors="Spectral")
# Define the radius (sizes) of the cluster markers
project_radius_stops = [[1, 5], [50, 10], [100, 15], [1000, 20]]
# Define the parameters for our cluster map
# Call our NGMDB style as basemap, set the max zoom level for clusters to show
# Set cluster label size and cluster symbol opacity
# Handle initial zoom/center of visualization
projectClusterMap = ClusteredCircleViz(projectClusterGeoJson,
style=ngmdbLight,
color_stops=project_color_stops,
radius_stops=project_radius_stops,
cluster_maxzoom=10,
label_size=10,
opacity=0.6,
center=(-95, 40),
zoom=3.25)
projectClusterMap.show()
Graduated symbols show a quantitative difference between mapped elements by varying the size of the map markers. Attribute values are classified into ranges that are assigned a symbol size representing that range. Symbol size is an effective way to represent differences in magnitude of a selected attribute, because larger markers are naturally associated with a greater amount of something. Using graduated symbols gives you granularity over the size of each symbol in their respective bin ranges, and unlike proportional symbols, are not scaled directly to the absolute min and max of the attribute values.
In this example, we'll call the USPVDB API with some advanced parameters to fine tune the data returned to the dataframe. We'll then render a graduated symbol map of a projects in North Carolina with symbol sizes based on solar project capacity values (in MW). We'll also apply a color scheme to our graduated symbol map based on project area (in m2), effectively visualizing the relationship between project capacity and project area.
Before we build the map visualization, we'll preview USPVDB data for North Carolina in some simple plots using the pandas.plotting
module. The plots will help us visualize relationships between a variety of project attributes like project year, capacity, and area.
In this example, we'll make another customized http request to the API and return projects that (1) are only located in North Carolina (2) have p_cap_ac
(capacity) values that are not null
and (3) limit the solar voltaic project attributes in the response to p_cap_ac
(which we will cast as "Capacity"). This is done by appending URL parameters ?&p_state=eq.NC&p_cap_ac=not.is.null&select=Capacity:p_cap_ac
to the root level API https://eersc.usgs.gov/api/uspvdb/v1/projects/
. Once a successful request is made, we'll parse the JSON response and generate a histogram showing the frequency of project capacities for projects in North Carolina.
Note: There are other operators related to the USPVDB that can be leveraged in the API request. Feel free to experiment with other URL operators to build your own custom plots using the USPVDB.
# Call the USPVDB API and apply custom URL parameters to the request
ncCapHist_url = 'https://eersc.usgs.gov/api/uspvdb/v1/projects?&p_state=eq.NC&p_cap_ac=not.is.null&select=Capacity:p_cap_ac'
# Parse the JSON response from the API return and populate the dataframe
capHist = pd.read_json(ncCapHist_url)
#Display the number of projects in our API return
display(capHist.count())
#Preview the first 5 records of the return. Data should only include the single attribute "Capacity" as defined by our API request
display(capHist.head(5))
# Generate a histogram showing frequencies of solar project capacities. Include number of bins, and size of the plot
capHist.plot.hist(bins=25, figsize=(10,5))
Capacity 651 dtype: int64
Capacity | |
---|---|
0 | 3.0 |
1 | 5.0 |
2 | 5.0 |
3 | 2.0 |
4 | 2.5 |
<AxesSubplot: ylabel='Frequency'>
From our output, we see that the number of projects returned from this API request is 651 (first output line). Based on our histogram with a bin level of bins=25
, we see that the majority of projects in North Carolina have capacities between 5-10 MW. The next largest frequency appears to be projects with a capacity between 1-5 MW. We can increase the number of bins and rerun the cell to further refine capacity ranges in the histogram return. Try running the cell with a bin level of bins=25
.
Next, let's generate a scatter plot to help us visualize relationships between project installation year, project capacity, and project total area. We'll make another customized request to the USPVDB API for projects that (1) are located in North Carolina (2) have p_cap_ac
(capacity) and p_area
values that are not null
and (3) limit the project attributes in the response to p_year
(year the project was completed), p_area
(which we will cast as "Area"), and p_cap_ac
(which we will cast as "Capacity"). We will also add xlong
(longitude) and ylat
(latitude) so we can use the request to generate our graduated symbol map later. All this is done by appending URL parameters ?&p_state=eq.NC&p_cap_ac=not.is.null&p_area=not.is.null&select=p_year,p_name,Capacity:p_cap_ac,Area:p_area,xlong,ylat'
to the root level API https://eersc.usgs.gov/api/uspvdb/v1/projects/
.
# Call the USPVDB API and apply custom URL parameters to the request.
ncProjects_url = 'https://eersc.usgs.gov/api/uspvdb/v1/projects?&p_state=eq.NC&p_cap_ac=not.is.null&p_area=not.is.null&select=p_year,p_name,Capacity:p_cap_ac,Area:p_area,xlong,ylat'
# Parse the JSON response from the API return and populate the dataframe
ncProjects = pd.read_json(ncProjects_url)
# Display the number of projects in our API return
display(ncProjects.count())
# Preview the first 5 records of the return. Data should only include the attributes defined by our API request
display(ncProjects.head(5))
# Generate a scatter plot with x-axis=year, y-axis=capacity, colorized (c) by area using 'viridis' matplotlib colormap
ncProjects.plot.scatter(x='p_year',
y='Capacity',
c='Area',
colormap='viridis',
figsize=(20,5),
sharex=False)
p_year 651 p_name 651 Capacity 651 Area 651 xlong 651 ylat 651 dtype: int64
p_year | p_name | Capacity | Area | xlong | ylat | |
---|---|---|---|---|---|---|
0 | 2016 | Kenneth Solar | 3.0 | 41275 | -78.1524 | 36.4224 |
1 | 2016 | Leggett Solar, LLC | 5.0 | 137415 | -77.6292 | 36.0611 |
2 | 2018 | Arthur Solar, LLC | 5.0 | 196857 | -78.8359 | 34.1723 |
3 | 2014 | East Wayne Solar, LLC | 2.0 | 44906 | -77.8494 | 35.4418 |
4 | 2011 | PCSP3 Airport | 2.5 | 55087 | -78.9890 | 36.2882 |
<AxesSubplot: xlabel='p_year', ylabel='Capacity'>
From our output, we see that the number of projects returned based on our new API request is 651. Based on our scatter plot, we see there is a general trend of increasing capacity and area. We also see that solar photovoltaic projects in North Carolina begin to increase in size and capacity around the year 2015. If we wanted to see if this general trend was the same at a national level, we would simply remove the parameter &t_state=eq.NC
from our API request and re-run the cell. This is a simple example highlighting the efficiency of data delivery through an API. We only request the data we need for our analysis by modifying and/or appending simple URL parameters to the root level API endpoint. We also stay in sync with the latest version of the data because we're pulling from the source, not a static flat-file that may be out of date.
Just like before we'll create a GeoJSON object from the dataframe we just defined. We'll start by (1) passing the dataframe columns (attributes) to be added to our GeoJSON object, (2) define the precision of the solar voltaic project latitude/longitude values, and (3) map the names of the dataframe columns to the required latitude and longitude parameters of the function*.
*As seen in examples above, we could simply cast lat:ylat
and lon:xlong
in the API request to avoid having to map lat='ylat', lon='xlong' in df_to_geojson
.
# Create GeoJSON object from our 'utProjects' dataframe
projectGradSymGeoJson = df_to_geojson(ncProjects,
properties=['Capacity','Area'],
precision=3,lat='ylat', lon='xlong')
Let's say we wanted to show a graduated symbol map that had marker symbols with sizes based on project capacity, and a color ramp representing project area (where hotter colors depicted increased project area). To create the map, we'll first create color 'stops' (or cutoffs) based on the project area ranges in the database. In this example, we'll hard code in RGB values (warm to hot) at each of our five defined stop. Next, we'll define the sizes of our markers based on project capacity. Again, we'll define five stops representing a range of project capacities. Finally, we'll call GraduatedCircleViz
and apply our color and radius bins, along with some custom parameters for our visualization. In the code cell below, we provide a brief explanation for each custom parameter used for rendering our cluster map. An exhaustive list of parameters can be found in the `mapboxgl-jupyter` documentation.
# Assign color breaks based on project area ranges (in m2)
project_area_color_bins= [[15000, 'rgb(43,131,186)'],
[25000, 'rgb(171,221,164)'],
[50000, 'rgb(255,255,191)'],
[100000, 'rgb(253,174,97)'],
[500000, 'rgb(215,25,28)']]
# Assign marker radius size based on project capacity ranges (in MW)
project_radius_bins = [[0, 0],
[1, 3],
[15, 6],
[25, 9],
[50, 12]]
# Define the parameters for our graduated symbol map
# Call our NGMDB style as basemap, apply our color and radius stops
# Set cluster symbol opacity and stroke, add scalebar and scalebar styles
# Handle initial zoom/center of visualization
projectGradSymbolMap = GraduatedCircleViz(projectGradSymGeoJson,
style= ngmdbBasemap,
color_property='Area',
color_function_type='interpolate',
color_stops=project_area_color_bins,
radius_property='Capacity',
radius_stops=project_radius_bins,
radius_function_type='interpolate',
radius_default=1,
opacity=0.75,
stroke_color='black',
stroke_width=0.15,
scale=True,
scale_unit_system='imperial',
scale_background_color='#0000ff00',
center=(-79.75, 35.5),
zoom=5.5)
# Generate map labels of solar voltaic project names and adjust label properties
projectGradSymbolMap.label_property = "case_id"
projectGradSymbolMap.label_size = 5
#Render the map
projectGradSymbolMap.show()
You may want to view/share the maps generated in your notebook as standalone web maps. Standalone web maps can be displayed on web and mobile devices without the need for notebook dependencies. They look exactly like the inline maps in your notebook and carry with them all the interactivity and control parameters defined in your code. The web map will include your data packaged in the HTML file. You can generate a standalone web map from mapboxgl
by calling create_html()
with standard Python protocol. The standalone web maps will be written to your Jupyter notebook home directory.
# Generate a standalone web map of the USPVDB cluster map
with open('uspvdbClusterMap.html', 'w') as f:
f.write(projectClusterMap.create_html())
# Generate a standalone web map of the California USPVDB graduated symbol map
with open('uspvdbGradSymbolMap.html', 'w') as f:
f.write(projectGradSymbolMap.create_html())
The creation of this database was jointly funded by the U.S. Department of Energy (DOE) Solar Energy Technologies Office (SETO) via the Lawrence Berkeley National Laboratory (LBNL) Electricity Markets and Policy Group and the U.S. Geological Survey (USGS) Energy Resources Program. The database is continuously updated through collaboration among LBNL and USGS. With the release of this public version, we hope researchers and other interested parties around the world will use the data to further their efforts. If you have feedback or want to let us know how you are using the data, send us a note.
Map services and data downloaded from the U.S. Large-Scale Solar Photovoltaic Database are free and in the public domain. There are no restrictions; however, we request that the following acknowledgment statement be included in products and data derived from our map services when citing, copying, or reprinting: "Map services and data are available from Large-Scale Solar Photovoltaic Database, provided by the U.S. Geological Survey and Lawrence Berkeley National Laboratory via https://eerscmap.usgs.gov/uspvdb".
Although this digital spatial database has been subjected to rigorous review and is substantially complete, it is released on the condition that neither the USGS, LBNL, nor the United States Government nor any agency thereof, nor any employees thereof, makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information contained within the database.