配方¶
用户有时会分享使用 Docker 镜像的有趣方法。我们鼓励用户通过提交拉取请求将这些配方贡献给文档,以防它们对社区的其他成员有用。以下各节介绍了这些知识。
Google Cloud SDK 安装¶
某些操作符,例如 GKEStartPodOperator
, DataflowStartSqlJobOperator
, 需要安装 Google Cloud SDK (包括 gcloud
)。您也可以使用 BashOperator 运行这些命令。
创建一个新的 Dockerfile,如下所示。
docs/docker-stack/docker-images-recipes/gcloud.Dockerfile
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# https://apache.ac.cn/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG BASE_AIRFLOW_IMAGE
FROM ${BASE_AIRFLOW_IMAGE}
SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]
USER 0
ARG CLOUD_SDK_VERSION=322.0.0
ENV GCLOUD_HOME=/opt/google-cloud-sdk
ENV PATH="${GCLOUD_HOME}/bin/:${PATH}"
RUN DOWNLOAD_URL="https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-${CLOUD_SDK_VERSION}-linux-x86_64.tar.gz" \
&& TMP_DIR="$(mktemp -d)" \
&& curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/google-cloud-sdk.tar.gz" \
&& mkdir -p "${GCLOUD_HOME}" \
&& tar xzf "${TMP_DIR}/google-cloud-sdk.tar.gz" -C "${GCLOUD_HOME}" --strip-components=1 \
&& "${GCLOUD_HOME}/install.sh" \
--bash-completion=false \
--path-update=false \
--usage-reporting=false \
--additional-components alpha beta kubectl \
--quiet \
&& rm -rf "${TMP_DIR}" \
&& rm -rf "${GCLOUD_HOME}/.install/.backup/" \
&& gcloud --version
USER ${AIRFLOW_UID}
然后构建一个新的镜像。
docker build . \
--pull \
--build-arg BASE_AIRFLOW_IMAGE="apache/airflow:2.0.2" \
--tag my-airflow-image:0.0.1
Apache Hadoop Stack 安装¶
Airflow 通常用于在 Hadoop 集群上运行任务。它需要 Java 运行时环境 (JRE) 才能运行。以下是使用 Hadoop 世界中常用工具的步骤
Java 运行时环境 (JRE)
Apache Hadoop
Apache Hive
创建一个新的 Dockerfile,如下所示。
docs/docker-stack/docker-images-recipes/hadoop.Dockerfile
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# https://apache.ac.cn/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG BASE_AIRFLOW_IMAGE
FROM ${BASE_AIRFLOW_IMAGE}
SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]
USER 0
# Install Java
RUN mkdir -pv /usr/share/man/man1 \
&& mkdir -pv /usr/share/man/man7 \
&& curl -fsSL https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | apt-key add - \
&& echo "deb https://adoptopenjdk.jfrog.io/adoptopenjdk/deb/ $(lsb_release -cs) main" > \
/etc/apt/sources.list.d/adoptopenjdk.list \
&& apt-get update \
&& apt-get install --no-install-recommends -y \
adoptopenjdk-8-hotspot-jre \
&& apt-get autoremove -yqq --purge \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/adoptopenjdk-8-hotspot-jre-amd64
RUN mkdir -p /opt/spark/jars
# Install Apache Hadoop
ARG HADOOP_VERSION=2.10.1
ENV HADOOP_HOME=/opt/hadoop
ENV HADOOP_CONF_DIR=/etc/hadoop
ENV MULTIHOMED_NETWORK=1
ENV USER=root
RUN HADOOP_URL="https://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" \
&& curl 'https://dist.apache.org/repos/dist/release/hadoop/common/KEYS' | gpg --import - \
&& curl -fSL "$HADOOP_URL" -o /tmp/hadoop.tar.gz \
&& curl -fSL "$HADOOP_URL.asc" -o /tmp/hadoop.tar.gz.asc \
&& gpg --verify /tmp/hadoop.tar.gz.asc \
&& mkdir -p "${HADOOP_HOME}" \
&& tar -xvf /tmp/hadoop.tar.gz -C "${HADOOP_HOME}" --strip-components=1 \
&& rm /tmp/hadoop.tar.gz /tmp/hadoop.tar.gz.asc \
&& ln -s "${HADOOP_HOME}/etc/hadoop" /etc/hadoop \
&& mkdir "${HADOOP_HOME}/logs" \
&& mkdir /hadoop-data
ENV PATH="$HADOOP_HOME/bin/:$PATH"
# Install Apache Hive
ARG HIVE_VERSION=2.3.7
ENV HIVE_HOME=/opt/hive
ENV HIVE_CONF_DIR=/etc/hive
RUN HIVE_URL="https://archive.apache.org/dist/hive/hive-${HIVE_VERSION}/apache-hive-${HIVE_VERSION}-bin.tar.gz" \
&& curl -fSL 'https://downloads.apache.org/hive/KEYS' | gpg --import - \
&& curl -fSL "$HIVE_URL" -o /tmp/hive.tar.gz \
&& curl -fSL "$HIVE_URL.asc" -o /tmp/hive.tar.gz.asc \
&& gpg --verify /tmp/hive.tar.gz.asc \
&& mkdir -p "${HIVE_HOME}" \
&& tar -xf /tmp/hive.tar.gz -C "${HIVE_HOME}" --strip-components=1 \
&& rm /tmp/hive.tar.gz /tmp/hive.tar.gz.asc \
&& ln -s "${HIVE_HOME}/etc/hive" "${HIVE_CONF_DIR}" \
&& mkdir "${HIVE_HOME}/logs"
ENV PATH="$HIVE_HOME/bin/:$PATH"
# Install GCS connector for Apache Hadoop
# See: https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
ARG GCS_VARIANT="hadoop2"
ARG GCS_VERSION="2.1.5"
RUN GCS_JAR_PATH="/opt/spark/jars/gcs-connector-${GCS_VARIANT}-${GCS_VERSION}.jar" \
&& GCS_JAR_URL="https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-${GCS_VARIANT}-${GCS_VERSION}.jar" \
&& curl "${GCS_JAR_URL}" -o "${GCS_JAR_PATH}"
ENV HADOOP_CLASSPATH="/opt/spark/jars/gcs-connector-${GCS_VARIANT}-${GCS_VERSION}.jar:$HADOOP_CLASSPATH"
USER ${AIRFLOW_UID}
然后构建一个新的镜像。
docker build . \
--pull \
--build-arg BASE_AIRFLOW_IMAGE="apache/airflow:2.0.2" \
--tag my-airflow-image:0.0.1
Apache Beam Go Stack 安装¶
为了能够使用 BeamRunGoPipelineOperator
运行 Beam Go Pipeline,您的容器中需要 Go。使用 apache-airflow-providers-google>=6.5.0
和 apache-airflow-providers-apache-beam>=3.2.0
安装 airflow
创建一个新的 Dockerfile,如下所示。
docs/docker-stack/docker-images-recipes/go-beam.Dockerfile
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# https://apache.ac.cn/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG BASE_AIRFLOW_IMAGE
FROM ${BASE_AIRFLOW_IMAGE}
SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]
USER 0
ARG GO_VERSION=1.16.4
ENV GO_INSTALL_DIR=/usr/local/go
# Install Go
RUN if [[ "$(uname -a)" = *"x86_64"* ]] ; then export ARCH=amd64 ; else export ARCH=arm64 ; fi \
&& DOWNLOAD_URL="https://dl.google.com/go/go${GO_VERSION}.linux-${ARCH}.tar.gz" \
&& TMP_DIR="$(mktemp -d)" \
&& curl -fL "${DOWNLOAD_URL}" --output "${TMP_DIR}/go.linux-${ARCH}.tar.gz" \
&& mkdir -p "${GO_INSTALL_DIR}" \
&& tar xzf "${TMP_DIR}/go.linux-${ARCH}.tar.gz" -C "${GO_INSTALL_DIR}" --strip-components=1 \
&& rm -rf "${TMP_DIR}"
ENV GOROOT=/usr/local/go
ENV PATH="$GOROOT/bin:$PATH"
USER ${AIRFLOW_UID}
然后构建一个新的镜像。
docker build . \
--pull \
--build-arg BASE_AIRFLOW_IMAGE="apache/airflow:2.2.5" \
--tag my-airflow-image:0.0.1