[Error] pod not completed

k get all -n prototype
NAME                                     READY   STATUS    RESTARTS   AGE
pod/eth-at-prototype-v1-28123200-kmlbh   1/1     Running   0          2d11h
pod/eth-at-prototype-v1-28124640-lxs7h   1/1     Running   0          35h
pod/eth-at-prototype-v1-28126080-rccf6   1/1     Running   0          11h

NAME                                SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/eth-at-prototype-v1   0 0 * * *   False     3        11h             5d16h

NAME                                     COMPLETIONS   DURATION   AGE
job.batch/eth-at-prototype-v1-28123200   0/1           2d11h      2d11h
job.batch/eth-at-prototype-v1-28124640   0/1           35h        35h
job.batch/eth-at-prototype-v1-28126080   0/1           11h        11h

위와 같이 이전 job 에서 실행된 pod 가 completed 되지 않음

현재 pod 의 log 에서 exection 발생중 → exection 해결 후 pod 가 정상 작동할 때 while 문의 else 로 break 되는지 확인 필요

2. 원인 파악

테스트를 위해 10분마다 pod 를 생성하는 cronjob 생성
pod 는 output 에 1초마다 현재시간을 기록함
pod 의 app은 생성 5분 후 종료됨
completed 로 전환되는지 확인 필요

2.1 codes

test_app.py

import time
import datetime

def open_logfile(n):
    log_file_name=n
    f=open(log_file_name, 'a')
    return f

def write_and_flush_logs(f, log_string):
    logs=log_string+"\n"
    f.write(logs); f.flush()

def close_logfile(f):
    f.close()

logfile=open_logfile("output.log")

start_time = datetime.datetime.now()
end_time = start_time + datetime.timedelta(minutes=5)
logs="start_time : "+str(start_time)+"\nend_time : "+str(end_time)
write_and_flush_logs(logfile, logs)

while True :
    now = datetime.datetime.now()

    if start_time < now < end_time :
        logs="running : "+str(now)
        write_and_flush_logs(logfile, logs)
    else :
        logs="endtime : "+str(now)
        write_and_flush_logs(logfile, logs)
				break

close_logfile(logfile)
send_logs_to_s3(log_file_name)

run.sh

sudo ln -sf /usr/share/zoneinfo/Asia/Seoul /etc/localtime
python3 test_app.py

Dockerfile

FROM python:3.10

RUN apt-get update && apt-get install -y \
    vim

WORKDIR /home

COPY . .

RUN pip install --upgrade pip

CMD ["sh", "run.sh"]

test_cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-v1
  namespace: test
spec:
  schedule: "*/10 * * * *"
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            app: test-v1
            namespace: test
        spec:
          containers:
          - name: python-app
            image: cyaninn/test:v.1.2
          restartPolicy: OnFailure

2.2 test

$ k apply -f test_cronjob.yaml

20분 경과 후

$ k get all -n test
NAME                         READY   STATUS             RESTARTS        AGE
pod/test-v1-28127000-knk7s   1/1     Running            1 (2m18s ago)   7m20s
pod/test-v1-28126990-7vggg   1/1     Running            3 (2m2s ago)    17m
pod/test-v1-28126980-wtmns   0/1     CrashLoopBackOff   4 (39s ago)     27m

NAME                    SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/test-v1   */10 * * * *   False     3        7m20s           116m

NAME                         COMPLETIONS   DURATION   AGE
job.batch/test-v1-28127000   0/1           7m20s      7m20s
job.batch/test-v1-28126990   0/1           17m        17m
job.batch/test-v1-28126970   0/1           37m        37m
job.batch/test-v1-28126980   0/1           27m        27m

$ k exec -it test-v1-28126960-4cpv4 -n test /bin/bash
root@test-v1-28126960-4cpv4:/home# tail output.log 
running : 2023-06-24 15:10:46.751761
running : 2023-06-24 15:10:46.751770
running : 2023-06-24 15:10:46.751780
running : 2023-06-24 15:10:46.751791
running : 2023-06-24 15:10:46.751801
running : 2023-06-24 15:10:46.751811
running : 2023-06-24 15:10:46.751821
running : 2023-06-24 15:10:46.751831
running : 2023-06-24 15:10:46.759889
running : 2023-06-24 15:10:46.759904

여전히 실행중인 것을 확인

※ crashloopbackoff 는 exec 로 접속했다가 빠져나와서 발생한 오류

3. 해결

https://kubernetes.io/ko/docs/concepts/workloads/controllers/job/#잡의-종료와-정리

.spec.activeDeadlineSeconds

유효 데드라인 설정

잡을 종료하는 또 다른 방법은 유효 데드라인을 설정하는 것이다. 잡의 .spec.activeDeadlineSeconds 필드를 초 단위로 설정하면 된다. activeDeadlineSeconds 는 생성된 파드의 수에 관계 없이 잡의 기간에 적용된다. 잡이 activeDeadlineSeconds 에 도달하면, 실행 중인 모든 파드가 종료되고 잡의 상태는 reason: DeadlineExceeded 와 함께 type: Failed 가 된다.

3.1 데드라인 적용 후 테스트

3.1.1 test_cronjob_2.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-v1
  namespace: test
spec:
  schedule: "*/10 * * * *"
  jobTemplate:
    spec:
			activeDeadlineSeconds: 599
      template:
        metadata:
          labels:
            app: test-v1
            namespace: test
        spec:
          containers:
          - name: python-app
            image: cyaninn/test:v.1.4
          restartPolicy: OnFailure

10분에 한번씩 pod를 실행하고 해당 pod 를 599초가 지나면 삭제하고 job을 fail 처리하도록 설정

3.1.2 테스트 결과

POD는 1개로 유지됨
job은 계속 2개로 유지됨

$ k apply -f test_cronjob.yaml 
cronjob.batch/test-v1 created
...
50분 경과
...
$ k get all -o wide -n test
NAME                         READY   STATUS    RESTARTS   AGE
pod/test-v1-28131230-8vf2l   1/1     Running   0          2m14s

NAME                    SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/test-v1   */10 * * * *   False     1        2m14s           51m

NAME                         COMPLETIONS   DURATION   AGE
job.batch/test-v1-28131220   0/1           12m        12m
job.batch/test-v1-28131230   0/1           2m14s      2m14s

DeadlineExceeded 사유로 failed 상태가 됨

$ k describe job test-v1-28131220 -n test
...
Pods Statuses:            0 Active (1 Ready) / 0 Succeeded / 1 Failed
...
Events:
  Type     Reason            Age    From            Message
  ----     ------            ----   ----            -------
  Normal   SuccessfulCreate  13m    job-controller  Created pod: test-v1-28131220-k8wj8
  Normal   SuccessfulDelete  3m58s  job-controller  Deleted pod: test-v1-28131220-k8wj8
  Warning  DeadlineExceeded  3m58s  job-controller  Job was active longer than specified deadline

JOB이 생성된지 10분 (599초)가 지나면 pod가 먼저 Terminating 됨을 확인
이후, 이전 batch를 수행한 job은 또 10분 후 자동으로 삭제됨을 확인
Cronjob 에 의해 새로운 job이 batch 작업됨

ubuntu@k3s:~/test$ k get all -n test
NAME                         READY   STATUS    RESTARTS        AGE
pod/test-v1-28131230-8vf2l   1/1     Running   1 (4m57s ago)   9m58s

NAME                    SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/test-v1   */10 * * * *   False     1        9m58s           59m

NAME                         COMPLETIONS   DURATION   AGE
job.batch/test-v1-28131220   0/1           19m        19m
job.batch/test-v1-28131230   0/1           9m58s      9m58s
ubuntu@k3s:~/test$ 
ubuntu@k3s:~/test$ 
ubuntu@k3s:~/test$ 
ubuntu@k3s:~/test$ k get all -n test
NAME                         READY   STATUS              RESTARTS        AGE
pod/test-v1-28131230-8vf2l   1/1     Terminating         1 (4m59s ago)   10m
pod/test-v1-28131240-xv9wk   0/1     ContainerCreating   0               0s

NAME                    SCHEDULE       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/test-v1   */10 * * * *   False     1        0s              59m

NAME                         COMPLETIONS   DURATION   AGE
job.batch/test-v1-28131230   0/1           10m        10m
job.batch/test-v1-28131240   0/1           0s         0s

'1인개발 메이킹로그 > [Infra+k8s+App] 가상화폐 자동매매' 카테고리의 다른 글

[Test] Code 에서 AWS Credential 분리/제거 (1)	2023.07.12
[Project Report] 가상화폐 자동매매 + AWS lambda (0)	2023.06.29
[Error] string indices must be integers (0)	2023.06.24
[테스트] Dockerfile 'CMD' / kubernetes cronjob (0)	2023.06.19
[Prototype-v1] 레거시 POD 배포 (0)	2023.06.16

목차

1. 에러 상황

2. 원인 파악

2.1 codes

2.2 test

3. 해결

3.1 데드라인 적용 후 테스트

'1인개발 메이킹로그 > [Infra+k8s+App] 가상화폐 자동매매' 카테고리의 다른 글

티스토리툴바

목차

1. 에러 상황

2. 원인 파악

2.1 codes

2.2 test

3. 해결

3.1 데드라인 적용 후 테스트

'1인개발 메이킹로그 > [Infra+k8s+App] 가상화폐 자동매매' 카테고리의 다른 글

검색 태그

티스토리툴바