Contribute to Open Source. Search issue labels to find the right project for you!

K2 Up Services Ignored Errors

samsung-cnct/k2

The are 2 ignored errors on K2 services up. They are valid, but there should be a way to do the needed processing without generating errors that can be ignored.

e.g.:

TASK [roles/kraken.services : See if tiller rc if present] ********************* fatal: [localhost]: FAILED! => {"changed": true, "cmd": "/opt/cnct/kubernetes/v1.5/bin/kubectl --kubeconfig=/Users/guineveresaenger/.kraken/guinrecon/admin.kubeconfig get deployment tiller-deploy --namespace=kube-system", "delta": "0:00:00.857228", "end": "2017-04-27 17:04:17.077705", "failed": true, "rc": 1, "start": "2017-04-27 17:04:16.220477", "stderr": "Error from server (NotFound): deployments.extensions \"tiller-deploy\" not found", "stdout": "", "stdout_lines": [], "warnings": []} ...ignoring

TASK [roles/kraken.services : Clean up services] ******************************* fatal: [localhost]: FAILED! => {"failed": true, "msg": "The conditional check 'item.status.loadBalancer.ingress[0].hostname is defined and kraken_action == 'down'' failed. The error was: error while evaluating conditional (item.status.loadBalancer.ingress[0].hostname is defined and kraken_action == 'down'): 'ansible.vars.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'status'\n\nThe error appears to have been in '/kraken/ansible/roles/kraken.services/tasks/kill-services.yaml': line 72, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Clean up services\n ^ here\n"} ...ignoring

Updated 27/04/2017 18:10

simplify cloud-init config

samsung-cnct/k2

Currently, cloud-init configs are written to a series of file parts and stitched together. We could, instead, just build a dict and write the yaml once. This would allow us to check the size (metadata services have maximum payload sizes that we cannot exceed) and also would be much faster.

Updated 25/04/2017 21:07

dryrun currently fails

samsung-cnct/k2

During development it is common to make changes and then partially test them by setting the dryrun tag. In particular this will skip the provider role, so that the developer does not have to wait for the cluster to be brought up. Currently dryrun is broken in that it fails during the fabric role. [1]

This can be partially fixed by fixing #229, but then the fabric role retries 100 times and fails, because this is a dryrun! The simplest fix is to remove the dryrun tag from the fabric role. But then dryrun could not be used to test fabric templating. That maybe acceptable.

[1]

TASK [/root/Samsung/k2/ansible/roles/kraken.fabric/kraken.fabric.flannel : Wait for api server to become available in case it's not] ***
task path: /root/Samsung/k2/ansible/roles/kraken.fabric/kraken.fabric.flannel/tasks/main.yaml:15
fatal: [localhost]: FAILED! => {
    "failed": true, 
    "msg": "{{ lookup('file', kubeconfig) | from_yaml | json_query('clusters[*].cluster.server') }}: An unhandled exception occurred while running the lookup plugin 'file'. Error was a <class 'ansible.errors.AnsibleError'>, original message: could not locate file in lookup: /root/.kraken/lp-k2recon/admin.kubeconfig"
}

For further information about which tags k2 makes available, see up.sh

Updated 07/04/2017 20:27

Fix 'Clean up releases' to avoid needing to ignore the failure

samsung-cnct/k2

If we can ignore this failure, we should be able to test and see that the work doesn’t actually need to be done and just not execute this.

Seeing failures that are ignored irritates me.

TASK [roles/kraken.services : Clean up releases] *******************************
failed: [localhost] (item={u'name': u'kubedns', u'namespace': u'kube-system', u'chart': u'kubedns', u'repo': u'atlas', u'version': u'0.1.0', u'values': {u'cluster_ip': u'10.32.0.2', u'dns_domain': u'cluster.local'}}) => {"changed": true, "cmd": ["helm", "delete", "--purge", "kubedns"], "delta": "0:00:03.994099", "end": "2017-04-07 20:02:04.912575", "failed": true, "item": {"chart": "kubedns", "name": "kubedns", "namespace": "kube-system", "repo": "atlas", "values": {"cluster_ip": "10.32.0.2", "dns_domain": "cluster.local"}, "version": "0.1.0"}, "rc": 1, "start": "2017-04-07 20:02:00.918476", "stderr": "Error: deletion completed with 1 error(s): object not found, skipping delete", "stdout": "", "stdout_lines": [], "warnings": []}
...ignoring
failed: [localhost] (item={u'repo': u'atlas', u'version': u'0.1.0', u'namespace': u'kube-system', u'name': u'heapster', u'chart': u'heapster'}) => {"changed": true, "cmd": ["helm", "delete", "--purge", "heapster"], "delta": "0:00:03.960751", "end": "2017-04-07 20:02:09.020494", "failed": true, "item": {"chart": "heapster", "name": "heapster", "namespace": "kube-system", "repo": "atlas", "version": "0.1.0"}, "rc": 1, "start": "2017-04-07 20:02:05.059743", "stderr": "Error: deletion completed with 2 error(s): object not found, skipping delete; object not found, skipping delete", "stdout": "", "stdout_lines": [], "warnings": []}
Updated 19/04/2017 20:14

Validate files exist

samsung-cnct/k2

Validate files that we’re going to use actually exist.

For example, terraform failed to complete because I didnt' change the ssh keys to point to the ones I actually use for this purpose: fatal: [localhost]: FAILED! => {"attempts": 10, "changed": true, "cmd": ["terraform", "apply", "-input=false", "-no-color", "/home/jjulian/.kraken/joej3"], "delta": "0:00:00.374135", "end": "2017-04-07 19:24:07.061804", "failed": true, "rc": 1, "start": "2017-04-07 19:24:06.687669", "stderr": "Errors:\n\n * file: open /home/jjulian/.ssh/id_rsa.pub: no such file or directory in:\n\n${file(\"/home/jjulian/.ssh/id_rsa.pub\")}", "stdout": "There are warnings and/or errors related to your configuration. Please\nfix these before continuing.", "stdout_lines": ["There are warnings and/or errors related to your configuration. Please", "fix these before continuing."], "warnings": []} If these files were confirmed to exist, we could have failed much sooner.

Updated 19/04/2017 20:16

Refactor misnamed and unused roles

samsung-cnct/k2

Rename ansible/roles/kraken.{etcd, master, node}/kraken.*.docker roles and remove ansible/roles/kraken.{etcd, master, node}/kraken.*.selector roles.

Currently we have a selector role ansible/roles/kraken.{etcd, master, node}/kraken.*.selector which unconditionally call analogous ansible/roles/kraken.{etcd, master, node}/kraken.*.docker roles. The docker roles in particular do not have anything to do with docker, so their contents should probably just be part of the top level role. Then the selector roles can be removed.

Updated 03/04/2017 01:47

Node Taints

samsung-cnct/k2

We should be able to apply two forms of taints to nodes at node startup in two ways: - default nodePool - arbitrary config driven taints

taints are the future for how pods will be scheduled to nodes and as of kubernetes v1.6 they can be applied to nodes via kubelet CLI options. we should enable this functionality in k2

Updated 29/03/2017 22:55 1 Comments

how to handle helm charts over time

samsung-cnct/k2

we are progressing towards using k2/k2cli and the configuration file for cluster management operations. it is certainly possible for people to modify the list of services in the configuration and expect the cluster state to reflect that change. this is already supported for adding new services but this is not supported for removing services.

how do we want to approach this? this will be a combination of code and operator manual.

Updated 29/03/2017 22:31

Setting global variables in Ansible

samsung-cnct/k2

Currently, the way we make a variable “globally” accessible is by setting them as a fact at the top of the first tasks that get run, in k2/ansible/roles/tasks/main.yaml.

It might be more idiomatic to declare a vars file where globally accessible variables are kept, as per suggestion in this article: https://robert-reiz.com/2014/09/03/global-variables-with-ansible/

and in the docs (the examples a bit further down are pretty good): http://docs.ansible.com/ansible/playbooks_variables.html#variable-precedence-where-should-i-put-a-variable

Not sure how pressing/necessary this is for the way we use Ansible.

Updated 05/04/2017 20:18 1 Comments

start building helm nightly to cover gaps in helm release cycle

samsung-cnct/k2

helm does not cut official releases at the same time as kubernetes. we rely heavily on helm for providing ancillary services on kubernetes (monitoring, logging, CI that targets kubernetes, etc). as such we should be doing our best to be: - testing against alpha and beta versions of kubernetes - papering over the release lag between kubernetes and helm

this is a result of a discussion raised in #222

Updated 29/03/2017 22:24 1 Comments

K2-Down fails to bring down clusters with ELBs

samsung-cnct/k2

Currently k2-down is failing when we deploy an ELB with an external IP.

I can spin up a default cluster and bring it down successfully, but when I deploy the above service, k2 down fails at certain tasks (seem to be a bit different each time, see the above comment for a list of common ones I see).

Bug repro: 1. Spin up default k2 cluster 2. Deploy service: https://gist.github.com/leahnp/7240e1fc4c2a73a75d0403f734e8555a 3. Run k2-down (it should fail at some point)

More details on the issue here: https://github.com/samsung-cnct/kraken-ci-jobs/issues/166

Updated 07/04/2017 05:42 16 Comments

Validate format of properties

samsung-cnct/k2

Task

The k2 schema defines a format for some properties. Some of these formats (e.g. “ip4addr”) are part of the JSON Schema standard and some implementations should implement. The existing JSON schema validator should be extended to validate non standard formats (e.g. cidr, semver). Until this is done non standard formats should be ignored by all validators.

Updated 05/04/2017 21:02

Warn on depreciated k2 properties

samsung-cnct/k2

Task

Per #221 , before a property can be removed it must be depreciated first to give the user a chance to stop depending on it. Programmatically this is done by adding the property to the depreciated list within the k2 schema. For example [1] gives a hypothetical schema stanza. Since depreciated is an extension to JSON schema (see PR #173 for the current discussion) no standard validator will print warnings. This ticket is to create an Ansible task which will print warnings as part of the kraken.config role.

[1]

"kubeConfig": {
"title": "A Kubernetes configuration",
"description": "The location and version of a container containing the Kubernetes hyperkube binary.",
"properties": {
"name": {
"default": "defaultKubeConfig",
"description": "Name of the Kubernetes configuration.",
"type": "string"
},
"hyperkubeLocation": {
"default": "gcr.io/google_containers/hyperkube",
"description": "Location of the Kubernetes container.",
"format": "uri",
"type": "string"
},
"version": {
"default": "v1.5.2",
"description": "Version of the hyperkube binary.",
"format": "symver",
"type": "string"
}
},
"required": [
"name"
],
"depreciated": [
"hyperkubeLocation"
],
"type": "object"
},
Updated 16/03/2017 04:15 1 Comments

Add checksums to Dockerfile to verify integrity

samsung-cnct/k2

The k2 Dockerfile currently contains lines like this:

wget http://storage.googleapis.com/kubernetes-helm/helm-${K8S_HELM_VERSION}-linux-amd64.tar.gz && tar -zxvf helm-${K8S_HELM_VERSION}-linux-amd64.tar.gz && mv linux-amd64/helm /usr/bin/ && rm -rf linux-amd64 helm-${K8S_HELM_VERSION}-linux-amd64.tar.gz

We should follow the best practices implemented in goglide and careen-goglide and verify the checksum of the tarball.

Updated 05/04/2017 21:03

create CI job that verifies non-supported kubernetes templates are removed

samsung-cnct/k2

as of https://github.com/samsung-cnct/k2/issues/222#issuecomment-284583580 we will be introducing template directories and ansible checks that should be removed when we no longer support that version.

create a CI job that checks the source code for no longer supported templates and version checks. at this moment I am unsure how we are tracking what the supported versions of kubernetes are so that may also be something that needs to be determined here.

Updated 20/03/2017 17:30 2 Comments

Documentation should match code

samsung-cnct/k2

The current documentation is out of date and also includes features which have never been implemented. For example: https://github.com/samsung-cnct/k2/blob/master/Documentation/kraken-configs/kubelabels.md

We can fix this by reviewing the documentation, or depending on the results of #221, we can generate documentation from the config schema.

Updated 28/04/2017 16:30 1 Comments

Consolidate apiserver readiness checks

samsung-cnct/k2

We currently wait for the API server to be read in 3 different places:

rodin:k2 dwat$ grep -r 'Wait for api server' ansible
ansible/roles/kraken.fabric/kraken.fabric.flannel/tasks/main.yml:- name: Wait for api server to become available in case it's not
ansible/roles/kraken.readiness/tasks/do-wait.yaml:- name: Wait for api server to become available in case it's not
ansible/roles/kraken.services/tasks/run-services.yaml:- name: Wait for api server to become available in case it's not
rodin:k2 dwat$ 

One of these has diverged from the rest which suggests that it may be wrong. We should remove duplications like these from the code to make it is easier to maintain.

Updated 03/03/2017 00:25

tpc provider: terraform templates

samsung-cnct/k2

Attempt to follow the design of the aws terraform templates (see aws-template.yaml) to build a template to terraform using the triton resources.

This is a tracking issue for the following: - [ ] vpc #193 - [ ] keypair (= key) #194 - [ ] subnet (= fabric) #195 - [ ] cluster_secgroup (= firewall) #196 - [ ] etcd (= machine) #197 - [ ] master (= machine) #197 - [ ] node (= machine) #197

Updated 26/04/2017 16:10 1 Comments

create jenkins-ci project plan

samsung-cnct/k2

our current jenkins CI system (kraken-ci) is unmaintained, based on jenkins v1 and the instructions to rebuild no longer work. we are going to abandon this project in favor of using a kubernetes-based jenkins build system built on jenkins v2.

this ticket will cover the basic planning and initial tickets for the new ci system

Updated 26/04/2017 22:03 25 Comments

Look into CoreRoller for managing CoreOS based installs

samsung-cnct/k2

CoreRoller ( https://github.com/coreroller/coreroller ) is a project that will allow you to manage the packages installed on a CoreOS machine as well as what version of CoreOS (ContainerLinux) is deployed where.

This service could be deployed to a kubernetes cluster (in a self-hosted sort of way) to help manage nodes that the cluster are running. Could also be used to keep a specific version of docker running on the fleet

Updated 05/04/2017 21:18 1 Comments

Add cluster autoscaling support

samsung-cnct/k2

On cloud environments (AWS, GCE, Azure) adding additional nodes can be done through APIs.

Having a right-sized kubernetes environment would reduce costs while at the same to be responsive to the cluster’s need for additional hardware when appropriate.

Kubernetes supports cluster autoscaling for GCE, GKE and AWS. We should have this be an option in K2. https://github.com/kubernetes/contrib/tree/master/cluster-autoscaler for reference

Updated 17/04/2017 23:16

Add "kubelet cluster update"

samsung-cnct/k2

Kubelet cluster update should be a new ansible role that is not part of the default set and must be called explicitly and has a selector based on provider. The ansible role should perform the upgrade/downgrade (should be the same process) tasks as laid out in the following tickets: - gke: https://github.com/samsung-cnct/k2/issues/217#issuecomment-285467380 - aws: https://github.com/samsung-cnct/k2/issues/218 (see last comment for additional detail. if there is confusion, please speak to @coffeepac) – this may be completed best by completing #60 first

Updated 19/04/2017 20:34 2 Comments

Get Sysdig-Agent Working Correctly.

samsung-cnct/k2

Get K2 working similar to original kraken for sysdigcloud. 1) get sysdig-agent installed and running if the user specifies a sysdigcloud_access_key, otherwise do not run the agent. 2) assume each user will have their own unique sysdigcloud_access_key associated with their sysdigcloud.com account. 3) Should run on all nodes and all control plane machines (master, apiserver, etcd, … )

Note: originally the agent HAD to be installed in cloudconfig for it to work correctly. Also, the key for seeing kubernetes data was having it running on the APIServer.

Updated 05/04/2017 21:48 5 Comments

Fork me on GitHub