Contribute to Open Source. Search issue labels to find the right project for you!

config clusterNode type wrong

samsung-cnct/k2

When we included central-logging in the default configs, we edited the clusterNodes type to be m4.xlarge, somehow this got changed back to c4.large, but in milestone 0.3 we need to be the correct type to ensure that all ES pods will fit onto nodes

Updated 26/06/2017 19:00

allow user to add/delete node pools from config file and have this update a running cluster

samsung-cnct/k2

Use Case: a user would like to add a new nodepool to a running cluster. They add the nodepool to the config file, run upgrade and the new nodepool is created.

Use Case: a user would like to delete a nodepool from a running cluster. They remove the nodepool from the config file, run upgrade and the nodepool is deleted.

For both use cases, master nodepools should not be allowed to be created or deleted (detect this by checking for apiServer config key).

This may be done except for master nodepool mangling prevention.

Updated 26/06/2017 17:57

allow external IP address to be configurable

samsung-cnct/k2

Not all users will want to expose every node in the cluster to the outside world. We should support having a private-ip only configuration. When this per-cluster flag is set we should also alter the ssh config generation step to use the private ip. At that point its up to the user to ensure they have a valid route to the node.

Updated 26/06/2017 17:53

Kubectl Resiliency

samsung-cnct/k2

Add error checking to all kubectl calls. kubectl returns 0 on no error. We should check and retry if appropriate…vs just failing.

- name: Example of kubectl call and return check
  command: >
    {{ kubectl }} --kubeconfig={{ kubeconfig | expanduser }} version --short
  register: task_get_ver_result
  until: task_get_ver_result.rc == 0 | int
  retries: "{{ repetitions | int }}"
  delay: "{{ rep_interval | int }}"
  when: "(not (dryrun | bool))"
Updated 23/06/2017 23:23 1 Comments

etcd readiness check

samsung-cnct/k2

we should have a readiness check for any etcd clusters that causes creation of master nodePool resources to wait until the etcd clusters are up and running. this should prevent errors where our etcd clusters are coming up a bit wedged.

Updated 21/06/2017 22:48

etcd node recovery process

samsung-cnct/k2

this is a tracking issue for building a process that allows for rebuilding a degraded etcd within a K2 install. this process is only expected to work with automation on an etcd cluster that still has a majority of the nodes available.

there should also be a manual process specified that will support rebuilding an etcd cluster that has gone into a ‘read-only’ state.

the specifics of this may mutate greatly over the life of this issue. Please work closely with @alejandroEsc and @sostheim to stay on track

Updated 22/06/2017 01:33 2 Comments

Change default config to have a pinned version of CoreOS

samsung-cnct/k2

Currently we can have an accidental destruction of a running etcd cluster without a user’s direct action because we are defaulting the CoreOS version to ‘latest’. The value ‘latest’ gets passed to terraform and terraform will rebuild any etcd nodes that have had their resolved CoreOS version changed. When terraform updates CoreOS it destroys the node and recreates it.

If we want to move to a pinned version of CoreOS we would need to add ‘checking for a new version of CoreOS’ to the weekly version check ticket. Its small but its another thing to check.

Updated 19/06/2017 16:59 2 Comments

Wait for API Server Superfluous

samsung-cnct/k2

The reorganization of ansible role order needed for RBAC should always guarantee that the kubernetes apis server is available. The startup has to wait for kubectl to be viable (hence that the api server is available) to install the needed system RBAC before anything else can occur.

The api server should be available for anything after that step.

The ‘Wait for API server’ tasks after that step should be able to be removed.

Updated 19/06/2017 19:12 3 Comments

K8s 1.6 deploymentcontroller and daemonsetcontroller

samsung-cnct/k2

Controller manager is showing some issues regarding deploying monitoring tools:

E0614 17:34:41.484210       1 deployment_controller.go:489] Deployment.apps "kafka-monitor" is invalid: metadata.finalizers[0]: Invalid value: "foregroundDeletion": name is neither a standard finalizer name nor is it fully qualified
E0614 17:35:33.222806       1 reflector.go:201] k8s.io/kubernetes/pkg/controller/garbagecollector/graph_builder.go:192: Failed to list <nil>: the server cannot complete the requested operation at this time, try again later (get replicationcontrollers)

and

E0614 17:03:06.995601       1 daemoncontroller.go:233] default/fluentd-daemon failed with : error storing status for daemon set &v1beta1.DaemonSet{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fluentd-daemon", GenerateName:"", Namespace:"default", SelfLink:"/apis/extensions/v1beta1/namespaces/default/daemonsets/fluentd-daemon", UID:"5ab65933-5118-11e7-b0d8-0a5b615c88a0", ResourceVersion:"4898", Generation:2, CreationTimestamp:v1.Time{Time:time.Time{sec:63633051870, nsec:105424329, loc:(*time.Location)(0x6f16460)}}, DeletionTimestamp:(*v1.Time)(0xc4227fc500), DeletionGracePeriodSeconds:(*int64)(0xc4226ec348), Labels:map[string]string{"app":"log-app"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string{"foregroundDeletion"}, ClusterName:""}, Spec:v1beta1.DaemonSetSpec{Selector:(*v1.LabelSelector)(0xc4227fc540), Template:v1.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"kubernetes.io/cluster-service":"true", "version":"v1", "k8s":"fluentd-logging"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:""}, Spec:v1.PodSpec{Volumes:[]v1.Volume{v1.Volume{Name:"varlog", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(0xc4226ec3c0), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil)}}, v1.Volume{Name:"varlibdockercontainers", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(0xc4226ec3e0), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(nil), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), ConfigMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil)}}}, InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"fluentd-kafka", Image:"quay.io/samsung_cnct/fluentd_daemonset", Command:[]string(nil), Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList{"memory":resource.Quantity{i:resource.int64Amount{value:209715200, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}}, Requests:v1.ResourceList{"cpu":resource.Quantity{i:resource.int64Amount{value:100, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"100m", Format:"DecimalSI"}, "memory":resource.Quantity{i:resource.int64Amount{value:209715200, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}}}, VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"varlog", ReadOnly:false, MountPath:"/var/log", SubPath:""}, v1.VolumeMount{Name:"varlibdockercontainers", ReadOnly:true, MountPath:"/var/lib/docker/containers", SubPath:""}}, LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"Always", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}, v1.Container{Name:"logrotate-sidecar", Image:"quay.io/samsung_cnct/logrotate", Command:[]string(nil), Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar{v1.EnvVar{Name:"ALLAPPLICATIONLOGS", Value:"/var/log/containers", ValueFrom:(*v1.EnvVarSource)(nil)}}, Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceList(nil)}, VolumeMounts:[]v1.VolumeMount{v1.VolumeMount{Name:"varlog", ReadOnly:false, MountPath:"/var/log/", SubPath:""}}, LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullPolicy:"Always", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, RestartPolicy:"Always", TerminationGracePeriodSeconds:(*int64)(0xc4226ec468), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"", DeprecatedServiceAccount:"", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, SecurityContext:(*v1.PodSecurityContext)(0xc4220290c0), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]v1.Toleration(nil)}}, UpdateStrategy:v1beta1.DaemonSetUpdateStrategy{Type:"OnDelete", RollingUpdate:(*v1beta1.RollingUpdateDaemonSet)(nil)}, MinReadySeconds:0, TemplateGeneration:1}, Status:v1beta1.DaemonSetStatus{CurrentNumberScheduled:8, NumberMisscheduled:0, DesiredNumberScheduled:8, NumberReady:0, ObservedGeneration:1, UpdatedNumberScheduled:8, NumberAvailable:0, NumberUnavailable:8}}: DaemonSet.extensions "fluentd-daemon" is invalid: metadata.finalizers[0]: Invalid value: "foregroundDeletion": name is neither a standard finalizer name nor is it fully qualified
Updated 14/06/2017 20:06

convert dashes to underscores for cluster names in helmOverride code path

samsung-cnct/k2

dashes are not available to be used in the identifiers for environment variables but dashes need to be allowable in the cluster names. underscores are not available for use in cluster names as we use the cluster name as part of a DNS entry.

this issue should: - update the helmOverride environment variable ansible task to convert any dashes to underscores when creating or looking for an environment variable (make sure to update the help text as well) - update the json schema for clusterName to disallow underscores

Updated 14/06/2017 20:36

Check for Helm Up is Problematic

samsung-cnct/k2

kraken.services/tasks/run-services.yaml

The Helm/Tiller steps have some potential issues: 1) Helm Init is synchronous. It returns up or error.
2) The next “Wait for Tiller” step should not be needed. But if we want to do this, we should not be looking under the helm covers to determine its readiness. i.e. We should not be querying via kubectl for any pods/deployments for this. Helm may change the way they do things.
3) We should perform helm readiness with a helm command itself. e.g. helm ls (the only true judge of helm is helm itself). 4) The readiness loop should not be every second, but some longer time period. 5) Currently, if the helm init task fails (after 60 retries)…the next readiness check is still performed, forcing another 600 sec wait. The rest of the helm tasks should be skipped if the helm init fails.

Also, helm readiness is generally faster than 5 minutes, it should actually be immediate if the helm init did not return an error.

Updated 14/06/2017 21:03

Clean up helmOverride

samsung-cnct/k2

As mentioned here the schema for helmOverride is not as useful as it could because the type of helmOveride may be a boolean (22 different ways) when specified by the config file, a string when specified by an environment variable, or undefined when using an older configuration file.

I believe the root of this problem is that helmOverride has two distinct purposes:

  • Allow potentially incompatible versions of helm to be used anyway
  • Allow helm to be skipped entirely

The latter of these purposes has been requested by other customers, and I think the easiest way to accommodate this behavior is to allow the user to remove the helmConfig line from the cluster definition within the configuration file. For example:

deployment:
  clusters:
    - name: dewMaybe
...
      fabricConfig: *kubeVersionedFabric
      kubeAuth: *defaultKubeAuth
      #helmConfig: *defaultHelm
      dnsConfig: *defaultDns
...

Since the stanza is commented out, helm should be skipped…

For the former purpose, to proceed or not when there is no supported version of helm for the chosen kubernetes version, a boolean will suffice. If it is true, then use the latest version of helm available, even if it does not support the requested version of k8s. If it is false, then fail if the requested version of kubernetes does not have a supported version of helm. The failure message should indicate that the config file or an environment variable can be specified to proceed with an unsupported version of helm.

In this scheme, to support old versions of the configuration file, by default the helmOverride value should be false.

Updated 14/06/2017 20:09

helm permission denied

samsung-cnct/k2

I am running k2 version 65c9194b86b61c5b0f2b194c9e2e49b45c2aa2c5 “rbac 16 hotfix”, using a config generated by a previous version of k2, and receiving this error:

fatal: [localhost]: FAILED! => {“changed”: false, “cmd”: “/opt/cnct/kubernetes/v1.5/bin/helm init”, “failed”: true, “msg”: “[Errno 13] Permission denied”, “rc”: 13}

Updated 14/06/2017 20:50 2 Comments

Running up.sh command with 'dryrun' tags leads spinning up actual cluster

samsung-cnct/k2

Running up.sh command with ‘dryrun’ tags leads spinning up actual cluster.

It runs ansible roles under a k2/ansible/roles/kraken.provider/ directory so that it spins up actual cluster

Need to change codes below

k2/ansible/roles/kraken.provider/kraken.provider.aws/tasks/main.yaml

k2/ansible/roles/kraken.provider/kraken.provider.gke/tasks/create-cluster.yaml

In case of role for GKE, it needs to take the task for “Render a deployment manager template” out to main.yml for dryrun test

391 should be updated with issue.

Updated 14/06/2017 22:09 9 Comments

Kubernetes System Logging in K2

samsung-cnct/k2

This is the parent ticket for finalizing our logging system. The output of this ticket should be: - a final design of our logging system - should include two options, one with kafka one without - include all data sources currently not included - fluentbit on the node (may be slightly trickier with the lack of fluentd plugins) - a set of issues to enact these changes

Updated 26/06/2017 20:49 3 Comments

Crash Reporter: MVP

samsung-cnct/k2

Get a proof of concept running.

  • [x] deploy elasticsearch in prod cluster

  • [x] deploy kibana

  • [x] add functionality to identify failed tasks on up & down, send terminal output to kibana

  • [ ] create kibana report that lists top tasks that are failing

As edge cases, features and preferred data become clear, make corresponding ‘Crash Reporter’ tickets

Updated 16/06/2017 17:59 13 Comments

Create a design for iterating over clusters instead of mushing cluster and nodePool looping together

samsung-cnct/k2

It would feel much safer and less buggy to loop over the entire K2 playbook for each cluster, rather than looping over individual aspects of each cluster inside the single playbook. If such a thing is possible, we would be sure that all configs would pertain to the cluster at hand. I feel we might have some unwanted crossover of information between clusters (once we start being able to start multiple clusters).

Updated 07/06/2017 21:11

K2 Content

samsung-cnct/k2

After the team completes:

https://github.com/samsung-cnct/k2/issues/438 https://github.com/samsung-cnct/k2/issues/437 https://github.com/samsung-cnct/k2/issues/436

We need to plan our K2 content strategy that considers:

  • [ ] - blogs for our github.io
  • [ ] - external blogs and articles
  • [ ] - find K2’s place in K8s tooling docs: https://github.com/samsung-cnct/k2/issues/231
  • [ ] - find K2’s place in: https://github.com/ramitsurana/awesome-kubernetes
  • [ ] - SIG presentations
  • [ ] - Conference talk
  • [ ] - Videos/demos
Updated 26/06/2017 18:01

Update all K2 documentation to have cohesive voice and message

samsung-cnct/k2

https://github.com/samsung-cnct/k2 https://github.com/samsung-cnct/k2cli http://samsung-cnct.github.io/aboutK2.html

Currently the docs listed above have been written by different people at different points in K2’s genesis. We need a clean sweep to insure all these docs have a clear voice and message reflecting K2’s mission and current state. It should be clear what K2 is and that we recommend using K2cli to get started.

What is K2’s mission today and who should these docs be written for?

K2 deploys a high-available Kubernetes cluster with a sophisticated control plane. A K2 cluster gives devOps and clusterOps professionals granular control and the ability to deploy on multiple clouds. Our docs should explain the benefits of a 3 master node control plane, 5 ETCD cluster topology and ability to deploy in multiple availability zones.

Updated 24/05/2017 20:29

TCME Deployment Design

samsung-cnct/k2

The Cert Management Experience will be deployed in two different modes over the course of a single cluster creation command. The process will run as follows: 1. Dedicated VM for TCME is created and TCME is installed via cloud config - signing keys are not available at this point 2. When VM is up, K2 will ssh into the node, create the keys as env vars and start TCME 3. A DNS entry will be created that points to the internal ip address of this node. 4. Wait for the cluster to be up and running 5. Install TCME as a kubernetes service that is only reachable via internal services - e.g. on AWS this would be an ELB that is only visible from inside a VPC 6. update the DNS entry to point to this new service 7. delete the standalone TCME VM

Updated 24/05/2017 20:52

TCME Design

samsung-cnct/k2

We would like to have a secure SSL Cert Distribution service. This service should have the following properties: - go application - be able to choose between pre-formed certs retrieved from a remote location or generating certs on the fly - be able to read in the signing certs via Env vars or k8s secret depending on deployment configuration the k8s secret can be exposed as an env var - have an optional verification step to ensure the requesting node should get a cert

Updated 24/05/2017 21:00 2 Comments

Calculate the needed resources and err if not available

samsung-cnct/k2

Yesterday I encountered a failure to launch a cluster on gke because we had used up all our storage quota. Frequently we encounter similar problems with AWS reaching other quotas: IAM roles, etc.

If we could calculate the resources that will be needed and compare that with quotas and usage we could fail early if bringing up the cluster is impossible.

We could use the same code for looking for resource leaks, which would allow us to improve that aspect as well.

Updated 24/05/2017 20:31 2 Comments

clean up k2 Jenkinsfile

samsung-cnct/k2

the current k2 Jenkinsfile uses an e2etester docker image that is far too large, isn’t checking for correct fork (ie samsung_cnct vs coffeepac) when pushing a new image, has no concept of versioning and probably more.

This issue should be used to tackle the topics in the previous paragraph (except versioning) and should be used to collect further desired improvements. These improvements should then be converted into issues. When the first three issues are cleaned up this issues should be closed and any further CI work should be in the normal flow of work.

Updated 07/06/2017 17:53

simplify cloud-init config

samsung-cnct/k2

Currently, cloud-init configs are written to a series of file parts and stitched together. We could, instead, just build a dict and write the yaml once. This would allow us to check the size (metadata services have maximum payload sizes that we cannot exceed) and also would be much faster.

Updated 03/05/2017 20:43

Validate files exist

samsung-cnct/k2

Validate files that we’re going to use actually exist.

For example, terraform failed to complete because I didnt' change the ssh keys to point to the ones I actually use for this purpose: fatal: [localhost]: FAILED! => {"attempts": 10, "changed": true, "cmd": ["terraform", "apply", "-input=false", "-no-color", "/home/jjulian/.kraken/joej3"], "delta": "0:00:00.374135", "end": "2017-04-07 19:24:07.061804", "failed": true, "rc": 1, "start": "2017-04-07 19:24:06.687669", "stderr": "Errors:\n\n * file: open /home/jjulian/.ssh/id_rsa.pub: no such file or directory in:\n\n${file(\"/home/jjulian/.ssh/id_rsa.pub\")}", "stdout": "There are warnings and/or errors related to your configuration. Please\nfix these before continuing.", "stdout_lines": ["There are warnings and/or errors related to your configuration. Please", "fix these before continuing."], "warnings": []} If these files were confirmed to exist, we could have failed much sooner.

Updated 07/06/2017 21:27

Refactor misnamed and unused roles

samsung-cnct/k2

Rename ansible/roles/kraken.{etcd, master, node}/kraken.*.docker roles and remove ansible/roles/kraken.{etcd, master, node}/kraken.*.selector roles.

Currently we have a selector role ansible/roles/kraken.{etcd, master, node}/kraken.*.selector which unconditionally call analogous ansible/roles/kraken.{etcd, master, node}/kraken.*.docker roles. The docker roles in particular do not have anything to do with docker, so their contents should probably just be part of the top level role. Then the selector roles can be removed.

Updated 03/05/2017 21:08

how to handle helm charts over time

samsung-cnct/k2

we are progressing towards using k2/k2cli and the configuration file for cluster management operations. it is certainly possible for people to modify the list of services in the configuration and expect the cluster state to reflect that change. this is already supported for adding new services but this is not supported for removing services.

how do we want to approach this? this will be a combination of code and operator manual.

Updated 07/06/2017 21:31

Setting global variables in Ansible

samsung-cnct/k2

Currently, the way we make a variable “globally” accessible is by setting them as a fact at the top of the first tasks that get run, in k2/ansible/roles/tasks/main.yaml.

It might be more idiomatic to declare a vars file where globally accessible variables are kept, as per suggestion in this article: https://robert-reiz.com/2014/09/03/global-variables-with-ansible/

and in the docs (the examples a bit further down are pretty good): http://docs.ansible.com/ansible/playbooks_variables.html#variable-precedence-where-should-i-put-a-variable

Not sure how pressing/necessary this is for the way we use Ansible.

Updated 03/05/2017 21:18 1 Comments

start building helm nightly to cover gaps in helm release cycle

samsung-cnct/k2

helm does not cut official releases at the same time as kubernetes. we rely heavily on helm for providing ancillary services on kubernetes (monitoring, logging, CI that targets kubernetes, etc). as such we should be doing our best to be: - testing against alpha and beta versions of kubernetes - papering over the release lag between kubernetes and helm

this is a result of a discussion raised in #222

Updated 07/06/2017 17:53 1 Comments

K2-Down fails to bring down clusters with ELBs

samsung-cnct/k2

Currently k2-down is failing when we deploy an ELB with an external IP.

I can spin up a default cluster and bring it down successfully, but when I deploy the above service, k2 down fails at certain tasks (seem to be a bit different each time, see the above comment for a list of common ones I see).

Bug repro: 1. Spin up default k2 cluster 2. Deploy service: https://gist.github.com/leahnp/7240e1fc4c2a73a75d0403f734e8555a 3. Run k2-down (it should fail at some point)

More details on the issue here: https://github.com/samsung-cnct/kraken-ci-jobs/issues/166

Updated 24/06/2017 19:06 30 Comments

Add checksums to Dockerfile to verify integrity

samsung-cnct/k2

The k2 Dockerfile currently contains lines like this:

wget http://storage.googleapis.com/kubernetes-helm/helm-${K8S_HELM_VERSION}-linux-amd64.tar.gz && tar -zxvf helm-${K8S_HELM_VERSION}-linux-amd64.tar.gz && mv linux-amd64/helm /usr/bin/ && rm -rf linux-amd64 helm-${K8S_HELM_VERSION}-linux-amd64.tar.gz

We should follow the best practices implemented in goglide and careen-goglide and verify the checksum of the tarball.

Updated 05/04/2017 21:03

create CI job that verifies non-supported kubernetes templates are removed

samsung-cnct/k2

as of https://github.com/samsung-cnct/k2/issues/222#issuecomment-284583580 we will be introducing template directories and ansible checks that should be removed when we no longer support that version.

create a CI job that checks the source code for no longer supported templates and version checks. at this moment I am unsure how we are tracking what the supported versions of kubernetes are so that may also be something that needs to be determined here.

Updated 07/06/2017 17:53 2 Comments

Consolidate apiserver readiness checks

samsung-cnct/k2

We currently wait for the API server to be read in 3 different places:

rodin:k2 dwat$ grep -r 'Wait for api server' ansible
ansible/roles/kraken.fabric/kraken.fabric.flannel/tasks/main.yml:- name: Wait for api server to become available in case it's not
ansible/roles/kraken.readiness/tasks/do-wait.yaml:- name: Wait for api server to become available in case it's not
ansible/roles/kraken.services/tasks/run-services.yaml:- name: Wait for api server to become available in case it's not
rodin:k2 dwat$ 

One of these has diverged from the rest which suggests that it may be wrong. We should remove duplications like these from the code to make it is easier to maintain.

Updated 16/06/2017 19:32 2 Comments

Look into CoreRoller for managing CoreOS based installs

samsung-cnct/k2

CoreRoller ( https://github.com/coreroller/coreroller ) is a project that will allow you to manage the packages installed on a CoreOS machine as well as what version of CoreOS (ContainerLinux) is deployed where.

This service could be deployed to a kubernetes cluster (in a self-hosted sort of way) to help manage nodes that the cluster are running. Could also be used to keep a specific version of docker running on the fleet

Updated 05/04/2017 21:18 1 Comments

Add cluster autoscaling support

samsung-cnct/k2

On cloud environments (AWS, GCE, Azure) adding additional nodes can be done through APIs.

Having a right-sized kubernetes environment would reduce costs while at the same to be responsive to the cluster’s need for additional hardware when appropriate.

Kubernetes supports cluster autoscaling for GCE, GKE and AWS. We should have this be an option in K2. https://github.com/kubernetes/contrib/tree/master/cluster-autoscaler for reference

Updated 07/06/2017 21:49

Canal install pulls directly from canal

samsung-cnct/k2

the line: https://github.com/samsung-cnct/k2/blob/6d8a7b43a7ccf2a49f331ca239a418d88e27b2ca/ansible/roles/kraken.master/kraken.master.docker/templates/units.kubelet.part.jinja2#L13 pulls directly from an external source. this means we can not have an airgapped deploy with this process.

low priority but when air gapping comes up this will need to be addressed.

why is this there: https://github.com/tigera/canal/issues/14

Updated 05/04/2017 21:29

pre-allocate ips for etcd nodes

samsung-cnct/k2

pre-allocating ips for etcd nodes will allow us to create DNS entries for the etcd before they are finished spinning up. This will lower total time for a cluster to start and will be give us more control over when services become visible to the rest of the cluster.

This is a follow on ticket from https://github.com/samsung-cnct/k2/issues/99

Updated 05/04/2017 21:31

Fork me on GitHub