JJinuxLand

GCP: Cloud Digital Leader Certification

2024-01-05T18:43:00.000-08:00

Heh, I passed Google Cloud's Cloud Digital Leader certification!

I started by taking GCP for Beginners - Become a Google Cloud Digital Leader on Udemy. It was about 10 hours of video. It took me a while, though, because I wrote 95 pages of notes. I was perhaps overcompensating for my poor memory. I studied for an extra couple of days reading random things on the web.

The exam was about 90 minutes. It's remote, but it's proctored. It cost $99.

Overall, not bad. Most importantly, I learned a lot!

My Takeaways from the Stack Overflow Developer Survey

2023-08-08T14:55:00.003-07:00

These are my takeaways from the Stack Overflow Developer Survey:

Programming languages:

JavaScript, HTML/CSS, and SQL are still dominant.
Python is the #2 programming language, followed by TypeScript.
Only 5% of developers still code in assembly.
Lisp moved up two spots to 1.33% of respondents.
Rust is the language that the highest number of people want to either start coding in or continue coding in. Rust has held this coveted spot for something like 8 years.

Editors:

The most important editors are VS Code, the IntelliJ variants, and Visual Studio. It remains unclear how many people use one of the IntelliJ variants because there's overlap, but it seems to be higher than Visual Studio.
70% of people use VS Code.
10% of people use neovim or vim.
4% of people use Emacs.

OSs:

Most developers still use Windows.
After that, it's macOS and then Linux, where Ubuntu dominates.

Most people use ChatGPT and/or GitHub Copilot.
Only 29% of developers don't use AI tools and don't plan on using them.
77% of developers think AI tools are helpful for their development.

Testing:

30% of developers still don't have CI. 40% don't have automated testing. (I have no idea how the second number can be higher than first.)

Tools:

Docker and npm are the two most common tools. Half of people use Docker.

Microservices:

Half of the developers have microservices at their company.

Pay:

In the US, the median full-stack developer makes $140,000 while engineering managers make $195,000. C-suite people make $220,000. Security professionals make $173,000. SREs make $180,000. Most of these numbers seem low by Bay Area standards.
The median Python programmer makes $78,331, which is up from last year, but that still seems very low to me. For Java, it's only $72,701.
Zig is the top-paying programming language (weird!). It took over that spot from Clojure. You might consider using Zig if you need something like C++.
5% of people are looking for work (4% in the US). There are more independent contractors, etc. than last year.
Most developers are in the US.

Remote work:

Only 16% of people work fully in-person. The rest are split fairly evenly between fully remote and hybrid.

Learning:

Udemy maintains its place as the most popular online course or certification program for learning how to code.

ChatGPT: I feel like a kid who just beat an AI playing Go ;)

2023-08-03T16:56:00.006-07:00

No, ChatGPT, that's not right ;)

If you try solving this puzzle yourself, it's not actually that hard if you start by picking the last word first. I picked "poems on a quick snake". One of the reasons this is hard for ChatGPT is that it picks the words in order.

Security Mistake on GitHub Copilot's Homepage

2023-06-12T12:55:00.002-07:00

Can anyone else spot it?

Python: Advice for Patching Your Code at Runtime

2023-05-26T13:57:00.003-07:00

A lot of people use mock.patch() in their tests, but it's also sometimes useful to monkey-patch code at runtime. This blog post talks about why and how.

Let's imagine that you're using some library (perhaps something big, like a web framework), and for whatever reason, you're unable to update the version you're using. Meanwhile, someone comes along and reports a major vulnerability. You need to somehow deal with the vulnerability, but you're in a situation where it's really hard to update.

So, you go find the actual change that fixed the vulnerability. You want to apply it to your version of the code. What do you do?

Well, you could fork the library, but that's kind of a pain to manage. When will you move off that fork? What's the difference between your fork and the original library? What do you do if you need to update versions slightly but you still need the fork?

Or, you can grab the bits of code you actually care about, and patch the system at runtime.

Some basics of patching

Somewhere, near the start of your application, you make a call to some function, apply_all_patches(). Then, you write a function called apply_all_patches() that calls other functions like apply_patch_for_this_thing() and apply_patch_for_that_thing(), etc.

Now, let's say there's a class, SomeClass, with a function, some_function. Let's suppose there's a vulnerability in some_function, and you can you see the newer version of it with the fix.

You basically do:

# fix_for_some_thing_code.py

# This module mostly contains third-party code wrapped in functions.
# Include the original license since this is mostly third-party code.

# This standalone function has the code for the method that I'm trying to replace.
# Note, even though it's a top-level function, it still accepts self because I'm
# going to inject the function into the existing class later.
def some_function(self, ...):
    ...

# fix_for_some_thing_patch.py

# This takes the above third-party code and monkey-patches it in.

import fix_for_some_thing_code

# Here, I'm injecting that code:
def apply_patch_for_this_thing():
    SomeClass._orig_some_function = SomeClass.some_function
    SomeClass.some_function = fix_for_some_thing_code.some_function

Here are a couple of trivial functions to make it easier:

def patch(obj, attribute_name, new_value):
    setattr(obj, f"_orig_{attribute_name}", getattr(obj, attribute_name, None))
    setattr(obj, attribute_name, new_value)


def patch_multiple(obj, attribute_names, copy_from_obj):
    for attribute_name in attribute_names:
        new_value = getattr(copy_from_obj, attribute_name, None)
        patch(obj, attribute_name, new_value)

Now, we can just write:

def apply_patch_for_this_thing():
    patch(SomeClass, "some_function", some_function)
    
    # Alternatively, if you have a bunch of patches:
    patch(SomeClass, ["some_function", "some_other_function"], fix_for_some_thing_code)

Dealing with imports

Python's from a import b can make a patcher's life difficult.

Let's say you have two modules, module_with_vuln and module_that_imported_from_module_with_vuln.

Depending on how module_that_imported_from_module_with_vuln is written, it can make your life either more or less painful. And, let's imagine there are a ton of modules that import from module_with_vuln.

If the problem is in a class's method, it's no big deal. You can just replace the method in the class.

If the thing that you have to replace is something immutable like an int, function, or enum, life becomes harder.

Let's imagine the problem is in some top_level_function inside module_with_vuln. Let's imagine that module_that_imported_from_module_with_vuln has code like from module_with_vuln import top_level_function. Even if you update module_with_vuln.top_level_function, it won't matter because module_that_imported_from_module_with_vuln.top_level_function still points to the original function. Anyone who used mock.patch in their tests is familiar with this problem.

To deal with it, you have to focus on replacing module_that_imported_from_module_with_vuln.top_level_function with your new module_with_vuln.top_level_function after you've already patched module_with_vuln. Basically, you have two places that you have to monkey-patch.

If you have a lot of things to patch, you might be asking if you can just swap out the entire module in sys.modules, but that actually won't help if other modules have already run and done their imports. If you can really be the first thing to run, then you might be able to pull this trick off, but it's actually subtly harder than you might think.

Anyway, if what you have to patch is a method in a class, it's easy to just patch that one method in the class, but if what you have to patch is something like an int at the top level, you have no choice but to chase down all the paces that import it and patch their references too.

By the way, you have to be really careful about entirely redefining classes or modules. If you have some class, ClassWithVuln, and you entirely redefine it, there might be some code out there that imported the old version of ClassWithVuln and is doing stuff like isinstance(some_object, ClassWithVuln). If some_object is an instance of the new ClassWithVuln, but the import is for the old ClassWithVuln, then isinstance is going to return False.

There's another weird edge case. Let's say that we're replacing some_function_with_vuln, and the code is a closure that uses some globals like SomeHarmlessOtherClass. You want to make sure that the old code and the new code reference the exact same SomeHarmlessOtherClass. So, in your fix_for_some_thing_code.py, you may want to import things from the original module that had the vulnerability:

# fix_for_some_thing_code.py
from module_with_vuln import SomeHarmlessOtherClass

# This standalone function has the code for the method that I'm trying to replace:
def some_function(self, ...):
    ...

One more trick. Write a test like:

def test_remember_the_patch_when_upgrading_the_library(self):
    if some_library.__version__ != "1.2.3":
        raise AssertionError("Remember to update or remove the patch for some_library")

Summary

In summary, my advice is:

Remember to save a reference to the thing you're replacing, like _orig_function_with_vuln.
Whenever possible, stick to replacing individual functions/methods.
Avoid creating new modules, rather patch them in place.
Avoid creating new classes, rather patch them in place.
When creating a new function, make sure its closure is closing around the same instances that the original function was closing around.
If the thing you're trying to replace is something immutable, like an int, function, enum, etc., and other modules are using from module_with_vuln import something_immutable, you're going to have to chase down those other modules and replace those references.
Use a test to remind future developers to update or remove your patch when they update the library.

So, dynamically patching your libraries to work around vulnerabilities is definitely a useful technique. But, if the patching gets too extensive, you might decide to just bite the bullet and do the upgrade. In some cases, you might also decide that the vulnerability just isn't severe enough to worry about.

Security: Generating a Symmetric Key

2023-04-27T16:01:00.007-07:00

When I was first learning AppSec, my buddy, Josh Bonnett, sent me Cryptographic Right Answers. I read it 3 times and still barely understood it. But, now, it's my favorite page for figuring out the right thing to do when it comes to cryptography.

Suppose you need to create a secret (i.e. a symmetric key). You need it to be long enough. That page says 256 bits is enough. You want it to not get messed up in various contexts, so you need to somehow pass it around not in binary. So, using URL-safe base64 is a good idea. And, let's suppose it's gotta be really secret, so you want to use a cryptographically random number generator.

With all of that in mind, I just found my new favorite bit of Python:

>>> from secrets import token_urlsafe

>>> token_urlsafe(nbytes=256)

'ez-MDTo5SSZFp5dk5LByq8S7sN-gGoI_8MyIMa-joBvlsQvIihyOCgct2s8XkLnTztdPIpf8dPu3Q6CSBBuOtGCcS3lbiwczzaR1zF46HazoAlM7v-2wGgZrLmPLpkEcNexfgoy8D4KMz7L06QiRAJTGB6N2F8dbYXAUYuc3iUd6XKkLkr9JIC3p13VdzTEyLlWNOhTYzAbb7YSqFMrqn_ifLjDfr0oakzYR6zQumB1dsRCSqIbBuJubGdRUoVnCgtj3vS6lrhtV-NVSlX4hsHE9oW1qYZcNfxhaRWEOZM5Q6V1cUquxeZ-3QAUxS0N6tdsRUFq41n2vfON67cLhkg'

Security: BSidesSF 2023: CTF

2023-04-25T14:10:00.002-07:00

This was my third time going to BSidesSF, which is a friendly, volunteer-run security conference. In the past, I've always avoided the CTF (Capture the Flag) hacking competitions because I was afraid of making a fool of myself, but, this time around, I decided to give it a go!

In the last 3 years, I've spent a ton of time practicing thanks to @thecybermentor's Practical Ethical Hacking course (which I took on Udemy), Hack the Box, and the OWASP Juice Shop (which I adored!). They really helped me feel comfortable in the CTF, but in retrospect, I didn't need to be so scared! A lot of the challenges are built to help beginners get their feet wet. In fact, the whole thing was pretty friendly!

So, I ended up skipping all of the talks--multiple people told me they watch the videos after the conference. I hacked all day Saturday and Sunday. I didn't know that the competition actually started at 5 PM on Friday. I wish I had known! I ended up only getting 4 hours of sleep on Saturday night.

I learned so much!

Although I've played with XSS for years, this was the first time I've written an XSS exploit to steal something an admin can access and sent it to a Pastebin.
There was an extremely easy buffer overflow challenge that made me actually feel like a real hacker :-P
There was a great RSA lab in which I learned how to implement RSA, generate keys, break keys, etc.
I learned about hunting for subdomains with certificate transparency logs.
I spent a ton of time trying to beat a computer in 3 different rock-paper-scissor challenges.
I learned how to call a private, static method in some random jar file in Java.
I was introduced to Return-Oriented Programming (ROP) which "is a computer security exploit technique that allows an attacker to execute code in the presence of security defenses such as executable space protection and code signing" (Wikipedia).
I learned more about using netcat and socat.
I practiced analyzing an strace file to figure out what a binary is doing.

My son, Giovanni, was able to come for a couple of hours at the end and broke into a padlock for me which was one of the challenges. I bought him his first lockpicking kit when I went to my first BSidesSF 3 years ago :-P

By the way, if you want to try some of the challenges, they said they'd keep them up for about a week.

So, you might be wondering how I did. My "team" placed 28 out of 492 teams! I analyzed the users from all the teams that scored more points than me. If I just look at individual users (ignoring the one that says @everyone), I placed 33 out of 676 users! I was definitely pretty happy with that considering this was my first CTF!

Aside from my son, Giovanni, picking one lock, I played on my own. The big winners played on actual teams. This included a ton of online-only teams. The top team had 9 people; at least some of them were onsite. It's definitely the sort of thing where it'd help to have different members of a team working on different challenges. As I said above, it would have also helped to start earlier.

Here are some of the big-picture lessons I learned (or re-learned):

Google is your best friend!
You can get pretty far with just Google and some Python, web, Linux, and possibly Burp Suite skills.
Go broad before you go deep. In hacker terms, enumerate, enumerate, enumerate!
Sometimes the challenge is broken, and it's okay to talk with the admin on Slack if you think it is.
As I said above, pay attention to the exact start time.
Long hours of sitting in a folding chair with high levels of mental intensity are brutal on your body. I totally messed up my back. Thank God I have a great chiropractor!
If your goal is to learn, consider doing it by yourself. If your goal is to win, you definitely need a good team.
Pay attention to how many points a challenge is worth. There's a huge variation! There are some that aren't too hard that are worth a lot of points. There are some that are hard that aren't worth very many points.
CTFs are fun and totally worth doing!

CHATGPT IS TOTALLY not GOING TO TAKE OVER THE WORLD!

2023-02-13T20:25:00.007-08:00

People are understandably frightened by ChatGPT. They fear that it might put software engineers like me out of business. Some of my friends have even suggested that it's the beginning of a Terminator 2 situation! I'm here to put those fears to rest:

First of all, Microsoft is investing in OpenAI. From their purchase of Skype to their development of .NET, Microsoft has always shown itself to be a highly functional company without any aspirations of world domination!

Sure, some naysayers like the CEO of OpenAI might claim that the worst case scenario is lights out for all of us, but everyone knows you shouldn't listen to random people on the Internet!

And, even if the AI goes awry, I seriously doubt that we're going to have humanoid robots that can melt in order to walk through jails like in Terminator 2! For instance, if you check out the final video in this post from Nature, it's quite clear that such technology is still years away!

In any case, I'm sure that Mark Zuckerberg, who totally is not a robot, BTW, will use Facebook to encourage lawmakers to enact world-wide laws to keep us safe!

I'm just really glad that we're all doing our part to bring about AI that can really assist humanity in creating a brighter tomorrow!

Python: Streaming Sieve of Eratosthenes

2022-11-26T11:13:00.006-08:00

I thought of a cute way of infinitely generating prime numbers that I call the Streaming Sieve of Eratosthenes:

#!/usr/bin/env python3

"""
Streaming Sieve of Eratosthenes

I thought of a cute way of infinitely generating prime numbers.
"""

from collections import defaultdict


# upcoming is a defaultdict. Each key is an upcoming number. Each value is a list
# of prime factors of that number.
upcoming = defaultdict(list)

n = 2
while True:
    factors = upcoming[n]
    del upcoming[n]
    if not factors:
        print(n)  # Prime
        factors.append(n)
    for factor in factors:
        next_n = n + factor
        upcoming[next_n].append(factor)
    n += 1

Books: Web Application Security: Exploitation and Countermeasures for Modern Web Applications

2022-09-24T11:27:00.004-07:00

I finished reading Web Application Security: Exploitation and Countermeasures for Modern Web Applications by Andrew Hoffman.

In summary: It's not very broad. It's not very deep. It's not very complete. It's not very polished--I plan on submitting a bunch of errata.

I was surprised at Hoffman's choice to rely on Chrome DevTools and JavaScript for all his exploits. I think most web pentesters rely on man-in-the-middle proxies such as Burp Suite.

He said that it's an intermediate-level book. I think it's fair to say that it's targeted at intermediate-level programmers who are beginners in web security. Surprisingly, he didn't even cover all of the OWASP Top 10. I didn't really learn much.

On the other hand, I appreciated the fact that it was easy to read, and I enjoyed the history of hacking at the beginning of the book.

Here are my more-detailed notes.

Books: Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith

2022-08-30T06:06:00.005-07:00

I finished "Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith" by Sam Newman. It was great :)

There were a lot of things that surprised me in the book.

He's a lot more in favor of modular monoliths than I would have expected. He doesn't believe that microservices are the one true way. He thinks startups should stick with monoliths when they're trying to find their place in the world.

He's perfectly fine with calling back into the monolith. I remember, there were multiple people at my previous company who were really against that, and not just for issues of scale or latency.

He's okay with a service and a monolith both talking to the monolith's DB or the service's DB during a short transition period (i.e. days, weeks, or months) because doing double writes or flag days is so much harder.

You can either separate the app first (i.e. have the monolith and the new service talk to the same DB) or separate the DB first (i.e. have the monolith talk to two DBs).

He's okay with the idea of a "DB as an API". That's when you have a complicated, private, internal DB, but then you publish either a view or a nicely polished DB for the rest of the company to consume. I had always thought that was heresy, but, I had seen this approach in many places. In a certain sense, a lot of people excuse themselves when they do this same thing with Kafka.

Aside from being less dogmatic than I expected, the rest of the book was not that surprising. Nonetheless, it was definitely worth reading.

Here are my notes from the book. Search for "important" for some of my other favorite parts:

ix. There are now many companies created to solve the problems caused by microservices.

ix. Consider whether they're even right for you. See especially chapter 2 on this.

1.6. He gave a nice definition of what a microservice is.

2.1. Independent deployability is key.

2.5. In general, sharing a DB is bad.

2.7. You shouldn't have to deploy 2 services for 1 new feature.

3. Separating by layer (e.g. frontend vs. backend) has drawbacks.

4.8. You should favor cohesion of business functionality over cohesion of technology.

5.8. Own your own data. Avoid sharing a DB unless you really have to because it may make it harder to maintain separate deployability.

6.7. Microservices do cause problems.

7.4. "Honestly, microservices seem like a terrible idea, except for all the good stuff" :-P

7.9. Important: Leaving the UI as a big blob in the monolith is a mistake.

8.2. Avoid changing tech stacks when first moving.

8.5. You don't have to use k8s, Docker, the cloud, Go, etc.

9.0. Clojure requires fewer lines of code than Java.

9.5. Don't worry very much about how big or small each microservice is.

10.9. He was part of the group of people that picked the name "microservices".

12.0. He defined what a monolith is and discussed the various kinds of monoliths.

12.9. He talked about modular monoliths.

13.5. A modular monolith can be an excellent choice. However, scaling the DB can make it harder.

13.9. He said that Shopify was a good example of a successful modular monolith.

14.4. A distributed monolith is when you have a bunch of services that all have to be deployed together. That's bad.

15.6. He talked about the advantages of monoliths. They can be the right choice.

16. He explained coupling and cohesion in a more understandable way than I had ever seen before.

17.6. Cohesion means that the code that changes together stays together. For instance, you *shouldn't* put all of your models in one place, all your controllers in another place, and all your views in a third place. Rather, group things into business units. Coupling means if you change one, you have to change the other. Remember, we want high cohesion and low coupling.

20.8. He talked about the benefits of "publishing" a DB view, as well as "internal" vs. "external" DBs.

21.8. He talked about the benefits of outside-in API development. Talk to your consumers!

22.2. Temporal coupling is when you have multiple services with synchronous calls.

23.4. He prefers release on demand over release trains.

23.9. He spoke highly of an Erlang monolith where you could separately upgrade individual modules.

24.9. "Greenspun's 10th rule states, 'Any sufficiently complicated C or Fortran program contains an ad hoc, informally specified, bug-ridden, slow implementation of half of Common Lisp.' This has morphed into the newer joke, 'Every microservice architecture contains a half-broken reimplementation of Erlang." I think there is a lot of truth to this."

28.7. He talked about the importance of modeling around business domains as well as Domain-Driven Development.

30.9. Start separating things based on who the system users are.

31.1. He talked about bounded contexts.

31.7. Start with larger services encompassing a whole bounded context (i.e. multiple aggregates).

32.9. He recommended the book "Domain-Driven Design Distilled".

33. You have to understand your goals. There are many when it comes to microservices, and different goals will lead to different ways of going about things.

33.9. Avoid microservice cargo cult mentality.

34. Important: A common failing is when you say "my CTO told me to do it."

34.7. It's hard to get real data on the benefits of microservices.

35.5. He gave 3 questions that you should ask yourself before adopting microservices.

35.8. He covered why you might use them as well as some alternatives (to microservices) that may result in the same benefits.

36.7. Having different teams own different parts of the monolith can help. Making more things self-service can help as well.

37.2. Important: There are many other ways to reduce time to market.

38.5. If you're worried about scalability, you could choose to only move certain code into microservices--i.e. the code where it really matters.

40.9. You could choose to use a modular monolith. The biggest downside of this approach is the shared deployment.

41.9. He walked through an example that perfectly matched why my previous company's PHP to Python transition was successful.

42.9. He talked about times when microservices are a bad idea.

42.9. Don't use microservices when you're trying to tackle an unclear domain, when you're trying to build a startup, when you're building customer-installed and managed software, or in cases where you don't have a strong reason to use them.

45.6. Important: Adopting microservices because everyone else is adopting them is a terrible reason.

45.9. Important: He specifically mentioned that it was a bad idea to try out microservices because you want to try out Kotlin :-P He said that conflating things in this way is a bad idea in general.

48.0. He talked about how to "move" an organization.

50.6. He talked about Google's "Testing on the Toilet". I tried really hard to get something published in that, but I failed :-P

51.6. Important: Don't spend a year building the perfect microservice architecture only to find out that it doesn't solve your problems.

52.4. You can avoid decomposing your DB for a while, but you can't delay that forever.

52.7. Sharing info with the rest of the org is critical.

53. He talked about the importance of incremental migration.

54.7. Don't overthink reversible decisions.

58. He talked about event storming.

59.5. The strangler fig pattern is useful. It implies calling into the monolith, and he's okay with that.

60.3. He said you might not want to tackle DB decomposition for the first few microservices, and that's okay.

63.6. He talked about the importance of cross-functional teams tackling different business verticals. I.e. don't put all your frontend people in one team, all your backend people in another team, etc. Instead, one team should have a mix of people and tackle an entire business vertical (frontend, backend, etc.).

64.7. Important: Bold proclamations can move a company forward, but expect chaos to follow.

65.0. Important: DevOps doesn't mean NoOps.

66.8. Important: Expecting developers to instantly know how to do on-call is unrealistic.

67. Important: It may make sense to have your developers handle on-call during business hours and to have your ops people handle it after hours.

77.0. Copy the code from your monolith into the microservice, and then worry about deleting it later.

77.8. Important: He said that you can perhaps start by moving to a modular monolith. For instance, you can have separate jar files. He said this approach was actually recommended.

79. He talked about the "strangler fig application" pattern.

82.0. Important: He talked more about how to call back into the monolith.

83.0. He did a good job describing how my previous company migrated from PHP to Python.

86.9. He talked about using NGINX as a proxy in front of your monolith and microservice.

88.5. Important: He talked about doing an incremental rollout using a shared DB.

90. Important: He talked about having a service that acts like a proxy translating from one protocol to another. He said that he had grave concerns about this approach because you ended up collecting too much business logic in the proxy. It reminded me a lot of GraphQL proxies.

92. He talked about service meshes.

93.8. He said you may want to delay the adoption of a service mesh until things had settled down.

97.2. He recommended the book "Enterprise Integration Patterns".

98. He talked about various helpful approaches to UI composition such as page composition, widget-based composition, etc. You can have separate widgets on the same page served by different microservices.

103. He talked about "micro frontends".

104.8. Important: He introduced the "branch by abstraction" pattern. Create an interface. Create a front door that matches that interface. Behind that front door, it can either call the old code (which also matches that interface) or the new code (which also matches that interface) which calls out to a service. Naturally, you can throw some feature flags in there, etc.

105.3. Avoid long-lived branches. Rather, ship both versions of the code, and control them using a feature flag or something like that.

113. The Parallel Run pattern

115. N-Version Programming

117. He talked about GitHub's Scientist library which you can use to compare 2 live implementations of the same thing.

118.3. He introduced "progressive delivery".

118.7 The Decorating Collaborator pattern

120.8 Change Data Capture (CDC)

123. Doing CDC with transaction logs

125. The challenges of using a shared DB

127. Important: He said you should prefer to separate the DB, but if you can't, here are some coping patterns.

128. The Database View pattern

132. The Database Wrapping Service pattern

135. Important: The Database as a Service pattern is when you maintain a well-groomed DB that's specifically meant to be consumed by clients. You'll probably want to have an internal vs. an external, read-only DB.

136. He talked about how to build the mapping engine for an external DB. You can use CDC, etc.

136.9. He mentioned Debezium for CDC.

137.4. He suggested that you might start with a DB view and then possibly move to DB as a service.

138. Important: He introduced the Aggregate Exposing Monolith in which a service provides a nice interface to data that it gets by calling back into the monolith.

138.8. Remember that microservices encompass both behavior *and* state.

138.9. Important: Don't just create a wrapper of a DB. A service should act like a state machine in which it controls what state transitions are possible. This was a key idea that he kept coming back to.

140.7. Each service takes care of its own authorization insofar as it decides which state transitions are okay.

141. Heh, the same section is duplicated on p. 139 and p. 141.

141. He talked about the Change Data Ownership pattern.

142. Important: This is a super important picture of the Change Data Ownership pattern.

143.0. He talked about projecting data back into the monolith using a view.

144.6. He gave some sensible (but perhaps contrary to the orthodoxy) advice on how to keep data in sync for the strangler fig pattern.

145.0 The Synchronize Data in Application pattern

145. Riak was used for some nationwide Danish medical record system. It's apparently very good.

149. The Tracer Write pattern

158. Splitting apart the DB

160. Important: Should you split apart the DB first or the code first? Each has pros and cons.

162.4. He mentioned using Flyway for DB migrations.

163.5. SchemaSpy is a tool that you can use to visualize DB table relationships.

164.7. He talked about another successful modular monolith that talked to multiple separate schemas.

165.6. Important: Most people split the code first and the DB later. I really like this approach. It seems like it goes against the dogma of never having two services talking to the same database, but as long as it's a temporary part of a transition, it works really well, and I've seen it succeed in practice! Just to spell it out (thanks Maksim Horbul!): Create a new DB user for the service, and give it read/write access to the MonoDB. Create a new service which reads and writes to the MonoDB. Replicate the tables you need from the MonoDB to a new service DB. Stop the service temporarily. Make sure the replicated tables in the MonoDB and service DB are identical. Drop replication. Configure the new service to talk to the service's DB. Restart the service.

166.3. Don't stop there! Finish!

166.5. He talked about using the monolith as a data access layer.

169. Multi-schema storage

170.5. Important: Avoid splitting the code and the schema at the same time. Do one or the other first.

170.9. Important: If you can change the monolith, split the schema first. If not, split the code first.

175.5. Errata: There was an error in some diagram. I just wrote, "Nothing to join on."

176.2. He mentioned using Jaeger for distributed tracing.

177.5. You can just not allow deletion. Have a field for whether the row has been deleted.

178.3. When you pull something out of the DB, take the whole aggregate with you.

178.7. He covered how to deal with shared, static data like country lists.

182.2. Important: He talked about storing static reference data (like country lists) in code as shared libraries. (I remember being criticized for doing this one time.)

187.2. He talked about how to deal with transactions.

188.5. He referred to the book "Data-Intensive Applications" (which my buddy Sam recommended to me).

190. He talked about two-phase commits.

193.1. Important: Say no to distributed transactions.

193.6. He explained sagas. His explanation was better than others I've seen.

201.8. Important: He talked about Business Process Modelling tools. Apparently, he doesn't like them because they're not built for programmers, yet it's programmers who always end up having to use them.

202.5. Camunda and Zeebe are better. They're open-source orchestration frameworks targeted at microservice developers.

204.9. He talked about how to implement sagas.

205.9. He didn't know the term "saga" when he wrote his first microservice book. There was another book that was very influential in this space, and it didn't use the term "saga" either.

208.0. Important: Microservices are a dial, not a switch. It's not about yes or no. It's about how many.

214.0. He talked about running two versions of a service at the same time to cope with clients that are not compatible with the new version. He said it's better to have one version of the service support both the old API and the new API.

215.5. It's imperative to avoid accidentally breaking a contract and that you have a plan for how to handle purposely changing an API.

215.7. He talked about how to deal with reporting, big-data analytics, data warehouses, etc.

218.8. He covered log aggregation, the ELK stack, etc. He likes Humio.

219.0. You should tackle log aggregation before you try to tackle microservices.

219.5. He talked about tracing.

220.6. He mentioned Jaeger again for tracing.

221.5. Important: He talked about the importance of testing in prod with fake user data and synthetic transactions.

222.6. Distributed systems observability

222.8. He talked about the local dev experience. You'll get to a point where you can't run everything on one laptop.

223.8. He recommended teleprescence.io, my buddy's company :-D

225.1. Important: He recommended that you *not* run vanilla Kubernetes. Use OpenShift or some managed solution (perhaps EKS). He's a fan of Function as a Service (i.e. Lambda).

225.7. Important: If you can use some Function as a Service solution (i.e. Lambda), do that before trying to tackle Kubernetes. He said it's way simpler. That advice surprised me. He said don't reach for Kubernetes too soon. You can wait a while. The same goes for microservices in general.

226. He covered end-to-end testing.

227.5. He talked about Consumer-Driven Contracts (CDCs). He said they're underused, but not everyone has been successful with them. He recommended pact.io.

231.2. He recommended something like a tech council where you have one person from each team.

233. He talked about orphaned services.

237. He gave his closing remarks.

Python: My Favorite Python Tricks for LeetCode Questions

2022-08-02T17:01:00.028-07:00

I've been spending a lot of time practicing on LeetCode recently, so I thought I'd share some of my favorite intermediate-level Python tricks. I'll also cover some newer features of Python you may not have started using yet. I'll start with basic tips and then move to more advanced ones.

Get help()

Python's documentation is pretty great, and some of these examples are taken from there.

For instance, if you just google "heapq", you'll see the official docs for heapq, which are often enough.

However, it's also helpful to sometimes just quickly use help() in the shell. Here, I can't remember that push() is actually called append().

>>> help([])

>>> dir([])

>>> help([].append)

enumerate()

If you need to loop over a list, you can use enumerate() to get both the item as well as the index. As a mnemonic, I like to think for (i, x) in enumerate(...):

for (i, x) in enumerate(some_list):
    ...

items()

Similarly, you can get both the key and the value at the same time when looping over a dict using items():

for (k, v) in some_dict.items():
    ...

[] vs. get()

Remember, when you use [] with a dict, if the value doesn't exist, you'll get a KeyError. Rather than see if an item is in the dict and then look up its value, you can use get():

val = some_dict.get(key)  # It defaults to None.
if val is None:
    ...

Similarly, .setdefault() is sometimes helpful.

Some people prefer to just use [] and handle the KeyError since exceptions aren't as expensive in Python as they are in other languages.

range() is smarter than you think

for item in range(items):
    ...
    
for index in range(len(items)):
    ...
    
# Count by 2s.
for i in range(0, 100, 2):
    ...

# Count backward from 100 to 0 inclusive.
for i in range(100, -1, -1):
    ...
    
# Okay, Mr. Smarty Pants, I'm sure you knew all that, but did you know
# that you can pass a range object around, and it knows how to reverse
# itself via slice notation? :-P
r = range(100)
r = r[::-1]  # range(99, -1, -1)

print(f'') debugging

Have you switched to Python's new format strings yet? They're more convenient and safer (from injection vulnerabilities) than % and .format(). They even have a syntax for outputing the thing as well as its value:

# Got 2+2=4
print(f'Got {2+2=}')

for else

Python has a feature that I haven't seen in other programming languages. Both for and while can be followed by an else clause, which is useful when you're searching for something.

for item in some_list:
    if is_what_im_looking_for(item):
        print(f"Yay! It's {item}.")
        break
else:
    print("I couldn't find what I was looking for.")

Use a list as a stack

The cost of using a list as a stack is (amortized) O(1):

elements = []
elements.append(element)  # Not push
element = elements.pop()

Note that inserting something at the beginning of the list or in the middle is more expensive it has to shift everything to the right--see deque below.

sort() vs. sorted()

# sort() sorts a list in place.
my_list.sort()

# Whereas sorted() returns a sorted *copy* of an iterable:
my_sorted_list = sorted(some_iterable)

And, both of these can take a key function if you need to sort objects.

set and frozenset

Sets are so useful for so many problems! Just in case you didn't know some of these tricks:

# There is now syntax for creating sets.
s = {'Von'}

# There are set "comprehensions" which are like list comprehensions, but for sets.
s2 = {f'{name} the III' for name in s}
{'Von the III'}

# If you can't remember how to use union, intersection, difference, etc.
help(set())

# If you need an immutable set, for instance, to use as a dict key, use frozenset.
frozenset((1, 2, 3))

deque

If you find yourself needing a queue or a list that you can push and pop from either side, use a deque:

>>> from collections import deque
>>> 
>>> d = deque()
>>> d.append(3)
>>> d.append(4)
>>> d.appendleft(2)
>>> d.appendleft(1)
>>> d
deque([1, 2, 3, 4])
>>> d.popleft()
1
>>> d.pop()
4

Using a stack instead of recursion

Instead of using recursion (which has a depth of about 1024 frames), you can use a while loop and manually manage a stack yourself. Here's a slightly contrived example:

work = [create_initial_work()]
while work:
    work_item = work.pop()
    result = process(work_item)
    if is_done(result):
        return result
    work.push(result.pieces[0])
    work.push(result.pieces[1])

Using yield from

If you don't know about yield, you can go spend some time learning about that. It's awesome.

Sometimes, when you're in one generator, you need to call another generator. Python now has yield from for that:

def my_generator():
    yield 1
    yield from some_other_generator()
    yield 6

So, here's an example of backtracking:

class Solution:
    def problem(self, digits: str) -> List[str]:
        def generate_possibilities(work_so_far, remaining_work):
            if not remaining_work:
                if work_so_far:
                    yield work_so_far
                return
            first_part, remaining_part = remaining_work[0], remaining_work[1:]
            for i in things_to_try:
                yield from generate_possibilities(work_so_far + i, remaining_part)
        
        output = list(generate_possibilities(no_work_so_far, its_all_remaining_work))
        return output

This is appropriate if you have less than 1000 "levels" but a ton of possibilities for each of those levels. This won't work if you're going to need more than 1000 layers of recursion. In that case, switch to "Using a stack instead of recursion".

Updated: On the other hand, if you can have the recursive function append to some list of answers instead of yielding it all the way back to the caller, that's faster.

Pre-initialize your list

If you know how long your list is going to be ahead of time, you can avoid needing to resize it multiple times by just pre-initializing it:

dp = [None] * len(items)

collections.Counter()

How many times have you used a dict to count up something? It's built-in in Python:

>>> from collections import Counter
>>> c = Counter('abcabcabcaaa')
>>> c
Counter({'a': 6, 'b': 3, 'c': 3})

defaultdict

Similarly, there's defaultdict:

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> d['girls'].append('Jocylenn')
>>> d['boys'].append('Greggory')
>>> d
defaultdict(<class 'list'>, {'girls': ['Jocylenn'], 'boys': ['Greggory']})

Notice that I didn't need to set d['girls'] to an empty list before I started appending to it.

heapq

I had heard of heaps in school, but I didn't really know what they were. Well, it turns out they're pretty helpful for several of the problems, and Python has a list-based heap implementation built-in.

If you don't know what a heap is, I recommend this video and this video. They'll explain what a heap is and how to implement one using a list.

The heapq module is a built-in module for managing a heap. It builds on top of an existing list:

import heapq

some_list = ...
heapq.heapify(some_list)

# The head of the heap is some_list[0].
# The len of the heap is still len(some_list).

heapq.heappush(some_list, item)
head_item = heapq.heappop(some_list)

The heapq module also has nlargest and nsmallest built-in so you don't have to implement those things yourself.

Keep in mind that heapq is a minheap. Let's say that what you really want is a maxheap, and you're not working with ints, you're working with objects. Here's how to tweak your data to get it to fit heapq's way of thinking:

heap = []
heapq.heappush(heap, (-obj.value, obj))

(ignored, first_obj) = heapq.heappop()

Here, I'm using - to make it a maxheap. I'm wrapping things in a tuple so that it's sorted by the obj.value, and I'm including the obj as the second value so that I can get it.

Use bisect for binary search

I'm sure you've implemented binary search before. Python has it built-in. It even has keyword arguments that you can use to search in only part of the list:

import bisect

insertion_point = bisect.bisect_left(sorted_list, some_item, lo=lo, high=high)

Pay attention to the key argument which is sometimes useful, but may take a little work for it to work the way you want.

namedtuple and dataclasses

Tuples are great, but it can be a pain to deal with remembering the order of the elements or unpacking just a single element in the tuple. That's where namedtuple comes in.

>>> from collections import namedtuple
>>> Point = namedtuple('Point', ['x', 'y'])
>>> p = Point(5, 7)
>>> p
Point(x=5, y=7)
>>> p.x
5
>>> q = p._replace(x=92)
>>> p
Point(x=5, y=7)
>>> q
Point(x=92, y=7)

Keep in mind that tuples are immutable. I particularly like using namedtuples for backtracking problems. In that case, the immutability is actually a huge asset. I use a namedtuple to represent the state of the problem at each step. I have this much stuff done, this much stuff left to do, this is where I am, etc. At each step, you take the old namedtuple and create a new one in an immutable way.

Updated: Python 3.7 introduced dataclasses. These have multiple advantages:

They can be mutable or immutable (although, there's a small performance penalty).
You can use type annotations.
You can add methods.

from dataclasses import dataclass

@dataclass  # Or: @dataclass(frozen=True)
class InventoryItem:
    """Class for keeping track of an item in inventory."""
    name: str
    unit_price: float
    quantity_on_hand: int = 0

    def total_cost(self) -> float:
        return self.unit_price * self.quantity_on_hand

item = InventoryItem(name='Box', unit_price=19, quantity_on_hand=2)

dataclasses are great when you want a little class to hold some data, but you don't want to waste much time writing one from scratch.

Updated: Here's a comparison between namedtuples and dataclasses. It leads me to favor dataclasses since they have faster property access and use 30% less memory :-/ Per the Python docs, using frozen=True is slightly slower than not using it. In my (extremely unscientific) testing, using a normal class with __slots__ is faster and uses less memory than a dataclass.

int, decimal, math.inf, etc.

Thankfully, Python's int type supports arbitrarily large values by default:

>>> 1 << 128
340282366920938463463374607431768211456

There's also the decimal module if you need to work with things like money where a float isn't accurate enough or when you need a lot of decimal places of precision.

Sometimes, they'll say the range is -2 ^ 32 to 2 ^ 32 - 1. You can get those values via bitshifting:

>>> -(2 ** 32) == -(1 << 32)
True
>>> (2 ** 32) - 1 == (1 << 32) - 1
True

Sometimes, it's useful to initialize a variable with math.inf (i.e. infinity) and then try to find new values less than that.

Updated: If you want to save memory by not importing the math module, just use float("inf").

Closures

I'm not sure every interviewer is going to like this, but I tend to skip the OOP stuff and use a bunch of local helper functions so that I can access things via closure:

class Solution():  # This is what LeetCode gave me.
    def solveProblem(self, arg1, arg2):  # Why they used camelCase, I have no idea.
      
        def helper_function():
            # I have access to arg1 and arg2 via closure.
            # I don't have to store them on self or pass them around
            # explicitly.
            return arg1 + arg2
          
        counter = 0
        
        def can_mutate_counter():
            # By using nonlocal, I can even mutate counter.
            # I rarely use this approach in practice. I usually pass in it
            # as an argument and return a value.
            nonlocal counter
            counter += 1
            
       can_mutate_counter()
       return helper_function() + counter

match statement

Did you know Python now has a match statement?

# Taken from: https://learnpython.com/blog/python-match-case-statement/

>>> command = 'Hello, World!'
>>> match command:
...     case 'Hello, World!':
...         print('Hello to you too!')
...     case 'Goodbye, World!':
...         print('See you later')
...     case other:
...         print('No match found')

It's actually much more sophisticated than a switch statement, so take a look, especially if you've never used match in a functional language like Haskell.

OrderedDict

If you ever need to implement an LRU cache, it'll be quite helpful to have an OrderedDict.

Python's dicts are now ordered by default. However, the docs for OrderedDict say that there are still some cases where you might need to use OrderedDict. I can't remember. If you never need your dicts to be ordered, just read the docs and figure out if you need an OrderedDict or if you can use just a normal dict.

@functools.cache

If you need a cache, sometimes you can just wrap your code in a function and use functools.cache:

from functools import cache

@cache
def factorial(n):
    return n * factorial(n - 1) if n else 1
  
print(factorial(5))
...
factorial.cache_info()  # CacheInfo(hits=3, misses=8, maxsize=32, currsize=8)

Debugging ListNodes

A lot of the problems involve a ListNode class that's provided by LeetCode. It's not very "debuggable". Add this code temporarily to improve that:

def list_node_str(head):
    seen_before = set()
    pieces = []
    p = head
    while p is not None:
        if p in seen_before:
            pieces.append(f'loop at {p.val}')
            break
        pieces.append(str(p.val))
        seen_before.add(p)
        p = p.next
    joined_pieces = ', '.join(pieces)  
    return f'[{joined_pieces}]'


ListNode.__str__ = list_node_str

Saving memory with the array module

Sometimes you need a really long list of simple numeric (or boolean) values. The array module can help with this, and it's an easy way to decrease your memory usage after you've already gotten your algorithm working.

>>> import array
>>> array_of_bytes = array.array('b')
>>> array_of_bytes.frombytes(b'\0' * (array_of_bytes.itemsize * 10_000_000))

Pay close attention to the type of values you configure the array to accept. Read the docs.

I'm sure there's a way to use individual bits for an array of booleans to save even more space, but it'd probably cost more CPU, and I generally care about CPU more than memory.

Using an exception for the success case rather than the error case

A lot of Python programmers don't like this trick because it's equivalent to goto, but I still occasionally find it convenient:

class Eureka(StopIteration):
    """Eureka means "I found it!" """
    pass

  
def do_something_else():
    some_value = 5
    raise Eureka(some_value)


def do_something():
    do_something_else()


try:
    do_something()
except Eureka as exc:
    print(f'I found it: {exc.args[0]}')

Updated: Enums

Python now has a built-in enums:

from enum import Enum

# Either:
class Color(Enum):
    RED = 1
    GREEN = 2
    BLUE = 3

# Or:
Color = Enum('Color', ['RED', 'GREEN', 'BLUE'])

However, in my experience, when coding for LeetCode, just having some local constants (even if the values are strings) is a tad faster and requires a tad less memory:

RED = "RED"
GREEN = "GREEN"
BLUE = "BLUE"

Using strings isn't actually slow if all you're doing is pointer comparisons.

Updated: Using a profiler

You'll need some sample data. Make your code crash when it sees a test case with a lot of data. Grab the data in order to get your code to run on its own. Run something like the following. It'll print out enough information to figure out how to improve your code.

import cProfile
cProfile.run("Solution().someMethod(sampleData)")

Using VS Code, etc.

VS Code has a pretty nice Python extension. If you highlight the code and hit shift-enter, it'll run it in a shell. That's more convenient than just typing everything directly in the shell. Other editors have something similar, or perhaps you use a Jupyter notebook for this.

Another thing that helps me is that I'll often have separate files open with separate attempts at a solution. I guess you can call this the "fast" approach to branching.

Write English before Python

One thing that helps me a lot is to write English before writing Python. Just write all your thoughts. Keep adding to your list of thoughts. Sometimes you have to start over with a new list of thoughts. Get all the thoughts out, and then pick which thoughts you want to start coding first.

Conclusion

Well, those are my favorite tricks off the top of my head. I'll add more if I think of any.

This is just a single blog post, but if you want more, check out Python 3 Module of the Week.

Security: BSidesSF 2022

2022-06-13T20:17:00.009-07:00

Opening Remarks

The theme this year is "from the ground up". They're focusing on community, collaboration, and education.

It's a 100% volunteer team. 25 people work year-round.

They had speed mentoring sessions.

They really need some new volunteers. See bsides.sf/jobs.

The talks will be on their YouTube channel.

They have a stringent photo policy. You must have the permission of everyone in the frame, and crowd shots where you can see faces are strongly discouraged.

Here is the schedule, and here is their YouTube channel.

[I wrote these notes by hand and then transcribed them in a single day. I didn't quite expect them to be so voluminous! Happy hacking!]

Keynote: We Need More Mediocre Security Engineers

Jackie Bow (@jbowocky) from Asana.

[This was my favorite talk.]

She pointed out that BSidesSF was the last in-person conference that a lot of us attended before the pandemic. That was true for her, and it was for me as well.

She's held many jobs in security, including malware reverse engineering, which is one of the most hard-core jobs you can have in security.

She's worked for Facebook, for the government, etc.

She said that ClamAV is still the best open-source antivirus software there is.

One time, she added a virus signature to ClamAV but forgot to add the trailing newline. This broke Facebook Messenger in production for 1-3 hours.

Important: 82% of breaches involve a human element.

We expect each other to be perfect in security. We're not.

She said, "Have you read InfoSec Twitter? Ugh!"

Important: Extreme expectations lead to burnout, not excellence.

More != better.

Burnout is in the standard classification of mental disorders. "Burnout has been defined as a combination of emotional exhaustion, depersonalization, and reduced personal accomplishment caused by chronic work stress" (cited).

Unfortunately, our work predisposes us to burnout, but we have to avoid burnout if we hope to do this career for a long time.

Consider COVID-19, Log4J, the Colonial Pipeline hack, Solar Winds, supply chain attacks, Ukraine, etc.

The Solar Winds thing shook her deeply because she really respected FireEye.

There are currently 600k people in security. It's expected that there will be two million open roles. How are we going to add a million new people to the field?

She referred to Stuxnet.

Our current burn rate is unsustainable.

We need to dismantle our concept of a security unicorn.

We need to see each other as allies. We need to stop overworking. We need to change who we think is hirable.

We're too elitist. That's bad.

We expect people to know everything. [That's something I'm struggling with as I prepare to interview.]

We can't scale as solo individuals.

We need to drop the l337 hacker stuff.

Social isolation and loneliness [which I know all too well] increase the likelihood of early death by 25-30%. It's equivalent to 15 cigarettes a day.

Elitism is the enemy of diversity.

Only 24% of people in security identify as women.

She used to work on reverse-engineering malware. That's one of the most technical jobs you can have in security. Now, she feels like a dinosaur because of all this SaaS software, CSRF, etc.

We end up being expected to always be on.

Important: She called it the "wheel of reactive hell".

There's always more work to do.

Glorifying overworking hurts us all.

She talked about her kid asking her at 7 PM how much more work she had to do. [I literally started tearing up when she said that because the exact same thing had happened to me the day before.]

How often do we get to take a vacation longer than a week?

Vacations are hugely important.

You can be a great security engineer and still have hobbies--even non-security ones!

We need to bridge the "talent gap."

We're looking for unicorns. We need to stop that.

We need to see degrees as a privilege.

We should look at education as something that should happen once you're already in the industry.

There is no agreed-upon value for boot camps or certs.

We need to offer education as a benefit.

We really don't know how to hire for cyber security roles.

We still demand CS degrees, and that's bad.

Your job should pay for you to do boot camps, certs, etc.

At our current rate, we'll burn out before the pipeline fixes itself.

We need to dismantle the unicorn.

We need to challenge our perceptions of who belongs in this industry to achieve a more diverse workforce.

An Unlikely Friendship: Why Security Engineers and Product Managers Should Be Working Together

Leif Dreizler (@leifdreizler) and Rachel Landers (@workingrach) from Twilio Segment.

Segment was acquired by Twilio.

He's an engineering manager, and she's a PM. Their team worked on building security-related features.

Segment is a customer data platform.

They use TypeScript and Go.

Enterprise customers have very high bars. They're very demanding and noisy.

They mentioned LocoMocoSec which is a security conference in Hawaii.

SecEng = security engineering

Netflix has a great security team. They had this idea that the paved path should lead people to do things securely.

They talked about a self-service approach to security.

He talked about the different sides of security. Application security was on his list.

[It's amazing how similar their team is to the team I worked on at Udemy.]

Their first feature was a password strength meter built using zxcvbn and Have I Been Pwned.

Next, they tackled MFA.

The biggest feature they tackled was integrating with SCIM. [Our team didn't do that one. The UB team did that one.]

You can use SCIM to integrate with Okta or Azure Active Directory to provision users in your app. It's a system for cross-domain identity management.

PDLC = product development lifecycle. [We used the term SDLC.]

Ask yourself, why is this the right time to build this feature?

IdP = identity provider

Okta groups were mapped to Segment groups.

SDD = software design doc

You should "always be selling". The SDD should spend a little bit of time convincing people why it's a good idea to build this feature.

The PM owns what and when.

The engineering manager owns how and when it'll be done.

Important: "Weeks of programming can save you days of planning."

SCIM is basically CRUD for users and groups.

He mentioned RFCs 7642, 7643, and 7644.

When you have to implement query filtering, use a library.

Read the onboarding docs for each of the IdPs.

Build the integration with the IdPs.

He gave Okta props for how smoothly the process went. It took OneLogin almost a year to accept their integration.

Enterprise software has a bigger focus on security than consumer-facing software.

1/3 of their customers who use SSO use SCIM.

ARR = annual recurring revenue.

The customers they have that use SCIM account for 21% of their ARR.

Defaults matter a lot!

Lunch

first.org shares incident response data.

I talked to some security journalists who piece together news about incidents.

Boring SSL and libsodium are examples of tools that are simple, easy, and useful.

OpenSSL is pointlessly and hopelessly complex.

Code Red Partners is a recruiting firm that focuses on security professionals.

Embracing Risk Responsibly: Moving beyond inflexible SLAs and exception hell by treating security vulnerabilities and risk like actual debt

Eric Ellett from Segment Twilio

We need to embrace innovation to get away from having a dumpster fire of a security program.

Start by buying some time with solutions that are "good enough".

Identify and engage with critical customers (which are people inside your company that your security team has to work with).

He talked about an example where the AppSec team asked a service to fix a P1 issue reported via a bug bounty program.

He talked about creating metrics for closing vulns.

When you're working on a v2 of your program, rebuild the foundation with data. Now you have some time to build a proper foundation.

He talked about sending formal emails asking people to fix their vulns. A key part of these emails was that they had a due date based on the severity. This due date was possible to extend.

Attributing vulns to teams was hard because of the constant org changes.

They tied vulns to divisions and departments.

They rolled the data up the org chart to enable competition across the company for who could fix their vulns the most quickly.

At this point in your program, you can start experimenting strategically.

There are different risk appetites in different parts of the org.

He referred to Google's SRE book. He talked about SLIs, SLOs, and SLAs. In particular, he referred to chapter 3 on embracing risk.

Important: The only truly secure system is one that is powered off, cast in a block of concrete, and sealed in a lead-lined room with armed guards - and even then I have my doubts. -- Gene Spafford

He talked about error budgets.

For an SLO, he talked about uptime per quarter.

Perfect security and reliability is not the goal--it's too expensive.

Important: They created a debt metric: debt = (current_date - orig_date) / sla_in_days

The higher the priority, the shorter the SLA in days.

So, if the priority says you have to get it fixed in a day, every day you slip, you're increasing your debt by 1. However, if the priority says you have to fix it in a month, then it takes a whole month for you to increase your debt by 1.

As he mentioned before, this debt can be calculated and rolled up organizationally.

You can break down the debt in different ways.

He mentioned Snowflake.

He said that prioritizing work based on a debt metric is more helpful than prioritizing based on severity alone.

They even integrated the debt metric into CI.

He said that Segment's security program is further ahead than the rest of Twilio's.

At Segment, they're not yet tackling the p4s and p5s. They're too noisy right now.

He said that compensating controls frequently lower the CVSS which lowers the priority.

He talked about using Backstage for code asset management--i.e. which team owns the code with the vuln.

They're moving from VMs to k8s.

Buying Security: A Client's Guide

Rami McCarthy (@ramimacisabird)

He called himself a "reformed security consultant".

Buying security services is hard.

The security industry is a $100 billion industry.

Let's talk about security assessments. This is a comprehensive guide on buying and getting value.

He mentioned some survey that talked about buying and selling security.

He mentioned a talk from 2011 called, Penetration Testing Considered Harmful.

Important: Consider the question: Is a particular pentest good? The answer lies along a scale that goes from it's bad to you don't know.

White box tests are now dominant. They're more efficient and more thorough.

Don't compare a pentest to a bug bounty program.

Don't fall for a dressed-up Nessus scan.

There are different motivations for getting a security assessment. Risk reduction is the number one reason. The second most common reason is compliance.

There are different types of vendors.

It's hard to know if a vendor is good. Network recommendations are helpful.

Be careful about how much time you give the vendor. Keep in mind Parkinson's law.

Know your scope.

Gather 3-5 proposals.

When your goal is compliance, the pentester has to strike a balance between providing value vs. actually enabling you to pass.

Your own sales clients might tell you who to use since they might have customers asking for proof of compliance via specific vendors.

The vendor will help you further refine your target scope. You have to hone in on clear objectives and the length of the engagement. These will affect the cost.

Surprisingly, different vendors will come back with very different quotes.

Fast, good, and cheap--pick 2. In security, it's more like pick 1.

Be skeptical of cheap proposals and consultants.

There's lots of paperwork involved: NDA, MSA, SOW, etc.

Cure53 actually made their paperwork public.

Show the pentesters your known risks, your threat models, etc. This will help them.

Don't waste their time by leaving in obvious, known vulnerabilities, forcing them to go through your WAF (just let them through), or by giving them an incomplete environment that is missing important data to be useful.

Their reports are decomposed and sent to different teams. There is usually an executive summary vs. a section with nitty, gritty details.

A lot of people like getting an overall score or grade.

Make sure the vendor cleans up after themselves. He saw a case where one vendor left an open shell, and then another vendor found it.

Remember, no findings != no risk.

Do root cause and variant analysis.

Assessments are an expensive way to find vulns.

For each vuln, you need to fix, mitigate, or accept the risk.

Remediate the vulns. Don't just leave them there to be found by the next pentester.

Do a retro after you're done.

You can use canary bugs to see if they're actually doing their job.

Consider your pentesting cadence: Once a year? Once every six months?

Think about the ROI.

Don't kill bugs. Kill bug classes.

Emerging Best Practices in Software Supply Chain Security: What We Can Learn from Google, the White House, OWASP, and Gartner

Tony Loehr from Cycode

He talked about Google's SLSA and NIST's SSDF. These are AppSec frameworks.

By 2025, 45% of orgs will experience an attack on their supply chain.

Presidential order 14028 which talked about improving the nation's cybersecurity had some text which complained about the opaqueness of commercial software.

It talked about five objectives: protect, confidentiality, identify (SBOM), rapid responses, and training.

Important: 80% of incidents involve a known vuln that hasn't been patched.

He spoke more about Google's SLSA framework.

Level 4 requires a two-person review of all changes as well as hermetic, reproducible builds.

SSDF covers what. SLSA covers how. There are still some gaps.

He mentioned Terraform.

He mentioned least privileged access.

He mentioned anomaly detection.

Avoiding insidious points of compromise in infrastructure access systems

Sharon Goldberg is the CEO/Co-Founder of BastionZero and is also a tenured professor in the Computer Science Department at Boston University.

[I was very impressed by her creds. I don't want to start any rumors, but I'm pretty sure I overheard that at night, she's a vigilante crime fighter, and she likes to fly fighter jets for fun :-P ]

She focuses on infra-access systems.

She wanted to do a detailed breakdown of some war stories.

Act 1: Standing credentials, VPNs

Act 2: Zero Trust

Act 3: Weaknesses in Zero Trust.

She started by talking about bastion hosts.

She talked about Fluffy Bunni from 2001. This compromise involved a fake ssh client that stole passwords from compromised users. Even the bastion was infected. However, it wasn't able to steal ssh key passphrases.

Lesson: Don't give users standing credentials, especially passwords. Use MFA.

Next up, she talked about VPNs.

She talked about Operation Aurora from 2009. It was a Chinese APT breaking into Akamai. There was a zero-day in IE that allowed the attacker to compromise the entire machine.

Amazingly, the adversary had a very long dwell time, i.e. they went undetected for a very long time. They were able to move laterally, behind the VPN.

Their goal was to get to the source code.

Akamai didn't even know they were inside. Finally, Google took over some C&C server and told Akamai about the ongoing attack.

Lesson: Don't trust people just because they're on a secured network. That's the idea behind Zero Trust.

Akamai also wasn't segmented very well at the time.

Lesson: Segment!

Next, she talked about single-level domain administration such as Active Directory Admin Server.

She talked about an article named "NotPetya Ransomware" from 2017 that she said was great. She called it a watering hole attack. That's where you hack some thing and then wait for people to interact with it. In this case, it was Ukrainian tax software.

Once they were able to steal one credential, they were able to get to all the other machines. The result was that computers were bricked. They literally had to be thrown away.

She said we too often rely on a privileged system--a system locked down with a single cred.

Lesson: Vet your supply chain.

Act 2: Zero Trust

When it comes to remote access, don't trust the user just based on their network address. Don't rely on long-lived creds.

She talked about some situation involving a certificate authority, an SSO provider, and a proxy. She talked about an x509 certificate or a SAML token.

She talked about Diginotaur from 2011. She said the incident involved blindly trusting a CA. She said that in her mind this is one of the top 5 incidents of all time.

Some CA was hacked. The hacker created a certificate for Google, and they used it to snoop on Google's traffic.

We later created certificate transparency, etc.

Next, she covered SolarWinds from 2020. She said the problem here was blindly trusting SSO too much [uh oh].

She showed two architectures. In one architecture MFA would not have helped. She said if MFA was separated from the SSO provider, it'd require a second point of compromise.

Lesson: Users get hacked. Access systems get hacked.

She recommended reading some article that talked about DigiNotar getting hacked. [Perhaps this one?]

Red Teaming macOS Environments with Hermes the Swift Messenger

Justin Bui (@slyd0g)

He's a red teamer at Zoom. He's also a skateboarder.

He talked about the benefits of the Swift programming language and the Mythic framework.

He talked about the benefits of using Swift as a post-exploitation language. It now runs on Linux and Windows too.

Swift can interoperate with C, C++, and ObjC.

On macOS, Swift is not installed by default, but the libraries are.

There are several languages used for post-exploitation on macOS: JXA, Python, and Golang are common.

JXA has been abandoned. Apple said that Python and other scripting languages are deprecated and will be removed. [I noticed it's no longer present on macOS 12.4 Monterey.]

He said Golang is fantastic. It too can interoperate with C, C++, and ObjC. It does result in big binaries, though.

By using the swift command, you can circumvent the app whitelist. However, it's not installed by default.

Mythic is a cross-platform, post-exploit, red teaming framework built with python3, docker, docker-compose, and a web browser UI. It has a C&C server.

He talked about how the implant agent calls back from the victim.

There are payloads to target macOS.

He kept talking about LOLBins.

[I didn't know what a LOLBin was. Per this page, LOLBins is the abbreviated term for Living Off the Land Binaries. Living Off the Land Binaries are binaries of a non-malicious nature, local to the operating system, that have been utilized and exploited by cyber criminals and crime groups to camouflage their malicious activity.]

He said that Python and Swift are LOLBins.

Hermes is a Swift payload for the Mythic framework. He's the author.

The Mythic framework makes use of encrypted key exchange in order to encrypt the traffic between the victim and the C&C server.

Hermes has various modules for post-exploitation.

By using the Mythic framework, he only had to worry about writing code for the implant side.

He didn't want to force developers to use Macs. He said that setting up cross-compilation was the hardest part of the project.

Darling is a macOS emulation layer for Linux. It's like Wine, but for macOS. Darling relies on a Linux kernel module.

He talked about the "operator" who was controlling the C&C server.

Each job is a separate thread allowing you to run things in parallel.

He showed Mythic's web UI. You can upload files to and download files from the victim host from your browser. It can also capture screenshots of the user's browser.

It has clipboard monitoring too. Note that root doesn't have access to the clipboard [weird!]. He talked about nabbing passwords when people copy and paste them.

He talked about a time when his co-worker reverse-engineered some malware to steal some techniques.

plist files can be XML, JSON, or binary.

He keeps focusing on using techniques to snoop on what the user is doing.

Apple has an Endpoint Security Framework. 3rd-party developers got "pushed out of the kernel". Because of this, hackers and security software now have equal footing.

Attackers can use launch agents to achieve persistence.

It reminded me of spy vs. spy.

Open Remarks for Day 2

The summary of the Code of Conduct is, "Do not be an ass, or we'll kick your ass out!"

Keynote: Building sustainable security programs

Astha Singhal, Director of Security, Netflix

She too talked about InfoSec burnout.

This is a job where you never win.

These are the contributing factors:

Constant firefighting: She referred to Log4J.
Security cynicism
Culture of catastrophizing
Possible vs. probable
Personal responsibility
Ridiculous and impossible
Ongoing conflicts with stakeholders
Changing threat landscape
We're never done
There are never enough things in the wins column: Only one thing needs to go wrong for bad things to happen.

That's a lot!

She talked about organizational culture.

We need to disrupt security cynicism.

We need to discourage heroics and instead celebrate long-term wins. Proactive investments are better.

Cuture takes intentionality.

Build "additive" teams--where each new person adds something unique to the team.

At one point, all the members of her team were AppSec engineers. They've expanded.

Build an environment of empathy and collaboration.

Keep in mind business enablement and customer service.

Consider things from a risk perspective. Our job is to manage risk.

risk = likelihood * impact

Help other security engineers think about risk as well.

Don't forget about probability or likelihood. Don't overfocus on things that have extremely high impact but very little likelihood.

Understand your threat model and why security matters.

Be rigorous about risk outcomes.

Have a strategic program focus.

Consider strategic vs. operational investements.

Sometimes you have to make "strategic bets" where you choose from among a set of possibilities.

Consider leverage points and efficiency.

Minimize the impact to critical data assets.

Achieving overall security assurance requires a balance of proactive and reactive security controls.

Stakeholders and leadership have to achieve alignment. It's helpful to understand senior leadership's risk appetite.

Netflix open-sourced some library for quantifying risk.

You need to create shared guiding principals.

You need ongoing visibility and reasonable expectations.

You need to show up for the customers with reasonable expectations that are in line with your risk tolerance.

The CISO Panel Discussion

Tom Alcock, Partner and Founder, Code Red Partners (moderator)
Caleb Sima, Chief Security Officer, Robinhood
Fermin Serna, Chief Security Officer, Databricks
Jessica Ferguson, Chief Security Officer, Docusign

They started by talking about the key factors for building a security team from scratch.

Focus on assessments, strategy, organization, finding leaders who can help, what you can build with your ICs, and execution. What are you needs, and what are your tools?

"I wasn't a CSO. I was essentially a recruiter."

People and talent are #1.

Consider your AppSec to developer ratio.

The best hires are sometimes a surprise. It might just be someone who happened to be free that came in through someone's network.

You can also grab people internally and grow them. This was a big focus for Ferguson. Ferguson also said "You can teach security. You can't teach innate curiosity."

60-70% of sourcing is internal sourcing.

Sima said you need to convince the candidate that you're better than the other companies. Call them so that you can explain why you're better. Sima said that Robinhood sells to the candidate before interviewing them. They reverse the order. That hooks them. They do this even for more junior roles.

[Sima was easy to like.]

Every company is a disaster behind the scenes.

Transparency is key.

"Sell the disaster." People want a challenge, and they want to have impact.

It's important to overcommunicate, especially now that people are remote. People feel disconnected.

There are over 100 people on Ferguson's team at Docusign.

Remote work is good for deep dive work, but decision making quickly suffers.

A good manager can hold a team together.

Be intentional about building diverse teams. It helps to start with diverse panels.

Ferguson loves growing non-security people and has been really successful with it.

Initially, start with people who are good generally and have a certain curiosity.

You need someone to tie all the things and people together.

As a manager, don't be a hero. Talk to people. Get advice.

Someone gave a plug for the CSO at Lyft.

The recession is an opportunity.

Don't try to make security the top priority at the company, but it should be in the top 3 or the top 5.

The recession offers an opportunity to hire people who are being laid off.

If you can't hire, focus on retaining your people.

Security is important, but running a business is more important.

If you want to go into security but feel you don't know enough, you have a skill set that can grow. Don't get hung up on what you don't know.

Someone brought up IR, forensics, managing an investigation, etc.

Serna said that soft skills are really important. Even how you write an email is really important. You can build bridges or burn bridges.

It's nice to have a little bit of passion for the field.

It's impressive to see how attackers work.

Serna said "Don't be a jerk. It doesn't cost you anything to be nice."

Rise of the Vermilion: Cross-Platform Cobalt Strike Beacon Targeting Linux and Windows

Avigayil Mechtinger (@AbbyMCH) and Ryan Robinson (@MhicRoibin), security researchers from Intezer

Cobalt Strike is "Software for Adversary Simulations and Red Team Operations". It's very popular.

It's a malware framework.

There are different components involved: a C&C server, a stager, a backdoor, a team server, a client.

It's hard to detect and easy to configure.

There are many possible payloads.

When it's detected, it's hard to attribute to a particular attacker.

It's meant for red teams, but adversaries use it too. Adversaries will often rely on a cracked version of it. It's even used by some nation states.

Geacon is a golang beacon for Linux.

Only 2% of desktop hosts use Linux, but 90% of hosts in the cloud use Linux.

There are several categories of malware on Linux: coin miners, botnets, ransomware, backdoors, etc.

Backdoors are often from nation states such as Russia, North Korea, and China, and they're targetted in nature.

They started talking about the rise of Vermilion.

They do malware analysis. The malware they were analyzing was 94% never-before-seen code and 3% code from Cobalt Strike. That's weird because this was Linux malware, but the Cobalt Strike malware hadn't been officially ported to Linux. There was network-related stuff in the code.

Virus Total reported that none of the virus scanners were catching this malware.

The name of the binary was nowhere to be seen on Google.

They called the malware Vermilion.

It was an elf file. There were strings in the code that would be used if the malware ran on Windows. That's pretty weird for an elf file which runs on Linux.

It made use of RSA for encryption.

The malware fingerprints the machine it's running on.

The code runs a C&C loop. They analyzed the commands.

There's a Windows version too. Apparently, the Windows version was known as of 2019, but here it is running on Linux.

They partnered with McAfee.

The malware was actively targeting high-profile companies.

There weren't many samples of victims.

There was a backdoor, written-from-scratch, which ran on Windows and Linux hosts. It was found in live attacks.

It was probably from a nation state.

When running on Linux, the malware flew under the radar.

It's a misconception that Linux people think they don't need antivirus software.

Mirai is one of the most popular botnets, and it's not recognized by Virus Total.

As an industry, we should spend more time detecting Linux malware.

Vermilion Strike for Windows can be detected in memory, or you can detect the stager.

They predict that the prevalence of cross-platform malware will continue in the future.

Got popcorn? What’s on the Vuln Channel tonight?

Rob Jerdonek and Lily Chau from the trust engineering team at Roku

Apparently, trust team = security team

They wanted to build static code scanning tools that were as easy to use as watching a movie.

They mentioned CI/CD integration with Jenkins, k8s, bots that scanned things, a DB, a dashboard, an integration layer, and viewer tools.

They have a web-based UI for users to view actionable vulnerability data.

They integrated with Slack.

They called their work the "Trusty Code Scanning Framework" (TCSF).

It's written in Go, Python, and JavaScript, and it uses Docker.

They integrate with lots of existing code scanning tools such as Semgrep, OSS-Index, npm-audit, Bandit, tfsec, Trivy, Gitleaks, Retire.js, and dependency-check.

They use one container to discover which other scanners should run. The scanners run in parallel.

One of the scanners recommends that you use defusedxml for more secure XML parsing in Python.

They make use of ELK for the DB and dashboards.

They're working on building SBOMS.

They don't yet block merges.

Sadly, it's not yet open source.

They said they need to reduce the false positives.

Their main point is that it was really useful for them to build this tool to bring together multiple code scanning tools.

Hacker TikTok: Community, Creativity, and Controversy

This was a panel discussion about posting security-focused content on TikTok.

Kyle Tobener (moderator)
MakeItHackin
shenetworks
Kylie Robison

They showed example TikTok videos:

You can wrap something in aluminum foil in order to foil an RFID scanner.

If you use "www.nytimes.com." (i.e. add a period at the end), you can circumvent their paywall. They have since fixed that vuln. This was used as a response to the prompt, "Show me you're a hacker without telling me you're a hacker."

TikTok originally had a lot of dancing content.

Stuff on TikTok goes viral. It's easy to grow an audience.

Some guy had cancer, and [somehow, I don't remember] TikTok helped.

There's a great community on TikTok.

However, it's a grind to make content consistently.

You can record and edit in the app, but some people prefer to edit their content outside the app.

One of panelists tried to produce two pieces of content a day. Another one of them recommended you do it whenever it makes you happy.

Keep in mind that 33% of your audience is under 19.

One of the memes was "stuff you know that feels illegal to know". There was lots of stuff from Defcon.

TikTok's recommendation algorithm is so good! It's really easy to rabbit hole.

TikTok definitely has a culture.

Multiple of the panelists said that TikTok has removed some of their videos showing exploits. That was really frustrating. One of them compared their work to lock picking--just because you learn the art of lockpicking doesn't mean you plan on doing illegal things. TikTok removed some videos that weren't even very concrete and specific.

It's already hard to show an exploit in 20 seconds. Having TikTok occasionally remove videos adds to the frustration.

NY Times eventually fixed the "www.nytimes.com." vuln, but TikTok actually took down the video for 6 days.

Some women don't want to get into tech because they have heard so many bad things.

In one quarter, TikTok removed 90 million videos. Half of those were by automated means. 5% were false positives. Moderation at scale is tough.

If you show a terminal, they're more likely to take down your video.

It's a problem that some people try to act like gatekeepers who act elitist toward people trying to get into the security industry. We need more people.

TikTok videos get so many comments, and they're so unmoderated.

TikTok is great for building a huge audience of people that you wouldn't be able to reach on other platforms.

The panelists enjoyed being creators. Their work on TikTok wasn't necessarily connected to their day jobs.

TikTok gives you so much exposure to stuff you've never seen before.

Your communication style is important. Tell a story.

It's super easy to get started. You can get started with just your phone and the built-in editor.

Someone in the audience brought up the "elephant in the room" that we're security people, and TikTok is partly owned by China.

It's true, but it's such a great platform.

One of the panelists did some investigation and found that someone had spent hundreds of thousands of dollars on anti-TikTok campaigns, so keep that in mind.

TikTok is actually very transparent. It's not available in China, and they don't have any servers in China.

One of the panelists said she was less concerned about China and more concerned about power plants.

She said that people have ulterior motives for hating on TikTok.

One of the panelists said that 70% of his viewers were male and 30% were female.

It's weird. TikTok knows that he's male, but when he signed up, he never told them his sex. How do they know?

It's a formidable platform. It's not going away.

The content is limited to 10 minutes. They weren't the first short-form video content platform.

One of the panelists said he limits his content to one minute. With good editing, you can cover a lot of content in one minute.

When people watch your content on TikTok, they're searching for specific content. So, it wouldn't make sense to put a tutorial there.

Some of the panelists take sponsorships. One of them used it to pay down her student loans.

Computer Science: Heisenberg Uncertainty Principle

2021-11-22T16:08:00.002-08:00

My buddy, Hy Carrel, joked that the Heisenberg Uncertainty Principle as applied to queues suggests that the more sure you want to be that an item in a queue is going to get processed, the less sure you can be of how long it'll take :-P

Python: PyWeek 32: Lil Miss Vampire

2021-10-27T15:08:00.001-07:00

TL;DR A world that scrolls infinitely in any direction, an RPG-like UI, and simple, real-time fighting.

My younger kids and I built this entry for PyWeek 32 based on the theme "Neverending".

The key innovations are:

It has a neverending world. As the player walks along, it picks up tiles and places new ones invisibly. It uses an LRUDict to remember the last million tiles you've seen. This matches real life in that if you go back to a place after 20 years, it'll look different than when you first saw it.
The user interface was inspired by Super Mario RPG, but the fighting mechanics are purposely realtime. It's a lot like if you were playing Street Fighter, but all you were allowed to do was use a fast punch, a slow punch, or block. It's a little bit like roshambo.

The code:

The code is pretty pleasant. I made use of lots of new features in the latest Python, and I built a pretty decent developer experience.
It's built on the excellent arcade library which has exceptionally good documentation, tutorials, and examples.
I used type annotations everywhere, and I enforced them via mypy. I made extensive use of `typing.NamedTuple` which gives it a nice, immutable, well-typed flavor.
I used black to format the code during check-in.
There are extensive unit tests for the models. And there are git hooks to keep everything sane.
Running `make iterate` will reformat the code, run mypy to enforce types, run the unit tests, and then launch the game.

Here's the GitHub page with more details.

Security: What Percentage of Passwords are Pure ASCII?

2021-09-27T17:34:00.011-07:00

I was wondering what percentage of passwords are pure ASCII. Hence, I threw together some code:

#!/usr/bin/env python3

PASSWORD_LIST = "example.txt"

num_pure_ascii = 0
num_iso_8859_1_not_ascii = 0
num_passwords = 0
with open(PASSWORD_LIST, mode="rb") as f:
    for line in f:
        password = line.rstrip(b"\n")
        num_passwords += 1
        
        try:
            password.decode('ASCII')
            num_pure_ascii += 1
            print("Pure ASCII:", password, flush=True)
            
        except UnicodeDecodeError:
            try:
                password.decode('UTF-8').encode('ISO-8859-1')
                num_iso_8859_1_not_ascii += 1
                print("ISO-8859-1 (but not pure ASCII):", password, flush=True)

            except (UnicodeEncodeError, UnicodeDecodeError):
                print("Not encodable into ASCII or ISO-8859-1:", password, flush=True)

    percentage_pure_ascii = (100 * num_pure_ascii) / num_passwords        
    percentage_iso_8859_1_not_ascii = (100 * num_iso_8859_1_not_ascii) / num_passwords
    print("Num passwords:", num_passwords)
    print("Percentage that are pure ASCII:", percentage_pure_ascii)
    print("Percentage that are not pure ASCII but can be encoded into ISO-8859-1:", percentage_iso_8859_1_not_ascii)

Using the largest password list from Metasploit, /usr/share/wordlists/metasploit/password.lst:

Num passwords: 88397
Percentage that are pure ASCII: 99.65%
Percentage that are not pure ASCII but can be encoded into ISO-8859-1: 0.0%

Using the top 100,000 passwords according to https://github.com/danielmiessler/SecLists/blob/master/Passwords/Common-Credentials/10-million-password-list-top-1000000.txt:

Num passwords: 999998
Percentage that are pure ASCII: 99.9998%
Percentage that are not pure ASCII but can be encoded into ISO-8859-1: 0.0002%

CrackStation's Password Cracking Dictionary (human passwords only):

Num passwords: 63,941,069
Percentage that are pure ASCII: 99.93%
Percentage that are not pure ASCII but can be encoded into ISO-8859-1: 0.0259%

The results seem suspicious, so I wonder if I'm doing something wrong. I really expected more passwords would contain things like ñ and í, which my code would report as "not pure ASCII but can be encoded into ISO-8859-1".

Feel free to try it on your own password list and report the results below.

Type Annotations T-Shirt

2021-08-07T15:05:00.004-07:00

A Space Engine

2021-07-31T10:52:00.000-07:00

(I'm talking about stuff I don't understand, so feel free to ignore me.)

Space isn't entirely empty. There are a few hydrogen atoms hanging out here and there.

Imagine if a spacecraft was flying really fast, and it was collecting those tiny few. It could either use a massive funnel at the front of it, or it could use something electromagnetic. Once it collects them, it could use fusion to release energy. Then, on the other side of the spacecraft, it could shoot out the output as hard as possible.

Add Another Entry to the UNIX Haters' Handbook

2021-05-15T14:11:00.005-07:00

I was using the command line to quickly build out a file hierarchy. I wrote something that looked basically like:

mkdir -p "~/dir/a b/c d"

I meant for dir to be in my home directory. I should have put the ~/ outside the doublequotes. Hence, it actually ended up creating a directory called ~.

I thought, "Well that was dumb. Let me delete that and start over..." So I wrote:

rm -rf ~

As you can imagine, that started recursively deleting things from my home directory. I should have put the ~ in double quotes or written ./~.

I hit control-c once I started seeing strange errors, but I was a bit late. It started deleting things all over the place. It complained that it couldn't delete a bunch of things in ~/Library, but it did end up deleting a bunch of other things there. Apps started acting strangely or crashing. It deleted my Google Drive settings, but not the files themselves. I was really worried that it'd delete the files and synchronize the deletions to the server, but it didn't. Thankfully, it didn't delete any of my VMs. That would have been painful.

I eventually just created a new user, switched to that user, moved my stuff out of the way, deleted and recreated my original user, and rebuilt things from scratch using my notes. I tend to log everything I do when setting up a machine. All of my stuff is in the cloud, so I don't really worry about backups.

BTW, if you haven't read the UNIX Haters Handbook, it's a lot of fun. My buddy, Travis, put it perfectly when he said, "I love the UNIX command line, but sometimes it's a bit like juggling chainsaws."

Information Security: SOX, SOC2, ISO 27001, PCI-DSS, OMG!

2021-05-03T11:51:00.000-07:00

Introduction

Let’s talk about certifications, standards, controls, control frameworks, etc.

Let’s start with standards.

SOX

Per Wikipedia:

The Sarbanes–Oxley Act of 2002...more commonly called Sarbanes–Oxley or SOX, is a United States federal law that set new or expanded requirements for all U.S. public company boards, management and public accounting firms. A number of provisions of the Act also apply to privately held companies, such as the willful destruction of evidence to impede a federal investigation.

The bill...was enacted as a reaction to a number of major corporate and accounting scandals, including Enron and WorldCom. The sections of the bill cover responsibilities of a public corporation's board of directors, add criminal penalties for certain misconduct, and require the Securities and Exchange Commission to create regulations to define how public corporations are to comply with the law.

In a nutshell (and bearing in mind that I am not an expert), SOX is a set of guidelines that came in response to the fraud committed by Enron, etc. Imagine if an “evil” CEO told someone in the company to “cook the books”. The goal of SOX compliance is to make it hard to actually pull that off in practice. You do that by making sure things are reviewed (limit what one person can do on their own), auditable (the logs are good), reasonable (you should take reasonable steps to be secure), and appropriate (a person in accounting shouldn’t be able to tweak the code).

What’s in scope is anything that mutates financial impacting data (data that would lead to revenue reports). For instance, you might be concerned about employees having read-only access to personal information, but from a SOX perspective, since it’s read-only, it doesn’t allow people to alter financial information, and thus it’s out of scope.

Finally, there’s an understanding that you can’t be 100% perfect against every possible attack--zero percent risk is not a thing. Given someone senior enough and someone hacky enough, if they’re willing to steal people’s passwords and delete your entire AWS infrastructure (thereby killing all the logs), there’s no protecting against that. Consider how likely a risk is. The goal is to be “reasonable and appropriate”. Look at the risk. Mitigate it to the degree you can. Be intentional about risk management.

SOX isn’t that scary. Just think through your normal business. If there’s a risk that people can do inappropriate things, put in procedures to prevent them from doing those things.

It is required by law that a company be compliant with SOX “roughly” a year after going public.

SOC2

Per Wikipedia:

System and Organization Controls (SOC), defined by the American Institute of Certified Public Accountants (AICPA), is the name of a suite of reports produced during an audit. It is intended for use by service organizations (organizations that provide information systems as a service to other organizations) to issue validated reports of internal controls over those information systems to the users of those services...[There are] two levels of reporting, type 1 and type 2. Additional AICPA guidance materials specify three types of reporting: SOC 1, SOC 2, and SOC 3.

These controls have to do with:

Security
- Firewalls
- Intrusion detection
- Multi-factor authentication
Availability
- Performance monitoring
- Disaster recovery
- Incident handling
Confidentiality
- Encryption
- Access controls
- Firewalls
Processing Integrity
- Quality assurance
- Process monitoring
Privacy
- Access control
- Multi-factor authentication
- Encryption

A SOC2 report says that as a service provider, you have a reasonable approach to information security. It says that you do what you claim you do, and that you have a documented process. Clients worldwide might ask you for this as a requirement before doing business with you, and it’s also useful during the process of doing public.

ISO 27001

Per Wikipedia:

ISO/IEC 27001 is an international standard on how to manage information security. The standard was originally published...in 2005and then revised in 2013. It details requirements for establishing, implementing, maintaining and continually improving an information security management system (ISMS) – the aim of which is to help organizations make the information assets they hold more secure.A European update of the standard was published in 2017. Organizations that meet the standard's requirements can choose to be certified by an accredited certification body following successful completion of an audit.

ISO 27001 is similar to SOC2, but it’s way more stringent and comprehensive. As with SOC2, clients worldwide might ask you for this as a requirement before doing business with you, and it’s also useful during the process of going public.

PCI DSS

Per Wikipedia:

The Payment Card Industry Data Security Standard (PCI DSS) is an information security standard for organizations that handle branded credit cards from the major card schemes...The standard was created to increase controls around cardholder data to reduce credit card fraud.

Validation of compliance is performed annually or quarterly by a method suited to the volume of transactions handled.

Payment processors require you to be compliant with PCI DSS. The more transactions you do, the stricter they are. At a certain point, you are required to have an audit performed by QSA rather than performing a self-assessment.

Controls and Control Frameworks

Per Wikipedia:

Security controls are safeguards or countermeasures to avoid, detect, counteract, or minimize security risks to physical property, information, computer systems, or other assets. In the field of information security, such controls protect the confidentiality, integrity and availability of information.

Systems of controls can be referred to as frameworks or standards. Frameworks can enable an organization to manage security controls across different types of assets with consistency...

Security controls can also be classified according to their nature, for example:

Physical controls e.g. fences, doors, locks and fire extinguishers;
Procedural or administrative controls e.g. incident response processes, management oversight, security awareness and training;
Technical or logical controls e.g. user authentication (login) and logical access controls, antivirus software, firewalls;
Legal and regulatory or compliance controls e.g. privacy laws, policies and clauses.

Going back to ISO 27001, it’s actually a control framework. Per Wikipedia:

Most organizations have a number of information security controls. However, without an information security management system (ISMS), controls tend to be somewhat disorganized and disjointed, having been implemented often as point solutions to specific situations or simply as a matter of convention...ISO/IEC 27001 requires that management:

Systematically examine the organization's information security risks, taking account of the threats, vulnerabilities, and impacts;
Design and implement a coherent and comprehensive suite of information security controls and/or other forms of risk treatment (such as risk avoidance or risk transfer) to address those risks that are deemed unacceptable; and
Adopt an overarching management process to ensure that the information security controls continue to meet the organization's information security needs on an ongoing basis.

The current ISO 27001 standard lets you pick the controls that you deem most appropriate, but a previous version of the standard had an annex that had the following (per Wikipedia):

There are 114 controls in 14 groups and 35 control categories:

A.5: Information security policies (2 controls)
A.6: Organization of information security (7 controls)
A.7: Human resource security - 6 controls that are applied before, during, or after employment
A.8: Asset management (10 controls)
A.9: Access control (14 controls)
A.10: Cryptography (2 controls)
A.11: Physical and environmental security (15 controls)
A.12: Operations security (14 controls)
A.13: Communications security (7 controls)
A.14: System acquisition, development and maintenance (13 controls)
A.15: Supplier relationships (5 controls)
A.16: Information security incident management (7 controls)
A.17: Information security aspects of business continuity management (4 controls)
A.18: Compliance; with internal requirements, such as policies, and with external requirements, such as laws (8 controls)

Other Control Frameworks

There are many control frameworks, and they overlap. As you can imagine, a company might need to meet several standards at the same time. To do so, a company might compile a list of all of the controls from all of the frameworks and organize them into a new, all-encompassing control framework.

For instance, Adobe has such a framework:

The Common Control Framework (CCF) by Adobe is the foundational framework and backbone to our company-wide security compliance strategy. The CCF is a comprehensive set of simple control requirements, aggregated, correlated, and rationalized from industry information security and privacy standards.

Your own company might have its own control framework. This could exist, for instance, as a spreadsheet listing all of your controls, how you implement those controls, and what part of each standard the control applies to.

To explain it from a programmer’s point of view, think of each standard as a Java interface. Think of each control as a required method within that interface. Think of your company as a class that implements all of these interfaces by implementing the various required methods.

Conclusion

There are various compliance standards that are required in different situations. SOX is for public companies in the US. PCI DSS is for companies that accept credit card transactions. SOC2 and ISO 27001 are both about running a business in a sensible, robust manner with a special focus on information security.

Controls are specific things like a firewall or a terms of service agreement. If you take a set of controls and organize it into a unified whole, you have a control framework. Your own company might have its own control framework which matches the standards that you need to follow.

Sometimes you need to be certified by an external auditor that you meet a standard. If they flag you with an exception (i.e. a case where you fail to live up to the standard), you need to address that.

Some of your clients might ask that you either comply with SOC2 or ISO 27001. If you're making use of a service or other vendor that stores or processes your critical data (such as email addresses or payment card data) or if they integrate with your infrastructure, you might check their SOC2 and/or ISO 27001 compliance.

Getting Windows 7 Running on a Lenovo Thinkpad T410 with no CDROM Drive and no OEM Software

2021-01-02T17:54:00.013-08:00

This is a continuation of Creating Windows 10 Boot Media for a Lenovo Thinkpad T410 Using Only a Mac and a Linux Machine. I figured out that Windows 10 isn't supported on the Lenovo Thinkpad T410, so I decided to focus on getting Windows 7 running on it, which is what it came with. I know it's a security risk, but I figured it'd be okay if I locked down the firewall, installed a virus scanner, and limited the apps installed on the machine. There's nothing on this laptop that we can't afford to lose.

Remember, one of my challenges was that the laptop doesn't have a CDROM drive, and I didn't have any installation media at all. I just had a Mac to work with.

Attempt 24:
I bought a license key from g2a.com.
I was hoping to download an ISO either from them or from Microsoft.
It turns out Microsoft wouldn't let me download the ISO since it was an OEM license.
I also bought a copy of McAfee AntiVirus Plus at the same time.
I never figured out how to apply that license.
Attempt 25:
Important: I found dellwindowsreinstallationguide which had a direct link to download the ISO.
Important: I found this link for downloading drivers.
Important: I found Lenovo System Update.
I found this guide on creating a bootable USB with Windows 7.
I tried formatting the USB and copying the files over.
It didn't boot.
Attempt 26:
I tried Microsoft's Windows USB/DVD Download Tool.
I was following the instructions for doing it manually with tools built in.
I tried to do it from my Windows 10 VM.
It only works if you have a BIOS, not UEFI.
I switched VMware to use a BIOS instead of UEFI.
My Windows 10 VM wouldn't boot.
Attempt 27:
Important: I tried this guide from Make Use Of.
This guide was key.
It required an OS with NTFS.
My Mac didn't have that.
However, I had a Windows 10 VM that did.
Part of the guide requires running: d:/boot/bootsect.exe /nt60 e:
That comes from the Windows 7 ISO.
It wouldn't work on my Windows 10 VM.
In order to create Windows 7 USB boot media, I needed Windows 7 :(
Attempt 28:
I created a Windows 7 VM in VMware Fusion using the Windows 7 ISO I downloaded.
I tried to run USBRecoveryCreator from Lenovo.
I had to create a Lenovo account.
It ended up just crashing.
Attempt 29:
I tried to make use of this guide from Lenovo.
I can't remember what happened, but it didn't work.
Attempt 30:
Important: I went back to the Make Use Of guide.
I used VMware to connect the ISO to the VM's virtual CDROM drive.
I couldn't get the VM to recognize my USB thumb drive.
It turns out my USB thumb drive was USB 3, and Windows 7 wouldn't recognize it (I think).
I had to scour the house for an old USB thumb drive, which I eventually found.
I ran: d:/boot/bootsect.exe /nt60 e:
It said: Could not map drive partitions to the associated volume device objects: Access is denied.
Attempt 31:
Important: I did the same thing as administrator.
That worked.
I was able to boot the laptop into Windows 7.
My son wanted to dual boot with Ubuntu, so we left some space while partitioning.
It wouldn't let me use the laptop's own license key because it didn't match what I was booting.
I was able to reuse the key from the VM and then delete the VM.

At this point, I had Windows 7 running. However, I was missing key drivers, including the drivers necessary to get an internet connection. Bear in mind, I don't spend a lot of time on Windows. This was my first foray into what I knew was called "driver hell". I was really hoping to get WiFi working so I wouldn't have to keep downloading things on my Mac and transferring them to the Lenovo laptop using a USB thumb drive.

Attempt 32:
I tried to use the laptop to get the original OEM software from Lenovo.
I never got anywhere with that.
Attempt 33:
I tried to use the Lenovo System Update.
I used a USB thumb drive to download it on my Mac and transfer it to the Lenovo.
It wasn't very useful without an internet connection.
I'm a little bit unclear at this point, but I don't even think I had ethernet at this point, let alone WiFi.
Attempt 34:
I tried to download useful-looking drivers one-by-one using my Mac from Lenovo.
Installed this WAN driver.
HUAWEI EM660 Wireless WAN
c:\DRIVERS\WIN\WWAN-HUAW
It installed some HUAWEI DataCard driver.
c:\Program Files (x86)\HUAWEI Modem Driver
I searched for the device, but couldn't find it.
Attempt 35:
I tried installing Leadcore 5730D Wireless WAN driver for Windows 7 (32-bit and 64-bit), Vista (32-bit and 64-bit), XP - ThinkPad Edge 11, Edge 13, Edge 14, Edge 15, Edge E10, Edge E30, Edge E31, Edge E40, Edge E50, T410, T410s, X100e, X201, X201 Tablet.
c:\drivers\wwan-leadcore
In device manager, I kept scanning for hardware changes, but it's not helping.
Attempt 36:
I tried installing Intel Wireless LAN (11abgn, 11bgn, 11ac) for Windows 8 (32-bit, 64-bit) - ThinkPad.
c:\drivers\win\wlanint
Scanned for hardware changes in device manager.
Ran the Intel PROSet/Wireless Control Panel Applet.
The Proset thing seems useless for me.
It just offers to import a profile.
Attempt 37:
I tried installing ThinkPad 11b/g/n Wireless LAN Mini-PCI Express Adapter II for Windows 7 (32-bit, 64-bit) - ThinkPad.
c:\swtools\wlan\6iws25ww
Attempt 38:
I tried installing Ethernet driver (Intel PRO/1000 LAN adapter software) for Windows 7 (32-bit, 64-bit) - ThinkPad T410, T410i, T410s, T410si, T510, T510i, W510, W701, W701ds, X201, X201i, X201s, X201 Tablet.
c:\drivers\win\ethernet
Attempt 39:
I tried Rescue and Recovery® 4.52 for Windows 7.
It didn't actually look like it was going to help.
Attempt 40:
I looked at the hardware ID of the network controller.
That lead to driveridentifier.com.
That lead to this page.
That device is some WiMAX thing.
Attempt 41:
Important: I decided to just plug in an ethernet cable and hope.
The only cable I had laying around was a crossover cable.
However, it actually connected!
I'm not sure if I always had ethernet working or not.
Obviously, I should have tried that first.
I ran the Lenovo System Update utility.
It installed:
Intel Management Engine Interface 6.2 and Serial Over Lan (SOL) Driver
Ricoh Multi Card Reader Driver for Windows 7 and Vista
Thinkpad Video Features (NVIDIA NVS Optimus) -7
Thinkpad Integrated Camera Device Driver for Windows 7/XP/Vista
Conexant Audio Software for Windows 7, Vista, and XP
I ran it again, and it offered to install:
ThinkPad BIOS Update US.
Attempt 42:
I still don't have a driver for my WiFi.
I did a scan in Device Manager.
It offered to install new drivers.
Perhaps it's different because it finally has an internet connection.
Or maybe it's different because some other drivers has been installed.
I think it's trying to find a driver for the network controller.
It never found a driver.
Attempt 43:
Let's try this recovery key thing again.
It made me log in.
It said there are no active orders, and gave me the chance to create one.
This led to another USBRecoveryCreator.
This again had me log in.
This again led me to the website.
It's trying to figure out my serial number.
It downloaded Lenovo Service Bridge.
Running it doesn't seem to do anything.
I still have a network controller with a missing driver.
Attempt 44:
I searched for it by hardware ID.
I ended up on some site downloading something called DriverSupport One.
c:\Program Files(x86)\Driver Support One (or something like that)
You have to create an account to get started with their service.
I created an account.
I ended up on the account portal.
They want $9.99/month.
Groan.
I actually signed up.
I made sure to use PayPal so they wouldn't have my credit card number.
I think it's installing a Lenovo Intel Wireless LAN Driver (Network Controller).
There were a few other drivers, but none looked as important.
I installed them.
After installing one driver, it said that a newer version of that driver was already installed.
I was installing some Intel chipset driver it said I needed.
I ended up with a blue screen of death.
My network controller is still not supported.
I did some cleanup and removed:
Intel Network Connection Driver
Intel PROSet/Wireless Software
DriverSupport One crashed again while installing a monitor driver.
Now DriverSupport One won't even start!
I cancelled my plan.
I also cancelled it on the PayPal side.
I uninstalled the software.
Attempt 45:
I used Device Manager again and just searched for software updates.
Important: It installed Intel Centrino Advanced-N 6250 AGN.
Finally, all the devices are supported!
WiFi is working!
I did some cleanup and removed:
Lenovo Service Bridge
Huawei DataCard Driver
I rebooted.
I checked that all my devices were still supported.
I removed Intel Management Engine Components:
It showed some piece of hardware missing.
I told it to update software drivers.
It reinstalled the Intel Management Engine Interface.
I looked at all the programs installed, and the list looked reasonable.
I went to scan something (???), and now it won't shut off properly.

Conclusion

I hope to never do something like this again. I spent about 3 days all told trying to get a laptop that was end-of-lifed many years ago to run an OS that also was end-of-lifed many years ago.

I bought this laptop off a buddy for $200. It was very old, but it had 8 GB of RAM and an SSD drive. It runs Ubuntu reasonably well. However, I thought installing Windows 7 would let my son run a few things he wanted to run that only ran on Windows. I figured the original video card driver for Windows 7 would run better than the one for Linux. He said that Minecraft is actually running more smoothly on Ubuntu. C'est la vie.

At the end of the day, it just wasn't worth my time. I should have paid more for a newer machine. Obviously, dealing with Macs is easier because you're less likely to end up in driver hell, but installing Windows without having access to the OEM ISO is a particularly frustrating experience.

It's done, but I feel a bit guilty that I could have spent those 3 days actually engaged with my 8 kids. A few days later, I was about to play around with installing Arch Linux in a VM. I decided, "Nope. I have better things to do with my time."

Fun with VMware on a 64 GB Mac

2020-11-14T16:13:00.001-08:00

I'm having a lot of fun with VMware on this 64 GB Mac:

My main OS, obviously, is macOS running work-related stuff.
Then, I have Ubuntu Linux for development.
I have Kali Linux for doing security work.
I have Windows 10 for practicing exploit development.
And, finally, I have macOS running in a VM for my personal stuff.
I could probably get Android and iOS running for completeness sake (using different emulators), but I don't actually need those right now ;)

I've been running multiple VMs for a month or two. Things are working in general, and I'm happy with this setup. My only complaints are:

It took a while to set everything up.
My battery life sucks :-P
The macOS VM lacks GPU acceleration; hence I had to disable GPU acceleration in Chrome.
Similarly, for personal use, I'd prefer to use Netflix, YouTube, and Zoom in my personal macOS VM. However, the video is too laggy. Hence, I have to do those things on the main OS, using an Incognito window when possible.

Application Security: Hashing, Encryption, Encoding, Compression, Oh My!

2020-09-21T15:58:00.006-07:00

In this blog post, I’m going to be talking about hashing, encryption, encoding, compression, etc. All of these things are related, but they serve different purposes. Sometimes, developers confuse these things which can lead to tragic results.

My goal is to provide a high-level overview without getting into the weeds. If you’re interested in the details, Wikipedia is a great place to start. In fact, any part of this blog post that sounds even remotely intelligent was probably taken straight from Wikipedia.

Encoding

Let’s start with code:

In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is the invention of language, which enabled a person, through speech, to communicate what they saw, heard, thought, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time…

The process of encoding converts information from a source into symbols for communication or storage. Decoding is the reverse process. [Wikipedia]

My definition is a little “softer”: It’s a way of transporting information in a way that “fits” within a given context.

Example: Let’s say we want to transmit the string “hi mom” within a GET parameter in a URL. To do this, we must use URL encoding, such as:

http://example.com?greeting=hi+mom

Because “hi mom” has a space in it, and spaces aren’t allowed in URLs, we have to encode the space using a +.

Example: Unicode has the concept of “Unicode code points.” Each letter is represented by one or more numbers. To store those Unicode code points in memory or transport them over the wire, they have to be encoded. UTF-8 is one such encoding. It’s a “nice” encoding because, for “simple” English characters, it’s backward compatible with ASCII, which is an older encoding.

Example: Base64 is a way of encoding binary into lower and uppercase letters, numbers, +, /, and =.

Example: base64url is similar, but - is used instead of +, and _ is used instead of / because those characters don’t require any further encoding when used in a URL.

Example: QR codes are a way of transmitting a small amount of information, such as a URL, in a way that can be seen by humans and scanned by machines.

Joke: Raising your middle finger is a fantastically succinct way of encoding certain feelings of disgust toward another.

Compression

In signal processing, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder and one that performs the reversal of the process (decompression) as a decoder. [Wikipedia]

Joke: Did you hear about Knuth’s latest compression algorithm? He can fit any 32-bit integer into only 17-bits.

Lossless compression algorithms are useful when you really need to get the exact same bytes you started with after decompressing something you’ve compressed. This is really important, for example, when compressing a zip file containing source code.

Lossy algorithms are useful when you only need things to be “close enough”. For instance, JPEGs support lossy compression which allows you to compress them to far smaller files than you could achieve using only a lossless compression algorithm, such as the algorithms used by GIFs.

Example:

The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. [Wikipedia]

Example:

gzip is a file format and a software application used for file compression and decompression. gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. [Wikipedia]

Example:

bzip2 is a free and open-source file compression program that uses the Burrows-Wheeler transform to convert frequently-recurring character sequences into strings of identical letters. It then applies move-to-front transform and Huffman coding. [Wikipedia]

Lossy compression algorithms trade accuracy for even greater compression. This is useful for things like images, video, or audio when losing a little bit of accuracy is okay.

Joke: Beware of the lossy credit card compression algorithm.

Example: JPEG is a lossy image compression algorithm. In contrast, GIF is lossless since it relies on LZW (although it uses a limited color pallet which can make it seem lossy).

Example: MPEG is actually a joint effort between multiple standards bodies in order to create standards for audio and video compression and transmission. All of the MPEG formats use discrete cosine transform (DCT) based lossy video compression algorithms.

Lest you be tempted to believe that MPEG-4 is a straightforward algorithm, keep in mind that:

MPEG-4 supports Intellectual Property Management and Protection (IPMP), which provides the facility to use proprietary technologies to manage and protect content like digital rights management. It also supports MPEG-J, a fully programmatic solution for the creation of custom interactive multimedia applications (Java application environment with a Java API) and many other features. [Wikipedia]

Example:

Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), is a video compression standard based on block-oriented, motion-compensated integer-DCT coding. It is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. It supports resolutions up to and including 8K UHD.

The intent of the H.264/AVC project was to create a standard capable of providing good video quality at substantially lower bit rates than previous standards (i.e., half or less the bit rate of MPEG-2, H.263, or MPEG-4 Part 2), without increasing the complexity of design so much that it would be impractical or excessively expensive to implement. [Wikipedia]

If compression is the sort of thing that gets you excited, check out my buddy, Colt McAnlis’s, Compressor Head series. To learn more about H.264/AVC, see Everything You Should Know about H.264/AVC (Advanced Video Coding).

Hash functions, checksums, and cryptographic hash functions

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are used to index a fixed-size table called a hash table. The use of a hash function to index a hash table is called hashing or scatter storage addressing. [Wikipedia]

A checksum is a small-sized datum derived from a block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity. [Wikipedia]

A cryptographic hash function (CHF) is a mathematical algorithm that maps data of arbitrary size (often called the "message") to a bit array of a fixed size (the "hash value", "hash", or "message digest"). It is a one-way function, that is, a function that is practically infeasible to invert. Ideally, the only way to find a message that produces a given hash is to attempt a brute-force search of possible inputs to see if they produce a match, or use a rainbow table of matched hashes. Cryptographic hash functions are a basic tool of modern cryptography. [Wikipedia]

These things are somewhat overlapping, but they’re optimized for different things, and they have different applications. For instance, a checksum must be fairly short because they often have to be typed in manually, whereas a cryptographic hash should be fairly long so as to avoid hash collisions.

Example: Python (and similarly in other languages) uses a hash function on the key you provide when putting something into a dict. For example, the hash function might take a string and return an array index.

Example: Credit card numbers use Luhn’s algorithm (which is a checksum) to ensure the user hasn’t mistyped a digit. Hence, you can figure out the last digit by looking at the previous digits.

Example: Similarly, books have ISBNs on the back of them. ISBNs are based on this checksum algorithm. Once again, you can figure out the last digit by looking at the previous digits.

Example: MD5, SHA-1, and SHA-256 are all cryptographic hash functions. When you download a CD image (i.e. an ISO), you’ll typically see a file containing an MD5 or SHA1 of that ISO. You can run md5 or shasum on the command line on the ISO image to make sure nothing went wrong while downloading the ISO. This use is somewhat similar to the way a checksum might be used.

Example: Similarly, Git internally relies on SHA-1 for the identification and integrity checking of all file objects and commits.

Example: Google relies on SHA-1 (or at least it used to) to transform URLs into something more manageable and uniform. In the extremely rare case of a hash collision (i.e. if two URLs hash to the same value), the search engine will simply discard one of the URLs.

Example: It’s common to use a cryptographic hash function as a core part of digital signature algorithms, which we’ll cover below.

Encryption

In cryptography, encryption is the process of encoding information. This process converts the original representation of the information, known as plaintext, into an alternative form known as ciphertext. Only authorized parties can decipher a ciphertext back to plaintext and access the original information.

[Note, this is a Wikipedia definition. The fact that they use the term “encoding” as a way of describing the process of encrypting something into ciphertext is a bit unfortunate since it goes against the grain of this blog post as a whole which is trying to differentiate these things. My apologies.]

Encryption does not itself prevent interference but denies the intelligible content to a would-be interceptor. For technical reasons, an encryption scheme usually uses a pseudo-random encryption key generated by an algorithm. It is possible to decrypt the message without possessing the key, but, for a well-designed encryption scheme, considerable computational resources and skills are required. An authorized recipient can easily decrypt the message with the key provided by the originator to recipients but not to unauthorized users...

In symmetric-key schemes, the encryption and decryption keys are the same. Communicating parties must have the same key in order to achieve secure communication…

In public-key [AKA asymmetric] encryption schemes, the encryption key is published for anyone to use and encrypt messages. However, only the receiving party has access to the decryption key that enables messages to be read. [Wikipedia]

Note: It’s common to mix and match these two things. For instance, some algorithms use a public-key encryption scheme in order to transmit a key which is used for a symmetric-key scheme.

Note: In the following section, I’ll be combining content from Wikipedia with advice from Josh Bonnett.

First, let me mention some symmetric-key algorithms:

Examples: The Germans are well-known for having used a mechanical encryption device during World War II. The Enigma Machine utilized a new symmetric-key each day for encoding and decoding messages.

Example: AES is a symmetric-key algorithm. It’s older and slower, but support for it is built into modern CPUs. Where possible, you should use 256-bit AES as the default.

Example: The ChaCha variant of Salsa20 is the new hotness. It’s another symmetric-key algorithm. One reason it’s famous is that the source code for it can fit onto a post-it note.

Both ciphers are built on a pseudorandom function based on add-rotate-XOR (ARX) operations—32-bit addition, bitwise addition (XOR), and rotation operations. The core function maps a 256-bit key, a 64-bit nonce, and a 64-bit counter to a 512-bit block of the key stream.

Example: DES, Triple DES, Blowfish, and Twofish are older, weaker symmetric-key algorithms. You should avoid using these.

Now, let’s cover some public-key algorithms:

Example:

Diffie–Hellman key exchange is a method of securely exchanging cryptographic keys over a public channel and was one of the first public-key protocols as conceived by Ralph Merkle and named after Whitfield Diffie and Martin Hellman. DH is one of the earliest practical examples of public key exchange implemented within the field of cryptography. Published in 1976 by Diffie and Hellman, this is the earliest publicly known work that proposed the idea of a private key and a corresponding public key. The method was followed shortly afterward by RSA, an implementation of public-key cryptography using asymmetric algorithms. [Wikipedia]

Example:

RSA (Rivest–Shamir–Adleman) is a notable public-key cryptosystem. Created in 1977, it is still used today for applications involving digital signatures.Using number theory, the RSA algorithm selects two prime numbers, which help generate both the encryption and decryption keys. [Wikipedia]

Example:

A publicly available public-key encryption application called Pretty Good Privacy (PGP) was written in 1991 by Phil Zimmermann, and distributed free of charge with source code. [Wikipedia]

Example:

Elliptic-curve cryptography (ECC) is an approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields. ECC allows smaller keys compared to non-EC cryptography...to provide equivalent security.

Elliptic curves are applicable for key agreement, digital signatures, pseudo-random generators, and other tasks. Indirectly, they can be used for encryption by combining the key agreement with a symmetric encryption scheme. [Wikipedia]

Per Josh Bonnett, it matters which agreed-upon curve you use:

Curve 25519: A curve was chosen by D. J. Bernstein with each component of the curve being chosen publicly, for verifiable reasons.

Elliptic Curve Digital Signature Algorithm (ECDSA): A curve was developed by NIST in the dark with help from the NSA. RSA was paid in secret to make it the default in their popular crypto library. It’s suspected to not be completely secure against the NSA.

Digital signatures

A digital signature is a mathematical scheme for verifying the authenticity of digital messages or documents. A valid digital signature, where the prerequisites are satisfied, gives a recipient very strong reason to believe that the message was created by a known sender (authentication) and that the message was not altered in transit (integrity).

Digital signatures are a standard element of most cryptographic protocol suites, and are commonly used for software distribution, financial transactions, contract management software, and in other cases where it is important to detect forgery or tampering. [Wikipedia]

Hence, a digital signature ensures that the data has not been tampered with and has been sent by the entity you expect. This allows you to transmit data in clear text as the signature will detect tampering.

Example: It’s pretty common to use a cryptographic hash function as the basis for a digital signature. For example:

message = some_data_you_want_to_transmit
key = some_private_key
hash = sha256(message + key)
message_to_transmit = message + ‘:’ + hash

In this case, the hash is acting as a digital signature. When the receiver receives message_to_transmit, as long as they have the key, they’ll be able to verify that the message hasn’t been tampered with. Note, this doesn’t provide encryption.

Example:

In cryptography, an HMAC (sometimes expanded as either keyed-hash message authentication code or hash-based message authentication code) is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and the authenticity of a message. [Wikipedia]

Basically, HMAC is similar to the previous algorithm, but a lot more secure.

Building larger systems that piece together multiple algorithms

We’ve covered a bunch of different “classes” of algorithms. Sometimes, it’s useful to piece together multiple algorithms in order to build a larger system.

Example: The Debian project relies on developers meeting each other in person, verifying each other’s identities, and then signing each other’s keys to build a web of trust. They might use SHA-256 as the basis of a digital signature, but the thing they are signing might be a 4096-bit RSA key which is used for encryption.

Example:

JSON Web Token (JWT...) is an Internet standard for creating data with optional signature and/or optional encryption whose payload holds JSON that asserts some number of claims. The tokens are signed either using a private secret or a public/private key. For example, a server could generate a token that has the claim "logged in as admin" and provide that to a client. The client could then use that token to prove that it is logged in as admin. The tokens can be signed by one party's private key (usually the server's) so that party can subsequently verify the token is legitimate. If the other party, by some suitable and trustworthy means, is in possession of the corresponding public key, they too are able to verify the token's legitimacy. The tokens are designed to be compact, URL-safe, and usable especially in a web-browser single-sign-on (SSO) context. JWT claims can typically be used to pass the identity of authenticated users between an identity provider and a service provider, or any other type of claims as required by business processes. [Wikipedia]

Example: Consider a web browser that is using HTTP/2 to talk to a web server. This is a good example of something that pieces together almost everything above.

Encoding: The server might be serving HTML that’s been encoded using UTF-8.
Compression: HTTP/1.1 can use algorithms such as gzip and Deflate to compress the body of a request or response. HTTP/2 can even use Huffman coding and HPACK to compress the HTTP headers. And don’t forget, the body of the response might be a JPEG (lossy compression) or a GIF (lossless compression).
Checksums: TCP uses a 16-bit checksum field to do error-checking of the TCP header, the payload, and an IP pseudo-header.
Cryptographic hash functions: The server might be storing a JWT in a cookie and using an algorithm like SHA-256 to make sure the user doesn’t tamper with it.
Encryption: HTTP/2 relies on TLS for encryption. TLS itself can use a variety of encryption algorithms.
Digital signatures: TLS also relies on server certificates which have been digitally signed by a certificate authority so that the server can prove it is who it says it is.

Closing thoughts

Even if you didn’t understand all the details, hopefully, you now have a decent grasp of the lay of the land.

One time, I had a co-worker ask me how he could decompress some videos he had compressed and/or encrypted using MD5. If you’ve made it this far, hopefully, that’ll give you a chuckle--it’s a true story.

A final tip: if you need to do something “novel” with encryption either a) don’t or b) consult an expert (I’m not one of them).

Finally, thanks to Eric Bloch for giving me permission to publish this publicly, Eric Batalden for suggesting this blog post, Josh Bonnett for helping fill in the blanks, Wikipedia for providing all the details, and Rusty for his editing.

Ubuntu 20.04 on a 2015 15" MacBook Pro

2020-05-18T22:10:00.001-07:00

I decided to give Ubuntu 20.04 a try on my 2015 15" MacBook Pro. I didn't actually install it; I just live booted from a USB thumb drive which was enough to try out everything I wanted. In summary, it's not perfect, and issues with my camera would prevent me from switching, but given the right hardware, I think it's a really viable option.

The first thing I wanted to try was what would happen if I plugged in a non-HiDPI screen given that my laptop has a HiDPI screen. Without sub-pixel scaling, whatever scale rate I picked for one screen would apply to the other. However, once I turned on sub-pixel scaling, I was able to pick different scale rates for the internal and external displays. That looked ok. I tried plugging in and unplugging multiple times, and it didn't crash. I doubt it'd work with my Thunderbolt display at work, but it worked fine for my HDMI displays at home. I even plugged it into my TV, and it stuck to the 100% scaling I picked for the other monitor, so it looked ok.

The next thing I did was install Zoom, since I'm using that a lot during this COVID-19 quarantine. Trying to download and install the .deb from the website didn't work because of unmet dependencies. However, installing the snap package using Ubuntu Software worked just fine.

At this point, I should note that WiFi and sound worked out of the box. That was a relief!

The microphone worked in Zoom for the first meeting, but not for subsequent meetings. Restarting Zoom fixed the problem.

My screen brightness keeps changing based on changes to the ambient lighting. This happens in macOS too, but somehow there it's a lot less noticeable and disruptive.

The camera did not work by default. By following these instructions, I was able to get the camera working, even in Zoom, but the picture quality was grainy / pixelated. The picture quality was similarly bad in Cheese, suggesting a driver problem. Observe:

On this page, it says, "The driver will complain about 1871_01XX.dat (or similarly named) files missing. This error can be ignored. The .dat files contains sensor calibration settings that will improve image quality. The error looks something like this: Direct firmware load for facetimehd/1871_01XX.dat failed with error -2. Ignore it." I'm guessing that's the problem. I found a guide here in order to install those files. I reloaded the driver. I think I did everything I was supposed to do, but it didn't seem to fix the issue. The documentation for the driver says that it's experimental :-/

Along the way, I had to "enable the multiverse" in order to install unrar. I told my kids about it, but they told me that that doesn't mean I'm an Avenger. Hmm, maybe I have to switch to Arch :-/

I'm guessing I wouldn't have this problem if I were using hardware with better support. On the other hand, my son said he was having the same problem with his 2007 Thinkpad.

Just to check that everything else was sort of behaving, I tried YouTube. No problems.

Next up, because my wife and I have been watching "Once Upon a Time" on Netflix together, I decided to give Netflix a try using the builtin browser, Firefox. It asked for permission to enable DRM. I said yes. But, then Netflix didn't actually work:

Switching to Chrome fixed the problem.

I guess now is the time I cop to the fact that I actually prefer the hotkeys in macOS. The way the command key works is generally more consistent, even in the terminal, and I really like the built-in Emacs hotkeys that work pretty much everywhere--I have those built into my fingers even though I'm not an Emacs guy.

The last few times I've used Linux on the desktop, I ran into other problems such as the following:

There's no Linux version of Backup and Sync from Google. The last time I tried it, the built in support for Google Drive was not very good. In the past, I tried Insync which worked pretty well, minus the fact that it didn't know how to deal with my HiDPI screen.

Chrome Canary isn't available for Linux. That's not that big of a deal most of the time, but usually the coolest DevTools toys come to Canary first and might not make it to the stable version for a long time.

I'm guessing that the magic that is iTerm2 + tmux -CC doesn't work in Linux. If you don't know what I'm talking about, Google for it. It's amazing.

Dropbox works pretty well on Ubuntu; not so much for other distros.

IntelliJ just doesn't look as good on Linux as it does on a Mac. But, it does work.

If you have an iPhone, it's really nice to be able to type normal text messages on macOS. Yeah, I know, Apple is just being Apple, and things are nice, but it's still a walled garden. Whatever--it's still nice, and I would miss it in Linux.

Thankfully things like Slack, VS Code, Signal, and Typora just work these days because they're Electron apps. It sucks to think of how much this is eating up in terms of system resources, but it's nice that they at least work.

Anyway, I wanted to figure out if it was viable for me to switch to Ubuntu. I think the answer is yes, but not with this Facetime camera. I'll stick to using macOS and running Linux in a VM using VMware Fusion. By the way, I really like VMware Fusion, and things like Kali Linux should be run in a VM anyway.

See also PCMag's Windows vs. MacOS vs. Chrome OS vs. Ubuntu Linux: Which Operating System Reigns Supreme? Don't worry about the click-baity title. It's actually pretty good. Perhaps it might not tell you anything revolutionary or new, but it's a solid, well-balanced look at the four OSs.

Creating Windows 10 Boot Media for a Lenovo Thinkpad T410 Using Only a Mac and a Linux Machine

2020-03-23T10:49:00.001-07:00

TL;DR: Giovanni and I struggled trying to get Windows 10 installed on the Lenovo Thinkpad T410. We struggled a lot trying to create the installation media because we only had a Mac and a Linux machine to work with. Everytime we tried to boot the USB thumb drive, it just showed us a blinking cursor. At the end, we finally realized that Windows 10 wasn't supported on this laptop :-/

I've heard that it took Thomas Edison 100 tries to figure out the right material to use as a lightbulb filament. Well, I'm no Thomas Edison, but I thought it might be noteworthy to document our attempts at getting it to boot off a USB thumb drive:

Download the ISO.
Attempt 1:
Use Etcher.
Etcher says it doesn't work for Windows.
Attempt 2:
Use Boot Camp Assistant.
It doesn't have that feature anymore.
Attempt 3:
Use Disk Utility on a Mac.
Erase a USB thumb drive:
Format: ExFAT
Scheme: GUID Partition Map
Mount the ISO.
Copy everything from the ISO to the USB thumb drive.
The laptop wouldn't actually boot it.
Attempt 4:
See: https://www.freecodecamp.org/news/how-make-a-windows-10-usb-using-your-mac-build-a-bootable-iso-from-your-macs-terminal/
diskutil list
I found: /dev/disk2
diskutil eraseDisk MS-DOS "WIN10" GPT /dev/disk2
hdiutil mount ~/Downloads/Win10_1909_English_x64.iso
cp -rf /Volumes/CCCOMA_X64FRE_EN-US_DV9/* /Volumes/WIN10
Hmm, it didn't unmount, and it seems like the same approach :-/
It still doesn't seem bootable.
Attempt 5:
See: https://www.top-password.com/blog/create-windows-10-bootable-usb-from-iso-on-mac/
It's basically the same as attempt 4.
Attempt 6:
Let's do attempt 3 again, but use a MBR instead of a GUID Partition Map.
It may be that the older hardware doesn't understand the GUID Partition Map.
It didn't really boot it.
Attempt 7:
See: https://itsfoss.com/bootable-windows-usb-linux/
Do it from Linux.
Looks like basically the same thing.
Attempt 8:
Use Etcher despite the warning.
It ended up booting Linux instead.
Attempt 9:
Do the same thing as attemp 6, but let it boot for a few minutes.
Attempt 10:
Download the ISO again in case that's the problem.
Attempt 11:
See: https://www.makeuseof.com/tag/windows-10-usb-boot-drive/
That's only for Windows.
Attempt 12:
Use VMware Fusion on my Mac to use the Windows Media Creation Tool.
For some reason, VMware Fusion won't let me access the USB drive.
I wonder if IT disabled kernel extensions or something like that.
Attempt 13:
Use unetbootin.
See: https://www.wdiaz.org/how-to-create-a-bootable-windows-usb/
This tutorial gave me a hint that the problem might be that this older laptop simply doesn't support ExFAT.
So many of the tutorials are based on the system being able to read installation off of an ExFAT-formatted USB drive.
It behaved the same way.
It didn't boot.
It just showed a blinking cursor.
Attempt 14:
Try to download the original rescue disk.
You need Windows to create the rescue disk.
Attempt 15:
Download a Windows 7 ISO.
You need a license.
The license sticker is missing from the bottom of the laptop.
Attempt 16:
Try to copy things onto a FAT32-formatted USB.
The file is too large for the destination.
Attempt 17:
Study the comments in that earlier post: https://www.wdiaz.org/how-to-create-a-bootable-windows-usb/
I need to format the USB drive using NTFS.
To do that on a Mac, I would need to install this software called Paragon.
Attempt 18:
Create an NTFS-formatted USB thumb drive and copy the files from Linux.
See: https://tecadmin.net/format-usb-in-linux/
df -h
sudo umount /dev/sdb1
sudo mkfs.ntfs -f /dev/sdb1
Note, using -f makes it go *way* faster.
But since it's skipping the bad sector checking, it might cause problems later.
I did it the fast way, but then used diff to check to make sure the copy matched.
Transferred the ISO from my laptop to Giovanni's:
My laptop: python -m SimpleHTTPServer 8000
Figured out my IP in another tab: ifconfig
His laptop: http://192.168.1.199:8000
Nope, same result.
Attempt 19:
Google for installing Windows 10 on this laptop model.
Windows 10 isn't really supported on this laptop.
See: https://answers.microsoft.com/en-us/insider/forum/all/lenovo-thinkpad-t410-not-compatible-with-windows/5d17c838-7f0c-4e32-b47d-a68a16e8fc7d
Even if you can get it installed, a bunch of drivers are missing.
We decided it's better to stick with Ubuntu.
Attempt 20:
Using with Windows key under the battery, download Windows 7 from Microsoft.
Microsoft won't let me because it's an OEM license.
Attempt 21:
Use the Lenovo Recovery USB creation tool using a Windows VM on my Mac.
Lenovo won't recognize the serial number; this model is too old.
Attempt 22:
Download it from pcsupport.lenovo.com.
They don't have anything older than T420.
Attempt 23:
Download the Rescue and Recovery app.
It won't run on my Windows 10 VM.

Updated: See Creating Windows 7 Boot Media for a Lenovo Thinkpad T410 Using Only a Mac