I read your comment but it's not clear to me how what you are suggesting would work practically. My friend and I are bootstrapping a startup that would primarily cater to Pakistan audience. We are hosted on AWS and use Stripe for payment processing etc, like most startups. Even if we wanted to host something in Pakistan (which we don't), no such infrastructure exists and we are in place to hire any dev resources in Pakistan either.
Per the article, if you are not hosted in Pakistan, Govt will fine you 3.x, which will probably kill our project, but it's not clear how Pak Govt can find us if we have no presence there?
This is, ultimately, the only way forward: Pick a legal jurisdiction, keep your employees within it, and let the internet bring worldwide customers to you. If other countries try block your packets, that's between them and their citizens.
The more jurisdictions you straddle (with servers, employees), the lower the lowest common denominator of "legal" behavior. At some point the lowest common denominator may be a nonintersecting set - "government must have access to the data" and "data may not leave the country" are incongruent. This will make life increasingly difficult for giant multinationals like FB and Google.
As long as your content is legal in the US, the US will not extradite you for violating Pakistan's (or Europe's) particular flavors of internet censorship. You may want to be careful about international travel, however.
Pakistan's average salary [1] is about $15,000 usd annually. Three times that would be less what an software engineering intern makes in Boston [2]. Maybe just add the government to your payroll and call it a cost of doing business?
Recently started introducing Rust in my team and here are my notes on some of the issues we have run into so far. We are primarily a Scala/JVM shop with a little bit of Python.
- `Error Handling` is a bit of a dumpster fire right now. Rust has something called a `Result` enum which is similar to `Either[Error,T]` monad in Scala however every single Rust crate I have used so far its creating its own `Error` enum which makes it really hard to compose `Result`. Ideally I would like to chain `Results` as `e1.and_then(e2).and_then(e3)` but its not possible due to incompatible error enums. Ended up using `https://docs.rs/crate/custom_error/1.3.0` to align the types.
- A lot of basic things are still influx and community wide standards have not been established. For example: I needed to externalize some environment specific settings in a Config but couldnt figure out where to put non-code assets in a cargo project and then how to reliable read them. In JVM world `src/main/resources` acts like a standard place for stuff like this but that patterns has not been established yet.
- Distributing code inside the company is hard because there is no integration with Artifactory or similar tools. We are directly referring git sha in Cargo right now and waiting for better solutions.
- Rust comes with a default unit test framework but it pretty bare bones. I havent seen examples of test fixture, setup/teardown support and loading test configs etc.
- I really like Rust compiler because of really good error messages it produces but its really slow and you start to notice it as you add more code/external crates
-IDE support is good but not great. I am using IntelliJ with Rust plugin as we use IntelliJ for Scala/JVM and it is nowhere as good as even Scala plugin which is pretty mediocre in itself.
Overall I am pretty happy with the language (except for Error issue) and most of my gripes are around the ecosystem and tooling around the language. Hopefully these will be resolved as language gains more momentum
> Rust comes with a default unit test framework but it pretty bare bones. I havent seen examples of test fixture, setup/teardown support and loading test configs etc.
This is one of the things that I really love about Rust - it has a testing framework that is so simple that it subtly pushes you to write better tests and better code.
What I mean by that is - it really wants you to write simple tests with no mocking and no convoluted state using simple asserts, and it really wants you to write code which is trivially testable. In other languages this might be problematic, but in Rust this synergises really well with other language features such as #[derive(...)], sum types and pattern matching. If you write good, idiomatic Rust code then all of those extra features of other testing frameworks are usually totally unnecessary.
I can see your point for unit testing but I am struggling to see how it will work for Integration testing. For example I would like to initialize a connection pool (or other heavy object) only once for the entire test, I .. havent figured out how to handle that except for creating a new one in each test. Any pointer on how you are handling those? Thanks
I know a little about two of the big open source Rust projects, Servo and the Rust compiler.
For integration tests, both the Rust compiler as well as the Servo web engine have written their own test runners with an accompanying set of tools. This allows the test format to be as flexible as the target domain of the program. In fact, servo can just simply use the cross-browser wpt [1] test suite.
As for avoiding to initialize heavy objects, I only know about the Rust compiler's test runner initializing the entire compiler every time from zero for every single test file. The result is a quite long testing-time. Suggestions by me to use a shared driver were rejected by them because one test might affect the other one.
But in theory you could write your own custom test runner that has as much shared state as possible. You might also want to check out cucumber-rust [2] which I think allows for some way of sharing. You could also just group smaller tests into a small set of bigger tests where each group of tests shares some resource.
If it's something that doesn't take too much wall time I just simply initialize those from scratch for every test. If I have multiple tests which need some complex setup I just put that down in a separate function and call that in every test. If it's something reaaally expensive to initialize I just use lazy_static:
> every single Rust crate I have used so far its creating its own `Error` enum which makes it really hard to compose `Result`.
Is there a reason why `map_err()` doesn't achieve what you need - i.e. to get everything into your error type no matter where it came from?
For example: OtherLibResult.map_err(|e| MyError::from(e))
Unless I'm mistaken - you ultimately are doing something like this in one form or another, because if you never match on the different Error variants that happened somewhere along the pipeline- then those details are being swallowed.
Implementing a From trait at least forces you to make that choice (all in one place too)- and then you can keep composing the results as needed since it's all in your MyError type
Yes thats the pattern I ended up using with the help of `custom_error` crate which automates the implementation of `From` for your custom errors. I guess the disconnect for me was that in Java/Scala all custom exceptions extend `Throwable` so types always line up, in rust custom Errors are disjointed so you have to wrap everything yourself to align the types.
If you make your functions return `Box<dyn Error>` (or `failure::Error`), they will all convert out of the box. The `?` operator performs conversion itself. For chains `.map_err(From::from)` does it.
> - Rust comes with a default unit test framework but it pretty bare bones. I havent seen examples of test fixture, setup/teardown support and loading test configs etc.
There is an eRFC for better integration of custom test frameworks.
They have shut down the listening part a long time ago by providing absolutely no way for people to get in touch with humans about problems faced by customers.
Unfortunately in Bay Area and specially at FAANG you need 0% of job "Requirements" on your resume as they will leetcode the shit out of you in the interview without asking one relevant question pertaining to your resume or even to the actual job. One of the reason given for this is that the stack used at these companies is completely homegrown so your experience using framework $X for $Y years has no relevance
Hey man, thats pretty cool and we do exactly the same using Cassandra instead of FDB. Since Cassandra doesnt support transaction at high volume (100K tps) we do a shuffle so that all the same key do read/modify/write from the same machine. It seems like with FDB you can get away with it as it supports transactions? My question to you is what is the volume your system is operating at? Also how does it work for skews? Lets say you need to update HLL for a key that is heavily skewed, does your FDB transaction unwind fast enough not to slow down the whole system?
This varies, as our workload is dynamic in that anyone at any time can inject a query for the data stream, but for this sake lets say 5k.
> Also how does it work for skews?
Foundation does a magnificent job automatically detecting and physically relocating skew. However, to mitigate write skew, I use time bucketing techniques where party of the key is a MURMUR3 hash of the minute_of_hour so that heavy write loads can only affect a server for one minute. This has helped with certain metrics.
> Lets say you need to update HLL for a key that is heavily skewed, does your FDB transaction unwind fast enough not to slow down the whole system?
There isn't really a concept of an HLL (or key) being heavily skewed. A key lives on a single sever (or multiple, depending on replication). Essentially, when I want to merge additional HLL content into one already store, I just read it, deserialize it, merge it with the one I have and then write the result back to FDB. Because of transactions I can ensure that nobody else is doing the same exact thing I am doing. If there were...then mine (or their) transaction would fail, and retry. The retry is important because it would reattempt the same logic, except the result I got from the database would be the merged result from somebody else. This allows you to ensure that idempotent / atomic operations happen as you'd expect.
Thanks for the reply, got a few more additional questions for you :-)
Lets say you are counting distinct ips used by `users` using HLL. Lets say you start getting DDOSed by certain users since I am assuming you are not doing s shuffle before writing to FDB, you will be locking the user, reading HLL, deserializing, merging and writing back to FDB from multiple machines which will results in a lot of rejected transaction and retries. My question is whether retries unwind fast enough or you will end up dropping data on the floor as you will exhaust the retry count
Turns out we are doing a shuffle :) - We're using Apache Flink for the aggregation step (5 second window) which performs a merge on key before writing the value out. So at the end of the day, we would only read/deserialize/merge/write once every 5 seconds, that is of course assuming we received data for the HLL aggregation.
However, due to the need for HA, we might run two or three clusters in different AZs which means we might have a few servers writing a partial aggregation to the same row, thus, the awesomeness of FDB plays a role.
That being said, our P99 latency writing to FDB is typically very low (few ms). We're doing usually 4,000 - 5,000 transactions a second at any given time.
Not the person you’re responding to, but you can merge HLLs together, so if your workload was skewed, you could hash the value you’re adding to the HLL and distribute it among more keys in FoundationDB.
Additionally, depending on the write rate and the size of the data being written to the HLL, it may be worth only actually writing it out periodically and keeping a log you read at runtime of recent values.
There is a trade off between needlessly re-writing mostly unchanged data and read performance that is similar to the IO amplification trade off in log structured merge tree-derived storage engines.
Why are almost all the replies by ex Google engineers? I thought I would see an even distribution from all FAANG companies but replies are pretty heavily skewed toward ex google employees. Is this because 1) Google culture is completely different from other FAANG companies? 2) Engineers at other FAANG companies dont quit their job? 3) or they dont browse HN in their free time?
Working for Google creates a much bigger cognitive dissonance compared to other FAANGM.
Google disproportionately promotes culture as number 1 reason to work there, where as the rest of the companies are just that: companies that make money.
Imagine the surprise when a person joined Google and realize it's nothing but another large company.
I have heard that FB has similar issues, but those are just that: rumors or things I read online.
The companies are all of varying sizes and distribution of types of workers. Google has a heavy engineering component to it - Apple and maybe to a lesser degree Amazon does as well, but one thing about Apple is that it has a culture of generally not talking about work as much outside of work, in part due to the whole wanting to keep things secret. Netflix as a company is also just small at around 5000 employees last I heard. Facebook is somewhere around 20k I believe at HQ, although that number seems to go up every time I hear more recent numbers, Google is at ~60kish I remember reading, and Amazon is around 100k I believe? Apple is at a little over 120k employees, but Apple and Amazon both have a lot of non-engineer workers as well (retail and warehouse employees, respectively).
That doesn't sound accurate. Anecdata, but in my current org in Amazon Retail I think about 1 in 15 engineers browse HN daily. When I was in AWS it was around 1 in 3.
Why do you think people in Google/Netflix/Facebook browse HN more than Amazon/Apple?
I think the population you're observing here are people who first joined a FAANG before 2014 (since I imagine most people stayed for their initial four-year grant given FANNG stock performance). Facebook (not sure about the others) in 2014 had ~2k engineers IIRC, but Google IIRC had like 10-15k engineers around that time. The gap has closed in recent years, but looking at current engineering numbers isn't the right approach IMO.
This maybe offtopic so I apologize in advance but when people say "Asian-Americans" does it include South Asians as well or just East Asians? As a south asian this has caused me a lot of confusion while talking to people in bay area
In social situations, "Asians" generally refers to East Asians. However, most official demographic-checklists don't have an option for "South-Asian", and expect them to self-identify as "Asian". Hence, much of the evidence you see from the above lawsuit, likely applies to South Asians as well.
I love when forms mix racial terms and have the options “Asian” and “Caucasian.” Mixing races based on recent geographic origin with races based on morphology can lead to broad interpretation. South-Asians are actually Caucasian if you believe in the racial theory that has three races- caucasoid, mongoloid, negroid (https://en.wikipedia.org/wiki/Caucasian_race) because they were big on bone structure over skin tone.
But for the most part these racial distinctions are all pretty hokey and you can check whatever you want. I’m not sure why it’s more valid to classify by skin tone vs hair color vs hand width or whatever. I can understand the unjust classification within certain cultures in isolation because it probably closely associated with certain classes or religions.
But classifying a black-skinned Indian South Asian and a brown-skinned Ethiopian (both also Caucasian) makes no real sense. Even from a social justice persective as which background has it worse off?
Is a light-skinned African-American have more or less systematic oppression than a dark-skinned South Asian? Or a dark-skinned AA vs a light-skinned SA?
There’s all sorts of interesting and confusing scenarios. I’m not sure what to do nor what is right, so I largely just muffle up and/or wait for the loudest shouting groups to figure it out.
Things did not go well for an immigrant friend of mine who was white South African who signed up for an African American law program. But when the all white engineering team won the state championship challenge for minority schools, that was fine. It was a weird quirk in my country where the program was for schools with majority minority students. But most schools had almost entirely Asian and white teams because the schools had small non-minority populations.
When people in the US say "Asian" conversationally, they usually mean "East Asian". The admissions process uses the more formal definition, which means an American who can trace his/her ethnicity to any country in Asia (including India, etc).
Interesting project although cant say I am happy to see SQL being used in Streaming Systems like this. In my last two jobs I had to write frameworks and tools to enable "Data Scientists" and "analysts" to write production jobs and problem I have run into with exposing SQL to this class of user is that every job end up being its own special snowflake with deeply nested SQL with custom UDFs mixed in for good measure. Due to "unique" nature of each its significantly increases the support and maintainability cost. I have to come the conclusion that a typesafe api with map/filter/flatmap is much better API to expose that Stringly typed SQL. I am curious to know whether Uber is running into similar support issues?
Our experience is that AthenaX actually lowers our support costs:
(1) There are significant loads on consultations when users had to implement their own jobs in Java / Scala and run them in production. Sometimes it turned in to co-development as the users lack the expertise of the streaming analytics frameworks.
(2) We consciously encourage our users to write good SQLs via:
(a) enforcing schemas on all analytical Kafka topics.
(b) setting up a team dedicated to help them using SQL in big data systems (i.e., Hive, Presto, AthenaX, etc.)
For UDF we provide general guidances and ask our users to oncall for the jobs that use UDFs. The support costs are definitely not zero but it is still much better to teach users to write a Samza / Flink / Storm job from scratch.
My experience teaching some graduates in a BI shop. SQL is more common, and tools that support SQL tend to be used better.
I've "taught" them how to use Spark, but being a team of varying prior experience, the Scala API meant them learning Scala, the Python one was a bit better, but they did much better with the SQL DSL.
Regarding your concern re maintainability: UDF's tend to be the problem, I'm also curious to know re their support issues, and also: can anyone write their own UDF (the code requires registering a .jar), or is there a team that helps business users in that regard?
Spring vs. JavaEE is a topic that stopped being relevant five years ago. I cant believe anyone seriously considering any of these heavy duty teach stack in 2017. There are much much better choices out there even for Java developers, and if they can come out of their comfort zone a bit there is Clojure, Scala, Kotlin etc with MUCH nicer fraemworks
Sure, I used to work with Java EE in the enterprise. My team switched over entirely to working with Clojure over the course of the past 6 years. We build large applications for use at the hospital. For context, the projects I work on are typically implemented over several years by a team of 5-10 developers.
We find that our projects now have drastically less code doing the same types of things. Not only that, but the code we do write is predominantly declarative in nature. Clojure makes it much easier to separate the intent of the code from the implementation details.
Conversely, having less code means we're less attached to it. When it takes a 1000 lines to solve a problem, you tend to keep them around once you get a solution working. When you have 100 lines, it's much easier to throw them away and write a cleaner solution when you understand the problem.
Clojure projects are easier to debug and maintain thanks to pervasive immutability in the language. This allows us to safely think about parts of the application in isolation. When I come back to code that I wrote a few months ago and make a change, I know that the change is local and it's not going to affect another part of the project via side effects.
Clojure facilitates interactive development by providing strong integration between the editor and the REPL. When we're building new features, we're able to experiment interactively to see what approach will work best.
Since Clojure also runs in the browser with ClojureScript, we're able to use the same language for the full stack. This also lets us share code, such as validation logic, between the client and the server.
Meanwhile, we're still able to leverage existing Java libraries, infrastructure, and reap all the benefits of using the JVM.
You are barking up the wrong tree. I am also a Clojure enthusiast and I also use it wherever I can. I was genuinely interested to hear about a Java alternative to Spring because I failed to find any. I don't use java in production anymore though, only Kotlin and Clojure.
Fair enough, from what I know Dropwizard is supposed to be pretty decent. I have a few friends working in Java shops who seem to like it. However, I haven't used it myself, so can't really comment how it compares to Spring overall.
Per the article, if you are not hosted in Pakistan, Govt will fine you 3.x, which will probably kill our project, but it's not clear how Pak Govt can find us if we have no presence there?