aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorsotech117 <michael_foiani@brown.edu>2024-03-11 15:40:47 -0400
committersotech117 <michael_foiani@brown.edu>2024-03-11 15:40:47 -0400
commit6cbc309cb94551e75d914e105ff94e5a39878abf (patch)
treed5fc7ae437fd3ca766e317ac307361c08f4563e2
parent7b4d951fa00ee0e94d1d1b65a2f2f06cb9850146 (diff)
update readme & add energy graph to match chi^2 curve
-rw-r--r--README.md24
-rw-r--r--figs/rs-1dvels.pngbin28263 -> 32212 bytes
-rw-r--r--figs/rs-3dvels.pngbin60791 -> 68968 bytes
-rw-r--r--figs/rs-energies.pngbin0 -> 25811 bytes
-rw-r--r--figs/rs-speeds.pngbin17602 -> 26668 bytes
-rw-r--r--figs/rw-demo3.pngbin134331 -> 135108 bytes
-rw-r--r--random-speed.jl25
7 files changed, 43 insertions, 6 deletions
diff --git a/README.md b/README.md
index 0b7fe77..c74a0f9 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,24 @@
# ab-testing-special-topics
-The simulations used to give intuition and understanding into the reasoning behind why the normal and chi^2 curves are declarative for AB tests..
+The simulations used to give intuition and understanding into the reasoning behind why the normal and chi^2 curves are declarative for AB tests through an example of the random walk and random speed problems.
+
+## Random Walk
+Notice how the random walks frequency graph created a normal distribution, which became "more normal" with more points on the walk.
+This frequency plot represents the (un-normalized) pdf of finding a points, so we can quanitivaly define how probable it is to find a walk at a certain distance.
+
+Relating to AB testing, we can view the t-score as this particle doing a "random walk", with the p-value as the probability of finding a walk at that distance. By converting the average to a t-score, you have effectively normalized the walk, and can use the normal distribution to find the p-value.
+
+When you argue that "if the p-value is less than 0.05, then the null hypothesis is rejected", you are saying that "if the probability of finding a sample in this position is less than .05 and I found it (in your sampleA vs sampleB calculations), then it's highly unlinkely this path is a coicidence and the null hypothesis can be rejected."
+
+## Random Speeds
+Notice how the distribution of random speeds and enegeries frequency graph is not normal - it's mostly normal but skewed with a longer right tail. This is because speed has no direction (as it's the magnitude of velocity), so the distrubtion is no longer normal.
+
+While the velocity for each dimension in our system (x, y, z) has a normal distribution, the sum of the squares of them results in this non-symmetic, non-normal frequncy map, the chi distribution (not squared yet). Now chi^2 relates the energies of the system (speed^2), which is directly proportional the generalized chaos in the system (entropy).
+
+Chi^2 is used to test the null hypothesis of "no difference" between categorical variables in AB testing because it measures generalized, non-directional chaos among all dimensions of the system. If your distrubtions from the dimensions are similar, it should converge to be highly-chaotic & high-energy, as stated by the second law of thermodynamics. By contrast, if your underlying distrubtions create an immensely low-chaotic (i.e. low-energy state), then it's highly likely these underlying distrubtions are different.
+
+Relating to AB testing, when you argue that, for chi^2, "if the p-value is less than 0.05, then the null hypothesis is rejected", you are saying that "if the probability of finding these directions at energy level this low and I found it (in your sampleA vs sampleB calculations), then it's highly unlinkely this state is a coicidence (violates the second law of thermodynamics) and the null hypothesis can be rejected (i.e. these distributions are not the same)."
+
+In theory, for performing a hypothesis test with categorical variables, we take each dimension to be the difference between the normal distrubtions (of differences in observed-expected) of the samples. This encapsulates the difference between the distrubtions into a normal curve, which we combine into the chi^2 curve (visuals helps this explanation, see video).
+
+## Video of Special Topic Hour
+TODO: add video after the hours \ No newline at end of file
diff --git a/figs/rs-1dvels.png b/figs/rs-1dvels.png
index c4c06d4..00f7c1d 100644
--- a/figs/rs-1dvels.png
+++ b/figs/rs-1dvels.png
Binary files differ
diff --git a/figs/rs-3dvels.png b/figs/rs-3dvels.png
index 8dae989..f021ad5 100644
--- a/figs/rs-3dvels.png
+++ b/figs/rs-3dvels.png
Binary files differ
diff --git a/figs/rs-energies.png b/figs/rs-energies.png
new file mode 100644
index 0000000..a0705c0
--- /dev/null
+++ b/figs/rs-energies.png
Binary files differ
diff --git a/figs/rs-speeds.png b/figs/rs-speeds.png
index 6110329..0309747 100644
--- a/figs/rs-speeds.png
+++ b/figs/rs-speeds.png
Binary files differ
diff --git a/figs/rw-demo3.png b/figs/rw-demo3.png
index 09e8623..d807594 100644
--- a/figs/rw-demo3.png
+++ b/figs/rw-demo3.png
Binary files differ
diff --git a/random-speed.jl b/random-speed.jl
index 0af9121..058fedd 100644
--- a/random-speed.jl
+++ b/random-speed.jl
@@ -1,13 +1,22 @@
using Plots
using Distributions
-num_velocities = 1000
+num_velocities = 100000
+num_dimensions = 3
println("Starting Random Speed Simualtions...\n")
function make_random_velocity()
# pull 3 nums randomly from normal distribution
N = Normal(0, 1)
+ if num_dimensions == 1
+ return (rand(N), 0, 0)
+ end
+
+ if num_dimensions == 2
+ return (rand(N), rand(N), 0)
+ end
+
return (rand(N), rand(N), rand(N))
end
@@ -56,14 +65,20 @@ p = Plots.scatter(
title="Velocities")
savefig(p, "figs/rs-3dvels.png")
-# plot their speeds
speeds = [sqrt(v[1]^2 + v[2]^2 + v[3]^2) for v in velocities]
p = histogram(
- speeds, title="Randomly Generated Speeds (n=$num_velocities)",
- legend=false, xlabel="Speed", ylabel="Frequency")
+ speeds, title="Randomly Generated Speeds (n=$num_velocities, d=$num_dimensions)",
+ legend=false, xlabel="Speed = \$ √(v_x^2 + v_y^2 + v_z^2) \$", ylabel="Frequency")
savefig(p, "figs/rs-speeds.png")
+
+# plot their energy
+energies = [.5 * (v[1]^2 + v[2]^2 + v[3]^2) for v in velocities]
+p = histogram(
+ energies, title="Randomly Generate`d Energies (n=$num_velocities, d=$num_dimensions)",
+ legend=false, xlabel="Energy = \$ .5m(v_x^2 + v_y^2 + v_z^2) \$", ylabel="Frequency")
+savefig(p, "figs/rs-energies.png")
# print the mean and standard deviation of the speed distribution
-println("\nSpeed->\tμ:$(mean(speeds)), σ:$(std(speeds)), n:$(length(speeds))")
+println("\tEnergies->\tμ:$(mean(energies)), σ:$(std(energies)), n:$(length(energies))")
println("\nRandom Speed Simualtions Complete!") \ No newline at end of file