update readme & add energy graph to match chi^2 curve

author: sotech117 <michael_foiani@brown.edu> 2024-03-11 15:40:47 -0400
committer: sotech117 <michael_foiani@brown.edu> 2024-03-11 15:40:47 -0400
commit: 6cbc309cb94551e75d914e105ff94e5a39878abf (patch)
tree: d5fc7ae437fd3ca766e317ac307361c08f4563e2
parent: 7b4d951fa00ee0e94d1d1b65a2f2f06cb9850146 (diff)
7 files changed, 43 insertions, 6 deletions
diff --git a/README.md b/README.md
index 0b7fe77..c74a0f9 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,24 @@
 # ab-testing-special-topics
-The simulations used to give intuition and understanding into the reasoning behind why the normal and chi^2 curves are declarative for AB tests.. 
+The simulations used to give intuition and understanding into the reasoning behind why the normal and chi^2 curves are declarative for AB tests through an example of the random walk and random speed problems.
+
+## Random Walk
+Notice how the random walks frequency graph created a normal distribution, which became "more normal" with more points on the walk. 
+This frequency plot represents the (un-normalized) pdf of finding a points, so we can quanitivaly define how probable it is to find a walk at a certain distance.
+
+Relating to AB testing, we can view the t-score as this particle doing a "random walk", with the p-value as the probability of finding a walk at that distance. By converting the average to a t-score, you have effectively normalized the walk, and can use the normal distribution to find the p-value.
+
+When you argue that "if the p-value is less than 0.05, then the null hypothesis is rejected", you are saying that "if the probability of finding a sample in this position is less than .05 and I found it (in your sampleA vs sampleB calculations), then it's highly unlinkely this path is a coicidence and the null hypothesis can be rejected." 
+
+## Random Speeds
+Notice how the distribution of random speeds and enegeries frequency graph is not normal - it's mostly normal but skewed with a longer right tail. This is because speed has no direction (as it's the magnitude of velocity), so the distrubtion is no longer normal.
+
+While the velocity for each dimension in our system (x, y, z) has a normal distribution, the sum of the squares of them results in this non-symmetic, non-normal frequncy map, the chi distribution (not squared yet). Now chi^2 relates the energies of the system (speed^2), which is directly proportional the generalized chaos in the system (entropy). 
+
+Chi^2 is used to test the null hypothesis of "no difference" between categorical variables in AB testing because it measures generalized, non-directional chaos among all dimensions of the system. If your distrubtions from the dimensions are similar, it should converge to be highly-chaotic & high-energy, as stated by the second law of thermodynamics. By contrast, if your underlying distrubtions create an immensely low-chaotic (i.e. low-energy state), then it's highly likely these underlying distrubtions are different. 
+
+Relating to AB testing, when you argue that, for chi^2, "if the p-value is less than 0.05, then the null hypothesis is rejected", you are saying that "if the probability of finding these directions at energy level this low and I found it (in your sampleA vs sampleB calculations), then it's highly unlinkely this state is a coicidence (violates the second law of thermodynamics) and the null hypothesis can be rejected (i.e. these distributions are not the same)." 
+
+In theory, for performing a hypothesis test with categorical variables, we take each dimension to be the difference between the normal distrubtions (of differences in observed-expected) of the samples. This encapsulates the difference between the distrubtions into a normal curve, which we combine into the chi^2 curve (visuals helps this explanation, see video).
+
+## Video of Special Topic Hour
+TODO: add video after the hours
+\ No newline at end of file
diff --git a/figs/rs-1dvels.png b/figs/rs-1dvels.png
index c4c06d4..00f7c1d 100644
--- a/figs/rs-1dvels.png
+++ b/figs/rs-1dvels.png
diff --git a/figs/rs-3dvels.png b/figs/rs-3dvels.png
index 8dae989..f021ad5 100644
--- a/figs/rs-3dvels.png
+++ b/figs/rs-3dvels.png
diff --git a/figs/rs-energies.png b/figs/rs-energies.png
new file mode 100644
index 0000000..a0705c0
--- /dev/null
+++ b/figs/rs-energies.png
diff --git a/figs/rs-speeds.png b/figs/rs-speeds.png
index 6110329..0309747 100644
--- a/figs/rs-speeds.png
+++ b/figs/rs-speeds.png
diff --git a/figs/rw-demo3.png b/figs/rw-demo3.png
index 09e8623..d807594 100644
--- a/figs/rw-demo3.png
+++ b/figs/rw-demo3.png
diff --git a/random-speed.jl b/random-speed.jl
index 0af9121..058fedd 100644
--- a/random-speed.jl
+++ b/random-speed.jl
@@ -1,13 +1,22 @@
 using Plots
 using Distributions
 
-num_velocities = 1000
+num_velocities = 100000
+num_dimensions = 3
 
 println("Starting Random Speed Simualtions...\n")
 
 function make_random_velocity()
     # pull 3 nums randomly from normal distribution
     N = Normal(0, 1)
+    if num_dimensions == 1
+        return (rand(N), 0, 0)
+    end
+
+    if num_dimensions == 2
+        return (rand(N), rand(N), 0)
+    end
+
     return (rand(N), rand(N), rand(N))
 end
 
@@ -56,14 +65,20 @@ p = Plots.scatter(
     title="Velocities")
 savefig(p, "figs/rs-3dvels.png")
 
-# plot their speeds
 speeds = [sqrt(v[1]^2 + v[2]^2 + v[3]^2) for v in velocities]
 p = histogram(
-    speeds, title="Randomly Generated Speeds (n=$num_velocities)", 
-    legend=false, xlabel="Speed", ylabel="Frequency")
+    speeds, title="Randomly Generated Speeds (n=$num_velocities, d=$num_dimensions)", 
+    legend=false, xlabel="Speed = \$ √(v_x^2 + v_y^2 + v_z^2) \$", ylabel="Frequency")
 savefig(p, "figs/rs-speeds.png")
+
+# plot their energy
+energies = [.5 * (v[1]^2 + v[2]^2 + v[3]^2) for v in velocities]
+p = histogram(
+    energies, title="Randomly Generate`d Energies (n=$num_velocities, d=$num_dimensions)", 
+    legend=false, xlabel="Energy =  \$ .5m(v_x^2 + v_y^2 + v_z^2) \$", ylabel="Frequency")
+savefig(p, "figs/rs-energies.png")
 # print the mean and standard deviation of the speed distribution
-println("\nSpeed->\tμ:$(mean(speeds)), σ:$(std(speeds)), n:$(length(speeds))")
+println("\tEnergies->\tμ:$(mean(energies)), σ:$(std(energies)), n:$(length(energies))")
 
 
 println("\nRandom Speed Simualtions Complete!")
 \ No newline at end of file
author	sotech117 <michael_foiani@brown.edu>	2024-03-11 15:40:47 -0400
committer	sotech117 <michael_foiani@brown.edu>	2024-03-11 15:40:47 -0400
commit	6cbc309cb94551e75d914e105ff94e5a39878abf (patch)
tree	d5fc7ae437fd3ca766e317ac307361c08f4563e2
parent	7b4d951fa00ee0e94d1d1b65a2f2f06cb9850146 (diff)