Histogram frequency count for column

Histogram frequency count for column - unix

I have a set of data (in the region of 800000 lines), in three columns (longitude, latitude and earthquake magnitude) that are not sorted in any way. A small example below...
-118.074 36.930 2.97
-118.005 36.898 2.61
-116.526 36.621 2.72
-116.488 36.650 2.68
-117.675 36.820 2.00
-117.963 36.514 1.30
-118.090 36.757 1.94
-117.651 36.518 1.40
-116.434 36.506 1.90
-117.914 36.531 2.10
-118.235 36.882 2.00
I am required to create a histogram of the earthquake magnitudes (in the range of 1.0 to 7.0), but I am not sure how to go about creating the frequency of magnitudes.
I understand that in order to create a histogram, I will need to discern the unique values, and set them in ascending order in a column. I believe I can then run a for command with a count function for each value... but I need a bit of help in doing so!
Thank you for any help you can offer!

awk '{counts[$3]++} END {for (c in counts) print c, counts[c]}' inputs.txt | sort -nk2
will print the unique magnitudes and their counts in ascending order:
1.30 1
1.40 1
1.90 1
1.94 1
2.10 1
2.61 1
2.68 1
2.72 1
2.97 1
2.00 2

Related

Grouping Frame Values

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?

After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

Specify time series parameters for data over millions of years

I have foram delta 18O data over 6 million years, with one datapoint every 0.003 Myrs, like this:
Age (Myr) | d18O
0 | 3.43
0.003 | 3.37
0.006 | 3.54
0.009 | 3.87
0.012 | 4.36
0.015 | 4.90
0.018 | 5.01
0.021 | 4.96
0.024 | 4.87
0.027 | 4.67
0.03 | 4.58
ts_d18O <- ts(d18O[,2])
plot(ts_d18O)
However, when I tell R to consider it as a time series and I plot it, I get values on the x-axis all the way to 2000, i.e. the scale is not in millions of years. Here's the plot
How do I fix this? I need it to be a time series because I have to do spectral analysis on it, e.g. periodograms

You need to specify the frequency of observations. You have 1 observation every 3000 years, so the frequency would be 1/3000 if you want your answer in years. I'll assume you want your answer in millions of years. This means your frequency is 1/0.003, or 333.333 (i.e. this is how many samples are taken per million years).
You should specify a start time as well (-5 in this example will represent 5 Mya).
Finally, you can label your x axis as required.
ts_d18O <- ts(d18O[,2], start = c(0.003, -5 / 0.003), frequency = 1/0.003)
ts
#> Time Series:
#> Start = -5
#> End = -4.97
#> Frequency = 333.333333333333
#> [1] 3.43 3.37 3.54 3.87 4.36 4.90 5.01 4.96 4.87 4.67 4.58
plot(ts_d18O, xlab = "Million years ago")

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Hopefully this makes sense. I have one column in my dataset that has multiple entries of one of three size category (read in the data as characters), "(0,1.88]", "(1.88,4]", and "(4,10]". I would to combine all of my entries together by plot (another column in the dataset), totaling the response for each size category in its own column.
Ideally, I'm trying to take data which has multiple responses in each Plot and end up with one total response for each plot, divided by size category. I'm hoping to get something like this:
Plot Total Response for (0,1.88] Total Response for (1.88,4] Total Response for (4,10]
Here is the head of my data. Not all of it is needed, only Plot, ounces, and tuber.diam. tuber.diam has the entries grouped into size categories.
head(newChippers)
Plot ounces Height Shape Area plot variety rate block width length tuber.oz.bin tuber.diam
1 2422 1.31 1.22 26122 3237 242 Lamoka 3 4 1.65 1.70 (0,4] (0,1.88]
2 2422 2.76 1.56 27853 5740 242 Lamoka 3 4 2.20 2.24 (0,4] (1.88,4]
3 2422 1.62 1.31 24125 3721 242 Lamoka 3 4 1.53 1.95 (0,4] (0,1.88]
4 2422 3.37 1.70 27147 6498 242 Lamoka 3 4 2.17 2.48 (0,4] (1.88,4]
5 2422 3.19 1.70 27683 6126 242 Lamoka 3 4 2.22 2.34 (0,4] (1.88,4]
6 2422 2.83 1.53 27356 6009 242 Lamoka 3 4 2.00 2.53 (0,4] (1.88,4]
Here is what I currently have for making the new dataset:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot) %>%
summarize(totalOz = sum(Weight),
Diameter.0.1.88 = (tuber.diam("(0,1.88]")),
Diameter.1.88.4 = (tuber.diam(" (1.88,4]")),
Diameter.4.10 = (tuber.diam(" (4,10]")))
I get the following error code:
Error in x[[n]] : object of type 'closure' is not subsettable
Any help would be very much appreciated! Again, I'm very sorry if I've explained it poorly or made it too complicated. If any additional information is needed, I can try to provide it. Thank you!

I have revised your code. I assume your variable weight is the same as variable ounce as there is no weight variable in newChippers your data data. I use weight here as in your code:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot, tuber.diam) %>%
summarize(totalOz = sum(Weight)) %>%
pivot_wider(names_from = tuber.diam, values_from = totalOz)
YieldSizeProfileDiameter
I have not tested the code on my side as I do not have the data.

Ramp up/down missing time-series data in R

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers

It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.

When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

Using the ddply comand on a subset of data

I got some issues using the command 'ddply' of the 'plyr' package. I created a dataframe which looks like this one :
u v intensity season
24986 -1.97 -0.35 2.0 1
24987 -1.29 -1.53 2.0 1
24988 -0.94 -0.34 1.0 1
24989 -1.03 2.82 3.0 1
24990 1.37 3.76 4.0 1
24991 1.93 2.30 3.0 2
24992 3.83 -3.21 5.0 2
24993 0.52 -2.95 3.0 2
24994 3.06 -2.57 4.0 2
24995 2.57 -3.06 4.0 2
24996 0.34 -0.94 1.0 2
24997 0.87 4.92 5.0 3
24998 0.69 3.94 4.0 3
24999 4.60 3.86 6.0 3
I tried to use the function cumsum on the u and v values, but I don't get what I want. When I select a subset of my data, corresponding to a season, for example :
x <- cumsum(mydata$u[56297:56704]*10.8)
y <- cumsum(mydata$v[56297:56704]*10.8)
...this works perfectly. The thing is that I got a huge dataset (67208 rows) with 92 seasons, and I'd like to make this function work on subsets of data. So I tried this :
new <- ddply(mydata, .(mydata$seasons), summarize, x=c(0,cumsum(mydata$u*10.8)))
...and the result looks like this :
24986 1 NA
24987 1 NA
24988 1 NA
I found some questions related to this one on stackoverflow and other website, but none of them helped me dealing with my problem. If someone has an idea, you're welcome ;)

Don't use your data.frame's name inside the plyr "function". just reference the column name as though it was defined:
ddply(mydata, .(seasons), summarise, x=c(0, cumsum(u*10.8)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Histogram frequency count for column - unix

awk '{counts[$3]++} END {for (c in counts) print c, counts[c]}' inputs.txt | sort -nk2 will print the unique magnitudes and their counts in ascending order: 1.30 1 1.40 1 1.90 1 1.94 1 2.10 1 2.61 1 2.68 1 2.72 1 2.97 1 2.00 2

Related

Grouping Frame Values

Specify time series parameters for data over millions of years

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Ramp up/down missing time-series data in R

Using the ddply comand on a subset of data

Categories

Resources