How to create unique vectors for large dataset - r

I am trying to find the Atkinson Index measure for individual countries that spans over 11,000 observations. I have the decile measures for each specific observation which I can create an individual vector ex. c(d1, d2,...d10) for each single observation and compute the Atkinson Index but I am sure there is a quicker way to do this across 11,000 observations. Is there any possible way I can direct R to create a unique vectors across all 11,000 observations that use the deciles that are specific to each individual observation?
I am still rather new to coding in R, but I have tried to see if I can create some kind of loop that would return a vector pertaining to the deciles that corresponding with each individual observation.
id2 <- c(3.86, 5.29, 6.38, 7.32, 8.38, 9.35, 10.82, 12.47, 14.90, 21.22)
atkinson(id2, epsilon = 1)
[1] 0.1079504
Here is what I get when type:
dput(head(data))
structure(list(id = c(1, 2, 3, 4, 5, 6), country = c("Afghanistan",
"Albania", "Albania", "Albania", "Albania", "Albania"), c3 = c("AFG",
"ALB", "ALB", "ALB", "ALB", "ALB"), d1 = c(NA, 0, 3.49, 3.48,
3.73, 3.66), d2 = c(NA, 5.29, 4.86, 4.92, 5.14, 5.19), d3 = c(NA,
6.38, 5.84, 5.98, 6.09, 6.14), d4 = c(NA, 7.32, 6.74, 6.92, 6.98,
7.03), d5 = c(NA, 8.38, 7.65, 7.99, 7.91, 8.08), d6 = c(NA, 9.35,
8.84, 9.04, 8.92, 9.26), d7 = c(NA, 10.82, 10.23, 10.37, 10.3,
10.52), d8 = c(NA, 12.47, 11.98, 12.13, 11.93, 12.29), d9 = c(NA,
14.9, 14.93, 14.83, 14.54, 14.89), d10 = c(NA, 21.22, 25.44,
24.34, 24.46, 22.93)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I can do this over 11,000 times but obviously that will take awhile, is there a way to construct R (a loop?) to do something along this lines for each individual observation?

Consider a row-wise calculation with apply to assign a new column to data frame. Underneath, as.vector() combines all decile points into a vector needed for atkinson().
data$atkinson_index <- apply(data[4:ncol(data)], MARGIN=1,
function(x) atkinson(as.vector(x), epsilon = 1)
)
data
Should NA pose a problem, wrap call in tryCatch
data$atkinson_index <- apply(data[4:ncol(data)], MARGIN=1,
function(x) tryCatch(atkinson(as.vector(x), epsilon = 1),
error = function(e) NA)
)
data

Related

How the wind rose varies by month: Package openair

I have date for 8 years. Sample of my data:
structure(list(Data = c("1/1/2015", "1/2/2015", "1/3/2015", "1/4/2015",
"1/5/2015", "1/6/2015", "1/7/2015", "1/8/2015", "1/9/2015", "1/10/2015",
"1/11/2015", "1/12/2015", "1/13/2015", "1/14/2015", "1/15/2015",
"1/16/2015", "1/17/2015", "1/18/2015", "1/19/2015", "1/20/2015",
"1/21/2015", "1/22/2015", "1/23/2015", "1/24/2015", "1/25/2015",
"1/26/2015", "1/27/2015", "1/28/2015", "1/29/2015", "1/30/2015",
"1/31/2015"), no2 = c(3.56, 11.13, 11.84, 4.88, 6.16, 12.56,
18.99, 24.74, 10.81, 12.7, 6.08, 7.34, 16.88, 16.65, 15.81, 20.78,
15.03, 11.82, 15.18, 17, 15.21, 13.86, 10.28, 8.34, 11.89, 7.22,
15.44, 10.55, 8.19, 5.04, 14.65), ws = c(10.84, 3.71, 2.08, 4.59,
6.18, 2.97, 2.13, 1.22, 1.92, 2.07, 3.09, 4.75, 2.12, 1.8, 1.9,
1.79, 1.58, 1.86, 1.58, 1.47, 1.7, 2.6, 2.67, 3.21, 1.78, 4.58,
1.79, 3.1, 3.49, 6.15, 2.59), wd = c(90, 112.5, 112.5, 270, 90,
135, 112.5, 112.5, 270, 315, 270, 112.5, 112.5, 135, 135, 112.5,
292.5, 135, 270, 135, 112.5, 112.5, 270, 112.5, 112.5, 112.5,
112.5, 112.5, 270, 270, 270)), class = "data.frame", row.names = c(NA,
-31L))
library(openair)
windRose(nitrogen,
key = list(header="Wind Rose Acri", footer="wind speed",
plot.style = c("ticks", "border"),
fit = "all", height = 1,
space = "top"))
pollutionRose(nitrogen, pollutant = "no2")
I want to show how the wind rose varies by month. The same problem (Wind rose with ggplot (R)?) but tried realised by function from Openair package.
You could convert your Data column to a name called date with date format and specify type argument with "month". type according to documenation:
type determines how the data are split i.e. conditioned, and then
plotted. The default is will produce a single plot using the entire
data. Type can be one of the built-in types as detailed in cutData
e.g. “season”, “year”, “weekday” and so on. For example, type =
"season" will produce four plots --- one for each season.
It is also possible to choose type as another variable in the data
frame. If that variable is numeric, then the data will be split into
four quantiles (if possible) and labelled accordingly. If type is an
existing character or factor variable, then those categories/levels
will be used directly. This offers great flexibility for understanding
the variation of different variables and how they depend on one
another.
Type can be up length two e.g. type = c("season", "weekday") will
produce a 2x2 plot split by season and day of the week. Note, when two
types are provided the first forms the columns and the second the
rows.
Please note you only provided one month:
library(openair)
# add month column
nitrogen$date <- as.POSIXct(nitrogen$Data, format = '%m/%d/%Y')
windRose(nitrogen,
key = list(header="Wind Rose Acri", footer="wind speed",
plot.style = c("ticks", "border"),
fit = "all", height = 1,
space = "top"),
type = 'month')
Created on 2022-12-13 with reprex v2.0.2
Here is an example with build-in data with type = 'month':
library(openair)
windRose(mydata, type = "month")
Created on 2022-12-13 with reprex v2.0.2

Frequency table for intervals

I saved data into the object datos so I could calculate AF (absolute frequency) and RF(relative frequency) for a continuous variable in column V1. But I want to have the frequencies be in intervals.
I don't really know how to do it so I need your help. If anyone has any idea about how to do it, here is my code:
k is the number of intervals I'm using
and largo is the quantity of data I have.
read.table("datos.txt", header = FALSE)-> datos
largo<-length(datos$V1)
k<- (1+log2(largo))
k<-round(k,digits = 0)
vectordatos <- datos$v1
histograma<-hist(datos$V1,breaks=k)
FA<-table(datos$V1)
FR<-table(datos$V1)/largo
FA
FR
The datos object is as follows:
datos = structure(list(V1 = c(6.16, 5.83, 5.66, 3.63, 1.38, 9.64, 7.46,
5.34, 7.93, 8.5, 4.18, 5.18, 10.27, 5.41, 4.76, 4.67, 10.02,
7.1, 5.38, 8.55, 4.85, 8.28, 2.9, 7.18, 6.54, 5.66, 7.26, 6.45,
3.97, 6.55, 5.15, 7.83, 5.52, 7.21, 7.3, 6.19)), class = "data.frame", row .names = c(NA,
-36L))
You can use cut to create k intervals and table to represent the frequency per interval. You can use the following code:
table(cut(datos$V1,k))
Output:
(1.37,2.86] (2.86,4.34] (4.34,5.83] (5.83,7.31] (7.31,8.79] (8.79,10.3]
1 4 11 11 6 3

Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)

I'm having some issues cleaning up a dataset after I manually extracted the data online - I'm guessing these are encoding issues. I have an issue trying to remove the "U+00A0" in the "Athlete" column cels along with the operator brackets. I looked up the corresponding UTF-8 code and it's for "No-Break-Space". I'm also not sure how to replace the other UTF-8 characters to make the names legible - for e.g. getting U+008A to display as Š.
Subset of data
head2007decathlon <- structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
This is what I've tried so far to no success:
1) head2007decathlon$Athlete <- gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete)
2) head2007decathlon$Athlete <- gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete)
3) head2007decathlon$Athlete <- iconv(head2007decathlon$Athlete, from="UTF-8", to="LATIN1")
4) Encoding(head2007decathlon$Athlete) <- "UTF-8"
5) head2007decathlon$Athlete<- enc2utf8(head2007decathlon$Athlete)
The following would remove the no break space.
head2007decathlon$Athlete <- gsub(pattern="<U\\+00A0>",replacement="",x=head2007decathlon$Athlete)
Not sure how to convert the other characters. One problem could be that the codes are not exactly in a format that R sees as UTF-8.
One example:
iconv('\u008A', from="UTF-8", to="LATIN1")
this seems to have an effect, contrary to trying to convert U+008A. Although
the output is:
[1] "\x8a"
not the character you want. Hope this helps somehow.

for loop to find threshold values between different data frames

I have 2 data frame with some matching columns (pollutants).
The first data frame contains the observations while the second one contains different thresholds for some pollutants.
Here a small subset of both data frames:
dput(df1)
structure(list(sample = structure(27:76, .Label = c("A_1", "A_2",
"A_LS", "A_PC", "A_PM", "B_1", "B1_1", "B1_2", "B1-8_PC", "B1-8_PM",
"B1_LS", "B1_PC", "B1_PM", "B_2", "B2_1", "B2_2", "B2-8_PC",
"B2-8_PM", "B2_LS", "B2_PC", "B2_PM", "B_LS", "B_PC", "B_PM",
"C_1", "C_2", "C386", "C387", "C388", "C389", "C390", "C391",
"C392", "C393", "C394", "C395", "C396", "C397", "C398", "C399",
"C400", "C401", "C402", "C403", "C404", "C405", "C406", "C407",
"C408", "C409", "C410", "C411", "C412", "C413", "C414", "C415",
"C416", "C417", "C418", "C419", "C420", "C421", "C422", "C423",
"C424", "C425", "C426", "C427", "C428", "C429", "C430", "C431",
"C432", "C433", "C434", "C435", "C436", "C437", "C438", "C439",
"C440", "C441", "C442", "C443", "C444", "C445", "C446", "C447",
"C448", "C449", "C450", "C451", "C452", "C453", "C454", "C455",
"C456", "C457", "C458", "C459", "C460", "C461", "C462", "C463",
"C464", "C465", "C466", "C467", "C468", "C469", "C470", "C471",
"C472", "C473", "C474", "C475", "C476", "C477", "C478", "C479",
"C480", "C481", "C482", "C483", "C484", "C485", "C486", "C487",
"C488", "C489", "C490", "C491", "C492", "C493", "C494", "C495",
"C496", "C497", "C498", "C499", "C500", "C501", "C502", "C503",
"C504", "C505", "C506", "C507", "C508", "C509", "C510", "C511",
"C512", "C513", "C514", "C515", "C516", "C517", "C518", "C519",
"C520", "C521", "C522", "C523", "C524", "C-8_PC", "C-8_PM", "D_1",
"D_2", "E_1", "E_2", "F_1", "F_2"), class = "factor"), As = c(9,
8.75, 13.5, 7.75, 7.6, 8.33, 8, 8.75, 7.4, 8.25, 8.17, 7.75,
7.6, 7.5, 7.2, 8, 7.83, 7.75, 7, 7.5, 8.17, 8.75, 6.67, 7, 5.83,
6.75, 5.6, 6.4, 6.2, 6.2, 6.2, 6.25, 7, 6, 6, 6.4, 6, 5.8, 5.6,
6, 5.8, 7.25, 8.8, 8.5, 8, 8.25, 8.25, 8.5, 8.25, 8.25), Al = c(30245,
38060, 36280, 24355, 27776, 35190, 38733.8, 36400, 29624, 33699.75,
32163.33, 30645.75, 31373, 26647.5, 19987.6, 32210, 27158, 24220.25,
18598.5, 23081.75, 29393, 26800.5, 22581.67, 29290, 29651.67,
20947.5, 19762.6, 23815, 32784.8, 20696.2, 26880.6, 25087.75,
19497.2, 21794, 32232, 24253.4, 20034, 21270, 22510, 15170.25,
8956.6, 21612.25, 35828, 30006.25, 27128.75, 25835, 31118.75,
35614.5, 37440.25, 33736.75), Hg = c(0.25, 0.35, 0.48, 1.03,
1.12, 0.2, 1.14, 0.4, 2, 0.48, 0.85, 0.18, 0.76, 0.4, 0.48, 0.35,
0.32, 0.33, 0.4, 0.13, 0.15, 0.13, 0.87, 0.12, 0.03, 0.33, 0.2,
0.22, 0.04, 0.16, 0.1, 0.18, 0.11, 0.08, 0.03, 0.06, 0.06, 0.1,
0.03, 0.07, 0.03, 0.1, 0.08, 0.11, 0.1, 0.13, 0.08, 0.12, 0.07,
0.09)), .Names = c("sample", "As", "Al", "Hg"), row.names = c(NA,
50L), class = "data.frame")
and
dput(df2)
structure(list(As = c(25L, 32L), Hg = c(0.4, 0.8), Cr = c(100L,
360L), Element = structure(c(1L, 3L), .Label = c("LCB", "LCB_pelite",
"LCL"), class = "factor")), .Names = c("As", "Hg", "Cr", "Element"
), row.names = c(NA, -2L), class = "data.frame")
Actually the original data frames are bigger, but this subset gives the idea.
What I want now is to put in a 3rd data frames the values of each element of the first df that exceed the threshold values contained in the second df.
Be aware that there are 2 different threshold values (for each element) in df2 and df2 has some element not matched in df1 (for example Cr).
I've tried to write a for loop but I was able to do that just for 1 element at a time:
for (i in df2$As) {
print(length(which(df1$As > i)))
}
I've also tried to use nested for loops but without success..
I'm pretty sure this does not look good, but I think it works. I added some extra lines to match only the elements found in both data frames, which in this case is only 1. It might ned some changes for your full data:
df1.2 <- rbind(df1, df1) #Duplicate the df1 to compare to each threshold value
df1.2 <- df1.2[order(df1.2$sample),] #Order by sample again
cols2 <- na.omit(match(colnames(df1), colnames(df2)))[[1]] #Get the columns of df2 which are in df1
cols1 <- na.omit(match(colnames(df2), colnames(df1)))[[1]] #Get the columns of df1 which are in df2
df2.2 <- df2[rep(1:2, nrow(df1)),cols2] #Replicates df2 the number of times to allow matching the thresholds to each sample, once for each threshold
exceeds <- df1.2[,cols1]>df2.2 #Make the comparions and return a boolean
sum(exceeds) #You will need colSums() for more than one column
With your sample data it's also not clear from the answer which elements ir refers to, but this shouldn't happen if more than one element matches and your result is a matrix.
Maybe there's a more elegant way without replicating the dataframes and having to worry about number of element matches.
df3=data.frame(Pollutant="Z",LCB=0,LCL=0,stringsAsFactors=FALSE)
for (p in names(df1)[-1]) {
if(p %in% names(df2)[1:(length(df2)-1)]) {
df3 = rbind(df3,c(p,sum(df1[p]>df2[[p]][1]),sum(df1[p]>df2[[p]][2])))
}
}
df3=df3[-1,]
df3
Update:
Ah, each new row is rbound as a character vector. To finish up:
str(df3)
df3$LCB=as.numeric(df3$LCB)
df3$LCL=as.numeric(df3$LCL)
str(df3)
How about this?
foo <- function(x, y) {
sapply(x, function(i) sum(y>i))
}
cols = c("As", "Hg")
mapply(foo, df2[cols], df1[cols])
# As Hg
# [1,] 0 10
# [2,] 0 6
Convert this to a data.frame if necessary.

how do you create a custom json output from an R function with dataframe and column names as arguments

I need to create and R function that would take a data frame and column names as arguments (there should be at least two column names in the argument list and maybe more). Then given the data frame, I need to create a json formatted output based the given column names. For example,
this is my df:
structure(list(DateTime = structure(1:8, .Label = c("8/24/2014 15:20",
"8/24/2014 15:55", "8/24/2014 16:04", "8/24/2014 16:18", "8/24/2014 16:27",
"8/24/2014 16:42", "8/24/2014 16:56", "8/24/2014 17:10"), class = "factor"),
Server1 = c(6.09, 4.54, 5.03, 4.93, 6.27, 4.59, 5.91, 4.53
), Server2 = c(5.7, 4.38, 4.52, 4.61, 4.18, 4.61, 4.37, 4.3
), Server3 = c(5.21, 5.33, 4.92, 5.56, 5.62, 6.73, 4.76,
4.59)), .Names = c("DateTime", "Server1", "Server2", "Server3"
), class = "data.frame", row.names = c(NA, -8L))
I need this function to return this output:
[{"name":"Server1","data":[[18/24/2014 15:20,6.09],[8/24/2014 15:55,4.54],[8/24/2014 16:04,5.03]]},
{"name":"Server2","data":[[18/24/2014 15:20,7.7],[8/24/2014 15:55,4.38],[8/24/2014 16:04,4.52]]},
{"name":"Server3","data":[[18/24/2014 15:20,5.21],[8/24/2014 15:55,5.33],[8/24/2014 16:04,4.92]]}]
Any ideas how I would start with this?
Assuming your data frame is named dd, then
library(rjson)
library(reshape2)
mm <- melt(dd)
ss <- split(mm, mm$variable)
poo <- unname(Map(function(n,x)
list(name=n, data=unname(lapply(split(x, 1:nrow(x)), function(x) {
list(x$DateTime, x$value)
}))), names(ss),ss))
cat(toJSON(poo))
And that gives
[{"name":"Server1","data":[["8/24/2014 15:20",6.09],["8/24/2014 15:55",4.54],["8/24/2014 16:04",5.03],["8/24/2014 16:18",4.93],["8/24/2014 16:27",6.27],["8/24/2014 16:42",4.59],["8/24/2014 16:56",5.91],["8/24/2014 17:10",4.53]]},
{"name":"Server2","data":[["8/24/2014 15:20",5.7],["8/24/2014 15:55",4.38],["8/24/2014 16:04",4.52],["8/24/2014 16:18",4.61],["8/24/2014 16:27",4.18],["8/24/2014 16:42",4.61],["8/24/2014 16:56",4.37],["8/24/2014 17:10",4.3]]},
{"name":"Server3","data":[["8/24/2014 15:20",5.21],["8/24/2014 15:55",5.33],["8/24/2014 16:04",4.92],["8/24/2014 16:18",5.56],["8/24/2014 16:27",5.62],["8/24/2014 16:42",6.73],["8/24/2014 16:56",4.76],["8/24/2014 17:10",4.59]]}]
which seems to match what you wanted.
It's not super pretty because you've really gone out of your way to reshape your data in a way that rsjon doesn't necessarily like.

Resources