I have a csv file named "table_parameter". Please, download from here. Data look like this:
time avg.PM10 sill range nugget
1 2012030101 52.2692307692308 0.11054330 45574.072 0.0372612157
2 2012030102 55.3142857142857 0.20250974 87306.391 0.0483153769
3 2012030103 56.0380952380952 0.17711558 56806.827 0.0349567088
4 2012030104 55.9047619047619 0.16466350 104767.669 0.0307528346
.
.
.
25 2012030201 67.1047619047619 0.14349774 72755.326 0.0300378129
26 2012030202 71.6571428571429 0.11373430 72755.326 0.0320594776
27 2012030203 73.352380952381 0.13893530 72755.326 0.0311135434
28 2012030204 70.2095238095238 0.12642303 29594.037 0.0281416079
.
.
In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.
From this dataset I want subset (24*11) datframe like the table below:
for example, for 1 am (2012030101,2012030201....2012030701) and for avg.PM10<10, I want 1 dataframe. In this case, probably you found that for some data frame there will be no observation. But its okay, because I will work with very large data set.
I can do this subsetting manually by writing (24*11)240 lines code like this!
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times ==24 & avg.PM10>100)
But I understand this code is very inefficient. Is there any way to do it efficiently by using a loop?
FYI: Actually in future, by using these (24*11) dataset I want to draw some plot.
Update: After this subsetting, I want to plot the boxplots using the range of every dataset. But problem is, I want to show all boxplots (24*11)[like above figure] of range in one plot like a matrix! If you have any further inquery, please let me know. Thanks a lot in advance.
You can do this using some plyr, dplyr and tidyr magic :
library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway
# Read data
dfData <- read.csv("table_parameter.csv")
dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(hour, roundedPM.10) %>%
# Count the number of occurences per hour
count(roundedPM.10, hour) %>%
# Use spread (from tidyr) to transform it into wide format
spread(hour, n)
If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.
EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :
library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it
# for the round_any function anyway
# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")
dfDataPlot <- dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(roundedPM.10, hour, range)
# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) +
geom_boxplot() +
facet_grid(roundedPM.10~.)
How about a double loop like this:
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]
t_list=seq(1,24,1)
PM_list=seq(0,100,10)
for (t in t_list){
#t=t_list[1]
for (PM in PM_list){
#PM=PM_list[4]
PM2=PM+10
sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
if (length(sub$X)!=0) { #to avoid errors because of empty sub
name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
sub$name = name
sub.df <- rbind(sub.df , sub) }
}
}
sub.df #print data frame
Related
I am trying to arrange my data. The csv file that I load contains results of 15 precincts for one locality. The number of rows are 150 because the names of the 10 candidates repeat for each of the 15 precincts.
My goal is to make the names of the 10 candidates as columns without repeating their names and with the results for each candidate as the values. I use the code below, however I have to do it 15 times because I cut my data in intervals of 10 to extract the results of one precinct. It's the same for "binondov". I have to cut my data in intervals of 8 because there are 8 candidates for each precinct.
Is there a way to write my code as a loop? Thanks!
binondop1 <- binondop[1:10,]
binondop1a <- binondop1[order(binondop1[,2]),]
binondov1 <- binondov[1:8,]
binondov1a <- binondov1[order(binondov1[,2]),]
colnames(binondop1a) = colnames(binondov1a) =
c('X', 'Candidate', 'Party', 'Vote', 'Percentage')
binondo1 <- rbind(binondop1a, binondov1a)
binondo <- rbind(t(binondo1$Vote), t(binondo2$Vote),
t(binondo3$Vote), t(binondo4$Vote),
t(binondo5$Vote), t(binondo6$Vote),
t(binondo7$Vote), t(binondo8$Vote),
t(binondo9$Vote), t(binondo10$Vote),
t(binondo11$Vote), t(binondo12$Vote),
t(binondo13$Vote),t(binondo14$Vote),
t(binondo15$Vote))
colnames(binondo) <- c('Acosta', 'Aquino', 'DLReyes', 'EEjercito',
'Gordon', 'Madrigal', 'Perlas', 'Teodoro',
'Villanueva', 'Villar', 'Binay', 'Chipeco',
'Fernando', 'Legarda', 'Manzano', 'Roxas',
'Sonza', 'Yasay')
It's hard to say exactly without seeing a sample data set, but perhaps something like this will help get you where you need to your answer.
library(dplyr)
library(tidyr)
df <- data.frame(Candidate = c(rep('Acosta',3), rep('Aquino',3), rep('DLReyes',3)),
Party = c('R','R','R','L','L','L','D','D','D'),
Vote = rep(c('A','B','C'),3),
Percentage = c(5,4,2,6,8,3,1,3,2))
df2 <- df %>%
mutate(Candidate = paste0(Candidate, ' (', Party, ')')) %>%
select(-Party) %>%
spread(Candidate, Percentage)
I have OHLC (Open/High/Low/Close)
data which we can get using Finance API and all.
I want to create a target indicator (-1,0,1) on which I will build stock classification model.
To create this target variable.
I need to create another indicator, log(tomorrow's CLOSE/today's CLOSE)
Which will give me value in (-inf to inf).
Now, I want to create labels=c(-1, 0, 1) from breaks=c(-Inf,
range_start, range_end, Inf) of log(tomorrow's CLOSE/today's CLOSE).
My first question is to create this target variable without looking into the future data, as my formula log(tomorrow's CLOSE/today's CLOSE) looks into the future, which is wrong, I want to shift the dataframe/inputs backward by one row and treat today as tomorrow and so on.
and then, calculate the target category, based on range_start, range_end and breaks I will define, the -1, 0,1 .
My 2nd question is how can i define it in best manner, this value, I am taking this as -0.0015,0.0015 as of now.
need some comments and suggestions here, thanks.
masterDF_close <- masterDF %>% dplyr::select('Date', 'Close')
# create a one-row matrix the same length as data
temprow <- matrix(c(rep.int(NA,length(masterDF))),nrow=1,ncol=length(masterDF))
# make it a data.frame and give cols the same names as data
newrow <- data.frame(temprow)
colnames(newrow) <- colnames(masterDF)
# rbind the empty row to data
masterDF <- rbind(newrow,masterDF)
###View(masterDF)
temprow2 <- matrix(c(rep.int(NA,length(masterDF_close))),nrow=1,ncol=length(masterDF_close))
# make it a data.frame and give cols the same names as data
newrow2 <- data.frame(temprow2)
colnames(newrow2) <- colnames(masterDF_close)
# rbind the empty row to data
masterDF_close <- rbind(masterDF_close, newrow2)
masterDF['Close_unshifted'] = masterDF_close$Close
###View(masterDF)
# Shifting data backwards, assuming today Close as tomorrow Close and yesterday Close as today Close
# close <- masterDF$Close
# lead_close <- lag(close, k = -1)
#
# close[1:10]
# lead_close[1:10]
#
# log(close/lead_close)
#
# plot(log(close/lead_close))
masterDF['TargetIndicator'] <- log(masterDF$Close_unshifted/masterDF$Close)
###View(masterDF)
masterDF = masterDF[-1,]
masterDF$TargetIndicator[is.na(masterDF$TargetIndicator)] <- 0
masterDF_ <- masterDF %>% mutate(category=cut(TargetIndicator,
breaks=c(-Inf, range_start, range_end, Inf),
labels=c(-1, 0, 1)))
These are two operations, I am doing on the code.
I am trying to calculate monthly mean from daily values. My data has too many missing values and I want to fill them with NA values.
For example this is the desired output:
"MM","YY","RR"
10,1961,NA
10,1962,NA
10,1963,NA
10,1964,NA
10,1965,NA
10,1966,NA
10,1967,NA
10,1968,NA
10,1969,NA
10,1970,NA
10,1971,14.8290322580645
10,1972,5.92903225806452
10,1973,7.10645161290323
10,1974,9.25806451612903
10,1975,6.13225806451613
10,1976,NA
10,1977,NA
10,1978,NA
10,1979,11.358064516129
10,1980,NA
10,1981,20.8354838709677
10,1982,NA
10,1983,NA
10,1984,7.4741935483871
10,1985,NA
10,1986,NA
10,1987,NA
10,1988,NA
10,1989,NA
10,1990,NA
10,1991,NA
10,1992,NA
10,1993,NA
10,1994,NA
10,1995,NA
10,1996,NA
10,1997,NA
10,1998,NA
10,1999,NA
10,2000,NA
10,2001,12.2548387096774
10,2002,7.19354838709677
10,2003,4.34193548387097
10,2004,8.09354838709677
10,2005,10.3354838709677
10,2006,5.49677419354839
10,2007,9.58709677419355
10,2008,NA
10,2009,NA
10,2010,17.4548387096774
The test data can be downloaded from this link:
Link to Data
I am using the aggregate function to calculate the mean
Below is my script:
library(plyr)
dat<- read.csv("test.csv",header=TRUE,sep=",")
dat[dat == -999]<- NA
dat[dat == -888]<- 0
monthly_mean<-aggregate(RR ~ MM + YY,dat,mean)
#Filter August Only
oct<-monthly_mean[which(monthly_mean$MM == 10),]
dat2 <- as.data.frame(oct)
#monthly_mean <- ddply(dat,.(MM, DD), sumaprise, mean_r =
mean(RR,na.rm=TRUE))
write.table(dat2,file="test_oct.csv",sep=",",col.names=T,row.names=F, na="NA")
Problems:
[1] When I ran this script, the missing years are also removed.
I'll appreciate any suggestions on how to do this correctly in R.
You can retain the NA columns by changing the aggregate function to,
monthly_mean<-aggregate(RR ~ MM + YY,dat,mean,na.action=na.pass)
I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).
Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})
Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087
Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.
Here's a little piece of code I wrote to report variables with missing values from a data frame. I'm trying to think of a more elegant way to do this, one that perhaps returns a data.frame, but I'm stuck:
for (Var in names(airquality)) {
missing <- sum(is.na(airquality[,Var]))
if (missing > 0) {
print(c(Var,missing))
}
}
Edit: I'm dealing with data.frames with dozens to hundreds of variables, so it's key that we only report variables with missing values.
Just use sapply
> sapply(airquality, function(x) sum(is.na(x)))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
You could also use apply or colSums on the matrix created by is.na()
> apply(is.na(airquality),2,sum)
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
> colSums(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
My new favourite for (not too wide) data are methods from excellent naniar package. Not only you get frequencies but also patterns of missingness:
library(naniar)
library(UpSetR)
riskfactors %>%
as_shadow_upset() %>%
upset()
It's often useful to see where the missings are in relation to non missing which can be achieved by plotting scatter plot with missings:
ggplot(airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
Or for categorical variables:
gg_miss_fct(x = riskfactors, fct = marital)
These examples are from package vignette that lists other interesting visualizations.
We can use map_df with purrr.
library(mice)
library(purrr)
# map_df with purrr
map_df(airquality, function(x) sum(is.na(x)))
# A tibble: 1 × 6
# Ozone Solar.R Wind Temp Month Day
# <int> <int> <int> <int> <int> <int>
# 1 37 7 0 0 0 0
summary(airquality)
already gives you this information
The VIM packages also offers some nice missing data plot for data.frame
library("VIM")
aggr(airquality)
Another graphical alternative - plot_missing function from excellent DataExplorer package:
Docs also points out to the fact that you can save this results for additional analysis with missing_data <- plot_missing(data).
More succinct-: sum(is.na(x[1]))
That is
x[1] Look at the first column
is.na() true if it's NA
sum() TRUE is 1, FALSE is 0
Another function that would help you look at missing data would be df_status from funModeling library
library(funModeling)
iris.2 is the iris dataset with some added NAs.You can replace this with your dataset.
df_status(iris.2)
This will give you the number and percentage of NAs in each column.
For one more graphical solution, visdat package offers vis_miss.
library(visdat)
vis_miss(airquality)
Very similar to Amelia output with a small difference of giving %s on missings out of the box.
I think the Amelia library does a nice job in handling missing data also includes a map for visualizing the missing rows.
install.packages("Amelia")
library(Amelia)
missmap(airquality)
You can also run the following code will return the logic values of na
row.has.na <- apply(training, 1, function(x){any(is.na(x))})
Another graphical and interactive way is to use is.na10 function from heatmaply library:
library(heatmaply)
heatmaply(is.na10(airquality), grid_gap = 1,
showticklabels = c(T,F),
k_col =3, k_row = 3,
margins = c(55, 30),
colors = c("grey80", "grey20"))
Probably won't work well with large datasets..
A dplyr solution to get the count could be:
summarise_all(df, ~sum(is.na(.)))
Or to get a percentage:
summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
Maybe also worth noting that missing data can be ugly, inconsistent, and not always coded as NA depending on the source or how it's handled when imported. The following function could be tweaked depending on your data and what you want to consider missing:
is_missing <- function(x){
missing_strs <- c('', 'null', 'na', 'nan', 'inf', '-inf', '-9', 'unknown', 'missing')
ifelse((is.na(x) | is.nan(x) | is.infinite(x)), TRUE,
ifelse(trimws(tolower(x)) %in% missing_strs, TRUE, FALSE))
}
# sample ugly data
df <- data.frame(a = c(NA, '1', ' ', 'missing'),
b = c(0, 2, NaN, 4),
c = c('NA', 'b', '-9', 'null'),
d = 1:4,
e = c(1, Inf, -Inf, 0))
# counts:
> summarise_all(df, ~sum(is_missing(.)))
a b c d e
1 3 1 3 0 2
# percentage:
> summarise_all(df, ~(sum(is_missing(.) / nrow(df))))
a b c d e
1 0.75 0.25 0.75 0 0.5
If you want to do it for particular column, then you can also use this
length(which(is.na(airquality[1])==T))
ExPanDaR’s package function prepare_missing_values_graph can be used to explore panel data:
For piping you could write:
# Counts
df %>% is.na() %>% colSums()
# % of missing rounded to 2 decimals
df %>% summarise_all(.funs = ~round(100*sum(is.na(.))/length(.),2))