I have constructed a dataset from the gss data (https://gss.norc.org/) associating data in decades
env_data <- select(gss, year, sex, degree, natenvir) %>% na.omit()
env_datadecades <- env_data %>%
mutate(decade=as.factor(ifelse(year<1980,
"70s",
ifelse(year>1980 & year<=1990,
"80s",
ifelse(year>1990 & year<2000, "90s", "00s")))))
I want to plot it with ggplot2 and facet_grid() and the order is not right so I made it as seen somewhere else
set.seed(6809)
env_datadecades$decade <- factor(env_datadecades$decade,
levels = c("Seventies", "Eighties", "Nineties", "Twothous"))
It worked the first time but when I try to run the code again I get NA for all data in decade. What is happening?
I just made a simple dataset of years
df <- data.frame(Years = sample(1970:2010, 20, replace = T))
Convert it into the required factors by this method,
df <- df %>%
mutate(Decades = case_when(Years < 1980 ~ "Seventies",
1980 <= Years & Years < 1990 ~ "Eighties",
1990 <= Years & Years < 2000 ~ "Nineties",
2000 <= Years ~ "TwoThousands"))
df$Decades <- factor(df$Decades, levels = c("Seventies", "Eighties", "Nineties", "TwoThousands"), ordered = T)
and now try faceting.
I think the problem with your code was that you gave the levels one set of names when you first converted the variables to a factor, and then in the second line of code, you give them another set of names. Stick to the same set, and it should work
Related
I am trying to make a new column in my data.frame based on another column.
My data frame is called dat.cp2 and one column in it has a certain year from 1990-2017
Here you can see how my data looks. The "ar" column states a year.
I need to make a new column called "TB" with periods. e.g period one is 1990-1996 and i want that period to be called "TB1".. 1997-2003 is "TB2" etc. So for a person born in 1995 the new column says "TB1".
I tried:
dat.cp2 %>% mutate(TB =
case_when(ar <=1996 ~ "TB1",
ar >=1997&<=2003 ~ "TB2",
ar >=2004&<=2010 ~ "TB3",
ar >=2011 ~ "TB4")
But i get error message:
Error: unexpected '<=' in:
" case_when(ar <=1996 ~ "TB1",
ar >=1997&<="
I have tried looking for answers but can't find any.. Can anyone help?
The syntax &<= may be acceptable in some other languages, but in R, the syntax should have ar in both expressions connected by &
library(dplyr)
dat.cp2 %>%
mutate(TB =
case_when(ar <=1996 ~ "TB1",
ar >=1997 & ar <=2003 ~ "TB2",
ar >=2004 & ar <=2010 ~ "TB3",
ar >=2011 ~ "TB4"))
NOTE: There are many methods for simplifying. But, this is just to show where the OP's code mistake is
You don't actually need the & since you are working sequentially, and also you can finalise with TRUE:
dat.cp2 %>%
mutate(
TB = case_when(ar <= 1996 ~ 'TB1',
ar <= 2003 ~ 'TB2',
ar <= 2010 ~ 'TB3',
TRUE ~ 'TB4')
)
You could also do:
dat.cp2 %>%
mutate(TB = cut(ar, breaks = c(1989,1996, 2003, 2010, 2017),
labels = c("TB1", "TB2","TB3","TB4")))
I would like to select cases with values in some variables above the corresponding third quartile (𝑄3)
As my dataset is very large I am going to take as an example the 'Air Quality' database that comes in R.
df <- airquality[complete.cases(airquality),]
The objective was to filter by certain columns
('Ozone', 'Solar.R', 'Wind', 'Temp').
Currently I was able to develop this solution:
filtro_Ozone = df$Ozone>quantile(df$Ozone)[4]
filtro_Solar.R = df$Solar.R>quantile(df$Solar.R)[4]
filtro_Wind = df$Wind>quantile(df$Wind)[4]
filtro_Temp = df$Temp>quantile(df$Temp)[4]
df[filtro_Ozone & filtro_Solar.R & filtro_Wind & filtro_Temp,]
With which I obtain:
Ozone Solar.R Wind Temp Month Day
40 71 291 13.8 90 6 9
Another fancier way to get this?
UPDATE: per OP's updated request, you can use filter_at from dplyr to only filter at selected variables:
df <- airquality[complete.cases(airquality),]
filter_at(df, vars(Ozone, Solar.R, Wind, Temp), ~. > quantile(., probs = 0.75))
I am trying to calculate monthly mean from daily values. My data has too many missing values and I want to fill them with NA values.
For example this is the desired output:
"MM","YY","RR"
10,1961,NA
10,1962,NA
10,1963,NA
10,1964,NA
10,1965,NA
10,1966,NA
10,1967,NA
10,1968,NA
10,1969,NA
10,1970,NA
10,1971,14.8290322580645
10,1972,5.92903225806452
10,1973,7.10645161290323
10,1974,9.25806451612903
10,1975,6.13225806451613
10,1976,NA
10,1977,NA
10,1978,NA
10,1979,11.358064516129
10,1980,NA
10,1981,20.8354838709677
10,1982,NA
10,1983,NA
10,1984,7.4741935483871
10,1985,NA
10,1986,NA
10,1987,NA
10,1988,NA
10,1989,NA
10,1990,NA
10,1991,NA
10,1992,NA
10,1993,NA
10,1994,NA
10,1995,NA
10,1996,NA
10,1997,NA
10,1998,NA
10,1999,NA
10,2000,NA
10,2001,12.2548387096774
10,2002,7.19354838709677
10,2003,4.34193548387097
10,2004,8.09354838709677
10,2005,10.3354838709677
10,2006,5.49677419354839
10,2007,9.58709677419355
10,2008,NA
10,2009,NA
10,2010,17.4548387096774
The test data can be downloaded from this link:
Link to Data
I am using the aggregate function to calculate the mean
Below is my script:
library(plyr)
dat<- read.csv("test.csv",header=TRUE,sep=",")
dat[dat == -999]<- NA
dat[dat == -888]<- 0
monthly_mean<-aggregate(RR ~ MM + YY,dat,mean)
#Filter August Only
oct<-monthly_mean[which(monthly_mean$MM == 10),]
dat2 <- as.data.frame(oct)
#monthly_mean <- ddply(dat,.(MM, DD), sumaprise, mean_r =
mean(RR,na.rm=TRUE))
write.table(dat2,file="test_oct.csv",sep=",",col.names=T,row.names=F, na="NA")
Problems:
[1] When I ran this script, the missing years are also removed.
I'll appreciate any suggestions on how to do this correctly in R.
You can retain the NA columns by changing the aggregate function to,
monthly_mean<-aggregate(RR ~ MM + YY,dat,mean,na.action=na.pass)
My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)
I have a csv file named "table_parameter". Please, download from here. Data look like this:
time avg.PM10 sill range nugget
1 2012030101 52.2692307692308 0.11054330 45574.072 0.0372612157
2 2012030102 55.3142857142857 0.20250974 87306.391 0.0483153769
3 2012030103 56.0380952380952 0.17711558 56806.827 0.0349567088
4 2012030104 55.9047619047619 0.16466350 104767.669 0.0307528346
.
.
.
25 2012030201 67.1047619047619 0.14349774 72755.326 0.0300378129
26 2012030202 71.6571428571429 0.11373430 72755.326 0.0320594776
27 2012030203 73.352380952381 0.13893530 72755.326 0.0311135434
28 2012030204 70.2095238095238 0.12642303 29594.037 0.0281416079
.
.
In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.
From this dataset I want subset (24*11) datframe like the table below:
for example, for 1 am (2012030101,2012030201....2012030701) and for avg.PM10<10, I want 1 dataframe. In this case, probably you found that for some data frame there will be no observation. But its okay, because I will work with very large data set.
I can do this subsetting manually by writing (24*11)240 lines code like this!
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times ==24 & avg.PM10>100)
But I understand this code is very inefficient. Is there any way to do it efficiently by using a loop?
FYI: Actually in future, by using these (24*11) dataset I want to draw some plot.
Update: After this subsetting, I want to plot the boxplots using the range of every dataset. But problem is, I want to show all boxplots (24*11)[like above figure] of range in one plot like a matrix! If you have any further inquery, please let me know. Thanks a lot in advance.
You can do this using some plyr, dplyr and tidyr magic :
library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway
# Read data
dfData <- read.csv("table_parameter.csv")
dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(hour, roundedPM.10) %>%
# Count the number of occurences per hour
count(roundedPM.10, hour) %>%
# Use spread (from tidyr) to transform it into wide format
spread(hour, n)
If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.
EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :
library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it
# for the round_any function anyway
# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")
dfDataPlot <- dfData %>%
# Extract hour and compute the rounded Avg.PM10 using round_any
mutate(hour = as.numeric(substr(time, 9, 10)),
roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>%
# Keep only the relevant columns
select(roundedPM.10, hour, range)
# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) +
geom_boxplot() +
facet_grid(roundedPM.10~.)
How about a double loop like this:
table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))
#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]
t_list=seq(1,24,1)
PM_list=seq(0,100,10)
for (t in t_list){
#t=t_list[1]
for (PM in PM_list){
#PM=PM_list[4]
PM2=PM+10
sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
if (length(sub$X)!=0) { #to avoid errors because of empty sub
name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
sub$name = name
sub.df <- rbind(sub.df , sub) }
}
}
sub.df #print data frame