My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]
Related
I have 8 data frames and I want to create a variable for each of this data frame. I use a for a loop and the code I have used is given below:
year <- 2001
dflist <- list(bhps01, bhps02, bhps03, bhps04, bhps05, bhps06, bhps07, bhps08)
for (df in dflist){
df[["year"]] <- as.character(year)
assign()
year <- year + 1
}
bhps01,...,bhps08 are the data frame objects and year is a character variable. bhps01 is the data frame for year 2001, bhps02 is the data frame for year 2002 and so on.
Each data corresponds to a year, so bhps01 corresponds to year 2001, bhps corresponds to 2002 and so on. So, I want to create a year variable for each one of these data. So, year variable would be "2001" for bhps01 data, "2002" for bhps02 and so on.
The code runs fine but it does not create the variable year for either of the data frames except the local variable df.
Can someone please explain the error in the above code? Or is there an alternative of doing the same thing?
The syntax in the for loop is wrong. I am not entirely sure what you try to accomplish but let us try this
year = 2001
A = data.frame(a = c(1, 1), b = c(2, 2))
B = data.frame(a = c(1, 1), b = c(2, 2))
L = list(A, B)
for (i in seq_along(L)) {
L[[i]][, dim(L[[i]])[2] + 1] = as.character(rep(year,dim(L[[i]])[1]))
year = year + 1
}
with output
> L
[[1]]
a b V3
1 1 2 2001
2 1 2 2001
[[2]]
a b V3
1 1 2 2002
2 1 2 2002
That is what you intend as output, correct?
In order to change the column name to "year" you can do
L = lapply(L, function(x) {colnames(x)[3] = "year"; x})
You take a copy of the dataframe from the list, and add the variable "year" to it, but then do not assign it anywhere, which is why it is discarded (i.e. not stored in a variable). Here's a fix:
year <- 2001
dflist <- list(bhps01, bhps02, bhps03, bhps04, bhps05, bhps06, bhps07, bhps08)
counter <- 0
for (df in dflist){
counter <- counter + 1
df[["year"]] <- as.character(year)
dflist[[counter]] <- df
year <- year + 1
}
If you want the original dataframes to be edited, you could assign the result back on the rather then into the list. This is a bit of an indirect route, and notice the change in creating the dflist with names. We create the df, and then assign it to the original name. For example:
year <- 2001
dflist <- list(bhps01 = bhps01, bhps02 = bhps02, bhps03 = bhps03, bhps04 = bhps04, bhps05 = bhps05, bhps06 = bhps06, bhps07 = bhps07, bhps08 = bhps08)
counter <- 0
for (df in dflist){
counter <- counter + 1
df[["year"]] <- as.character(year)
dflist[[counter]] <- df
assign(names(dflist)[counter], df)
year <- year + 1
}
My dataset looks something like this
ID YOB ATT94 GRADE94 ATT96 GRADE96 ATT 96 .....
1 1975 1 12 0 NA
2 1985 1 3 1 5
3 1977 0 NA 0 NA
4 ......
(with ATTXX a dummy var. denoting attendance at school in year XX, GRADEXX denoting the school grade)
I'm trying to create a dummy variable that = 1 if an individual is attending school when they are 19/20 years old. e.g. if YOB = 1988 and ATT98 = 1 then the new variable = 1 etc. I've been attempting this using mutate in dplyr but I'm new to R (and coding in general!) so struggle to get anything other than an error any code I write.
Any help would be appreciated, thanks.
Edit:
So, I've just noticed that something has gone wrong, I changed your code a bit just to add another column to the long format data table. Here is what I did in the end:
df %>%
melt(id = c("ID", "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
so it looks something like e.g.
ID YOB VARIABLE VALUE dummy
1 1979 ATT94 1994 1
1 1979 ATT96 1996 1
1 1979 ATT98 0 0
2 1976 ATT94 0 0
2 1976 ATT96 1996 1
2 1976 ATT98 1998 1
i.e. whenever the ATT variables take a value other than 0 the dummy = 1, even if they're not 19/20 years old. Any ideas what could be going wrong?
On my phone so I can't check this right now but try:
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Edit: The above approach will create the column but when the condition does not hold it will be equal to NA
As #Greg Snow mentions, this approach assumes that the column was already created and is equal to zero initially. So you can do the following to get your dummy variable:
df$dummy <- rep(0, nrow(df))
df$dummy[df$DOB==1988 & df$ATT98==1] <- 1
Welcome to the world of code! R's syntax can be tricky (even for experienced coders) and dplyr adds its own quirks. First off, it's useful when you ask questions to provide code that other people can run in order to be able to reproduce your data. You can learn more about that here.
Are you trying to create code that works for all possible values of DOB and ATTx? In other words, do you have a whole bunch of variables that start with ATT and you want to look at all of them? That format is called wide data, and R works much better with long data. Fortunately the reshape2 package does exactly that. The code below creates a dummy variable with a value of 1 for people who were in school when they were either 19 or 20 years old.
# Load libraries
library(dplyr)
library(reshape2)
# Create a sample dataset
ATT94 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT96 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
ATT98 <- runif(500, min = 0, max = 1) %>% round(digits = 0)
DOB <- rnorm(500, mean = 1977, sd = 5) %>% round(digits = 0)
df <- cbind(DOB, ATT94, ATT96, ATT98) %>% data.frame()
# Recode ATTx variables with the actual year
df$ATT94[df$ATT94==1] <- 1994
df$ATT96[df$ATT96==1] <- 1996
df$ATT98[df$ATT98==1] <- 1998
# Melt the data into a long format and perform requested analysis
df %>%
melt(id = "DOB") %>%
tbl_df() %>%
mutate(dummy = ifelse(value - DOB %in% c(19,20), 1, 0))
#Warner shows a way to create the variable (or at least the 1's the assumption is the column has already been set to 0). Another approach is to not explicitly create a dummy variable, but have it created for you in the model syntax (what you asked for is essentially an interaction). If running a regression, this would be something like:
fit <- lm( resp ~ I(DOB==1988):I(ATT98==1), data=df )
or
fit <- lm( resp ~ I( (DOB==1988) & (ATT98==1) ), data=df)
I have two columns in two data frames, where the longer one includes all elements of the other column. Now I want to delete elements in the longer column that do not overlap with the other, together with the corresponding row. I identified the "difference" using:
diff <- setdiff(gdp$country, tfpg$country)
and I tried to use two FOR loops to get this done:
for (i in 1:28) { for(j in 1:123) {if(diff[i] == gdp$country[j]) {gdp <- gdp[-c(j),]}}}
where 28 is the number of rows I want to delete (length of diff) and 123 is the length of the longer column. This does not work, the error message:
Error in if (diff[i] == gdp$country[j]) { :
missing value where TRUE/FALSE needed
So how do I fix this? Or is there a better way to do this?
Thank you very much.
I have a data frame called "gdp" here:
country wto y1990 y1991 y1992
Austria 1995 251540 260197 265644
Belgium 1995 322113 328017 333038
Cyprus 1995 14436 14537 15898
Denmark 1995 177089 179392 182936
Finland 1995 149584 140737 136058
France 1995 1804032 1822778 1851937
There are 123 rows.
I would like to delete rows with country names specified in another vector:
diff ["Austria","China",...,"Yemen"]
there is a better way! What you're describing is the equivalent of a left join, or inner join. But in R the way to achieve it is using the merge command:
## S3 method for class 'data.frame'
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"),
incomparables = NULL, ...)
In your case:
merge(gdp, tfpg, by = intersect('country', 'country'))
E.g.
x = data.frame(foo = c(1,2,3,4,5), bar=c("A","B","C","D","E"))
y = data.frame(baz = c(6,7,8,9), bar=c("A","C","E","F"))
z = merge(x,y,by=intersect('bar','bar'))
gives
bar foo baz
1 A 1 6
2 C 3 7
3 E 5 8
I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...
Hope it is not a too newbie question.
I am trying to subset rows from the GDP UK dataset that can be downloaded from here:
http://www.ons.gov.uk/ons/site-information/using-the-website/time-series/index.html
The dataframe looks more or less like that:
X ABMI
1 1948 283297
2 1949 293855
3 1950 304395
....
300 2013 Q2 381318
301 2013 Q3 384533
302 2013 Q4 387138
303 2014 Q1 390235
The thing is that for my analysis I only need the data for years 2004-2013 and I am interested in one result per year, so I wanted to get every fourth row from the dataset that lies between the 263 and 303 row.
On the basis of the following websites:
https://stat.ethz.ch/pipermail/r-help/2008-June/165634.html
(plus a few that i cannot quote due to the link limit)
I tried the following, each time getting some error message:
> GDPUKodd <- seq(GDPUKsubset[263:302,], by = 4)
Error in seq.default(GDPUKsubset[263:302, ], by = 4) :
argument 'from' musi mieæ d³ugoœæ 1
> OddGDPUK <- GDPUK[seq(263, 302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(263, 302, by = 4)) :
undefined columns selected
> OddGDPUKprim <- GDPUK[seq(263:302), by = 4]
Error in `[.data.frame`(GDPUK, seq(263:302), by = 4) :
unused argument (by = 4)
> OddGDPUK <- GDPUK[seq(from=263, to=302, by = 4)]
Error in `[.data.frame`(GDPUK, seq(from = 263, to = 302, by = 4)) :
undefined columns selected
> OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to=GDPUK[302,] by = 4)]
Error: unexpected symbol in "OddGDPUK <- GDPUK[seq(from=GDPUK[263,] to"
> GDPUK[seq(1,nrows(GDPUK),by=4),]
Error in seq.default(1, nrows(GDPUK), by = 4) :
could not find function "nrows"
To put a long story short: help!
Instead of trying to extract data based on row ids, you can use the subset function with appropriate filters based on the values.
For example if your data frame has a year column with values 1948...2014 and a quarter column with values Q1..Q4, then you can get the right subset with:
subset(data, year >= 2004 & year <= 2013 & quarter == 'Q1')
UDATE
I see your source data is dirty, with no proper year and quarter columns. You can clean it like this:
x <- read.csv('http://www.ons.gov.uk/ons/datasets-and-tables/downloads/csv.csv?dataset=pgdp&cdid=ABMI')
x$ABMI <- as.numeric(as.character(x$ABMI))
x$year <- as.numeric(gsub('[^0-9].*', '', x$X))
x$quarter <- gsub('[0-9]{4} (Q[1-4])', '\\1', x$X)
subset(x, year >= 2004 & year <= 2013 & quarter == 'Q1')
Your code GDPUK[seq(1,nrows(GDPUK),by=4),] actually works quite well for these purposes. The only thing you need to change is nrow for nrows.