My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]
I have some data with some individuals where I know their age (18-98) and which country they are from (1 = Germany , 2 = France, etc. ).
I have an other variable for which i want to see the effect eg. of 18 year old people from Germany.
With dummy(data$age, sep='_') and dummy(data$sg2, sep='_')[sg2 = country] I was able to create dummy variables for these two variables.
But while regressing, the output shows the effect now of every age and every country separate.
How can I combine 18 year olds from Germany that I see their effect on the other variable?
#Dummy age
x1 <- dummy(dat$age, sep='_')
#Dummy country
x2 <- dummy(dat$sg2, sep = '_')
fm <- lm(myvariable~x1+x2, data=dat)
summary(fm)
Estimate Std. Error t value Pr(>|t|)
x1age_18 1.547691 0.567995 2.725 0.006437 **
x1age_19 1.632648 0.567939 2.875 0.004047 **
x2sg2_1 0.083239 0.030118 2.764 0.005717 **
x2sg2_2 0.056555 0.030655 1.845 0.065063 .
This is what I get, bit how can i get x1age_18 & x2sg2_1 in one?
I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %
I am running into a sticky spot trying to solve for variance accounted for by trend several times within a single data set.....
My data is structured like this
x <- read.table(text = "
STA YEAR VALUE
a 1968 457
a 1970 565
a 1972 489
a 1974 500
a 1976 700
a 1978 650
a 1980 659
b 1968 457
b 1970 565
b 1972 350
b 1974 544
b 1976 678
b 1978 650
b 1980 690
c 1968 457
c 1970 565
c 1972 500
c 1974 600
c 1976 678
c 1978 670
c 1980 750 " , header = T)
and I am trying to return something like this
STA R-sq
a n1
b n2
c n3
where n# is the corresponding r-squared value of the locations data in the original set....
I have tried
fit <- lm(VALUE ~ YEAR + STA, data = x)
to give the model of yearly trend of VALUE for each individual station over the years data is available for VALUE, within the master data set....
Any help would be greatly appreciated.... I am really stumped on this one and I know it is just a familiarity with R problem.
To get r-squared for VALUE ~ YEAR for each group of STA, you can take this previous answer, modify it slightly and plug-in your values:
# assuming x is your data frame (make sure you don't have Hmisc loaded, it will interfere)
models_x <- dlply(x, "STA", function(df)
summary(lm(VALUE ~ YEAR, data = df)))
# extract the r.squared values
rsqds <- ldply(1:length(models_x), function(x) models_x[[x]]$r.squared)
# give names to rows and col
rownames(rsqds) <- unique(x$STA)
colnames(rsqds) <- "rsq"
# have a look
rsqds
rsq
a 0.6286064
b 0.5450413
c 0.8806604
EDIT: following mnel's suggestion here are more efficient ways to get the r-squared values into a nice table (no need to add row and col names):
# starting with models_x from above
rsqds <- data.frame(rsq =sapply(models_x, '[[', 'r.squared'))
# starting with just the original data in x, this is great:
rsqds <- ddply(x, "STA", summarize, rsq = summary(lm(VALUE ~ YEAR))$r.squared)
STA rsq
1 a 0.6286064
2 b 0.5450413
3 c 0.8806604
#first load the data.table package
library(data.table)
#transform your dataframe to a datatable (I'm using your example)
x<- as.data.table(x)
#calculate all the metrics needed (r^2, F-distribution and so on)
x[,list(r2=summary(lm(VALUE~YEAR))$r.squared ,
f=summary(lm(VALUE~YEAR))$fstatistic[1] ),by=STA]
STA r2 f
1: a 0.6286064 8.462807
2: b 0.5450413 5.990009
3: c 0.8806604 36.897258
there's only one r-squared value, not three.. please edit your question
# store the output
y <- summary( lm( VALUE ~ YEAR + STA , data = x ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]
# or are you looking to run three separate
# lm() functions on 'a' 'b' and 'c' ..where this would be the first?
y <- summary( lm( VALUE ~ YEAR , data = x[ x$STA %in% 'a' , ] ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]
The following is a toy problem that demonstrates my question.
I have a data frame that contains a bunch of employees; for each employee, it has a name, salary, gender and state.
aggregate(salary ~ state) # Returns the average salary per state
aggregate(salary ~ state + gender, data, FUN = mean) # Avg salary per state/gender
What I actually need is a summary of the fraction of the total salary earned by women in each state.
aggregate(salary ~ state + gender, data, FUN = sum)
returns the total salary earned by women (and men) in each state ,but what I really need is salary_w / salary_total on a per-state level. I can write a for-loop, etc -- but I am wondering if there is some way to use aggregate to do that.
Another option would be using plyr. ddply() expects a data.frame as an input and will return a data.frame as an output. The second argument is how you want to split the data frame. The third argument is what we want to apply to the chunks, here we are using summarise to create a new data.frame from the existing data.frame.
library(plyr)
#Using the sample data from kohske's answer above
> ddply(d, .(state), summarise, ratio = sum(salary[gender == "Woman"]) / sum(salary))
state ratio
1 1 0.5789860
2 2 0.4530224
probably reshape or reshape2 would help your work.
Here is a sample script:
library(reshape2) # from CRAN
# sample data
d <- data.frame(expand.grid(state=gl(2,2),gender=gl(2,1, labels=c("Men","Wemon"))),
salaly=runif(8))
d2 <- dcast(d, state~gender, sum)
d2$frac <- d2$Wemon/(d2$Men+d2$Wemon)
The ave function is good for problems like this.
Data$ratio <- ave(Data$salary, Data$state, Data$gender, FUN=sum) /
ave(Data$salary, Data$state, FUN=sum)
Another solution is to use xtabs and prop.table:
prop.table(xtabs(salary ~ state + gender,data),margin=1)
It's generally not advisable to name your datasets, "data", so I will change the problem slightly to name the dataset "dat1".
with( subset(dat1, gender="Female"), aggregate(salary, state, sum )/
# should return a vector
with( data=dat1, aggregate(salary, state, sum )
# using R's element-wise division
I think you are also using attach and there are good reasons to reconsider that decision, despite what you might read in Crawley.
Since you want the results on a per state basis the tapply might be what you want.
To illustrate let's generate some arbitrary data to play with:
set.seed(349) # For replication
n <- 20000 # Sample size
gender <- sample(c('M', 'W'), size = n, replace = TRUE) # Random selection of gender
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI',
'ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN',
'MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH',
'OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA',
'WV','WI','WY') # All US states
state <- sample(state, size = n, replace = TRUE) # Random selection of the states
state_index <- tapply(state, state) # Just for the data generatino part ...
gender_index <- tapply(gender, gender)
# Generate salaries
salary <- runif(length(unique(state)))[state_index] # Make states different
salary <- salary + c(.02, -.02)[gender_index] # Make gender different
salary <- salary + log(50) + rnorm(n) # Add mean and error term
salary <- exp(salary) # The variable of interest
What you asked for, the sum of salaries for the women per state and the sum of total salaries per state:
salary_w <- tapply(salary[gender == 'W'], state[gender == 'W'], sum)
salary_total <- tapply(salary, state, sum)
Or if it is in a data-frame:
salary_w <- with(myData, tapply(salary[gender == 'W'], state[gender == 'W'], sum))
salary_total <- with(myData, tapply(salary, state, sum))
Then the answer is:
> salary_w / salary_total
AK AL AR AZ CA CO CT DC
0.4667424 0.4877013 0.4554831 0.4959573 0.5382478 0.5544388 0.5398104 0.4750799
DE FL GA HI IA ID IL IN
0.4684846 0.5365707 0.5457726 0.4788805 0.5409347 0.4596598 0.4765021 0.4873932
KS KY LA MA MD ME MI MN
0.5228247 0.4955802 0.5604342 0.5249406 0.4890297 0.4939574 0.4882687 0.5611435
MO MS MT NC ND NE NH NJ
0.5090843 0.5342312 0.5492702 0.4928284 0.5180169 0.5696885 0.4519603 0.4673822
NM NV NY OH OK OR PA RI
0.4391634 0.4380065 0.5366625 0.5362918 0.5613301 0.4583937 0.5022793 0.4523672
SC SD TN TX UT VA VT WA
0.4862358 0.4895377 0.5048047 0.4443220 0.4881062 0.4880047 0.5338397 0.5136393
WI WV WY
0.4787588 0.5495602 0.5029816