aggregating a data frame over a column - r

I have a data frame, one of the columns representing years. Let's say
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c("10","11","12","13","14","15","16","17","18")
m2 <- c("20","30","40","50","60","70","80","90","100")
data <- data.frame(region,year,m1,m2)
I want to aggregate the data set m1 in a way taking 3-year averages for each country. I am confused in how to do that with a data frame. Any comment is highly appreciated.
Thanks in advance!

First, your m1 variable needs to be numeric. Convert it using as.numeric():
data$m1 <- as.numeric(as.character(data$m1))
Then, you can use aggregate like this:
aggregate(m1 ~ region, FUN = mean, data = data)
# region m1
# 1 Italy 14
# 2 Norway 15
# 3 Spain 13
To avoid the awkward type conversion (as.numeric(as.character())), you should eliminate the quotes from the setup for m1 and m2:
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
Alternative approach using dplyr:
library(dplyr)
region <- c("Spain", "Italy", "Norway")
year <- c("2010","2011","2012","2010","2011","2012","2010","2011","2012")
m1 <- c(10,11,12,13,14,15,16,17,18)
m2 <- c(20,30,40,50,60,70,80,90,100)
data <- data.frame(region,year,m1,m2)
data %>%
group_by(region) %>%
summarise(mean_m1 = mean(m1),
mean_m2 = mean(m2))
# region mean_m1 mean_m2
# 1 Italy 14 60
# 2 Norway 15 70
# 3 Spain 13 50

Related

How to loop with two lists in R

I have a dataset with demographic information and with questions.
DF<-(Participant = c(1,2,3,4,5,6,7,8,9,10)
Male = c(1,0,1,1,0,1,0,0,1,0)
Female = c(0,1,0,0,1,0,1,1,0,1)
Q1 = c(9,6,5,4,5,1,3,5,5,2)
Q2 = c(2,4,5,4,2,1,3,5,4,2)
Q3 = c(6,8,2,7,5,2,1,1,6,3))
I have two lists (made from column titles), one of demographic information (Males, Females, age group etc) and one of questions with their associated response.
Demographic <- c(“Male”, “Female”, “Age_group_1”, “Age_group_2”…)
Questions<- c(“Q1”, “Q2”, Q3”, “Q4”…)
I need something along the lines of- if value in demographic column is equal to 1 then sum scores in all separate question columns. But I want to do this is a loop so I have the separate question scores (~300) for all columns in the demographic list (~80). Plus I want to save the output. I have no idea how to do this and I’m getting into a loop of bad programming myself!
The end result should resemble this:
M F
Q1 20 21
Q2 16 16
Q3 23 18
I would be grateful for any help!
Thanks in advance.
UPDATE:
With help from a friend, I have found a work around my problem. How do you make this more efficient though?
df.list <- list()
for(question in questions){
question.df <- (DF[, lapply(.SD,sum, na.rm=T), by=question,
.SDcols=c(demographic)])
df.list <- append(df.list, question.df)}
list_new <- bind_cols(df.list, .id = "column_label")
library(tidyr)
library(dplyr)
df <- data.frame(
Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3)
)
df %>%
mutate(sex = ifelse(Male == 1, "M", "F")) %>%
select(-Male, -Female) %>%
pivot_longer(cols = starts_with("Q"), names_to = "Q") %>%
group_by(sex, Q) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = sex)
gives:
Q F M
<chr> <dbl> <dbl>
1 Q1 21 24
2 Q2 16 16
3 Q3 18 23
Depending on what you want to do with the output, another approach is to use tables::tabular(), which can be used to generate additional statistics (e.g. percentages), as well as customizing row and column headings.
We'll generate a simple table using the data provided in the question.
df <- data.frame(Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3))
df$sex <- ifelse(df$Male == 1,"M","F")
library(tables)
tabular((Q1 + Q2 + Q3)~Factor(sex)*(sum),data=df)
...and the output:
> tabular((Q1 + Q2 + Q3)~Factor(sex)*(sum),data=df)
sex
F M
sum sum
Q1 21 24
Q2 16 16
Q3 18 23
Processing multiple demographic variables
In the comments to my answer a question was asked about how to use tabular() with more than one demographic variable.
We can use a combination of lapply(), paste(), and substitute() to build the correct formula expressions for `tabular().
To illustrate the process we will add a second demographic variable, Income to the data frame listed above. Then we create a vector to represent the list of demographic variables for which we will generate tables. Finally, we use the vector with lapply() to produce the tables.
df <- data.frame(Participant = c(1,2,3,4,5,6,7,8,9,10),
Male = c(1,0,1,1,0,1,0,0,1,0),
Female = c(0,1,0,0,1,0,1,1,0,1),
Income = c(rep("low",5),rep("high",5)),
Q1 = c(9,6,5,4,5,1,3,5,5,2),
Q2 = c(2,4,5,4,2,1,3,5,4,2),
Q3 = c(6,8,2,7,5,2,1,1,6,3))
df$Sex <- ifelse(df$Male == 1,"M","F")
library(tables)
tabular((Q1 + Q2 + Q3)~Factor(Sex)*(sum),data=df)
demoVars <- c("Sex","Income")
lapply(demoVars,function(x){
# generate a formula expression including the column variable
# and use substitute() to render it correctly within tabular()
theExpr <- paste0("(Q1 + Q2 + Q3) ~ Factor(",x,")*(sum)")
tabular(substitute(theExpr),data=df)
})
...and the output:
> lapply(demoVars,function(x){
+ # generate a formula expression including the column variable
+ # and use substitute() to render it correctly within tabular()
+ theExpr <- paste0("(Q1 + Q2 + Q3) ~ Factor(",x,")*(sum)")
+ tabular(substitute(theExpr),data=df)
+ })
[[1]]
Sex
F M
sum sum
Q1 21 24
Q2 16 16
Q3 18 23
[[2]]
Income
high low
sum sum
Q1 16 29
Q2 15 17
Q3 13 28
Note that we can enhance the solution further by saving the tables to an output object and rendering them in a printer friendly format as needed.

R - Automatically split time series in equal parts

I am trying to do a regression mode with calibration periods. For that I want to split my time series into 4 equal parts.
library(lubridate)
date_list = seq(ymd('2000-12-01'),ymd('2018-01-28'),by='day')
date_list = date_list[which(month(date_list) %in% c(12,1,2))]
testframe = as.data.frame(date_list)
testframe$values = seq (1, 120, length = nrow(testframe))
The testframe above is 18 seasons long and I want to devide that into 4 parts, meaning 2 Periodes of 4 winter seasons and 2 Periodes of 5 winter seasons.
My try was:
library(lubridate)
aj = year(testframe[1,1])
ej = year(testframe[nrow(testframe),1])
diff = ej - aj
But when I devide diff now with 4, its 4.5, but I would need something like 4,4,5,5 and use that to extract the seasons. Any idea how to do that automatically?
You can start with something like this:
library(lubridate)
testframe$year_ <- year(testframe$date_list)
testframe$season <- getSeason(testframe$date_list)
If you're wondering the origin of getSeason() function, read this. Now you can split have the datasets with the seasons:
by4_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[1:4],]
by4_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[5:8],]
by5_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[9:13],]
by5_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[14:18],]
Now you can test it, for example:
table(by4_1$year_, by4_1$season)
Fall Winter
2000 14 17
2001 14 76
2002 14 76
2003 14 76

How to use tidyr to organize multiple columns [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 3 years ago.
I'm having more problems with the tidyr package in R. I am doing an experiment involving splitting up the data frame into plot, plant, and leaf variables, and since I have a large data frame, I need to do this with a code. I'm using RStudio and using the tidyr package.
I need to organize a data frame from this:
library(readr)
library(tidyr)
library(dplyr)
plot <- c("101","101","101","101","101","102","102","102","102","102")
plant <- c("1","2","3","4","5","1","2","3","4","5")
leaf_1 <- c("100","100","100","100","100","100","100","100","100","100")
leaf_2 <- c("90","90","90","90","90","90","90","90","90","90")
leaf_3 <- c("80","80","80","80","80","80","80","80","80","80")
plot <- as.data.frame(plot)
plant <- as.data.frame(plant)
leaf_1 <- as.data.frame(leaf_1)
leaf_2 <- as.data.frame(leaf_2)
leaf_3 <- as.data.frame(leaf_3)
data <- cbind(plot, plant, leaf_1, leaf_2, leaf_3)
View(data)
Into this:
plot <- c("101","101","101", "101","101","101","101","101","101","101","101","101","101","101","101")
plant <- c("1","1","1","2","2","2","3","3","3","4","4","4","5","5","5")
leaf_number <- c("1","2","3","1","2","3","1","2","3","1","2","3","1","2","3")
score <- c("100","90","80","100","90","80","100","90","80","100","90","80","100","90","80")
plot <- as.data.frame(plot)
plant <- as.data.frame(plant)
leaf_number <- as.data.frame(leaf_number)
score <- as.data.frame(score)
example <- cbind(plot, plant, leaf_number, score)
View(example)
Here is what I have already tried:
data1 <- gather(data, leaf_number, score, -plot)
But it just doesn't gather the data frame into what I need. Any help is greatly appreciated, thanks so much everybody!
data <- data.frame(
plot = c(101,101,101,101,101,102,102,102,102,102),
plant = c(1,2,3,4,5,1,2,3,4,5),
leaf_1 = c(100,100,100,100,100,100,100,100,100,100),
leaf_2 = c(90,90,90,90,90,90,90,90,90,90),
leaf_3 = c(80,80,80,80,80,80,80,80,80,80)
)
gather(data, leaf_number, score, -c(plot, plant))
# plot plant leaf_number score
#1 101 1 leaf_1 100
#2 101 2 leaf_1 100
#3 101 3 leaf_1 100
#4 101 4 leaf_1 100
#5 101 5 leaf_1 100
#6 102 1 leaf_1 100
#7 102 2 leaf_1 100
#etc.

Plot one column in every matrix in a list of matrices in R

I'm quite new to programming as well as data analysis, please bear with me here. My data currently consists of a list of 14 matrices (lom), each corresponding to data from a country (with two-letter country codes).
Here is a full sample for Austria:
> lom["AT"]
$`AT`
Year AllKey AllSub SelKey SelSub
1 2000 1.622279 0.5334964 1.892894 0.8057591
2 2001 1.903745 0.5827514 2.291335 0.8295899
3 2002 1.646538 0.4873866 2.006873 0.7360566
4 2003 1.405250 0.8692641 2.105648 1.2711968
5 2004 1.511154 1.5091751 1.970236 1.9407666
6 2005 1.459177 0.6781008 1.808982 1.1362805
7 2006 1.604652 0.5038658 1.942126 0.7992008
8 2007 2.107326 0.9260200 2.683072 1.3302627
9 2008 1.969735 0.6178362 2.994758 1.2051339
10 2009 1.955768 0.7365529 2.896198 1.2272024
11 2010 2.476157 0.7952590 3.715950 1.5686643
12 2011 2.092459 0.4970011 2.766169 0.6476707
13 2012 1.913122 0.5338756 2.450942 0.6022315
14 2013 2.086200 0.6739412 2.786736 0.9211941
15 2014 2.579428 0.8424793 3.152541 1.0225888
16 2015 10.662568 5.8472436 9.769320 3.8840780
17 2016 11.088286 4.6504581 10.567789 3.2383420
18 2017 7.225053 1.7528594 6.747515 1.2781224
I'd like to get all 14 countries plotted against x = Year and y = each of the other variables, i.e. four plots with 14 lines each. Hence the requirement in the question title.
I keep coming up with impossibilities involving some combination of a for loop and some apply function, for example:
for (i in colnames(lom$anyCountry)) {
ggplot(lapply(lom, function(x) x[,1:14], aes(x=Year, y=i)
}
which apart from many other problems I can now see throws:
Error: data must be a data frame, or other object coercible by fortify(), not a list
which led me to combine the list of matrices into a big matrix inspired by
this:
bigDF <- do.call(rbind, lom)
I suppose I could restructure my data some other way, perhaps I'm missing some functionality that would help... probably both. I would appreciate any pointers as to how to achieve this as efficiently as possible.
Consider appending all matrix data into a master, single data frame with a country indicator that you can use for the color argument of line plots:
# CREATE LARGE DATAFRAME FROM MATRIX LIST
lom_df <- do.call(rbind, lapply(lom, data.frame))
# CREATE COLUMN NAMES FROM ROWNAMES
lom_df$country <- gsub("\\..*$", "", row.names(lom_df))
row.names(lom_df) <- NULL
# EXTRACT ALL FOUR Y COLUMN NAMES (MINUS Year AND country)
y_columns <- colnames(lom_df[2:(ncol(lom_df)-1)])
# PRODUCE LIST OF FOUR PLOTS EACH WITH COUNTRY LINES
plot_list <- lapply(y_columns, function(col)
ggplot(lom_df, aes_string(x="Year", y=col, color="country")) +
geom_line()
)
# OUTPUT EACH LIST
plot_list
This solution uses package ggplot2.
It has two steps, data preparation and plotting.
First of all the list must be transformed into one large data frame, with a column as an id column. I have searched SO for a function that does this but couldn't find one so here it goes.
rbindWithID <- function(x, id.name = "ID", sep = "."){
if(is.null(names(x))) names(x) <- paste(id.name, seq_along(x), sep = sep)
res <- lapply(names(x), function(nm){
DF <- x[[nm]]
DF[[id.name]] <- nm
x[[nm]] <- cbind(DF[ncol(DF)], DF[-ncol(DF)])
x[[nm]]
})
do.call(rbind, res)
}
lom_df <- rbindWithID(lom, "Country")
Now reshape the data frame from wide to long.
molten <- reshape2::melt(lom_df, id.vars = c("Country", "Year"))
Finally, plot it.
library(ggplot2)
ggplot(molten, aes(Year, value, colour = Country)) +
geom_line() +
facet_wrap(~ variable)
DATA.
set.seed(1234) # Make the results reproducible
lom <- lapply(1:4, function(i){
data.frame(
Year = 2000:2008,
AllKey = runif(9, 1, 2),
AllSub = runif(9, 0, 2),
SelKey = runif(9, 1, 2),
SelSub = runif(9, 0, 2)
)
})
names(lom) <- c("AT", "DE", "FR", "PT")

r - ggplot multiple line graphs for each unique instance over time

The Problem
Plotting a bunch of line plots on top of one another, but I only want to color 10 specifically after they are all plotted amongst themselves (to visualize how my 'targets' traveled over time while being able to view the masses of other behind them. So an example of this would be like 100 line graphs over time, but I want to color 5 or 10 of them specifically to discuss about with respect to the trend of the 90 other grayscale ones.
The following post has a pretty good image that I want to replicate, but with slightly more meat on the bones, , Except I want MANY lines behind those 3 all grayscale, but those 3 are my highlighted cities I want to see in the foreground, per say.
My original data was in the following form:
# The unique identifier is a City-State combo,
# there can be the same cities in 1 state or many.
# Each state's year ranges from 1:35, but may not have
# all of the values available to us, but some are complete.
r1 <- c("city1" , "state1" , "year" , "population" , rnorm(11) , "2")
r2 <- c("city1" , "state2" , "year" , "population" , rnorm(11) , "3")
r3 <- c("city2" , "state1" , "year" , "population" , rnorm(11) , "2")
r4 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "1")
r5 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "7")
df <- data.frame(matrix(nrow = 5, ncol = 16))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- c("City", "State", "Year", "Population", 1:11, "Cluster")
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# City | State | Year | Population | ... 11 Variables ... | Cluster #
# ----------------------------------------------------------------------#
# Each row is a city instance with these features ... #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
But I thought it might be better to view the data differently, so I also have it in the following format. I am not sure which is better for this problem.
cols <- c(0:35)
rows <- c("unique_city1", "unique_city2","unique_city3","unique_city4","unique_city5")
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- cols
row.names(df) <- rows
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Year1 Year2 .......... Year 35 #
# UniqueCityState1 VAL NA .......... VAL #
# UniqueCityState2 VAL VAL .......... NA #
# . #
# . #
# . #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Prior Attempts
I have tried using melt to get the data into a format that is possible for ggplot to accept and plot each of these cities over time, but nothing has seemed to work. Also, I have tried creating my own functions to loop through each of my unique city-state combinations to stack ggplots which had some fair amount of research available on the topic, but nothing yet still. I am not sure how I could find each of these unique citystate pairs and plot them over time taking their cluster value or any numeric value for that matter. Or maybe what I am seeking is not possible, I am not sure.
Thoughts?
EDIT: More information about data structure
> head(df)
city state year population stat1 stat2 stat3 stat4 stat5
1 BESSEMER 1 1 31509 0.3808436 0 0.63473928 2.8563268 9.5528262
2 BIRMINGHAM 1 1 282081 0.3119671 0 0.97489728 6.0266377 9.1321287
3 MOUNTAIN BROOK 1 1 18221 0.0000000 0 0.05488173 0.2744086 0.4390538
4 FAIRFIELD 1 1 12978 0.1541069 0 0.46232085 3.0050855 9.8628448
5 GARDENDALE 1 1 7828 0.2554931 0 0.00000000 0.7664793 1.2774655
6 LEEDS 1 1 7865 0.2542912 0 0.12714558 1.5257470 13.3502861
stat6 stat6 stat7 stat8 stat9 cluster
1 26.976419 53.54026 5.712654 0 0.2856327 9
2 35.670605 65.49183 11.982374 0 0.4963113 9
3 6.311399 21.40387 1.426925 0 0.1097635 3
4 21.266759 68.11527 11.480968 0 1.0787487 9
5 6.770567 23.24987 3.960143 0 0.0000000 3
6 24.157661 39.79657 4.450095 0 1.5257470 15
agg
1 99.93970
2 130.08675
3 30.02031
4 115.42611
5 36.28002
6 85.18754
And ultimately I need it in the form of unique cities as row.names, 1:35 as col.names and the value inside each cell to be agg if that year was present or NA if it wasn't. Again I am sure this is possible, I just can't attain a good solution to it and my current way is unstable.
If I understand your question correctly, you want to plot all the lines in one color, and then plot a few lines with several different colors. You may use ggplot2, calling geom_line twice on two data frames. The first time plot all city data without mapping lines to color. The second time plot just the subset of your target city and mapping lines to color. You will need to re-organize your original data frame and subset the data frame for the target city. In the following code I used tidyr and dplyr to process the data frame.
### Set.seed to improve reproducibility
set.seed(123)
### Load package
library(tidyr)
library(dplyr)
library(ggplot2)
### Prepare example data frame
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- 1:35
df <- df %>% mutate(City = 1:5)
### Reorganize the data for plotting
df2 <- df %>%
gather(Year, Value, -City) %>%
mutate(Year = as.numeric(Year))
The gather function takes df as the first argument. It will create the key column called Year, which will store year number. The year number are the column names of each column in the df data frame except the City column. gather function will also create a column called Value, which will store all the numeric values from each column in in the df data frame except the City column. Finally, City column will not involve in this process, so use -City to tell the gather function "do not transform the data from the City column".
### Subset df2, select the city of interest
df3 <- df2 %>%
# In this example, assuming that City 2 and City 3 are of interest
filter(City %in% c(2, 3))
### Plot the data
ggplot(data = df2, aes(x = Year, y = Value, group = factor(City))) +
# Plot all city data here in gray lines
geom_line(size = 1, color = "gray") +
# Plot target city data with colors
geom_line(data = df3,
aes(x = Year, y = Value, group = City, color = factor(City)),
size = 2)
The resulting plot can be seen here: https://dl.dropboxusercontent.com/u/23652366/example_plot.png

Resources