I have over 100 tables. Each ID has multiple columns (ID, Date, Days, Mass, Float, Date 2, Days 2, pH).
I split the IDs from the data frame and made them the names of the tables as shown below.
data = NN
ID <- paste0("", NN$ID)
SD<- split(NN,ID)
SD
Each of the ID's look as follows
> SD$`4469912`
# A tibble: 5 × 8
ID Date Days Mass Float `Date 2` `Days 2` pH
<dbl> <dttm> <dbl> <dbl> <dbl> <dttm> <dbl> <chr>
1 4469912 2022-05-24 00:00:00 0 440 16.9 NA 0 NA
2 4469912 2022-05-27 00:00:00 3 813 NA NA 0 NA
3 4469912 2022-06-02 00:00:00 9 930 NA NA 0 NA
4 4469912 2022-06-03 00:00:00 10 914. NA NA 0 NA
5 4469912 2022-06-06 00:00:00 13 944 NA NA 0 NA
I would like to convert each ID to its own Dataframe as shown below
`4469912`<- data.frame(SD$`4469912`)
AKA
`4469912`<- data.frame(SD[9])
The problem I am running into is running a loop to create each table as its own data frame. I would like to name the data frames to their corresponding ID. Something along the lines of the code below.
for (x in SD) {
names(SD[x]) <- data.frame(SD[x])
}
EDIT: I will add that the end goal is to pull or select specific IDs to then plot them on top or against one another in ggplot as each ID is its own geom_line for example:
`4469912`<- data.frame(SD$`4469912`)
`4469822`<- data.frame(SD$`4469822`)
`4469222`<- data.frame(SD$`4469222`)
ggplot(data=NULL,aes(x=`Date`,y=`Mass`)) +
geom_line(data = `4469912`,aes(col="red"))+
geom_line(data = `4469822`,aes(col="blue"))+
geom_line(data = `4469222`,aes(col="green"))
Rather than plotting the entirety of my original data frame, I can determine falloff or regression between the IDs rather than the entirety of the data points selected; if that makes sense and/or is relevant.
Say I have a data frame with the columns for Date (DATE), Status of the employee (Status), and the number of employees with that status (n), how do I then go about creating a line chart to diplay the change in time of the number of each employee status with each status being an individual line.
df %>%
ggplot() +
geom_line(aes(DATE, n, colour = Status)
When using this it returns a grpah with exponential values on the y-axis:
Does any-one have any solutions? With there only being two differetn status values, I tried using pivot_wider() to make each a separate varivble, however I am still stuck.
assuming your data something like this
df <- data.frame(Date = rep(as.Date(seq(18628, 18677), origin= "1970-01-01"), 2),
Status = rep(c("A", "B"), each = 50),
emp_nos = rpois(100, 100))
# A tibble: 100 x 3
Date Status emp_nos
<date> <chr> <int>
1 2021-01-01 A 83
2 2021-01-02 A 99
3 2021-01-03 A 104
4 2021-01-04 A 111
5 2021-01-05 A 102
6 2021-01-06 A 91
7 2021-01-07 A 96
8 2021-01-08 A 102
9 2021-01-09 A 97
10 2021-01-10 A 98
# ... with 90 more rows
you may create a chart like this
df %>% ggplot() +
geom_line(aes(x=Date, y = emp_nos, linetype = Status, color = Status), size =1)
I have a data frame of SPEI values. I want to calculate two statistics (explained below) at an interval of
20 years i.e 2021-2040, 2041-2060, 2061-2080, 2081-2100. The first column contains the Date (month-year), and
Each year i.e. 2021, 2022, 2023 etc. till 2100.
The statistics are:
Drought frequency: Number of times SPEI < 0 in the specified period (20 years and 1 year respectively)
Drought Duration: Equal to the number of months between its start (included) and end month (not included) of the specified period. I am assuming a drought event starts when SPEI < 0.
I was wondering if there's a way to do that in R? It seems like an easy problem, but I don't know how to do it. Please help me out. Excel is taking too long. Thanks.
> head(test, 20)
Date spei-3
1 2021-01-01 NA
2 2021-02-01 NA
3 2021-03-01 -0.52133737
4 2021-04-01 -0.60047887
5 2021-05-01 0.56838399
6 2021-06-01 0.02285012
7 2021-07-01 0.26288462
8 2021-08-01 -0.14314685
9 2021-09-01 -0.73132256
10 2021-10-01 -1.23389220
11 2021-11-01 -1.15874943
12 2021-12-01 0.27954143
13 2022-01-01 1.14606657
14 2022-02-01 0.66872986
15 2022-03-01 -1.13758050
16 2022-04-01 -0.27861017
17 2022-05-01 0.99992395
18 2022-06-01 0.61024314
19 2022-07-01 -0.47450485
20 2022-08-01 -1.06682997
Edit:
I very much like to add some code, but I don't know where to start.
test = "E:/drought.xlsx"
#Extract year and month and add it as a column
test$Year = format(test$Date,"%Y")
test$Month = format(test$Date,"%B")
I don't know how to go from here. I found that cumsum can help, but how do I select one year and then apply cumsum on it. I am not withholding code on purpose. I just don't know where or how to begin.
There are a couple questions the OP's post so I will go through them step by step. You'll need dplyr and lubridate for this workflow.
First, we create some fake data to use:
library(lubridate)
library(dplyr)
#create example data
dd<- data.frame(Date = seq.Date(as.Date("2021-01-01"), as.Date("2100-12-01"), by = "month"),
spei = rnorm(960,0,2))
That will look like this, similar to what you have above
> head(dd)
Date spei year year_20 drought
1 2021-01-01 -6.85689789 2021 2021_2040 1
2 2021-02-01 -0.09292459 2021 2021_2040 1
3 2021-03-01 0.13715922 2021 2021_2040 0
4 2021-04-01 2.26805601 2021 2021_2040 0
5 2021-05-01 -0.47325008 2021 2021_2040 1
6 2021-06-01 0.37034138 2021 2021_2040 0
Then we can use lubridate and cut to create our yearly and 20-year variables to group by later and create a column drought signifying if spei was negative.
#create a column to group on by year and by 20-year
dd <- dd %>%
mutate(year = year(Date),
year_20 = cut(year, breaks = c(2020,2040,2060,2080, 2100), include.lowest = T,
labels = c("2021_2040", "2041_2060", "2061_2080", "2081_2100"))) %>%
#column signifying if that month was a drought
mutate(drought = ifelse(spei<0,1,0))
Once we have that, we just use the group_by function to get frequency (or number of months with a drought) by year or 20-year period
#by year
dd %>%
group_by(year) %>%
summarise(year_freq = sum(drought)) %>%
ungroup()
# A tibble: 80 x 2
year year_freq
<dbl> <dbl>
1 2021 6
2 2022 4
3 2023 7
4 2024 6
5 2025 6
6 2026 7
#by 20-year group
dd %>%
group_by(year_20) %>%
summarise(year20_freq = sum(drought)) %>%
ungroup()
# A tibble: 4 x 2
year_20 year20_freq
<fct> <dbl>
1 2021_2040 125
2 2041_2060 121
3 2061_2080 121
4 2081_2100 132
Calculating drought duration is a bit more complicated. It involves
identifying the first month of each drought
calculating the length of each drought
combining information from 1 and 2 together
We can use lag to identify when a month changed from "no drought" to "drought". In this case we want an index of where the value in row i is different from that in row i-1
# find index of where values change.
change.ind <- dd$drought != lag(dd$drought)
#use index to find drought start
drought.start <- dd[change.ind & dd$drought == 1,]
This results in a subset of the initial dataset, but only with the rows with the first month of a drought. Then we can use rle to calculate the length of the drought. rle will calculate the length of every run of numbers, so we will have to subset to only those runs where the value==1 (drought)
#calculate drought lengths
drought.lengths <- rle(dd$drought)
# we only want droughts (values = 1)
drought.lengths <- drought.lengths$lengths[drought.lengths$values==1]
Now we can combine these two pieces of information together. The first row is an NA because there is no value at i-1 to compare the lag to. It can be dropped, unless you want to include that data.
drought.dur <- cbind(drought.start, drought_length = drought.lengths)
head(drought.dur)
Date spei year year_20 drought drought_length
NA <NA> NA NA <NA> NA 2
5 2021-05-01 -0.47325008 2021 2021_2040 1 1
9 2021-09-01 -2.04564549 2021 2021_2040 1 1
11 2021-11-01 -1.04293866 2021 2021_2040 1 2
14 2022-02-01 -0.83759671 2022 2021_2040 1 1
17 2022-05-01 -0.07784316 2022 2021_2040 1 1
I have an issue when trying to use dplyr and ggplot2 to summarize data. I have a data set(excel file) that I imported:
df<-read.xlsx('sample.xlsx', sheet = 1)
With a sample of the data
date user vert aff browser clicks age rpc installs revenue Week Month Year
1 2017-10-25 2017-10-25 maps_1 appfocus1 Chrome 13 0 0.4436 37 5.7668 43 10 2017
2 2017-10-25 2017-10-25 maps_1 appfocus1 Chrome 1140 0 0.4436 2914 505.7040 43 10 2017
3 2017-10-25 2017-10-25 maps appfocus84 Chrome 2189 0 0.4436 7543 971.0404 43 10 2017
4 2017-10-25 2017-10-25 maps_1 appfocus1 Firefox 1 0 0.4436 6 0.4436 43 10 2017
5 2017-10-25 2017-10-25 maps_1 appfocus1 Firefox 123 0 0.4436 170 54.5628 43 10 2017
6 2017-10-25 2017-10-25 maps appfocus84 Firefox 331 0 0.4436 497 146.8316 43 10 2017
source
1 googlepartner
2 search
3 NULL
4 googlepartner
5 search
6 NULL
The code below takes a column "affiliate" and generate the summation of two fields based on that column. Then I create a calculated field by "affiliate":
UC10 <- filter(df, UCMonth == 10)
UC101 <- UC10 %>% group_by(affiliate) %>%
summarise_at(vars(revenue,installs),sum)%>%
mutate(RPI = revenue/installs)
And get the below data:
# A tibble: 2 x 4
affiliate revenue installs RPI
<chr> <dbl> <dbl> <dbl>
1 appfocus1 53603. 809580 0.0662
2 appfocus84 174479. 2768181 0.0630
Then I try to plot, by affiliate, the total RPI using ggplot2:
gcor <- ggplot(UC101, aes(x = affiliate, y = RPI)) +
geom_boxplot(color = "dark red")
My problem is the output of the graph. When I look at the graph, I get the below error:
Can anyone help understand why it isn't showing a full boxplot? This is really my first time using dplyr and ggplot2 together, so any help would be appreciated.
year <- c(2000:2014)
group <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"C","C","C","C","C","C","C","C","C","C","C","C","C","C","C")
value <- sample(1:5, 45, replace=TRUE)
df <- data.frame(year,group,value)
df$value[df$value==1] <- NA
year group value
1 2000 A NA
2 2001 A 2
3 2002 A 2
...
11 2010 A 2
12 2011 A 3
13 2012 A 5
14 2013 A NA
15 2014 A 3
16 2000 B 2
17 2001 B 3
...
26 2010 B NA
27 2011 B 5
28 2012 B 4
29 2013 B 3
30 2014 B 5
31 2000 C 5
32 2001 C 4
33 2002 C 3
34 2003 C 4
...
44 2013 C 5
45 2014 C 3
Above is the sample dataframe for my question.
Each group (A,B or C), has value from 2000 to 2014, but in some years, the value might be missing for some of the groups.
The graph I would like to plot is as below:
x-axis is year
y-axis is group (i.e. A, B & C should be showed on y-lab)
the bar or line represent the value availability of each group
If the value is NA, then the bar would not show at that time point.
ggplot2 is preferred if possible.
Can anyone help?
Thank you.
I think my description is confusing. I am expecting a graph like below, BUT the x-axis would be year. And the bar or line represents the availability of the value for a given group across the year.
In the sample dataframe of group A, we have
2012 A 5
2013 A NA
2014 A 3
Then there should be nothing at the point of group A in 2013, and then a dot would be presented at the point of group A in 2014.
You can use the geom_errorbar, with no range (geom_errorbarh for horizontal). Then just subset for complete.cases (or !is.na(df$value))
library(ggplot2)
set.seed(10)
year <- c(2000:2014)
group <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"C","C","C","C","C","C","C","C","C","C","C","C","C","C","C")
value <- sample(1:5, 45, replace=TRUE)
df <- data.frame(year,group,value)
df$value[df$value==1] <- NA
no_na_df <- df[complete.cases(df), ]
ggplot(no_na_df, aes(x=year, y = group)) +
geom_errorbarh(aes(xmax = year, xmin = year), size = 2)
Edit:
To get a countious bar, you can use this slightly unappealing method. It is nesessary to make a numeric representation of the group data, to give the bars a width. Thereafter, we can make the scale represent the variables as discrete again.
df$group_n <- as.numeric(df$group)
no_na_df <- df[complete.cases(df), ]
ggplot(no_na_df, aes(xmin=year-0.5, xmax=year+0.5, y = group_n)) +
geom_rect(aes(ymin = group_n-0.1, ymax = group_n+0.1)) +
scale_y_discrete(limits = levels(df$group))