I have a dataframe like this
X 2001,2002,2003
JAN NA,1,2
JUN NA,2,3
DEC 1,2,NA
I want an empty vector to store values and generate a time series
What can I do
Intended output formated by month and year, omit NAs
output=c(1,1,2,2,2,3)
How can I do?
You might go that direction:
library(tidyverse)
dta <- tribble(
~X, ~"2001", ~"2002", ~"2003",
"JAN", NA, 1, 2,
"JUN", NA, 2, 3,
"DEC", 1, 2, NA)
dta %>%
pivot_longer(cols = '2001':'2003',
names_to = "year",
values_to = "val") %>%
arrange(year) %>%
filter(!is.na(val))
However, you need to assure that the months are sorted correctly.
Related
I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"
I am currently using station data for my research in R, and I need to count the number of missing/null values for each month. The data is currently in daily measurements, and the monthly total of missing values would let me trim certain months out if they are not useful.
CUM00078310_df %>%
dplyr::mutate(
Month=month(Date),
Mis = rowSums(is.na(.[,grepl("C",colnames(CUM00078310_df))]))
) %>%
group_by(Month) %>%
summarize(Sum=sum(Mis), Percentage=mean(Mis))
Here is an example. Not sure if you want the data summarized or held within the dataframe. If not summarized, then omit final two lines of code. Add month grouping variable to group_by() with your data. Filter NA's only, if needed filter(is.na(x))
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount)) %>%
group_by(x, valueCount) %>%
summarise()
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
Unsummarized example
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount))
I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"
Here's a simple example of what I'm looking for:
Before:
data.frame(
Name = c("pusheen", "pusheen", "puppy"),
Species = c("feline", "feline", "doggie"),
Activity = c("snacking", "napping", "playing"),
Start = c(1, 2, 3),
End = c(11, 12, 13)
)
After:
data.frame(
Name = c("pusheen", "puppy"),
Species = c("feline", "doggie"),
Activity1 = c("snacking", "playing"),
Start1 = c(1, 3),
End1 = c(11, 13),
Activity2 = c("napping", NA),
Start2 = c(2, NA),
End2 = c(12, NA)
)
How do I do this in R or Excel? Thanks!
This can be done using pivot_wider from the tidyr package.
library(tidyr)
library(dplyr)
library(magrittr)
df <- df %>%
group_by(Name) %>%
mutate(num = row_number()) %>% # Create a counter by group
ungroup() %>%
pivot_wider(
id_cols = c("Name", "Species"),
names_from = num,
values_from = c("Activity", "Start", "End"),
names_sep = "")
If you want the result ordered as in your sample output, we can add an additional select statement. I used str_sub from the stringr package to pull out the last character from each column name, and then sorted the names from there. This method of ordering columns should generalise to any number of activities.
library(stringr)
df %>%
select(Name, Species, names(df)[order(str_sub(names(df), -1))])
I have a dataframe of 96074 obs. of 31 variables.
the first two variables are id and the date, then I have 9 columns with measurement (three different KPIs with three different time properties), then various technical and geographical variables.
df <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d_1day_old = rnorm(9, 2, 1),
sum_i_1day_old = rnorm(9, 2, 1),
per_i_d_1day_old = rnorm(9, 0, 1),
sum_d_5days_old = rnorm(9, 0, 1),
sum_i_5days_old = rnorm(9, 0, 1),
per_i_d_5days_old = rnorm(9, 0, 1),
sum_d_15days_old = rnorm(9, 0, 1),
sum_i_15days_old = rnorm(9, 0, 1),
per_i_d_15days_old = rnorm(9, 0, 1)
)
I want to transform from wide to long, in order to do graphs with ggplot using facets for example.
If I had a df with just one variable with its three-time scans I would have no problem in using gather:
plotdf <- df %>%
gather(sum_d, value,
c(sum_d_1day_old, sum_d_5days_old, sum_d_15days_old),
factor_key = TRUE)
But having three different variables trips me up.
I would like to have this output:
plotdf <- data.frame(
id = rep(1:3, 3),
time = rep(as.Date('2009-01-01') + 0:2, each = 3),
sum_d = rep(c("sum_d_1day_old", "sum_d_5days_old", "sum_d_15days_old"), 3),
values_sum_d = rnorm(9, 2, 1),
sum_i = rep(c("sum_i_1day_old", "sum_i_5days_old", "sum_i_15days_old"), 3),
values_sum_i = rnorm(9, 2, 1),
per_i_d = rep(c("per_i_d_1day_old", "per_i_d_5days_old", "per_i_d_15days_old"), 3),
values_per_i_d = rnorm(9, 2, 1)
)
with id, sum_d, sum_i and per_i_d of class factor time of class Date and the values of class numeric (I have to add that I don't have negative measures in these variables).
what I've tried to do:
plotdf <- gather(df, key, value, sum_d_1day_old:per_i_d_15days_old, factor_key = TRUE)
gathering all of the variables in a single column
plotdf$KPI <- paste(sapply(strsplit(as.character(plotdf$key), "_"), "[[", 1),
sapply(strsplit(as.character(plotdf$key), "_"), "[[", 2), sep = "_")
creating a new column with the name of the KPI, without the time specification
plotdf %>% unite(value2, key, value) %>%
#creating a new variable with the full name of the KPI attaching the value at the end
mutate(i = row_number()) %>% spread(KPI, value2) %>% select(-i)
#spreading
But spread creates rows with NAs.
To replace then at first I used
group_by(id, date) %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "down") %>%
fill(c(sum_d, sum_i, per_i_d), .direction = "up") %>%
But the problem is that there are already some measurements with NAs in the original df in the variable per_i_d (44 in total), so I lose that information.
I thought that I could replace the NAs in the original df with a dummy value and then replace the NAs back, but then I thought that there could be a more efficient solution for all of my problem.
After I replaced the NAs, my idea was to use slice(1) to select only the first row of each couple id/date, then do some manipulation with separate/unite to have the output I desired.
I actually did that, but then I remembered I had those aforementioned NAs in the original df.
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+')) %>%
select(-key) %>%
spread(type,value)
gives
id time age per_i sum_d sum_i
1 1 2009-01-01 15days_old 0.8132301 0.8888928 0.077532040
2 1 2009-01-01 1day_old -2.0993199 2.8817133 3.047894196
3 1 2009-01-01 5days_old -0.4626151 -1.0002926 0.327102000
4 1 2009-01-02 15days_old 0.4089618 -1.6868523 0.866412133
5 1 2009-01-02 1day_old 0.8181313 3.7118065 3.701018419
...
EDIT:
adding non-value columns to the dataframe:
df %>%
gather(key,value,-id,-time) %>%
mutate(type = str_extract(key,'[a-z]+_[a-z]'),
age = str_extract(key, '[0-9]+[a-z]+_[a-z]+'),
info = paste(age,type,sep = "_")) %>%
select(-key) %>%
gather(key,value,-id,-time,-age,-type) %>%
unite(dummy,type,key) %>%
spread(dummy,value)