This is my dataset
Group Status From To
Blue No 1994 2000
Red No 1994 1997
Red Yes 1998 2002
Yellow No 1994 2014
Yellow Yes 2015 2021
Purple No 1994 1997
I like to get rid of the rows with Status=No only where they belong to a Group that repeats more than once.
For instance. Group=Red and Yellow have 2 rows, I like to get rid of the row with Status=No within these two groups. The final dataset like this.
Group Status From To
Blue No 1994 2000
Red Yes 1998 2002
Yellow Yes 2015 2021
Purple No 1994 1997
Any suggestions regarding this is much apricated. Thanks.
You can return rows with Status = 'Yes' if number of rows in the group is greater than 1.
library(dplyr)
df %>%
group_by(Group) %>%
filter(if(n() > 1) Status == 'Yes' else TRUE) %>%
ungroup
# Group Status From To
# <chr> <chr> <int> <int>
#1 Blue No 1994 2000
#2 Red Yes 1998 2002
#3 Yellow Yes 2015 2021
#4 Purple No 1994 1997
For this data, since 'Yes' > 'No' we can also do -
df %>%
arrange(Group, desc(Status)) %>%
distinct(Group, .keep_all = TRUE)
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Group = c("Blue", "Red", "Red", "Yellow", "Yellow",
"Purple"), Status = c("No", "No", "Yes", "No", "Yes", "No"),
From = c(1994L, 1994L, 1998L, 1994L, 2015L, 1994L), To = c(2000L,
1997L, 2002L, 2014L, 2021L, 1997L)),
class = "data.frame", row.names = c(NA, -6L))
Related
I have table such as this one
Year Type Value
1991 A 4945
1991 B 525
1991 C 764
1992 A 640
1992 B 3935
1992 D 49
1993 K 49
I would like to generate a new column that calculates the percentage of each type for each year. The types may change per year, and some years only have one type
Eg. The first percentage should be 4945/(4945+525+764)
Any help would be very welcome. Thank you very much!
Do a group by 'Year' and get the proportions of 'Value'
library(dplyr)
df1 %>%
group_by(Year) %>%
mutate(new = proportions(Value) * 100) %>%
ungroup
-output
# A tibble: 6 × 4
Year Type Value new
<int> <chr> <int> <dbl>
1 1991 A 4945 79.3
2 1991 B 525 8.42
3 1991 C 764 12.3
4 1992 A 640 13.8
5 1992 B 3935 85.1
6 1992 D 49 1.06
Or use base R with ave
df1$new <- with(df1, ave(Value, Year, FUN = proportions) * 100)
data
df1 <- structure(list(Year = c(1991L, 1991L, 1991L, 1992L, 1992L, 1992L
), Type = c("A", "B", "C", "A", "B", "D"), Value = c(4945L, 525L,
764L, 640L, 3935L, 49L)), class = "data.frame", row.names = c(NA,
-6L))
How to add a 0 amount for source solar in year 1990 to the dataframe below? There's presently no value for solar in 1990.
Data:
year
source
amount
1990
coal
19203
1990
nuclear
2345
1991
coal
18490
1991
nuclear
2398
1991
solar
123
1992
...
...
...
...
...
2019
...
...
Code:
data <- read.csv('annual_generation.csv')
data$source <- as.factor(data$source)
This doesn't work but it's the general idea:
for(i in 1990:2019) {
for (j in data$source) {
if (!data[i][j])
data[i][j] = 0
}
}
Edit: Based on the answer below, this was the final solution:
data <- complete(data, YEAR, STATE, ENERGY.SOURCE,
fill = list(
GEN = 0,
TYPE.OF.PRODUCER = 'Total Electric Power Industry'))
YEAR STATE ENERGY.SOURCE TYPE.OF.PRODUCER GEN
<int><fct> <fct> <fct> <dbl>
1 1990 IL Coal Total Electric Power Industry 54966018
...
We can use complete from tidyr
library(tidyr)
complete(data, year, source, fill = list(amount = 0))
-output
# A tibble: 6 x 3
# year source amount
# <int> <chr> <dbl>
#1 1990 coal 19203
#2 1990 nuclear 2345
#3 1990 solar 0
#4 1991 coal 18490
#5 1991 nuclear 2398
#6 1991 solar 123
Also, if there are some 'year', missing. we can use a range
complete(data, year = 1990:2019, source, fill = list(amount = 0))
data
data <- structure(list(year = c(1990L, 1990L, 1991L, 1991L, 1991L),
source = c("coal",
"nuclear", "coal", "nuclear", "solar"), amount = c(19203L, 2345L,
18490L, 2398L, 123L)), class = "data.frame", row.names = c(NA,
-5L))
I've got a long dataframe like this:
year value town
2001 0.15 ny
2002 0.19 ny
2002 0.14 ca
2001 NA ny
2002 0.15 ny
2002 0.12 ca
2001 NA ny
2002 0.13 ny
2002 0.1 ca
I want to calculate a mean value per year and per species. Like this:
df %>% group_by(year, town) %>% summarise(mean_year = mean(value, na.rm=T))
However, I only want to summarise those town values which have more than 2 non-NA values. In the example above, I don't want to summarise year 2001 for ny because it only has 1 non-NA value.
So the output would be like this:
town year mean_year
ny 2001 NA
ny 2002 0.156
ca 2002 0.45
try this
df %>% group_by(year, town) %>%
summarise(mean_year = ifelse(sum(!is.na(value))>=2, mean(value, na.rm = T), NA))
# A tibble: 3 x 3
# Groups: year [2]
year town mean_year
<int> <chr> <dbl>
1 2001 ny NA
2 2002 ca 0.12
3 2002 ny 0.157
dput
> dput(df)
structure(list(year = c(2001L, 2002L, 2002L, 2001L, 2002L, 2002L,
2001L, 2002L, 2002L), value = c(0.15, 0.19, 0.14, NA, 0.15, 0.12,
NA, 0.13, 0.1), town = c("ny", "ny", "ca", "ny", "ny", "ca",
"ny", "ny", "ca")), class = "data.frame", row.names = c(NA, -9L
))
I am trying to melt/stack/gather multiple specific columns of a dataframe into 2 columns, retaining all the others.
I have tried many, many answers on stackoverflow without success (some below). I basically have a situation similar to this post here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
only many more columns to retain and combine. It is important to mention my year columns are factors and I have many, many more columns than the sample listed below so I want to call column names not positions.
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
I tried using the pivot_longer in tidyr based on the post above which seemed very similar, which resulted in various errors depending on what I did:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
I also played with melt in reshape2 in various ways which either melted only the values columns or only the years columns. Such as:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
I also tried dplyr gather based on other posts but I find it extremely difficult to understand the help page and posts.
To be clear what I am looking to achieve:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
I really appreciate the help here.
We can specify the names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or use the names_sep with escaped . as according to ?pivot_longer
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
which implies that by default the regex is on and the . in regex matches any character and not the literal dot. To get the literal value, either escape or place it inside square bracket
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
Update
For the updated dataset
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\\.[^.]+)\\.([^.]+)$", "\\1\\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\\.(.*)")
Or without the rename, use regex lookaround
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\\.")
data
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))
I’m a beginner with R and appreciate all the help on this website. But I have been unable to locate a solution to a little problem...
I have 3 columns of data: SchoolName, Year, SATScore
There are many different school names, and for each school name, there is a “Year” which ranges from 2001-2012. (ex., JFK high school has 12 years of SAT data).
For each high school, I need to calculate the difference between SAT score in 2012 and SAT score in 2001.
A pivot table in Excel does this in a few minutes, but I’d like to learn how to do it in R.
Thanks in advance,
Paul
The answer will depend on the format of your data. If it looks like this
dat <- structure(list(shool = c("a", "a", "a", "b", "b", "b", "c", "c",
"c"), year = c(2001L, 2004L, 2012L, 2001L, 2005L, 2012L, 2001L,
2007L, 2012L), sat = c(12L, 45L, 5L, 6L, 8L, 9L, 44L, 55L, 5L
)), .Names = c("shool", "year", "sat"), class = "data.frame", row.names = c(NA,
-9L))
>dat
# shool year sat
#1 a 2001 12
#2 a 2004 45
#3 a 2012 5
#4 b 2001 6
#5 b 2005 8
#6 b 2012 9
#7 c 2001 44
#8 c 2007 55
#9 c 2012 5
Then you can simply do:
dat$sat[dat$year == 2012] - dat$sat[dat$year == 2001]
If things are not ordered so nicely, I suggest:
library(plyr)
ddply(dat, .(shool), summarise,
difference = sat[year == 2012] - sat[year == 2001] )
# shool difference
# 1 a -7
# 2 b 3
# 3 c -39
I'm assuming your data is in a data frame called data. You can do the following:
data2001 <- data.frame(SchoolName = data[data$Year == 2001, ]$SchoolName, Score2001 = data[data$Year == 2001, ]$SATscore)
data2012 <- data.frame(SchoolName = data[data$Year == 2012, ]$SchoolName, Score2012 = data[data$Year == 2012, ]$SATscore)
stats <- merge(data2001, data2012)
stats$Difference <- stats$Score2012 - stats$Score2001