Sum rows by interval Dataframe - r

I need help in a research project problem.
The code problem is: i have a big data frame called FRAMETRUE, and a need to sum certain columns of those rows by row in a new column that I will call Group1.
For example:
head.table(FRAMETRUE)
Municipalities 1989 1990 1991 1992 1993 1994 1995 1996 1997
A 3 3 5 2 3 4 2 5 3
B 7 1 2 4 5 0 4 8 9
C 10 15 1 3 2 NA 2 5 3
D 7 0 NA 5 3 6 4 5 5
E 5 1 2 4 0 3 5 4 2
I must sum the values in the rows from 1989 to 1995 in a new column called Group1. like the column Group1 should be
Group1
22
23
and so on...
I know it must be something simple, I just don't get it, I'm still learning R

If you are looking for an R solution, here's one way to do it: The trick is using [ combined with rowSums
FRAMETRUE$Group1 <- rowSums(FRAMETRUE[, 2:8], na.rm = TRUE)

A dplyr solution that allows you to refer to your columns by their names:
library(dplyr)
municipalities <- LETTERS[1:4]
year1989 <- sample(4)
year1990 <- sample(4)
year1991 <- sample(4)
df <- data.frame(municipalities,year1989,year1990,year1991)
# df
municipalities year1989 year1990 year1991
1 A 4 2 2
2 B 3 1 3
3 C 1 3 4
4 D 2 4 1
# Calculate row sums here
df <- mutate(df, Group1 = rowSums(select(df, year1989:year1991)))
# df
municipalities year1989 year1990 year1991 Group1
1 A 4 2 2 8
2 B 3 1 3 7
3 C 1 3 4 8
4 D 2 4 1 7

Related

Remove duplicates based on some conditions

I have two datasets: D1 and D2. D2 is a left join from D1 and a larger dataset which I will call D3. Although the key column of D2 has the same number of unique elements than D1, it has some duplicates that I want to get rid of based on certain conditions.
There are two problems:
1) There are some rows full of NA values, except for the key value, and these rows are very important to me.
2) There are some other rows which may or may not be duplicated but doesn't match with my standard condition.
How can I remove these duplicates conditionally based on a hierarchy?
Sample dataset:
ID Var
1 1
2 1
3 1
3 9
4 2
4 9
5 1
6 1
7 1
7 9
7 9
8 2
9
10 1
Expected dataset:
ID Var
1 1
2 1
3 1
4 2
5 1
6 1
7 1
8 2
9
10 1
duplicated does what you need.
dat[!duplicated(dat$ID),]
# ID Var
# 1 1 1
# 2 2 1
# 3 3 1
# 5 4 2
# 7 5 1
# 8 6 1
# 9 7 1
# 12 8 2
# 13 9 NA
# 14 10 1
As does something from the tidyverse:
library(dplyr)
dat %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
And data.table ...
library(data.table)
as.data.table(dat)[ !duplicated(ID), ]
Data:
dat <- read.table(header = TRUE, text = "
ID Var
1 1
2 1
3 1
3 9
4 2
4 9
5 1
6 1
7 1
7 9
7 9
8 2
9 NA
10 1")
Let's say! We have a data.table below:
Library(data.table)
df <- data.table(Name = c("JACK", "JOHN", "JACK", "ANNIE", "JOHN", "JACK"),
Amount = c(30, 10, 20, 24, 5, 1))
In this case, I order by Name so it will be similar to your Id column. When I got the appropriate order, I will take only the first result
df[][order(Name, Amount)]
df[,.SD[1], by = Name]
Output:
Name Amount
1: JACK 30
2: JOHN 10
3: ANNIE 24
I hope this may help you.

Vectorized solution for using previous value in a column under some condition

This is probably easy, but beyond a clunky for loop I haven't been able to find a vectorized solution for this.
df <- tibble(a=c(1,2,3,4,3,2,5,6,9), b=c(1,2,3,4,4,4,5,6,9))
Column a should be continuously increasing and should look like column b. So, whenever the next value in a is smaller than the previous value in a, the previous value should be used instead.
Thanks!
We can use lag and fill from tidyverse
library(tidyverse)
df %>%
mutate(b1 = replace(a, a < lag(a), NA)) %>%
fill(b1)
# a b b1
# <dbl> <dbl> <dbl>
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 3 4 4
#6 2 4 4
#7 5 5 5
#8 6 6 6
#9 9 9 9
The logic being we replace the values in a with NA where the previous value is greater than the next and then use fill to replace those NAs with last non-NA value.
Using cummax() from base R:
df[["b1"]] <- cummax(df[["a"]])
> df
a b b1
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 3 4 4
6 2 4 4
7 5 5 5
8 6 6 6
9 9 9 9
Using more dplyr syntax:
df %>%
mutate(b1 = cummax(a))

R For Loop with Certain conditions

I have a dataframe (surveillance) with many variables (villages, houses, weeks). I want to eventually do a time-series analysis.
Currently for each village, there are between 1-183 weeks, each of which has several houses associated. I need the following: each village to have a single data point at each week. Thus, I need to sum up a third variable.
Example:
Village Week House Affect
A 3 7 12
B 6 3 0
C 6 2 2
A 3 9 1
A 5 8 0
A 5 2 8
C 7 19 0
C 7 2 1
I tried this and failed. How do I ask R to only sum observations with the same village and week value?
for (i in seq(along=surveillance)) {
if (surveillance$village== surveillance$village& surveillance$week== surveillance$week)
{surveillance$sumaffect <- sum(surveillance$affected)}
}
Thanks
No need for loop. Use ddply or similar
library(plyr)
Village = c("A","B","C","A","A","A","C","C")
Week = c(3,6,6,3,5,5,7,7)
Affect = c(12,0,2,1,0,8,0,1)
df = data.frame(Village,Week,Affect)
View(df)
result = ddply(df,.(Village,Week),summarise, val = sum(Affect))
View(result)
DF:
Village Week Affect
1 A 3 12
2 B 6 0
3 C 6 2
4 A 3 1
5 A 5 0
6 A 5 8
7 C 7 0
8 C 7 1
Result:
Village Week val
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
The function aggregate will do what you need.
dfs <- ' Village Week House Affect
1 A 3 7 12
2 B 6 3 0
3 C 6 2 2
4 A 3 9 1
5 A 5 8 0
6 A 5 2 8
7 C 7 19 0
8 C 7 2 1
'
df <- read.table(text=dfs)
Then the aggregation
> aggregate(Affect ~ Village + Week , data=df, sum)
Village Week Affect
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
This is an example of a split-apply-combine strategy; if you find yourself doing this often, you should investigate the dplyr (or plyr, its ancestor) or data.table as alternatives to quickly doing this sort of analysis.
EDIT: updated to use sum instead of mean

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

reshape a dataframe R

I am facing a reshaping problem with a dataframe. It has many more rows and columns. Simplified, it structure looks like this:
rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2020 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16
I would like to come out with a dataframe that has one single row for the variable "year", copy the x1, x2, x3 values in subsequent columns, and rename the columns with a combination between the rowname and the x-variable. It should look like this:
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
2000 2 6 11 0 4 2 0 3 5
2010 2 6 11 0 0 0 4 1 8
2020 10 1 7 8 4 10 22 1 16
I thought to use subsequent cbind() functions, but since I have to do it for thousands of rows and hundreds columns, I hope there is a more direct way with the reshape package (with which I am not so familiar yet)
Thanks in advance!
First, I hope that rownames is a data.frame column and not the data.frame's rownames. Otherwise you'll encounter problems due to the non-uniqueness of the values.
I think your main problem is, that your data.frame is not entirely molten:
library(reshape2)
dt <- melt( dt, id.vars=c("year", "rownames") )
head(dt)
year rownames variable value
1 2000 a x1 2
2 2000 b x1 0
3 2000 c x1 0
4 2010 a x1 2
...
dcast( dt, year ~ rownames + variable )
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
1 2000 2 6 11 0 4 2 0 3 5
2 2010 2 6 11 0 0 0 4 1 8
3 2020 10 1 7 8 4 10 22 1 16
EDIT:
As #spdickson points out, there is also an error in your data avoiding a simple aggregation. Combinations of year, rowname have to be unique of course. Otherwise you need an aggregation function which determines the resulting values of non-unique combinations. So we assume that row 6 in your data should read c 2010 4 1 8.
You can try using reshape() from base R without having to melt your dataframe further:
df1 <- read.table(text="rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2010 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16",header=T,as.is=T)
reshape(df1,direction="wide",idvar="year",timevar="rownames")
# year x1.a x2.a x3.a x1.b x2.b x3.b x1.c x2.c x3.c
# 1 2000 2 6 11 0 4 2 0 3 5
# 4 2010 2 6 11 0 0 0 4 1 8
# 7 2020 10 1 7 8 4 10 22 1 16

Resources