R For Loop with Certain conditions - r

I have a dataframe (surveillance) with many variables (villages, houses, weeks). I want to eventually do a time-series analysis.
Currently for each village, there are between 1-183 weeks, each of which has several houses associated. I need the following: each village to have a single data point at each week. Thus, I need to sum up a third variable.
Example:
Village Week House Affect
A 3 7 12
B 6 3 0
C 6 2 2
A 3 9 1
A 5 8 0
A 5 2 8
C 7 19 0
C 7 2 1
I tried this and failed. How do I ask R to only sum observations with the same village and week value?
for (i in seq(along=surveillance)) {
if (surveillance$village== surveillance$village& surveillance$week== surveillance$week)
{surveillance$sumaffect <- sum(surveillance$affected)}
}
Thanks

No need for loop. Use ddply or similar
library(plyr)
Village = c("A","B","C","A","A","A","C","C")
Week = c(3,6,6,3,5,5,7,7)
Affect = c(12,0,2,1,0,8,0,1)
df = data.frame(Village,Week,Affect)
View(df)
result = ddply(df,.(Village,Week),summarise, val = sum(Affect))
View(result)
DF:
Village Week Affect
1 A 3 12
2 B 6 0
3 C 6 2
4 A 3 1
5 A 5 0
6 A 5 8
7 C 7 0
8 C 7 1
Result:
Village Week val
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1

The function aggregate will do what you need.
dfs <- ' Village Week House Affect
1 A 3 7 12
2 B 6 3 0
3 C 6 2 2
4 A 3 9 1
5 A 5 8 0
6 A 5 2 8
7 C 7 19 0
8 C 7 2 1
'
df <- read.table(text=dfs)
Then the aggregation
> aggregate(Affect ~ Village + Week , data=df, sum)
Village Week Affect
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
This is an example of a split-apply-combine strategy; if you find yourself doing this often, you should investigate the dplyr (or plyr, its ancestor) or data.table as alternatives to quickly doing this sort of analysis.
EDIT: updated to use sum instead of mean

Related

Transforming a looping factor variable into a sequence of numerics

I have a factor variable with 6 levels, which simplified looks like:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 2... 1 1 1 2 2... (with n = 78)
Note, that each number is repeated mostly but not always three times.
I need to transform this variable into the following pattern:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8...
where each repetition of the 6 levels continuous counting ascending.
Is there any way / any function that lets me do that?
Sorry for my bad description!
Assuming that you have a numerical vector that represents your simplified version you posted. i.e. x = c(1,1,1,2,2,3,3,3,1,1,2,2), you can use this:
library(dplyr)
cumsum(x != lag(x, default = 0))
# [1] 1 1 1 2 2 3 3 3 4 4 5 5
which compares each value to its previous one and if they are different it adds 1 (starting from 1).
Maybe you can try rle, i.e.,
v <- rep(seq_along((v<-rle(x))$values),v$lengths)
Example with dummy data
x = c(1,1,1,2,2,3,3,3,4,4,5,6,1,1,2,2,3,3,3,4,4)
then we can get
> v
[1] 1 1 1 2 2 3 3 3 4 4 5 6 7 7 8 8 9 9
[19] 9 10 10
In base you can use diff and cumsum.
c(1, cumsum(diff(x)!=0)+1)
# [1] 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8
Data:
x <- c(1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,1,1,1,2,2,2,2)

Merge 2 rows with duplicated pair of values into a single row

I have the dataframe below in which there are 2 rows with the same pair of values for columns A and B -3RD AND 4RTH with 2 3 -, -7TH AND 8TH with 4 6-.
master <- data.frame(A=c(1,1,2,2,3,3,4,4,5,5), B=c(1,2,3,3,4,5,6,6,7,8),C=c(5,2,5,7,7,5,7,9,7,8),D=c(1,2,5,3,7,5,9,6,7,0))
A B C D
1 1 1 5 1
2 1 2 2 2
3 2 3 5 5
4 2 3 7 3
5 3 4 7 7
6 3 5 5 5
7 4 6 7 9
8 4 6 9 6
9 5 7 7 7
10 5 8 8 0
I would like to merge these rows into one by adding the pipe | operator between values of C and D. The 2nd and 3rd line for example would be like:
A B C D
2 3 2|5 2|5
I think your combined pairs are off by a row in your example, assuming that's the case, this is what you're looking for. We group by the columns we want to collapse the duplicates out of, and then use summarize_all with paste0 to combine the values with a separator.
library(tidyverse)
master %>% group_by(A,B) %>% summarize_all(funs(paste0(., collapse="|")))
A B C D
<dbl> <dbl> <chr> <chr>
1 1 1 5 1
2 1 2 2 2
3 2 3 5|7 5|3
4 3 4 7 7
5 3 5 5 5
6 4 6 7|9 9|6
7 5 7 7 7
8 5 8 8 0
We can do this in base R with aggregate
aggregate(.~ A + B, master, FUN = paste, collapse= '|')
# A B C D
#1 1 1 5 1
#2 1 2 2 2
#3 2 3 5|7 5|3
#4 3 4 7 7
#5 3 5 5 5
#6 4 6 7|9 9|6
#7 5 7 7 7
#8 5 8 8 0

Sum rows by interval Dataframe

I need help in a research project problem.
The code problem is: i have a big data frame called FRAMETRUE, and a need to sum certain columns of those rows by row in a new column that I will call Group1.
For example:
head.table(FRAMETRUE)
Municipalities 1989 1990 1991 1992 1993 1994 1995 1996 1997
A 3 3 5 2 3 4 2 5 3
B 7 1 2 4 5 0 4 8 9
C 10 15 1 3 2 NA 2 5 3
D 7 0 NA 5 3 6 4 5 5
E 5 1 2 4 0 3 5 4 2
I must sum the values in the rows from 1989 to 1995 in a new column called Group1. like the column Group1 should be
Group1
22
23
and so on...
I know it must be something simple, I just don't get it, I'm still learning R
If you are looking for an R solution, here's one way to do it: The trick is using [ combined with rowSums
FRAMETRUE$Group1 <- rowSums(FRAMETRUE[, 2:8], na.rm = TRUE)
A dplyr solution that allows you to refer to your columns by their names:
library(dplyr)
municipalities <- LETTERS[1:4]
year1989 <- sample(4)
year1990 <- sample(4)
year1991 <- sample(4)
df <- data.frame(municipalities,year1989,year1990,year1991)
# df
municipalities year1989 year1990 year1991
1 A 4 2 2
2 B 3 1 3
3 C 1 3 4
4 D 2 4 1
# Calculate row sums here
df <- mutate(df, Group1 = rowSums(select(df, year1989:year1991)))
# df
municipalities year1989 year1990 year1991 Group1
1 A 4 2 2 8
2 B 3 1 3 7
3 C 1 3 4 8
4 D 2 4 1 7

Grouping cases with at least three variables in common in R

I have want to group my dataset by multiple variables and than id those groups. I can id groups when I only group by one variable using dplyr with group_indices.
But I want to group cases by having the same value on at least one of a certain set of variables and then identify the group cases belong to. How to do this in R?
I have the following dataset
NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9
I want cases to be grouped when they have at least one variable of the three I listed (name, adress, phonenumber) in common.
Cases with most in common to each other should be grouped over cases that have the least in common.
So I want to create a grouping variable which gives cases the same value if they're in the same group.
You can assume the hierarchy of name>address>phone
NPI name adress phone org
1 1 1 1 1
2 1 1 1 1
3 2 2 2 2
4 2 3 3 2
5 3 4 4 3
6 3 4 5 3
7 4 5 6 4
8 5 6 6 4
9 6 7 7 5
10 7 8 8 6
11 1 9 9 1
In the my real dataset I don't have numbers but names, actual addresses and phone numbers. So all the variables I'm working with are string variables.
Try this with dplyr:
library(dplyr)
df %>%
arrange(name, adress, phone) %>%
mutate(group = c(1, ifelse((name != lag(name)) & (adress != lag(adress)) & (phone != lag(phone)), 1, 0)[-1]),
group = cumsum(group)) %>%
arrange(NPI)
Result:
NPI name adress phone group
1 1 1 1 1 1
2 2 1 1 1 1
3 3 2 2 2 2
4 4 2 3 3 2
5 5 3 4 4 3
6 6 3 4 5 3
7 7 4 5 6 4
8 8 5 6 6 4
9 9 6 7 7 5
10 10 7 8 8 6
11 11 1 9 9 1
Note:
This works even if name, adress, and phone are all characters. As long as and id column (NPI) is numeric, the final data.frame would be in the correct order.
Data:
df = read.table(text = " NPI name adress phone
1 1 1 1
2 1 1 1
3 2 2 2
4 2 3 3
5 3 4 4
6 3 4 5
7 4 5 6
8 5 6 6
9 6 7 7
10 7 8 8
11 1 9 9 ", header = TRUE)
library(dplyr)
df = df %>% mutate_at(vars(-NPI), as.character)

Summing two dataframes based on common value

I have a dataframe that looks like
day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3
and another like
day.of.week count
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13
I want to add the values from df1 to df2 based on day.of.week. I was trying to use ddply
total=ddply(merge(total, subtotal, all.x=TRUE,all.y=TRUE),
.(day.of.week), summarize, count=sum(count))
which almost works, but merge combines rows that have a shared value. For instance in the example above for day.of.week=5. Rather than being merged to two records each with count one, it is instead merged to one record of count one, so instead of total count of two I get a total count of one.
day.of.week count
1 0 3
2 0 17
3 1 6
4 2 1
5 3 1
6 4 1
7 4 5
8 5 1
9 6 3
10 6 13
There is no need to merge. You can simply do
ddply(rbind(d1, d2), .(day.of.week), summarize, sum_count = sum(count))
I have assumed that both data frames have identical column names day.of.week and count
In addition to the suggestion Ben gave you about using merge, you could also do this simply using subsetting:
d1 <- read.table(textConnection(" day.of.week count
1 0 3
2 3 1
3 4 1
4 5 1
5 6 3"),sep="",header = TRUE)
d2 <- read.table(textConnection(" day.of.week count1
1 0 17
2 1 6
3 2 1
4 3 1
5 4 5
6 5 1
7 6 13"),sep = "",header = TRUE)
d2[match(d1[,1],d2[,1]),2] <- d2[match(d1[,1],d2[,1]),2] + d1[,2]
> d2
day.of.week count1
1 0 20
2 1 6
3 2 1
4 3 2
5 4 6
6 5 2
7 6 16
This assumes no repeated day.of.week rows, since match will return only the first match.

Resources