Reshape the Columns of Data Frame in R - r

I have a data frame (for example)
Week Bags
4 5
6 3
10 5
13 7
18 5
23 1
30 9
31 9
32 4
33 7
35 1
38 2
42 2
47 2
'Week' column denotes the week number in an year and 'Bags' denotes the number of bags used by a small firm.
I want my data frame in the form of
Week Bags
1 0
2 0
3 0
4 5
5 0
6 3
7 0
8 0
9 0
10 5
and so on, in order to plot the weekly changes in number of bags.
I am sure it is very silly question but I could not find any way. Please help in this direction.

You can create another dataset
df2 <- data.frame(Week= 1:max(df1$Week))
and then merge with the first dataset
res<- merge(df1, df2, all=TRUE)
res$Bags[is.na(res$Bags)] <- 0
head(res,10)
# Week Bags
#1 1 0
#2 2 0
#3 3 0
#4 4 5
#5 5 0
#6 6 3
#7 7 0
#8 8 0
#9 9 0
#10 10 5
Or using data.table
library(data.table)
res1 <- setDT(df1, key='Week')[J(Week = 1:max(Week))][is.na(Bags), Bags:=0][]

Related

replace values in row if it matches with last row in R

I have below data frame in R
df <- read.table(text = "
A B C D E
14 6 8 16 14
5 6 10 6 4
2 4 6 3 4
26 6 18 39 36
1 2 3 1 2
3 1 1 1 1
3 5 1 4 11
", header = TRUE)
Now if values in last two rows are same, I need to replace these values with 0, can any one help me in this if it is doable in R
For example:
values last two rows in column 1 are 3 so I need to replace 3 by 0.
Also same for column 3 last two rows in column 3 are 1 so I need to replace 3 by 0.
you can compare last 2 rows and replace in the columns where the values are same :
nr <- nrow(df)
df[(nr-1):nr, df[nr-1, ]==df[nr, ]] <- 0
df
# A B C D E
#1 14 6 8 16 14
#2 5 6 10 6 4
#3 2 4 6 3 4
#4 26 6 18 39 36
#5 1 2 3 1 2
#6 0 1 0 1 1
#7 0 5 0 4 11
One option is to loop through the columns, check if the last two elements (tail(x,2)) or duplicated, then replace it with 0 or else return the column and assign the output back to the dataset. The [] make sure that the structure is intact.
df[] <- lapply(df, function(x) if(anyDuplicated(tail(x, 2))>0)
replace(x, c(length(x)-1, length(x)), 0) else x)
df
# A B C D E
#1 14 6 8 16 14
#2 5 6 10 6 4
#3 2 4 6 3 4
#4 26 6 18 39 36
#5 1 2 3 1 2
#6 0 1 0 1 1
#7 0 5 0 4 11
You could also do this:
r <- tail(df, 2)
r[,r[1,]==r[2,]] <- 0
df <- rbind(head(df, -2), r)

R For Loop with Certain conditions

I have a dataframe (surveillance) with many variables (villages, houses, weeks). I want to eventually do a time-series analysis.
Currently for each village, there are between 1-183 weeks, each of which has several houses associated. I need the following: each village to have a single data point at each week. Thus, I need to sum up a third variable.
Example:
Village Week House Affect
A 3 7 12
B 6 3 0
C 6 2 2
A 3 9 1
A 5 8 0
A 5 2 8
C 7 19 0
C 7 2 1
I tried this and failed. How do I ask R to only sum observations with the same village and week value?
for (i in seq(along=surveillance)) {
if (surveillance$village== surveillance$village& surveillance$week== surveillance$week)
{surveillance$sumaffect <- sum(surveillance$affected)}
}
Thanks
No need for loop. Use ddply or similar
library(plyr)
Village = c("A","B","C","A","A","A","C","C")
Week = c(3,6,6,3,5,5,7,7)
Affect = c(12,0,2,1,0,8,0,1)
df = data.frame(Village,Week,Affect)
View(df)
result = ddply(df,.(Village,Week),summarise, val = sum(Affect))
View(result)
DF:
Village Week Affect
1 A 3 12
2 B 6 0
3 C 6 2
4 A 3 1
5 A 5 0
6 A 5 8
7 C 7 0
8 C 7 1
Result:
Village Week val
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
The function aggregate will do what you need.
dfs <- ' Village Week House Affect
1 A 3 7 12
2 B 6 3 0
3 C 6 2 2
4 A 3 9 1
5 A 5 8 0
6 A 5 2 8
7 C 7 19 0
8 C 7 2 1
'
df <- read.table(text=dfs)
Then the aggregation
> aggregate(Affect ~ Village + Week , data=df, sum)
Village Week Affect
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
This is an example of a split-apply-combine strategy; if you find yourself doing this often, you should investigate the dplyr (or plyr, its ancestor) or data.table as alternatives to quickly doing this sort of analysis.
EDIT: updated to use sum instead of mean

Flag first by-group in R data frame

I have a data frame which looks like this:
id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21
I'd like to identify a way to flag the first occurrence of id -- similar to first. and last. in SAS. I've tried the !duplicated function, but I need to actually append the "flag" column to my data frame since I'm running it through a loop later on. I'd like to get something like this:
id score first_ind
1 15 1
1 18 0
1 16 0
2 10 1
2 9 0
3 8 1
3 47 0
3 21 0
> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
You can find the edges using diff.
x <- read.table(text = "id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21", header = TRUE)
x$first_id <- c(1, diff(x$id))
x
id score first_id
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
Using plyr:
library("plyr")
ddply(x,"id",transform,first=as.numeric(seq(length(score))==1))
or if you prefer dplyr:
x %>% group_by(id) %>%
mutate(first=c(1,rep(0,n-1)))
(although if you're operating completely in the plyr/dplyr framework you probably wouldn't need this flag variable anyway ...)
Another base R option:
df$first_ind <- ave(df$id, df$id, FUN = seq_along) == 1
df
# id score first_ind
#1 1 15 TRUE
#2 1 18 FALSE
#3 1 16 FALSE
#4 2 10 TRUE
#5 2 9 FALSE
#6 3 8 TRUE
#7 3 47 FALSE
#8 3 21 FALSE
This also works in case of unsorted ids. If you want 1/0 instead of T/F you can easily wrap it in as.integer(.).

Delete single occurances in longitudinal data

I am working with longitudinal data. I want to remove the observations of people that were only measured once (ids 5,7,9 below). How do I do this? Assume id is the unique identifier for people in the data set. Therefore, I would want to remove observations associated with ids 5,7, and 9. I've played with duplicated, unique, the table function, and the count function in plyr but haven't been successful. Example data below.
y<-sample(1:10, 20, replace=TRUE)
x<-sample(c(0,1),20, replace=TRUE)
id<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,7,8,8,8,9)
data<-data.frame(cbind(y,x,id))
You would have received immediate assistance had you tagged the post as R,data.frame
Here, the ! "not" function is used to remove id rows which match the values c(5,7,9)
> data[!data$id %in% c(5,7,9),]
y x id
1 3 0 1
2 2 1 1
3 3 0 1
4 9 0 2
5 9 0 2
6 1 0 2
7 9 0 3
8 7 0 3
9 4 0 3
10 9 1 4
11 7 0 4
12 8 1 4
14 4 1 6
15 1 0 6
17 2 0 8
18 8 0 8
19 2 0 8

reshape a dataframe R

I am facing a reshaping problem with a dataframe. It has many more rows and columns. Simplified, it structure looks like this:
rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2020 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16
I would like to come out with a dataframe that has one single row for the variable "year", copy the x1, x2, x3 values in subsequent columns, and rename the columns with a combination between the rowname and the x-variable. It should look like this:
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
2000 2 6 11 0 4 2 0 3 5
2010 2 6 11 0 0 0 4 1 8
2020 10 1 7 8 4 10 22 1 16
I thought to use subsequent cbind() functions, but since I have to do it for thousands of rows and hundreds columns, I hope there is a more direct way with the reshape package (with which I am not so familiar yet)
Thanks in advance!
First, I hope that rownames is a data.frame column and not the data.frame's rownames. Otherwise you'll encounter problems due to the non-uniqueness of the values.
I think your main problem is, that your data.frame is not entirely molten:
library(reshape2)
dt <- melt( dt, id.vars=c("year", "rownames") )
head(dt)
year rownames variable value
1 2000 a x1 2
2 2000 b x1 0
3 2000 c x1 0
4 2010 a x1 2
...
dcast( dt, year ~ rownames + variable )
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
1 2000 2 6 11 0 4 2 0 3 5
2 2010 2 6 11 0 0 0 4 1 8
3 2020 10 1 7 8 4 10 22 1 16
EDIT:
As #spdickson points out, there is also an error in your data avoiding a simple aggregation. Combinations of year, rowname have to be unique of course. Otherwise you need an aggregation function which determines the resulting values of non-unique combinations. So we assume that row 6 in your data should read c 2010 4 1 8.
You can try using reshape() from base R without having to melt your dataframe further:
df1 <- read.table(text="rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2010 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16",header=T,as.is=T)
reshape(df1,direction="wide",idvar="year",timevar="rownames")
# year x1.a x2.a x3.a x1.b x2.b x3.b x1.c x2.c x3.c
# 1 2000 2 6 11 0 4 2 0 3 5
# 4 2010 2 6 11 0 0 0 4 1 8
# 7 2020 10 1 7 8 4 10 22 1 16

Resources