reshape a dataframe R - r

I am facing a reshaping problem with a dataframe. It has many more rows and columns. Simplified, it structure looks like this:
rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2020 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16
I would like to come out with a dataframe that has one single row for the variable "year", copy the x1, x2, x3 values in subsequent columns, and rename the columns with a combination between the rowname and the x-variable. It should look like this:
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
2000 2 6 11 0 4 2 0 3 5
2010 2 6 11 0 0 0 4 1 8
2020 10 1 7 8 4 10 22 1 16
I thought to use subsequent cbind() functions, but since I have to do it for thousands of rows and hundreds columns, I hope there is a more direct way with the reshape package (with which I am not so familiar yet)
Thanks in advance!

First, I hope that rownames is a data.frame column and not the data.frame's rownames. Otherwise you'll encounter problems due to the non-uniqueness of the values.
I think your main problem is, that your data.frame is not entirely molten:
library(reshape2)
dt <- melt( dt, id.vars=c("year", "rownames") )
head(dt)
year rownames variable value
1 2000 a x1 2
2 2000 b x1 0
3 2000 c x1 0
4 2010 a x1 2
...
dcast( dt, year ~ rownames + variable )
year a_x1 a_x2 a_x3 b_x1 b_x2 b_x3 c_x1 c_x2 c_x3
1 2000 2 6 11 0 4 2 0 3 5
2 2010 2 6 11 0 0 0 4 1 8
3 2020 10 1 7 8 4 10 22 1 16
EDIT:
As #spdickson points out, there is also an error in your data avoiding a simple aggregation. Combinations of year, rowname have to be unique of course. Otherwise you need an aggregation function which determines the resulting values of non-unique combinations. So we assume that row 6 in your data should read c 2010 4 1 8.

You can try using reshape() from base R without having to melt your dataframe further:
df1 <- read.table(text="rownames year x1 x2 x3
a 2000 2 6 11
b 2000 0 4 2
c 2000 0 3 5
a 2010 2 6 11
b 2010 0 0 0
c 2010 4 1 8
a 2020 10 1 7
b 2020 8 4 10
c 2020 22 1 16",header=T,as.is=T)
reshape(df1,direction="wide",idvar="year",timevar="rownames")
# year x1.a x2.a x3.a x1.b x2.b x3.b x1.c x2.c x3.c
# 1 2000 2 6 11 0 4 2 0 3 5
# 4 2010 2 6 11 0 0 0 4 1 8
# 7 2020 10 1 7 8 4 10 22 1 16

Related

Sum rows by interval Dataframe

I need help in a research project problem.
The code problem is: i have a big data frame called FRAMETRUE, and a need to sum certain columns of those rows by row in a new column that I will call Group1.
For example:
head.table(FRAMETRUE)
Municipalities 1989 1990 1991 1992 1993 1994 1995 1996 1997
A 3 3 5 2 3 4 2 5 3
B 7 1 2 4 5 0 4 8 9
C 10 15 1 3 2 NA 2 5 3
D 7 0 NA 5 3 6 4 5 5
E 5 1 2 4 0 3 5 4 2
I must sum the values in the rows from 1989 to 1995 in a new column called Group1. like the column Group1 should be
Group1
22
23
and so on...
I know it must be something simple, I just don't get it, I'm still learning R
If you are looking for an R solution, here's one way to do it: The trick is using [ combined with rowSums
FRAMETRUE$Group1 <- rowSums(FRAMETRUE[, 2:8], na.rm = TRUE)
A dplyr solution that allows you to refer to your columns by their names:
library(dplyr)
municipalities <- LETTERS[1:4]
year1989 <- sample(4)
year1990 <- sample(4)
year1991 <- sample(4)
df <- data.frame(municipalities,year1989,year1990,year1991)
# df
municipalities year1989 year1990 year1991
1 A 4 2 2
2 B 3 1 3
3 C 1 3 4
4 D 2 4 1
# Calculate row sums here
df <- mutate(df, Group1 = rowSums(select(df, year1989:year1991)))
# df
municipalities year1989 year1990 year1991 Group1
1 A 4 2 2 8
2 B 3 1 3 7
3 C 1 3 4 8
4 D 2 4 1 7

R For Loop with Certain conditions

I have a dataframe (surveillance) with many variables (villages, houses, weeks). I want to eventually do a time-series analysis.
Currently for each village, there are between 1-183 weeks, each of which has several houses associated. I need the following: each village to have a single data point at each week. Thus, I need to sum up a third variable.
Example:
Village Week House Affect
A 3 7 12
B 6 3 0
C 6 2 2
A 3 9 1
A 5 8 0
A 5 2 8
C 7 19 0
C 7 2 1
I tried this and failed. How do I ask R to only sum observations with the same village and week value?
for (i in seq(along=surveillance)) {
if (surveillance$village== surveillance$village& surveillance$week== surveillance$week)
{surveillance$sumaffect <- sum(surveillance$affected)}
}
Thanks
No need for loop. Use ddply or similar
library(plyr)
Village = c("A","B","C","A","A","A","C","C")
Week = c(3,6,6,3,5,5,7,7)
Affect = c(12,0,2,1,0,8,0,1)
df = data.frame(Village,Week,Affect)
View(df)
result = ddply(df,.(Village,Week),summarise, val = sum(Affect))
View(result)
DF:
Village Week Affect
1 A 3 12
2 B 6 0
3 C 6 2
4 A 3 1
5 A 5 0
6 A 5 8
7 C 7 0
8 C 7 1
Result:
Village Week val
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
The function aggregate will do what you need.
dfs <- ' Village Week House Affect
1 A 3 7 12
2 B 6 3 0
3 C 6 2 2
4 A 3 9 1
5 A 5 8 0
6 A 5 2 8
7 C 7 19 0
8 C 7 2 1
'
df <- read.table(text=dfs)
Then the aggregation
> aggregate(Affect ~ Village + Week , data=df, sum)
Village Week Affect
1 A 3 13
2 A 5 8
3 B 6 0
4 C 6 2
5 C 7 1
This is an example of a split-apply-combine strategy; if you find yourself doing this often, you should investigate the dplyr (or plyr, its ancestor) or data.table as alternatives to quickly doing this sort of analysis.
EDIT: updated to use sum instead of mean

Reshape the Columns of Data Frame in R

I have a data frame (for example)
Week Bags
4 5
6 3
10 5
13 7
18 5
23 1
30 9
31 9
32 4
33 7
35 1
38 2
42 2
47 2
'Week' column denotes the week number in an year and 'Bags' denotes the number of bags used by a small firm.
I want my data frame in the form of
Week Bags
1 0
2 0
3 0
4 5
5 0
6 3
7 0
8 0
9 0
10 5
and so on, in order to plot the weekly changes in number of bags.
I am sure it is very silly question but I could not find any way. Please help in this direction.
You can create another dataset
df2 <- data.frame(Week= 1:max(df1$Week))
and then merge with the first dataset
res<- merge(df1, df2, all=TRUE)
res$Bags[is.na(res$Bags)] <- 0
head(res,10)
# Week Bags
#1 1 0
#2 2 0
#3 3 0
#4 4 5
#5 5 0
#6 6 3
#7 7 0
#8 8 0
#9 9 0
#10 10 5
Or using data.table
library(data.table)
res1 <- setDT(df1, key='Week')[J(Week = 1:max(Week))][is.na(Bags), Bags:=0][]

An efficient way to indicate multiple indicator variables per row with composite key? [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
How do I get a contingency table?
(6 answers)
Closed 4 years ago.
My indicator and value objects have composite keys that map to each other is there an efficient way to aggregate the values into the indicator object?
Given an "empty" indicator dataframe:
indicator <- data.frame(Id1=c(1,1,2,2,3,3,4,4), Id2=c(10,11,10,12,10,12,10,12),Ind_A=rep(0,8),Ind_B=rep(0,8))
Id1 Id2 Ind_A Ind_B
1 1 10 0 0
2 1 11 0 0
3 2 10 0 0
4 2 12 0 0
5 3 10 0 0
6 3 12 0 0
7 4 10 0 0
8 4 12 0 0
and a dataframe of values:
values <- data.frame(Id1=c(1,1,1,2,2,3,3,4,4,4),Id2=c(10,10,11,10,12,10,12,10,10,12),Indicators=c('Ind_A','Ind_B','Ind_A','Ind_B','Ind_A','Ind_A','Ind_A','Ind_A','Ind_B','Ind_A'));
Id1 Id2 Indicators
1 1 10 Ind_A
2 1 10 Ind_B
3 1 11 Ind_A
4 2 10 Ind_B
5 2 12 Ind_A
6 3 10 Ind_A
7 3 12 Ind_A
8 4 10 Ind_A
9 4 10 Ind_B
10 4 12 Ind_A
I want to end up with:
Id1 Id2 Ind_A Ind_B
1 10 1 1
1 11 1 0
2 10 0 1
2 12 1 0
3 10 1 0
3 12 1 0
4 10 1 1
4 12 1 0
You could use dcast to convert the "values" dataset from 'long' to 'wide' format.
library(reshape2)
dcast(values, Id1+Id2~Indicators, value.var='Indicators', length)
# Id1 Id2 Ind_A Ind_B
#1 1 10 1 1
#2 1 11 1 0
#3 2 10 0 1
#4 2 12 1 0
#5 3 10 1 0
#6 3 12 1 0
#7 4 10 1 1
#8 4 12 1 0
As showed above, you may not need to create a second dataset, but if you need to change the values in one dataset based on the value in other,
indicator$Ind_A <- (do.call(paste, c(indicator[1:2], 'Ind_A')) %in%
do.call(paste, values))+0L
indicator$Ind_B <- (do.call(paste, c(indicator[1:2], 'Ind_B')) %in%
do.call(paste, values))+0L

Flag first by-group in R data frame

I have a data frame which looks like this:
id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21
I'd like to identify a way to flag the first occurrence of id -- similar to first. and last. in SAS. I've tried the !duplicated function, but I need to actually append the "flag" column to my data frame since I'm running it through a loop later on. I'd like to get something like this:
id score first_ind
1 15 1
1 18 0
1 16 0
2 10 1
2 9 0
3 8 1
3 47 0
3 21 0
> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
You can find the edges using diff.
x <- read.table(text = "id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21", header = TRUE)
x$first_id <- c(1, diff(x$id))
x
id score first_id
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
Using plyr:
library("plyr")
ddply(x,"id",transform,first=as.numeric(seq(length(score))==1))
or if you prefer dplyr:
x %>% group_by(id) %>%
mutate(first=c(1,rep(0,n-1)))
(although if you're operating completely in the plyr/dplyr framework you probably wouldn't need this flag variable anyway ...)
Another base R option:
df$first_ind <- ave(df$id, df$id, FUN = seq_along) == 1
df
# id score first_ind
#1 1 15 TRUE
#2 1 18 FALSE
#3 1 16 FALSE
#4 2 10 TRUE
#5 2 9 FALSE
#6 3 8 TRUE
#7 3 47 FALSE
#8 3 21 FALSE
This also works in case of unsorted ids. If you want 1/0 instead of T/F you can easily wrap it in as.integer(.).

Resources