reshaping data in R skipping certain measured variables - r

I would like to reshape a data.frame that looks like this:
permno dte ttm var1 var2 var3
1 123 2012-01-01 20 1 10 100
2 123 2012-01-01 30 -1 10 100
3 124 2012-01-01 20 2 20 200
4 124 2012-01-01 30 -2 20 200
I would like to make my data.frame look the following way:
permno dte var1_20 var1_30 var2 var3
1 123 2012-01-01 1 -1 10 100
2 124 2012-01-01 2 -2 20 200
I have been attempting to do this with reshape2 package but I am unable to isolate var1 from the rest and keep getting var2_20 and var2_30 for example in the results. Does anyone know how to do this using the reshape2 package?
data.frame dput:
> dput(DF)
structure(list(permno = c(123L, 123L, 124L, 124L), dte = structure(c(1L,
1L, 1L, 1L), .Label = " 2012-01-01", class = "factor"), ttm = c(20L,
30L, 20L, 30L), var1 = c(1L, -1L, 2L, -2L), var2 = c(10L, 10L,
20L, 20L), var3 = c(100L, 100L, 200L, 200L)), .Names = c("permno",
"dte", "ttm", "var1", "var2", "var3"), class = "data.frame", row.names = c(NA,
-4L))
> dput(result)
structure(list(permno = 123:124, dte = structure(c(1L, 1L), .Label = " 2012-01-01", class = "factor"),
var1_20 = 1:2, var1_30 = c(-1L, -2L), var2 = c(10L, 20L),
var3 = c(100L, 200L)), .Names = c("permno", "dte", "var1_20",
"var1_30", "var2", "var3"), class = "data.frame", row.names = c(NA,
-2L))

Use a combination of merge, reshape, and unique as follows:
unique(merge(DF[-c(3:4)],
reshape(DF[1:4], direction = "wide",
idvar = c("permno", "dte"),
timevar="ttm")))
# permno dte var2 var3 var1.20 var1.30
# 1 123 2012-01-01 10 100 1 -1
# 3 124 2012-01-01 20 200 2 -2
Basically, you reshape only the columns that need to be reshaped, and drop those columns from the original dataset before merging. You'll end up with duplicated rows, so just wrap all of that in unique to get (almost) your desired output. You can rearrange the column order if required.

I'm feeling rather clever about this answer, but I strongly suspect that I've made too many assumptions about your data, in particular the constant nature of var2 and var3:
ddply(dat,.(permno,dte,var2,var3),
function(x) { dcast(x,permno + dte + var2 + var3 ~ ttm,value.var = 'var1') })
permno dte var2 var3 20 30
1 123 2012-01-01 10 100 1 -1
2 124 2012-01-01 20 200 2 -2

Related

Calculate rowMeans on a range of column (Variable number)

I want to calculate rowMeans of a range of column but I cannot give the hard-coded value for colnames (e.g c(C1,C3)) or range (e.g. C1:C3) as both names and range are variable. My df looks like:
> df
chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
1 123 abc 12 10.00 19 18.00 12 13.00 -14
2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
This is just a sample, in reality I have columns ranging in MGW.1 ... MGW.196 and so. Here Instead of giving the exact colnames or an exact range I want to pass initial of colnames and want to get average of all columns having that initials. Something like: MGW=rowMeans(df[,MGW.*]), HEL=rowMeans(df[,HEL.*])
So my final output should look like:
> df
chr name age MGW Hel
1 123 abc 12 10.00 19
2 234 bvf 24 13.29 13
3 376 bxc 17 -6.95 10
I know these values are not correct but it is just to give you and idea. Secondly I want to remove all those rows from data frame which contains NA in the entire row except the first 3 values.
Here is the dput for sample example:
> dput(df)
structure(list(chr = c(123L, 234L, 376L), name = structure(1:3, .Label = c("abc",
"bvf", "bxc"), class = "factor"), age = c(12L, 24L, 17L), MGW.1 = c(10,
-13.29, -6.95), MGW.2 = c(19L, 13L, 10L), MGW.3 = c(18, -3.02,
-18), HEL.1 = c(12L, 12L, 15L), HEL.2 = c(13, -0.12, 4), HEL.3 = c(-14L,
24L, -4L)), .Names = c("chr", "name", "age", "MGW.1", "MGW.2",
"MGW.3", "HEL.1", "HEL.2", "HEL.3"), class = "data.frame", row.names = c(NA,
-3L))
Firstly
I think you are looking for this to get mean of rows:
df$mean.Hel <- rowMeans(df[, grep("^HEL.", names(df))])
And to delete the columns afterwards:
df[, grep("^HEL.", names(df))] <- NULL
Secondly
To delete rows which have only NA after the first three elements.
rows.delete <- which(rowSums(!is.na(df)[,4:ncol(df)]) == 0)
df <- df[!(1:nrow(df) %in% rows.delete),]
Here's an idea achieving your desired output without hardcoding variable names:
library(dplyr)
library(tidyr)
df %>%
# remove rows where all values are NA except the first 3 columns
filter(rowSums(is.na(.[4:length(.)])) != length(.) - 3) %>%
# gather the data in a tidy format
gather(key, value, -(chr:age)) %>%
# separate the key column into label and num allowing
# to regroup by variables without hardcoding them
separate(key, into = c("label", "num")) %>%
group_by(chr, name, age, label) %>%
# calculate the mean
summarise(mean = mean(value, na.rm = TRUE)) %>%
spread(label, mean)
I took the liberty to modify your initial data to show how the logic would fit special cases. For example, here we have a row (#4) where all values but the first 3 columns are NAs (according to your requirements, this row should be removed) and one where there is a mix of NAs and values (#5). In this case, I assumed we would like to have a result for MGW since there is a value at MGW.1:
# chr name age MGW.1 MGW.2 MGW.3 HEL.1 HEL.2 HEL.3
#1 123 abc 12 10.00 19 18.00 12 13.00 -14
#2 234 bvf 24 -13.29 13 -3.02 12 -0.12 24
#3 376 bxc 17 -6.95 10 -18.00 15 4.00 -4
#4 999 zzz 21 NA NA NA NA NA NA
#5 888 aaa 12 10.00 NA NA NA NA NA
Which gives:
#Source: local data frame [4 x 5]
#Groups: chr, name, age [4]
#
# chr name age HEL MGW
#* <int> <fctr> <int> <dbl> <dbl>
#1 123 abc 12 3.666667 15.666667
#2 234 bvf 24 11.960000 -1.103333
#3 376 bxc 17 5.000000 -4.983333
#4 888 aaa 12 NaN 10.000000
Data
df <- structure(list(chr = c(123L, 234L, 376L, 999L, 888L), name = structure(c(2L,
3L, 4L, 5L, 1L), .Label = c("aaa", "abc", "bvf", "bxc", "zzz"
), class = "factor"), age = c(12L, 24L, 17L, 21L, 12L), MGW.1 = c(10,
-13.29, -6.95, NA, 10), MGW.2 = c(19L, 13L, 10L, NA, NA), MGW.3 = c(18,
-3.02, -18, NA, NA), HEL.1 = c(12L, 12L, 15L, NA, NA), HEL.2 = c(13,
-0.12, 4, NA, NA), HEL.3 = c(-14L, 24L, -4L, NA, NA)), .Names = c("chr",
"name", "age", "MGW.1", "MGW.2", "MGW.3", "HEL.1", "HEL.2", "HEL.3"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

How to intersect values from two data frames with R

I would like to create a new column for a data frame with values from the intersection of a row and a column.
I have a data.frame called "time":
q 1 2 3 4 5
a 1 13 43 5 3
b 2 21 12 3353 34
c 3 21 312 123 343
d 4 123 213 123 35
e 4556 11 123 12 3
And another table, called "event":
q dt
a 1
b 3
c 4
d 2
e 1
I want to put another column called inter on the second table that will be fill the values that are in the intersection between the q and the columns dt from the first data.frame. So the result would be this:
q dt inter
a 1 1
b 3 12
c 4 123
d 2 123
e 1 4556
I have tried to use merge(event, time, by.x = "q", by.y = "dt"), but it generate the error that they aren't the same id. I have also tried to transpose the time data.frame to cross section the values but I didn't have success.
library(reshape2)
merge(event, melt(time, id.vars = "q"),
by.x=c('q','dt'), by.y=c('q','variable'), all.x = TRUE)
Output:
q dt value
1 a 1 1
2 b 3 12
3 c 4 123
4 d 2 123
5 e 1 4556
Notes
We use the function melt from the package reshape2 to convert the data frame time from wide to long format. And then we merge (left outer join) the data frames event and the melted time by two columns (q and dt in event, q and variable in the melted time) .
Data:
time <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), `1` = c(1L, 2L, 3L, 4L, 4556L), `2` = c(13L,
21L, 21L, 123L, 11L), `3` = c(43L, 12L, 312L, 213L, 123L), `4` = c(5L,
3353L, 123L, 123L, 12L), `5` = c(3L, 34L, 343L, 35L, 3L)), .Names = c("q",
"1", "2", "3", "4", "5"), class = "data.frame", row.names = c(NA,
-5L))
event <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), dt = c(1L, 3L, 4L, 2L, 1L)), .Names = c("q",
"dt"), class = "data.frame", row.names = c(NA, -5L))
This may be a little clunky but it works:
inter=c()
for (i in 1:nrow(time)) {
xx=merge(time,event,by='q')
dt=xx$dt
z=y[i,dt[i]+1]
inter=c(inter,z)
final=cbind(time[,1],dt,inter)
}
colnames(final)=c('q','dt','inter')
Hope it helps.
Output:
q dt inter
[1,] 1 1 1
[2,] 2 3 12
[3,] 3 4 123
[4,] 4 2 123
[5,] 5 1 4556

R: Extract list columns based on column names and patterns

I have a list (here only sample data)
my_list <- list(structure(list(sample = c(2L, 6L), data1 = c(56L, 78L),
data2 = c(59L, 27L), data3 = c(90L, 28L), data1namet = structure(c(1L,
1L), .Label = "Sam1", class = "factor"), data2namab = structure(c(1L,
1L), .Label = "Test2", class = "factor"), dataame = structure(c(1L,
1L), .Label = "Ex3", class = "factor"), ma = c("Jay", "Jay"
)), .Names = c("sample", "data1", "data2", "data3", "data1namet",
"data2namab", "dataame", "ma"), row.names = c(NA, -2L), class = "data.frame"),
structure(list(sample = c(12L, 13L, 17L), data1 = c(56L,
78L, 3L), data2 = c(59L, 27L, 2L), datest = structure(c(1L,
1L, 1L), .Label = "Exa9", class = "factor"), dattestr = structure(c(1L,
1L, 1L), .Label = "cz1", class = "factor"), add = c(2, 2,
2)), .Names = c("sample", "data1", "data2", "datest", "dattestr",
"add"), row.names = c(NA, -3L), class = "data.frame"))
my_list
[[1]]
sample data1 data2 data3 data1namet data2namab dataame ma
1 2 56 59 90 Sam1 Test2 Ex3 Jay
2 6 78 27 28 Sam1 Test2 Ex3 Jay
[[2]]
sample data1 data2 datest dattestr add
1 12 56 59 Exa9 cz1 2
2 13 78 27 Exa9 cz1 2
3 17 3 2 Exa9 cz1 2
I've got two problems:
I would like to extract columns in this list based on patterns of their column names, e.g. all columns which contain the word 'data' in their column name. I wasn't able to find a solution with grep.
I know how to extract one column based on their index number (see example below), but how could I do this selection directly based on the column name (not the column number)?
out <- lapply(my_list, `[`, 1) # extract "sample" column
Try
lapply(my_list, function(df) df[, grep("data", names(df), fixed = TRUE)] )
# [[1]]
# data1 data2 data3 data1namet data2namab dataame
# 1 56 59 90 Sam1 Test2 Ex3
# 2 78 27 28 Sam1 Test2 Ex3
#
# [[2]]
# data1 data2
# 1 56 59
# 2 78 27
# 3 3 2
lapply(my_list, "[", "sample")
# [[1]]
# sample
# 1 2
# 2 6
#
# [[2]]
# sample
# 1 12
# 2 13
# 3 17

Calculate column sums for each combination of two grouping variables [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I have a dataset that looks something like this:
Type Age count1 count2 Year Pop1 Pop2 TypeDescrip
A 35 1 1 1990 30000 50000 alpha
A 35 3 1 1990 30000 50000 alpha
A 45 2 3 1990 20000 70000 alpha
B 45 2 1 1990 20000 70000 beta
B 45 4 5 1990 20000 70000 beta
I want to add the counts of the rows that are matching in the Type and Age columns. So ideally I would end up with a dataset that looks like this:
Type Age count1 count2 Year Pop1 Pop2 TypeDescrip
A 35 4 2 1990 30000 50000 alpha
A 45 2 3 1990 20000 70000 alpha
B 45 6 6 1990 20000 70000 beta
I've tried using nested duplicated() statements such as below:
typedup = duplicated(df$Type)
bothdup = duplicated(df[(typedup == TRUE),]$Age)
but this returns indices for which age or type are duplicated, not necessarily when one row has duplicates of both.
I've also tried tapply:
tapply(c(df$count1, df$count2), c(df$Age, df$Type), sum)
but this output is difficult to work with. I want to have a data.frame when I'm done.
I don't want to use a for-loop because my dataset is quite large.
Try
library(dplyr)
df1 %>%
group_by(Type, Age) %>%
summarise_each(funs(sum))
# Type Age count1 count2
#1 A 35 4 2
#2 A 45 2 3
#3 B 45 6 6
In the newer versions of dplyr
df1 %>%
group_by(Type, Age) %>%
summarise_all(sum)
Or using base R
aggregate(.~Type+Age, df1, FUN=sum)
# Type Age count1 count2
#1 A 35 4 2
#2 A 45 2 3
#3 B 45 6 6
Or
library(data.table)
setDT(df1)[, lapply(.SD, sum), .(Type, Age)]
# Type Age count1 count2
#1: A 35 4 2
#2: A 45 2 3
#3: B 45 6 6
Update
Based on the new dataset,
df2 %>%
group_by(Type, Age,Pop1, Pop2, TypeDescrip) %>%
summarise_each(funs(sum), matches('^count'))
# Type Age Pop1 Pop2 TypeDescrip count1 count2
#1 A 35 30000 50000 alpha 4 2
#2 A 45 20000 70000 beta 2 3
#3 B 45 20000 70000 beta 6 6
data
df1 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L,
35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L,
1L, 3L, 1L, 5L)), .Names = c("Type", "Age", "count1", "count2"
), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(Type = c("A", "A", "A", "B", "B"), Age = c(35L,
35L, 45L, 45L, 45L), count1 = c(1L, 3L, 2L, 2L, 4L), count2 = c(1L,
1L, 3L, 1L, 5L), Year = c(1990L, 1990L, 1990L, 1990L, 1990L),
Pop1 = c(30000L, 30000L, 20000L, 20000L, 20000L), Pop2 = c(50000L,
50000L, 70000L, 70000L, 70000L), TypeDescrip = c("alpha",
"alpha", "beta", "beta", "beta")), .Names = c("Type", "Age",
"count1", "count2", "Year", "Pop1", "Pop2", "TypeDescrip"),
class = "data.frame", row.names = c(NA, -5L))
#hannah you can also use sql using the sqldf package
sqldf("select
Type,Age,
sum(count1) as sum_count1,
sum(count2) as sum_count2
from
df
group by
Type,Age
")

Re-align column in data frame into multiple columns

I'm trying to change a data frame column (var3, in the example below) that has multiple values for factor levels of another variable (names, in the example below). I'd like var3 to be split into separate columns, one for each value, so that the factor levels in names do not repeat. My other variables (var1, var2) repeat where necessary to provide space for var3.
This is the kind of data I have:
df1 <- structure(list(name = structure(c(2L, 4L, 4L, 4L, 3L, 5L, 5L,
1L), .Label = c("fifth", "first", "fourth", "second", "third"
), class = "factor"), var1 = c(90L, 84L, 84L, 84L, 18L, 22L,
22L, 36L), var2 = c(301L, 336L, 336L, 336L, 412L, 296L, 296L,
357L), var3 = c(-0.582075925, -1.108889624, -1.014962009, -0.162309524,
-0.282309524, 0.563055819, -0.232075925, -0.773353424)), .Names = c("name",
"var1", "var2", "var3"), class = "data.frame", row.names = c(NA, -8L))
This is what i'd like:
df2 <- structure(list(name = structure(c(2L, 4L, 3L, 5L, 1L), .Label = c("fifth",
"first", "fourth", "second", "third"), class = "factor"), var1 = c(90L,
84L, 18L, 22L, 36L), var2 = c(301L, 336L, 412L, 296L, 357L),
var3 = c(-0.582075925, -1.108889624, -0.282309524, 0.563055819,
-0.773353424), var3.2 = c(NA, -1.014962009, NA, -0.232075925,
NA), var3.3 = c(NA, -0.162309524, NA, NA, NA)), .Names = c("name", "var1",
"var2", "var3", "var3.2", "var3.3"), class = "data.frame", row.names = c(NA, -5L))
I've looked at reshape and ddply, but can't get them to give me this output.
Here's a base solution:
> df1$seqnam <- ave(as.character(df1$name), df1$name, FUN=seq) # creates a "time" index
> reshape(df1, direction="wide", timevar="seqnam", idvar=c("name", "var1", "var2") )
name var1 var2 var3.1 var3.2 var3.3
1 first 90 301 -0.5820759 NA NA
2 second 84 336 -1.1088896 -1.0149620 -0.1623095
5 fourth 18 412 -0.2823095 NA NA
6 third 22 296 0.5630558 -0.2320759 NA
8 fifth 36 357 -0.7733534 NA NA
ddply(df1, .(name), function(x) {
var3 <- data.frame(rbind(unique(x$var3)))
names(var3) <- paste0("var3.", 1:length(var3))
return(data.frame(name = unique(x$name), var1 = unique(x$var1),
var2 = unique(x$var2), var3))
})
name var1 var2 var3.1 var3.2 var3.3
1 fifth 36 357 -0.7733534 NA NA
2 first 90 301 -0.5820759 NA NA
3 fourth 18 412 -0.2823095 NA NA
4 second 84 336 -1.1088896 -1.0149620 -0.1623095
5 third 22 296 0.5630558 -0.2320759 NA
The function can be modified if you expect var1 and var2 to also contain multiple values.

Resources