transform a dataframe from long to wide in r, but needs date transformation - r

I have a dataframe like this (each "NUMBER" indicate a student):
NUMBER Gender Grade Date.Tested WI WR WZ
1 F 4 2014-02-18 6 9 10
1 F 3 2014-05-30 9 8 2
2 M 5 2013-05-02 7 9 15
2 M 4 2009-05-21 5 7 2
2 M 5 2010-04-29 9 1 4
I know I can use:
cook <- reshape(data, timevar= "?", idvar= c("NUMBER","Gender"), direction = "wide")
to change it into a wide format. However, I want to remove the date.tested to the times (1st time, 2nd time...etc), and indicate the grade.
What I want at the end is like this:
NUMBER Gender Grade1 Grade 2 Grade 3 WI1 WR1 WZ1 WI2 WR2 WZ2 WI3 WR3 WZ3
1 F 3 4 NA 9 8 2 6 9 10 NA NA NA
and for the rest "NUMBER"s.
I have searched a lot but did not find an answer. Can someone help me with it?
Thank you very much!

Try
data$id <- with(data, ave(seq_along(NUMBER), NUMBER, FUN=seq_along))
reshape(data, idvar=c('NUMBER', 'Gender'), timevar='id', direction='wide')
If you want the Date.Tested variable to be included in the 'idvar' and you need only the 1st value for the group ('NUMBER' or 'GENDER')
data$Date.Tested <- with(data, ave(Date.Tested, NUMBER,
FUN=function(x) head(x,1)))
reshape(data, idvar=c('NUMBER', 'Gender', 'Date.Tested'),
timevar='id', direction='wide')

Related

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.
I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution
Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
"Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)
The data looks like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 6 2
2 Hank 4 9 2
3 Tom 3 2 8
4 Jeff 3 5 6
5 Hank 4 7 4
6 Jeff 3 4 3
7 Jeff 3 8 3
8 Bill 6 2 3
9 Mark 2 9 1
I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 23 14
2 Hank 4 16 6
3 Tom 3 2 8
8 Bill 6 2 3
9 Mark 2 9 1
Any help would great. Thanks!
Another solution which is a bit more straightforward is by using the library dplyr
library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>% # group the columns you want to "leave alone"
summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4
if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))
I'd do it this way using data.table:
setDT(data)[, c(data.Point.1 = data.Point.1[1L],
lapply(.SD, sum)), by=Name,
.SDcols = -"data.Point.1"]
# Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff 3 23 14
# 2: Hank 3 16 6
# 3: Tom 3 2 8
# 4: Bill 3 2 3
# 5: Mark 3 9 1
We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.
Check the HTML vignettes for detailed info.
You could try
library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
# Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff 3 23 14
#2: Hank 4 16 6
#3: Tom 3 2 8
#4: Bill 6 2 3
#5: Mark 2 9 1
or using base R
data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
x[3:4] <- colSums(x[3:4])
x[1,]} ))
Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column
library(dplyr)
res1 <- data %>%
group_by(Name) %>%
summarise(data.Point.1=data.Point.1[1L])
res2 <- data %>%
group_by(Name) %>%
summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
# Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff 3 23 14
#2 Hank 4 16 6
#3 Tom 3 2 8
#4 Bill 6 2 3
#5 Mark 2 9 1
EDIT
The data created and the data showed initially differed in the original post. After the edit on OP's post (by #dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

aggregate dataframe subsets in R

I have the dataframe ds
CountyID ZipCode Value1 Value2 Value3 ... Value25
1 1 0 etc etc etc
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0
and would like to aggregate based on ds$ZipCode and set ds$CountyID equal to the primary county based on the highest ds$Value1. For the above example, it would look like this:
CountyID ZipCode Value1 Value2 Value3 ... Value25
2 1 4 etc etc etc
5 2 2
6 3 3
7 4 9
9 5 1
10 6 0
All the ValueX columns are the sum of that column grouped by ZipCode.
I've tried a bunch of different strategies over the last couple days, but none of them work. The best I've come up with is
#initialize the dataframe
ds_temp = data.frame()
#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {
sub <- subset(ds, ds$ZipCode == zip)
len <- length(sub)
maxIndex <- which.max(sub$Value1)
#do the aggregation
row <- aggregate(sub[3:27], FUN=sum, by=list(
CountyID = rep(sub$CountyID[maxIndex], len),
ZipCode = sub$ZipCode))
rbind(ds_temp, row)
}
ds <- ds_temp
I haven't been able to test this on the real data, but with dummy datasets (such as the one above), I keep getting the error "arguments must have the same length). I've messed around with rep() and fixed vectors (eg c(1,2,3,4)) but no matter what I do, the error persists. I also occasionally get an error to the effect of
cannot subset data of type 'closure'.
Any ideas? I've also tried messing around with data.frame(), ddply(), data.table(), dcast(), etc.
You can try this:
data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
Fully reproducible sample data:
df<-read.table(text="
CountyID ZipCode Value1
1 1 0
2 1 3
3 1 0
4 1 1
5 2 2
6 3 3
7 4 7
8 4 2
9 5 1
10 6 0", header=TRUE)
data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
CountyID = unlist(lapply(split(df, df$ZipCode),
function(x) x$CountyID[which.max(x$Value1)])))
# Group.1 x CountyID
#1 1 4 2
#2 2 2 5
#3 3 3 6
#4 4 9 7
#5 5 1 9
#6 6 0 10
In response to your comment on Frank's answer, you can preserve the column names by using the formula method in aggregate. Using Franks's data df, this would be
> cbind(aggregate(Value1 ~ ZipCode, df, sum),
CountyID = sapply(split(df, df$ZipCode), function(x) {
with(x, CountyID[Value1 == max(Value1)]) }))
# ZipCode Value1 CountyID
# 1 1 4 2
# 2 2 2 5
# 3 3 3 6
# 4 4 9 7
# 5 5 1 9
# 6 6 0 10

How to convert data.frame to (flat) matrix?

How can I convert the data.frame below to a matrix as given? the first two columns of the data.frame contain the row variables, all combinations of the other columns (except the one containing the values) determine the columns. Ideally, I'm looking for a solution that does not require further packages (so no reshape2 solution). Also, no ftable solution.
(df <- data.frame(c1=rep(c(1, 2), each=8), c2=rep(c(1, 2, 1, 2), each=4),
gr=rep(c(1, 2), 8), subgr=rep(c(1,2), 4, each=2), val=1:16) )
c1 c2 gr1.subgr1 gr1.subgr2 gr2.subgr1 gr2.subgr2
1 1 1 3 2 4
1 2 5 7 6 8
2 1 9 11 10 12
2 2 13 15 14 16
Use an interaction variable to construct the groups:
newdf <- reshape(df, idvar=1:2, direction="wide",
timevar=interaction(df$gr,df$subgr) ,
v.names="val",
drop=c("gr","subgr") )
names(newdf)[3:6] <- c("gr1.subgr1", "gr1.subgr2", "gr2.subgr1", "gr2.subgr2")
newdf
c1 c2 gr1.subgr1 gr1.subgr2 gr2.subgr1 gr2.subgr2
1 1 1 1 2 3 4
5 1 2 5 6 7 8
9 2 1 9 10 11 12
13 2 2 13 14 15 16
Alright - this looks like it does mostly what you want. From reading the help file, this seems like it should do what you want:
reshape(df, idvar = c("c1", "c2"), timevar = c("gr", "subgr")
, direction = "wide")
c1 c2 val.c(1, 2, 1, 2) val.c(1, 1, 2, 2)
1 1 1 NA NA
5 1 2 NA NA
9 2 1 NA NA
13 2 2 NA NA
I can't fully explain why it shows up with NA values. However, maybe this bit from the help page explains:
timevar
the variable in long format that differentiates multiple records from the same
group or individual. If more than one record matches, the first will be taken.
I initially took that to mean that R would use it's partial matching capabilities if there was an ambiguity in the column names you gave it, but maybe not? Next, I tried combining gr and subgr into a single column:
df$newcol <- with(df, paste("gr.", gr, "subgr.", subgr, sep = ""))
And let's try this again:
reshape(df, idvar = c("c1", "c2"), timevar = "newcol"
, direction = "wide", drop= c("gr","subgr"))
c1 c2 val.gr.1subgr.1 val.gr.2subgr.1 val.gr.1subgr.2 val.gr.2subgr.2
1 1 1 1 2 3 4
5 1 2 5 6 7 8
9 2 1 9 10 11 12
13 2 2 13 14 15 16
Presto! I can't explain or figure out how to make it not append val. to the column names, but I'll leave you to figure that out on your own. I'm sure it's on the help page somewhere. It also put the groups in a different order than you requested, but the data seems to be right.
FWIW, here's a solution with reshape2
> dcast(c1 + c2 ~ gr + subgr, data = df, value.var = "val")
c1 c2 1_1 1_2 2_1 2_2
1 1 1 1 3 2 4
2 1 2 5 7 6 8
3 2 1 9 11 10 12
4 2 2 13 15 14 16
Though you still have to clean up column names.

Resources