How to convert data.frame to (flat) matrix? - r

How can I convert the data.frame below to a matrix as given? the first two columns of the data.frame contain the row variables, all combinations of the other columns (except the one containing the values) determine the columns. Ideally, I'm looking for a solution that does not require further packages (so no reshape2 solution). Also, no ftable solution.
(df <- data.frame(c1=rep(c(1, 2), each=8), c2=rep(c(1, 2, 1, 2), each=4),
gr=rep(c(1, 2), 8), subgr=rep(c(1,2), 4, each=2), val=1:16) )
c1 c2 gr1.subgr1 gr1.subgr2 gr2.subgr1 gr2.subgr2
1 1 1 3 2 4
1 2 5 7 6 8
2 1 9 11 10 12
2 2 13 15 14 16

Use an interaction variable to construct the groups:
newdf <- reshape(df, idvar=1:2, direction="wide",
timevar=interaction(df$gr,df$subgr) ,
v.names="val",
drop=c("gr","subgr") )
names(newdf)[3:6] <- c("gr1.subgr1", "gr1.subgr2", "gr2.subgr1", "gr2.subgr2")
newdf
c1 c2 gr1.subgr1 gr1.subgr2 gr2.subgr1 gr2.subgr2
1 1 1 1 2 3 4
5 1 2 5 6 7 8
9 2 1 9 10 11 12
13 2 2 13 14 15 16

Alright - this looks like it does mostly what you want. From reading the help file, this seems like it should do what you want:
reshape(df, idvar = c("c1", "c2"), timevar = c("gr", "subgr")
, direction = "wide")
c1 c2 val.c(1, 2, 1, 2) val.c(1, 1, 2, 2)
1 1 1 NA NA
5 1 2 NA NA
9 2 1 NA NA
13 2 2 NA NA
I can't fully explain why it shows up with NA values. However, maybe this bit from the help page explains:
timevar
the variable in long format that differentiates multiple records from the same
group or individual. If more than one record matches, the first will be taken.
I initially took that to mean that R would use it's partial matching capabilities if there was an ambiguity in the column names you gave it, but maybe not? Next, I tried combining gr and subgr into a single column:
df$newcol <- with(df, paste("gr.", gr, "subgr.", subgr, sep = ""))
And let's try this again:
reshape(df, idvar = c("c1", "c2"), timevar = "newcol"
, direction = "wide", drop= c("gr","subgr"))
c1 c2 val.gr.1subgr.1 val.gr.2subgr.1 val.gr.1subgr.2 val.gr.2subgr.2
1 1 1 1 2 3 4
5 1 2 5 6 7 8
9 2 1 9 10 11 12
13 2 2 13 14 15 16
Presto! I can't explain or figure out how to make it not append val. to the column names, but I'll leave you to figure that out on your own. I'm sure it's on the help page somewhere. It also put the groups in a different order than you requested, but the data seems to be right.
FWIW, here's a solution with reshape2
> dcast(c1 + c2 ~ gr + subgr, data = df, value.var = "val")
c1 c2 1_1 1_2 2_1 2_2
1 1 1 1 3 2 4
2 1 2 5 7 6 8
3 2 1 9 11 10 12
4 2 2 13 15 14 16
Though you still have to clean up column names.

Related

How to rearrange a data in R

I have a long data list similar to the following one:
set.seed(9)
part_number<-sample(1:5,5,replace=TRUE)
Type<-sample( c("A","B","C"),5, replace=TRUE)
rank<-sample(1:20,5,replace=TRUE)
data<-data.frame(cbind(part_number,Type,rank))
data
part_number Type rank
1 2 A 3
2 1 B 1
3 2 B 18
4 2 C 7
5 3 C 10
I want to rearrange the data in the following way:
part_number A B C
1 1
2 3 18 7
3 10
I think I need to use the reshape library. But I am not sure.
libary(tidyr)
data %>% spread(Type,rank)
# part_number A B C
# 1 1 <NA> 1 <NA>
# 2 2 3 18 7
# 3 3 <NA> <NA> 10
You would go about doing the following:
data <- reshape(data, idvar = "part_number", timevar = "Type", direction = "wide")
data
To format it exactly as you asked, I would add in,
library(tidyverse)
data %>%
arrange(part_number) %>%
dplyr::select(part_number, A = rank.A, B = rank.B, C = rank.C)
If you however had a lot more columns to rename, I would use the gsub function to rename by pattern. In addition, since now the row names are messy,
rownames(data) <- c()
Let me know if this doesn't work or this wasn't what you had in mind.

Adding NA's where data is missing [duplicate]

This question already has an answer here:
Insert missing time rows into a dataframe
(1 answer)
Closed 5 years ago.
I have a dataset that look like the following
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
data.frame(id,cycle,value)
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 3 8
9 4 2 9
so basically there is a variable called id that identifies the sample, a variable called cycle which identifies the timepoint, and a variable called value that identifies the value at that timepoint.
As you see, sample 3 does not have cycle 2 data and sample 4 is missing cycle 1 and 3 data. What I want to know is there a way to run a command outside of a loop to get the data to place NA's where there is no data. So I would like for my dataset to look like the following:
> data.frame(id,cycle,value)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
I am able to solve this problem with a lot of loops and if statements but the code is extremely long and cumbersome (I have many more columns in my real dataset).
Also, the number of samples I have is very large so I need something that is generalizable.
Using merge and expand.grid, we can come up with a solution. expand.grid creates a data.frame with all combinations of the supplied vectors (so you'd supply it with the id and cycle variables). By merging to your original data (and using all.x = T, which is like a left join in SQL), we can fill in those rows with missing data in dat with NA.
id = c(1,1,1,2,2,2,3,3,4)
cycle = c(1,2,3,1,2,3,1,3,2)
value = 1:9
dat <- data.frame(id,cycle,value)
grid_dat <- expand.grid(id = 1:4,
cycle = 1:3)
# or you could do (HT #jogo):
# grid_dat <- expand.grid(id = unique(dat$id),
# cycle = unique(dat$cycle))
merge(x = grid_dat, y = dat, by = c('id','cycle'), all.x = T)
id cycle value
1 1 1 1
2 1 2 2
3 1 3 3
4 2 1 4
5 2 2 5
6 2 3 6
7 3 1 7
8 3 2 NA
9 3 3 8
10 4 1 NA
11 4 2 9
12 4 3 NA
A solution based on the package tidyverse.
library(tidyverse)
# Create example data frame
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 4)
cycle <- c(1, 2, 3, 1, 2, 3, 1, 3, 2)
value <- 1:9
dt <- data.frame(id, cycle, value)
# Complete the combination between id and cycle
dt2 <- dt %>% complete(id, cycle)
Here is a solution with data.table doing a cross join:
library("data.table")
d <- data.table(id = c(1,1,1,2,2,2,3,3,4), cycle = c(1,2,3,1,2,3,1,3,2), value = 1:9)
d[CJ(id=id, cycle=cycle, unique=TRUE), on=.(id,cycle)]

How to do iterations in R?

I'm operating with a dataset that contains the values of same variables at different points in time. In the example below I have the values of variables a and b at time points 1 and 2.
> set.seed(1)
> data <- data.frame(matrix(sample(16), ncol = 4))
> names(data) <- paste(rep(c("a", "b"), each = 2), 1:2, sep = "")
> data
a1 a2 b1 b2
1 5 3 14 13
2 6 10 1 8
3 9 11 2 4
4 12 15 7 16
Now, suppose I want to calculate a new variable for both time points so that it would contain the sum of a and b (instead of the NAs as in example below). Since my actual dataset contains about 15 different variables and 10 time points (so 150 columns), I want to automate this calculation of 10 new variables.
> data[, paste("ab", 1:2, sep = "")] <- NA
> data
a1 a2 b1 b2 ab1 ab2
1 5 3 14 13 NA NA
2 6 10 1 8 NA NA
3 9 11 2 4 NA NA
4 12 15 7 16 NA NA
I've previously used Stata where I could create a simple 'foreach' loop to do this. Something like below.
foreach t of numlist 1/2 {
generate ab`t' = a`t' + b`t'
}
But I've learned that using loops in R is not feasible, nor have I any idea how to loop over variable names like that in R.
So what would be the correct solution for my problem in R?
This will replicate the same foreach loop you used in Stata.
for(i in 1:2){
data[, paste("ab", i, sep="")] <-
data[,paste("a", i, sep="")] + data[, paste("b", i, sep="")]
}
The output looks like this:
> data
a1 a2 b1 b2 ab1 ab2
1 15 1 16 12 31 13
2 10 7 14 3 24 10
3 2 5 9 4 11 9
4 6 8 13 11 19 19
to do this the R way,
make use of some native iteration via a *apply function
use the built-in rowSums (as in #Sotos) answer
make use of assignment into the data.frame, that is `]`<-
all together
data[paste0('ab', 1:2)] <- sapply(1:2,
function(i)
rowSums(data[paste0(c('a', 'b'), i)]))
data
# a1 a2 b1 b2 ab1 ab2
# 1 5 3 14 13 19 16
# 2 6 10 1 8 7 18
# 3 9 11 2 4 11 15
# 4 12 15 7 16 19 31
ps, in a program use vapply instead, you'll need to provide an additional argument specifying the shape of the output but its safer and sometimes faster
You can do without iteration:
data$ab1 <- data$a1 + data$b1
data$ab2 <- data$a2 + data$b2
or
data <- transform(data, ab1=a1+b1, ab2=a2+b2)
BTW:
It is better not to name an object data because data= is often a parameter in functions.
Here is one way to do it. We iterate over the unique values of the column names and we calculate the rowSums when those unique values match the colname values.
sapply(unique(sub('\\D', '', names(data))),
function(i) rowSums(data[,grepl(i, sub('\\D', '', names(data)))]))
# 1 2
#[1,] 17 23
#[2,] 24 22
#[3,] 14 10
#[4,] 15 11

transform a dataframe from long to wide in r, but needs date transformation

I have a dataframe like this (each "NUMBER" indicate a student):
NUMBER Gender Grade Date.Tested WI WR WZ
1 F 4 2014-02-18 6 9 10
1 F 3 2014-05-30 9 8 2
2 M 5 2013-05-02 7 9 15
2 M 4 2009-05-21 5 7 2
2 M 5 2010-04-29 9 1 4
I know I can use:
cook <- reshape(data, timevar= "?", idvar= c("NUMBER","Gender"), direction = "wide")
to change it into a wide format. However, I want to remove the date.tested to the times (1st time, 2nd time...etc), and indicate the grade.
What I want at the end is like this:
NUMBER Gender Grade1 Grade 2 Grade 3 WI1 WR1 WZ1 WI2 WR2 WZ2 WI3 WR3 WZ3
1 F 3 4 NA 9 8 2 6 9 10 NA NA NA
and for the rest "NUMBER"s.
I have searched a lot but did not find an answer. Can someone help me with it?
Thank you very much!
Try
data$id <- with(data, ave(seq_along(NUMBER), NUMBER, FUN=seq_along))
reshape(data, idvar=c('NUMBER', 'Gender'), timevar='id', direction='wide')
If you want the Date.Tested variable to be included in the 'idvar' and you need only the 1st value for the group ('NUMBER' or 'GENDER')
data$Date.Tested <- with(data, ave(Date.Tested, NUMBER,
FUN=function(x) head(x,1)))
reshape(data, idvar=c('NUMBER', 'Gender', 'Date.Tested'),
timevar='id', direction='wide')

Read csv with two headers into a data.frame

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.
I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:
a b
x y z x y z
10 1 5 22 1 6
12 2 6 21 3 5
12 2 7 11 3 7
13 1 4 33 2 8
12 2 5 44 1 9
csv file below:
a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9
How can I get to a data.frame in R as shown below?
cell x y z
a 10 1 5
a 12 2 6
a 12 2 7
a 13 1 4
a 12 2 5
b 22 1 6
b 21 3 5
b 11 3 7
b 33 2 8
b 44 1 9
Use base R reshape():
temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
# time x y z id
# 1.0 0 10 1 5 1
# 2.0 0 12 2 6 2
# 3.0 0 12 2 7 3
# 4.0 0 13 1 4 4
# 5.0 0 12 2 5 5
# 1.1 1 22 1 6 1
# 2.1 1 21 3 5 2
# 3.1 1 11 3 7 3
# 4.1 1 33 2 8 4
# 5.1 1 44 1 9 5
Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.
You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.
You can change the "time" variable to your "cell" variable as follows:
# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])
Then, drop the "time" column:
OUT$time = NULL
Update
To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:
temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE,
nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

Resources