Exporting data of inequal length to CSV - r

I'm preprocessing some data from sensor-created files into the format required for external analysis (ultimately, it needs to be output as a CSV). The end goal is something like this:
1 C3 C4 Cz Pz AllSites 2 C3 C4 Cz Pz AllSites 3 C3 C4 Cz Pz AllSites
50:23.9 0 0 0 0 0 53:15.0 0 0 0 0 0 09:15.0 0 0 0 0 0
50:24.9 1 0 0 1 0 53:16.0 1 0 0 1 0 09:16.1 0 0 1 0 0
50:26.0 1 0 0 0 0 53:17.1 1 0 0 1 0 09:17.1 0 0 1 0 0
50:27.0 1 0 0 1 0 53:18.1 1 1 1 0 0 09:18.1 0 0 1 1 0
50:28.0 0 1 0 0 0 53:19.2 1 0 0 0 0 09:19.2 0 0 1 0 0
50:29.1 1 1 1 1 1 53:20.2 1 0 0 1 0 09:20.2 0 0 1 0 0
50:30.2 0 1 1 0 0 53:21.2 1 0 0 0 0 09:21.2 0 0 0 1 0
50:31.2 0 0 0 0 0 53:22.3 0 0 0 0 0 09:22.3 0 0 0 1 0
Each set of columns is data from one session. The only catch is that sessions are of inequal length (and thus each group has a different number of observations), so at the moment, it's all in a list instead of a data frame. I have found a few different ways of exporting to CSV (e.g., this question), but they all involve converting to a data frame first. How do I export a list to CSV without converting it to a data frame first?
N.B.: I also found a bunch of questions about exporting a list of data frames to a series of CSV files, but for this application, all the data frames need to be in a single CSV.

Lets make some simple samples:
b1 = data.frame(C3=sample(c(0,1),8,TRUE),C4=sample(c(0,1),8,TRUE),Cz=sample(c(0,1),8,TRUE))
b2 = data.frame(C3=sample(c(0,1),3,TRUE),C4=sample(c(0,1),3,TRUE),Cz=sample(c(0,1),3,TRUE))
b3 = data.frame(C3=sample(c(0,1),8,TRUE),C4=sample(c(0,1),8,TRUE),Cz=sample(c(0,1),8,TRUE))
You cant just column-bind them and hope R pads out the smaller columns:
> cbind(b1,b2,b3)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 8, 3
So we need to paste them into a big enough data frame. Lets make one full of NAs to start:
b = data.frame(matrix(NA, ncol=ncol(b1)+ncol(b2)+ncol(b3), nrow=max(nrow(b1),nrow(b2),nrow(b3))))
dim(b)
[1] 8 9
Then this code puts each b data frame in the right place. Each one is a bit further along:
> b[1:nrow(b1),1:ncol(b1)]=b1
> b[1:nrow(b2),(1:ncol(b1))+ncol(b1)]=b2
> b[1:nrow(b3),(1:ncol(b1))+ncol(b1)+ncol(b2)]=b3
> b
X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 1 1 1 0 0 0 0 1
2 1 1 0 0 0 0 0 1 0
3 0 0 1 0 1 1 0 1 1
4 1 1 1 NA NA NA 1 1 1
5 0 0 0 NA NA NA 0 0 0
6 0 1 0 NA NA NA 1 0 1
7 0 0 0 NA NA NA 1 1 1
8 0 1 0 NA NA NA 1 1 1
Easy enough to generalise in a loop over a list. Now:
> write.csv(b,na="")
"","X1","X2","X3","X4","X5","X6","X7","X8","X9"
"1",1,1,1,1,0,0,0,0,1
"2",1,1,0,0,0,0,0,1,0
"3",0,0,1,0,1,1,0,1,1
"4",1,1,1,,,,1,1,1
"5",0,0,0,,,,0,0,0
"6",0,1,0,,,,1,0,1
"7",0,0,0,,,,1,1,1
"8",0,1,0,,,,1,1,1
Gives us those empty columns. You probably need to fiddle about to get the column headers back and repeated but that's easy enough...

Not sure if this is what you need... but it's a shot...
a <- data.frame(small=letters)
b <- data.frame(big=LETTERS)
l <- list(a=a, b=b)
sapply(names(l), function(x)write.csv(l[[x]], file=paste0(x, ".csv")))
# or maybe all in the same file...
sapply(names(l), function(x)write.table(l[[x]], file="c.csv", append=T))

A csv file is most often used to export data in tabular form. They map perfectly with the data.frame R objects. list objects are way more general and exhibit a lot of flexibility that a simple csv format cannot handle in many cases.
In your cases sure you have a list, but the components of your list are data frames that share (apparently) the same structure (same number and names of the columns). So, it's pretty trivial to join all them in just one data frame. You only need an additional column that indicates the session. So, if mylist is your list, you can try:
mydf<-do.call(rbind,mylist)
elLength<-vapply(mylist,length,1)
mydf$Session<-rep(1:length(mylist),times=elLength))
In this way you end up with a single data frame and you can extract the session through the Session column. You can use read.csv to export it to a csv file.

Related

Loading a matrix from a csv (dataframe) to R

I have my data in this format in a csv
state1 state2 state3
1 0 0
1 0 0
0 0 1
and I would like this in a matrix in R like this
1 0 0
1 0 0
0 0 1
I have tried loading the csv as a data frame and putting it into a matrix, but it does give the desired result. Using this code, I get this output
df <- read_csv('filepath')
m <- matrix(df)
matrix format output:
1 1 0
0 0 0
0 0 1
This essentially makes a column a row instead of maintain what was originally stored. I want the matrix to hold the same structure that was in my csv file.
You have to transform t() the result:
m <- t(as.matrix(read_csv('filepath')))
m
## 1 0 0
## 1 0 0
## 0 0 1

How can I create a new binary variable that categorizes people based on if they EVER had a certain response in the dataset?

I'm examining drug use over 5 years in about 100 people. I want to create a binary variable that indicates whether people could ever be considered drug users (0=never user, 1=user).
Below, 1 indicates drug use, 0 indicates none, and NA indicates missing data at that time. Here are some example cases:
0 0 0 1 1
0 1 0 1 1
NA 0 1 0 NA
NA 0 0 0 1
0 0 NA NA 0
Almost all of my cases have missing data for at least one time point.
I'm new to R so I'm really struggling to figure out how to create this new binary variable. Basically the code needs to scan all 5 time points to see if a "1" ever appears, and it needs to be able to handle NAs.
Any advice would be great!
We can use rowSums
df1$new <- +(rowSums(df1 == 1, na.rm = TRUE) > 0)
Assuming the individuals show up on the rows, and the years as columns:
d <- read.table(text="
0 0 0 1 1
0 1 0 1 1
NA 0 1 0 NA
NA 0 0 0 1
0 0 NA NA 0", header=F
)
d$true <- apply(d,1, function(x)any(x==1, na.rm = T))*1
d
V1 V2 V3 V4 V5 true
1 0 0 0 1 1 1
2 0 1 0 1 1 1
3 NA 0 1 0 NA 1
4 NA 0 0 0 1 1
5 0 0 NA NA 0 0

Selecting specific columns from dataset

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.
I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

r sum several colmns by another column

I have a 39 column (with upward of 100000 rows) data frame whose last ten columns looks like that (The rest of the columns do not concern my question)
H3K27me3_gross_bin H3K4me3_gross_bin H3K4me1_gross_bin UtoP UtoM UPU UPP UPM UMU UMP UMM
cg00000029 3 3 6 1 1 0 0 0 0 0 0
cg00000321 6 1 5 1 0 0 1 0 0 0 0
cg00000363 6 1 1 1 0 1 0 0 0 0 0
cg00000622 1 2 1 0 0 0 0 0 0 0 0
cg00000714 2 5 6 1 0 0 0 0 0 0 0
cg00000734 2 6 2 0 0 0 0 0 0 0 0
I want to create a matrix that will:
a) count the number of rows in which the value columns UPU, UPP or UPM is 1 by each of the first three columns (H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin)
b) sum each row of the columns UPU, UPP, UPM by the first three columns
I came up with this incredibly cumbersome way of doing this:
UtoPFrac<-seq(6)
UtoPTotEvents<-seq(6)
for (j in 1:3){
y<-df[,28+j]
for (i in 1:3){
UtoPFrac<-cbind(UtoPFrac,tapply(df[which(is.na(y)==FALSE),33+i],y[which(is.na(y)==FALSE)], function(x) length(which(x==1))))
}
}
UtoPFrac<-UtoPFrac[,2:10]
UtoPEvents<-cbind(rowSums(UtoPFrac[,1:3]),rowSums(UtoPFrac[,4:6]),rowSums(UtoPFrac[,7:9]))
I am certian there is a more elegent way of doing this, probably by using aggregate() or ddply(), but was unable to get this working.
I will apprciate any help doing this more efficenly
Thanks in advance
Not tested:
library(plyr)
dpply(df,.(H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin), summarize, UPUl=length(UPU[which(UPU==1)]),UPPl=length(UPP[which(UPP==1)]),UPMl=length(UPM[which(UPM==1)]), mysum=sum( UPU + UPP + UPM))
P.S. If you dput the data and provide the expected output, I will test the above code

model.matrix() with na.action=NULL?

I have a formula and a data frame, and I want to extract the model.matrix(). However, I need the resulting matrix to include the NAs that were found in the original dataset. If I were to use model.frame() to do this, I would simply pass it na.action=NULL. However, the output I need is of the model.matrix() format. Specifically, I need only the right-hand side variables, I need the output to be a matrix (not a data frame), and I need factors to be converted to a series of dummy variables.
I'm sure I could hack something together using loops or something, but I was wondering if anyone could suggest a cleaner and more efficient workaround. Thanks a lot for your time!
And here's an example:
dat <- data.frame(matrix(rnorm(20),5,4), gl(5,2))
dat[3,5] <- NA
names(dat) <- c(letters[1:4], 'fact')
ff <- a ~ b + fact
# This omits the row with a missing observation on the factor
model.matrix(ff, dat)
# This keeps the NA, but it gives me a data frame and does not dichotomize the factor
model.frame(ff, dat, na.action=NULL)
Here is what I would like to obtain:
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.7266086 0 0 0 0
2 1 -0.6088697 0 0 0 0
3 NA 0.4643360 NA NA NA NA
4 1 -1.1666248 1 0 0 0
5 1 -0.7577394 0 1 0 0
6 1 0.7266086 0 1 0 0
7 1 -0.6088697 0 0 1 0
8 1 0.4643360 0 0 1 0
9 1 -1.1666248 0 0 0 1
10 1 -0.7577394 0 0 0 1
Joris's suggestion works, but a quicker and cleaner way to do this is via the global na.action setting. The 'Pass' option achieves our goal of preserving NA's from the original dataset.
Option 1: Pass
Resulting matrix will contain NA's in rows corresponding to the original dataset.
options(na.action='na.pass')
model.matrix(ff, dat)
Option 2: Omit
Resulting matrix will skip rows containing NA's.
options(na.action='na.omit')
model.matrix(ff, dat)
Option 3: Fail
An error will occur if the original data contains NA's.
options(na.action='na.fail')
model.matrix(ff, dat)
Of course, always be careful when changing global options because they can alter behavior of other parts of your code. A cautious person might store the original setting with something like current.na.action <- options('na.action'), and then change it back after making the model.matrix.
Another way is to use the model.frame function with argument na.action=na.pass as your second argument to model.matrix:
> model.matrix(ff, model.frame(~ ., dat, na.action=na.pass))
(Intercept) b fact2 fact3 fact4 fact5
1 1 -1.3560754 0 0 0 0
2 1 2.5476965 0 0 0 0
3 1 0.4635628 NA NA NA NA
4 1 -0.2871379 1 0 0 0
5 1 2.2684958 0 1 0 0
6 1 -1.3560754 0 1 0 0
7 1 2.5476965 0 0 1 0
8 1 0.4635628 0 0 1 0
9 1 -0.2871379 0 0 0 1
10 1 2.2684958 0 0 0 1
model.frame allows you to set the appropriate action for na.action which is maintained when model.matrix is called.
I half-stumbled across a simpler solution after looking at mattdevlin and Nathan Gould's answers:
model.matrix.lm(ff, dat, na.action = "na.pass")
model.matrix.default may not support the na.action argument, but model.matrix.lm does!
(I found model.matrix.lm from Rstudio's auto-complete suggestions — it appears to be the only non-default method for model.matrix if you haven't loaded any libraries that add others. Then I just guessed it might support the na.action argument.)
You can mess around a little with the model.matrix object, based on the rownames :
MM <- model.matrix(ff,dat)
MM <- MM[match(rownames(dat),rownames(MM)),]
MM[,"b"] <- dat$b
rownames(MM) <- rownames(dat)
which gives :
> MM
(Intercept) b fact2 fact3 fact4 fact5
1 1 0.9583010 0 0 0 0
2 1 0.3266986 0 0 0 0
3 NA 1.4992358 NA NA NA NA
4 1 1.2867461 1 0 0 0
5 1 0.5024700 0 1 0 0
6 1 0.9583010 0 1 0 0
7 1 0.3266986 0 0 1 0
8 1 1.4992358 0 0 1 0
9 1 1.2867461 0 0 0 1
10 1 0.5024700 0 0 0 1
Alternatively, you can use contrasts() to do the work for you. Constructing the matrix by hand would be :
cont <- contrasts(dat$fact)[as.numeric(dat$fact),]
colnames(cont) <- paste("fact",colnames(cont),sep="")
out <- cbind(1,dat$b,cont)
out[is.na(dat$fact),1] <- NA
colnames(out)[1:2]<- c("Intercept","b")
rownames(out) <- rownames(dat)
which gives :
> out
Intercept b fact2 fact3 fact4 fact5
1 1 0.2534288 0 0 0 0
2 1 0.2697760 0 0 0 0
3 NA -0.8236879 NA NA NA NA
4 1 -0.6053445 1 0 0 0
5 1 0.4608907 0 1 0 0
6 1 0.2534288 0 1 0 0
7 1 0.2697760 0 0 1 0
8 1 -0.8236879 0 0 1 0
9 1 -0.6053445 0 0 0 1
10 1 0.4608907 0 0 0 1
In any case, both methods can be incorporated in a function that can deal with more complex formulae. I leave the exercise to the reader (what do I loath that sentence when I meet it in a paper ;-) )

Resources