creating a list of dataframes - r

I have a dataframe that looks like this:
a <- as.data.frame(t(matrix(c('gr1','','','','gr2','','','','','gr3','','',
rep(1,12),rep(2,12)),ncol=3)))
a looks like:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
gr1 gr2 gr3
1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2
Columns V1-V4 belong to gr1, V5-V9 to gr2, and V10-V12 to gr3.
I would like to separate these groups (gr1-gr3) and their corresponding columns and put them all in a list that I can later on loop and do some analysis. So the desired output is:
list1 = (gr1,gr2,gr3), where each of gr1, gr2, and gr3 are a dataframe with their corresponding columns.

We create a grouping variable based on whether the first row element is blank ('') or not. Then, split the column names of 'a' with the 'grp' to a list and then subset the columns and rows (remove the first row) using lapply, change the names of the 'lst' as the 'gr' values that we extract the first row of 'a'.
grp <- cumsum(as.character(unlist(a[1,]))!='')
lst <- lapply(split(names(a), grp), function(nm) a[-1, nm])
nm1 <- as.character(unlist(a[1,]))
names(lst) <- nm1[nzchar(nm1)]
NOTE: The columns in 'a' are factor class due to the presence of the second header ('gr') as the first row. If we need to convert the columns in each data.frame in the 'lst' to numeric,
lapply(lst, function(x) {
x[] <- lapply(x, function(.x) as.numeric(as.character(.x)))
x})

Related

How do I merge two lists of mostly differing dataframes and bind the rows of those dataframes with the same name in R?

I have two different lists of dataframes. Some of the dataframes in the two lists have the same name, and others dont. When I merge the two lists, I need the dataframes with the same name to be merged rbind-style, and the ones that are unique in both lists just to remain as unique dataframes and tack on to the newly created merged list of dataframes.
The list1 is likely to have more dataframes and more rows per dataframe than list2 since it will be the cumulatively binded list as a result of a loop. List2 is the new result of each loop to be added to the cumulative list1.
Mock Example:
mydf1 <- data.frame(V1=1, V2=rep("A", 4))
mydf2 <- data.frame(V1=1, V2=rep("B", 3))
mydf3 <- data.frame(V1=1, V2=rep("C", 2))
mydf4 <- data.frame(V1=2, V2="A")
mydf5 <- data.frame(V1=3, V2="C")
mydf6 <- data.frame(V1=4, V2="D")
mydf7 <- data.frame(V1=7, V2="E")
list1 <- list(AA=mydf1, BB=mydf2, CC=mydf3)
list2 <- list(AA=mydf4, CC=mydf5, DD=mydf6, EE=mydf7)
Expected result:
$AA
V1 V2
1 1 A
2 1 A
3 1 A
4 1 A
1 2 A
$BB
V1 V2
1 1 B
2 1 B
3 1 B
$CC
V1 V2
1 1 C
2 1 C
1 3 C
$DD
V1 V2
1 4 D
$EE
V1 V2
1 7 E
I have tried with the solution here, but have not been able to get them to work properly.
This solution isn't putting the right dataframes together and is creating other weird combinations.
(m <- match(names(list2), names(list1), nomatch = 0L))
# [1] 1 1 2
Map(rbind, list1[m], list2)
and this one appears to just never rbind the dataframes with the same names, all the dataframes just keep 1 row.
stackMe <- function(x) {
a <- eval.parent(quote(names(X)))[substitute(x)[[3]]]
rbind(list1[[a]], x)
}
lapply(list2, stackMe)
How can I merge two list of dataframes where those dataframes with the same name just append/rbind the rows, and other unique dataframes are just tacked on to the list?
We can use indexing with Map
list1[names(list2)] <- Map(rbind, list1[names(list2)], list2)
list1
#$AA
# V1 V2
#1 1 A
#2 1 A
#3 1 A
#4 1 A
#5 2 A
#$BB
# V1 V2
#1 1 B
#2 1 B
#3 1 B
#$CC
# V1 V2
#1 1 C
#2 1 C
#3 3 C
#$DD
# V1 V2
#1 4 D
#$EE
# V1 V2
#1 7 E

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Finding factors that correspond to more than one values

Suppose, that one has the following dataframe:
x=data.frame(c(1,1,2,2,2,3),c("A","A","B","B","B","B"))
names(x)=c("v1","v2")
x
v1 v2
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 3 B
In this dataframe a value in v1 I want to correspond into a label in v2. However, as one can see in this example B has more than one corresponding values.
Is there any elegant and fast way to find which labels in v2 correspond to more than one values in v1 ?
The result I want ideally to show, the values - which in our example should be c(2,3) - as well as the row number - which in our example should be r=c(5,6).
Assuming that we want the index of the unique elements in 'v1' grouped by 'v2' and that should have more than one unique elements, we create a logical index with ave and use that to subset the rows of 'x'.
i1 <- with(x, ave(v1, v2, FUN = function(x)
length(unique(x))>1 & !duplicated(x, fromLast=TRUE)))!=0
x[i1,]
# v1 v2
#5 2 B
#6 3 B
Or a faster option is data.table
library(data.table)
i1 <- setDT(x)[, .I[uniqueN(v1)>1 & !duplicated(v1, fromLast=TRUE)], v2]$V1
x[i1, 'v1', with = FALSE][, rn := i1][]
# v1 rn
#1: 2 5
#2: 3 6

taking the sum of a TRUE/FALSE vector in r

I am working analyzing SNP data for a fungus, and I am trying to impute the missing data by changing the Ns to the genotype of the more frequent allele....see below.
newdata is a matrix of my snps (rows)and fungal isolates(columns). The genotypes for each snp are in the 0, 1, and N format, and that is why I am trying to impute the missing genotypes.
newdata_imputed=newdata
for (k in 1:nrow(newdata)){
u=newdata[k,]
x<-sum(u==0)
y<-sum(u==1)
all_freq=y/(x+y)
if (all_freq<0.5){
newdata_imputed[k,]=gsub("N",0,u)
} else{newdata_imputed[k,]=gsub("N",1,u)}
print(k)
}
However, I keep getting this error:
[1] 295
[1] 296
Error in if (all_freq < 0.5) { : missing value where TRUE/FALSE needed
It is obvious that the code runs but stops after encountering a problem. Please, can someone tell me what I am doing wrong? I am a newbie to R, and any advice would be greatly appreciated.
#akrun, the reason why i used a for loop is because it is nested in another for loop..so after using your code.
newdata=as.data.frame(newdata)
u=newdata
all_freq <- rowSums(u==1)/rowSums((u==1)|(u==0))
indx <- all_freq < 0.5
indx1 <- indx & !is.na(indx)
indx2 <- !indx & !is.na(indx)
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
newdata[] <- lapply(newdata, as.numeric)
I got weird values
newdata[1:10,1:10]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3
4 1 1 1 1 1 1 1 1 1 1
Please where is the "3" coming from.???? I should only have 0 or 1
We could do this using rowSums. As #bergant and #MatthewLundberg mentioned in the comments, if there are rows with no 0 or 1 elements, we get NaN based on the calculation. One way would be to modify the logical condition by including !is.na, i.e. elements that are not NA along with the previous condition.
#using `rowSums` to create the all_freq vector
all_freq <- rowSums(newdata==1)/rowSums((newdata==1)|(newdata==0))
#Create a logical index based on elements that are less than 0.5
indx <- all_freq < 0.5
#The NA elements can be changed to FALSE by adding another condition
indx1 <- indx & !is.na(indx)
#similarly for elements that are > 0.5
indx2 <- !indx & !is.na(indx)
Now, we subset the rows of the 'newdata' with 'indx1', loop through the columns (lapply) and use gsub with pattern and replacement arguments and assign the output back to the subset of 'newdata'.
newdata[indx1,] <- lapply(newdata[indx1,], gsub, pattern='N', replacement=0)
Similarly, we can do the replacement for the rows that are greater than 0.5 for 'all_freq'
newdata[indx2,] <- lapply(newdata[indx2,], gsub, pattern='N', replacement=1)
The gsub output columns are character class, which can be converted back to numeric (if needed).
newdata[] <- lapply(newdata, as.numeric)
data
set.seed(24)
newdata <- as.data.frame(matrix(sample(c(0:1, "N"), 10*4, replace=TRUE),
ncol=4), stringsAsFactors=FALSE)
newdata[7,] <- 2

writing character array to a table

I want to transpose the output given by the last command and write it to a data.frame. I want that dataframe to have 2 columns. First column will have column names and the second column will have data type for the column in each row. How could I achieve it? I tried variety of things but didnt get what I am looking for
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
smoke <- as.data.frame(smoke)
table1=sapply (smoke, class)
table1
You could also skip the table1 part and go straight from smoke to the desired result.
> data.frame(nm = names(smoke), cl = sapply(unname(smoke), class))
# nm cl
# 1 V1 numeric
# 2 V2 numeric
# 3 V3 numeric
You could try this:
data.frame(var.name = names(table1), var.class = table1, row.names=NULL)
# var.name var.class
#1 V1 numeric
#2 V2 numeric
#3 V3 numeric
You might be looking for the melt command.
library(reshape2)
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
smoke <- as.data.frame(smoke)
table1 <- sapply (smoke, class)
smoke.melt <- melt(smoke)
levels(smoke.melt$variable) <- table1
> smoke.melt
variable value
1 numeric 51
2 numeric 92
3 numeric 68
4 numeric 43
5 numeric 28
6 numeric 22
7 numeric 22
8 numeric 21
9 numeric 9
Just convert table1 to data.frame and adjust:
dd = data.frame(table1)
dd
table1
V1 numeric
V2 numeric
V3 numeric
dd$VarName = rownames(dd)
dd
table1 VarName
V1 numeric V1
V2 numeric V2
V3 numeric V3
dd = dd[,c(2,1)]
dd
VarName table1
V1 V1 numeric
V2 V2 numeric
V3 V3 numeric
names(dd)[2] = "type"
dd
VarName type
V1 V1 numeric
V2 V2 numeric
V3 V3 numeric

Resources