Turn different sized rows into columns - r

I am reading in a data file with many different rows, all of which can have different lengths like so:
dataFile <- read.table("file.txt", as.is=TRUE);
The rows can be as follows:
1 5 2 6 2 1
2 6 24
2 6 1 5 2 7 982 24 6
25 2
I need the rows to be transformed into columns. I'll be then using the columns for a violin plot like so:
names(dataCol)[1] <- "x";
jpeg("violinplot.jpg", width = 1000, height = 1000);
do.call(vioplot,c(dataCol,))
dev.off()
I'm assuming there will be an empty string/placeholder for any column with fewer entries than the column with the maximum number of entries. How can it be done?

Use the fill = TRUE argument in read.table. Then to change rows to columns, use t to transpose. Using your data this would look like...
df <- read.table( text = "1 5 2 6 2 1
2 6 24
2 6 1 5 2 7 982 24 6
25 2
" , header = FALSE , fill = TRUE )
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1 1 5 2 6 2 1 NA NA NA
#2 2 6 24 NA NA NA NA NA NA
#3 2 6 1 5 2 7 982 24 6
#4 25 2 NA NA NA NA NA NA NA
t(df)
# [,1] [,2] [,3] [,4]
#V1 1 2 2 25
#V2 5 6 6 2
#V3 2 24 1 NA
#V4 6 NA 5 NA
#V5 2 NA 2 NA
#V6 1 NA 7 NA
#V7 NA NA 982 NA
#V8 NA NA 24 NA
#V9 NA NA 6 NA

EDIT: apparently read.table has a fill=TRUE option, which is WAYYYY easier than my answer.
I've never used vioplot before, and that seems like a weird way to make a function call (instead of something like vioplot(dataCol)), but I have worked with ragged arrays before, so I'll try that.
Have you read the data in yet? That tends to be the hardest part. The code below reads the above data from a file called temp.txt into a matrix called out2
file = 'temp.txt'
dat = readChar(file,file.info(file)$size)
split1 = strsplit(dat,"\n")
split2 = strsplit(split1[[1]]," ")
n = max(unlist(lapply(split2,length)))
out=matrix(nrow=n,ncol=length(split2))
tFun = function(i){
vect = as.numeric(split2[[i]])
length(vect)=n
out[,i]=vect
}
out2 = sapply(1:length(split2),tFun)
I'll try and explain what I've done: the first step is to read in every character via readChar. You then split the lines, then the elements within each line to get the list split2, where each element of the list is a row of the input file.
From there you create a blank matrix that would be the right size for your data, then iterate through the list and assign each element to a column.
It's not pretty, but it works!

Related

Can I iterate through an vector of variable names in an R For Loop?

Background:
I'm building a complex function which, in part, creates groups of heterogeneous size and assigns homogenous values to members of each group. To make writing the larger, more complex function simpler, I built this simple function which will allow me to do what I described and it works exactly as I need it to. To demonstrate its effects:
## Chunky4Loop.R | v1.0 - 2022.01.01
# A function that allows us to take a small number of values from an array of any size
# and paste them into a dataframe in chunks, which may or may not be equal in size
Chunky4Loop <-function(chunk, # Value noting how many chunks to iterate through. The value of chunk should be equal to the length of the arrays Quantity and Input
quantity, # Vectors noting the size of, or how many rows are contained within, each chunk
input) # The values that each chunk should take
{LoopTracker <- 0
output <- NA
for (i in 1:chunk){
if (i == 1){
output[1:quantity[i]] <- rep(input[i], quantity[i])
LoopTracker <- LoopTracker + quantity[i]
}
if (i > 1){
output[(LoopTracker + 1):(LoopTracker + quantity[i])] <- rep(input[i], quantity[i])
LoopTracker <- LoopTracker + quantity[i]
}
}
rm(LoopTracker, i)
df
return(output)
}
nGroups <- 3
nPeople <- c(7,3,1)
GroupName <- c("A","B","C")
rows <- 1:sum(nPeople)
cols <- c("Group", "ATK", "DMG")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# Group ATK DMG
#1 NA NA NA
#2 NA NA NA
#3 NA NA NA
#4 NA NA NA
#5 NA NA NA
#6 NA NA NA
#7 NA NA NA
#8 NA NA NA
#9 NA NA NA
#10 NA NA NA
#11 NA NA NA
df$Group <- Chunky4Loop(chunk = nGroups,
quantity = nPeople,
input = GroupName)
df
# Group ATK DMG
#1 A NA NA
#2 A NA NA
#3 A NA NA
#4 A NA NA
#5 A NA NA
#6 A NA NA
#7 A NA NA
#8 B NA NA
#9 B NA NA
#10 B NA NA
#11 C NA NA
However, I have to run it over many variables, which is doable. I'd like to simplify it even further, though, by creating a vector of input variables, output columns, and then entering these two meta-variables into a For Loop like this:
# I specify the original values contained within the arrays
ATK <- c(7,3,1)
DMG <- c(7,3,1)
# I create an array of the input variables I just specified ...
inputs <- c('ATK', 'DMG')
# Then I create an array of columns to output to ...
outputs <- c('df$ATK', 'df$DMG')
# Then I run the Chunky4Loop function to keep this all compacted
for (i in 1:length(inputs)){
outputs[i] <- Chunky4Loop(chunk = nGroups,
quantity = nPeople,
input = inputs[i])
} # added missing close-curly-brace
I would ideally get this:
Group ATK DMG
1 A 7 7
2 A 7 7
3 A 7 7
4 A 7 7
5 A 7 7
6 A 7 7
7 A 7 7
8 B 3 3
9 B 3 3
10 B 3 3
11 C 1 1
But R will not recognize these inputs or outputs and will produce the message:
In outputs[i] <- Chunky4Loop(chunk = nGroups, quantity = nPeople, :
number of items to replace is not a multiple of replacement length
Can I make this work?

Appending csvs with different column quantities and spellings

Nothing too complicated, it would just be useful to use rbindlist on a large number of csvs where the column names change a little over time (minor spelling changes), the column orders remain the same, and at some point, two additional columns are added to the csvs (which I don't really need).
library(data.table)
csv1 <- data.table("apple" = 1:3, "orange" = 2:4, "dragonfruit" = 13:15)
csv2 <- data.table("appole" = 7:9, "orangina" = 6:8, "dragonificfruit" = 2:4, "pear" = 1:3)
l <- list(csv1, csv2)
When I run
csv_append <- rbindlist(l, fill=TRUE) #which also forces use.names=TRUE
it gives me a data.table with 7 columns
apple orange dragonfruit appole orangina dragonificfruit pear
1: 1 2 13 NA NA NA NA
2: 2 3 14 NA NA NA NA
3: 3 4 15 NA NA NA NA
4: NA NA NA 7 6 2 1
5: NA NA NA 8 7 3 2
6: NA NA NA 9 8 4 3
as opposed to what I want, which is:
V1 V2 V3 V4
1: 1 2 13 NA
2: 2 3 14 NA
3: 3 4 15 NA
4: 7 6 2 1
5: 8 7 3 2
6: 9 8 4 3
which I can use, even though I have to go through the extra step later of renaming the columns back to standard variable names.
If I instead try the default fill=FALSE and use.names=FALSE, it throws an error:
Error in rbindlist(l) :
Item 2 has 4 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
Is there a simple way to manage this, either by forcing fill=TRUE and use.names=FALSE somehow or by omitting the additional columns in the csvs that have them by specifying a vector of columns to append?
If we only need first 3 columns, then drop the rest and bind as usual:
rbindlist(lapply(l, function(i) i[, 1:3]))
# apple orange dragonfruit
# 1: 1 2 13
# 2: 2 3 14
# 3: 3 4 15
# 4: 7 6 2
# 5: 8 7 3
# 6: 9 8 4
Another option, from the comments: we could directly read the files, and set to keep only first 3 columns using fread, then bind:
rbindlist(lapply(filenames, fread, select = c(1:3)))
Here is an option with name matching using phonetic from stringdist. Extract the column names from the list of data.table ('nmlist'), unlist, group using phonetic, get the first element, relist it to the same list structure as 'nmlist', use Map to change the column names of the list of data.table, and then apply rbindlist
library(stringdist)
library(data.table)
nmlist <- lapply(l, names)
nm1 <- unlist(nmlist)
rbindlist(Map(setnames, l, relist(ave(nm1, phonetic(nm1),
FUN = function(x) x[1]), skeleton = nmlist)), fill = TRUE)
-output
# apple orange dragonfruit pear
#1: 1 2 13 NA
#2: 2 3 14 NA
#3: 3 4 15 NA
#4: 7 6 2 1
#5: 8 7 3 2
#6: 9 8 4 3

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

How does one merge dataframes by row name without adding a "Row.names" column?

If I have two data frames, such as:
df1 = data.frame(x=1:3,y=1:3,row.names=c('r1','r2','r3'))
df2 = data.frame(z=5:7,row.names=c('r5','r6','r7'))
(
R> df1
x y
r1 1 1
r2 2 2
r3 3 3
R> df2
z
r5 5
r6 6
r7 7
), I'd like to merge them by row names, keeping everything (so an outer join, or all=T). This does it:
merged.df <- merge(df1,df2,all=T,by='row.names')
R> merged.df
Row.names x y z
1 r1 1 1 NA
2 r2 2 2 NA
3 r3 3 3 NA
4 r5 NA NA 5
5 r6 NA NA 6
6 r7 NA NA 7
but I want the input row names to be the row names in the output dataframe (merged.df).
I can do:
rownames(merged.df) <- merged.df[[1]]
merged.df <- merged.df[-1]
which works, but seems inelegant and hard to remember. Anyone know of a cleaner way?
Not sure if it's any easier to remember, but you can do it all in one step using transform.
transform(merge(df1,df2,by=0,all=TRUE), row.names=Row.names, Row.names=NULL)
# x y z
#r1 1 1 NA
#r2 2 2 NA
#r3 3 3 NA
#r5 NA NA 5
#r6 NA NA 6
#r7 NA NA 7
From the help of merge:
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
So it is clear that you can't avoid the Row.names column at least using merge. But maybe to remove this column you can subset by name and not by index. For example:
dd <- merge(df1,df2,by=0,all=TRUE) ## by=0 easier to write than row.names ,
## TRUE is cleaner than T
Then I use row.names to subset like this :
res <- subset(dd,select=-c(Row.names))
rownames(res) <- dd[,'Row.names']
x y z
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 NA NA 5
5 NA NA 6
6 NA NA 7

Resources