Stack every n-th column of data.frame - r

I have a large excel sheet which I need to import into R. My issue is that the sheet contains many small tables which are separated by empty rows and columns. The table titles are also separated from the table itself by an empty row.
You can look at the excel output here.
I would like to have a list of all data.frames on the sheet which are correctly named.
Edit: I have removed all empty rows and columns so now I am left with many columns where every fourth column, a new table begins.
results[1:8,1:10]
c01 c02 c03 c04 c05 c06
1 response of AUT_E3CI to AUT_E3CI shocks <NA> <NA> <NA> response of AUT_E3CI to AUT_HICP shocks <NA>
2 <NA> lower bound median upper bound <NA> lower bound
3 1.0 1.0 1.0 1.0 1.0 0.0
4 2.0 -0.0017003281610081986 0.06962694180009907 0.11535402387039681 2.0 0.0320783162263377
5 3.0 -0.01232670095304385 0.08512933823225599 0.22707701169618283 3.0 0.023014339564983166
6 4.0 -0.09369238952198494 0.0904834548010296 0.47546694990618454 4.0 -0.05144787826814039
7 5.0 -0.2301596729710727 0.03155987208641617 0.6502910254692547 5.0 -0.11117305002762731
8 6.0 -0.4024596506843293 0.10607314703381547 1.3257182896467359 6.0 -0.47648920768332415
Now I would like to rbind all columns into a single data.frame of four columns, that is stack all tables on top of each other in one long table of four columns. I have tried it with a while-loop which rbinds the whole data.frame onto itself in sets of four columns.
while (ncol(results)>4) {
results <- rbind(results[,1:ncol(results)/2],
results[,(ncol(results)/2)+1:ncol(results)])
}
However, this gives me the following error:
Error in `[.data.frame`(results, , (ncol(results)/2) + 1:ncol(results)) :
undefined columns selected
Does anyone know how to fix the loop or knows an alternative approach?

I think you're starting with something similar to this:
dat <- structure(list(V1 = 11:13, V2 = 21:23, V3 = 31:33, V4 = 41:43, V5 = 51:53, V6 = 61:63, V7 = 71:73, V8 = 81:83, V9 = 91:93), class = "data.frame", row.names = c(NA, -3L))
dat
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 11 21 31 41 51 61 71 81 91
# 2 12 22 32 42 52 62 72 82 92
# 3 13 23 33 43 53 63 73 83 93
where you want to take every set of (say) 3 columns and stack them. Try this:
datspl <- split.default(dat, (seq_along(dat) - 1) %/% 3)
datspl <- lapply(datspl, `colnames<-`, colnames(datspl[[1]]))
newdat <- do.call(rbind, datspl)
newdat
# V1 V2 V3
# 0.1 11 21 31
# 0.2 12 22 32
# 0.3 13 23 33
# 1.1 41 51 61
# 1.2 42 52 62
# 1.3 43 53 63
# 2.1 71 81 91
# 2.2 72 82 92
# 2.3 73 83 93
This assumes that the names from the first four columns are appropriate for all other sets of 4 columns.
(Replace my 3 with 4 if you're doing every 4 columns, everything else should just work.)

You could use reshape in R:
For example using the data provided by r2evans we could do:
n_cols <- 3
reshape(dat, matrix(names(dat),ncols), dir = 'long')
time V1 V2 V3 id
1.1 1 11 21 31 1
2.1 1 12 22 32 2
3.1 1 13 23 33 3
1.2 2 41 51 61 1
2.2 2 42 52 62 2
3.2 2 43 53 63 3
1.3 3 71 81 91 1
2.3 3 72 82 92 2
3.3 3 73 83 93 3
Then get rid of the id and time columns If you do not need them.

Related

How can I use a loop in R to create new, labeled variables and write them into a .csv?

I have a table with eighty columns and I want to create columns by multiplying var1*var41 var1*var42....var1*var80. var2*var41 var2*var42...var2*var80. How could I write a loop to multiply the columns and write the labeled product into a .csv? The result should have 1600 additional columns.
I took a stab at this with some fake data:
# Fake data (arbitraty 5 rows)
mtx <- sample(1:100, 5 * 80, replace = T)
dim(mtx) <- c(5,80)
colnames(mtx) <- paste0("V", 1:ncol(mtx)) # Name the original columns
mtx[1:5,1:5]
# V1 V2 V3 V4 V5
#[1,] 8 10 69 84 92
#[2,] 59 34 36 96 86
#[3,] 51 26 78 63 8
#[4,] 74 93 73 70 49
#[5,] 62 30 20 43 9
Using a for loop, one might try something like this:
v <- expand.grid(1:40,41:80) # all combos
v[c(1:3,1598:1600),]
# Var1 Var2
#1 1 41
#2 2 41
#3 3 41
#1598 38 80
#1599 39 80
#1600 40 80
# Initialize matrix for multiplication results
newcols <- matrix(NA, nrow = nrow(mtx), ncol = nrow(v))
# Run the for loop
for(i in 1:nrow(v)) newcols[,i] <- mtx[,v[i,1]] * mtx[,v[i,2]]
# save the names as "V1xV41" format with apply over rows (Margin = 1)
# meaning, for each row in v, paste "V" in front and "x" between
colnames(newcols) <- apply(v, MARGIN = 1, function(eachv) paste0("V", eachv, collapse="x"))
# combine the additional 1600 columns
tocsv <- cbind(mtx, newcols)
tocsv[,78:83] # just to view old and new columns
# V78 V79 V80 V1xV41 V2xV41 V3xV41
#[1,] 17 92 13 429 741 1079
#[2,] 70 94 1 4836 4464 5115
#[3,] 6 77 93 3740 1020 3468
#[4,] 88 34 26 486 258 66
#[5,] 48 77 61 873 4365 970
# Write it
write.csv(tocsv, "C:/Users/Evan Friedland/Documents/NEWFILENAME.csv")

Transpose and rearrange rows in a matrix

I have several files with the following structure:
data <- matrix(c(1:100000), nrow=1000, ncol=100)
The first 500 rows are X coordinates and the final 500 rows are Y coordinates of several object contours. Row # 1 (X) and row 501 (Y) correspond to coordinates of the same object. I need to:
transpose the whole matrix and arrange it so now row 1 is column 1 and row 501 is column 2 and have paired x, y coordinates in contiguous columns. Row 2 and row 502 should be in column 1 and column 2 below the data of previous object.
ideally, have an extra column with filename info.
thanks.
Simpler version:
Transpose the matrix, then create a vector with the column indices and subset with them:
mat <- matrix(1:100, nrow = 10)
mat2 <- t(mat)
cols <- unlist(lapply(1:(nrow(mat2)/2), function(i) c(i, i+nrow(mat2)/2)))
mat3 <- mat2[,cols]
Then just make it a dataframe as below.
You can subset pairs of rows separated by nrow/2, make them a 2-column matrix and then cbind them all:
df <- as.data.frame(do.call(cbind, lapply(1:(nrow(mat)/2), function(i) {
matrix(mat[c(i, nrow(mat)/2 + i),], ncol = 2, byrow = TRUE)
})))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 fname
# 1 1 6 2 7 3 8 4 9 5 10 a
# 2 11 16 12 17 13 18 14 19 15 20 e
# 3 21 26 22 27 23 28 24 29 25 30 e
# 4 31 36 32 37 33 38 34 39 35 40 o
# 5 41 46 42 47 43 48 44 49 45 50 y
# 6 51 56 52 57 53 58 54 59 55 60 q
# 7 61 66 62 67 63 68 64 69 65 70 v
# 8 71 76 72 77 73 78 74 79 75 80 b
# 9 81 86 82 87 83 88 84 89 85 90 v
# 10 91 96 92 97 93 98 94 99 95 100 y
Then just add the new column as necessary, since it's now a dataframe:
df$fname <- sample(letters, nrow(df), TRUE)
What about
n <- 500
df <- data.frame(col1 = data[1:n, ],
col2 = data[(nrow(data) - 500):nrow(data), ],
fileinfo = "this is the name of the file...")
Try David's answer, but this way:
n <- 500
df <- data.frame(col1 = data[1:n, ],
col2 = data[(nrow(data) - (n-1)):nrow(data), ],
fileinfo = "this is the name of the file...")

Subset Duplicated Values >10

I am looking at a data frame and trying to subset rows that have the same pressure value for more then 5 rows or delete rows that do not have 5 duplicate pressure values...
File Turbidity Pressure
1 3.2 46
2 3.4 46
3 5.4 46
4 3.2 46
5 3.1 46
6 2.3 46
7 2.3 45
8 4.5 45
9 2.3 45
10 3.2 44
11 4.5 44
12 6.5 43
13 3.2 42
14 3.1 41
15 1.2 41
16 2.3 41
17 2.4 41
18 2.1 41
19 1.4 41
25 1.3 41
So basically trying to keep rows that have a pressure of 46 and 41 and delete rows in-between. This is a small portion of my dataset and just need code that will basically keep rows with 5 or more duplicate pressure values and delete others.
Try
library(dplyr)
df %>% group_by(Pressure) %>% filter(n() >= 5)
Which gives:
#Source: local data frame [13 x 3]
#Groups: Pressure
#
# File Turbidity Pressure
#1 1 3.2 46
#2 2 3.4 46
#3 3 5.4 46
#4 4 3.2 46
#5 5 3.1 46
#6 6 2.3 46
#7 14 3.1 41
#8 15 1.2 41
#9 16 2.3 41
#10 17 2.4 41
#11 18 2.1 41
#12 19 1.4 41
#13 25 1.3 41
Here's a data.table solution (relies crucially on Pressure not repeating itself later on):
library(data.table)
setDT(df)[,if(.N>=5) .SD,by=Pressure]
Addendum:
If you expect Pressure values to repeat later on, e.g.:
df<-data.frame(File=c(1:19,25:28),
Pressure=rep(c(46:41,46),c(6,3,2,1,1,7,3)))
Then you'll need to use rleid in order to keep only groups of at least 5 in a row (no gaps):
setDT(df)[,ct:=rleid(Pressure)][,if (.N>=5) .SD,by=ct]
Here is a solution using base R:
df <- data.frame(id=1:10, Pressure=c(rep(1,5),6:10))
p.counts <- table(df[,"Pressure"])
good.pressures <- as.numeric(names(p.counts))[p.counts>=5]
df.sub <- df[df[,"Pressure"]%in%good.pressures,]
Note that I'm using df as an example data set, so you can delete that first line of code and replace all instances of df with the name of your data.frame.

Computing normalized Euclidean distance in R

The data frame I have is as follows:
Binning_data[1:4,]
person_id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 312 74 80 NA 87 90.0 85 88 98 96.5 99 94 95 90 90 93 106
2 316 NA NA 116 106 105.0 110 102 105 105.0 102 98 101 98 92 89 91
3 318 71 61 61 61 60.5 68 62 67 64.0 60 59 60 62 59 63 63
4 319 64 NA 80 80 83.0 84 87 83 85.0 88 87 95 74 70 63 83
I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs.
For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins)
Dimension of Binning_Data is 10000*17.
The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'.
I am currently using sapply for this purpose:
index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)
distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {
compare_person<-binning_data[j,]
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))
})
This seems to take a considerable amount of time. Is there a better way to do this?
If I run your code I get:
0.0000 146.0192 890.9000 200.8750
If you convert your data frame into a matrix, transpose, then you can subtract columns and then use na.rm=TRUE on mean to get the distances you want. This can be done over columns using colMeans. Here for row II of your sample data:
> II = 1
> m = t(as.matrix(binning_data[,-1]))
> colMeans((m - m[,II])^2, na.rm=TRUE)
1 2 3 4
0.0000 146.0192 890.9000 200.8750
Your 10000x2 matrix is then (where here 10000==4):
> cbind(II,colMeans((m - m[,II])^2, na.rm=TRUE))
II
1 1 0.0000
2 1 146.0192
3 1 890.9000
4 1 200.8750
If you want to compute this for a given list of indexes, loop it, perhaps like this with an lapply and an rbind putting it all back together again as a data frame for a change:
II = c(1,2,1,4,4)
do.call(rbind,lapply(II, function(i){data.frame(i,d=colMeans((m-m[,i])^2,na.rm=TRUE))}))
i d
1 1 0.0000
2 1 146.0192
3 1 890.9000
4 1 200.8750
11 2 146.0192
21 2 0.0000
31 2 1595.0179
41 2 456.7143
12 1 0.0000
22 1 146.0192
32 1 890.9000
42 1 200.8750
13 4 200.8750
23 4 456.7143
33 4 420.8833
43 4 0.0000
14 4 200.8750
24 4 456.7143
34 4 420.8833
44 4 0.0000
That's a 4 x length(II)-row matrix

findInterval() with varying intervals in data.table R

I have asked this question a long time ago, but haven't found the answer yet. I do not know if this is legit in stackoverflow, but I repost it.
I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.
Reproducible example:
set.seed(100)
DT <- data.table(year=2000:2009, month=1:10, price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)
I have just tried:
merging the DT and intervals.dt data.tables by month/year,
creating a new intervalsstring column consisting of all the V* columns to
one column string, (not very elegant, I admit), and finally
substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)
So, after:
DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]
I get
> DT
year month price intervalsstring interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
which is correct for the first rows, but not for the last (or other) rows.
For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.
Thank you!
Your main problem is that you just didn't do findInterval for each group. But I also don't see the point of making that large merged data.table, or the paste/strsplit business. This is what I would do:
DT[, interval := findInterval(price,
intervals.dt[.BY][, V1:V10]),
by = .(year, month)][]
# year month price interval
# 1: 2000 1 30.776611 2
# 2: 2000 1 62.499648 6
# 3: 2000 1 53.581115 6
# 4: 2000 1 48.830599 5
# 5: 2000 1 33.066053 2
# ---
#3376: 2009 10 33.635924 1
#3377: 2009 10 38.993769 1
#3378: 2009 10 75.065820 7
#3379: 2009 10 6.277403 0
#3380: 2009 10 64.189162 5
Note that intervals.dt[.BY] is a keyed subset.

Resources