Selecting column sequences and creating variables - r

I was wondering if there was a way to select specific columns via a sequence and create new variables from this.
So for example, if I had 8 columns with n observations, how could I create 4 variables that selects 2 rows sequentially? My dataset is much larger than this and I have 1416 variables with 62 observations each (I have pasted a link to the spreadsheet below, whereby the first column and row represent names). I would like to create new dataframes from this named as sites 1-12. So site 1 = df[,1:117]; site 2 = df [,119:237] etc.
I am planning on using this code for future datasets with even more variables so some form of loop or sequence function would be very effective if anyone could shed any light on how to achieve this?
https://www.dropbox.com/s/p1a5cu567lxntmw/MyData.csv?dl=0
Thank you in advance.
James
p.s #nrussell I have copied and pasted the output of the code you mentioned below, it follows on as a series of numbers like those displayed.
dput(z[ , 1:10])
structure(list(1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0.0311410340342049,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0.0207444023791158, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0312971643732546,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0376287494579976, 0, 0, 0, 0, 0,
0, 0),......... 10 = c(0, 0, 0, 0, 0.119280313679916,
0, 0, 0.301029995663981, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.715681882079494,
0.136831816210901, 0, 0, 0, 0.0273663632421801, 0, 0, 0, 0.0547327264843602,
0, 0, 0, 0, 0.0231561535126139, 0, 0, 0.0903089986991944, 0,
0, 0.0752574989159953, 0.159368821233872, 0.0272640716982664,
0.0177076468037636, 0, 0, 0.120411998265592, 0, 0, 0, 0, 0.0322532138211408,
0.0250858329719984, 0, 0, 0, 0.119280313679916, 0, 0.172922500085254,
0.225772496747986, 0, 0, 0, 0.0954242509439325, 0)), .Names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "data.frame", row.names = c(NA,
-62L))

We could split the dataset ('df') with '1416' columns to equal size '118' columns by creating a grouping index with gl
lst <- setNames(lapply(split(1:ncol(df), as.numeric(gl(ncol(df), 118,
ncol(df)))), function(i) df[,i]), paste0('site', 1:12))
Or you can create the 'lst' without using the split
lst <- setNames(lapply(seq(1, ncol(df), by = 118),
function(i) df[i:(i+117)]), paste0('site', 1:12))
If we need to create 12 dataset objects in the global environment, list2env is an option (I would prefer to work within the 'lst' itself)
list2env(lst, envir=.GlobalEnv)
Using a small dataset ('df1') with '8' columns
lst1 <- setNames(lapply(split(1:ncol(df1), as.numeric(gl(ncol(df1),
2, ncol(df1)))), function(i) df1[,i]), paste0('site', 1:4))
list2env(lst1, envir=.GlobalEnv)
head(site1,3)
# V1 V2
#1 6 12
#2 4 7
#3 14 14
head(site4,3)
# V7 V8
#1 10 2
#2 5 4
#3 5 0
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:20, 8*10, replace=TRUE), ncol=8))

Related

How to assign first column as rownames in R? [duplicate]

This question already has answers here:
Convert the values in a column into row names in an existing data frame
(5 answers)
Closed 27 days ago.
I want to assign the first column as rownames of kirp.mut.
rownames(kirp.mut) <- kirp.mut[,1]
kirp.mut[,1] <- NULL
Traceback:
> rownames(kirp.mut) <- kirp.mut[,1]
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
In addition: Warning message:
Setting row names on a tibble is deprecated.
Dimensions:
> dim(kirp.mut)
[1] 283 8654
Class:
> class(kirp.mut)
[1] "tbl_df" "tbl" "data.frame"
typeof(kirp.mut)
[1] "list"
Data:
> dput(kirp.mut[1:10,1:10])
structure(list(sample_id = c("TCGA-2Z-A9J1-01A-11D-A382-10",
"TCGA-B9-A5W9-01A-11D-A28G-10", "TCGA-GL-A59R-01A-11D-A26P-10",
"TCGA-2Z-A9JM-01A-12D-A42J-10", "TCGA-A4-A57E-01A-11D-A26P-10",
"TCGA-BQ-7044-01A-11D-1961-08", "TCGA-HE-7130-01A-11D-1961-08",
"TCGA-UZ-A9Q0-01A-12D-A42J-10", "TCGA-HE-A5NI-01A-11D-A26P-10",
"TCGA-WN-A9G9-01A-12D-A36X-10"), NBPF1 = c(1, 0, 0, 0, 0, 0,
0, 0, 0, 0), CROCC = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0), SF3A3 = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 0), GUCA2A = c(1, 0, 0, 0, 0, 0, 0, 0,
0, 0), RAVER2 = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0), ACADM = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 0), PDE4DIP = c(1, 0, 0, 0, 0, 0, 0,
0, 0, 0), NUP210L = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0), NCF2 = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
A tibble cannot have row names assigned. You could convert it to another format, such as a data frame, then assign row names. You can also do this tidyverse solution using column_to_rownames on your tibble without explicitly converting to another form, but it will do so internally and return a data.frame:
library(tidyverse)
library(dplyr)
kirp.mut <- kirp.mut %>%
column_to_rownames(var = "sample_id")
See the technical documentation here on row names and tibbles
Convert to matrix, excluding 1st column, then assign rownames:
m <- as.matrix(kirp.mut[, -1])
rownames(m) <- kirp.mut$sample_id
Or to a dataframe
#convert tibble to data.frame, then add rownames
df <- as.data.frame(kirp.mut[, -1])
rownames(df) <- kirp.mut$sample_id

r: Manipulate data so that columns with same values combine in particular ways

I have a dataframe where each column is made up of zero along with one other number. For example:
I want to manipulate the dataframe so that columns that contain the same other number become one column where the value stays as the other number if the other number was present in every row, otherwise it turns to zero.
So for instance, I would want the dataframe above to look like
..1 ..2 ..3
1 2 3
0 2 0
0 0 0
1 0 0
The first row of the dataframe is 1 because the values were both 1 in the first row of the original. The second row of the first column is 0 because there were a 1 and a 0 in the row.
Here is some reproducible data:
structure(list(...1 = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ...2 = c(1, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0), ...3 = c(2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ...4 = c(3,
0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0,
0, 0, 0, 0, 0, 0), ...5 = c(3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), ...6 = c(3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
-28L), class = "data.frame")
Here is a possible solution in base R, where dat is the data frame you provide in the question. We find the unique value for each column, assuming there is only one nonzero value in each column. Then we loop through the groups of columns with each unique value, applying the function all() to each row of the subsetted dataframe to identify rows with all nonzero values. Multiply the resulting logical vector by the value itself to get the desired result. Then store this vector in a list and bind to a data frame.
col_vals <- apply(dat, 2, max)
columns <- list()
for (val in unique(col_vals)) {
columns[[length(columns) + 1]] <- val * apply(dat[, col_vals == val, drop = FALSE], 1, all)
}
as.data.frame(do.call(cbind, columns))

Filtering dummy-variables to create an index

i'm trying to create an index in R and i have no idea where to start. I've been looking around but i just can't seem to find a way to do what i want to.
I have several dummy-variables (1,0) and they refer to whether someone is a member in an organization (1) or not (0). I would like to create an index indicating to how many organizations a person is a member of.
That means, i should somehow be able to filter and add this information to create such an index.
I've never done anything like it. I've heard there are some easy ways to do it in SPSS but i want to learn how to do it in R.
Does anyone have a tip, how can i do this?
If it is of any use, here is an example of my data:
dput(SK[1:10,])
structure(list(Woeltaetigkeit = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0), Menschenrechte = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Naturschutz = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), Buergerinitiative = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), Gewerkschaft = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0), ehem.Gewerkschaft = c(0, 1, 0, 1, 1, 0, 0, 0, 0, 1), Partei = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Sport = c(1, 0, 0, 1, 0, 1, 0, 0,
1, 1), Hobby = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), Gesundheit = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0), Eltern = c(0, 0, 0, 0, 0, 1, 1, 0,
1, 0), Senioren = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA,
10L), class = "data.frame")
I think all you need is (desired output would help us understand exactly what you want):
rowSums(my_data)
output
> rowSums(my_data)
1 2 3 4 5 6 7 8 9 10
1 2 0 2 1 3 1 0 2 2
Edit: its unclear to me if the organisations or people are on the rows or columns, If I've made the wrong assumption you can use colSums(my_data) to get the opposite.

Replace character in a df for numeric vector in R

I would like to replace characters for specifics numeric vector.
I have this df:
First Second Third
A C D
F R K
and I also have vectors like these
A = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
R = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
N = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
I have tried several times but I can't do it. Does anyone have some advice or idea?
An option would be to unlist (convert to character if it is factor) and then use mget to return the values for that object in a list
lst1 <- mget(as.character(unlist(df)))

Add consecutive elements of a vector until a value

I would like to calculate the minimum number of consecutive elements in a vector that when added (consecutively) would be less than a given value.
For example in the following vector
ev<-c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3.27, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 370.33, 1375.4,
1394.03, 1423.8, 1360, 1269.77, 1378.8, 1350.37, 1425.97, 1423.6,
1363.4, 1369.87, 1365.5, 1294.97, 1362.27, 1117.67, 1026.97,
1077.4, 1356.83, 565.23, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 356.83,
973.5, 0, 240.43, 1232.07, 1440, 1329.67, 1096.87, 1331.37, 1305.03,
1328.03, 1246.03, 1182.3, 1054.53, 723.03, 1171.53, 1263.17,
1200.37, 1054.8, 971.4, 936.4, 968.57, 897.93, 1099.87, 876.43,
1095.47, 1132, 774.4, 1075.13, 982.57, 947.33, 1096.97, 929.83,
1246.9, 1398.2, 1063.83, 1223.73, 1174.37, 1248.5, 1171.63, 1280.57,
1183.33, 1016.23, 1082.1, 795.37, 900.83, 1159.2, 992.5, 967.3,
1440, 804.13, 418.17, 559.57, 563.87, 562.97, 1113.1, 954.87,
883.8, 1207.1, 1046.83, 995.77, 803.93, 1036.63, 946.9, 887.33,
727.97, 733.93, 979.2, 1176.8, 1241.3, 1435.6)
What is the minimum number of elements that when added consecutively (as in the order within the vector) would sum up to lets say 20000
To be more clear i need the following:
Start with ev[1] and add consecutively up to 20000. Record the number of elements you had to add in order to get to 20000 as r[1]. Then start with ev[2] and add till 20000 and so on. Recored the number of elements you had to add till 20000 as r[2]. Do this for the entire length of ev. Then return the min(r)
For example
j<-c(1, 2, 3, 5, 7, 9, 2).
I want the minimum number of elements that when added consecutively would give lets say >20. This should be 3 (5+7+9)
Thanks a lot
Well, I'll give it a shot: This one will find the length of the minimum sequence of numbers
that add up to or above max. It makes no claims to be fast, but it has O(2n) time complexity :-)
I made it return both the start index and the length.
f <- function(x, max=10) {
s <- 0
len <- Inf
start <- 1
j <- 1
for (i in seq_along(x)) {
s <- s + x[i]
while (s >= max) {
if (i-j+1 < len) {
len <- i-j+1
start <- j
}
s <- s - x[j]
j <- j + 1
}
}
list(start=start, length=len)
# uncomment the line below if you don't need the start index...
#len
}
r <- f(ev, 20000) # list(start=245, length=15)
sum(ev[seq(r$start, len=r$length)]) # 20275.42
# Test speed:
x <- sin(1:1e6)
system.time( r <- f(x, 1.9) ) # 1.54 secs
# Compile the function makes it 9x faster...
g <- compiler::cmpfun(f)
system.time( r <- g(x, 1.9) ) # 0.17 secs
library(zoo) # Needed for rollapply
N <- 20000 # The desired sum we want to achieve
j <- 0
for(i in 1:length(ev)){
k <- rollapply(ev, i, sum)
j[i] <- max(k)
if(j[i] >= N){
break
}
}
i # contains how many consecutive elements you need to sum (15)
j[i] # contains the corresponding sum(20275.42)
Currently this doesn't tell you where the specific subset occurs in the vector but another use of rollapply could get you that information.
There are other ways to do it but if you have a really long vector this will break out of the loop so you don't calculate more than you need. The basic idea is to use rollapply to create a vector of the consecutive sums of length k and then find the maximum of that. If this is less than what we desire do the same thing for sums of length k+1. Repeat until we find a sum that is larger than the desired threshold.
Edit:
This appears to be about 100x faster. I haven't compared it to Tommy's answer (which is probably faster than this but this will provide a significant speedup compared to my original method.
Edit 2: Moving the [-n] and removing the suppresswarnings speeds this up quite a bit.
myfun <- function(ev, N){
i <- 1
n <- length(ev)
j <- ev
repeat{
j <- (j[-n] + ev[-c(1:i)])
i <- i+1
n <- n-1
if(max(j) >= N | i > length(ev)){
break;
}
}
return(i)
}
myfun(ev, 20000)
# And stealing the idea from Tommy gives a nice speedup as well
myfuncomp <- compiler:cmpfun(myfun)
myfuncomp(ev, 20000)
myfunc3 <- compiler:cmpfun(myfun, options = list(optimize = 3))
myfunc3(ev, 20000)
library(rbenchmark) # For testing
# If you have Tommy's functions loaded as f and g you can compare
benchmark(f(ev, 20000), g(ev, 20000), myfun(ev, 20000), myfuncomp(ev, 20000), myfunc3(ev, 20000))
you mean something like this?
> sum(ifelse(cumsum(ev)<=200000, 1, 0))
[1] 364
I think this may be a Traveling Salesman Problem in disguise unless you put in some more constraints. You cannot necessarily start at the max ev and go out in either direction since it may be a local non-dense maximum
x=1:length(ev)
plot(x,ev)
lxy <- loess(ev~x )
lines(predict(lxy, x=1:length(y)))
title(main="loess() fit of ev")
But in the region of the most dense values the values are fairly flat.
x=1:length(y); y=c(356.83,
973.5, 0, 240.43, 1232.07, 1440, 1329.67, 1096.87, 1331.37, 1305.03,
1328.03, 1246.03, 1182.3, 1054.53, 723.03, 1171.53, 1263.17,
1200.37, 1054.8, 971.4, 936.4, 968.57, 897.93, 1099.87, 876.43,
1095.47, 1132, 774.4, 1075.13, 982.57, 947.33, 1096.97, 929.83,
1246.9, 1398.2, 1063.83, 1223.73, 1174.37, 1248.5, 1171.63, 1280.57,
1183.33, 1016.23, 1082.1, 795.37, 900.83, 1159.2, 992.5, 967.3,
1440, 804.13, 418.17, 559.57, 563.87, 562.97, 1113.1, 954.87,
883.8, 1207.1, 1046.83, 995.77, 803.93, 1036.63, 946.9, 887.33,
727.97, 733.93, 979.2, 1176.8, 1241.3, 1435.6)
lxyhi <- loess(y~x)
plot(x,y)
lines(predict(lxyhi, x=1:length(y)))

Resources