Selecting specific columns from dataset - r

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.

I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

Related

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Match combinations of row values between 2 different data frames

I have a data.frame with 16 different combinations of 4 different cell markers
combinations_df
FITC Cy3 TX_RED Cy5
a 0 0 0 0
b 1 0 0 0
c 0 1 0 0
d 1 1 0 0
e 0 0 1 0
f 1 0 1 0
g 0 1 1 0
h 1 1 1 0
i 0 0 0 1
j 1 0 0 1
k 0 1 0 1
l 1 1 0 1
m 0 0 1 1
n 1 0 1 1
o 0 1 1 1
p 1 1 1 1
I have my "main" data.frame with 10 columns and thousands of rows.
> main_df
a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1 1 1 1 0 1 1 1 1
2 0 1 0 1 1 0 1 0 1 1
3 1 1 0 0 0 1 1 0 0 0
4 0 1 1 1 1 0 1 1 1 1
5 0 0 0 0 0 0 0 0 0 0
....
I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.
sample output
> phenotype
[1] "g" "i" "a" "p" "g"
I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.
Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.
EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.
EDIT: Changin the sample data output, since no "t" should be present
The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:
# phenotype FITC Cy3 TX_RED Cy5
#1 a 0 0 0 0
#2 b 1 0 0 0
#3 c 0 1 0 0
#4 d 1 1 0 0
# etc
dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.
library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))
# a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1 1 1 1 0 1 1 1 1 p
#2 0 1 0 1 1 0 1 0 1 1 o
#3 1 1 0 0 0 1 1 0 0 0 e
#4 0 1 1 1 1 0 1 1 1 1 p
#5 0 0 0 0 0 0 0 0 0 0 a
I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.
Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.
main_text=NULL
for(i in 1:length(main_df[,1])){
main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
}
comb_text=NULL
for(i in 1:length(combinations_df[,1])){
comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
}
rownames(combinations_df)[match(main_text,comb_text)]
How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.
combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)
main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)
rownames(combination_df)[match(main_df$key, combination_df$key)]

counting the occurrences of a number and when it occurred in R data.frame and data.table

I have newly started to learn R, so my question may be utterly ridiculous. I have a data frame
data<- data.frame('number'=1:11, 'col1'=sample(10:20),'col2'=sample(10:20),'col3'=sample(10:20),'col4'=sample(10:20),'col5'=sample(10:20), 'date'= c('12-12-2014','12-11-2014','12-10-2014','12-09-2014', '12-08-2014','12-07-2014','12-06-2014','12-05-2014','12-04-2014', '12-04-2014', '12-03-2014') )
The number column is an 'id' column and the last column is a date.
I want to count the number of times that each number occurs across (not per column, but the whole data frame containing data) the columns 2:6 and when they occurred.
I am stuck on the first part having tried the following using data.table:
count <- function(){
i = 1
DT <-data.table(data[2:6])
for (i in 10:20){
DT[, .N, by =i]
i = i + 1
}
}
which gives an error that I don't begin to understand
Error in `[.data.table`(DT, , .N, by = i) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (11)
Can someone help, please. Also with the second part that I have not even attempted yet i.e. associating a date or a row number with each occurrence of a number
Perhaps you may want this
library(reshape2)
table(melt(data[,-1], id.var='date')[,-2])
# value
#date 10 11 12 13 14 15 16 17 18 19 20
# 12-03-2014 0 0 1 0 0 1 0 0 1 2 0
# 12-04-2014 2 0 0 2 2 0 1 0 1 1 1
# 12-05-2014 0 0 0 0 0 0 1 1 2 0 1
# 12-06-2014 1 1 0 0 0 1 0 1 0 0 1
# 12-07-2014 0 1 0 1 0 1 1 1 0 0 0
# 12-08-2014 1 1 0 0 1 0 0 1 1 0 0
# 12-09-2014 0 0 2 0 1 2 0 0 0 0 0
# 12-10-2014 0 0 1 1 0 0 1 0 0 1 1
# 12-11-2014 0 1 1 0 0 0 1 0 0 1 1
# 12-12-2014 1 1 0 1 1 0 0 1 0 0 0
Or if you need a data.table solution (from #Arun's comments)
library(data.table)
dcast.data.table(melt(setDT(data),
id="date", measure=2:6), date ~ value)

reverse lexicographic order after using expand.grid

I'm trying to generate the following matrix, based on a multinomial framework. For example, if I had three columns, I'd get:
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1
But, I want many more columns. I know I can use expand.grid, like:
u <- list(0:1)
expand.grid(rep(u,3))
But, it returns what I want in the wrong order:
0 0 0
1 0 0
0 1 0
1 1 0
0 0 1
1 0 1
0 1 1
1 1 1
Any ideas? Thanks.
You can reorder your rows to match your expected output:
u <- list(0:1)
g <- expand.grid(rep(u,3))
g <- g[order(rowSums(g)), ]

How to exclude cases that do not repeat X times in R?

I have a long format unbalanced longitudinal data. I would like to exclude all the cases that do not contain complete information. By that I mean all cases that do not repeat 8 times. Someone can help me finding a solution?
Below an example: I have three subjects {A, B, and C}. I have 8 information for A and B, but only 2 for C. How can I delete rows in which C is present based on the information it has less than 8 repeated measurements?
temp = scan()
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0
Any help?
Assuming your variable names are V1, V2... and so on, here's one approach:
temp[temp$V1 %in% names(which(table(temp$V1) == 8)), ]
The table(temp$V1) == 8 matches the values in the V1 column that have exactly 8 cases. The names(which(... part creates a basic character vector that we can match using %in%.
And another:
temp[ave(as.character(temp$V1), temp$V1, FUN = length) == "8", ]
Here's another approach:
temp <- read.table(text="
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
A 1 1 1 1
A 0 1 0 0
A 1 1 1 0
A 1 1 0 1
A 1 0 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
B 1 1 1 1
B 0 1 0 0
B 1 1 1 0
B 1 1 0 1
B 1 0 0 0
C 1 1 1 1
C 0 1 0 0", header=FALSE)
do.call(rbind,
Filter(function(subgroup) nrow(subgroup) == 8,
split(temp, temp[[1]])))
split breaks the data.frame up by its first column, then Filter drops the subgroups that don't have 8 rows. Finally, do.call(rbind, ...) collapses the remaining subgroups back into a single data.frame.
If the first column of temp is character (rather than factor, which you can verify with str(temp)) and the rows are ordered by subgroup, you could also do:
with(rle(temp[[1]]), temp[rep(lengths==8, times=lengths), ])

Resources