I have a data frame which looks something like:
a b c
1 1 2 3
2 3 4 5
3 4 5 2
4 3 5 6
.....
and I'm trying to make some kind of selector function which will create vector of values in a, b or c column based on a vector of column names, i.e. for the input vector:
c(a,b,c,b,...)
the output of the selector function should be:
c(1,4,2,5,...)
I can do it by looping trough rows or with nested ifelse but is there a better (more generic) way in case of more that few columns?
We need row/column index to extract the values from the dataset i.e.
df1[cbind(1:nrow(df1), match(v1, colnames(df1)))]
#[1] 1 4 2 5
data
v1 <- c('a','b','c','b')
df1 <- structure(list(a = c(1L, 3L, 4L, 3L), b = c(2L, 4L, 5L, 5L),
c = c(3L, 5L, 2L, 6L)), .Names = c("a", "b", "c"), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Related
I have a data frame where I have a column name Rooms which holds the number of rooms in the house. It has about 50,000+ rows and I checked it using str(df$Rooms) and it is a factor with 44 levels. The column looks like this :
>str(df$Rooms)
Factor w/ 44 levels "","1","1+1","1+2",..: 20 32 23 27 28 29 27 23 26 24 ...
> df$Rooms
1+2
3
1+3
1+2
4
3
1+1
2
..
..
My question is there any way or any functions or library in R that can be used to get the value of these equations. Maybe so that it can become something like this :
> df$Rooms
3
3
4
3
4
3
2
2
..
..
Thank you in advance~
We can use eval parse
df$final_rooms <- sapply(as.character(df$Rooms), function(x) eval(parse(text = x)))
df
# Rooms final_rooms
#1 1+2 3
#2 3 3
#3 1+3 4
#4 1+2 3
#5 4 4
#6 3 3
#7 1+1 2
#8 2 2
data
df <- structure(list(Rooms = structure(c(2L, 5L, 3L, 2L, 6L, 5L, 1L,
4L), .Label = c("1+1", "1+2", "1+3", "2", "3", "4"), class = "factor")),
class = "data.frame", row.names = c(NA, -8L))
We can split by the + and do a sum after converting to numeric without using the eval(parse in base R
df$final_rooms <- sapply(strsplit(as.character(df$Rooms) , "+",
fixed = TRUE), function(x) sum(as.numeric(x)))
Or another option is to read with read.table into two columns and do a rowSums with vectorized option
df$final_rooms <- rowSums(read.table(text = as.character(df$Rooms),
sep="+", header = FALSE, fill = TRUE), na.rm = TRUE)
df$final_rooms
#[1] 3 3 4 3 4 3 2 2
data
df <- structure(list(Rooms = structure(c(2L, 5L, 3L, 2L, 6L, 5L, 1L,
4L), .Label = c("1+1", "1+2", "1+3", "2", "3", "4"), class = "factor")),
class = "data.frame", row.names = c(NA, -8L))
I'm trying to calculate cumulative sum in rows with several variables.
This is my data as example. I have 5 patients ID, and 4 condition variables. If there is value between '1 to 3' in conditions, cumsum will be added 1.
ID<-c("a","b","c","d","e")
cond1<-as.factor(sample(x=1:7,size=5,replace=TRUE))
cond2<-as.factor(sample(x=1:7,size=5,replace=TRUE))
cond3<-as.factor(sample(x=1:7,size=5,replace=TRUE))
cond4<-as.factor(sample(x=1:7,size=5,replace=TRUE))
df<-data.frame(ID,cond1,cond2,cond3,cond4)
df
ID cond1 cond2 cond3 cond4
1 a 2 7 6 6
2 b 7 2 3 6
3 c 4 3 1 4
4 d 7 3 3 6
5 e 6 7 7 3
I use rowSums code with following statement. However, as 2nd row, though cond2 is 2 and cond3 is 3, the cumsum was not '2', '1'. 4nd row has same problem.
df$cumsum<-rowSums(df[,2:5]==c(1,2,3),na.rm=TRUE)
df
ID cond1 cond2 cond3 cond4 cumsum
1 a 2 7 6 6 0
2 b 7 2 3 6 1
3 c 4 3 1 4 1
4 d 7 3 3 6 1
5 e 6 7 7 3 0
How to make it cumulative? I would really appreciate all your help.
For more than 1 element comparison, use %in%, but %in% works on a vector. So, we loop through the columns with lapply/sapply and then do the rowSums on the logical matrix
df$RSum <- rowSums(sapply(df[,2:5], `%in%`, 1:3))
df$RSum
#[1] 1 2 2 2 1
If the values were numeric, then we could also make use of > or <
df$RSum <- rowSums(df[, 2:5] >=1 & df[, 2:5] <=3)
data
df <- structure(list(ID = c("a", "b", "c", "d", "e"), cond1 = c(2L,
7L, 4L, 7L, 6L), cond2 = c(7L, 2L, 3L, 3L, 7L), cond3 = c(6L,
3L, 1L, 3L, 7L), cond4 = c(6L, 6L, 4L, 6L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
I suggest you fix two problems with your data:
Your data is wide, instead of long formatted. Had your data been long formatted, your analysis would have been much simpler. This is specially true for plotting.
Your values for each condition are factors. That makes it more difficult to do comparissons, and might induce some difficult-to-spot errors. If you see #akrun answer carefully, you'll notice the values are integer (numeric).
That said, I propose a data.table solution:
# 1. load libraries and make df a data.table:
library(data.table)
setDT(df)
# 2. make the wide table a long one
melt(df, id.vars = "ID")
# 3. with a long table, count the number of conditions that are in the 1:3 range for each ID. Notice I chained the first command with this second one:
melt(df, id.vars = "ID")[, sum(value %in% 1:3), by = ID]
Which produces the result:
ID V1
1: a 1
2: b 2
3: c 2
4: d 2
5: e 1
You'll only need to run commands under 1 and 3 (2 has been chained into 3). See ?data.table for further details.
You can read more about wide vs long in wikipedia and in Mike Wise's answer
The data I used is the same as #akrun:
df <- structure(list(ID = c("a", "b", "c", "d", "e"),
cond1 = c(2L, 7L, 4L, 7L, 6L),
cond2 = c(7L, 2L, 3L, 3L, 7L),
cond3 = c(6L, 3L, 1L, 3L, 7L),
cond4 = c(6L, 6L, 4L, 6L, 3L)),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
I’m trying to remove all the rows that have the same value in the "lan" column of my dataframe but different value for my "id" column (but not vice-versa).
Using an example dataset:
require(dplyr)
t <- structure(list(id = c(1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L,
4L), lan = structure(c(1L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 1L,
7L), .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"),
value = c(0.22988498, 0.848989831, 0.538065821, 0.916571913,
0.304183372, 0.983348167, 0.356128559, 0.054102854, 0.400934593,
0.001026817, 0.488452667)), .Names = c("id", "lan", "value"
), class = "data.frame", row.names = c(NA, -11L))
t
I need to get rid of rows 1 and 10 because they have the same lan (a) but different id.
I've tried the following, without success:
a<-t[(!duplicated(t$id)),]
c<-a[duplicated(a$lan)|duplicated(a$lan, fromLast=TRUE),]
d<-t[!(t$lan %in% c$lan),]
Thanks for your help!
And an alternative using dplyr:
t2 <- t %>%
group_by(lan,id) %>%
summarise(value=sum(value)) %>%
group_by(lan) %>%
summarise(number=n()) %>%
filter(number>1) %>%
select(lan)
> t[!t$lan %in% t2$lan ,]
id lan value
2 2 b 0.84898983
3 2 c 0.53806582
4 3 d 0.91657191
5 3 d 0.30418337
6 4 e 0.98334817
7 4 e 0.35612856
8 4 e 0.05410285
9 4 f 0.40093459
11 4 g 0.48845267
You could use duplicated on "lan", to get the logical index of all elements that are duplicates, repeat the same with both columns together ('id', 'lan'), to get the elements not duplicated, check which of these elements are TRUE in both cases, negate, and subset.
indx1 <- with(t, duplicated(lan)|duplicated(lan,fromLast=TRUE))
indx2 <- !(duplicated(t[1:2])|duplicated(t[1:2],fromLast=TRUE))
t[!(indx1 & indx2),]
# id lan value
#2 2 b 0.84898983
#3 2 c 0.53806582
#4 3 d 0.91657191
#5 3 d 0.30418337
#6 4 e 0.98334817
#7 4 e 0.35612856
#8 4 e 0.05410285
#9 4 f 0.40093459
#11 4 g 0.48845267
Problem: Extraordinarily large dataset with dozens of columns. How to search a list of columns and all the rows within them, and if they match conditions, create a new column that adds a dichotomous variable to the row. Normally would use Excel, but size is too large.
Example
col1 col2 col3 col4
1 2 3 4
1 2 5 6
3 3 3 3
1 1 1 2
2 3 4 1
If any of these columns (col1-4) and any of the rows within match a list of numbers, say List: 1, 2, 3, then add a new colum (col5) and add 1 if it matches, 0 if not. Repetition doesn't matter - the value returned is 1 if there is one or more occurence of any of the list conditions.
Potential solution idea
For i in col1:col4, for j in row1:allrows, ifelse(row=list, col5=1, col5=0), next.
Thanks!
May be you need
df$col5 <- (apply(df, 1, function(x)
!any(!table(factor(x[x %in% v1], levels=v1)))))+0L
df
# col1 col2 col3 col4 col5
#1 1 2 3 4 1
#2 1 2 5 6 0
#3 3 3 3 3 0
#4 1 1 1 2 0
#5 2 3 4 1 1
data
df <- structure(list(col1 = c(1L, 1L, 3L, 1L, 2L), col2 = c(2L, 2L,
3L, 1L, 3L), col3 = c(3L, 5L, 3L, 1L, 4L), col4 = c(4L, 6L, 3L,
2L, 1L)), .Names = c("col1", "col2", "col3", "col4"), class =
"data.frame", row.names = c(NA, -5L))
v1 <- 1:3
i have a data frame generated inside a for loop and have this structure
V1 V2 V3
1 a a 1
2 a b 3
3 a c 2
4 a d 1
5 a e 3
6 b a 3
7 b b 1
8 b c 8
9 b d 1
10 b e 1
11 c a 2
12 c b 8
the data is longer than this , but that's the idea that i want
(transform it to a wide table [V1 by V2])
V3 is a value based on (V1, V2)
i want to rearrange data to be like this (with first col is the unique of V1 and first row is the unique of V2 and data between them are from V3 )
a b c d e
a 1 3 2 1 3
b 3 1 8 1 1
c 2 8 2 8 2
d 1 1 5 7 2
e 3 5 9 5 3
thnx in advance.
Reproducible example of yours:
df <- structure(list(V1 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), V2 = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L), .Label = c("a", "b", "c", "d", "e"), class = "factor"), V3 = c(1L, 3L, 2L, 1L, 3L, 3L, 1L, 8L, 1L, 1L, 2L, 8L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
And compute a basic crosstable based on your variables:
> xtabs(V3~V1+V2, df)
V2
V1 a b c d e
a 1 3 2 1 3
b 3 1 8 1 1
c 2 8 0 0 0
I hope you meant this :)
If df is your data-frame, assuming a unique V3 is mapped to each V1,V2 combination, you can do it with
with(df, tapply(V3, list(V1,V2), identity))
Another method, perhaps slightly more baroque, for widening a dataframe from a third column on the basis of the first two... with Chase that the OP has not given an unambiguous problem description:
df2 <- expand.grid(A=LETTERS[1:5], B=LETTERS[1:5])
df2$N <- 1:25
mtx <- outer(X=LETTERS[1:5],Y=LETTERS[1:5], FUN=function(x,y){
df2[intersect(which(df2$A==x), which(df2$B==y)), "N"] })
colnames(mtx)<-LETTERS[1:5]; rownames(mtx)<-LETTERS[1:5]
mtx
A B C D E
A 1 6 11 16 21
B 2 7 12 17 22
C 3 8 13 18 23
D 4 9 14 19 24
E 5 10 15 20 25
I'm sure there are many other strategies using reshape in base or dcast in reshape2.