Suppose I have a dataframe as follows,
Name value
A 0
A 1
A 2
A 3
B 0
B 0
B 3
C 5
I want the following output,
Name 0 0<X<2 2-4 5 and above
A 1 1 2 0
B 2 0 1 0
C 0 0 0 1
I want the new columns to be created and counts of the rows to fall into it. I have tried reshape for it. But it is changing the structure, but count is not working. Can anybody help me in doing this?
Thanks
We can use cut
library(data.table)
setDT(df1)[, gr := cut(value, breaks=c(-1, 0, 2, 4, Inf),
labels=c(0, '0<X<2', '2-4', '5 and above'))]
dcast(df1, Name~gr)
# Name 0 0<X<2 2-4 5 and above
#1: A 1 2 1 0
#2: B 2 0 1 0
#3: C 0 0 0 1
You can make a class column in your dataframe, then use table():
df <- read.table(text="Name value
A 0
A 1
A 2
A 3
B 0
B 0
B 3
C 5", header=T)
df[df$value==0, 'class'] <- "0"
df[df$value==1, 'class'] <- "0<X<2"
df[df$value>=2&df$value<=4, 'class'] <- "2-4"
df[df$value>4, 'class'] <- "5 and above"
table(df$Name, df$class)
0 0<X<2 2-4 5 and above
A 1 1 2 0
B 2 0 1 0
C 0 0 0 1
And just for fun:
f <- function(x) c(sum(x == 0), sum(x > 0 & x < 2), sum(x >= 2 & x <= 4), sum(x > 5))
t(sapply(split(df$value, df$Name), f))
[,1] [,2] [,3] [,4]
A 1 1 2 0
B 2 0 1 0
C 0 0 0 0
Related
I have two sequences of data (with five variables in each sequence) that I want to combine accordingly into one using this rubric:
variable sequence 1 variable sequence 2 variable in combined sequence
0 0 1
0 1 2
1 0 3
1 1 4
Here are some example data:
set.seed(145)
mm <- matrix(0, 5, 10)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("s1_1", "s1_2", "s1_3", "s1_4", "s1_5", "s2_1", "s2_2", "s2_3", "s2_4", "s2_5")
> df
s1_1 s1_2 s1_3 s1_4 s1_5 s2_1 s2_2 s2_3 s2_4 s2_5
1 1 0 0 0 0 0 1 1 0 0
2 1 1 1 0 1 1 0 0 0 0
3 1 1 0 0 0 1 1 0 1 1
4 0 0 1 0 1 1 0 1 0 1
5 0 1 0 0 1 0 0 1 1 0
Here s1_1 represents variable 1 in sequence 1, s2_1 represents variable 2 in sequence 2, and so on. For this example, s1_1=1 and s2_1=0, the variable 1 in combined sequence would be coded as 3. How do I do this in R?
Here's a way -
return_value <- function(x, y) {
dplyr::case_when(x == 0 & y == 0 ~ 1,
x == 0 & y == 1 ~ 2,
x == 1 & y == 0 ~ 3,
x == 1 & y == 1 ~ 4)
}
sapply(split.default(df, sub('.*_', '', names(df))), function(x)
return_value(x[[1]], x[[2]]))
# 1 2 3 4 5
#[1,] 3 2 2 1 1
#[2,] 4 3 3 1 3
#[3,] 4 4 1 2 2
#[4,] 2 1 4 1 4
#[5,] 1 3 2 2 3
split.default splits the data by sequence and using sapply we apply the function return_value to compare the two columns in each dataframe.
In a larger dataset I would like to identify variables (dummies) coded as 1/2 and transform them to 0/1. They might contain missings.
## test dataset:
df <- data.frame(
c(1,2,2,1,2,NA,NA,1,2,NA),
c(1,2,2,1,2,1,2,1,2,2),
c(0,1,NA,1,1,0,NA,1,NA,0),
c(0,1,1,1,1,0,0,1,1,0))
names(df) <- c("A","B","C","D")
Columns A and B should be transformed, C and D should remain the same.
## attempts:
df2 <- select(df, function(x) {x %in% c(1,2,NA)})
df2 <- sapply(df, function(x) {(x %in% c(1,2,NA))})
Once identified (which I could not achieve yet), I would like to transform these columns like this: 1 to 0, 2 to 1. In the end I aim to have df2 in the same dimensions as df.
Thank you in advance!!
In each column check if all non-NA values are in 1:2 and if so subtract 1 from each value.
library(dplyr)
df2 <- df %>%
mutate(across(where(~ all(na.omit(.) %in% 1:2)), ~ .x - 1))
or using only base R:
ok <- sapply(df, function(x) all(na.omit(x) %in% 1:2))
df2 <- replace(df, ok, df[ok] - 1)
A cautionary note that there is an ambiguity inherent in the question (although fortunately none of the columns in the question have this ambiguity) -- namely, that if a column contains only 1 or only 1 and NA then we cannot know if it represents a 0:1 or 1:2 column. To resolve the ambiguity the code above assumes the former but if there were any columns in the set returned by the following then if this default does not resolve it properly then it will be necessary to use application knowledge to resolve the ambiguity.
which(sapply(df, function(x) all(na.omit(x) == 1)))
if you'd like to do it conditionally by tidyverse ;
library(dplyr)
df %>%
mutate(A = case_when(A==1 ~ 0,
A==2 ~ 1),
B = case_when(B==1 ~ 0,
B==2 ~ 1))
output;
A B C D
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 1 1 1 1
3 1 1 NA 1
4 0 0 1 1
5 1 1 1 1
6 NA 0 0 0
7 NA 1 NA 0
8 0 0 1 1
9 1 1 NA 1
10 NA 1 0 0
You just use ifelse
df$A = ifelse(df$A == 1, 0, 1)
df$B = ifelse(df$A == 1, 0, 1)
A B C D AA
1 0 0 0 0 0
2 1 1 1 1 1
3 1 1 NA 1 1
4 0 0 1 1 0
5 1 1 1 1 1
6 NA 0 0 0 NA
7 NA 1 NA 0 NA
8 0 0 1 1 0
9 1 1 NA 1 1
10 NA 1 0 0 NA
You can subtract 1 or test if equal 2
i <- colSums(df == 2, na.rm=TRUE) > 0 & colSums(df == 0, na.rm=TRUE) == 0
#i <- sapply(df, function(x) all(x[!is.na(x)] %in% 1:2)) #Alternative
#Will not work in case there is only 1 in the column
df[, i] <- df[, i] - 1
#df[, i] <- +(df[, i] == 2) #Alternative
df
# A B C D
#1 0 0 0 0
#2 1 1 1 1
#3 1 1 NA 1
#4 0 0 1 1
#5 1 1 1 1
#6 NA 0 0 0
#7 NA 1 NA 0
#8 0 0 1 1
#9 1 1 NA 1
#10 NA 1 0 0
I have a data frame with negative values in one column. something like this
df <- data.frame("a" = 1:6,"b"= -(5:10), "c" = rep(8:6,2))
a b c
1 1 -5 8
2 2 -6 7
3 3 -7 6
4 4 -8 8
5 5 -9 7
6 6 -10 6
I want to convert this to a data frame with no negative values in "b" keeping row totals unchanged. I can use column "a" only if "c" is not big enough to absorb the negative values in "b".
The end result should look like this
a b c
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
I feel that sapply could be used. But I don't know how ?
You can use pmin and pmax to get the new values for a, b and c.
df$c <- df$c + pmin(0, df$b)
df$b <- pmax(0, df$b)
df$a <- df$a + pmin(0, df$c)
df$c <- pmax(0, df$c)
df
# a b c
#1 1 0 3
#2 2 0 1
#3 2 0 0
#4 4 0 0
#5 3 0 0
#6 2 0 0
You could use dplyr:
df %>%
mutate(total=rowSums(.)) %>%
rowwise() %>%
mutate(c=max(b+c, 0),
b=max(b,0),
a=total - c - b) %>%
select(-total)
which returns
# A tibble: 6 x 3
# Rowwise:
a b c
<dbl> <dbl> <dbl>
1 1 0 3
2 2 0 1
3 2 0 0
4 4 0 0
5 3 0 0
6 2 0 0
Here is a base R solution.
df2 <- df
df2$c <- df$c + df$b
df2$a <- ifelse(df2$c < 0, df2$a + df2$c, df2$a)
df2[df2 < 0 ] <- 0
df2
# a b c
# 1 1 0 3
# 2 2 0 1
# 3 2 0 0
# 4 4 0 0
# 5 3 0 0
# 6 2 0 0
In my data, I have 74 observations (rows) and 128 variables (columns), where each variable takes either 0 or 1 as value. In R, I am trying to write a code, where I can find in each row, the variables that has 1 as value and calculate 80% of the times 1 appears in each row. Pick those variables that has 80% of the times value as 1 and change the value from 1 to 0. I could write code, where I can calculate the 80% of times, 1 appears in each row, but I am not able to pick these variables in each row and change their value from 1 to 0.
data# data frame with 74 observations and 128 variables
row1 <- data[1,]
count1 <- length(which(data[1,] == 1)) # #number of 1 in row 1
print(count1)
perform <- 80/100*count1# 80% of count1
Below code works for one row:
test <- t(apply(data[1,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
If specify all the rows, code is not working:
test <- t(apply(data[1:74,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
Example of desired output:
original data frame
df
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1
When the code is applied to all the three rows in df, output should like this in all the three rows (80% of 1 replaced as 0):
a b c d e f
1 1 0 0 0 1 0
2 0 0 1 0 0 0
3 0 1 1 0 0 0
Thanks
Any suggestions
Thank you
Priya
A solution is to use apply row-wise and get indices where value is 1 using which. Afterwards, pick 80% of those indices (with value as 1) using sample and replace those to '0`.
t(apply(df, 1, function(x){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# [1,] 0 0 0 1 0 0
# [2,] 0 0 0 1 0 0
# [3,] 0 0 1 0 0 1
# [4,] 0 1 0 0 0 0
# [5,] 0 1 0 0 0 0
# [6,] 1 0 0 0 0 0
# [7,] 0 0 0 0 0 1
# [8,] 0 0 1 0 0 0
# [9,] 0 0 1 0 1 0
# [10,] 0 0 0 0 0 1
Sample Data:
set.seed(1)
df <- data.frame(a = sample(c(0,1,1,1), 10, replace = TRUE),
b = sample(c(0,1,1,1), 10, replace = TRUE),
c = sample(c(0,1,1,1), 10, replace = TRUE),
d = sample(c(0,1,1,1), 10, replace = TRUE),
e = sample(c(0,1,1,1), 10, replace = TRUE),
f = sample(c(0,1,1,1), 10, replace = TRUE))
df
# a b c d e f
# 1 1 0 1 1 1 1
# 2 1 0 0 1 1 1
# 3 1 1 1 1 1 1
# 4 1 1 0 0 1 0
# 5 0 1 1 1 1 0
# 6 1 1 1 1 1 0
# 7 1 1 0 1 0 1
# 8 1 1 1 0 1 1
# 9 1 1 1 1 1 1
# 10 0 1 1 1 1 1
# Answer on OP's data
t(apply(df1, 1, function(x){
onesInX <- which(x==1)
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# 1 1 1 0 0 0 0 <- .8*6 = 4.8 => 4 has been converted to 0
# 2 0 0 0 1 0 0 <- .8*5 = 4.0 => 4 has been converted to 0
# 3 0 1 0 0 0 0 <- .8*4 = 3.2 => 3 has been converted to 0
# Data from OP
df1 <- read.table(text="
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1",
header = TRUE)
df1
# a b c d e f
# 1 1 1 1 1 1 1 <- No of 1 = 6
# 2 1 0 1 1 0 1 <- No of 1 = 4
# 3 1 1 1 0 1 1 <- No of 1 = 5
a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?
We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C
data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C