How to remove duplicate values from different rows per unique identifier? - r

I'm just starting to use R. I have a dataset with in the first column unique identifiers (1958 patients) and in columns 2-35 0's en 1's.
For example:
Patient A: 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 NA NA
I want to change this to:
Patient A: 0 1 0 1 0 1
Thanks in advance.

We can use tapply and grouping our variable based on whether it changes value or not, i.e.
tapply(x[!is.na(x)], cumsum(c(TRUE, diff(x[!is.na(x)]) != 0)), FUN = unique)
#1 2 3 4 5 6
#0 1 0 1 0 1

Based on your example, it is not clear whether NA's can also occur in the middle, and how you would want to deal with that situation (e.g. make 1 NA 1 to 1 1 (option 1) and hence combine the two 1's, or whether NA would mark a boundary and you would keep both 1's (option 2).
That determines at which point to remove NA's in the code.
You could use S4Vectors run length encoding, which would allow you to have more than just 0 and 1.
library(S4Vectors)
## create example data
set.seed(1)
x <- sample(c(0,1), (1958*34), replace=TRUE, prob=c(.4, .6))
x[sample(length(x), 200)] <- NA
x <- matrix(x, nrow=1958, ncol=34)
df <- data.frame(patient.id = paste0("P", seq_len(1958)), x, stringsAsFactors = FALSE)
## define function to remove NA values
# option 1
fun.NA.boundary <- function(x) {
a <- runValue(Rle(x))
a[!is.na(a)]
}
# option 2
fun.NA.remove <- function(x) runValue(Rle(x[!is.na(x)]))
## calculate results
# option 1
reslist <- apply(x[,-1], 1, function(y) fun.NA.boundary(y))
# option 2
reslist <- apply(x[,-1], 1, function(y) fun.NA.remove(y))
names(reslist) <- df$patient.id
head(reslist)
#> $P1
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#>
#> $P2
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#>
#> $P3
#> [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#>
#> $P4
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
#>
#> $P5
#> [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
#>
#> $P6
#> [1] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

Related

How to set a loop to assign lots of variables

I just started using R for a psych class, so please go easy on me. I watched a bunch of youtube videos on For loops, but none have answered my question. I have 4 data frames (A, B, C, D), each with 25 columns. I want to combine the nth column from each data frame together, and save them as an object, like so:
Q1 <- cbind(A[1], B[1], C[1], D[1])
Q2 <- cbind(A[2], B[2], C[2], D[2])
How can I set a loop to do this for all 25 so I don’t have to do it manually?
Thanks in advance
Each of my data frames looks like this (with column headings reflecting the letter of the data frame (i.e. B has QB1, QB2, etc.
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15
1 1 2 2 0 0 2 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
3 1 0 0 0 0 0 1 0 0 2 1 1 0 0 0
4 1 0 0 0 0 0 1 1 0 1 0 2 0 0 0
In order to do it in a for loop, you need to use assign() from baseR and eval_tidy(), sym() from rlang(). Basically, you will need to evaluate strings as variables.
Create simulation data
library(rlang)
nrows = 10
ncols = 25
df_names <- c("A","B","C","D")
for(df_name in df_names){
# assign value to a string as variable
assign(
df_name,
as.data.frame(
matrix(
data = sample(
c(0,1),
size = nrows * ncols,
replace = TRUE
),
ncol = 25
)
)
)
# rename columns
assign(
df_name,
setNames(eval_tidy(sym(df_name)),paste0("Q",df_name,1:ncols))
)
}
Show A
> head(A)
QA1 QA2 QA3 QA4 QA5 QA6 QA7 QA8 QA9 QA10 QA11 QA12 QA13 QA14 QA15 QA16 QA17 QA18 QA19 QA20 QA21 QA22 QA23 QA24 QA25
1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1
2 0 1 0 1 1 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0
3 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
4 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1
5 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 1 1
6 1 1 0 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 1 1 0 1 1 0
To answer your question:
This should create 25 variables from Q1 to Q25:
# assign dataframes from Q1 to Q25
for(i in 1:25){
new_df_name <- paste0("Q",i)
# initialize Qi with the same number of rows as A,B,C,D ...
assign(
new_df_name,
data.frame(tmp = matrix(NA,nrow = rows))
)
# loop A,B,C,D ... and bind them
for(df_name in df_names){
assign(
new_df_name,
cbind(
eval_tidy(sym(new_df_name)),
eval_tidy(sym(df_name))[,i,drop = FALSE]
)
)
}
# drop tmp to clean up
assign(
new_df_name,
eval_tidy(sym(new_df_name))[,-1]
)
}
Show result:
> Q25
QA25 QB25 QC25 QD25
1 1 0 1 1
2 0 1 0 0
3 1 1 0 0
4 1 0 1 1
5 1 1 0 0
6 0 1 1 1
7 1 0 0 0
8 0 0 0 1
9 1 1 1 0
10 0 0 1 1
The codes should be much easier if you save results in a list using map(). The major complexity is from assigning values to separate variables.
You can combine some dplyr verbs in a for loop to combine the columns from each data set and assign them to 25 new objects.
# merge data, gather, split by var numbers, assign each df to environment
for (i in 1:25) {
df <- cbind(q1,q2,q3,q4) %>% mutate(id=row_number()) %>%
gather(k,v,-id) %>%
mutate(num=sub('A|B|C|D','',k)) %>%
filter(num==i) %>% select(-num) %>% spread(k,v)
assign(paste0('df',i),df)
}
ls(pattern = 'df')
[1] "df1" "df10" "df11" "df12" "df13" "df14" "df15" "df16" "df17" "df18" "df19" "df2"
[13] "df20" "df21" "df22" "df23" "df24" "df25" "df3" "df4" "df5" "df6" "df7" "df8"
[25] "df9"
Code to create initial 4 toy data frames.
# create four toy data frames
q1 <- data.frame(matrix(runif(100),ncol=25))
q2 <- data.frame(matrix(runif(100),ncol=25))
q3 <- data.frame(matrix(runif(100),ncol=25))
q4 <- data.frame(matrix(runif(100),ncol=25))
# set var names for each toy data
names(q1) <- sub('X','A',names(q1))
names(q2) <- sub('X','B',names(q2))
names(q3) <- sub('X','C',names(q3))
names(q4) <- sub('X','D',names(q4))

Change the value of variables that occur 80% of the times in each row, R

In my data, I have 74 observations (rows) and 128 variables (columns), where each variable takes either 0 or 1 as value. In R, I am trying to write a code, where I can find in each row, the variables that has 1 as value and calculate 80% of the times 1 appears in each row. Pick those variables that has 80% of the times value as 1 and change the value from 1 to 0. I could write code, where I can calculate the 80% of times, 1 appears in each row, but I am not able to pick these variables in each row and change their value from 1 to 0.
data# data frame with 74 observations and 128 variables
row1 <- data[1,]
count1 <- length(which(data[1,] == 1)) # #number of 1 in row 1
print(count1)
perform <- 80/100*count1# 80% of count1
Below code works for one row:
test <- t(apply(data[1,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
If specify all the rows, code is not working:
test <- t(apply(data[1:74,], 1, function(x,n){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
Example of desired output:
original data frame
df
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1
When the code is applied to all the three rows in df, output should like this in all the three rows (80% of 1 replaced as 0):
a b c d e f
1 1 0 0 0 1 0
2 0 0 1 0 0 0
3 0 1 1 0 0 0
Thanks
Any suggestions
Thank you
Priya
A solution is to use apply row-wise and get indices where value is 1 using which. Afterwards, pick 80% of those indices (with value as 1) using sample and replace those to '0`.
t(apply(df, 1, function(x){
onesInX <- which(x==1)
# Randomly select 80% of 1 and change to 0
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# [1,] 0 0 0 1 0 0
# [2,] 0 0 0 1 0 0
# [3,] 0 0 1 0 0 1
# [4,] 0 1 0 0 0 0
# [5,] 0 1 0 0 0 0
# [6,] 1 0 0 0 0 0
# [7,] 0 0 0 0 0 1
# [8,] 0 0 1 0 0 0
# [9,] 0 0 1 0 1 0
# [10,] 0 0 0 0 0 1
Sample Data:
set.seed(1)
df <- data.frame(a = sample(c(0,1,1,1), 10, replace = TRUE),
b = sample(c(0,1,1,1), 10, replace = TRUE),
c = sample(c(0,1,1,1), 10, replace = TRUE),
d = sample(c(0,1,1,1), 10, replace = TRUE),
e = sample(c(0,1,1,1), 10, replace = TRUE),
f = sample(c(0,1,1,1), 10, replace = TRUE))
df
# a b c d e f
# 1 1 0 1 1 1 1
# 2 1 0 0 1 1 1
# 3 1 1 1 1 1 1
# 4 1 1 0 0 1 0
# 5 0 1 1 1 1 0
# 6 1 1 1 1 1 0
# 7 1 1 0 1 0 1
# 8 1 1 1 0 1 1
# 9 1 1 1 1 1 1
# 10 0 1 1 1 1 1
# Answer on OP's data
t(apply(df1, 1, function(x){
onesInX <- which(x==1)
x[sample(onesInX, floor(length(onesInX)*.8))] <- 0
x
}))
# a b c d e f
# 1 1 1 0 0 0 0 <- .8*6 = 4.8 => 4 has been converted to 0
# 2 0 0 0 1 0 0 <- .8*5 = 4.0 => 4 has been converted to 0
# 3 0 1 0 0 0 0 <- .8*4 = 3.2 => 3 has been converted to 0
# Data from OP
df1 <- read.table(text="
a b c d e f
1 1 1 1 1 1 1
2 1 0 1 1 0 1
3 1 1 1 0 1 1",
header = TRUE)
df1
# a b c d e f
# 1 1 1 1 1 1 1 <- No of 1 = 6
# 2 1 0 1 1 0 1 <- No of 1 = 4
# 3 1 1 1 0 1 1 <- No of 1 = 5

Modify the column value by other columns in r

I have a CSV table (as a data frame). I want to modify a specific column value by other columns values.
I have prepared a code, but it doesn't work.
The data frame contains 1076 rows and 156 columns.
The formula have to be like this:
if (a[i,"0Q-state"] == "done" ) && (a[i,0Q-01] == NA)) a[i,0Q-01] = 0;
else a[i,0Q-01] = a[i,0Q-01];
but I don't know how can I do this in r.
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 NA
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 NA 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
sapply(c("0Q-01","0Q-02","0Q-03","0Q-04","0Q-05","0Q-06","0Q-07","0Q-08","0Q-09"),
function(y) {
dataset4[,y] <- sapply(c(1:1076), function(x)
ifelse (((is.na(dataset4[x,y])) && (dataset4[x,c("0Q-state")] == "done"))
,0, dataset4[x,y]))}
)
Output has to be:
>dataset4
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 1 NA 1 1
3: done 1 1 1 0 1 1 1 1 1
5: done 1 1 1 1 0 0 0 1 0
6: done 1 1 1 1 0 0 0 1 0
7: 1 1 NA 1 0 0 0 1 0
8: done 1 1 1 1 0 0 0 1 0
we could try:
df[rep(df[, 1] == "done", ncol(df)) & is.na(df)] <- 0
df
1 done 1 1 1 1 1 1 1 1 0
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 0 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
or using sapply():
myFunc <- function(x, y) ifelse(is.na(x) & y == "done", 1, x)
data.frame(df[, 1], sapply(df[, -1], myFunc, y = df[, 1]))
1 done 1 1 1 1 1 1 1 1 NA
2 1 1 1 1 1 1 NA 1 1
3 done 1 1 1 NA 1 1 1 1 1
4 done 1 1 1 1 0 0 0 1 0
5 done 1 1 1 1 0 0 0 1 0
6 1 1 NA 1 0 0 0 1 0
7 done 1 1 1 1 0 0 0 1 0
where you can always substitute df[, 1] with df[, "0Q-state"] and df[, -1] with df[, namesOfDummyVars]
The question has been tagged with data.table and the printed output of dataset4 suggests that dataset4 already is a data.table object.
Here are three variants in data.table syntax to replace NAs in rows which are marked as "done".
# create vector of names of columns to be changed
cols <- sprintf("0Q-%02i", 1:9)
# variant 1
dataset4[`0Q-state` == "done",
(cols) := lapply(.SD, function(x) replace(x, is.na(x), 0L)),
.SDcols = cols][]
0Q-state 0Q-01 0Q-02 0Q-03 0Q-04 0Q-05 0Q-06 0Q-07 0Q-08 0Q-09
1: done 1 1 1 1 1 1 1 1 0
2: 1 1 1 1 1 NA 1 1 NA
3: done 1 1 1 0 1 1 1 1 1
4: done 1 1 1 1 0 0 0 1 0
5: done 1 1 1 1 0 0 0 1 0
6: 1 NA 1 0 0 0 1 0 NA
7: done 1 1 1 1 0 0 0 1 0
or
# variant 2
lapply(cols, function(i) dataset4[`0Q-state` == "done" & is.na(get(i)), (i) := 0L])
dataset4
returning the same as above
or
# variant 3 --- data.table development version 1.10.5
for (i in cols)
set(dataset4, which(dataset4[, "0Q-state"] == "done" & is.na(dataset4[, ..i])), i, 0L)
dataset4

How to count the frequency in binary tranactions per column, and adding the result after last row in R?

I have a txt file (data5.txt):
1 0 1 0 0
1 1 1 0 0
0 0 1 0 0
1 1 1 0 1
0 0 0 0 1
0 0 1 1 1
1 0 0 0 0
1 1 1 1 1
0 1 0 0 1
1 1 0 0 0
I need to count the frequency of one's and zero's in each column
if the frequency of ones >= frequency of zero's then I will print 1 after the last row for that Colum
I'm new in R, but I tried this, and I got error:
Error in if (z >= d) data[n, i] = 1 else data[n, i] = 0 :
missing value where TRUE/FALSE needed
my code:
data<-read.table("data5.txt", sep="")
m =length(data)
d=length(data[,1])/2
n=length(data[,1])+1
for(i in 1:m)
{
z=sum(data[,i])
if (z>=d) data[n,i]=1 else data[n,i]=0
}
You may try this:
rbind(df, ifelse(colSums(df == 1) >= colSums(df == 0), 1, NA))
# V1 V2 V3 V4 V5
# 1 1 0 1 0 0
# 2 1 1 1 0 0
# 3 0 0 1 0 0
# 4 1 1 1 0 1
# 5 0 0 0 0 1
# 6 0 0 1 1 1
# 7 1 0 0 0 0
# 8 1 1 1 1 1
# 9 0 1 0 0 1
# 10 1 1 0 0 0
# 11 1 1 1 NA 1
Update, thanks to a nice suggestion from #Arun:
rbind(df, ifelse(colSums(df == 1) >= ceiling(nrow(df)/2), 1, NA)
or even:
rbind(df, ifelse(colSums(df == 1) >= nrow(df)/2, 1, NA)
Thanks to #SvenHohenstein.
Possibly I misinterpreted your intended results. If you want 0 when frequency of ones is not equal or larger than frequency of zero, then this suffice:
rbind(df, colSums(df) >= nrow(df) / 2)
Again, thanks to #SvenHohenstein for his useful comments!

Expand a single column to a wide/model matrix format

Suppose I have a column in a matrix or data.frame as follows:
df <- data.frame(col1=sample(letters[1:3], 10, TRUE))
I want to expand this out to multiple columns, one for each level in the column, with 0/1 entries indicating presence or absence of level for each row
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
for (i in 1:length(levels(df$col1))) {
curLetter <- levels(df$col1)[i]
newdf[which(df$col1 == curLetter), curLetter] <- 1
}
newdf
I know there's a simple clever solution to this, but I can't figure out what it is.
I've tried expand.grid on df, which returns itself as is. Similarly melt in the reshape2 package on df returned df as is. I've also tried reshape but it complains about incorrect dimensions or undefined columns.
Obviously, model.matrix is the most direct candidate here, but here, I'll present three alternatives: table, lapply, and dcast (the last one since this question is tagged reshape2.
table
table(sequence(nrow(df)), df$col1)
#
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
lapply
newdf <- data.frame(a=rep(0, 10), b=rep(0,10), c=rep(0,10))
newdf[] <- lapply(names(newdf), function(x)
{ newdf[[x]][df[,1] == x] <- 1; newdf[[x]] })
newdf
# a b c
# 1 1 0 0
# 2 0 1 0
# 3 0 1 0
# 4 0 0 1
# 5 1 0 0
# 6 0 0 1
# 7 0 0 1
# 8 0 1 0
# 9 0 1 0
# 10 1 0 0
dcast
library(reshape2)
dcast(df, sequence(nrow(df)) ~ df$col1, fun.aggregate=length, value.var = "col1")
# sequence(nrow(df)) a b c
# 1 1 1 0 0
# 2 2 0 1 0
# 3 3 0 1 0
# 4 4 0 0 1
# 5 5 1 0 0
# 6 6 0 0 1
# 7 7 0 0 1
# 8 8 0 1 0
# 9 9 0 1 0
# 10 10 1 0 0
It's very easy with model.matrix
model.matrix(~ df$col1 + 0)
The term + 0 means that the intercept is not included. Hence, you receive a dummy variable for each factor level.
The result:
df$col1a df$col1b df$col1c
1 0 0 1
2 0 1 0
3 0 0 1
4 1 0 0
5 0 1 0
6 1 0 0
7 1 0 0
8 0 1 0
9 1 0 0
10 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`df$col1`
[1] "contr.treatment"

Resources