remove rows with >50% missing across certain columns in R - r

here is my data:
data <- data.frame(id=c(1,2,3,4,5),
ethnicity=c("asian",NA,NA,NA,"asian"),
age=c(34,NA,NA,NA,65),
a1=c(3,4,5,2,7),
a2=c("y","y","y",NA,NA),
a3=c("low", NA, "high", "med", NA),
a4=c("green", NA, "blue", "orange", NA))
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
5 asian 65 7 <NA> <NA> <NA>
I would like to remove rows that have >50% missing in columns a1 to a4.
I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:
data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column
miss2 <- c()
for(i in 1:nrow(data)) {
if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7)
}
data1 <- data[-miss2,]
#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)
The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
Any help is appreciated, thank you!

You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans
keep <- rowMeans(is.na(data[,4:7])) <= 0.5
keep
[1] TRUE TRUE TRUE TRUE FALSE
data[keep,]
id ethnicity age a1 a2 a3 a4
1 1 asian 34 3 y low green
2 2 <NA> NA 4 y <NA> <NA>
3 3 <NA> NA 5 y high blue
4 4 <NA> NA 2 <NA> med orange

Just for fun a dplyr approach:
Here we combine rowwise with a comparing statement directly in filter. First we check the sum of NA over a1:a4, divide by the amount of columns and ask if condition <= 0.5 is true:
To do this we have to transform all (a1:a4) to the same class:
data %>%
rowwise() %>%
mutate(a1 = as.character(a1)) %>%
filter(sum(is.na(c_across(a1:a4))) / length(c_across(a1:a4)) <= 0.5)
id ethnicity age a1 a2 a3 a4
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 asian 34 3 y low green
2 2 NA NA 4 y NA NA
3 3 NA NA 5 y high blue
4 4 NA NA 2 NA med orange

data[rowSums(is.na(data[, -c(1:3)])) / 4 <= .5, ]
#> id ethnicity age a1 a2 a3 a4
#> 1 1 asian 34 3 y low green
#> 2 2 <NA> NA 4 y <NA> <NA>
#> 3 3 <NA> NA 5 y high blue
#> 4 4 <NA> NA 2 <NA> med orange

Related

Extracting a numeric information align with ID from unstructured dataset in R

I am trying to extract score information for each ID and for each itemID. Here how my sample dataset looks like.
df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
"Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
"Item 1: Time in seconds","Item 1: Errors",
"Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
"Item 2: Time in seconds","Item 2: Errors"),
School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))
> df
Text_1 School.2 X_Elementary_School..3 School.4 Y_Elementary_School..2
1 Scoring Teacher: Bill: Teacher: John:
2 1 = Incorrect DC Name: X District DC Name: X District
3 Text1 Date (mm/dd/yyyy): 10/7/21 Date (mm/dd/yyyy): 11/7/21
4 Text2 Child Grade: K Child Grade: K
5 Text3 Student Study ID: 123-2222-2: Student Study ID: 112-1111-3:
6 Text4 <NA> <NA> <NA> <NA>
7 Demo 1: Color Naming <NA> <NA> 0 <NA>
8 Amarillo <NA> <NA> <NA> <NA>
9 Azul <NA> <NA> 1 <NA>
10 Verde <NA> <NA> <NA> <NA>
11 Azul <NA> <NA> <NA> <NA>
12 Demo 1: Errors 0 <NA> 0 <NA>
13 Item 1: Color naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
14 Amarillo 0 <NA> 0 <NA>
15 Azul 1 <NA> 1 <NA>
16 Verde <NA> <NA> <NA> <NA>
17 Azul <NA> <NA> <NA> <NA>
18 Item 1: Time in seconds <NA> <NA> 120 <NA>
19 Item 1: Errors 0 <NA> 0 <NA>
20 Item 2: Shape Naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
21 Cuadrado/Cuadro 0 <NA> <NA> <NA>
22 Cuadrado/Cuadro <NA> <NA> 1 <NA>
23 Círculo <NA> <NA> 0 <NA>
24 Estrella 1 <NA> 1 <NA>
25 Círculo 1 <NA> <NA> <NA>
26 Triángulo 0 <NA> 1 <NA>
27 Item 2: Time in seconds <NA> <NA> 110 <NA>
28 Item 2: Errors 0 <NA> 0 <NA>
This sample dataset is limited only for two schools, two teachers and two students.
In this step, I need to extract student responses for each item.
Wherever the first column has Item , I need to grab from there. I especially need to index the rows and columns columns rather than giving the exact row columns number since this will be for multiple datafiles and each files has different information. No need to grab the ..:Error part.
################################################################################
# ## 2-extract the score information here
# ## 1-grab item information from where "Item 1:.." starts
Here, rather than using row number, how to automate this part.
score<-df[c(7:11,13:17,20:26),c(seq(2,dim(df)[2],2))] # need to automate row and columns index here
score<-as.data.frame(t(score))
rownames(score)<-seq(1,nrow(score),1)
colnames(score)<-paste0('i',seq(1,ncol(score),1)) # assign col names for items
score<-apply(score,2,as.numeric) # only keep numeric columns
score<-as.data.frame(score)
score$total<-rowSums(score,na.rm=T); score # create a total score
> score
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5
Additionally, I need to add ID which I could not achieve here.
My desired output would be:
> score
ID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 123-2222-2 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 112-1111-3 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5

R strsplit for uneven number of columns in a huge data set

I have a huge data set with about 200 columns and 25k+ rows, with the separator ';'. The columns are of an uneven number.
I read it in as a delimited txt file df <- read.delim(~path/data.txt, sep=";", header = FALSE)
which reads nicely as a table.
My issue is, many of the rows are so long that in the txt file they often spill onto new lines and it is here that it is not recognising that it should continue on the same row. Therefore the distinguished columns have information that belongs else where.
Each observation of data is a dbl.
I have created a new example below for ease of reading, therefore it is not possible to simply sort conditions into columns.
***EDIT: x, y and z contain spatial coordinates, but I have substituted them for their corresponding letters for ease of reading.
The data is X-profile data giving me coordinates of the centre point along a line, followed by offsets of 1m (up to 100m either side of 0, the centre line) in each column with its corresponding height ***
My data ends up looking something like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9]
[1] x y z 1 2 3 N/A N/A N/A
[2] x y z 1 2 3 4 5 6
[3] 7 8 9 10 N/A N/A N/A N/A N/A
[4] x y z 1 2 3 4 5 7
[5] 7 8 9 N/A N/A N/A N/A N/A N/A
[6] x y z 1 2 3 N/A N/A N/A
[7] x y z 1 2 3 4 5 N/A
And I'd like it to look like this:
[c1] [c2] [c3] [c4] [c5] [c6] [c7] [c8] [c9] [c10] [c11] [c12] [c13]
[1] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[2] x y z 1 2 3 4 5 6 7 8 9 10
[3] x y z 1 2 3 4 5 6 7 8 9 N/A
[4] x y z 1 2 3 N/A N/A N/A N/A N/A N/A N/A
[5] x y z 1 2 3 4 5 N/A N/A N/A N/A N/A
I have tried strsplit(as.character(df), split = "\n", fixed = TRUE) and it returns an error that it is not a character string. I have tried the same function with split = "\t" and split = "\r" and it returns the same error. Each attempt takes around half an hour to process so I was also wondering if there is a more efficient way to do this.
I hope I have explained clearly my aim.
EDIT
The text file is similar to the following example:
x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15
In some cases a number is split between the previous line and that below:
E.G.
101;102;103;10
4;105;106
This layout is exactly how it is being read into R.
Use scan which omits empty lines by default. Next, find positions that begin with "x" using findInterval, split there and paste them together. Then basically the ususal strsplit, some length adaptions etc. and you got it.
r <- scan('foo.txt', 'A', qui=T)
r <- split(r, findInterval(seq_len(length(r)), grep('^x', r))) |>
lapply(paste, collapse='') |>
lapply(strsplit, ';') |>
lapply(el) |>
{\(.) lapply(., `length<-`, max(lengths(.)))}() |>
do.call(what=rbind) |>
as.data.frame()
r
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
# 1 x y z 1 2 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 2 x y z 1 2 3 4 5 6 7 8 9 10 <NA> <NA> <NA> <NA> <NA>
# 3 x y z 1 2 3 4 5 6 7 8 9 <NA> <NA> <NA> <NA> <NA> <NA>
# 4 x y z 1 2 3 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 5 x y z 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data:
writeLines(text='x;y;z;1;2;3
x;y;z;1;2;3;4;5;6;
7;8;9;10
x;y;z;1;2;3;4;5;6;
7;8;9
x;y;z;1;2;3;4
x;y;z;1;2;3;4;5;6;
7;8;9;10;11;12;13;
14;15', 'foo.txt')
using data.table:
dt <- data.table(df)
dt[, grp := cumsum(c1 == "x")]
dt <- merge(dt[c1 == "x"], dt[c1 == 7], by = "grp", all = T)[, grp := NULL]
names(dt) <- paste0("c", 1:ncol(dt))
Resulting to:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18
1: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
2: x y z 1 2 3 4 5 6 7 8 9 10 NA NA NA NA NA
3: x y z 1 2 3 4 5 7 7 8 9 NA NA NA NA NA NA
4: x y z 1 2 3 NA NA NA NA NA NA NA NA NA NA NA NA
5: x y z 1 2 3 4 5 NA NA NA NA NA NA NA NA NA NA

Adding values to columns based on multiple conditions

I have 1 df as below
df <- data.frame(n1 = c(1,2,1,2,5,6,8,9,8,8),
n2 = c(100,1000,500,1,NA,NA,2,8,10,15),
n3 = c("a", "a", "a", NA, "b", "c",NA,NA,NA,NA),
n4 = c("red", "red", NA, NA, NA, NA,NA,NA,NA,NA))
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a <NA>
4 2 1 <NA> <NA>
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> <NA>
8 9 8 <NA> <NA>
9 8 10 <NA> <NA>
9 8 15 <NA> <NA>
First, please see my desired output
df
n1 n2 n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
5 5 NA b <NA>
6 6 NA c <NA>
7 8 2 <NA> red
8 9 8 <NA> red
9 8 10 <NA> red
9 8 15 <NA> red
I made this post before (Adding values to one columns based on conditions). However, I realized that I need to take one more column to solve my problem.
So, I would like to update/add the red in n4 by asking the conditions comming from n1, n2, n3. If n3 == "a", and values of n1 associated with a, then values of n4 that are the same row with values of n1 should be added with red (i.e. row 3,4th). At the same time, if values of n1 also match with that of n2 (i.e. 2), then this row th of n4 should also be added red. Further, 8 of column n1 is connected with the entire things like that. Then, if we have futher values of n2 or n1 is equal to 8 then, the step would be replicated as before. I hope it is clear, if not I would like to explain more. (It sounds like a Zig Zag thing).
-Note: tidyverse and baseR also welcomed to help me here.
Any suggestions for me please?
You can try the code below if you are using igraph
res <- do.call(
rbind,
lapply(
decompose(
graph_from_data_frame(replace(df, is.na(df), "NA"))
),
function(x) {
n4 <- E(x)$n4
if (!all(n4 == "NA")) {
E(x)$n4 <- unique(n4[n4 != "NA"])
}
get.data.frame(x)
}
)
)
dfout <- type.convert(
res[match(do.call(paste, df[1:2]), do.call(paste, res[1:2])), ],
as.is = TRUE
)
which gives
> dfout
from to n3 n4
1 1 100 a red
2 2 1000 a red
3 1 500 a red
4 2 1 <NA> red
9 5 NA b <NA>
10 6 NA c <NA>
5 8 2 <NA> red
6 9 8 <NA> red
7 8 10 <NA> red
8 8 15 <NA> red

Combine two dataframes same/different names [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have 2 dataframes, i am trying to combine both the dataframes not only the ones with common names but also with different variable names and tell as NA if respective value not found.
I tried normal rbind but it asks for same column names.
Dataframes:
d1 <- data.frame(a=c('a1','a2','a3'), b = c("a51","a52","a53"), d = c(12,13,14))
d2 <- data.frame(a=c('a4','a5','a6'), g = c("a151","a152","a153"), k = c(122,123,124))
Expected Output:
a b d g k
1 a1 a51 12 <NA> NA
2 a2 a52 13 <NA> NA
3 a3 a53 14 <NA> NA
4 a4 <NA> NA a151 122
5 a5 <NA> NA a152 123
6 a6 <NA> NA a153 124
Here is an option with bind_rows
library(dplyr)
bind_rows(d1, d2)
# a b d g k
#1 a1 a51 12 <NA> NA
#2 a2 a52 13 <NA> NA
#3 a3 a53 14 <NA> NA
#4 a4 <NA> NA a151 122
#5 a5 <NA> NA a152 123
#6 a6 <NA> NA a153 124
Or using rbindlist
library(data.table)
rbindlist(list(d1, d2))

Transfer pivottable to another table in R

In my research I have a dataset of cancer patients with some clinical information like cancer stage and treatment etc. Each patient has one row in a table with this clinical information. In addition, each patient has, at one or several timepoints during the treatment, taken blood samples, depending on how long the patient has been followed at the clinic. The first sample is from the first visit and the second sample is from the second visit at the clinic, and so on.
In the table, there is a variable (ie. column) that is named Sample_Time_1, which is the time for the first sample. Sample_Time_2 has the time (date) for the second sample and so on.
However - the samples were analysed at the lab and I got the result in a pivottable, which means I have a table where each sample has one row and therefore the results from one patient is displayed on several rows.
For example, create two tables:
x <- c(1,2,2,3,3,3,3,4,5,6,6,6,6,7,8,9,9,10)
y <- as.Date(c("2011-05-17","2012-06-30","2012-08-11","2011-10-15","2011-11-25","2012-01-07","2012-02-15","2011-08-13","2012-02-03","2011-11-08","2011-12-21","2012-02-01","2012-03-12","2012-01-03","2012-04-20","2012-03-31","2012-05-10","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
z <- c(123,185,153,153,125,148,168,187,194,115,165,167,143,151,129,130,151,134)
Sheet_1 <- matrix(c(x,y,z), ncol=3, byrow=FALSE)
colnames(Sheet_1) <- c("ID","Sample_Time", "Sample_Value")
Sheet_1 <- as.data.frame(Sheet_1)
Sheet_1$Sample_Time <- y
x1 <- c(1,2,3,4,5,6,7,8,9,10)
x2 <- c(3,3,2,3,2,2,4,2,3,3)
x3 <- c(1,2,2,3,3,1,3,1,1,2)
x4 <- as.Date(c("2011-05-17","2012-06-30","2011-10-15","2011-08-13","2012-02-03","2011-11-08","2012-01-03","2012-04-20","2012-03-31","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
x5 <- as.Date(c(NA,"2012-08-11","2011-11-25",NA,NA,"2011-12-21",NA,NA,"2012-05-10",NA), format="%Y-%m-%d", origin="1960-01-01")
x6 <- as.Date(c(NA,NA,"2012-01-07",NA,NA,"2012-02-01",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
x7 <- as.Date(c(NA,NA,"2012-02-15",NA,NA,"2012-03-12",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
Sheet_2 <- as.data.frame(c(1:10))
colnames(Sheet_2) <- "ID"
Sheet_2$Stage <- x2
Sheet_2$Treatment <- x3
Sheet_2$Sample_Time_1 <- x4
Sheet_2$Sample_Time_2 <- x5
Sheet_2$Sample_Time_3 <- x6
Sheet_2$Sample_Time_4 <- x7
Sheet_2$Sample_Value_1 <- NA
Sheet_2$Sample_Value_2 <- NA
Sheet_2$Sample_Value_3 <- NA
Sheet_2$Sample_Value_4 <- NA
I would like to transfer the Sample_Value for the first date a sample was taken from a patient from Sheet_1 to Sheet_2$Sample_Value_1 and if there are more samples, I would like to transfer them to column "Sample_Value_2" and so on.
I have tried with a double for-loop. For each patient (=ID) in Sheet_1 I have run through Sheet_2 and if there is a mach on ID, then I use another for-loop to see if there is a mach on a Sample_Time and insert (using if) the Sample_Value. However, I do not manage to get it to work and I have a strong feeling there must be a better way.
Any suggestions?
Is this what you want:
Prepare Sheet_1 for reshaping from long to wide by introducing an extra column with unique ID for each blood sample per patient
Sheet_1$uniqid <- with(Sheet_1, ave(as.character(ID), ID, FUN = seq_along))
And with this, do the re-shaping
S_1 <- reshape( Sheet_1, idvar = "ID", timevar = "uniqid", direction = "wide")
which gives you
> S_1
ID Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2 Sample_Time.3
1 1 2011-05-17 123 <NA> NA <NA>
2 2 2012-06-30 185 2012-08-11 153 <NA>
4 3 2011-10-15 153 2011-11-25 125 2012-01-07
8 4 2011-08-13 187 <NA> NA <NA>
9 5 2012-02-03 194 <NA> NA <NA>
10 6 2011-11-08 115 2011-12-21 165 2012-02-01
14 7 2012-01-03 151 <NA> NA <NA>
15 8 2012-04-20 129 <NA> NA <NA>
16 9 2012-03-31 130 2012-05-10 151 <NA>
18 10 2011-12-15 134 <NA> NA <NA>
Sample_Value.3 Sample_Time.4 Sample_Value.4
1 NA <NA> NA
2 NA <NA> NA
4 148 2012-02-15 168
8 NA <NA> NA
9 NA <NA> NA
10 167 2012-03-12 143
14 NA <NA> NA
15 NA <NA> NA
16 NA <NA> NA
18 NA <NA> NA
The number after the dot in the colnames is the uniqid.
Now you can merge the relevant columns from Sheet_2
S_2 <- merge( Sheet_2[ 1:3 ], S_1, by = "ID" )
and the result should be what you are looking for:
> S_2
ID Stage Treatment Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2
1 1 3 1 2011-05-17 123 <NA> NA
2 2 3 2 2012-06-30 185 2012-08-11 153
3 3 2 2 2011-10-15 153 2011-11-25 125
4 4 3 3 2011-08-13 187 <NA> NA
5 5 2 3 2012-02-03 194 <NA> NA
6 6 2 1 2011-11-08 115 2011-12-21 165
7 7 4 3 2012-01-03 151 <NA> NA
8 8 2 1 2012-04-20 129 <NA> NA
9 9 3 1 2012-03-31 130 2012-05-10 151
10 10 3 2 2011-12-15 134 <NA> NA
Sample_Time.3 Sample_Value.3 Sample_Time.4 Sample_Value.4
1 <NA> NA <NA> NA
2 <NA> NA <NA> NA
3 2012-01-07 148 2012-02-15 168
4 <NA> NA <NA> NA
5 <NA> NA <NA> NA
6 2012-02-01 167 2012-03-12 143
7 <NA> NA <NA> NA
8 <NA> NA <NA> NA
9 <NA> NA <NA> NA
10 <NA> NA <NA> NA

Resources