R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe - r

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]

You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1

One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

Related

R - Merging rows with numerous NA values to another column

I would like to ask the R community for help with finding a solution for my data, where any consecutive row with numerous NA values is combined and put into a new column.
For example:
df <- data.frame(A= c(1,2,3,4,5,6), B=c(2, "NA", "NA", 5, "NA","NA"), C=c(1,2,"NA",4,5,"NA"), D=c(3,"NA",5,"NA","NA","NA"))
A B C D
1 1 2 1 3
2 2 NA 2 NA
3 3 NA NA 5
4 4 5 4 NA
5 5 NA 5 NA
6 6 NA NA NA
Must be transformed to this:
A B C D E
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA
I would like to do the following:
Identify consecutive rows that have more than 1 NA value -> combine entries from those consecutive rows into a single combined entiry
Place the above combined entry in new column "E" on the prior row
This is quite complex (for me!) and I am wondering if anyone can offer any help with this. I have searched for some similar problems, but have been unable to find one that produces a similar desired output.
Thank you very much for your thoughts--
Using tidyr and dplyr:
Concatenate values for each row.
Keep the concatenated values only for rows with more than one NA.
Group each “good” row with all following “bad” rows.
Use a grouped summarize() to concatenate “bad” row values to a single string.
df %>%
unite("E", everything(), remove = FALSE, sep = " ") %>%
mutate(
E = if_else(
rowSums(across(!E, is.na)) > 1,
E,
""
),
new_row = cumsum(E == "")
) %>%
group_by(new_row) %>%
summarize(
across(A:D, first),
E = trimws(paste(E, collapse = " "))
) %>%
select(!new_row)
# A tibble: 2 × 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 1 3 2 NA 2 NA 3 NA NA 5
2 4 5 4 NA 5 NA 5 NA 6 NA NA NA

How to automate renaming of columns in wide data using R

Consider the following data in the wide format
df<-data.frame("id"=c(1,2,3,4),
"ex"=c(1,0,0,1),
"aQL"=c(5,4,NA,6),
"bQL"=c(5,7,NA,9),
"cQL"=c(5,7,NA,9),
"bST"=c(3,7,8,9),
"cST"=c(8,7,5,3),
"aXY"=c(1,9,4,4),
"cXY"=c(5,3,1,4))
I want to keep the column (or variable) names "id" and "ex" and rename the remaining columns, e.g. "aQL", "bQL" and "cQL" as "QL.1", "QL.2" and "QL.3", respectively. The other columns with names ending with "ST" and "XY" are expected to be renamed in the same manner, also having the order .1, .2 and .3. Of note is "aST" and "bXY" are missing from the data set, but I want them to be included and renamed as ST.1 and XY.2, with each having NAs as their entries. The expected output would look like
df
id ex QL.1 QL.2 QL.3 ST.1 ST.2 ST.3 XY.1 XY.2 XY.3
1 1 1 5 5 5 NA 3 8 1 NA 5
2 2 0 4 7 7 NA 7 7 9 NA 3
3 3 0 NA NA NA NA 8 5 4 NA 1
4 4 1 6 9 9 NA 9 3 4 NA 4
The main data set has many variables, so I would like the renaming to be done in an automated manner. I tried the following code
renameCol <- function(x) {
setNames(x, paste0("QL.", seq_len(ncol(x))))
}
renameCol(df)
but it does not work as expected. Thus, it renames "id" and "ex" that I want to maintain and it is not flexible on the renaming of multiple variable (i.e. QL, ST, XY). Any help is greatly appreciated.
I would suggest a tidyverse approach where there is no need of a function. In this solution you can extract the first letter of each variable name as id and then assign a number with cur_group_id so that the order is kept. Finally, with this new number you transform the variable containing the names and then you format to wide in order to obtain the expected output:
library(tidyverse)
#Data
df<-data.frame("id"=c(1,2,3,4),
"ex"=c(1,0,0,1),
"aQL"=c(5,4,NA,6),
"bQL"=c(5,7,NA,9),
"cQL"=c(5,7,NA,9),
"bST"=c(3,7,8,9),
"cST"=c(8,7,5,3),
"aXY"=c(1,9,4,4),
"cXY"=c(5,3,1,4))
#Reshape
df %>% pivot_longer(cols = -c(1,2)) %>%
#Extract first letter as id
mutate(id2=substring(name,1,1)) %>%
#Create the number id
group_by(id2) %>%
mutate(id3=cur_group_id()) %>%
#Clean name
mutate(name=substring(name,2,nchar(name))) %>%
#Create final var
mutate(name2=paste0(name,'.',id3)) %>% ungroup() %>%
dplyr::select(-c(name,id2,id3)) %>%
#Format to wide
pivot_wider(names_from = name2,values_from=value)
Output:
# A tibble: 4 x 9
id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4
in base R you could do:
names(df) <- sub("(\\d)([A-Z]{2})$","\\2.\\1", chartr("abc","123",names(df)))
df
id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
1 1 1 5 5 5 3 8 1 5
2 2 0 4 7 7 7 7 9 3
3 3 0 NA NA NA 8 5 4 1
4 4 1 6 9 9 9 3 4 4
If you need the NA columns:
names(df) <- sub("(\\d)([A-Z]{2})$","\\2.\\1", chartr("abc","123",names(df)))
a <- read.table(text=grep("\\.\\d",names(df),value = TRUE), sep=".")
b <- subset(aggregate(.~V1, a, function(x) setdiff(1:3,x)), V2>0)
df[do.call(paste, c(sep = ".", b))] <- NA
(df1 <- df[c(1, 2, order(names(df)[-(1:2)]) + 2)])
id ex QL.1 QL.2 QL.3 ST.1 ST.2 ST.3 XY.1 XY.2 XY.3
1 1 1 5 5 5 NA 3 8 1 NA 5
2 2 0 4 7 7 NA 7 7 9 NA 3
3 3 0 NA NA NA NA 8 5 4 NA 1
4 4 1 6 9 9 NA 9 3 4 NA 4
Another way you can try
colnames(df)[grepl("QL", colnames(df))] <- str_c("QL.", 1:3)
colnames(df)[grepl("ST", colnames(df))] <- str_c("ST.", 2:3)
colnames(df)[grepl("XY", colnames(df))] <- str_c("XY.", c(1,3))
# id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
# 1 1 1 5 5 5 3 8 1 5
# 2 2 0 4 7 7 7 7 9 3
# 3 3 0 NA NA NA 8 5 4 1
# 4 4 1 6 9 9 9 3 4 4
Here is a solution that uses regular expressions via the stringr package:
library(stringr)
df<-data.frame("id"=c(1,2,3,4),
"ex"=c(1,0,0,1),
"aQL"=c(5,4,NA,6),
"bQL"=c(5,7,NA,9),
"cQL"=c(5,7,NA,9),
"bST"=c(3,7,8,9),
"cST"=c(8,7,5,3),
"aXY"=c(1,9,4,4),
"cXY"=c(5,3,1,4))
renameCol <- function(x) {
col_names <- colnames(x)
index_ql <- str_detect(col_names,
"^[a-z]{1}QL")
index_st <- str_detect(col_names,
"^[a-z]{1}ST")
index_xy <- str_detect(col_names,
"^[a-z]{1}XY")
replace_fun <- function(x) {which(letters %in% x)}
col_names[index_ql] <- paste0("QL.", str_replace(substr(col_names[index_ql], 1, 1),
"[a-z]", replace_fun))
col_names[index_st] <- paste0("ST.", str_replace(substr(col_names[index_st], 1, 1),
"[a-z]", replace_fun))
col_names[index_xy] <- paste0("XY.", str_replace(substr(col_names[index_xy], 1, 1),
"[a-z]", replace_fun))
col_names
}
colnames(df) <- renameCol(df)
df
#> id ex QL.1 QL.2 QL.3 ST.2 ST.3 XY.1 XY.3
#> 1 1 1 5 5 5 3 8 1 5
#> 2 2 0 4 7 7 7 7 9 3
#> 3 3 0 NA NA NA 8 5 4 1
#> 4 4 1 6 9 9 9 3 4 4
Created on 2020-09-07 by the reprex package (v0.3.0)
Edit
The function above was adapted so that it takes the order into account.
using base pattern matching:
you need to define a function that does what you want on one single column name:
f = function(x){
beg <- str_extract(x,"[a-z](?=[A-Z]{2})")
num <- which(letters == beg)
output <- paste0(str_extract(x,"(?<=[a-z])[A-Z]{2}"),".",num)
return(output)
}
here extract the lower case letter if you have two upper case letters after, find it position in alphabet, and paste the found number back to the upper case letters.
> f("cQL")
[1] "QL.3"
You can then use regmatches and regular expression directly on the name of your data frame:
m <- gregexpr("[a-z][A-Z]{2}", names(df),perl = T)
regmatches(names(df), m) <- lapply(regmatches(names(df), m), f)
names(df)
> names(df)
[1] "id" "ex" "QL.1" "QL.2" "QL.3" "ST.2" "ST.3" "XY.1" "XY.3"
It solves only the renaming part, not the the "including missing column number" part of your question

How can I insert blank rows every 3 existing rows in a data frame?

How can I insert blank rows every 3 existing rows in a data frame?
After a web scraping process I get a dataframe with the information I need, however the final excel format requires that I add a blank row every 3 rows. I have searched the web for help but have not found a solution yet.
With hypothetical data, the structure of my data frame is as follows:
mi_df <- data.frame(
"ID" = rep(1:3,c(3,3,3)),
"X" = as.character(c("a", "a", "a", "b", "b", "b", "c", "c", "c")),
"Y" = seq(1,18, by=2)
)
mi_df
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4 2 b 7
5 2 b 9
6 2 b 11
7 3 c 13
8 3 c 15
9 3 c 17
The result I hope for is something like this
ID X Y
1 1 a 1
2 1 a 3
3 1 a 5
4
5 2 b 7
6 2 b 9
7 2 b 11
8
9 3 c 13
10 3 c 15
11 3 c 17
If the indices of a data frame contain NA, then the output will have NA rows. So my goal is to create a vector like 1 2 3 NA 4 5 6 NA ... and set it as the indices of mi_df.
cut <- rep(1:(nrow(mi_df)/3), each = 3)
mi_df[sapply(split(1:nrow(mi_df), cut), c, NA), ]
# ID X Y
# 1 1 a 1
# 2 1 a 3
# 3 1 a 5
# NA NA <NA> NA
# 4 2 b 7
# 5 2 b 9
# 6 2 b 11
# NA.1 NA <NA> NA
# 7 3 c 13
# 8 3 c 15
# 9 3 c 17
# NA.2 NA <NA> NA
If nrow(mi_df) is not a multiple of 3, then the following is a general solution:
# Version 1
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(c, lapply(split(1:nrow(mi_df), cut), c, NA)), ]
# Version 2
cut <- rep(1:ceiling(nrow(mi_df)/3), each = 3, len = nrow(mi_df))
mi_df[Reduce(function(x, y) c(x, NA, y), split(1:nrow(mi_df), cut)), ]
Don't mind the NA in the output because some functions which write data to an excel file have an optional argument controls if NA values are converted to strings or be empty. E.g.
library(openxlsx)
write.xlsx(df, "test.xlsx", keepNA = FALSE) # defaults to FALSE
tmp <- split(mi_df, rep(1:(nrow(mi_df) / 3), each = 3))
# or split(mi_df, ggplot2::cut_width(seq_len(nrow(mi_df)), 3, center = 2))
do.call(rbind, lapply(tmp, function(x) { x[4, ] <- NA; x }))
ID X Y
1.1 1 a 1
1.2 1 a 3
1.3 1 a 5
1.4 NA <NA> NA
2.4 2 b 7
2.5 2 b 9
2.6 2 b 11
2.4.1 NA <NA> NA
3.7 3 c 13
3.8 3 c 15
3.9 3 c 17
3.4 NA <NA> NA
You can make empty rows like you show by assigning an empty character vector ("") instead of NA, but this will convert your columns to character, and I wouldn't recommend it.
My recommendation is somewhat different from all the other answers: don't make a mess of your dataset inside R . Use the existing packages to write to designated rows in an Excel workbook. For example, with the package xlConnect, the method writeWorksheet (called from writeWorksheetToFile ) includes these arguments:
object The workbook to write to data Data to write
sheet The name or index of the sheet to write to
startRow Index of the first row to write to. The default is startRow = 1.
startCol Index of the first column to write to. The default is startCol = 1.
So if you simply set up a loop that writes 3 rows of your data file at a time, then moves the row index down by 4 and writes the next 3 rows, etc., you're all set.
Here's one method.
Splits into list by ID, adds empty row, then binds list back into data frame.
mi_df2 <- do.call(rbind,Map(rbind,split(mi_df,mi_df$ID),rep("",3)))
rownames(mi_df2) <- NULL

keep NA and blanks rows in a data.frame

I have this dataset:
ID FARM WEIGHT
1 2 NA
2 2
3 3 57
4 4 58
5 7 NA
And I desire select the blank and NA rows, I need my data.frame this way:
ID FARM WEIGHT
1 2 NA
2 2
5 7 NA
I tried this code:
newfile <- dataset[!(is.na(dataset$WEIGHT) | dataset$WEIGHT != ''),]
but doesn't work, I obtained an empty dataset.
I tried you code, shouldn't you use dataset[is.na(dataset$WEIGHT) | dataset$WEIGHT=="",]? The following code works.
dataset <- data.frame(ID=1:5, FARM=c(2, 2, 3, 4, 7), WEIGHT=c(NA, "", "57", "58", NA) )
dataset[is.na(dataset$WEIGHT) | dataset$WEIGHT=="",]
# ID FARM WEIGHT
# 1 1 2 <NA>
# 2 2 2
# 5 5 7 <NA>
Just use-
dt[!complete.cases(dt), ]
OR
dt[rowSums(is.na(dt) | dt=="") > 0,]
Output-
ID FARM WEIGHT
1 1 2 NA
2 2 2 NA
5 5 7 NA
Note- If you want to read directly from file then you can also do-
dt<- read.csv("file.csv", na.strings=c("NA",""))

Create a counter in a for loop in R

I'm an unexperienced user of R and I need to create quite a complicated stuff.
My dataset looks like this :
dataset
a,b,c,d,e are different individuals.
I want to complete the D column as follows :
At the last line for each individual in the col A, D = sum(C)/(B-1).
Expected results should look like :
results
D4=sum(C2:C4)/(B4-1)=0.5
D6=sum(C5:C6)/(B6-1)=1, etc.
I attempted to deal with it with something like :
for(i in 2:NROW(dataset)){
dataset[i,4]<-ifelse(
(dataset[i,1]==data1[i-1,1]),sum(dataset[i,3])/(dataset[i,2]-1),NA
)
}
But it is obviously not sufficient, as it computes the D value for all the rows and not only the last for each individual, and it does not calculate the sum of C values for this individual.
And I really don't know how to figure it out. Do you guys have any advice ?
Many thanks.
If I understood your question correctly, then this is one approach to get to the desired result:
df <- data.frame(
A=c("a","a","a","b","b","c","c","c","d","e","e"),
B=c(3,3,3,2,2,3,3,3,1,2,2),
C=c(NA,1,0,NA,1,NA,0,1,NA,NA,0),
stringsAsFactors = FALSE)
for(i in 2:NROW(df)){
df[i,4]<-ifelse(
(df[i,1]!=df[i+1,1] | i == nrow(df)),sum(df[df$A == df[i,1],]$C, na.rm=TRUE)/(df[i,2]-1),NA
)
}
This code results in the following table:
A B C V4
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0
The ifelse first tests if the individual of the current row of column A is different than the individual in the next row OR if it's the last row.
If it is the last row with this individual it takes the sum of column C (ignoring the NAs) of the rows with the individual present in column A divided by the value in column B minus one.
Otherwise it puts an NA in the fourth column.
Using dplyr you can try generating D for all rows and then remove where not required:
dftest %>%
group_by(A,B) %>%
dplyr::mutate(D = sum(C, na.rm=TRUE)/(B-1)) %>%
dplyr::mutate(D = if_else(row_number()== n(), D, as.double(NA)))
which gives:
Source: local data frame [11 x 4]
Groups: A, B [5]
A B C D
<chr> <dbl> <dbl> <dbl>
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0

Resources