How can I write a for loop for this? - r

I would like to write a for loop for the actions below, data is a df with multiple columns, each column contains a list. I would like to replace all NULL values in each column list with NA so that I can bind all lists into a dataframe. If there's a more efficient way to do this than a for loop I would like to know as well. Thank you.
for (i in names(data)){
list1=sapply(data[,1], function(x) ifelse(x == "NULL", NA, x))
list1=as.data.frame(list1)
list2=sapply(data[,2], function(x) ifelse(x == "NULL", NA, x))
list2=as.data.frame(list2)
.
.
.
fulllist=as.data.frame(cbind(list1,list2,....))
fulllist = as.data.frame(t(fulllist))
}

We loop over the columns of the data to find the list column ('i1'). Use that index to loop over the columns, then loop over the elements of the list and assign those NULL elements to NA
i1 <- sapply(data, is.list)
data[i1] <- lapply(data[i1], function(x) {
i2 <- sapply(x, is.null)
x[i2] <- NA
x })

If you indeed are working with a dataframe, you could perhaps consider not going through listing and recombining into dataframe:
purrr::map_df(.x = data, .f = ~ stringr::str_replace(.x, 'NULL', NA_character_))
You are inputting a dataframe data, applying to each column the function str_replace where you replace the character NULL with the character version of NA. The output is also a dataframe.
Here is an example:
library(purrr)
library(stringr)
df <- data.frame(
X1 = c('A', 'NULL', 'B'),
X2 = c('NULL', 'C', 'D'),
X3 = c('E', 'NULL', 'NULL')
)
purrr::map_df(.x = df, .f = ~ stringr::str_replace(.x, 'NULL', NA_character_))
# X1 X2 X3
# <chr> <chr> <chr>
# 1 A NA E
# 2 NA C NA
# 3 B D NA

Related

How to replace NA values with different values based on column in R dataframe?

I am trying to replace NA values by column with values predetermined from a vector. For example, I have vector containing the values (1,5,3) and a dataframe df, and want to replace all NA values from column one of df with 1, column two NA's with 5, and column three NA's with 3.
I tried a formula I saw that took
df[is.na(df)] = vector
but didn't seem to work due to "wrong length". Both the vector and #columns in df are also the same length.
You can use which to get row/column index of NA values and replace it directly.
mat <- which(is.na(df), arr.ind = TRUE)
df[mat] <- vector[mat[, 2]]
We can use Map to replace the corresponding columns in the dataset with the value in the vector and replace it directly and this would almost all the time and it is a single step replacement and is concise
df[] <- Map(function(x, y) replace(x, is.na(x), y), df, vec)
df
# col1 col2 col3
#1 1 5 2
#2 3 2 3
#3 1 5 3
Or another option is to make the lengths same, and then use pmax
df[] <- pmax(as.matrix(df), is.na(df) * vec[col(df)], na.rm = TRUE)
or another option with replace
df <- replace(df, is.na(df), rep(vec, colSums(is.na(df))))
NOTE: All the solutions above are one-liner
Or using data.table with set
library(data.table)
setDT(df)
for(j in seq_along(df)) set(df, i = which(is.na(df[[j]])), j = j, value = vec[j])
data
df <- data.frame(col1 = c(1, 3, NA), col2 = c(NA, 2, NA), col3 = c(2, NA, NA))
vec <- c(1, 5, 3)

Is there a more robust rename alternative than select with triple exclamation mark?

I have the following simplified use-case and it fails, stating that a column isn't there, which is correct. In my real use-case, this tibble comes from parsing an XML, which may not include all the expected columns but a subset in a few cases, i.e., missing values:
library(tidyverse)
mapping <- c('a', 'b', 'c')
names(mapping) <- c('x', 'y', 'z')
tibble(a=c(1, 2, 3), b=c('m', 'm', 'm')) %>%
select(!!! mapping)
Error: Unknown column `c`
Run `rlang::last_error()` to see where the error occurred.
The error is correct. However, is there a way to robustify this so that c would be created with NAs automatically when missing?
UPDATE: I worked out the following simple solution outside tidyverse:
# create a data frame instead of a tibble
df <- data.frame(a=c(1, 2, 3), b=c('m', 'm', 'm'), stringsAsFactors = FALSE)
# fill in missing columns with NAs
df[,setdiff(mapping, colnames(df))] <- NA
# reorder columns to match the mapping
df <- df[, mapping]
df
# now is safe to rename using the mapping
as_tibble(df) %>%
select(!!! mapping)
I'm still interested in a modern solution using tidyverse and tibble.
An extension to your solution could be
library(tidyverse)
df[,setdiff(mapping, colnames(df))] <- NA
df %>% rename_all(~names(mapping))
# x y z
#1 1 m NA
#2 2 m NA
#3 3 m NA
Or another approach
map_dfc(setdiff(mapping, colnames(df)), ~df %>% mutate(!!.x := NA)) %>%
arrange(mapping) %>%
rename_all(~names(mapping))
Here is a function I designed that may suit your needs. dat is the data frame you want to select columns. mapping is a vector contains the column names you want.
library(tidyverse)
select_fun <- function(dat, mapping){
cols_same <- intersect(names(dat), mapping)
cols_different <- setdiff(mapping, names(dat))
if (length(cols_same) > 0){
dat2 <- dat %>%
select(cols_same) %>%
set_names(names(mapping[mapping %in% cols_same]))
if (length(cols_different) > 0){
for (col in cols_different){
dat2[[names(mapping[mapping %in% col])]] <- NA
}
}
} else {
dat2 <- NA
}
return(dat2)
}
Let's test the function.
Test 1: The original question
mapping <- c('a', 'b', 'c')
names(mapping) <- c('x', 'y', 'z')
dat <- tibble(a=c(1, 2, 3), b=c('m', 'm', 'm'))
select_fun(dat, mapping)
# # A tibble: 3 x 3
# x y z
# <dbl> <chr> <lgl>
# 1 1 m NA
# 2 2 m NA
# 3 3 m NA
Test 2: mapping contains less matching column names than dat
mapping <- c('a', 'b')
names(mapping) <- c('x', 'y')
dat <- tibble(a=c(1, 2, 3), b=c('m', 'm', 'm'), c=c('n', 'n', 'n'))
select_fun(dat, mapping)
# # A tibble: 3 x 2
# x y
# <dbl> <chr>
# 1 1 m
# 2 2 m
# 3 3 m
Test 3: No matches
mapping <- c('x', 'y', "z")
names(mapping) <- c('x', 'y', 'z')
dat <- tibble(a=c(1, 2, 3), b=c('m', 'm', 'm'))
select_fun(dat, mapping)
# [1] NA
Notice that I am not sure what do you want if mapping and the column names of dat have no matches. For now, I set the response of this function to be NA. You can easily modify this response by editing the last else state. For example, below is a version that the function will return an empty data frame with column names the same as the dat.
} else {
dat2 <- dat %>% slice(0)
}

Subset a dataframe by matching it to a list and include non-match value too in the output using R

I have a dataframe (myDF) that has 2 columns "A" and "B" and a function (myfunc) which takes a list as an input and if it finds a match in column "A" then it returns a new dataframe that is a subset of myDF containing the value match and the corresponding "B" column.
But I want the function to also return the non-matching value in column A and NULL string in column B.
myDF:
A B
1 11
2 22
3 33
myfunc:
myfunc <- function(x) {
r<- with(myDF, myDF[a %in% x, c("a", "b")])
return(data.frame(r))
}
Input: mylist = c(1,2,"E")
Expected Output:
A B
1 11
2 22
E NULL
We create a logical index and assign
i1 <- with(myDF, !A %in% mylist)
myDF$B[i1] <- "NULL"
myDF$A[i1] <- mylist[i1]
myDF
# A B
#1 1 11
#2 2 22
#3 E NULL
Note: By assigning a character string to 'B' column, it effectively changes the type from numeric to character. A better option would be to assign it to NA
myDF$B[i1] <- NA
Or
data.frame(A= mylist, B = myDF$B[match(mylist, myDF$A)])
This is a join operation, which can be done in base R with merge, if you make the list a data.frame first. The all.y = T argument includes rows of mylistDF with no matching rows in myDF in the output.
mylistDF <- data.frame(A = mylist, stringsAsFactors = F)
merge(myDF, mylistDF, by = 'A', all.y = T)
# A B
# 1 1 11
# 2 2 22
# 3 E NA
Since you tagged tidyr, here's a tidyverse solution (same output)
library(tidyverse)
mylistDF <- tibble(A = mylist)
myDF %>%
mutate_at('A', as.character) %>%
right_join(mylistDF, by = 'A')

assign row name while rbind a row in a data frame

I want to assign a rowname (A_B) while I rbind a row into a new dataframe (d).
The row is the result of the ratio of two rows of another data frame (df).
df <- data.frame(ID = c("A", "B" ),replicate(3,sample(1:100,2,rep=TRUE)))
d <- data.frame()
d<-rbind(d, df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4])
Actual output
X1 X2 X3
1 0.08 0.14 0.66
Expected output. Instead of rowname 1 I want A_B as result of A_B ratio
X1 X2 X3
A_B 0.08 0.14 0.66
Maybe it's a stupid solution, but have you tried give it directly the row name? Some like this:
rbind(d, your_name = (df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4]))
For me it's working... Regards.
updated the solution to address multiple rows
You can workaround for the desired row names from following solution....
df <- data.frame(ID = c("A", "B" ),replicate(3, sample(1:100,8,rep=TRUE)))
# This is where you control what you like to see as the row names
rownames(df) <- make.names( paste("NAME", df[ ,"ID"]) , unique = TRUE)
d <- data.frame()
rbind(d, df[df$ID == "A",2:4 ]/df[df$ID == "B", 2:4], make.row.names = "T")
output
X1 X2 X3
NAME.A 0.8690476 1.1851852 2.40909091
NAME.A.1 1.8181818 0.8095238 1.01408451
NAME.A.2 0.8235294 5.4444444 2.50000000
NAME.A.3 1.4821429 1.8139535 0.05617978
You can assign row names by passing your rbind output into a pipe to the function rownames() and mutate the row name as you like.
%>% rownames(.) <- c("Name you like")
Just make sure you mutate the correct row you intended.

Changing Column Names in a List of Data Frames in R

Objective: Change the Column Names of all the Data Frames in the Global Environment from the following list
colnames of the ones in global environment
So.
0) The Column names are:
colnames = c("USAF","WBAN","YR--MODAHRMN")
1) I have the following data.frames: df1, df2.
2) I put them in a list:
dfList <- list(df1,df2)
3) Loop through the list:
for (df in dfList){
colnames(df)=colnames
}
But this creates a new df with the column names that I need, it doesn't change the original column names in df1, df2. Why? Could lapply be a solution? Thanks
Can something like:
lapply(dfList, function(x) {colnames(dfList)=colnames})
work?
With lapply you can do it as follows.
Create sample data:
df1 <- data.frame(A = 1, B = 2, C = 3)
df2 <- data.frame(X = 1, Y = 2, Z = 3)
dfList <- list(df1,df2)
colnames <- c("USAF","WBAN","YR--MODAHRMN")
Then, lapply over the list using setNames and supply the vector of new column names as second argument to setNames:
lapply(dfList, setNames, colnames)
#[[1]]
# USAF WBAN YR--MODAHRMN
#1 1 2 3
#
#[[2]]
# USAF WBAN YR--MODAHRMN
#1 1 2 3
Edit
If you want to assign the data.frames back to the global environment, you can modify the code like this:
dfList <- list(df1 = df1, df2 = df2)
list2env(lapply(dfList, setNames, colnames), .GlobalEnv)
Just change your for-loop into an index for-loop like this:
Data
df1 <- data.frame(a=runif(5), b=runif(5), c=runif(5))
df2 <- data.frame(a=runif(5), b=runif(5), c=runif(5))
dflist <- list(df1,df2)
colnames = c("USAF","WBAN","YR--MODAHRMN")
Solution
for (i in seq_along(dflist)){
colnames(dflist[[i]]) <- colnames
}
Output
> dflist
[[1]]
USAF WBAN YR--MODAHRMN
1 0.8794153 0.7025747 0.2136040
2 0.8805788 0.8253530 0.5467952
3 0.1719539 0.5303908 0.5965716
4 0.9682567 0.5137464 0.4038919
5 0.3172674 0.1403439 0.1539121
[[2]]
USAF WBAN YR--MODAHRMN
1 0.20558383 0.62651334 0.4365940
2 0.43330717 0.85807280 0.2509677
3 0.32614750 0.70782919 0.6319263
4 0.02957656 0.46523151 0.2087086
5 0.58757198 0.09633181 0.6941896
By using for (df in dfList) you are essentially creating a new df each time and change the column names to that leaving the original list (dfList) untouched.
If you want the for loop to work, you should not pass the whole data.frame as the argument.
for (df in 1:length(dfList))
colnames(dfList[[df]]) <- colnames
dfList <- lapply(dfList, `names<-`, colnames)
Create the sample data:
df1 <- data.frame(A = 1, B = 2, C = 3)
df2 <- data.frame(X = 1, Y = 2, Z = 3)
dfList <- list(df1,df2)
name <- c("USAF","WBAN","YR--MODAHRMN")
Then create a function to set the colnames:
res=lapply(dfList, function(x){colnames(x)=c(name);x})
[[1]]
USAF WBAN YR--MODAHRMN
1 1 2 3
[[2]]
USAF WBAN YR--MODAHRMN
1 1 2 3
A tidyverse solution with rename_with:
library(dplyr)
library(purrr)
map(dflist, ~ rename_with(., ~ colnames))
Or, if it's only for one column:
map(dflist, ~ rename(., new_col = old_col))
This also works with lapply:
lapply(dflist, rename_with, ~ colnames)
lapply(dflist, rename, new_col = old_col)

Resources