Delete columns in an R loop - r

I have a dataframe where I want to replace the variables
age_1 with values of variable age1_corr_1 if age1_corr_1 is not NA
age_2 with values of variable age1_corr_2 if age1_corr_2 is not NA, ...,
age_n with values of variable age1_corr_n if age1_corr_n is not NA.
Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?
Sample data
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
The code that will change values of age_n based on age1_corr_n
for(i in 1:4){
cname1 <- paste0("age_",i)
cname2 <- paste0("age1_corr_",i)
y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}
The output I'd like to have is
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7

You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).
Here's the data you provided:
y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
"age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))
Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:
library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))
Alternatively, you could choose the pattern for the columns you DON'T want:
x <- select(y, -grep("_corr_", current_vars()))
This uses the following strategy:
* you can select for everything BUT a column or set of columns by adding a minus sign first.
* current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)

Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.
library(magrittr)
y %>%
dplyr::mutate(
age1_corr_4 = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
age_1 = dplyr::coalesce(age1_corr_1, age_1),
age_2 = dplyr::coalesce(age1_corr_2, age_2),
age_3 = dplyr::coalesce(age1_corr_3, age_3),
age_4 = dplyr::coalesce(age1_corr_4, age_4)
) %>%
dplyr::select(
-age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
)
Produces
age_1 age_2 age_3 age_4
1 1 1 4 1
2 1 2 3 4
3 1 10 2 2
4 0 9 6 7
Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.

Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).
Like Dan Hall's response, one approach keeps the columns you want...
library(magrittr)
coalesce_corr1 <- function( index ) {
name_age <- paste0("age_" , index)
name_corr <- paste0("age1_corr_", index)
y %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
) %>%
dplyr::select(!!name_age)
}
1:4 %>%
purrr::map(coalesce_corr) %>%
dplyr::bind_cols()
...and the other drops the columns you don't want.
z <- y
coalesce_corr2 <- function( index ) {
name_age <- paste0( "age_" , index)
name_corr <- paste0( "age1_corr_", index)
z <<- z %>%
dplyr::mutate(
!!name_age := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
)
z[[name_corr]] <<- NULL
}
1:4 %>%
purrr::walk(coalesce_corr2)
z
I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.

Related

mutate(across) to generate multiple new columns in tidyverse

I usually have to perform equivalent calculations on a series of variables/columns that can be identified by their suffix (ranging, let's say from _a to _i) and save the result in new variables/columns. The calculations are equivalent, but vary between the variables used in the calculations. These again can be identified by the same suffix (_a to _i). So what I basically want to achieve is the following:
newvar_a = (oldvar1_a + oldvar2_a) - z
...
newvar_i = (oldvar1_i + oldvar2_i) - z
This is the farest I got:
mutate(across(c(oldvar1_a:oldvar1_i), ~ . - z, .names = "{col}_new"))
Thus, I'm able to "loop" over oldvar1_a to oldvar1_i, substract z from them and save the results in new columns named oldvar1_a_new to oldvar1_i_new. However, I'm not able to include oldvar2_a to oldvar2_i in the calculations as R won't loop over them. (Additionally, I'd still need to rename the new columns).
I found a way to achieve the result using a for-loop. However, this definitely doesn't look like the most efficient and straightforward way to do it:
for (i in letters[1:9]) {
oldvar1_x <- paste0("oldvar1_", i)
oldvar2_x <- paste0("oldvar2_", i)
newvar_x <- paste0("newvar_", i)
df <- df %>%
mutate(!!sym(newvar_x) := (!!sym(oldvar1_x) + !!sym(oldvar2_x)) - z)
}
Thus, I'd like to know whether/how to make mutate(across) loop over multiple columns that can be identified by suffixes (as in the example above)
In this case, you can use cur_data() and cur_column() to take advantage that we are wanting to sum together columns that have the same suffix but just need to swap out the numbers.
library(dplyr)
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,10,20)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ (.x + cur_data()[gsub("1", "2", cur_column())]) - z,
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar2_a oldvar2_i
#> 1 1 4 7 10 1 4 16
#> 2 2 5 8 11 10 -3 9
#> 3 3 6 9 12 20 -11 1
If you want to use with case_when, just make sure to index using [[, you can read more here.
df <- data.frame(
oldvar1_a = 1:3,
oldvar2_a = 4:6,
oldvar1_i = 7:9,
oldvar2_i = 10:12,
z = c(1,2,0)
)
mutate(
df,
across(
starts_with("oldvar1"),
~ case_when(
z == 1 ~ .x,
z == 2 ~ cur_data()[[gsub("1", "2", cur_column())]],
TRUE ~ NA_integer_
),
.names = "{col}_new"
)
)
#> oldvar1_a oldvar2_a oldvar1_i oldvar2_i z oldvar1_a_new oldvar1_i_new
#> 1 1 4 7 10 1 1 7
#> 2 2 5 8 11 2 5 11
#> 3 3 6 9 12 0 NA NA
There is a fairly straightforward way to do what I believe you are attempting to do.
# first lets create data
library(dplyr)
df <- data.frame(var1_a=runif(10, min = 128, max = 131),
var2_a=runif(10, min = 128, max = 131),
var1_b=runif(10, min = 128, max = 131),
var2_b=runif(10, min = 128, max = 131),
var1_c=runif(10, min = 128, max = 131),
var2_c=runif(10, min = 128, max = 131))
# taking a wild guess at what your z is
z <- 4
# initialize a list
fnl <- list()
# iterate over all your combo, put in list
for (i in letters[1:3])
{
dc <- df %>% select(ends_with(i))
i <- dc %>% mutate(a = rowSums(dc[1:ncol(dc)]) - z)
fnl <- append(fnl, i)
}
# convert to a dataframe/tibble
final <- bind_cols(fnl)
I left the column names sloppy assuming you had specific requirements here. You can convert this loop into a function and do the whole thin in a single step using purrr.

R creating a function using plyr revalue with multiple inputs

I am new to R and just learning the ropes so thanks in advance for any assistance you can provide.
I have a dataset that I am cleaning as a class project.
I have several sets of categorical data that I want to turn into specific numeric values.
I am repeating the same code format for different columns that I think would make a good function.
I would like to turn this:
# plyr using revalue
df$Area <- revalue(x = df$Area,
replace = c("rural" = 1,
"suburban" = 2,
"urban" = 3))
df$Area <- as.numeric(df$Area)
into this:
reval_3 <- function(data, columnX,
value1, num_val1,
value2, num_val2,
value3, num_val3) {
# plyr using revalue
data$columnX <- revalue(x = data$columnX,
replace = c(value1 = num_val1,
value2 = num_val2,
value3 = num_val3))
# set as numeric
data$columnX <- as.numeric(data$columnX)
# return dataset
return(data)
}
I get the following error:
The following `from` values were not present in `x`: value1, value2, value3
Error: Assigned data `as.numeric(data$columnX)` must be compatible with existing data.
x Existing data has 10000 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unknown or uninitialised column: `columnX`.
I've tried it with a single value1 where value1 <- c("rural" = 1, "suburban" = 2, "urban" = 3)
I know I can just:
df$Area <- as.numeric(as.factor(df$Area))
the data but I want specific values for each choice rather than R choosing.
Any assistance appreciated.
As already mentioned by #MartinGal in his comment, plyr is retired and the package authors themselves recommend using dplyr instead. See https://github.com/hadley/plyr.
Hence, one option to achieve your desired result would be to make use of dplyr::recode. Additionally if you want to write your function I would suggest to pass the values to recode and the replacements as vectors instead of passing each value and replacement as separate arguments:
library(dplyr)
set.seed(42)
df <- data.frame(
Area = sample(c("rural", "suburban", "urban"), 10, replace = TRUE)
)
recode_table <- c("rural" = 1, "suburban" = 2, "urban" = 3)
recode(df$Area, !!!recode_table)
#> [1] 1 1 1 1 2 2 2 1 3 3
reval_3 <- function(data, x, values, replacements) {
recode_table <- setNames(replacements, values)
data[[x]] <- recode(data[[x]], !!!recode_table)
data
}
df <- reval_3(df, "Area", c("rural", "suburban", "urban"), 1:3)
df
#> Area
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 2
#> 8 1
#> 9 3
#> 10 3
You can use case_when with across.
If the columns that you want to change are called col1, col2 you can do -
library(dplyr)
df <- df %>%
mutate(across(c(col1, col2), ~case_when(. == 'rural' ~ 1,
. == 'suburban' ~ 2,
. == 'urban' ~ 3)))
Based on your actual column names you can also pass starts_with, ends_with, range of columns A:Z in across.

Gather duplicate column sets into single columns

The problem of gathering multiple sets of columns was already addressed here: Gather multiple sets of columns, but in my case, the columns are not unique.
I have the following data:
input <- data.frame(
id = 1:2,
question = c("a", "b"),
points = 0,
max_points = c(3, 5),
question = c("c", "d"),
points = c(0, 20),
max_points = c(5, 20),
check.names = F,
stringsAsFactors = F
)
input
#> id question points max_points question points max_points
#> 1 1 a 0 3 c 0 5
#> 2 2 b 0 5 d 20 20
The first column is an id, then I have many repeated columns (the original dataset has 133 columns):
identifier for question
points given
maximum points
I would like to end up with this structure:
expected <- data.frame(
id = c(1, 2, 1, 2),
question = letters[1:4],
points = c(0, 0, 0, 20),
max_points = c(3, 5, 5, 20),
stringsAsFactors = F
)
expected
#> id question points max_points
#> 1 1 a 0 3
#> 2 2 b 0 5
#> 3 1 c 0 5
#> 4 2 d 20 20
I have tried several things:
tidyr::gather(input, key, val, -id)
reshape2::melt(input, id.vars = "id")
Both do not deliver the desired output. Furthermore, with more columns than shown here, gather doesn't work any more, because there are too many duplicate columns.
As a workaround I tried this:
# add numbers to make col headers "unique"
names(input) <- c("id", paste0(1:(length(names(input)) - 1), names(input)[-1]))
# gather, remove number, spread
input %>%
gather(key, val, -id) %>%
mutate(key = stringr::str_replace_all(key, "[:digit:]", "")) %>%
spread(key, val)
which gives an error: Duplicate identifiers for rows (3, 9), (4, 10), (1, 7), (2, 8)
This problem was already discussed here: Unexpected behavior with tidyr, but I don't know why/how I should add another identifier. Most likely this is not the main problem, because I probably should approach the whole thing differently.
How could I solve my problem, preferably with tidyr or base? I don't know how to use data.table, but in case there is a simple solution, I will settle for that too.
Try this:
do.call(rbind,
lapply(seq(2, ncol(input), 3), function(i){
input[, c(1, i:(i + 2))]
})
)
# id question points max_points
# 1 1 a 0 3
# 2 2 b 0 5
# 3 1 c 0 5
# 4 2 d 20 20
The idiomatic way to do this in data.table is pretty simple:
library(data.table)
setDT(input)
res = melt(
input,
id = "id",
meas = patterns("question", "^points$", "max_points"),
value.name = c("question", "points", "max_points")
)
id variable question points max_points
1: 1 1 a 0 3
2: 2 1 b 0 5
3: 1 2 c 0 5
4: 2 2 d 20 20
You get the extra column called "variable", but you can get rid of it with res[, variable := NULL] afterwards if desired.
Another way to accomplish the same goal without using lapply:
We start by grabbing all the columns for question, max_points, and points then we melt each one individually and cbind them all together.
library(reshape2)
questions <- input[,c(1,c(1:length(names(input)))[names(input)=="question"])]
points <- input[,c(1,c(1:length(names(input)))[names(input)=="points"])]
max_points <- input[,c(1,c(1:length(names(input)))[names(input)=="max_points"])]
questions_m <- melt(questions,id.vars=c("id"),value.name = "questions")[,c(1,3)]
points_m <- melt(points,id.vars=c("id"),value.name = "points")[,3,drop=FALSE]
max_points_m <- melt(max_points,id.vars=c("id"),value.name = "max_points")[,3, drop=FALSE]
res <- cbind(questions_m,points_m, max_points_m)
res
id questions points max_points
1 1 a 0 3
2 2 b 0 5
3 1 c 0 5
4 2 d 20 20
You might need to clarify how you want the ID column to be handled but perhaps something like this ?
runme <- function(word , dat){
grep( paste0("^" , word , "$") , names(dat))
}
l <- mapply( runme , unique(names(input)) , list(input) )
l2 <- as.data.frame(l)
output <- data.frame()
for (i in 1:nrow(l2)) output <- rbind( output , input[, as.numeric(l2[i,]) ])
Not sure how robust it is with respect to handling different numbers of repeated columns but it works for your test data and should work if you columns are repeated equal numbers of times.

Can't sort column with R

I want to create from the dataset a list that contains word and frequency of the word . I did it and saved into val named 'mylist'. now I want to sort the list according to the frequency of the word and to create barplot from the 10 words that have the higher frequency.
but I not succeeded to sort it. I tried many ways to change the type of 'mylist' to data.frame or date.table but still the column of the frequency stay a list.
To sumup I have the DT var that contains it is a list with 2 columns x-contains the words and type is character .
The 2 column is 'v' - that contains the frequency and it is a list.
I am not succeeding to sort it by the frequency.
please help me.
library(ggplot2)
libary(MASS)
#get the data
data.uri = "http://www.crowdflower.com/wp-content/uploads/2016/03/gender-classifier-DFE-791531.csv"
pwd = getwd()
data.file.name = "gender.csv"
data.file = paste0(pwd, "./", data.file.name)
download.file(data.uri, data.file)
data = read.csv(data.file.name)
#manipulate the data
data <- data[data$X_unit_id < 815719694,]
print(data$X_unit_id)
#get all female has white sidebar
female_colors <- subset(data, data$gender=="female")
female_colors$fav_number
#get all male fav_numbers
male_colors <- subset(data, data$gender=="male")
male_colors$fav_number
text_male = subset(data, data$gender=="male")
text_male = text_male$text
print(text_male[1])
print(length(text_male))
v <- text_male[1:length(text_male)]
print(v)
print (v[1])
count_of_list = 0;
x = list()
for ( i in v) {
# Merge the two lists.
x <- c(x,unlist(strsplit(i," ")))
}
count = 0;
mylist = list()
for (word in x){
for (xWord in x){
if (word == xWord)
count = count + 1;
}
key <- word
value <- count
mylist[[ key ]] <- value
count = 0;
}
libary(data.table)
require(data.table)
DT = data.table(x=c(names(mylist)),v=c(mylist))
DT
As suggested in comments, a reproducible example would be useful in creating an answer to help you. I will suggest a proposal anyway. Try to adapt this peocedure to your data.
Convert your list to a dataframe and use order:
df <- as.data.frame(your.data)
df <- data.frame(id = c("B", "A", "D", "C"), y = c(6, 8, 1, 5))
df
id y
1 B 6
2 A 8
3 D 1
4 C 5
df2 <- df[order(df$id), ]
df2
id y
2 A 8
1 B 6
4 C 5
3 D 1
It looks like you're using a cumbersome way to calculate the word counts, something like this is faster and simpler -
library(dplyr)
foo <- c("ant", "ant", "bat", "dog","egg","ant","bat")
bar <- rnorm(7, 5, 2)
df <- data.frame(foo, bar)
group_by(df, foo) %>% summarise(n = n()) %>% arrange(desc(n))
foo n
(fctr) (int)
1 ant 3
2 bat 2
3 dog 1
4 egg 1

Boxplot of pre-aggregated/grouped data in R

In R I want to create a boxplot over count data instead of raw data. So my table schema looks like
Value | Count
1 | 2
2 | 1
...
Instead of
Value
1
1
2
...
Where in the second case I could simply do boxplot(x)
I'm sure there's a way to do what you want with the already summarized data, but if not, you can abuse the fact that rep takes vectors:
> dat <- data.frame(Value = 1:5, Count = sample.int(5))
> dat
Value Count
1 1 1
2 2 3
3 3 4
4 4 2
5 5 5
> rep(dat$Value, dat$Count)
[1] 1 2 2 2 3 3 3 3 4 4 5 5 5 5 5
Simply wrap boxplot around that and you should get what you want. I'm sure there's a more efficient / better way to do that, but this should work for you.
I solved a similar issue recently by using the 'apply' function on each column of counts with the 'rep' function:
> datablock <- apply(countblock[-1], 2, function(x){rep(countblock$value, x)})
> boxplot(datablock)
...The above assumes that your values are in the first column and subsequent columns contain count data.
A combination of rep and data.frame can be used as an approach if another variable is needed for classification
Eg.
with(data.frame(v1=rep(data$v1,data$count),v2=(data$v2,data$count)),
boxplot(v1 ~ v2)
)
Toy data:
(besides Value and Count, I add a categorical variable Group)
set.seed(12345)
df <- data.frame(Value = sample(1:100, 100, replace = T),
Count = sample(1:10, 100, replace = T),
Group = sample(c("A", "B", "C"), 100, replace = T),
stringsAsFactors = F)
Use purrr::pmap and purrr::reduce to manipulate the data frame:
library(purrr)
data <- pmap(df, function(Value, Count, Group){
data.frame(x = rep(Value, Count),
y = rep(Group, Count))
}) %>% reduce(rbind)
boxplot(x ~ y, data = data)

Resources