dplyr mutate in R - add column as concat of columns - r

I have a problem with using mutate{dplyr} function with the aim of adding a new column to data frame. I want a new column to be of character type and to consist of "concat" of sorted words from other columns (which are of character type, too). For example, for the following data frame:
> library(datasets)
> states.df <- data.frame(name = as.character(state.name),
+ region = as.character(state.region),
+ division = as.character(state.division))
>
> head(states.df, 3)
name region division
1 Alabama South East South Central
2 Alaska West Pacific
3 Arizona West Mountain
I would like to get a new column with the following first element:
"Alamaba_East South Central_South"
I tried this:
mutate(states.df,
concated_column = paste0(sort(name, region, division), collapse="_"))
But I received an error:
Error in sort(1:50, c(2L, 4L, 4L, 2L, 4L, 4L, 1L, 2L, 2L, 2L, 4L, 4L, :
'decreasing' must be a length-1 logical vector.
Did you intend to set 'partial'?
Thank you for any help in advance!

You need to use sep = not collapse =, and why use sort?. And I used paste and not paste0.
library(dplyr)
states.df <- data.frame(name = as.character(state.name),
region = as.character(state.region),
division = as.character(state.division))
res = mutate(states.df,
concated_column = paste(name, region, division, sep = '_'))
As far as the sorting goes, you do not use sort correctly. Maybe you want:
as.data.frame(lapply(states.df, sort))
This sorts each column, and creates a new data.frame with those columns.

Adding on to Paul's answer. If you want to sort the rows, you could try order. Here is an example:
res1 <- mutate(states.df,
concated_column = apply(states.df[order(name, region, division), ], 1,
function(x) paste0(x, collapse = "_")))
Here order sorts the data.frame states.df by name and then breaks the tie by region and division

Related

How can I apply case_when(mapply (adist, x, y) <= 3 ~ x, TRUE ~ y)) to columns of different length and order

Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"

Merging three factors so their dependent variable sums in R

Not sure if someone has answered this - I have searched, but so far nothing has worked for me. I have a very large dataset that I am trying to narrow. I need to combine three factors in my "PROG" variable ("Grad.2","Grad.3","Grad.H") so that they become a single variable ("Grad") where the dependent variable ("NUMBER") of each comparable set of values is summed.
ie.
YEAR = "92/93" AGE = "20-24" PROG = "Grad.2" NUMBER = "50"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.3" NUMBER = "25"
YEAR = "92/93" AGE = "20-24" PROG = "Grad.H" NUMBER = "2"
turns into
YEAR = "92/93" AGE = "20-24" PROG = "Grad" NUMBER = "77"
I want to then drop all other factors for PROG so that I can compare the enrollment rates for Grad without worrying about the other factors (which I deal with separately). So my active independent variables are YEAR and AGE, while the dependent variable is NUMBER.
I hope this shows my data adequately:
structure(list
(YEAR = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("92/93", "93/94", "94/95", "95/96", "96/97",
"97/98", "98/99", "99/00", "00/01", "01/02", "02/03", "03/04",
"04/05", "05/06", "06/07", "07/08", "08/09", "09/10", "10/11",
"11/12", "12/13", "13/14", "14/15", "15/16"), class = "factor"),
AGE = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L), .Label = c("1-19",
"20-24", "25-30", "31-34", "35-39", "40+", "NR", "T.Age"), class = c("ordered",
"factor")),
PROG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
19L, 19L, 19L), .Label = c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"), class = "factor"),
NUMBER = c(104997L,
347235L, 112644L, 38838L, 35949L, 50598L, 5484L, 104991L,
333807L, 76692L)), row.names = c(7936L, 7948L, 7960L, 7972L,
7984L, 7996L, 8008L, 10459L, 10471L, 10483L), class = "data.frame")
In terms of why I am using factors, I don't know how else I should enter the data. Factors made sense, and they were how R interpreted the raw data when I uploaded it.
I am working on the suggestions below. Not had success yet, but I am still learning how to get R to do what I want, and frequently mess up. Will respond to each of you as soon as I have a reasonable answer to give. (And once I stop banging my poor head on my desk... sigh)
If I understand your question correctly, this should do it.
I am assuming your data frame is named df:
library(tidyverse)
df %>%
mutate(PROG = ifelse(PROG %in% c("Grad2", "Grad3","Grad.H"),
"Grad",
NA)) %>% ##combines the 3 Grad variables into one
filter(!is.na(PROG)) %>% ##drops the other variables
group_by(YEAR, AGE) %>%
summarise(NUMBER = sum(NUMBER))
Slightly different approach: only take factors you want, drop the factor variable (because you want to treat them as a group) and sum up all NUMBER values while grouping by all other variables. df is your data.
aggregate(formula = NUMBER ~ .,
data = subset(df, PROG %in% c("Grad2", "Grad3", "Grad.H"), select = -PROG),
FUN = sum)
There are multiple ways to do this, but I agree with FScott that you are likely looking for the levels() function to rename the factor levels. Here is how I would do the second step of summing.
library(magrittr)
library(dplyr)
#do the renaming of the PROG variables here
#sum by PROG
df <- df %>%
group_by(PROG) %>% # you could add more variable names here to group by i.e. group_by(PROG, AGE, YEAR)
mutate(group.sum= sum(NUMBER))
This chunk will make a new column in df named group.sum with the sum between subsetted groups defined by the group_by() function
if you wanted to condense the data.frame further as where the individual values in NUMBER are replaced with group.sum, again there are many ways to do this but here is a simple way.
#condense df down
df$number <- df$group.sum
df <- df[,-ncol(df)]
df <- unique(df)
A side note: I wouldn't recommend doing the above chunk because you loose information in your data, and your data is more tidy just having the extra column group.sum
I think the levels() function is what you are looking for. From the manual:
## combine some levels
z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
z
levels(z) <- c("fruit", "veg", "fruit")
z
I named your data temp and ran this code. It works for me.
z<-gl(n=length(temp$PROG),k=2,labels=c("T.Prog", "Basic", "Career", "Grad.H",
"Grad2", "Grad3", "Grad2.Qual", "Grad3.Qual", "Health.Res",
"NoProg.Grad", "NoProg.Other", "NoProg.Und.Grad", "NoProg.NoCred",
"Other", "Post.Und.Grad", "Post.Career", "Pre-U", "Career.Qual",
"Und.Grad", "Und.Grad.Qual"))
z
levels(z)<-c(rep("Other",3),rep("Grad",5),rep("Other",12))
z
temp$PROG2<-factor(x=temp$PROG,levels=levels(temp$PROG),labels=z)
temp

renaming elements of a duplicated character vector in a data.table and then to reshape the data

In a column of a data.table, I have duplicated elements. I would like to rename these consecutively in each group so that I can then use the dcast function. Let me explain with the following example:
DT <- data.table(v1 = letters[c(1,rep(4,3),10,1,rep(7,3),1,10,12)],
v2=month.abb[1:12],
v3=c("AV",rep("SINGLE",3),"DAT","AV",rep("SINGLE",3),"AV","DAT","R"),
v4 = c(3L, 2L, 2L, 5L, 4L, 3L,6L,11L,1L,7L,12L,20L))
DT
# I Want to create a new variable,"newv" where the factor,"SINGLE" in the variable (v3)
# to be named sequentially with the first letter as 's' followed by an integer
## unsuccessful attempt
#DT[v3 == SINGLE", newv:= for (i in 1:.N) {paste0('s',i)}, by="v1"]
# What I want is the following data table
DT1 <- data.table(v1 = letters[c(1,rep(4,3),10,1,rep(7,3),1,10,12)],
v2=month.abb[1:12],
v3=c("AV",rep("SINGLE",3),"DAT","AV",rep("SINGLE",3),"AV","DAT","R"),
v4 = c(3L, 2L, 2L, 5L, 4L, 3L,6L,11L,1L,7L,12L,20L),
newv=c("AV","s1","s2","s3","DAT","AV","s1","s2","s3","AV","DAT","R"))
DT1
# Now proceeding with dcast
dt2<-DT1[,`:=`('v2'=NULL,'v3'=NULL)] # to get the right shape in the final result
dcast.data.table(dt2,v1~newv,value.var="v4")
#Aggregate function missing, defaulting to 'length'
# Instead of the length, I would like to have the values from variable "v4"
# How do I get that?
Basically, I have not succeeded with my aims : (1) to rename the common factor "SINGLE" with a sequential string and (2) to reshape the data table to get the renamed strings, s1 to s3, as the column headers with the "v4" values appearing in the table.
I would appreciate any help that I can get. Either with data.table or dplyr/tidyr.

arranging strings from one data frame based on another one

I have a data frame like this one
df1<- structure(list(V1 = structure(c(8L, 4L, 5L, 7L, 6L, 3L, 9L, 1L,
2L), .Label = c("A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4", "A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920",
"C1P641;C1P640;A0A061AD21;G5EEV6", "O16276", "O16520-2", "O17323-2",
"O17395", "O17403", "Q22501;A0A061AE05"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))
My second data from looks like this
df2<- structure(list(From = structure(c(12L, 10L, 11L, 8L, 7L, 1L,
9L, 15L, 2L, 5L, 13L, 3L, 16L, 6L, 4L, 14L), .Label = c("A0A061AD21",
"A0A061AE05", "A0A061AJ82", "A0A061AJK8", "A0A061AKW6", "A0A061AL89",
"C1P640", "C1P641", "G5EEV6", "O16276", "O17395", "O17403", "Q19219",
"Q21920", "Q22501", "Q7JLR4"), class = "factor"), To = structure(c(4L,
8L, 1L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 3L, 3L, 7L), .Label = c("aat-3",
"CELE_F08G5.3", "CELE_R11A8.7", "cpsf-2", "epi-1", "pps-1", "R11A8.7",
"ugt-61"), class = "factor")), .Names = c("From", "To"), class = "data.frame", row.names = c(NA,
-16L))
df2 is taken from df1 but some information are added and some are removed . I want to reconstruct the df2 like df1 and arrange the column named To based on that
So the output should look like this
From To
O17403 cpsf-2
O16276 ugt-61
O16520-2 -
O17395 aat-3
O17323-2 -
C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
Q22501;A0A061AE05 pps-1
A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7; R11AB.7
It means we have O17403 in df2 and was only one string in df1, so it stays the same. O16276 was only one string in a raw in df1 so it also stays the same
O16520-2 was in df1 was not in df2 so in column named to a hyphen
the same for the rest until C1P641;C1P640;A0A061AD21;G5EEV6 are all in the same row of df1 and their To is the same, so we put them the same as df1 and just add one epi-1
Probably the best is to put df1 as template and then parse the To to it , those that are in df2, parse their To , those that are not only a hyphen
It is very complicated, I even could not think how to do it.I will appreciate any help
To solve this I split the semicolon delimited strings and created a nested for-for-if-if loop.
Here's the logic behind the loop which runs against the split string's data.frame (tmp):
Fix data classes (i.e. change factor to character to avoid conflicting level sets) and append a temporary To column to tmp
For each column and row of tmp start by seeing if a cell contains a valid string for matching and a matched value in df2$To, if not, go to the next iteration
If it does then look at the matching value in To from df2, checking to see if we already have the matched value in tmp$To (if so, go to next iteration)
If there's a new matched value in df2$To then put it in the correspond cell of tmp$To, prepending it with any preceeding matches and semicolons if it is not the first match for that row
df1$V1 <- as.character(df1$V1)
df2$From <- as.character(df2$From)
df2$To <- as.character(df2$To)
library(stringr)
tmp <- as.data.frame(str_split_fixed(df1$V1, ";",n=5), stringsAsFactors = F)
tmp$To <- as.character(NA)
for(j in 1:nrow(tmp)){
for(i in 1:ncol(tmp)){
if(length(df2$To[df2$From == tmp[j,i]]) == 0 | is.null(tmp[j,i])){
next
} else if(length(df2$To[df2$From == tmp[j,i]] ) == 1 & !is.na(tmp[j,i])){
if(is.na(tmp$To[j]) | tmp$To[j] == df2$To[df2$From == tmp[j,i]]){
tmp$To[j] <- df2$To[df2$From == tmp[j,i] ]
} else{
tmp$To[j] <- paste(tmp$To[j],";",df2$To[df2$From == tmp[j,i] ], sep="")
}
} else{
next
}
}
}
df1 <- data.frame(From=df1$V1, To=tmp$To)
df1
From To
1 O17403 cpsf-2
2 O16276 ugt-61
3 O16520-2 <NA>
4 O17395 aat-3
5 O17323-2 <NA>
6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
7 Q22501;A0A061AE05 pps-1
8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
One way of doing this is to use the splitstackshape package (use cSplit). I converted the factors to character strings to simplify (and get rid of warnings).
library(dplyr)
library(data.table) # cSplit from 'splitstackshape' returns a 'data.table'.
library(splitstackshape)
### Remove the factors for convenience of manipulation
df1 <- df1 %>% mutate(From = as.character(V1))
df2 <- df2 %>% mutate(From = as.character(From), To = as.character(To))
### 'cSplit' will split on ';' and create a new row for each item. The
### original 'From' column is kept around as cSplit removes the split column.
### 'rn' (row number) is used for ordering later.
cSplit(df1 %>% mutate(rn = row_number(), From_temp = From),
"From_temp", sep = ";", direction = "long", drop = FALSE, type.convert = FALSE) %>%
left_join(df2, by = c(From_temp = 'From')) %>% # Join to 'df2' to get the 'To' column
group_by(From, rn) %>% # Group by original 'From' column.
summarise(To = paste(sort(unique(na.omit(To))), collapse = ';'), # Create 'To' by joining 'To' Values
To = ifelse(To=='', '-', To)) %>% # Set empty values to '-'
ungroup %>%
arrange(rn) %>% # Sort by original row number and
select(-rn) # remove 'rn' column.
## From To
## <chr> <chr>
## 1 O17403 cpsf-2
## 2 O16276 ugt-61
## 3 O16520-2 -
## 4 O17395 aat-3
## 5 O17323-2 -
## 6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
## 7 Q22501;A0A061AE05 pps-1
## 8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
## 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
There may be a cleaner way to do with dplyr that doesn't require the splitstackshape.

Using "apply" functions across multiple data frames

I'm having an issue using apply functions (which I assume is the right way to do the following) across multiple data frames.
Some example data (3 different data frames, but the problem I'm working on has upwards of 50):
biz <- data.frame(
country = c("england","canada","australia","usa"),
businesses = sample(1000:2500,4))
pop <- data.frame(
country = c("england","canada","australia","usa"),
population = sample(10000:20000,4))
restaurants <- data.frame(
country = c("england","canada","australia","usa"),
restaurants = sample(500:1000,4))
Here's what I ultimately want to do:
1) Sort eat data frame from largest to smallest, according to the variable that's included
dataframe <- dataframe[order(dataframe$VARIABLE,)]
2) then create a vector variable that gives me the rank for each
dataframe$rank <- 1:nrow(dataframe)
3) Then create another data frame that has one column of the countries and the rank for each of the variables of interest as other columns. Something that would look like (rankings aren't real here):
country.rankings <- structure(list(country = structure(c(5L, 1L, 6L, 2L, 3L, 4L), .Label = c("brazil",
"canada", "england", "france", "ghana", "usa"), class = "factor"),
restaurants = 1:6, businesses = c(4L, 5L, 6L, 3L, 2L, 1L),
population = c(4L, 6L, 3L, 2L, 5L, 1L)), .Names = c("country",
"restaurants", "businesses", "population"), class = "data.frame", row.names = c(NA,
-6L))
So I'm guessing there's a way to put each of these data frames together into a list, something like:
lib <- c(biz, pop, restaurants)
And then do an lapply across that to 1) sort, 2)create the rank variable and 3) create the matrix or data frame of rankings for each variable (# of businesses, population size, # of restaurants) for each country. Problem I'm running into is that writing the lapply function to sort each data frame runs into issues when I try to order by the variable:
sort <- lapply(lib,
function(x){
x <- x[order(x[,2]),]
})
returns the error message:
Error in `[.default`(x, , 2) : incorrect number of dimensions
because I'm trying to apply column headings to a list. But how else would I tackle this problem when the variable names are different for every data frame (but keeping in mind that the country names are consistent)
(would also love to know how to use this using plyr)
Ideally I'd would recommend data.table for this.
However, here is a quick solution using data.frame
Try this:
Step1: Create a list of all data.frames
varList <- list(biz,pop,restaurants)
Step2: Combine all of them in one data.frame
temp <- varList[[1]]
for(i in 2:length(varList)) temp <- merge(temp,varList[[i]],by = "country")
Step3: Get ranks:
cbind(temp,apply(temp[,-1],2,rank))
You can remove the undesired columns if you want!!
cbind(temp[,1:2],apply(temp[,-1],2,rank))[,-2]
Hope this helps!!
totaldatasets <- c('biz','pop','restaurants')
totaldatasetslist <- vector(mode = "list",length = length(totaldatasets))
for ( i in seq(length(totaldatasets)))
{
totaldatasetslist[[i]] <- get(totaldatasets[i])
}
totaldatasetslist2 <- lapply(
totaldatasetslist,
function(x)
{
temp <- data.frame(
country = totaldatasetslist[[i]][,1],
countryrank = rank(totaldatasetslist[[i]][,2])
)
colnames(temp) <- c('country', colnames(x)[2])
return(temp)
}
)
Reduce(
merge,
totaldatasetslist2
)
Output -
country businesses population restaurants
1 australia 3 3 3
2 canada 2 2 2
3 england 1 1 1
4 usa 4 4 4

Resources