Replace NA with mode from categorical dataset R - r

I have a dataset with categorical and NA observations of 10 variables. I want to replace the NA values of each column with the mode. I did a histogram of each variable for identifying the density for each observation and got the mode. I know what values to replace the NAs in each column with.
I saw there was a related post, but I already know what values to replace. Here's the link: Replace mean or mode for missing values in R
Here's to reproduce the dataset:
> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
Here's an example:
> #The head of the first five observations
> head(SmallStoredf, n=5)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 <NA> Male <NA> <NA> <NA> <NA> <NA>
2 45-54 Female <NA> <NA> <NA> <NA> <NA>
5 45-54 Female 75k-100k Married Yes Own 150k-200k
6 25-34 Male 75k-100k Married No Own 300k-350k
7 35-44 Female 125k-150k Married Yes Own 250k-300k
Occupation Education LengthofResidence
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
5 <NA> Completed High School 9 Years
6 <NA> Completed High School 11-15 years
7 <NA> Completed High School 2 Years
In this example, I want NAs in HomeOwnerStatus replaced with Own, HomeMarketValue with 350K-500K, and Occupation with Professional.
EDIT: I tried inputting the values in, but got an error about three of the columns.
> replacementVals <- c(Age = "45-54", Gender = "Male", HouseholdIncome = "50K-75K",
+ MaritalStatus = "Single", PresenceofChildren = "No",
+ HomeOwnerStatus = "Own", HomeMarketValue = "350K-500K",
+ Occupation = "Professional", Education = "Completed High School",
+ LengthofResidence = "11-15yrs")
> indx1 <- replacementVals[col(df2)][is.na(df2[,names(replacementVals)])]
> df2[is.na(df2[,names(replacementVals)])] <- indx1
#Warning messages:
#1: In `[<-.factor`(`*tmp*`, thisvar, value = c("50K-75K", "50K-75K", :
invalid factor level, NA generated
#2: In `[<-.factor`(`*tmp*`, thisvar, value = c("350K-500K", "350K-500K", :
invalid factor level, NA generated
#3: In `[<-.factor`(`*tmp*`, thisvar, value = c("11-15yrs", "11-15yrs", :
invalid factor level, NA generated
Here's the output:
> head(SmallStoredf)
Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 45-54 Male <NA> Single No Own <NA>
2 45-54 Female <NA> Single No Own <NA>
5 45-54 Female 75k-100k Married Yes Own 150k-200k
6 25-34 Male 75k-100k Married No Own 300k-350k
7 35-44 Female 125k-150k Married Yes Own 250k-300k
8 55-64 Male 75k-100k Married No Own 150k-200k
Occupation Education LengthofResidence
1 Professional Completed High School <NA>
2 Professional Completed High School <NA>
5 Professional Completed High School 9 Years
6 Professional Completed High School 11-15 years
7 Professional Completed High School 2 Years
8 Professional Completed High School 16-19 years
Only NA values in some columns were replaced.

I amended your reproducible example a little bit, here's the setup
> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20),
stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
> dat[6,1]<-NA
#output
# x y
#1 a 1.511781168450847978590
#2 b 0.389843236411431093291
#3 b -0.621240580541803755210
#4 c -2.214699887177499881830
#5 <NA> 1.124930918143108193874
#6 c NA
#7 c -0.016190263098946087311
#8 b 0.943836210685299215051
#9 b 0.821221195098088552200
#10 <NA> 0.593901321217508826322
#11 a 0.918977371608218240873
#12 a 0.782136300731067102276
#13 c 0.074564983365190601328
#14 b -1.989351695863372793127
#15 <NA> 0.619825747894710232799
#16 b -0.056128739529000784558
#17 c -0.155795506705329295238
#18 c -1.470752383899274429169
#19 b -0.478150055108620353206
#20 c 0.417941560199702411005
now define your replacement vals, labeled by the columns you want to have NAs replaced
replacementVals<-c(x="Xreplace", y="Yreplace")
and the next call can replace them in all in one shot
dat[is.na(dat[,names(replacementVals)])]<-replacementVals
# x y
#1 a 1.51178116845085
#2 b 0.389843236411431
#3 b -0.621240580541804
#4 c -2.2146998871775
#5 Xreplace 1.12493091814311
#6 c Yreplace
#7 c -0.0161902630989461
#8 b 0.943836210685299
#9 b 0.821221195098089
#10 Yreplace 0.593901321217509
#11 a 0.918977371608218
#12 a 0.782136300731067
#13 c 0.0745649833651906
#14 b -1.98935169586337
#15 Xreplace 0.61982574789471
#16 b -0.0561287395290008
#17 c -0.155795506705329
#18 c -1.47075238389927
#19 b -0.47815005510862
#20 c 0.417941560199702
But as akrun pointed out, and subsequently solved, this didn't map well to your second data frame example. This is just taken straight from the comments they made (so either way they should probably get the check on this question)
We'll do the setup, I'm not going to do all the prints except for the result
HomeOwnerStatus = c(NA,NA,NA ,"Rent", "Rent" )
HomeMarketValue = c(NA,NA,NA, "350k", "350k")
Occupation = c(NA,NA,NA, NA, NA)
SmallStoreddf<-data.frame(HomeOwnerStatus,HomeMarketValue,Occupation, stringsAsFactors=FALSE)
replacementVals<-c("HomeOwnerStatus" = "Rent", "HomeMarketValue"="350k", "Occupation"="Professional")
Then in two steps (which could be combined into one really long line) you go
#get the values that we will be replacing
indx1<-replacementVals[col(SmallStoreddf)][is.na(SmallStoreddf[, names(replacementVals)])]
#do the replacement
SmallStoreddf[is.na(SmallStoredf[,names(replacementVals)])] <-indx1
# HomeOwnerStatus HomeMarketValue Occupation
#1 Own 350k Professional
#2 Own 350k Professional
#3 Own 350k Professional
#4 Rent 350k Professional
#5 Rent 350k Professional

Try: (Using your second example as it was a bit unclear when you showed two datasets)
indx <- which(is.na(SmallStoredf), arr.ind=TRUE)
SmallStoredf[indx] <- c("Own", "350K-500K", "Professional")[indx[,2]]
SmallStoredf
# HomeOwnerStatus HomeMarketValue Occupation
#1 Own 350K-500K Professional
#2 Own 350K-500K Professional
#3 Own 350K-500K Professional
#4 Rent 350k-500k Professional
#5 Rent 500k-1mm Professional

Upgrading comment.
If you are wanting to replace the missing data with the most frequent category, there may be an equal count of categories within a variable. So in the code below, the replacements are randomly sampled from the categories that are most frequent.
# some example data with missing
set.seed(1)
dat <- data.frame(x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
w=rnorm(20),
z=sample(letters[1:3],20,TRUE),
stringsAsFactors=FALSE)
dat[c(5,10,15),1] <- NA
dat[c(3,7),2] <- NA
# function to get replacement for missing
# sample is used to randomly select categories, allowing for the case
# when the maximum frequency is shared by more than one category
f <- function(x) {
tab <- table(x)
l <- sum(is.na(x))
sample(names(tab)[tab==max(tab)], l, TRUE)
}
# as we are using sample, set.seed before replacing
set.seed(1)
for(i in 1:ncol(dat)){
if(!is.numeric(dat[i]))
dat[i][is.na(dat[i])] <- f(dat[i])
}
gentle warning: you should think carefully before imputing missing data this way. For example, income is often more likely to be missing for highest and lowest categories. By this method you may be imputing an average wage incorrectly. You should consider why each variable is missing and if it is reasonable to assume the data is MCAR or MAR. If so, i would then consider a more robust method of imputation (mice package).

Related

Extracting time information from a raw dataset in R

I would like to extract ID, item information and time information from an unstructured dataset. Here is my sample dataset looks like:
df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
"Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
"Item 1: Time in seconds","Item 1: Errors",
"Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
"Item 2: Time in seconds","Item 2: Errors"),
School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))
> df
Text_1 School.2 X_Elementary_School..3 School.4 Y_Elementary_School..2
1 Scoring Teacher: Bill: Teacher: John:
2 1 = Incorrect DC Name: X District DC Name: X District
3 Text1 Date (mm/dd/yyyy): 10/7/21 Date (mm/dd/yyyy): 11/7/21
4 Text2 Child Grade: K Child Grade: K
5 Text3 Student Study ID: 123-2222-2: Student Study ID: 112-1111-3:
6 Text4 <NA> <NA> <NA> <NA>
7 Demo 1: Color Naming <NA> <NA> 0 <NA>
8 Amarillo <NA> <NA> <NA> <NA>
9 Azul <NA> <NA> 1 <NA>
10 Verde <NA> <NA> <NA> <NA>
11 Azul <NA> <NA> <NA> <NA>
12 Demo 1: Errors 0 <NA> 0 <NA>
13 Item 1: Color naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
14 Amarillo 0 <NA> 0 <NA>
15 Azul 1 <NA> 1 <NA>
16 Verde <NA> <NA> <NA> <NA>
17 Azul <NA> <NA> <NA> <NA>
18 Item 1: Time in seconds <NA> <NA> 120 <NA>
19 Item 1: Errors 0 <NA> 0 <NA>
20 Item 2: Shape Naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
21 Cuadrado/Cuadro 0 <NA> <NA> <NA>
22 Cuadrado/Cuadro <NA> <NA> 1 <NA>
23 Círculo <NA> <NA> 0 <NA>
24 Estrella 1 <NA> 1 <NA>
25 Círculo 1 <NA> <NA> <NA>
26 Triángulo 0 <NA> 1 <NA>
27 Item 2: Time in seconds <NA> <NA> 110 <NA>
28 Item 2: Errors 0 <NA> 0 <NA>
Here, When the first column has encounters Item #: Time in seconds, I would like to keep corresponding value under School.2 and School.4 columns. So the first students has those Item 1 and Item 2 time information as empty cell so they are NA. The second student has those time information as 120 and 110. These two students have two items in the example dataset.
There are multiple columns for the real dataset so I need something to be in a generalized loop.
My desired output would be:
> time
id itemid time
1 123-2222-2 Item 1 NA
2 123-2222-2 Item 2 NA
3 112-1111-3 Item 1 120
4 112-1111-3 Item 2 110
This is my attempt but I could not add the id yet.
time.data <- df %>%
filter(str_detect(Text_1, 'Time in seconds')) # %>%
# select(time = 4)
select_time_cols <- seq(from = 2, to = ncol(time.data), by = 2)
time <- time.data %>%
select(time = select_time_cols)
time.t<-as.data.frame(t(time))
rownames(time.t)<-seq(1,nrow(time.t),1)
colnames(time.t)<-paste0('i',seq(1,ncol(time.t),1))
time.t<-apply(time.t,2,as.numeric)
time.t<-as.data.frame(time.t)
> time.t
item1 item2
1 NA NA
2 120 110
I came up with a solution with your test dataset, though this isn't particularly robust or elegant, so there will likely be better options out there. The key thing I did was shift the text3 row over, so that the id and time were in the same column. I put notes in the code to illuminate my steps. Hopefully this points you in the right direction or prompts those with better coding skills than my own!
library(dplyr)
library(tidyr)
df2<-df%>% #Turning all to character, so that I can move the row without interfering with factor level
mutate_all(as.character)
df2[5, 2:(ncol(df2) - 1)] <- df2[5, 3:ncol(df2)] #shifting row 5, which is the "Text3" that has studentID to be the same column as the times
df3<-df2%>%
filter(grepl("Text3|Time in seconds", Text_1))%>% #removing unnecessary rows
mutate(type = case_when(grepl("Text", Text_1) ~ "itemid", #Relabling the Text_1 column
grepl("Item 1", Text_1) ~ "1",
grepl("Item 2", Text_1) ~ "2"))%>%
select(grep("type|^School", names(.))) #only keeping needed columns
colnames(df3) <- df3[1,] #Taking the first row and making it the column names
df3 <- df3[-1, ] #removing row 1, since it was made into column names
df3%>%
tidyr::pivot_longer(-itemid, names_to = "id", values_to = "time")%>% #Making the data into longer format
select(id, itemid, time)%>% #relocating columns to match desired output
arrange(desc(id)) #sorting to match desired output

dplyr relative frequency within group

(hopefully) simplified
I have asked farmers of a specific farmtype (organic and conventional) that I asked for a report on species (A,B) occur (0/1) on their land.
So, I have
df<-data.frame(id=1:10,
farmtype=c(rep("org",4), rep("conv",6)),
spA=c(0,0,0,1,1,1,1,1,1,1),
spB=c(1,1,1,0,0,0,0,0,0,0)
)
And my question is pretty simple... In what percentage of organic or conventional farms do the species occur?
solution
sp A occurs in 25% of org farms and 100% of conv farms
sp B occurs in 75% of org farms and 0% of conv farms
None of the solutions outlined below achieve that.
**additional question **
All I want is a simple ggplot with the species on the x-axis and the percentage of detection on the y-axis (once for org and once for conv).
ggplot(df.melt)+
geom_bar(aes(x=species, fill=farmtype))
### but, of course the species recognitions not just the farm types
janitor's tabyl is your friend. What you're calculating is "row"-percentages, but what you want is "col"-percentages. E.g.
set.seed(1234)
df <- data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
df |>
tabyl(species,farmtype) |>
adorn_percentages("col")
# species conventional organic
# a 0.2553191 0.2641509
# b 0.2765957 0.2452830
# c 0.2553191 0.1886792
# d 0.2127660 0.3018868
But you could also use your own approach. Group by farmtype in the second group_by and remember to save the dataframe. This would be easier to use with ggplot2 as it is already in a long format.
df <-
df %>%
group_by(species, farmtype) %>%
dplyr::summarise(count = n()) %>%
group_by(farmtype) %>%
dplyr::mutate(prop = count/sum(count))
df
# A tibble: 8 × 4
# Groups: farmtype [2]
# species farmtype count prop
# <chr> <chr> <int> <dbl>
# a conventional 12 0.255
# a organic 14 0.264
# b conventional 13 0.277
# b organic 13 0.245
# c conventional 12 0.255
# c organic 10 0.189
# d conventional 10 0.213
# d organic 16 0.302
df %>%
ggplot(aes(x = species, y = prop, fill = farmtype)) +
geom_col()
Update: A variant of second option also suggested by Isaac Bravo.
Here you can have another option using your approach:
df %>%
group_by(farmtype, species) %>%
summarize(n = n()) %>%
mutate(percentage = n/sum(n))
OUTPUT:
farmtype species n percentage
<chr> <chr> <int> <dbl>
1 conventional a 12 0.235
2 conventional b 12 0.235
3 conventional c 12 0.235
4 conventional d 15 0.294
5 organic a 16 0.327
6 organic b 9 0.184
7 organic c 14 0.286
8 organic d 10 0.204
If I understand the poster's first question correctly, the poster seeks the proportion of organic versus conventional farm types among farms that grew a given species. This can also be accomplished using the data.table package as follows.
First, the example data set is recreated by setting the seed.
set.seed(1234) ##setting seed for reproducible example
df<-data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
require(data.table)
df = data.table(df)
Next, the "no" answers are filtered out because we are only interested in farms that reported growing the species in the "occur" column. We then count the occurrences of the species for each farm type. The column "N" gives the count.
#Filter out "no" answers because they shouldn't affect the result sought
#and count the number of farmtypes that reported each species
ans = df[occ == "yes",.N,by = .(farmtype,species)]
ans
# farmtype species N
#1: conventional a 8
#2: conventional c 8
#3: organic a 6
#4: conventional d 11
#5: organic d 5
#6: organic c 7
#7: organic b 4
#8: conventional b 6
The total occurrences of each species for either farm type are then counted. As a check for this result, each row for a given species should give the same species total.
#Total number of farms that reported the species
ans[,species_total := sum(N), by = species] #
ans
# farmtype species N species_total
#1: conventional a 8 14
#2: conventional c 8 15
#3: organic a 6 14
#4: conventional d 11 16
#5: organic d 5 16
#6: organic c 7 15
#7: organic b 4 10
#8: conventional b 6 10
Finally, the columns are combined to calculate the proportion of organic or conventional farms for each species that was reported. As a check against the result, the proportion of organic and the proportion of conventional for each species should sum to 1 because there are only two farm types.
##Calculate the proportion of each farm type reported for each species
ans[, proportion := N/species_total]
ans
# farmtype species N species_total proportion
#1: conventional a 8 14 0.5714286
#2: conventional c 8 15 0.5333333
#3: organic a 6 14 0.4285714
#4: conventional d 11 16 0.6875000
#5: organic d 5 16 0.3125000
#6: organic c 7 15 0.4666667
#7: organic b 4 10 0.4000000
#8: conventional b 6 10 0.6000000
##Gives the proportion of organic farms specifically
ans[farmtype == "organic"]
# farmtype species N species_total proportion
#1: organic a 6 14 0.4285714
#2: organic d 5 16 0.3125000
#3: organic c 7 15 0.4666667
#4: organic b 4 10 0.4000000
If, on the other hand, one wanted to calculate the fraction of each species to all species occurrences reported for organic or conventional farms, you could use this code:
ans = df[,.N, by = .(species, farmtype,occ)] ##count by species,farmtype, and occurrence
ans[, spf := sum(N), by = .(occ,farmtype)] ##spf is the total number of times an occurrence was reported for each type
ans[, prop := N/spf]
ans = ans[occ == "yes"] ##proportion of the given species to all species occurrences reported for each farm type
ans
# species farmtype occ N spf prop
#1: a conventional yes 8 33 0.2424242
#2: c conventional yes 8 33 0.2424242
#3: a organic yes 6 22 0.2727273
#4: d conventional yes 11 33 0.3333333
#5: d organic yes 5 22 0.2272727
#6: c organic yes 7 22 0.3181818
#7: b organic yes 4 22 0.1818182
#8: b conventional yes 6 33 0.1818182
This result means that, for example, conventional farmers reported species "a" about 24.2% of the times that they reported any species. The result can be verified by selecting a species and farmtype and calculating manually as a spot check.

converting as factor for a list of data frames

I am trying to create a custom function to give labels to modified list of data frames. For example, I have a data frame like below.
df<-data.frame(
gender = c(1,2,1,2,1,2,1,2,2,2,2,1,1,2,2,2,2,1,1,1,1,1,2,1,2,1,2,2,2,1,2,1,2,1,2,1,2,2,2),
country = c(3,3,1,2,5,4,4,4,4,3,3,4,3,4,2,1,4,2,3,4,4,4,3,1,2,1,5,5,4,3,1,4,5,2,3,4,5,1,4),
Q1=c(1,1,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,1,1,1,NA,1,1,NA,NA,NA,NA,1,NA,NA,NA,NA,1,NA,1),
Q2=c(1,1,1,1,1,NA,NA,NA,NA,1,1,1,1,1,NA,NA,NA,1,1,1,NA,1,1,1,1,1,NA,NA,NA,1,1,1,1,1,1,1,NA,NA,NA),
Q3=c(1,1,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,1,1,1,NA,NA,NA,1,NA,NA,1,1,1,1,1,NA,NA,1),
Q4=c(1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
Q5=c(1,2,1,1,1,2,1,2,2,1,2,NA,1,1,2,2,2,1,1,1,2,NA,2,1,1,1,2,2,2,NA,1,2,2,1,1,1,2,2,2)
)
I understand your goal to be the following: You want to take a list of data frames (ldat). For each of the dataframes in the list (df, df2) you want to take some existing columns (Q1, Q2, Q3) and replicate them with new names in the same data frame (Q1_new, Q2_new, Q3_new). This you could achieve like this:
variables = c("Q1","Q2","Q3")
new_label =c("Q1_new","Q2_new","Q3_new")
newdfs <- lapply(ldat, FUN = function(x) {
x[,new_label] = x[,variables]
return(x)})
head(newdfs$ALL)
gender country Q1 Q2 Q3 Q4 Q5 cc2 Q1_new Q2_new Q3_new
1 Male USA Yes Available Partner Depends on sales Local 1 Yes Available Partner
2 female USA Yes Available Partner <NA> Overseas NA Yes Available Partner
3 Male CAN <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
4 female EU <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
5 Male UK <NA> Available <NA> <NA> Local 1 <NA> Available <NA>
6 female BR <NA> <NA> <NA> <NA> Overseas NA <NA> <NA> <NA>
Is this what you had in mind?

Error in match.names(clabs, names(xi)) : names do not match previous names

I am creating a vector having 5 age values and named it as boys_age. Likewise created a vector named girls_age.
eg :
boys_age <- c(18,15,16,17,19)
girls_age<- c(16,14,18,17,15)
Then Append rbind() the two vectors to create data.frame such that I have two columns named group and age.
The values from boys_age and girls_age should be in the column age. The group column should have the category values, boys/girls, to identify the source vector.
Its actually the most primal thing to do in R:
data:
df1 <- data.frame(boys_age = c(18,15,16,17,19), girls_age = c(16,14,18,17,15))
code:
library(data.table)
melt(setDT(df1), variable.name = "group", value.name = "age", measure.vars = c("boys_age", "girls_age"))[,2:1][,group:=sub("_.*$","",group)][]
result:
# age group
# 1: 18 boys
# 2: 15 boys
# 3: 16 boys
# 4: 17 boys
# 5: 19 boys
# 6: 16 girls
# 7: 14 girls
# 8: 18 girls
# 9: 17 girls
#10: 15 girls
You seem to be keen on using ?rbind: (not practical though)
rbind(
cbind.data.frame(age = df1$boys_age, group = "boys"),
cbind.data.frame(age = df1$girls_age, group = "girls")
)
# age group
#1 18 boys
#2 15 boys
#3 16 boys
#4 17 boys
#5 19 boys
#6 16 girls
#7 14 girls
#8 18 girls
#9 17 girls
#10 15 girls
In the ?cbind section I'm making use of the recycling functionality R provides. Read about it.
Why am I using cbind.data.frame, otherwise cbind would create a matrix and therefore the age numerics would be converted to characters.
Here is one option using stack
out <- stack(list(boys = boys_age, girls = girls_age))
out
# values ind
#1 18 boys
#2 15 boys
#3 16 boys
#4 17 boys
#5 19 boys
#6 16 girls
#7 14 girls
#8 18 girls
#9 17 girls
#10 15 girls
Now change names
names(out) <- c("age", "group")
out
# age group
#1 18 boys
#2 15 boys
#3 16 boys
#4 17 boys
#5 19 boys
#6 16 girls
#7 14 girls
#8 18 girls
#9 17 girls
#10 15 girls
You could also do the same in one line, thanks to #Sotos
setNames(stack(list(boys = boys_age, girls = girls_age)), c('age', 'group'))

R, create new column that consists of 1st column or if condition is met, a value from the 2nd/3rd column

a b c d
1 boiler maker <NA> <NA>
2 clerk assistant <NA> <NA>
3 senior machine setter <NA>
4 operated <NA> <NA> <NA>
5 consultant legal <NA> <NA>
How do I create a new column that takes the value in column 'a' unless any of the other columns contain either legal or assistant in which case it takes that value?
Here is a base-R solution. We use apply and any to test every column at once.
df$col <- as.character(df$a)
df$col[apply(df == "Legal",1,any)] <- "Legal"
df$col[apply(df == "assistant",1,any)] <- "assistant"
Try this:
library("dplyr")
df %>%
mutate(new=ifelse(b=="Legal" | c=="Legal" | d=="Legal", "Legal",
ifelse(b=="assistant" | c=="assistant" | d=="assistant", "assistant",
as.character(a))))
as.character is need if values where factors. If not, it's unnecessary.
A base R alternative of #scoa's answer:
indx <- apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L
mydf$col <- c("a","Legal","Assistent")[indx]
or in one go:
mydf$col <- c("a","Legal","Assistent")[apply(mydf == "Legal",1,any) + apply(mydf == "assistant",1,any)*2 + 1L]
which gives:
> mydf
a b c d col
1 boiler maker <NA> <NA> a
2 clerk assistant <NA> <NA> Assistent
3 senior machine setter <NA> a
4 operated <NA> <NA> <NA> a
5 consultant Legal <NA> <NA> Legal

Resources