Fill in empty values in column of dataframe by condition - r

I have the followin dataframe. Now I want to fill in the empty values in "product" by determining the value of the code 44 and 90. 44 should be "shirt" and 90 "sweater".
What's the best way to do this? With a for loop?
data = data.frame("code" = c(44,78,21,90,100,44,90), "product" = c("","hat","shoe","","umbrella","",""))
> data
code product
1 44
2 78 hat
3 21 shoe
4 90
5 100 umbrella
6 44
7 90

Using dplyr first convert the product variable to character (from factor), then use case_when
library(dplyr)
data %>%
mutate_if(is.factor, as.character) %>%
mutate(product = case_when(product == "" & code == 44 ~ "shirt",
product == "" & code == 90 ~ "sweater",
TRUE ~ product))
code product
1 44 shirt
2 78 hat
3 21 shoe
4 90 sweater
5 100 umbrella
6 44 shirt
7 90 sweater
Using base, same idea - first convert factors to character than then use ifelse
i <- sapply(data, is.factor)
data[i] <- lapply(data[i], as.character)
data$product[data$product == ""] <- ifelse(data$code[data$product == ""] == 44, "shirt", "sweater")
data
code product
1 44 shirt
2 78 hat
3 21 shoe
4 90 sweater
5 100 umbrella
6 44 shirt
7 90 sweater
Also worth noting, if you use data.frame with stringsAsFactors = FALSE all the factor converting becomes unnecessary.

You can use match and use the indices for subsetting.
i <- match(data$code, c(44, 90))
j <- !is.na(i)
data$product[j] <- c("shirt", "sweater")[i[j]]
data
# code product
#1 44 shirt
#2 78 hat
#3 21 shoe
#4 90 sweater
#5 100 umbrella
#6 44 shirt
#7 90 sweater

Related

Using a function and mapply in R to create new columns that sums other columns

Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0

Eliminate cases based on multiple rows values

I have a base with the following information:
edit: *each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better *
show(base)
H_ID P_ID AGE CONACT
1 10010000001 1001000000102 35 33
2 10010000001 1001000000103 12 31
3 10010000001 1001000000104 5 NA
4 10010000001 1001000000101 37 10
5 10010000002 1001000000206 5 NA
6 10010000002 1001000000205 10 NA
7 10010000002 1001000000204 18 31
8 10010000002 1001000000207 3 NA
9 10010000002 1001000000203 24 35
10 10010000002 1001000000202 43 33
11 10010000002 1001000000201 47 10
12 10010000003 1001000000302 26 33
13 10010000003 1001000000301 29 10
14 10010000004 1001000000401 56 32
15 10010000004 1001000000403 22 31
16 10010000004 1001000000402 49 10
17 10010000005 1001000000503 1 NA
18 10010000005 1001000000501 24 10
19 10010000005 1001000000502 23 10
20 10010000006 1001000000601 44 10
21 10010000007 1001000000701 69 32
I want a list with all the houses and all the individuals living there based on the condition that there's at least one person 60+, here's a link for the data: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
And here's how I made the base:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
base<-data.frame(datos$ID_VIV, datos$ID_PERSONA, datos$EDAD, datos$CONACT)
base
Any help is much much appreciated, Thanks!
This can be done by:
Adding a variable with the maximum age per household
base$maxage <- ave(base$AGE, base$H_ID, FUN=max)
Then only keeping households with a maximum age above 60.
base <- subset(base, maxage >= 60)
Or you could combine the two lines into one. With the column names in your linked data:
> base <- subset(base, ave(base$datos.EDAD, base$datos.ID_VIV, FUN=max) >= 60)
> head(base)
datos.ID_VIV datos.ID_PERSONA datos.EDAD datos.CONACT
21 10010000007 1001000000701 69 32
22 10010000008 1001000000803 83 33
23 10010000008 1001000000802 47 33
24 10010000008 1001000000801 47 10
36 10010000012 1001000001204 4 NA
37 10010000012 1001000001203 2 NA
Using dplyr, we can group_by H_ID and select houses where any AGE is greater than 60.
library(dplyr)
df %>% group_by(H_ID) %>% filter(any(AGE > 60))
Similarly with data.table
library(data.table)
setDT(df)[, .SD[any(AGE > 60)], H_ID]
To get a list of the houses with a tenant Age > 60 we can filter and create a list of distinct H_IDs
house_list <- base %>%
filter(AGE > 60) %>%
distinct(H_ID) %>%
pull(H_ID)
Then we can filter the original dataframe based on that house_list to remove any households that do not have someone over the age of 60.
house_df <- base %>%
filter(H_ID %in% house_list)
To then calculate the CON values we can filter out NA values in CONACT, group_by(H_ID) and summarize to find the number of individuals within each house that have a non-NA CONACT value.
CON_calcs <- house_df %>%
filter(!is.na(CONACT)) %>%
group_by(H_ID) %>%
summarize(Count = n())
And join that back into the house_df based on H_ID to include the newly calculated CON values, and I believe that should end with your desired result.
final_df <- left_join(house_df, CON_calcs, by = 'H_ID')

Mutate multiple columns using loop

I have a df like this :
Class_A Class_B
78 50
40 60
30 70
The result I want is
Class_A Class_B RankClass_A RankClass_B
78 50 1 3
40 60 2 2
30 70 3 1
Basically, I can create two or more cols by using mutate function. However, when I put it in a loop to create more cols the code does not work.
Here is my code
label<-c('RankClass_A',"RankClass_B")
for (i in 1:2){
for (k in 1:2){
mutate(df,label[i]=dense_rank(desc(df[k])
}
}
We can use mutate_all to create the 'Rank' columns
df %>%
mutate_all(funs(Rank = rank(-.)))
# Class_A Class_B Class_A_Rank Class_B_Rank
#1 78 50 1 3
#2 40 60 2 2
#3 30 70 3 1

Import in R with column headers across 3 rows. Replace missing with latest non-missing column

I need help importing data where my column header is split across 3 rows, with some header names implied. Here is what my xlsx file looks like
1 USA China
2 Dollars Volume Dollars Volume
3 Category Brand CY2016 CY2017 CY2016 CY2017 CY2016 CY_2017 CY2016 CY2017
4 Chocolate Snickers 100 120 15 18 100 80 20 22
5 Chocolate Twix 70 80 8 10 75 50 55 20
I would like to import the data into R, except I would like to retain the headers in rows 1 & 2. An added challenge is that some headers are implied. If a header is blank, I would like it to use the cell in the column to the left. An example of what I'd like it to import as.
1 Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016 China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
2 Chocolate Snickers 100 120 15 18 100 80 20 22
3 Chocolate Twix 70 80 8 10 75 50 55 20
My current method is to import, skipping rows 1 & 2 and then just rename the columns based on known position. However, I was hoping code existed to that would prevent me from this step. Thank you!!
I will assume that you have saved the xlsx data in .csv format, so it can be read in like this:
header <- read.csv("data.csv", header=F, colClasses="character", nrow=3)
dat <- read.csv("data.csv", header=F, skip=3)
The tricky part is the header. This function should do it:
construct_colnames <- function(header) {
f <- function(x) {
x <- as.character(x)
c("", x[!is.na(x) & x != ""])[cumsum(!is.na(x) & x != "") + 1]
}
res <- apply(header, 1, f)
res <- apply(res, 1, paste0, collapse="_")
sub("^_*", "", res)
}
colnames(dat) <- construct_colnames(header)
dat
Result:
Category Brand USA_Dollars_CY2016 USA_Dollars_CY2017 USA_Volume_CY2016 USA_Volume_CY2017 China_Dollars_CY2016
1 Chocolate Snickers 100 120 15 18 100
2 Chocolate Twix 70 80 8 10 75
China_Dollars_CY_2017 China_Volume_CY2016 China_Volume_CY2017
1 80 20 22
2 50 55 20

Replacing rows of a column in a dataframe conditional to another column in R [duplicate]

This question already has answers here:
Replace empty values with value from other column in a dataframe
(3 answers)
Closed 6 years ago.
Let I have such a data frame(df):
df:
header1 header2
------ -------
45 76
54 89
- 12
45 32
12 34
- 5
45 34
65 54
I want to get such a dataframe
header1 header2
------ -------
45 76
54 89
- -
45 32
12 34
- -
45 34
65 54
Namely I want to replace values in header2 columsn with "-", which rows of column header1 have "-" values.
How can I do that in R? I will be very glad for any help. Thanks a lot.
If both columns if your df are character vectors, you could do:
# You can convert your columns to character with
df[,1:2] <- lapply(df[,1:2], as.character)
df$header2[df$header1 == "-"] <- "-" # Replace values
> df
# header1 header2
#1 45 76
#2 54 89
#3 - -
#4 45 32
#5 12 34
#6 - -
#7 45 34
#8 65 54
Traditionally, I would suggest making use of dplyr as it produces beautify readable workflow when working on data frames.
set.seed(1)
dta <- data.frame(colA = c(12,22,34,"-",23,"-"),
colB = round(runif(n = 6, min = 1, max = 100),0))
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
dta %<>%
mutate(colB = ifelse(colA == "-", "-", colA))
This would give you the following results:
> head(dta)
colA colB
1 12 2
2 22 3
3 34 5
4 - -
5 23 4
6 - -
Side notes
This is very flexible mechanism but if you presume that the column classes may be of relevance you may simply choose to run mutate_each(funs(as.character)) before applying any other transformations.

Resources