How to complete rownames in R? - r

I have imported a table that looks like this:
df <- data.frame(study=c("A", "", "", "B", "C", ""),
outcome=c("mortality", "mortality", "surgery", "mortality", "mortality", "surgery"),
time.point=c("30d", "1y", "10d", "1y", "5y", "20d"))
The 2nd and 3rd outcome belong to study A, the 6th outcome belongs to study C.
In my table there are various examples like this with irregular number of outcomes and time-points in each study.
How can I assign a good name to each row indicating the study and outcome and time point predicted?
I want it to look like that:
df_new <- data.frame(study=c("A", "", "", "B", "C", ""),
outcome=c("mortality", "mortality", "surgery", "mortality", "mortality", "surgery"),
time.point=c("30d", "1y", "10d", "1y", "5y", "20d"),
rowname=c("A_mortality_30d", "A_mortality_1y", "A_surgery_10d", "B_mortality_1y", "C_mortality_5y", "C_surgery_20d"))
Thank you so much!

here is an approach by changing the empty strings to NA
library( data.table ); library( zoo )
#make it a data.table
setDT(df)
#set empty strings as NA
df[ study == "", study := NA_character_ ]
#create new column
df[, rowname := paste( zoo::na.locf( study), outcome, time.point, sep = "_")][]
# study outcome time.point rowname
# 1: A mortality 30d A_mortality_30d
# 2: <NA> mortality 1y A_mortality_1y
# 3: <NA> surgery 10d A_surgery_10d
# 4: B mortality 1y B_mortality_1y
# 5: C mortality 5y C_mortality_5y
# 6: <NA> surgery 20d C_surgery_20d

Credits to Oliver. First part is from him. He was faster.
Then you can use unite from tidyr package.
library(tidyr)
library(dplyr)
df1 <- df %>%
mutate(study = case_when(study == "" ~ NA_character_ ,
TRUE ~ study)) %>%
fill(study, .direction = 'down') %>%
unite(rowname, study, outcome, time.point, sep= "_", remove = FALSE)

You could do something like:
library(tidyverse)
df$rowname <- df %>% mutate(study = case_when(study == "" ~ NA_character_ ,
TRUE ~ study)) %>%
fill(study, .direction = 'down') %>%
(function(x)mapply(paste, sep = '_', study = x$study, outcome = x$outcome, time.point = x$time.point))
#alternative use rownames(df) <- ...
df
# study outcome time.point rowname
# 1 A mortality 30d A_mortality_30d
# 2 mortality 1y A_mortality_1y
# 3 surgery 10d A_surgery_10d
# 4 B mortality 1y B_mortality_1y
# 5 C mortality 5y C_mortality_5y
# 6 surgery 20d C_surgery_20d
here I first "replace" non-existing studies with NA_character_ so that I can use fill to fill in the "" values. Then I us mapply to iterate over the values in each column. The mapply is wrapped in a function, only because I want it within a pipe.

Base R solution using grep to get the line numbers of non-empty studies, counting their repeats with diff, and then repeating them with rep.
studies <- df[df$study != "", "study"]
reps <- diff(c(grep(".", df$study), nrow(df) +1))
rownames(df) <- paste(rep(studies, reps), df$outcome, df$time.point, sep="_")
> df
study outcome time.point
A_mortality_30d A mortality 30d
A_mortality_1y mortality 1y
A_surgery_10d surgery 10d
B_mortality_1y B mortality 1y
C_mortality_5y C mortality 5y
C_surgery_20d surgery 20d

Related

Extract values according to the result

I have a dataframe that represents characteristics of people, such as occupation, gender, and telework use :
data = data.frame (profession = sample (c ("craftsman", "employee", "senior executive"), 10000, replace = TRUE), sex = sample (c ("M", "F"), 10000, replace = TRUE), en_teletjob = sample (c ("Yes", "No"), 10000, replace = TRUE))
I would like to create a new dataframe, resulting from an extraction of the values ​​of "data", such as:
That there are 20% men and 80% women
And, that there are 60% of craftsmen, 20% of employees, and 20% of senior executives
And, that there be 50% of "Yes" to the use of telework.
Is it possible to do this on R?
Thank you
One approach you can try is next with apply() and prop.table() joint with table() in order to summarise all variables. Here the code:
#Code
apply(data,2,function(x) prop.table(table(x)))
Output:
$profession
x
craftsman employee senior executive
0.3331 0.3315 0.3354
$sex
x
F M
0.4987 0.5013
$en_teletjob
x
No Yes
0.503 0.497
You can use lapply() to call proportions() on each variable. It returns a list object.
lapply(data, function(x) proportions(table(x)))
# $profession
# x
# craftsman employee senior executive
# 0.3336 0.3318 0.3346
#
# $sex
# x
# F M
# 0.5035 0.4965
#
# $en_teletjob
# x
# No Yes
# 0.4978 0.5022
Note: prop.table() is an earlier name of proportions(), retained for back-compatibility.
An option with tidyverse would be to use adorn_percentages
-code
library(purrr)
library(dplyr)
library(janitor)
map(names(data), ~data %>%
select(.x) %>%
count(!! rlang::sym(.x)) %>%
adorn_percentages(denominator = 'col'))
-output
#[[1]]
# profession n
# craftsman 0.3302
# employee 0.3320
# senior executive 0.3378
#[[2]]
# sex n
# F 0.5108
# M 0.4892
#[[3]]
# en_teletjob n
# No 0.4981
# Yes 0.5019

R mutate variable to variable values from another observation, using a loop, an ifelse condition and subset (dplyr)

see my reproducible and desired output below.
I want to create a new variable, where I combine variable values from other observations (rows), which I want to identify in a loop using subset. The condition of the subset is to be defined by the loop.
In example 1 subset(df, country == i) does not work, but doing it manually (in Ex.2) subset(df, country == 'US') works. I thought country == i and country == 'US' should be pretty much the same.
# create a df
country <- c('US', 'US', 'China', 'China')
Trump_virus <- c('Y', 'N' ,'Y', 'N')
cases <- c (1000, 2000, 4, 6)
df <- data.frame(country, Trump_virus, cases)
#################################################### Ex.1
for (i in df$country) {
print(i)
df <- df %>%
mutate(cases_corected = ifelse(
Trump_virus == 'Y'
,subset(df, Trump_virus == 'N' & country == i)$cases*1000
,'killer_virus'
))}
##
df$cases_corected
#################################################### Ex.2
for (i in df$country) {
print(i)
df <- df %>%
mutate(cases_corected = ifelse(
Trump_virus == 'Y'
,subset(df, Trump_virus == 'N' & country == 'US')$cases*1000
,'killer_virus'
))}
##
df$cases_corected
################################################### Desired output
> df$cases_corected
[1] "2e+06"
[2] "killer_virus"
[3] "6000"
[4] "killer_virus"
Here is a solution with dplyr.
Updated based on the change in desired output
df <- df %>%
mutate(country=toupper(country)) # to get same names for other variants of a country #e.g. China and china
#genearting a dataset which have cases only for Trump_virus==N
df1<-df %>%
dplyr::filter(Trump_virus=="N") %>%
dplyr::mutate(ID= "Y",
cases_corected=cases*1e3) %>%
dplyr::select(-c(cases,Trump_virus))
# final merging
df<-df %>%
left_join(df1,by=c("country"="country","Trump_virus"="ID")) %>%
mutate(cases_corected=ifelse(is.na(cases_corected),'killer_virus',cases_corected))
df
country Trump_virus cases cases_corected
1 US Y 1000 2e+06
2 US N 2000 killer_virus
3 CHINA Y 4 6000
4 CHINA N 6 killer_virus

Replace NA in one row of data frame with values from other

i want to replace NA in one row with values from another row, example data are:
group <-c('A','A_old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A_old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
Original table is much larger so referring to each element one by one and assigning value is not possible..
Thanks!
For sake of argument below, i need to refer to the row values by their name as original table is big and i can not play around with only two rows. For example in table below, i would like to replace row 1 (group==A) with row 5 (group==E). Data are here:
group <-c('A','B','C','D','E','F','G')
year1<- c(NA,'100',NA,'200','300',NA,NA)
year2<- c(NA,'100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
SO i want to get:
group <-c('A','B','C','D','E','F','G')
year1<- c('300','100',NA,'200','300',NA,NA)
year2<- c('300','100',NA,'200','300','50','40')
year3<- c('20','100',10,'200','300','150','230')
data=data.frame(group,year1,year2,year3)
Other than using fill or na.locf, you could do:
datac %>%
group_by(grp = gsub("_.*", "", group)) %>%
mutate_at(vars(contains("year")),
funs(.[!is.na(.)])) %>%
ungroup() %>% select(-grp)
Output:
# A tibble: 2 x 4
group year1 year2 year3
<chr> <chr> <chr> <chr>
1 A 20 40 20
2 A_old 20 40 230
For your second example, you could do:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
TRUE ~ .)
)
)
Output:
group year1 year2 year3
1 A 300 300 20
2 B 100 100 100
3 C <NA> <NA> 10
4 D 200 200 200
5 E 300 300 300
6 F <NA> 50 150
7 G <NA> 40 230
You can also add other conditions to case_when.
For instance, if you'd additionally like to replace C years with what is there for group D, you would add:
data %>%
mutate_at(
vars(contains("year")),
funs(
case_when(
group == "A" & is.na(.) ~ .[group == "E"],
group == "C" & is.na(.) ~ .[group == "D"],
TRUE ~ .)
)
)
After a very long evening and headache from r i managed to get this:
rm(list = ls())
group <-c('A','A old')
year1<- c(NA,'20')
year2<- c(NA,'40')
year3<- c('20','230')
datac=data_frame(group,year1,year2,year3)
group <-c('A','A old')
year1<- c('20','20')
year2<- c('40','40')
year3<- c('20','230')
finaldatac=data_frame(group,year1,year2,year3)
datac$group <- gsub(' ', '--', datac$group)
datact = t(datac)
colnames(datact) = datact[1, ]
datact = datact[-1, ]
datact[,"A"] <- ifelse(!is.na(datact[,"A"]), datact[,"A"] , datact[,"A--old"])
datactt=t(datact)
group = rownames(datactt)
datactt<-cbind(datactt, group)
rownames(datactt) <- c()
datactt <- as.data.frame(datactt)
sapply(datactt, class)
datactt <- data.frame(lapply(datactt, as.character), stringsAsFactors=FALSE)
datactt$group <- gsub('--', ' ', datactt$group)
Where datactt (hopefully) is the same as finaldatac that i wanted... I am sure this cant be the best solution, obviously not the prettiest. If anybody has something similar, but shorter or more efficient please post it i would appreciate the answer.

Doing a ranged lookup with multiple variables in a matrix in R

I feel like I have a bit of a complicated problem (or at least for me it is!).
I have a table of prices which will need to be read from a csv which will look exactly like this:
V1 <- c("","Destination","Spain","Spain","Spain","Portugal","Portugal","Portugal","Italy","Italy","Italy")
V2 <- c("","Min_Duration",rep(c(1,3,6),3))
V3 <- c("","Max_Duration",rep(c(2,5,10),3))
V4 <- c("Full-board","Level_1",runif(9,100,200))
V5 <- c("Full-board","Level_2",runif(9,201,500))
V6 <- c("Full-board","Level_3",runif(9,501,1000))
V7 <- c("Half-board","Level_1",runif(9,100,200))
V8 <- c("Half-board","Level_2",runif(9,201,500))
V9 <- c("Half-board","Level_3",runif(9,501,1000))
Lookup_matrix <- as.data.frame(cbind(V1,V2,V3,V4,V5,V6,V7,V8))
The prices in the above table will of course come out a bit strange as they're completely random - but we can ignore that...
I also have a table like this:
Destination <- c("Spain", "Italy", "Portugal")
Duration <- c(2,4,8)
Level <- c(1,3,3)
Board <- c("Half-board","Half-board","Full-board")
Price <- "Empty"
Price_matrix <- as.data.frame(cbind(Destination,Duration,Level,Board,Price))
My question is - how do I populate the 'Price' column of the price matrix with the corresponding prices that can be found in the lookup matrix? Please note that the duration variable of the price matrix will have to fit into a range found between the 'Min_Duration' and 'Max_Duration' columns in the lookup matrix.
In Excel I would use an Index,Match formula. But I'm stumped with R.
Thanks in advance,
Dan
Here is a tidyverse possibility
First, please note that I rename your input objects; both Price_matrix and Lookup_matrix are data.frames (not matrices).
df1 <- Price_matrix
df2 <- Lookup_matrix
Next we need to fix the column names of df2 = Lookup_matrix.
# Fix column names
colnames(df2) <- gsub("^_", "", apply(df2[1:2, ], 2, paste0, collapse = "_"))
df2 <- df2[-(1:2), ]
We now basically do a left join of df1 and df2; in order for df2 to be in a suitable format we spread data from wide to long, extract Price values for every Board and Level, and expand entries from Min_Duration to Max_Duration. Then we join by Destination, Duration, Level and Board.
Note that in your example, Destination = Italy has no Level = 3 entry in Lookup_matrix; we therefore get Price = NA for this entry.
library(tidyverse)
left_join(
df1 %>%
mutate_if(is.factor, as.character) %>%
select(-Price),
df2 %>%
mutate_if(is.factor, as.character) %>%
gather(key, Price, -Destination, -Min_Duration, -Max_Duration) %>%
separate(key, into = c("Board", "Level"), sep = "_", extra = "merge") %>%
mutate(Level = sub("Level_", "", Level)) %>%
rowwise() %>%
mutate(Duration = list(seq(as.numeric(Min_Duration), as.numeric(Max_Duration)))) %>%
unnest() %>%
select(-Min_Duration, -Max_Duration) %>%
mutate(Duration = as.character(Duration)))
#Joining, by = c("Destination", "Duration", "Level", "Board")
# Destination Duration Level Board Price
#1 Spain 2 1 Half-board 119.010942545719
#2 Italy 4 3 Half-board <NA>
#3 Portugal 8 3 Full-board 764.536124917446
Using datatable:
library(data.table)
nms = trimws(do.call(paste, transpose(Lookup_matrix[1:2, ])))# column names
cat(do.call(paste, c(collapse="\n", Lookup_matrix[-(1:2), ])), file = "mm.csv")
# Rewrite the data in the correct format. You do not have to.
# Just doing Lookup_matrix1 = setNames(Lookup_matrix[-(1:2),],nms) is enough
# but it will not have rectified the column classes.
Lookup_matrix1 = fread("mm.csv", col.names = nms)
melt(Lookup_matrix1, 1:3)[,
c("Board", "Level") := .(sub("[.]", "-", sub("\\.Leve.*", "", variable)), sub("\\D+", "", variable))][
Price_matrix[, -5], on=c("Destination", "Board", "Level", "Min_Duration <= Duration", "Max_Duration >= Duration")]
Destination Min_Duration Max_Duration variable value Board Level
1: Spain 2 2 Half.board.Level_1 105.2304 Half-board 1
2: Italy 4 4 <NA> NA Half-board 3
3: Portugal 8 8 Full.board.Level_3 536.5132 Full-board 3

R for loop for calculating sums based on a data frame's different columns

My current data frame looks like this:
# Create sample data
my_df <- data.frame(seq(1, 100), rep(c("ind_1", "", "", ""), times = 25), rep(c("", "ind_2", "", ""), times = 25), rep(c("", "", "ind_3", ""), times = 25), rep(c("", "", "", "ind_4"), times = 25))
# Rename columns
names(my_df)[names(my_df)=="seq.1..100."] <- "value"
names(my_df)[names(my_df)=="rep.c..ind_1................times...25."] <- "ind_1"
names(my_df)[names(my_df)=="rep.c......ind_2............times...25."] <- "ind_2"
names(my_df)[names(my_df)=="rep.c..........ind_3........times...25."] <- "ind_3"
names(my_df)[names(my_df)=="rep.c..............ind_4....times...25."] <- "ind_4"
# Replace empty elements with NA
my_df[my_df==''] = NA
What I want to script is a rather simple for loop that calculates the sum of the value column for each of the four ind_*columns and prints the result.
So far my very meagre attempt has been:
# Create a vector with all individuals
individuals <- c("ind_1", "ind_2", "ind_3", "ind_4")
# Calculate aggregates for each individual
for (i in individuals){
ind <- 1
sum_i <- aggregate(value~ind_1, data = my_df, sum)
print(paste("Individual", i, "possesses an aggregated value of", sum_i$value))
ind <- ind + 1
}
As you can see, I currently struggle to include the correct command to calculate the sum based on one column after another as the current output, naturally, only calculates the result of ind_1. What needs to be changed in the aggregatecommand to achieve the desired result (I'm a total beginner but thought of using indices for proceeding from one column to another?)?
Assuming you´d want to calculate the sum if ind-column matches an expression in your individuals-vector:
individuals <- c("ind_1", "ind_2", "ind_3", "ind_4")
for (i in 1:(ncol(my_df)-1)){
print(sum(my_df$value[which(my_df[,individuals[i]] == individuals[i])]))
}
Why do you want to use print() instead of storing the results in a separate vector?
You can try tidyverse as well:
my_df %>%
gather(key, Inds, -value) %>%
filter(!is.na(Inds)) %>%
group_by(key) %>%
summarise(Sum=sum(value))
# A tibble: 4 x 2
key Sum
<chr> <int>
1 ind_1 1225
2 ind_2 1250
3 ind_3 1275
4 ind_4 1300
Idea is to make the data long using gather. Filter the NAs out, then group by Inds and summarize the values.
A more base R solution would be:
library(reshape2)
my_df_long <- melt(my_df, id.vars = "value",value.name = "ID")
aggregate(value ~ ID, my_df_long, sum, na.rm= T)
ID value
1 ind_1 1225
2 ind_2 1250
3 ind_3 1275
4 ind_4 1300

Resources