Reshape with factors - r

I am trying to reshape a data frame that contains a factor and a numeric variable with the melt and cast procedure. The following data shows my problem:
library(reshape)
df <- as.data.frame(cbind(c(1,1,2,2,3,3),c(2000,2001,2001,2002,2000,2001),c(2,1,4,3,1,5)))
names(df) <- c("Id","Year","Var")
df$Fac <- interaction(c(1,1,1,0,0,0),c(0,0,0,1,1,1),drop=TRUE)
MData <- melt.data.frame(df,id=c("Year","Id"))
RSData <- cast(MData, Id ~ Year | ...)
The operation works, but the missing observations in RSData are not NAs as they should be, but rather strings (< NA> and not NA):
$Var
Id 2000 2001 2002
1 1 2 1 <NA>
2 2 <NA> 4 3
3 3 1 5 <NA>
$Fac
Id 2000 2001 2002
1 1 1.0 1.0 <NA>
2 2 <NA> 1.0 0.1
3 3 0.1 0.1 <NA>
If I, however, disregard the factor the NAs are normal NAs:
df <- as.data.frame(cbind(c(1,1,2,2,3,3),c(2000,2001,2001,2002,2000,2001),c(2,1,4,3,1,5)))
names(df) <- c("Id","Year","Var")
MData <- melt.data.frame(df,id=c("Year","Id"))
RSData <- cast(MData, Id ~ Year | ...)
The output becomes:
$Var
Id 2000 2001 2002
1 1 1 1 NA
2 2 NA 1 0
3 3 0 0 NA
The string NAs give me problems when I try to use my recast data. How do I get the correct NAs when I have a factor and numeric variables in the data frame I want to melt and recast?
Thanks,
M

I am confident that I have found the answer to my own question by reading the comments and the documentation over and over. Bascially, the problem is that when using the melt.data.frame() method all the variable values are put in 1 column, and since I have both strings and numeric values the numeric values are implicitly converted to strings.
The only way around this I see is to reshape the numeric variables and the factors separately:
MDataNum = melt.data.frame(df[c("Id","Year","Var")],id=c("Year","Id"))
RSDataNum <- cast(MDataNum, Id ~ Year | ...)
MDataFac = melt.data.frame(df[c("Id","Year","Fac")],id=c("Year","Id"))
RSDataFac <- cast(MDataFac, Id ~ Year | ...)
The result becomes:
> RSDataNum
$Var
Id 2000 2001 2002
1 1 2 1 NA
2 2 NA 4 3
3 3 1 5 NA
> RSDataFac
$Fac
Id 2000 2001 2002
1 1 1.0 1.0 <NA>
2 2 <NA> 1.0 0.1
3 3 0.1 0.1 <NA>

Related

r- dynamically detect excel column names format as date (without df slicing)

I am trying to detect column dates that come from an excel format:
library(openxlsx)
df <- read.xlsx('path/df.xlsx', sheet=1, detectDates = T)
Which reads the data as follows:
# a b c 44197 44228 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
I tried to specify a fix index slice and then transform these specific columns as follows:
names(df)[4:5] <- format(as.Date(as.numeric(names(df)[4:5]),
origin = "1899-12-30"), "%m/%d/%Y")
This works well when the df is sliced for those specific columns, unfortunately, there could be the possibility that these column index's change, say from names(df)[4:5] to names(df)[2:3] for example. Which will return coerced NA values instead of dates.
data:
Note: for this data the column name is read as X4488, while read.xlsx() read it as 4488
df <- data.frame(a=rep(1:5), b=rep(1:5), c=NA, "44197"=rep(1:5), '44228'=rep(1:5), d=rep(1:5))
Expected Output:
Note: this is the original excel format for these above columns:
# a b c 01/01/2021 01/02/2021 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
How could I detect directly these excel format and change it to date without having to slice the dataframe?
We may need to only get those column names that are numbers
i1 <- !is.na(as.integer(names(df)))
and then use
names(df)[i1] <- format(as.Date(as.numeric(names(df)[i1]),
origin = "1899-12-30"), "%m/%d/%Y")
Or with dplyr
library(dplyr)
df %>%
rename_with(~ format(as.Date(as.numeric(.),
origin = "1899-12-30"), "%m/%d/%Y"), matches('^\\d+$'))

Min and Max across multiple columns with NAs

For the following sample data dat, is there a way to calculate min and max while handling NAs. My input is:
dat <- read.table(text = "ID Name PM TP2 Sigma
1 Tim 1 2 3
2 Sam 0 NA 1
3 Pam 2 1 NA
4 Ali 1 0 2
NA NA NA NA NA
6 Tim 2 0 7", header = TRUE)
My required output is:
ID Name PM TP2 Sigma Min Max
1 Tim 1 2 3 1 3
2 Sam 0 NA 1 0 1
3 Pam 2 1 NA 1 2
4 Ali 1 0 2 0 2
NA NA NA NA NA NA NA
6 Tim 2 0 7 0 7
My Effort
1- I have seen similar posts but none of them has discussed issues where all entries in a column were NAs e.g., Get the min of two columns
Based on this, I have tried pmin() and pmax(), but they do not work for me.
2- Another similar question is minimum (or maximum) value of each row across multiple columns. Again, there is no need to handle NAs.
3- Lastly, this question minimum (or maximum) value of each row across multiple columns talks about NA but not all elements in a column have missing values.
4- Also, some of the solutions require that the columns list to be included to be excluded is typed manually, my original data is quite wide, I want to have an easier solution where I can express columns by numbers rather than names.
Partial Solution
I have tried the following solution but Min column ends up having Inf and the Max column ends up having -Inf.
dat$min = apply(dat[,c(2:4)], 1, min, na.rm = TRUE)
dat$max = apply(dat[,c(2:4)], 1, max, na.rm = TRUE)
I can manually get rid of Inf by using something like:
dat$min[is.infinite(dat$min)] = NA
But I was wondering if there is a better way of achieving my desired outcome? Any advice would be greatly appreciated.
Thank you for your time.
You can use hablar's min_ and max_ function which returns NA if all values are NA.
library(dplyr)
library(hablar)
dat %>%
rowwise() %>%
mutate(min = min_(c_across(-ID)),
max = max_(c_across(-ID)))
You can also use this with apply -
cbind(dat, t(apply(dat[-1], 1, function(x) c(min = min_(x), max = max_(x)))))
# ID PM TP2 Sigma min max
#1 1 1 2 3 1 3
#2 2 0 NA 1 0 1
#3 3 2 1 NA 1 2
#4 4 1 0 2 0 2
#5 NA NA NA NA NA NA
#6 5 2 0 7 0 7
The following solution seems to work with the transform() function:
dat <- transform(dat, min = pmin(PM, TP2, Sigma))
dat <- transform(dat, max = pmin(PM, TP2, Sigma))
Without using the transform() function, the data seemed to mess up. Also, the above command requires that all column names are written explicitly. I do not understand why writing a short version like below, fails.
pmin(dat[,2:4])) or
pmax(dat[,2:4]))
I am posting the only solution that I could come up with, in case someone else stumbles upon a similar issue.
I would use data.table for this task. I use the rowSums to count the numbers of row with na and compare it to the number of columns in total. I just use in dat.new all columns where you have at least one nonNA value. Then you can use the na.rm=T as usually.
I hope this little code helps you.
library(data.table)
#your data
dat <- read.table(text = "ID PM TP2 Sigma
1 1 2 3
2 0 NA 1
3 2 1 NA
4 1 0 2
NA NA NA NA
5 2 0 7", header = TRUE)
#generate data.table and add id
dat <- data.table(dat)
number.cols <- dim(dat)[2] #4
dat[,id:=c(1:dim(dat)[1])]
# > dat
# ID PM TP2 Sigma id
# 1: 1 1 2 3 1
# 2: 2 0 NA 1 2
# 3: 3 2 1 NA 3
# 4: 4 1 0 2 4
# 5: NA NA NA NA 5
# 6: 5 2 0 7 6
#use new data.table to select all rows with at least one nonNA value
dat.new <- dat[rowSums(is.na(dat))<number.cols,]
dat.new[, MINv:=min(.SD, na.rm=T), by=id]
dat.new[, MAXv:=max(.SD, na.rm=T), by=id]
#if you need it merged to the old data
dat <- merge(dat, dat.new[,.(id,MINv,MAXv)], by="id")
On way might be to use pmin and pmax with do.call:
dat$min <- do.call(pmin, c(dat[,c(3:5)], na.rm=TRUE))
dat$max <- do.call(pmax, c(dat[,c(3:5)], na.rm=TRUE))
dat
# ID Name PM TP2 Sigma min max
#1 1 Tim 1 2 3 1 3
#2 2 Sam 0 NA 1 0 1
#3 3 Pam 2 1 NA 1 2
#4 4 Ali 1 0 2 0 2
#5 NA <NA> NA NA NA NA NA
#6 6 Tim 2 0 7 0 7

How to find whether at least one column satisfies a certain condition, with NAs

I have a dataframe with multiple columns: I need to identify those rows in which there is at least one outlier among some of the columns, but I do not know how to deal with NAs.
An example of dataframe (different from mine):
# X atq ME.BE.crsp X2
# 1 10 0.5 4
# NA 2 1.3 5
# 3 NA 5 2
# NA NA NA NA
# 2 4 NA 3
I'm doing the following:
data = data %>%
mutate(outlier= as.numeric(atq > quantile(atq, 0.99,na.rm=T)|
atq < quantile(atq, 0.01,na.rm=T)|
ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)|
ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)
))
My expected result is (I'm making up the outliers, the point is about NAs):
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 0
# NA NA NA NA NA
# 2 4 NA 3 1
What I get instead is:
# X atq ME.BE.crsp X2 outlier
# 1 10 0.5 4 1
# NA 2 1.3 5 0
# 3 NA 5 2 NA
# NA NA NA NA NA
# 2 4 NA 3 NA
So, it seems that as soon as the as.numeric finds an NA either in data$atq or in data$ME.BE.crsp, it just gives NA to data$outlier, while I would like it to consider the non NA value and assign 0 or 1 based on that one.
Any suggestions? Thanks!
If both'atq' and 'ME.BE.crsp' are NA and it should return NA, then use a condition with case_when
library(dplyr)
data %>%
mutate(outlier= case_when(is.na(atq) & is.na(ME.BE.crsp) ~
NA_real_,
TRUE ~ as.numeric((atq > quantile(atq, 0.99,na.rm=TRUE)) &
!is.na(atq)|
(atq < quantile(atq, 0.01,na.rm=T)) & !is.na(atq)|
(ME.BE.crsp > quantile(ME.BE.crsp, 0.99,na.rm = T)) &
!is.na(ME.BE.crsp)|
(ME.BE.crsp < quantile(ME.BE.crsp, 0.01,na.rm = T)) &
!is.na(ME.BE.crsp)
)))

Create a counter in a for loop in R

I'm an unexperienced user of R and I need to create quite a complicated stuff.
My dataset looks like this :
dataset
a,b,c,d,e are different individuals.
I want to complete the D column as follows :
At the last line for each individual in the col A, D = sum(C)/(B-1).
Expected results should look like :
results
D4=sum(C2:C4)/(B4-1)=0.5
D6=sum(C5:C6)/(B6-1)=1, etc.
I attempted to deal with it with something like :
for(i in 2:NROW(dataset)){
dataset[i,4]<-ifelse(
(dataset[i,1]==data1[i-1,1]),sum(dataset[i,3])/(dataset[i,2]-1),NA
)
}
But it is obviously not sufficient, as it computes the D value for all the rows and not only the last for each individual, and it does not calculate the sum of C values for this individual.
And I really don't know how to figure it out. Do you guys have any advice ?
Many thanks.
If I understood your question correctly, then this is one approach to get to the desired result:
df <- data.frame(
A=c("a","a","a","b","b","c","c","c","d","e","e"),
B=c(3,3,3,2,2,3,3,3,1,2,2),
C=c(NA,1,0,NA,1,NA,0,1,NA,NA,0),
stringsAsFactors = FALSE)
for(i in 2:NROW(df)){
df[i,4]<-ifelse(
(df[i,1]!=df[i+1,1] | i == nrow(df)),sum(df[df$A == df[i,1],]$C, na.rm=TRUE)/(df[i,2]-1),NA
)
}
This code results in the following table:
A B C V4
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0
The ifelse first tests if the individual of the current row of column A is different than the individual in the next row OR if it's the last row.
If it is the last row with this individual it takes the sum of column C (ignoring the NAs) of the rows with the individual present in column A divided by the value in column B minus one.
Otherwise it puts an NA in the fourth column.
Using dplyr you can try generating D for all rows and then remove where not required:
dftest %>%
group_by(A,B) %>%
dplyr::mutate(D = sum(C, na.rm=TRUE)/(B-1)) %>%
dplyr::mutate(D = if_else(row_number()== n(), D, as.double(NA)))
which gives:
Source: local data frame [11 x 4]
Groups: A, B [5]
A B C D
<chr> <dbl> <dbl> <dbl>
1 a 3 NA NA
2 a 3 1 NA
3 a 3 0 0.5
4 b 2 NA NA
5 b 2 1 1.0
6 c 3 NA NA
7 c 3 0 NA
8 c 3 1 0.5
9 d 1 NA NaN
10 e 2 NA NA
11 e 2 0 0.0

Merge, cbind: How to merge better? [duplicate]

This question already has answers here:
R: Adding NAs into Data Frame
(5 answers)
Closed 6 years ago.
I want to merge multiple vectors to a data frame. There are two variables, city and id that are going to be used for matching vectors to data frame.
df <- data.frame(array(NA, dim =c(10*50, 2)))
names(df)<-c("city", "id")
df[,1]<-rep(1:50, each=10)
df[,2]<-rep(1:10, 50)
I created a data frame like this. To this data frame, I want to merge 50 vectors that each corresponds to 50 cities. The problem is that each city only has 6 obs. Each city will have 4 NAs.
To give you an example, city 1 data looks like this:
seed(1234)
cbind(city=1,id=sample(1:10,6),obs=rnorm(6))
I have 50 cities data and I want to merge them to one column in df. I have tried the following code:
for(i in 1:50){
citydata<-cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) # each city data
df<-merge(df,citydata, by=c("city", "id"), all=TRUE)} # merge to df
But if I run this, the loop will show warnings like this:
In merge.data.frame(df, citydata, by = c("city", "id"), ... :
column names ‘obs.x’, ‘obs.y’ are duplicated in the result
and it will create 50 columns, instead of one long column.
How can I merge cbind(city=i,id=sample(1:10,6),obs=rnorm(6)) to df in a one nice and long column? It seems both cbind and merge are not ways to go.
In case there are 50 citydata (each has 6 rows), I can rbind them as one long data and use data.table approach or expand.gird+merge approach as Philip and Jaap suggested.
I wonder if I can merge each citydata through a loop one by one, instead of rbind them and merge it to df.
data.table is good for this:
library(data.table)
df <- data.table(df)
> df
city id
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 1 5
---
496: 50 6
497: 50 7
498: 50 8
499: 50 9
500: 50 10
I'm using CJ instead of your for loop to make some dummy data. CJ cross-joins each column against each value of each other column, so it makes a two-column table with each possible pair of values of city and id. The [,obs:=rnorm(.N)] command adds a third column that draws random values (without recycling them as it would if it were inside the CJ)--.N means "# rows of this table" in this context.
citydata <- CJ(city=1:50,id=1:6)[,obs:=rnorm(.N)]
> citydata
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
296: 50 2 0.30592659
297: 50 3 -0.44989646
298: 50 4 0.05359738
299: 50 5 -0.57494269
300: 50 6 0.09565473
setkey(df,city,id)
setkey(citydata,city,id)
As these two tables have the same key columns the following looks up rows of df by the key columns in citydata, then defines obs in df by looking up the value in citydata. Therefore the resulting object is the original df but with obs defined wherever it was defined in citydata:
df[citydata,obs:=i.obs]
> df
city id obs
1: 1 1 0.19168335
2: 1 2 0.35753229
3: 1 3 1.35707865
4: 1 4 1.91871907
5: 1 5 -0.56961647
---
496: 50 6 0.09565473
497: 50 7 NA
498: 50 8 NA
499: 50 9 NA
500: 50 10 NA
In base R you can do this with a combination of expand.grid and merge:
citydata <- expand.grid(city=1:50,id=1:6)
citydata$obs <- rnorm(nrow(citydata))
res <- merge(df, citydata, by = c("city","id"), all.x = TRUE)
which gives:
> head(res,12)
city id obs
1: 1 1 -0.3121133
2: 1 2 -1.3554576
3: 1 3 -0.9056468
4: 1 4 -0.6511869
5: 1 5 -1.0447499
6: 1 6 1.5939187
7: 1 7 NA
8: 1 8 NA
9: 1 9 NA
10: 1 10 NA
11: 2 1 0.5423479
12: 2 2 -2.3663335
A similar approach with dplyr and tidyr:
library(dplyr)
library(tidyr)
res <- crossing(city=1:50,id=1:6) %>%
mutate(obs = rnorm(n())) %>%
right_join(., df, by = c("city","id"))
which gives:
> res
Source: local data frame [500 x 3]
city id obs
(int) (int) (dbl)
1 1 1 -0.5335660
2 1 2 1.0582001
3 1 3 -1.3888310
4 1 4 1.8519262
5 1 5 -0.9971686
6 1 6 1.3508046
7 1 7 NA
8 1 8 NA
9 1 9 NA
10 1 10 NA
.. ... ... ...

Resources