How to Drop X in Column names after Merge - r

I've merged two data frames by common row names, and the merge worked fine, but I am getting an x before each column name.
How can I remove the X from each column header?
z<- merge(p, y, by='ID')
head(z)
ID x y V1 X198101 X198102 X198103 X198104 X198105 X198106
1 410320 -122.5417 37.75 NA 119.45 33.15 104.23 5.61 4.85 0
2 410321 -122.5000 37.75 NA 129.49 37.76 114.94 5.28 5.24 0
3 410322 -122.4583 37.75 NA 163.68 42.80 131.22 7.25 6.94 0
4 410323 -122.4167 37.75 NA 141.14 32.26 110.45 7.77 4.62 0
5 410324 -122.3750 37.75 NA 130.87 25.87 102.15 8.38 4.13 0
6 410325 -122.3333 37.75 NA 129.03 25.21 102.37 9.42 4.35 0
Thanks!

It is better to have column names not start with numbers. By default, the make.names or make.unique adds the X prefix when it starts with numbers. To remove it, one option is sub
names(z) <- sub("^X", "", names(z))
z
# ID x y V1 198101 198102 198103 198104 198105 198106
#1 410320 -122.5417 37.75 NA 119.45 33.15 104.23 5.61 4.85 0
#2 410321 -122.5000 37.75 NA 129.49 37.76 114.94 5.28 5.24 0
#3 410322 -122.4583 37.75 NA 163.68 42.80 131.22 7.25 6.94 0
#4 410323 -122.4167 37.75 NA 141.14 32.26 110.45 7.77 4.62 0
#5 410324 -122.3750 37.75 NA 130.87 25.87 102.15 8.38 4.13 0
#6 410325 -122.3333 37.75 NA 129.03 25.21 102.37 9.42 4.35 0
If we apply make.names
make.names(names(z))
#[1] "ID" "x" "y" "V1" "X198101" "X198102"
#[7] "X198103" "X198104" "X198105" "X198106"
The 'X' prefix is returned. So, in general, it is safe to have column names with 'character' prefix instead of just numbers. Also, if we wanted to extract say '198101' column, we need a backtick
z$198104
#Error: unexpected numeric constant in "z$198104"
z$`198104`
#[1] 5.61 5.28 7.25 7.77 8.38 9.42

This isn't actually caused by merge, it must be something earlier in your code. If it happens when you read in the data, try the check.names=FALSE option.
a <- data.frame(a=1:3, b=4:6)
b <- data.frame(a=1:3, c=7:9)
names(b)[2] <- 2485
merge(a,b)
## a b 2485
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

Related

na.omits removes all rows with NA in any column and not only on the specified columns

Considering the following object (data):
collection.number weight bill.length bill.height bill.width
1 XXXXX29985 11.5 16.10 4.07 6.74
2 XXXXX29986 11.6 17.43 4.17 6.39
3 XXXX391828 NA 14.85 4.02 7.19
4 XXXX328017 NA 16.92 3.38 NA
5 XX28024 NA 14.79 NA 6.00
6 XX28095 NA 15.80 4.17 6.54
I'm trying to remove all rows that have NA in any of the following columns: bill.length, bill.height and bill.width.
when I do:
data.filtered<-na.omit(data[3:5])
or
data.filtered<-na.omit(data, cols = c("bill.length","bill.height","bill.width"))
it removes all row that have NA in the "weight" column and not only in the specified columns. What am I doing wrong? Is there an easier way to remove those rows?
Here, an easier option is ?complete.cases which by desription takes arguments
a sequence of vectors, matrices and data frames.
and returns a logical vector, which can be used as row index for subsetting the dataset
data[complete.cases(data[3:5]),]
# collection.number weight bill.length bill.height bill.width
#1 XXXXX29985 11.5 16.10 4.07 6.74
#2 XXXXX29986 11.6 17.43 4.17 6.39
#3 XXXX391828 NA 14.85 4.02 7.19
#6 XX28095 NA 15.80 4.17 6.54
na.omit returns the dataset itself by removing rows with any NA on one of its columns. So, when we apply na.omit on the subset data, it returns only the subset of rows of the the data already subset by columns

Taking inverse of certain rows in dataframe

I have a dataframe of market trades and need to multiply only the put returns by -1. I have the code for that, but can't figure out how to assign it back without affecting the calls.
Input df:
Date Type Stock_Open Stock_Close Stock_ROI
0 2016-04-27 Call 5.33 4.80 -0.099437
1 2016-06-03 Put 4.80 4.52 -0.058333
2 2016-06-30 Call 4.52 5.29 0.170354
3 2016-07-21 Put 5.29 4.84 -0.085066
4 2016-08-08 Call 4.84 5.35 0.105372
5 2016-08-25 Put 5.35 4.65 -0.130841
6 2016-09-21 Call 4.65 5.07 0.090323
7 2016-10-13 Put 5.07 4.12 -0.187377
8 2016-11-04 Call 4.12 4.79 0.162621
Code:
flipped_puts = trades_df[trades_df['Type']=='Put']['Stock_ROI']*-1
trades_df['Stock_ROI'] = flipped_puts
Output of flipped puts:
1 0.058333
3 0.085066
5 0.130841
7 0.187377
Output of original DF:
Date Type Stock_Open Stock_Close Stock_ROI
0 2016-04-27 Call 5.33 4.80 NaN
1 2016-06-03 Put 4.80 4.52 0.058333
2 2016-06-30 Call 4.52 5.29 NaN
3 2016-07-21 Put 5.29 4.84 0.085066
4 2016-08-08 Call 4.84 5.35 NaN
5 2016-08-25 Put 5.35 4.65 0.130841
6 2016-09-21 Call 4.65 5.07 NaN
7 2016-10-13 Put 5.07 4.12 0.187377
8 2016-11-04 Call 4.12 4.79 NaN
try
trades_df.loc[trades_df.Type.eq('Put'), 'Stock_ROI'] *= -1
Or
trades_df.update(trades_df.query('Type == "Put"').Stock_ROI.mul(-1))
both give you
trades_df
We can use data.table from R. Convert the 'data.frame' to 'data.table' (setDT(trades_df)), specify the logical condition in 'i', multiply the 'Stock_ROI' with -1 and assign (:=) it to a new column. The other values will be filled by NA.
library(data.table)
setDT(trades_df)[Type == 'Put', Stock_ROIN := Stock_ROI * -1][]
If we want to update the same column
setDT(trades_df)[Type == 'Put', Stock_ROI := Stock_ROI * -1]
trades_df
# Date Type Stock_Open Stock_Close Stock_ROI
#1: 2016-04-27 Call 5.33 4.80 -0.099437
#2: 2016-06-03 Put 4.80 4.52 0.058333
#3: 2016-06-30 Call 4.52 5.29 0.170354
#4: 2016-07-21 Put 5.29 4.84 0.085066
#5: 2016-08-08 Call 4.84 5.35 0.105372
#6: 2016-08-25 Put 5.35 4.65 0.130841
#7: 2016-09-21 Call 4.65 5.07 0.090323
#8: 2016-10-13 Put 5.07 4.12 0.187377
#9: 2016-11-04 Call 4.12 4.79 0.162621
and want to change the other values to NA
setDT(trades_df)[Type == 'Put', Stock_ROI := Stock_ROI * -1
][Type!= 'Put', Stock_ROI := NA]
trades_df
# Date Type Stock_Open Stock_Close Stock_ROI
#1: 2016-04-27 Call 5.33 4.80 NA
#2: 2016-06-03 Put 4.80 4.52 0.058333
#3: 2016-06-30 Call 4.52 5.29 NA
#4: 2016-07-21 Put 5.29 4.84 0.085066
#5: 2016-08-08 Call 4.84 5.35 NA
#6: 2016-08-25 Put 5.35 4.65 0.130841
#7: 2016-09-21 Call 4.65 5.07 NA
#8: 2016-10-13 Put 5.07 4.12 0.187377
#9: 2016-11-04 Call 4.12 4.79 NA

Clean a dataset, set rows to NA based on different columns and different values

My dataset looks like this:
I would like to clean it so that all rows are NA when "QR" shows C:
SO4 PO4 LabConductivity LabPH Notes QR
1 0.131 0.00100 3.98 5.25 dmz B
2 0.109 0.00126 3.54 5.27 mz B
3 0.219 -0.5656 6.28 5.23 <NA> A
4 0.219 -0.5656 6.28 -5.66 <NA> C
5 0.219 -0.5656 6.28 5.23 <NA> C
So I can do that doing this:
mydata[mydata$QR=="C",] <- NA
However, I would like to keep doing that for other variables, e.g. set entire row to NA when LabPH is >6 OR <0.
If I do the same thing again I get the following warning:
Error in `[<-.data.frame`(`*tmp*`, mydata$LabPH > 5 | mydata$LabPH < 0, : missing values are not allowed in subscripted assignments of data frames
Is there another way of doing this? IS there an ignoreNAfunction for that case?
Or is there an entirely better way of doing this?
Thanks so much in advance
cheers
Sandra
data.frame cannot subset if you have NA at the column you use to do the logical test. For example, you can see the rows with LabPH = NA are not subsetted.
> mydata[mydata$LabPH > 5.25,]
SO4 PO4 LabConductivity LabPH Notes QR
2 0.109 0.00126 3.54 5.27 mz B
NA NA NA NA NA <NA> <NA>
NA.1 NA NA NA NA <NA> <NA>
which works because it excludes those rows with LabPH = NA, another way to do this is to use !is.na() to exclude the NA
> new <- mydata[!is.na(mydata$LabPH)&mydata$LabPH > 5.25,]
> new
SO4 PO4 LabConductivity LabPH Notes QR
2 0.109 0.00126 3.54 5.27 mz B
You can just add a which to your logical test.
For example,
mydata[which(mydata$LabPh > 5.25),] <- NA
Is it not replacing the entire row with NA is as good as excluding the data? If so, considering your condisions (QR = "C" and LabPH = between 0 to 6), here is a way to do that...
# Please note I added a random 6th row with LabPH = 7.0.
SO4 = c(0.131,0.109,0.219,0.219,0.219,0.21)
PO4 = c(0.00100,0.00126,-0.5656,-0.5656,-0.5656,-0.532)
LabConductivity = c(3.98, 3.54, 6.28, 6.28, 6.28,6.25)
LabPH = c(5.25,5.27,5.23,-5.66,5.23,7.0)
Notes = c("dmz","mz","<NA>","<NA>","<NA>","mz")
QR = c("B","B","A","C","C","B")
# create a data frame
df = data.frame(SO4,PO4,LabConductivity,LabPH,Notes,QR)
df
SO4 PO4 LabConductivity LabPH Notes QR
1 0.131 0.00100 3.98 5.25 dmz B
2 0.109 0.00126 3.54 5.27 mz B
3 0.219 -0.56560 6.28 5.23 <NA> A
4 0.219 -0.56560 6.28 -5.66 <NA> C
5 0.219 -0.56560 6.28 5.23 <NA> C
6 0.210 -0.53200 6.25 7.00 mz B
# Subset based on your condition
df[which((df$LabPH > 0 & df$LabPH < 6) & df$QR != "C"),]
# output
SO4 PO4 LabConductivity LabPH Notes QR
1 0.131 0.00100 3.98 5.25 dmz B
2 0.109 0.00126 3.54 5.27 mz B
3 0.219 -0.56560 6.28 5.23 <NA> A

common data/sample in two dataframes in R

I'm trying to compare model-based forecasts from two different models. Model number 2 however requires more non-missing data and has thus more missing values (NA) than model 1.
I am wondering now how I can quickly query both dataframes for non-missing values and identify the common sample? I used to work with excel and the function
=IF(AND(ISVALUE(a1);ISVALUE(b1));then;else)
comes to my mind but I don't know how to do it properly with R.
This is my df from model 1: Every observation is clearly identified by id and time.
(the rownumbers on the left are from my overall dataframe and are identical in both dataframes.)
> head(model1)
id time f1 f2 f3 f4 f5
9 1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
10 1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
11 1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
12 1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
13 1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
14 1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737
and this model 2:
> head(model2)
id time meanf1 meanf2 meanf3 meanf4 meanf5
9 1 1995 4.56 5.14 6.05 NA NA
10 1 1996 4.38 4.94 NA NA NA
11 1 1997 4.05 4.51 NA NA NA
12 1 1998 4.07 5.04 6.52 NA NA
13 1 1999 3.61 4.96 NA NA NA
14 1 2000 4.35 4.83 6.46 NA NA
Thank you for your help and hints.
The function complete.cases gives non-missing data across all columns. The sets (f4,meanf4) and (f5,meanf5) have no "common" non-missing values in the sample data, hence have no observations. Is this what you were looking for
#Read Data
model1=read.table(text='id time f1 f2 f3 f4 f5
1 1995 16.351261 -1.856662 6.577671 10.7883178 22.5349438
1 1996 15.942914 -1.749530 2.894190 0.6058255 1.7057163
1 1997 24.187390 15.099166 14.275441 -4.9963831 -0.1866863
1 1998 3.101094 -10.455754 -9.674086 -9.8456466 -8.5525140
1 1999 33.562234 2.610512 -15.237620 -18.8095980 -17.6351989
1 2000 59.979666 -45.106093 -100.352866 -56.6137325 -32.0315737',header=TRUE)
model2=read.table(text=' id time meanf1 meanf2 meanf3 meanf4 meanf5
1 1995 4.56 5.14 6.05 NA NA
1 1996 4.38 4.94 NA NA NA
1 1997 4.05 4.51 NA NA NA
1 1998 4.07 5.04 6.52 NA NA
1 1999 3.61 4.96 NA NA NA
1 2000 4.35 4.83 6.46 NA NA',header=TRUE)
#name indices of f1..f5 = 3..7
#merge data for each f1..f5 and keep only non-missing values using, complete.cases()
DF_list = lapply(3:7,function(x) {
DF=merge(model1[,c(1,2,x)],model2[,c(1,2,x)],by=c("id","time"));
DF=DF[complete.cases(DF),];
return(DF);
})
DF_list
#[[1]]
# id time f1 meanf1
#1 1 1995 16.351261 4.56
#2 1 1996 15.942914 4.38
#3 1 1997 24.187390 4.05
#4 1 1998 3.101094 4.07
#5 1 1999 33.562234 3.61
#6 1 2000 59.979666 4.35
#
#[[2]]
# id time f2 meanf2
#1 1 1995 -1.856662 5.14
#2 1 1996 -1.749530 4.94
#3 1 1997 15.099166 4.51
#4 1 1998 -10.455754 5.04
#5 1 1999 2.610512 4.96
#6 1 2000 -45.106093 4.83
#
#[[3]]
# id time f3 meanf3
#1 1 1995 6.577671 6.05
#4 1 1998 -9.674086 6.52
#6 1 2000 -100.352866 6.46
#
#[[4]]
#[1] id time f4 meanf4
#<0 rows> (or 0-length row.names)
#
#[[5]]
#[1] id time f5 meanf5
#<0 rows> (or 0-length row.names)

selecting column N of matrix/dataframe where N is based on another vector

I have a dataframe X with several columns and want to select column N for each row where N is different for each row depending on some vector ( in this example : values in column sel)
A B C D sel
16/04/2012 NA -1.25 -1.25 0.25 1
17/04/2012 NA 20 21.25 17.25 1
18/04/2012 -5.25 -5.25 -5.75 -1 2
19/04/2012 -6 -6 -6.25 -12 2
20/04/2012 2.5 2.5 2.75 NA 2
23/04/2012 NA -12.25 -12 NA 2
24/04/2012 NA 7.25 7.5 7.25 2
25/04/2012 NA 17.5 17 18.25 4
26/04/2012 NA 9.5 10 11.5 4
27/04/2012 NA 2 1 -3.25 4
30/04/2012 NA -4.75 -4 -1 4
01/05/2012 NA 6.25 5.75 17 3
02/05/2012 NA -3 -2.75 -16 3
03/05/2012 NA -11.5 -11.5 -6.75 4
04/05/2012 NA -23.5 -23.75 -23 4
so i would end up with
16/04/2012 NA
17/04/2012 NA
18/04/2012 -5.25
19/04/2012 -6
20/04/2012 2.5
23/04/2012 -12.25
24/04/2012 7.25
25/04/2012 18.25
26/04/2012 11.5
27/04/2012 -3.25
30/04/2012 -1
01/05/2012 5.75
02/05/2012 -2.75
03/05/2012 -6.75
04/05/2012 -23
X[,X$sel]
gave me a square matrix equal to nrow(X), not quite what i need.
is there some sort of "Excel's INDEX' type of functions i can use maybe inside an apply function?
You could use the method of subsetting a data frame by passing a two-column matrix with row numbers in the first column and column numbers in the second column. So:
X[matrix(ncol=2, c(1:nrow(X), X$sel)]
will give you a vector of those selected elements, which you can then build into whatever result data frame you're aiming for. Or just add to the existing data frame like this:
X$selected_values <- X[matrix(ncol=2, c(1:nrow(X), X$sel)]

Resources