R: Replace column names with row values except where cells equal NA - r

I have a data frame extracted from a data base that contains different types of data (record types). The different record types have different column names which occupy the first three rows (including header). This data frame is made to be used in excel where you can easily filter out the data by choosing the correct record type.
Here I present small sample of my data frame which in reality contains many more columns (59) as well as rows (34000).
sample <- data.frame(X01RecordType=c("01HL","01CA","HH","HH","HH","HL"), X02Quarter=c(NA,NA,2,2,2,1),X05Gear=c(NA,NA,"KRA","KRA","KRA",NA),X06SweepLngt=c(NA,NA,35,35,-9,-9),
X12Month=c("12SpecCodeType",NA,4,5,4,2), X13Day=c("13SpecCode",NA,26,5,25,160617), X22StatRec=c("22LngtCode","22CANoAtLngt","45G1",NA,NA,NA),X23Depth=c("23LngtClass","23IndWgt",41,NA,63,NA))
As you might see the cells which contain column names are preceded by an X and a number and then a text, e.g. X01RecordType. It would be very easy to replace column names with the first rows by using:
colnames(df) <- df[1,]
However, as you can see some of the cells in the first two rows also contain NA-values. These NA-values indicate that the column names are the same for all record types, using the current header and therefore I would like to keep these. So really what I would like to do is replace the column names with the values of the first row (where record type header equals 01HL) except for NA-values.
If possible I would like to do this without using any external packages. Cells within the data may also contain NA-values and I would like to keep these rows so filtering out all columns containing NA is not an option if it doesn't only apply to the first row. Which is really the way I tried to approach this problem, but I can't figure out how.
I hope this is all the information required to help me out and thanks!

Another option without a loop
colnames(sample)[!is.na(sample[1,])] <- sample[1,][!is.na(sample[1,])]
sample[1:2,]
# 01HL X02Quarter X05Gear X06SweepLngt 12SpecCodeType 13SpecCode 22LngtCode
#1 01HL NA <NA> NA 12SpecCodeType 13SpecCode 22LngtCode
#2 01CA NA <NA> NA <NA> <NA> 22CANoAtLngt
# 23LngtClass
#1 23LngtClass
#2 23IndWgt

I suggest a simple loop:
for(c in 1:length(sample)) if(!is.na(sample[1,c])) colnames(sample)[c] = as.character(sample[1,c])

Related

R: stacking up values from rows of a data frame

I started programming in R yesterday (literally), and I am having the following issue:
-I have a data frame containing R rows, and each row contains N values.
Rows are identified by the first and second field, while the other N-2 are just numerical values or NA.
-Some rows have identical first field and identical second field, something like:
row 1: a,b, third_field, .. ,last_field
row 2: a,b, third_field, .. ,last_field
the rule is that usually the first line will have its fields containing some numbers and some NA, while the second row will contain NA and numbers as well, but differently distributed.
What I am trying to do is to merge the two rows (or records) according to these two rules:
1) if both rows have a NA on a given field, I keep NA
2) if one of the two has a number, I use that value; if both of the rows contain the same value, I keep it also.
How do you do this without looping on each field of each row? (1M rows, tenths of fields, it will finish maybe tomorrow).
I do not know how to better explain my problem. I am sorry for the lengthy explaination, thanks a lot.
EDIT: it is better if I add an example. The following two lines
a,b,NA,NA,NA,1,2 ,NA
a,b,NA,3 ,NA,1,NA,NA
should become
a,b,NA,3 ,NA,1,2 ,NA

Replicating rows in a data frame using information in other columns

I have the following data frame NewTests:
The problem: I want to replicate the rows in the data frame using the information in 'duration' column. Although in this case the replication factor is 1 but it can be anything 2,3,4, etc. And in the replicate rows I want to add a new column called as 'Date' which contains information from the PromotionStartDate and PromotionEndDate columns.
E.g. in this case, the Date Column should contain 2017-04-01 for the entries shown. But in another case where duration is 2 and the PromotionStartDatae= 2017-03-01 and PromotionEndDate=2017-05-01, the replicated row 1 should contain 2017-03-01 in Date column and the replicated row 2 should contain 2017-04-01 in Date column.
I am trying to use the following solution to work out my problem:
library(splitstackshape)
newConrtols=expandRows(NewTests,"duration",drop=FALSE)%>%
group_by(CustomerNumber,PromotionID,RewardAssigned,RunID,ModelID)%>%
mutate(Date=seq(as.Date(PromotionStartDate),as.Date(PromotionEndDate),by="month")[1:duration])
But this gives the error:
Error in mutate_impl(.data, dots) : 'from' must be of length 1
What am I doing wrong in the solution?
That is quite simple, you just select the right row by using the following example.
Lets say I have a dataframe just like you.
a=c(1,2,3,4)
b=c(a,b,c,d)
t=data.frame(a,b)
Now, if I want the second column, I would normally type
t[2,]
Now that I want the second row I will type
t[,2]
IF I WANT TO copy that in the fifth row I would do
t[,5]=t[,2]
In your case if wanted to copy the dates from column PromotionDate to FinalDate you could right this line:your_variable_with_the_dataframe[6,]=your_variable_with_the_dataframe[7,]

How to combine several dataframes with different rows using R?

I have several text files containing 2 columns and different row numbers. I would like to follow drawing a plot using ggplot2 as explained enter link description here; however, it works well for dataframes with equal row numbers, and I couldn't reproduce it with dataframes with different row numbers.
please let me know how I should combine these data frames (dataframes with different row number) using R?
case siza
case1 129
case2 129
case3 130
case4 131
case5 132
case6 132
Thank you
It seems from the comments that you're actually trying to merge multiple columns and then plot each column individually. The problem, however, is that each of these columns has a different number of rows. Therefore you need to combine them based on some common variable (i.e. row names).
Using the examples from the link you provided:
df1 = data.frame(size=runif(300,300,1200))
#now adding an unequal column
df2 = data.frame(size=df1[c(1:275),])
Now merge the data frames based on row number. "all=TRUE" keeps all the values, "by=0" merges by row.names.
df.all=merge(df1$size,df2$size,by=0,all=TRUE)
#and to order the row names.
df.all=df.all[order(as.numeric(df.all[,1])),]
#finally if you want to remove the NA values
df.all[is.na(df.all)]=0
Does that get you the data.frame you want?

How to extract specific rows depending on part of the strings in one column in R

When I use R, I try to extract specific rows which have some specific strings in one column.
The data structure as following
ERC1 20679 14959 9770 RAB6-interacting protein 2 isoform
I want to extract the rows which have RAB6 in the last column. That column still has some other words besides RAB6 so I can not use column = "RAB6" to get them. It's just like a search function in excel. Does anyone have any ideas?
Assuming that your data frame is df:
df[grep("^RAB6", df$column),]
If not all values start with RAB6 remove the^.

extract columns that don't have a header or name in R

I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]

Resources