Getting stale values on using ifelse in a dataframe - r

Hi I am aggregating values from two columns and creating a final third column, based on priorities. If values in column 1 are missing or are NA then I go for column 2.
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10))
df
internal external
1 1
2 5 6
3 8
4 6 9
5 NA 10
df$final <- df$internal
df$final <- ifelse((df$final=="" | df$final=="NA"),df$external,df$final)
df
internal external final
1 1 2
2 5 6 3
3 8 4
4 6 9 4
5 NA 10 2
How can I get final value as 4 and 2 for row 3 and row 5 when the external is 8 and 2. I don't know what's wrong but these values don't make any sense to me.

The issue arises because R converts your values to factors.
Your code will work fine with
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10),stringsAsFactors = FALSE)
PS: this hideous conversion to factors should definitely belong to the R Inferno, http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Related

Sequence value in data frame column

I need some help writing R
I need to check whether a specif column in a data frame has ascending ordered correctly.
e.g
df$id | df$order | df$any
3 1 a
4 2 a
7 3 b
1 4 b
2 6 a
9 5 a # select this row - out of sequence in df$order
8 7 a
I would like to select the rows that don't follow the ascending sequence. In the example above, that would be the row with df$id equal to 9, because in df$order the value 5 is found after the value 6.
Obs. 1: in df$order, the numbers have range from 1 to N, where N is a number greater than 1.
Obs. 2: If possible I would like to use core libraries to solve the problem.
Any question, just ask on comments
Thanks in advance!
using Base R:
subset(df,c(0,diff(order))<0)
id order any
6 9 5 a
subset(df,c(0,diff(order))>=0)
id order any
1 3 1 a
2 4 2 a
3 7 3 b
4 1 4 b
5 2 6 a
7 8 7 a

How to subtract the mean of each variable from the mean of a specific variable

I would like to subtract the mean of each variable from the mean of a variable named 'birds' and create a new data frame that will contain the results.In my real data frame I have hundreds of variables so I would like to do it automatically.Any Idea how to do so?
I tried with this line of code without the mean function and it works (on the same data frame) :
setNames(as.data.frame(cbind(g, mean(dat$birds)-mean(dat))), c(names(dat), paste0(names(dat),'_new')))
but I don't understand how to use mean as part of the code,I tried:
setNames(as.data.frame(cbind(g, mean(dat$birds)-mean(dat))), c(names(dat), paste0(names(dat),'_new')))
Here is my toy data frame.
dat <- read.table(text = " birds wolfs snakes
3 9 7
3 8 4
1 2 8
1 2 3
1 8 3
6 1 2
6 7 1
6 1 5
5 9 7
3 8 7
4 2 7
1 2 3
7 6 3
6 1 1
6 3 9
6 1 1 ",header = TRUE)
I hope I understood your question correctly.
This should create a new object, in this case - just a vector, where mean of "birds" column is substracted from the means of other columns. This should also work for any size of the data frame.
mean=mean(dat$birds)
dat2=colMeans(dat[2:dim(dat)[2]])-mean
In the future, please provide reproducible example (in your code, object 'g' is not defined) and an example of the expected output, so that it would be clear what you are trying to achieve.

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
temp[which(temp==0)]=temp[which(temp==0)-1]
returns
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
library(zoo)
na.locf(temp)
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
inverse.rle(o)
#[1] 1 2 4 5 5 5 6 6

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Reordering (deleting/changing order) columns of data in data frame

I have two large data sets and I am attempting to reformat the older data set to put the questions in the same order as the newer data set (so that I can easily perform t-tests on each identical question to track significant changes over the 2 years between data sets). The new version both deleted and added questions when changing from the old version.
The way I've been attempting to do this, R keeps crashing due to, as best I can figure, vectors being too large. I'm not sure how they are getting to be this large, however! Below is what I am doing:
Both data sets have the same format. The original sets are 415 for the new and 418 for the old. I want to match the first approximately 158 colums of the new data set to the old. Each data set has column names which are q1-q415 and the data in each column is numerical 1-5 or NA. There are approximately 100 answers per question/column, the old data set has more respondants (140 rows in old vs 114 rows in new). An example is below (but keep in mind there are over 400 columns in the full set and over 100 rows!)
The following is an example of what data.old looks like. data.new looks the same only data.new has more Rows of number/na answers. Here I show questions 1 through 20 and the first 10 rows.
data.old = 418 columns (q1 though q418) x 140 rows
data.new = 415 columns (q1 through q415) x 114 rows
I need to match the first 170 COLUMNS of data.old to the first 157 COLUMNS of data.new
To do this, I will be deleting 17 columns from data.old (questions that were in the data.old questionnaire and deleted from the data.new questionnaire) but also adding 7 new columns to data.old (which will contain NAs... place holders for where data.new had new questions introducted that did not exist in data.old questionnaire)
>data.old
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
1 3 4 3 3 5 4 1 NA 4 NA 1 2 NA 5 4 3 2 3 1
3 4 5 2 2 4 NA 1 3 2 5 2 NA 3 2 1 4 3 2 NA
2 NA 2 3 2 1 4 3 5 1 2 3 4 3 NA NA 2 1 2 5
1 2 4 1 2 5 2 3 2 1 3 NA NA 2 1 5 5 NA 2 3
4 3 NA 2 1 NA 3 4 2 2 1 4 5 5 NA 3 2 3 4 1
5 2 1 5 3 2 3 3 NA 2 1 5 4 3 4 5 3 NA 2 NA
NA 2 4 1 5 5 NA NA 2 NA 1 3 3 3 4 4 5 5 3 1
4 5 4 5 5 4 3 4 3 2 5 NA 2 NA 2 3 5 4 5 4
2 2 3 4 1 5 5 3 NA 2 1 3 5 4 NA 2 3 4 3 2
2 1 5 3 NA 2 3 NA 4 5 5 3 2 NA 2 3 1 3 2 4
So in the new set, some of the questions were deleted, some new ones were added, and some changed order, so I went through and created subsets of old data in the order that I would need to combine them again to match the new dataset. When a question does not exist in the old data set, I want to use the question in the new data set so that I can (theoretically) perform my t-tests in a big loop.
dataold.set1 <- dataold[1:16]
dataold.set2 <- dataold[18:19]
dataold.set3 <- dataold[21:23]
dataold.set4 <- dataold[25:26]
dataold.set5 <- dataold[30:33]
dataold.set6 <- dataold[35:36]
dataold.set7 <- dataold[38:39]
dataold.set8 <- dataold[41:42]
dataold.set9 <- dataold[44]
dataold.set10 <- dataold[46:47]
dataold.set11 <- dataold[49:54]
dataold.set12 <- datanew[43:49]
dataold.set13 <- dataold[62:85]
dataold.set14 <- dataold[87:90]
dataold.set15 <- datanew[78]
dataold.set16 <- dataold[91:142]
dataold.set17 <- dataold[149:161]
dataold.set18 <- dataold[55:61]
dataold.set19 <- dataold[163:170]
I then was attempting to put the columns back together into one set
I tried both
dataold.adjust <- merge(dataold.set1, dataold.set2)
dataold.adjust <- merge(dataold.adjust, dataold.set3)
dataold.adjust <- merge(dataold.adjust, dataold.set4)
and I also tried
dataold.adjust <- cbind(dataold.set1, dataold.set2, dataold.set3)
However, every time I try to perform these functions, R freezes, then crashes. I managed to get it to display an error once, and it said it could not work with a vector of 10 Mb, and then I got multiple errors involving over 1000 Mb vectors. I'm not really sure how my vectors are that large, when this is crashing out by set 3, which is only 23 columns of data in a table, and the data sets I'm normally using are over 400 columns in length.
Is there another way to do this that won't cause my program to crash and have memory issues (and won't require me typing out the column names of over 100 columns), or is there some element of code here that I am missing where I'm getting a memory sink? I've been attempting to trouble shoot it and have spent an hour dealing with R crashing without any luck figuring out how to make this work.
Thanks for the assistance!
You're making tons of unnecessary copies of your data and then you're growing the final object (dataold.adjust). You just need a vector that orders the columns correctly:
cols1 <- c(1:16,18:19,21:23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <- c(62:85,87:90)
cols3 <- c(91:142,149:161,55:61,163:170)
# merge old / new data by row and add NA for unmatched rows
dataold.adjust <- merge(data.old[,c(cols1,cols2,cols3)],
data.new[,c(43:49,78)], by="row.names", all=TRUE)
# put columns in desired order
dataold.adjust <- dataold.adjust[,c(1:length(cols1), # 1st cols from dataold
ncol(dataold.adjust)-length(43:49):1, # 1st cols from datanew
(length(cols1)+1):length(cols2), # 2nd cols from dataold
ncol(dataold.adjust), # 2nd cols from datanew
(length(cols1)+length(cols2)+1):length(cols3))] # 3rd cols from dataold
The last part is an absolute kludge, but I've hit my self-imposed time limit for SO today. :)

Resources