I'm trying to do some data manipulation in R. I have 2 data frames, one is training data, the other testing data all the data is categorical and stored as factor variables.
There are some NA's in the data and I'm trying to convert them to "-1". When I do it for the training data, things go fine, but not for the test data.
Something changes the values during a loop I run but I can't figure out what.
Here's the before:
> class(catTrain1[,"Cat_111"])
[1] "factor"
> class(catTest1[,"Cat_111"])
[1] "factor"
> table(catTrain1[,"Cat_111"])
1 2
726 25
> table(catTest1[,"Cat_111"])
0 1 2
1 503 15
Here's the loop:
> for(i in 1:ncol(catTrain1)){
+ catTrain1[,i] <- as.factor(as.character(ifelse(is.na(catTrain1[,i]), "-1", catTrain1[,i])))
+ }
> for(i in 1:ncol(catTest1)){
+ catTest1[,i] <- as.factor(as.character(ifelse(is.na(catTest1[,i]), "-1", catTest1[,i])))
+ }
Here's the after:
> table(catTrain1[,"Cat_111"])
1 2
726 25
> table(catTest1[,"Cat_111"])
1 2 3
1 503 15
I've seen the shift up by one with character -> numeric conversions but I can't figure out why this is happening, especially for just one of the dataframes / loops.
Any suggestions?
The column names in your first set of calls to table are the levels of the factor. In the second set of calls to table, the column names are the level indexes. ifelse is pulling the indexes, not the levels. In your loops, move the as.character in around the final catTest1[,i] and catTrain1[,i].
Try this instead. (More r-like, vectorized) :
levels( catTest1[,"Cat_111"] ) <- c( catTest1[,"Cat_111"], "-1")
catTest1[,"Cat_111"][ is.na(catTest1[,"Cat_111"]) ] <- -1
Related
I have two datasets. They refer to the same data. However, one has string as answers to questions, and the other has the corresponding codes.
library(data.table)
dat_string <- fread("str_col1 str_col2 numerical_col
One Alot 1
Two Alittle 0")
dat_codes <- fread("code_col1 code_col2 numerical_col
0 3 1
1 5 0")
I would like, to combine both datasets, so that the levels get attached to the corresponding codes as labels, (see this example) for all string columns (in dat_string).
Please note that the column names can have any format and do not necessarily have the format from the example/
What would be the easiest way to do this?
Desired outcome:
dat_codes$code_col1 <- factor(dat_codes$code_col1, levels=c("0", "1"),
labels=c("One", "Two"))
attributes(dat_codes$code_col1)$levels
[1] "One" "Two"
If I understand your edit - you are saying that both tables are the same shape, with the same row order, it is just that one has labels and one has levels. If that is the case it should be even more straightforward than my original response:
code_cols <- which(sapply(dat_string, is.character))
for(j in code_cols) {
set(dat_codes, j = j, value = factor(
dat_codes[[j]],
levels = unique(dat_codes[[j]]),
labels = unique(dat_string[[j]])
)
)
}
dat_codes
# code_col1 code_col2 numerical_col
# 1: One Alot 1
# 2: Two Alittle 0
dat_codes$code_col1
# [1] One Two
# Levels: One Two
sapply(dat_codes, class)
# code_col1 code_col2 numerical_col
# "factor" "factor" "integer"
I am trying to pull the cell values from the StudyID column to the empty cells SigmaID column, but I am running into an odd issue with the output.
This is how my data looks before running commands.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44
ST04238 0 50
LM04829 1 24 LM23921
ST91124 0 89
ST29001 0 55
I tried accomplishing this by writing the syntax in three ways, because I wasn't sure if there is a problem with the way the logic was set up. All three produce the same output.
df$SigmaID <- ifelse(test = df$SigmaID != "", yes = df$SigmaID, no = df$StudyID)
df$SigmaID <- ifelse(df$SigmaID == "", df$StudyID, df3$SigmaID)
df %>% mutate(SigmaID = ifelse(Gender == 0, df$StudyID, df$SigmaID)
Output: instead of pulling the values from from the StudyID column, it is populating one to four digit numbers.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44 5
ST04238 0 50 4908
LM04829 1 24 LM23921
ST91124 0 89 209
ST29001 0 55 4092
I have tried recoding the empty spaces to NA and then calling on NA in the logic, but this produced the same output as seen above. I'm wondering if it could have anything to do with variable type or variable attributes and something's off about how it's reading the characters in StudyID. Would appreciate feedback on this issue!
Here is how to do it:
df$SigmaID[df$SigmaID == ""] = df$StudyID[df$SigmaID == ""]
df[df$SigmaID == ""] selects only the rows where SigmaID==""
I also recommend using data.table instead of data.frame. It is faster and has some useful syntax features:
library(data.table)
setDT(df) # setDT converts a data.frame to a data.table
df[SigmaID=="",SigmaId:=StudyID]
Following up on this! As it turns out, default R converts string types into factors. There are a few ways of addressing the issue above.
i <- sapply[df, is.factor]
df[i] <- lapply(df[i], as.character)
Another method:
df <- read.csv("/insert file pathway here", stringAsFactors = FALSE)
This is what I found to be helpful! I'm sure there are additional methods of troubleshooting this as well.
I've figured out if I use as.character(df[x,y]) or as.<whatever>df[x,y] I can get/coerce what I need, every time from my data frames
What I cant seem to find/figure out is why. Details below.
When I access df[1,1] (or anything in column 1) I get
df[1,1]
[1] a
Levels: a b c
but when I access 1,3 it works fine
> df[1,3]
[1] 10
but then when I use as.character() it works.
> as.character(df[1,1])
[1] "a"
The data frame was built using this line
df = data.frame(names = c("a","b","c"), size = c(1,2,3),num = c(10,20,30) )
> df
names size num
1 a 1 10
2 b 2 20
3 c 3 30
But in this data frame
imp2met = read.csv('tomet.csv', header = TRUE, sep=",",dec='.')
> imp2met
unit mult ret
1 (yd) 0.9100 (m)
2 (in) 2.5200 (cm)
3 .....
I get these results for 1,3
> imp2met[1,3]
[1] (m)
Levels: (c) (cm) (cm^2) ....
>
> as.character(imp2met[1,3])
[1] "(m)"
So why the "random" results? Why do I need as.<whatever>() but only some of the time?
data.frame default is to convert character vectors to factors. You can change this with the argument stringsAsFactors=FALSE
Also, when you subset a dataframe using [, you can add the drop=FALSE argument to simplify the results in some cases.
Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)
I'm using the "by" function in R to chop up a data frame and apply a function to different parts, like this:
pairwise.compare <- function(x) {
Nright <- ...
Nwrong <- ...
Ntied <- ...
return(c(Nright=Nright, Nwrong=Nwrong, Ntied=Ntied))
}
Z.by <- by(rankings, INDICES=list(rankings$Rater, rankings$Class), FUN=pairwise.compare)
The result (Z.by) looks something like this:
: 4
: 357
Nright Nwrong Ntied
3 0 0
------------------------------------------------------------
: 8
: 357
NULL
------------------------------------------------------------
: 10
: 470
Nright Nwrong Ntied
3 4 1
------------------------------------------------------------
: 11
: 470
Nright Nwrong Ntied
12 4 1
What I would like is to have this result converted into a data frame (with the NULL entries not present) so it looks like this:
Rater Class Nright Nwrong Ntied
1 4 357 3 0 0
2 10 470 3 4 1
3 11 470 12 4 1
How do I do that?
The by function returns a list, so you can do something like this:
data.frame(do.call("rbind", by(x, column, mean)))
Consider using ddply in the plyr package instead of by. It handles the work of adding the column to your dataframe.
Old thread, but for anyone who searches for this topic:
analysis = by(...)
data.frame(t(vapply(analysis,unlist,unlist(analysis[[1]]))))
unlist() will take an element of a by() output (in this case, analysis) and express it as a named vector.
vapply() does unlist to all the elemnts of analysis and outputs the result. It requires a dummy argument to know the output type, which is what analysis[[1]] is there for. You may need to add a check that analysis is not empty if that will be possible.
Each output will be a column, so t() transposes it to the desired orientation where each analysis entry becomes a row.
This expands upon Shane's solution of using rbind() but also adds columns identifying groups and removes NULL groups - two features which were requested in the question. By using base package functions, no other dependencies are required, e.g., plyr.
simplify_by_output = function(by_output) {
null_ind = unlist(lapply(by_output, is.null)) # by() returns NULL for combinations of grouping variables for which there are no data. rbind() ignores those, so you have to keep track of them.
by_df = do.call(rbind, by_output) # Combine the results into a data frame.
return(cbind(expand.grid(dimnames(by_output))[!null_ind, ], by_df)) # Add columns identifying groups, discarding names of groups for which no data exist.
}
I would do
x = by(data, list(data$x, data$y), function(d) whatever(d))
array(x, dim(x), dimnames(x))