I'm working with the survival library in R. I used residuals() on a survival object, which nicely outputs the residuals I want.
My question is on how R treats its row indexes.
Here is a sample data set. My goal is to merge them back together.
Data Frame:
> 1 - .2
> 2 - .4
> 3 - .6
> 4 - .8
Output:
> 1 - .2X
> 2 - .4X
> 4 - .8X
The output is a subset of the input (some data couldn't be processed). What I'd like is to add this new list back to the original input file to plot and run regressions, etc.
I don't know how to access the row indexes outside of the simple df[number] command. I think my approach to doing this is a bit prehistoric; I write.table() the objects which turns their row number into an actual printed column, and then go back and merge based on this new key. I feel like there is a smarter way then to write out and read back in the files. Any suggestions on how?
I hope this isn't a duplicate, as I looked around and couldn't quite find a good explanation on row indices.
Edit:
I can add column or row names to a data frame, but this results in a NULL value if done to a one dimensional object (my output file). The one dimensional object just has a subset of rows that I can't access.
rownames(res)
NULL
Instead of creating a new object as proposed above, you can simply use merge directly.
Just write:
merge(df1, df2, by.x = 0, by.y = res)
The by.x=0 refers then to the row names of the df1. The by.y refers to the row names of df2. The merge is performed using those as the link.
You can create an appropriate data.frame object out of res:
res.df <- data.frame(names=names(res), res=res)
Then use this as one of the inputs to merge.
For the join you can use merge() or join() from the plyr package.
Here is a question regarding both:
How to join (merge) data frames (inner, outer, left, right)?
I find join() more intuitive, has the SQL logic and it seems to perform better with large datasets also.
Related
I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!
list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.
So I have two columns. I need to add a third column. However this third column needs to have A for the first amount of rows, and B for the second specified amount of rows.
I tried adding this data_exercise_3 ["newcolumn"] <- (1:6)
but it didn't work. Can someone tell me what I'm doing wrong please?
Looks like you're having a problem with subsetting a data frame correctly. I'd recommend reviewing this concept before you proceed much further, either via a Coursera course or on a website like this UCLA R learning module on subsetting data frames. Subsetting is a crucial component of data wrangling with R, and you'll go much faster with a solid foundation of the basics!
You can assign values to a subset of a data frame by using [row, column] notation. Since your data frame is called data_exercise_3 and the column you'd like to assign values to is called 'newcolumn', then assuming you want the first 6 rows as 'A' and the next 3 as 'B', you could write it like this:
data_exercise_3[1:6,'newcolumn'] <- 'A'
data_exercise_3[7:9,'newcolumn'] <- 'B'
data_exercise_3$category <- c(rep("A",6),rep("B",6))
I want to merge 2 data frames (data1 and data2). Both initially contain around 35 million observations (around 2GB each).
I removed the duplicates from data2. I would need to keep the duplicates in data 1, as I wish to use them for further calculations per observation in data1.
I initially get the well documented error:
Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
As a solution (I looked at several topics, such here, here, and here), I included allow.cartesian=TRUE, but now I run into memory issues. Also, for a subset it works, but it gives me more observations than I wish (data1 now has 50 million observations, although is specify all.x=TRUE).
My code is:
#Remove duplicates before merge
data2 <- unique(data2)
#Merge
require(data.table)
data1 <- merge(data1, data2, by="ID", all.x=TRUE, allow.cartesian=TRUE)
Any advice on how to merge this, is very welcome.
In order to do a left join, the merge statement needs to understand which column you are using as the "key" or "index" for the join. If you have duplicate column names that are used as the key/index, it doesn't know what to do and gives that error. Further, it needs to know what to do if columns are being joined that have the same name as existing columns.
The solution is to temporarily rename the key/index column in your left (data1) dataset As a general rule, having duplicate column names is "bad" in R because it will confuse a lot of functions. Many functions silently call make.unique() to de-duplicate column names to avoid confusion.
If you have duplicate ID columns in data1 change them with colnames(data1) <- make.unique(colnames(data1)), which will set them to ID.1, ID.2, etc. Then do your merge (make sure to specify by.x="ID.1", by.y="ID" because of the rename. By default, duplicate columns that are merged will be appended with .y although you can specify the suffix with the suffixes= option (See Merge helpfile for details)
Lastly, it's worth noting that the merge() function in the data.table package tends to be a lot faster than the base merge() function with similar syntax. Seepage 47 of the data.table manual.
I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])
I'm trying to merge to data frames based on a common field called "lookup" that I created. I created the data frames after subsetting the original data frame.
Each of the two newly created data frames is less than 10,000 rows. When trying to execute merge, after much thinking, both R and R Studio shuts down, with R sometimes producing an error message stating:
Error in make.unique(as.character(rows)) :
promise already under evaluation: recursive default argument reference or earlier problems?
Below is my code...is there any other way to pull down the data from the other data frame based on the common field besides using the merge function? Any help is appreciated.
Also, do you have any thoughts as to why it may be shutting down, using up all the memory, when, in fact, the data size is so small?
wmtdata <- datastep2[datastep2$Market.Type=="WMT", c("Item", "Brand.Family", "Brand", "Unit.Size", "Pack.Size",
"Container", "UPC..int.", "X..Vol", "Unit.Vol", "Distribution", "Market.Type",
"Week.Ending", "MLK.Day","Easter", "Independence.Day", "Labor.Day", "Veterans.Day", "Thanksgiving",
"Christmas", "New.Years","Year","Month","Week.Sequence","Price")]
compdata <- datastep2[datastep2$Market.Type=="Rem Mkt", c("Week.Ending", "UPC..int.","X..Vol", "Unit.Vol", "Price","lookup")]
colnames(compdata)[colnames(compdata)=="X..Vol"]<-"Comp Vol"
colnames(compdata)[colnames(compdata)=="Unit.Vol"]<-"Comp Unit Vol"
colnames(compdata)[colnames(compdata)=="Price"]<-"Comp Price"
combineddata <-merge(wmtdata, compdata, by="lookup")
Try join from the plyr package:
combineddata <- join(wmtdata, compdata, by="lookup")
With only 10,000 rows, the problem is unlikely to be the use of merge(...) instead of something else. Are the elements of the lookup column unique? Otherwise you get a cross-join.
Consider this trivial case:
df.1 <- data.frame(id=rep(1,10), x=rnorm(10))
df.2 <- data.frame(id=rep(1,10), y=rnorm(10))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 100
So two df with 10 rows each produce a merge with 100 rows because the id is not unique.
Now consider:
df.1 <- data.frame(id=rep(1:10, each=40), x=rnorm(400))
df.2 <- data.frame(id=rep(1:10, each=50), y=rnorm(500))
z <- merge(df.1,df.2,by="id")
nrow(z)
# [1] 20000
In this example, df.1 has each id replicated 40 times, and in df.2 each id is replicated 50 times. Merge will produce one row for every instance of an id in each df, so 50 X 40 =2000 rows per id. Since there are 10 ids in this example, you get 20,000 rows. So your merge results can get very big very quickly if the id field (lookup in your case) is not unique.
Instead of using data frames, use the data.table package (see here for an intro). A data.table is like an indexed data frame. It has its own merge method that would probably work in this case.
Thank you all for all the great help. Data tables are the way to go for me, as I think this was a memory issue ("lookup" values were common between data frames). While 8 GB of memory (~ 6GB free) should be plenty, it was all used up during this process. nevertheless, data tables worked just fine. Learning a lot from these boards.