again I apologize if this is a trivial question. I'm still new to R but I am determined to learn. I have one original .csv file (lets call it Original.Data) that I am working with. I am basically reorganizing that spreadsheet. I am taking each row and transforming it into a column and then stacking each row as columns underneath each other using bind_rows.
I have created the code necessary to organize the first row how I wanted. Now I'm trying to create a loop to do this for all rows in the Original. Here is the code I created:
Rows 1:3 basically headers and are eventually deleted but not yet becuase they serve an initial purpose. Only row 5 is necessary for the final spreadsheet.
New.Data <-data.frame(t(Original.Data[c(1:3,5),]))
colnames(New.Data) <-c("patid","Day","Week","SD")
New.Data[,1] <- New.Data[1,4]
New.Data$Day <- as.numeric(as.character(New.Data$Day))
New.Data$SD <- as.numeric(as.character(New.Data$SD))
New.Data$Week <- as.numeric(as.character(New.Data$Week))
New.Data <- New.Data[ which(New.Data$Day<=New.Data[3,4]),]
This last line is creating a base spreadsheet where I am transferring all of the data to."Base" was already created before hand.
Final.Base <-bind_rows(Base,New.Data)
My idea is basically to do something like naming New.Data as New.Data1 for example and having the loop create multiple New.Data for each row and basically binding them all into the Final.Base. So Row 1 (or row 5 actually) is created in to New.Data 1 and row 2 into New.Data 2, etc.
Also, in my first line of code:
New.Data <-data.frame(t(Original.Data[c(1:3,5),]))
The only thing I believe that would need to change is the last number "5". It would need to increase by one in every iteration of the loop.
Thank you for all of your help.
I'm not totally sure what you want to do. A little bit of data would be good. But here is something that sounds like what you are asking. It takes a column and tacks it on to the next column and so forth to give you a vector of the stacked values"
mtcars
out <- c()
for(i in 1:ncol(mtcars)){
intermediate <- mtcars[,i]
out <- c(out, intermediate)
}
Related
I'm in a very basic class that introduces R for genetic purposes. I'm encountering a rather peculiar problem in trying to follow the instructions given. Here is what I have along with the instructor's notes:
MangrovesRaw<-read.csv("C:/Users/esteb/Documents/PopGen/MangrovesSites.csv")
#i'm going to make a new dataframe now, with one column more than the mangrovesraw dataframe but the same number of rows.
View(MangrovesRaw)
Mangroves<-data.frame(matrix(nrow = 528, ncol = 23))
#next I want you to name the first column of Mangroves "pop"
colnames(Mangroves)<-c(col1="pop")
#i'm now assigning all values of that column to be 1
Mangroves$pop<-1
#assign the rest of the columns (2 to 23) to the entirety of the MangrovesRaw dataframe
#then change the names to match the mangroves raw names
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
I'm not really sure how to assign columns that haven't been named used the $ as we have in the past. A friend suggested I first run
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw
#X338 is the name of the first column from MangrovesRaw
But while this does transfer the data from MangrovesRaw, it comes at the cost of having my column names messed up with X338. added to every subsequent column. In an attempt to modify this I found the following "fix"
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw[,2]
#Mangroves$X338<-MangrovesRaw[,2:22]
#MangrovesRaw has 22 columns in total
While this transferred all the data I needed for the X338 Column, it didn't transfer any data for the remaining 21 columns. The code in # just results in the same problem of having X388. show up in all my column names.
What am I doing wrong?
There are a few ways to solve this problem. It may be that your instructor wants it done a certain way, but here's one simple solution: just cbind() the Mangroves$pop column with the real data. Then the data and column names are already added.
Mangroves <- cbind(Mangroves$pop, MangrovesRaw)
Here's another way:
Mangroves[, 2:23] <- MangrovesRaw
colnames(Mangroves)[2:23] <- colnames(MangrovesRaw)
I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result
So I have two columns. I need to add a third column. However this third column needs to have A for the first amount of rows, and B for the second specified amount of rows.
I tried adding this data_exercise_3 ["newcolumn"] <- (1:6)
but it didn't work. Can someone tell me what I'm doing wrong please?
Looks like you're having a problem with subsetting a data frame correctly. I'd recommend reviewing this concept before you proceed much further, either via a Coursera course or on a website like this UCLA R learning module on subsetting data frames. Subsetting is a crucial component of data wrangling with R, and you'll go much faster with a solid foundation of the basics!
You can assign values to a subset of a data frame by using [row, column] notation. Since your data frame is called data_exercise_3 and the column you'd like to assign values to is called 'newcolumn', then assuming you want the first 6 rows as 'A' and the next 3 as 'B', you could write it like this:
data_exercise_3[1:6,'newcolumn'] <- 'A'
data_exercise_3[7:9,'newcolumn'] <- 'B'
data_exercise_3$category <- c(rep("A",6),rep("B",6))
I'm still relatively new to R, so pardon if what I'm saying lacks proper and explicit terminology.
I have a data set created from a UDP stream set up to collect data from one source and time from another source. The time is added asynchronously, meaning that one column contains information (Start, Stop) but the rest of such rows are blank. The link below may help explain what exactly this looks like in a data set.
https://googledrive.com/host/0Bw82Tt1bj-QRUkxwdGY2Qm5UOVk
I want to read where in column "MarkersOri" "Start" is located, and print "Start" in a new column "MarkerNew" exactly one row down. (Subsequently, I plan to delete these mostly blank rows and the "MarkersOri" column)
I have tried to implement if statements and the findInterval command, but I couldn't find anything on doing exactly what I want to do.
EDIT: Solved. Appreciate the help.
You want to have a new columns, with just 'Start' inside, but shifteb by one row down?
Maybe this might help:
# some data
df <- data.frame(col1 = LETTERS[1:10], stringsAsFactors=FALSE)
df$col1[5] <- 'Start'
# create empty column
df$newcol <- NA
# shift 'Start' by one row down, store it in the new column
df$newcol[which(df$col1 == 'Start') + 1] <- 'Start'
df
I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)