Reordering (deleting/changing order) columns of data in data frame - r

I have two large data sets and I am attempting to reformat the older data set to put the questions in the same order as the newer data set (so that I can easily perform t-tests on each identical question to track significant changes over the 2 years between data sets). The new version both deleted and added questions when changing from the old version.
The way I've been attempting to do this, R keeps crashing due to, as best I can figure, vectors being too large. I'm not sure how they are getting to be this large, however! Below is what I am doing:
Both data sets have the same format. The original sets are 415 for the new and 418 for the old. I want to match the first approximately 158 colums of the new data set to the old. Each data set has column names which are q1-q415 and the data in each column is numerical 1-5 or NA. There are approximately 100 answers per question/column, the old data set has more respondants (140 rows in old vs 114 rows in new). An example is below (but keep in mind there are over 400 columns in the full set and over 100 rows!)
The following is an example of what data.old looks like. data.new looks the same only data.new has more Rows of number/na answers. Here I show questions 1 through 20 and the first 10 rows.
data.old = 418 columns (q1 though q418) x 140 rows
data.new = 415 columns (q1 through q415) x 114 rows
I need to match the first 170 COLUMNS of data.old to the first 157 COLUMNS of data.new
To do this, I will be deleting 17 columns from data.old (questions that were in the data.old questionnaire and deleted from the data.new questionnaire) but also adding 7 new columns to data.old (which will contain NAs... place holders for where data.new had new questions introducted that did not exist in data.old questionnaire)
>data.old
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q15 q16 q17 q18 q19 q20
1 3 4 3 3 5 4 1 NA 4 NA 1 2 NA 5 4 3 2 3 1
3 4 5 2 2 4 NA 1 3 2 5 2 NA 3 2 1 4 3 2 NA
2 NA 2 3 2 1 4 3 5 1 2 3 4 3 NA NA 2 1 2 5
1 2 4 1 2 5 2 3 2 1 3 NA NA 2 1 5 5 NA 2 3
4 3 NA 2 1 NA 3 4 2 2 1 4 5 5 NA 3 2 3 4 1
5 2 1 5 3 2 3 3 NA 2 1 5 4 3 4 5 3 NA 2 NA
NA 2 4 1 5 5 NA NA 2 NA 1 3 3 3 4 4 5 5 3 1
4 5 4 5 5 4 3 4 3 2 5 NA 2 NA 2 3 5 4 5 4
2 2 3 4 1 5 5 3 NA 2 1 3 5 4 NA 2 3 4 3 2
2 1 5 3 NA 2 3 NA 4 5 5 3 2 NA 2 3 1 3 2 4
So in the new set, some of the questions were deleted, some new ones were added, and some changed order, so I went through and created subsets of old data in the order that I would need to combine them again to match the new dataset. When a question does not exist in the old data set, I want to use the question in the new data set so that I can (theoretically) perform my t-tests in a big loop.
dataold.set1 <- dataold[1:16]
dataold.set2 <- dataold[18:19]
dataold.set3 <- dataold[21:23]
dataold.set4 <- dataold[25:26]
dataold.set5 <- dataold[30:33]
dataold.set6 <- dataold[35:36]
dataold.set7 <- dataold[38:39]
dataold.set8 <- dataold[41:42]
dataold.set9 <- dataold[44]
dataold.set10 <- dataold[46:47]
dataold.set11 <- dataold[49:54]
dataold.set12 <- datanew[43:49]
dataold.set13 <- dataold[62:85]
dataold.set14 <- dataold[87:90]
dataold.set15 <- datanew[78]
dataold.set16 <- dataold[91:142]
dataold.set17 <- dataold[149:161]
dataold.set18 <- dataold[55:61]
dataold.set19 <- dataold[163:170]
I then was attempting to put the columns back together into one set
I tried both
dataold.adjust <- merge(dataold.set1, dataold.set2)
dataold.adjust <- merge(dataold.adjust, dataold.set3)
dataold.adjust <- merge(dataold.adjust, dataold.set4)
and I also tried
dataold.adjust <- cbind(dataold.set1, dataold.set2, dataold.set3)
However, every time I try to perform these functions, R freezes, then crashes. I managed to get it to display an error once, and it said it could not work with a vector of 10 Mb, and then I got multiple errors involving over 1000 Mb vectors. I'm not really sure how my vectors are that large, when this is crashing out by set 3, which is only 23 columns of data in a table, and the data sets I'm normally using are over 400 columns in length.
Is there another way to do this that won't cause my program to crash and have memory issues (and won't require me typing out the column names of over 100 columns), or is there some element of code here that I am missing where I'm getting a memory sink? I've been attempting to trouble shoot it and have spent an hour dealing with R crashing without any luck figuring out how to make this work.
Thanks for the assistance!

You're making tons of unnecessary copies of your data and then you're growing the final object (dataold.adjust). You just need a vector that orders the columns correctly:
cols1 <- c(1:16,18:19,21:23,25:26,30:33,35:36,38:39,41:42,44,46:47,49:54)
cols2 <- c(62:85,87:90)
cols3 <- c(91:142,149:161,55:61,163:170)
# merge old / new data by row and add NA for unmatched rows
dataold.adjust <- merge(data.old[,c(cols1,cols2,cols3)],
data.new[,c(43:49,78)], by="row.names", all=TRUE)
# put columns in desired order
dataold.adjust <- dataold.adjust[,c(1:length(cols1), # 1st cols from dataold
ncol(dataold.adjust)-length(43:49):1, # 1st cols from datanew
(length(cols1)+1):length(cols2), # 2nd cols from dataold
ncol(dataold.adjust), # 2nd cols from datanew
(length(cols1)+length(cols2)+1):length(cols3))] # 3rd cols from dataold
The last part is an absolute kludge, but I've hit my self-imposed time limit for SO today. :)

Related

Merge multiple data frames with partially matching rows

I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1

creating adjacency network matrix (or list) from large csv dataset using igraph

i am trying to do network analysis in igraph but having some issues with transforming the dataset I have into an edge list (with weights), given the differing amount of columns.
The data set looks as follows (much larger of course): First is the main operator id (main operator can also be partner and vice versa, so the Ids are staying the same in the adjacency) The challenge is that the amount of partners varies (from 0 to 40).
IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4 .....
1 4 3 2 NA
2 3 1 NA NA
3 1 4 7 6
4 9 6 3 NA
.
.
my question is how to transform this into an edge list with weight which is undirected (just expressing interaction):
Id1 Id2 weight
1 2 2
1 3 2
1 4 1
2 3 1
3 4 2
. .
Does anyone have a tip what the best way to go is? Many thanks in advance!
This is a classic reshaping task. You can use the reshape2 package for this.
text <- "IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4
1 4 3 2 NA
2 3 NA NA NA
3 1 4 7 6
4 9 NA NA NA"
data <- read.delim(text = text, sep = "")
library(reshape2)
data_melt <- reshape2::melt(data, id.vars = "IdMain")
edgelist <- data_melt[!is.na(data_melt$value), c("IdMain", "value")]
head(edgelist, 4)
# IdMain value
# 1 1 4
# 2 2 3
# 3 3 1
# 4 4 9

Getting stale values on using ifelse in a dataframe

Hi I am aggregating values from two columns and creating a final third column, based on priorities. If values in column 1 are missing or are NA then I go for column 2.
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10))
df
internal external
1 1
2 5 6
3 8
4 6 9
5 NA 10
df$final <- df$internal
df$final <- ifelse((df$final=="" | df$final=="NA"),df$external,df$final)
df
internal external final
1 1 2
2 5 6 3
3 8 4
4 6 9 4
5 NA 10 2
How can I get final value as 4 and 2 for row 3 and row 5 when the external is 8 and 2. I don't know what's wrong but these values don't make any sense to me.
The issue arises because R converts your values to factors.
Your code will work fine with
df=data.frame(internal=c(1,5,"",6,"NA"),external=c("",6,8,9,10),stringsAsFactors = FALSE)
PS: this hideous conversion to factors should definitely belong to the R Inferno, http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

[R]: applying a function to columns based on conditional row position

I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered.
Here is a highly simplified example:
fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))
samp1 samp2 samp3 samp4
1 5 2 2 5
2 6 3 3 6
3 7 4 4 7
4 5 6 11 5
5 4 7 7 4
6 5 9 9 12
7 10 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs.
Expected outcome:
samp1 samp2 samp3 samp4
2 2 4 3
I am able to get the answer I want by using nested for loops to exclude the observations I don't want.
newfake.dat<-data.frame()
for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}
This creates a new data frame on which I can run an easy apply function.
colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))
V1 V2 V3 V4
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA 7 NA
6 NA NA 9 NA
7 NA 2 2 10
8 5 3 3 5
9 6 7 7 6
10 7 8 8 7
V1 V2 V3 V4
2 2 4 3
Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. Which are large (2000 x 2000 or larger) and numerous. I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). Also, I don't really need the new dataframe for anything other than to be able to run the apply function.
Is there any way to do this more efficiently? I tried limiting the rows that the apply function works on by using seq and the row number of the max.
maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)
seq.att<-apply(fake.dat,2,function(x) {
sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})
Which kicks up four instances of this warning message:
1: In seq(1, nrow(fake.dat)) == (maxrow) :
longer object length is not a multiple of shorter object length
If I ignore the warning message and get the output anyway it doesn't give me the answer I expected:
samp1 samp2 samp3 samp4
2 3 3 3
I also tried using a while function which kept cycling so I stopped it (I misplaced the code I tried for this).
So far the most promising result has come from the nested for loops, but I know it's terribly inefficient and I'm hoping that there's a better way. I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. Thanks in advance for any help you can provide!
Here is a way in dplyr to replicate the same process that you showed with base R
library(dplyr)
fake.dat %>%
summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
na.rm=TRUE)))
# samp1 samp2 samp3 samp4
#1 2 2 4 3
If you need it as two steps:
datNA <- fake.dat %>%
mutate_each(funs(replace(., seq_len(which.max(.)), NA)))
datNA %>%
summarise_each(funs(sum(.>5, na.rm=TRUE)))
Here's one approach using data.table:
library(data.table)
##
data <- data.frame(
samp1=c(5,6,7,5,4,5,10,5,6,7),
samp2=c(2,3,4,6,7,9,2,3,7,8),
samp3=c(2,3,4,11,7,9,2,3,7,8),
samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
y <- x[(which.max(x)+1):.N]
length(y[y>5])
})
samp1 samp2 samp3 samp4
1: 2 2 4 3
A single-liner in base R:
vapply(fake.dat,function(x) sum(x[(which.max(x)+1):length(x)]>5),1L)
#samp1 samp2 samp3 samp4
# 2 2 4 3

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Resources