Optimal way to reshape dataframe in R to have observation on columns - r

Given e.g. the Orange data set, I would like to arrange the observations in a matrix in which the measurements (circumference) taken on each tree are arranged in rows (for a total of 5 rows).
One unsatisfactory way of obtaining this result is as follows:
mat<-matrix(Orange[,3],nrow=5, ncol = 7,byrow=T, dimnames = list(c(unique(Orange$Tree)),c(1:7)))

An alternative way would be using the dcast( ) function within the data.table package.
This allows you to convert data from long to wide. In this case, I've created an ID to could the number of records per Tree.
In the re-shaped data, Tree becomes our primary column and circumference is recorded in 7 unique columns (one for each age).
library(data.table)
Orange <- data.table(Orange)[,ID := seq(1:.N), by=Tree]
Orange2 <- dcast(
data = Orange,
formula = Tree ~ ID,
value.var = "circumference")
Orange2
Tree 1 2 3 4 5 6 7
1: 3 30 51 75 108 115 139 140
2: 1 30 58 87 115 120 142 145
3: 5 30 49 81 125 142 174 177
4: 2 33 69 111 156 172 203 203
5: 4 32 62 112 167 179 209 214
EDIT (in response to additional comments/questions):
Technically the data is already ordered by Tree (defined within the data). This is because the variable Tree is a factor variable with preset levels. To order numerically, here are 2 things: (1) Order by as.character( ) and (2) Re-level the variable.
Orange2[order(as.character(Tree),]
1: 1 30 58 87 115 120 142 145
2: 2 33 69 111 156 172 203 203
3: 3 30 51 75 108 115 139 140
4: 4 32 62 112 167 179 209 214
5: 5 30 49 81 125 142 174 177
class(Orange$Tree)
[1] "ordered" "factor"
levels(Orange$Tree)
[1] "3" "1" "5" "2" "4"
Orange2[,Tree := factor(Tree, c("1","2","3","4","5"), ordered = FALSE)]
Orange2[order(Tree),]
Tree 1 2 3 4 5 6 7
1: 1 30 58 87 115 120 142 145
2: 2 33 69 111 156 172 203 203
3: 3 30 51 75 108 115 139 140
4: 4 32 62 112 167 179 209 214
5: 5 30 49 81 125 142 174 177

In base, you could simply do:
aggregate(circumference ~ Tree, Orange, I)
If you don't want to order it afterwards: aggregate(circumference ~ as.character(Tree), Orange, I) (that will strip the factor ordering).
Or similar to #RyanF:
Orange$id <- sequence(rle(as.character(Orange$Tree))$lengths)
reshape(Orange[,-2],
idvar = "Tree",
timevar = "id",
direction = "wide")
Output:
Tree circumference.1 circumference.2 circumference.3 circumference.4 circumference.5 circumference.6 circumference.7
1 1 30 58 87 115 120 142 145
8 2 33 69 111 156 172 203 203
15 3 30 51 75 108 115 139 140
22 4 32 62 112 167 179 209 214
29 5 30 49 81 125 142 174 177

Related

Split a list into sub-lists based on condition that there is a gap of size n

I have a list of numbers that are increasing in nature (i.e. ever increasing).
alist <- c(1:20, 50:70, 210:235, 240:250)
The difference from one number to the next, is n.
I'd like to automatically split the list based on whether the difference between each item on the list is bigger than the threshold value of n.
For example, if the value of n > 20, for the particular list above it should split itself into 3 datasets.
Calling which(diff(alist) >20) tells me where I should "cut" the data up, but for the life of me I cannot figure out the next step... I might be missing something very simple here.
The result should ideally become a list of lists, or a table (I don't mind either):
[[1]]
[1] 1 2 3 4 5 6 7 8 9 0 11 12 13 14 15 16 17 18 19 20
[[2]]
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65...
[[3]]
[1] 210 211 212 213...
We can use cumsum on a logical vector to create a group for splitting
unname(split(alist, cumsum(c(TRUE, diff(alist) > 20))))
#[[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#[[2]]
# [1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
#[[3]]
# [1] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 240 241 242 243 244 245 246 247 248
# [36] 249 250
If we need to use the which approach,
i1 <- which(diff(alist) >20)
Map(function(i,j) alist[i:j], c(1, i1 +1), c(i1, length(alist)))

Add sequence to each element of a vector

I have a vector as indicated below
x <- c(1,32,60,86,115,142,171,198)
I would like to create a sequence as seq(x[i],x[i]+2,by=1) for each element of the vector. The resulting vector should be
1,2,3,32,33,34,60,61,62,86,87,88.....
I was wondering if there is a function similar to rep to do this? Appreciate your input on this.
We can use the vectorized rep
rep(x, each = 3) + 0:2
#[1] 1 2 3 32 33 34 60 61 62 86 87 88 115 116 117 142 143
#[18] 144 171 172 173 198 199 200
You can use saaply to loop over every element of x and generate a sequence of numbers and combine them with c
c(sapply(x, function(x) seq(x, x+2)))
# [1] 1 2 3 32 33 34 60 61 62 86 87 88 115 116 117 142 143 144 171 172 173
# 198 199 200

Filtering my R data frame is causing it to sort the data frame incorrectly

Consider the following two code snippets.
A:
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5, nrows=190) # Specify nrows, get correct answer
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')
mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
# No need to remove unranked countries because we specified nrows
# No need to convert V2 from factor to numeric
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get KNA, correct answer
B:
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5) # Don't specify nrows, get incorrect answer
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')
mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
mergedData = mergedData[which(mergedData$V2 != ""),] # Remove unranked countries
mergedData$V2 = as.numeric(mergedData$V2) # make V2 a numeric column
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get SRB, incorrect answer
I would think the two code snippets would be identical, except that in A you never add the unranked countries to your dataframe and in B you add them but then remove them. Why is the sorting different for these two code snippets?
The file downloads are from Coursera's Getting and Cleaning Data class (Quiz 3, Question 3).
Edit: To avoid security concerns, I've pasted the raw .csv files below
gdp.csv - http://pastebin.com/raw.php?i=4aRZwBRd
education.csv - http://pastebin.com/raw.php?i=0pbhDCSX
Edit2: The problem is occurring in the as.numeric step. For case B, here is mergedData$V2 before and after mergedData$V2 = as.numeric(mergedData$V2) is applied:
> mergedData$V2
[1] 161 105 60 125 32 26 133 172 12 27 68 162 25 140 128 59 76 93
[19] 138 111 69 169 149 96 7 153 113 167 117 165 11 20 36 2 99 98
[37] 121 30 182 166 81 67 102 51 4 183 33 72 48 64 38 159 13 103
[55] 85 43 155 5 185 109 6 114 86 148 175 176 110 42 178 77 160 37
[73] 108 71 139 58 16 10 46 22 47 122 40 9 116 92 3 50 87 145
[91] 120 189 178 15 146 56 136 83 168 171 70 163 84 74 94 82 62 147
[109] 141 132 164 14 188 135 129 137 151 130 118 154 127 152 34 123 144 39
[127] 126 18 23 107 55 66 44 89 49 41 187 115 24 61 45 97 54 52
[145] 8 142 19 73 119 35 174 157 100 88 186 150 63 80 21 158 173 65
[163] 124 156 31 143 91 170 184 101 79 17 190 95 106 53 78 1 75 180
[181] 29 57 177 181 90 28 112 104 134
194 Levels: .. Not available. 1 10 100 101 102 103 104 105 106 107 ... Note: Rankings include only those economies with confirmed GDP estimates. Figures in italics are for 2011 or 2010.
> mergedData$V2 = as.numeric(mergedData$V2)
> mergedData$V2
[1] 72 10 149 32 118 111 41 84 26 112 157 73 110 49 35 147 166 185
[19] 46 17 158 80 58 188 159 63 19 78 23 76 15 105 122 104 191 190
[37] 28 116 94 77 172 156 7 139 126 95 119 162 135 153 124 69 37 8
[55] 176 130 65 137 97 14 148 20 177 57 87 88 16 129 90 167 71 123
[73] 13 161 47 146 70 4 133 107 134 29 127 181 22 184 115 138 178 54
[91] 27 101 90 59 55 144 44 174 79 83 160 74 175 164 186 173 151 56
[109] 50 40 75 48 100 43 36 45 61 38 24 64 34 62 120 30 53 125
[127] 33 91 108 12 143 155 131 180 136 128 99 21 109 150 132 189 142 140
[145] 170 51 102 163 25 121 86 67 5 179 98 60 152 171 106 68 85 154
[163] 31 66 117 52 183 82 96 6 169 81 103 187 11 141 168 3 165 92
[181] 114 145 89 93 182 113 18 9 42
Can anyone explain why the numbers change when I apply as.numeric()?
The real reason for getting different results are in the second case i.e. the full dataset have some footer notes, which were also read with the read.csv resulting in most of the columns to be 'factor' class because of the 'character' elements in the footer. This could have avoided either by
skipping the last few lines using skip argument in read.csv
using stringsAsFactors=FALSE in the read.csv call along with skipping the lines.
The columns were ordered based on the "levels" of the factor.
If you have already read the files without skipping the lines, convert to the respective classes. If it is 'numeric' column, convert it to numeric by as.numeric(as.character(df$column)) or as.numeric(levels(df$column))[df$column].

Return id numbers if missing over a set of variables

If I have a large database, including an 'id' var, I want to list all variables of interest, and return back to myself a list of ids that are missing each particular variable.
#Fake Data:
set.seed(11100)
missdata<-data.frame(id<-1:1000,C1<-sample(c(1,NA),1000,replace=TRUE,prob=c(.8,.2)), C2<-sample(c(1,NA),1000,replace=TRUE,prob=c(.8,.2)))
names(missdata)<-c("id","v1","v2")
#One variable solution:
missdatatest<-subset(missdata, is.na(v1),select=id)
missdatatest[1:10,]
> missdatatest[1:10,]
[1] 5 30 44 47 48 49 57 65 68 74
#Looking to build a function...
FindMissings<-function(indata,varslist,printvar){
printonevar<-function(var){
missdatalist<-subset(indata, is.na(var),select=printvar)
print(missdatalist)
}
lapply(vars,printonevar)
}
#Run function:
vars<-c("v1","v2")
FindMissings(missdata,vars,id)
#Error:
> FindMissings(missdata,vars,id)
Error in `[.data.frame`(x, r, vars, drop = drop) : undefined columns selected
Any help would be appreciated. I originally wrote a function to do this in SAS, and it works perfectly fine, but I'm trying to move a lot of my work into R.
There's no need for such a function. Just use lapply:
> lapply(missdata[-1], function(x) which(is.na(x)))
$v1
[1] 5 30 44 47 48 49 57 65 68 74 89 103 107 110 115 119 152 167
[19] 175 176 194 197 199 202 204 212 215 223 231 232 233 239 245 280 281 293...
<<SNIP>>
$v2
[1] 3 6 18 19 22 23 27 28 33 38 41 50 51 55 60 66 68 77
[19] 81 84 86 96 97 99 109 116 117 134 139 141 143 146 148 153 165 168...
<<SNIP>>
If you specifically wanted to return the values from your "id" column (not just the position of the NA values), you can modify the statement to be:
lapply(missdata[-1], function(x) missdata$id[which(is.na(x))])
If your concern is how to use this approach for specific variables, it's pretty straightforward:
vars <- c("v1","v2")
lapply(missdata[vars], function(x) which(is.na(x)))

Create a for loop which prints every number that is x%%3=0 between 1-200

Like the title says I need a for loop which will write every number from 1 to 200 that is evenly divided by 3.
Every other method posted so far generates the 1:200 vector then throws away two thirds of it. What a waste. In an attempt to be eco-conscious, this method does not waste any electrons:
seq(3,200,by=3)
You don't need a for loop, use match function instead, as in:
which(1:200 %% 3 == 0)
[1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81
[28] 84 87 90 93 96 99 102 105 108 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162
[55] 165 168 171 174 177 180 183 186 189 192 195 198
Two other alternatives:
c(1:200)[c(F, F, T)]
c(1:200)[1:200 %% 3 == 0]

Resources