I'm unsure of what to call this, so I'll try to describe in laymens terms what the issue is. I have a dataframe that only consists of 0 and 1. So for each individual instead of having one column with a factoral value (ex. low price, 4 rooms) I have
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
2 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0
4 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0
How can I transform the dataset in R, so that I create new columns (#number of rooms) and give the position of the 1 (in the 4th column) a vhigh value?
I have multiple expenatory varibales I need to do this for. the 21 columns are representing 6 variables for 1000+ observations. should be something like this
PurchaseP. NumberofRooms ...
1. vhigh. 4
2. low. 4
3. vhigh. 1
4. vhigh. 2
Just did it for the first 2 epxlenatory varibales here, but essentially it repeats like this with each explenatory variable has 3-4 possible factoral values.
V1:V4 = purchase price, V5:V8 = number of rooms,V9:V11 = floors, and so on
In my head something like this could work
create a if statemt to give each 1 a value depending on column position, ex. if value in V4=1 then name "vhigh". and do this for each Vx
Then combine each column V1:V4, V5:V8, V9:V11 (depending on if it has 3-4 possible factoral/integer values) while ignoring 0 values.
Would this work, or is there a simpler approach? How would one code this in R?
If the dataset contains a single 1 per row this is a pretty simple problem
Here your data according to your picture (please edit your question to put a code instead of picture)
df = data.frame(r1 = 0, r2 = 1, r3 = 0)
rownames(df)<- 1
Then, you simply have to sum your column with the room number as weight
df$room = df$r1*1 + df$r2 * 2 + df$r3 *3
You can use the function which() similar to
lapply(df, function(x) { %now x is a row
idx = which(x == 1)[1]
return(idx)
})
The interesting part is to use which(x ==1) on each row. This gives you an array of all indices that contain a one. The first of those can be used in your case (assuming that you only have one 1 per line) Otherwise, aggregation needs to be discussed. The resulting column can then be transformed into a factor by giving sensible names to the various indices.
Related
Consider the following data set (named data).
library(DescTools)
v1 v2 v3 w1 w2 w3
1 0 0 0 1 0
0 1 0 0 0 1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 0 1
0 0 1 1 0 0
My objective is to compute contingency coefficient for all combination of (v1,v2,v3) and (w1,w2,w3). To make it clear, v1 & w1,v1 & w2, v1 & w3, etc using for loop. For example, the loop will do the following at the first iteration.
tab1 <- table(data$v1,data$w1)
c1 < ContCoef(tab1)
Any help is highly appreciated!
results = list()
for (v_col in c("v1", "v2", "v3")) {
for(w_col in c("w1", "w2", "w3")) {
tab = table(data[[v_col]], data[[w_col]])
results[[paste(v_col, w_col)]] = ContCoef(tab)
}
}
# view individual results
results[["v1_w2"]]
results[["v3_w1"]]
Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0
I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]
I am definitely not an R coder but am trying to stumble my way through this code. I have a dataframe that looks like this--with 200 rows (just 8 shown here).
Ind.ID V1 V2 V3 V4 V5 V6 V7 Captures
1 1 0 0 1 1 0 0 0 2
2 2 0 0 1 0 0 0 1 2
3 3 1 1 0 1 1 0 1 5
4 4 0 0 1 1 0 0 0 2
5 5 1 0 0 0 0 1 0 2
6 6 0 1 1 0 0 0 0 2
7 7 0 0 1 1 1 0 0 3
8 8 1 0 0 0 1 0 0 2
I am trying to sample from the Captures column (which is the sum of the row) and output the Ind.ID value. If there is a 0 in the Captures column, I want it to subtract 1 from i (i=i-1) and resample--to ensure that I get the correct number of samples. I also want to then subtract 1 from the sampled column (i.e., decrease the Captures value by 1 if it was sampled), and then resample. I am trying to get 400 samples (I think the current code will get me only 200, but I can't figure out how to get 400).
i want my output to be
23
45
197
64
.....
Here's my code:
sess1<-(numeric(200)) #create a place for output
for(i in 1:length(dep.pop$Captures)){
if(dep.pop[i,'Captures']!=0){ #if the value of Captures is not 0, sample and
sample(dep.pop$Captures, size=1, replace=TRUE) #want to resample the row if Captures >1
#code here to decrease the value of the sampled Captures column by 1. create new vector for resampling?
}
else {
if(dep.pop[i,'Captures']==0){ #if the value of Captures = 0
i<-i-1 #decrease the value of i by 1 to ensure 200 samples
sample(dep.pop$Captures, size=1, replace=TRUE) #and resample
}
#sess1<- #store the value from a different column (ID column) that represents the sampled row
}}
Thanks!
Assuming sum(dep.pop$Captures) is at least 400 then the following code may meet your needs to sample up to the number of captures for each individual id:
sample(rep(dep.pop$Ind.ID, times=dep.pop$Captures), size=400)
If you wish to sample with replacement (so you do not need to worry about the total number of captures) but still want to use the number of captures per individual id as sampling weights, then perhaps
sample(dep.pop$Ind.ID, size=400, replace=TRUE, prob=dep.pop$Captures)
I have a data frame, that I am wanting to use to generate a design matrix.
>ct<-read.delim(filename, skip=0, as.is=TRUE, sep="\t", row.names = 1)
> ct
s2 s6 S10 S14 S3 S7 S11 S15 S4 S8 S12 S16
group 1 1 1 1 2 2 2 2 3 3 3 3
donor 1 2 3 4 1 2 3 4 1 2 3 4
>factotum<-apply(ct,1,as.factor) # to turn rows into factors.
>design <- model.matrix(~0 + factotum[,1] + factotum[,2])
Eventually, I'll generate a string and use as.formula() instead of hard coding the formula. Anyway, this works and produces a design matrix. It leaves a column out though.
>design
factotum[, 1]1 factotum[, 1]2 factotum[, 1]3 factotum[, 2]2 factotum[, 2]3 factotum[, 2]4
1 1 0 0 0 0 0
2 1 0 0 1 0 0
3 1 0 0 0 1 0
4 1 0 0 0 0 1
5 0 1 0 0 0 0
6 0 1 0 1 0 0
7 0 1 0 0 1 0
8 0 1 0 0 0 1
9 0 0 1 0 0 0
10 0 0 1 1 0 0
11 0 0 1 0 1 0
12 0 0 1 0 0 1
By my reasoning, the column names should be:
factotum[, 1]1 factotum[, 1]2 factotum[, 1]3, factotum[,2]1, factotum[, 2]2 factotum[, 2]3 factotum[, 2]4. These would be renamed as group1,group2,group3,donor1,donor2,donor3,donor4.
Which means that factotum[,2]1, or donor1, is missing. What am I doing that this would be missing? Any help would be be appreciated.
Cheers
Ben.
There are several things here.
(1) apply(ct,1,as.factor) doesn't necessarily turn the rows into factors. Try str(factotum) and you'll see that it failed. I'm not sure what the fastest way is, but this should work:
factotum <- data.frame(lapply(data.frame(t(ct)), as.factor))
(2) Since you are working with factors, model.matrix creates dummy coding. In this case, donor has four values. If you are 2, then you get a 1 in the column factotum[,2]2. If you are 3 or 4, you get a 1 in their respective columns. So what if you are a 1? Well, that simply means that you are 0 in all three columns. In this way, you only need three columns to create four groups. The value 1 for donor is called the reference group here, which is the group with which the other groups are compared.
(3) So now the question is... Why doesn't group (or factotum[,1]) have only TWO columns? We could easily code three levels with two columns, right? Well... yes, this is exactly what happens when you use:
design <- model.matrix(~ factotum[,1] + factotum[,2])
However, since you specify that there is no intercept, you'll get an extra column for group.
(4) Usually you don't have to create the design matrix yourself. I'm not sure what function you want to use next, but in most cases the functions take care of it for you.