Data frame to matrix in R

Data frame to matrix in R - r

In R, I am using a for loop to iterate through a large data frame, trying to put the integer in the *i*th row, 7th column into a specific index in another matrix. The specific index corresponds to the index in the large data frame (again in the *i*th row, but the 2nd and 4th column instead). For example, say that my data frame has data_frame[1,2]=5, data_frame[1,4]=12, and data_frame[1,7]=375. I want to put 375 into my matrix in the index where the row has the name 5 and the column has name 12.
However, the problem (I think) is that when I do col_index=which(colnames(matrix)==data_frame[1,2]), it returns integer 0. The column name is technically 5, but I noticed it only works if I do col_index=which(colnames(matrix)=="5"). How can I make sure that (in my for loop) data_frame[i,2] corresponds to "5"?
data is saved as "out" My matrix that I want to put the data in is called "m"
m=matrix(nrow=87,ncol=87)
fips=sprintf("%03d",seq(1,173,by=2))
colnames(m)=fips
rownames(m)=fips
m[1:40,1:40]
Next, the condition that the 3rd column is equal to 27
for(i in 8:2446)
{
if(out[i,3]==27)
{
out_col=out[i,4]
out_row=out[i,2]
moves=out[i,7]
col_index=which(colnames(m)==paste(out_col))
row_index=which(rownames(m)==paste(out_row))
m[row_index,col_index]=moves
}
}
Sorry for the lack of formatting. It is putting numbers in the matrix, but they aren't the right numbers, and I can't figure out what's wrong. Any help would be much appreciated!

There's a lot of complexity in your example, but it boils down to replacing values in mat, where the row name, column name, and new value are stored in out. Let's start with a reproducible example (it would have been helpful if you posted one!)
# Matrix to have values replaced
mat <- matrix(0, nrow=3, ncol=3)
rownames(mat) <- c("1", "2", "3")
colnames(mat) <- c("4", "5", "6")
mat
# 4 5 6
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
out <- data.frame(row=c(1, 3, 3), col=c(6, 5, 4), val=c(1, 4, -1))
out
# row col val
# 1 1 6 1
# 2 3 5 4
# 3 3 4 -1
Now, doing the replacement is a one-liner:
mat[cbind(as.character(out$row), as.character(out$col))] <- out$val
mat
# 4 5 6
# 1 0 0 1
# 2 0 0 0
# 3 -1 4 0
Basically, we're indexing mat by a 2-column matrix, where each row of the indexing matrix is a row name and column name.
In your example, you appear to be excluding the first 7 rows of out, as well as any row where out[,3] does not equal 27. You could simply subset out based on these requirements with something like realout <- out[out[,3] == 27 & seq(nrow(out)) %in% 8:2446,] and then do the replacement with realout.
Note that one added benefit of doing the replacement in this way is that it will be much faster than using a for loop through the rows of out.

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)

If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Looping through a column to make a new table in R

I want to make a table called Count_Table and in it, Id like to count the number of 0s, 1s, and 5s when column "num" == 1,2,3,4, etc.
For example, the code below will count the 0s,1s,and 5s in column "num" when "num == "1". This is great but i need to do this 34 more times since "num" goes from 1-35.
Count_Table <- table(SASS_data[num == "1"]$Visited5)
I am new to R and I don't know how to add 1 to the "num" and loop it until 35 so that the Count_Table includes the counts of 0,1,5 for all nums that exist (1-35). I am sorry if this is confusing and thank you for your help.

lapply will generate a list of tables that span the columns of a dataframe. E.g.,
tablist <- lapply(mtcars, table)
If your dataframe contains columns you want to exclude, can do that by restricting the dataframe. E.g.,
tablist2 <- lapply(mtcars[, c(2, 4, 7)], table)

Answer
Table works on multiple dimensions. Just put both num and Visited5 as arguments. This also works if not all unique values of Visited5 are present in every level of num, those cells will simply be set to 0.
Example
SASS_data <- data.frame(
num = rep(1:5, each = 5),
Visited5 = sample(1:3, 25, r = T)
)
table(SASS_data$num, SASS_data$Visited5)
# 1 2 3
# 1 2 1 2
# 2 1 3 1
# 3 1 1 3
# 4 2 0 3
# 5 2 2 1

Split integers based on a value in second column, assign new values, and, recombine into new dataset

In R, I have a 2xn matrix of data containing all integers.
The first column indicates the size of an item. Some of these sizes were due to merging, so the second column indicates the number of items that went into that size (including 1) (calling it 'index'). The sum of the indices indicate how many items were actually in the original data.
I now need to create a new data set that splits any merged sizes back out according to the number in the index, resulting in a 2xn vector (with a new length n according the the total number of indices) and a second column all 1's.
I need this split to happen in two ways.
"Homogeneously" where any merged sizes are assigned to the number of indices as homogeneously as possible. For instance, a size of 6 with index of 3 would now be c(2,2,2). Importantly, all number have to be integers, so it should be something like c(1,2) or c(2,1). It cant be c(1.5,1.5).
"Heterogeneously" where the number of sizes are skewed to assign 1 to all positions in the index except one, which would contain the reminder. For instance, of a size of 6 with index of 3, it would now be c(1,1,4) or any combination of 1, 1, and 4.
Below I am providing some sample data that gives an example of what I have, what I want, and what I have tried.
#Example data that I have
Y.have<-cbind(c(19,1,1,1,1,4,3,1,1,8),c(3,1,1,1,1,2,1,1,1,3))
The data show that three items went into the size of 19 for the first row, one item went into the size one in the second column, and so on. Importantly, in these data there were originally 15 items (i.e. sum(Y.have[,2])), some of which got merged, so the final data will need to be of length 15.
What I want the data to look like is:
####Homogenous separation - split values evenly as possible
#' The value of 19 in row 1 is now a vector of c(6,6,7) (or any combination thereof, i.e. c(6,7,6) is fine) since the position in the second column is a 3
#' Rows 2-5 are unchanged since they have a 1 in the second column
#' The value of 4 in row 6 is now a vecttor of c(2,2) since the position of the second column is a 2
#' Rows 7-9 are unchanged since they have a 1 in the second column
#' The value of 8 in row 10 is now a vector of c(3,3,2) (or any combination thereof) since the position in the second column is a 3
Y.want.hom<-cbind(c(c(6,6,7),1,1,1,1,c(2,2),3,1,1,c(3,3,2)),c(rep(1,times=sum(Y.have[,2]))))
####Heterogenous separation - split values with as many singles as possible,
#' The value of 19 in row 1 is now a vector of c(1,1,17) (or any combination thereof, i.e. c(1,17,1) is fine) since the position in the second column is a 3
#' Rows 2-5 are unchanged since they have a 1 in the second column
#' The value of 4 in row 6 is now a vecttor of c(1,3) since the position of the second column is a 2
#' Rows 7-9 are unchanged since they have a 1 in the second column
#' The value of 8 in row 10 is now a vector of c(1,1,6) (or any combination thereof) since the position in the second column is a 3
Y.want.het<-cbind(c(c(1,1,17),1,1,1,1,c(1,3),3,1,1,c(1,1,6)),c(rep(1,times=sum(Y.have[,2]))))
Note that the positions of the integers in the final data don't matter since they will all have one index case.
I have tried splitting the data (split) according to index case. This creates a list with a length according to the number of unique index values. I then iterated through that positions in that list and divided by the position.
a<-split(Y.have[,1],Y.have[,2]) #Split into a list according to the index
b<-list() #initiate new list
for (i in 1:length(a)){
b[[i]]<-a[[i]]/i #get homogenous values
b[[i]]<-rep(b[i],times=i) #repeat the values based on the number of indicies
}
Y.test<-cbind(unlist(b),rep(1,times=length(unlist(c)))) #create new dataset
This was a terrible approach. First, it will produce decimals. Second, the position in the list does not necessarily equal the index number (i.e. if there was no index of 2, the second position would be the next lowest index, but would divide by 2).
However, it at least allowed me to separate out the data by index, manipulate it, and recombine it to a proper length. I now need help in that middle part - manipulating the data for both homogeneous and heterogenous reassignment. I would prefer base r, but any approach would certainly be fine! Thank you in advance!

Here might be one approach.
Create two functions for homogeneous and heterogeneous splits:
get_hom_ints <- function(M, N) {
vec <- rep(floor(M/N), N)
for (i in seq_len(M - sum(vec))) {
vec[i] <- vec[i] + 1
}
vec
}
get_het_ints <- function(M, N) {
vec <- rep(1, N)
vec[1] <- M - sum(vec) + 1
vec
}
Then use apply to go through each row of the matrix:
het_vec <- unlist(apply(Y.have, 1, function(x) get_het_ints(x[1], x[2])))
unname(cbind(het_vec, rep(1, length(het_vec))))
hom_vec <- unlist(apply(Y.have, 1, function(x) get_hom_ints(x[1], x[2])))
unname(cbind(hom_vec, rep(1, length(het_vec))))
Output
(heterogeneous)
[,1] [,2]
[1,] 17 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
[5,] 1 1
[6,] 1 1
[7,] 1 1
[8,] 3 1
[9,] 1 1
[10,] 3 1
[11,] 1 1
[12,] 1 1
[13,] 6 1
[14,] 1 1
[15,] 1 1
(homogeneous)
[,1] [,2]
[1,] 7 1
[2,] 6 1
[3,] 6 1
[4,] 1 1
[5,] 1 1
[6,] 1 1
[7,] 1 1
[8,] 2 1
[9,] 2 1
[10,] 3 1
[11,] 1 1
[12,] 1 1
[13,] 3 1
[14,] 3 1
[15,] 2 1

library(partitions) is created for this type of requirements check it out.
Apply below logics to your code it should work
ex:
hom <- restrictedparts(19,3) #where 19 is Y.have[,1][1] and 3 is Y.have[,2][1] as per your data
print(hom[,ncol(hom)])
#output : 7 6 6
het <- Reduce(intersect, list(which(hom[2,1:ncol(hom)] %in% 1),which(hom[3,1:ncol(hom)] %in% 1)))
hom[,het]
#output : 17 1 1

One option would be to use integer division (%/%) and modulus (%%). It may not give the exact results you specified ie. 8 and 3 give (2,2,4) rather than (3,3,2), but does generally do what you described.
Y.have<-cbind(c(19,1,1,1,1,4,3,1,1,8),c(3,1,1,1,1,2,1,1,1,3))
homoVec <- c()
for (i in 1:length(Y.have[,1])){
if (Y.have[i,2] == 1) {
a = Y.have[i,1]
homoVec <- append(homoVec, a)
} else {
quantNum <- Y.have[i,1]
indexNum <- Y.have[i,2]
b <- quantNum %/% indexNum
c <- quantNum %% indexNum
a <- c(rep(b, indexNum-1), b + c)
homoVec <- append(homoVec, a)
}
}
homoOut <- data.frame(homoVec, 1)
heteroVec <- c()
for (i in 1:length(Y.have[,1])){
if (Y.have[i,2] == 1) {
a = 1
heteroVec <- append(heteroVec, a)
} else {
quantNum <- Y.have[i,1]
indexNum <- Y.have[i,2]
firstNum <- quantNum - (indexNum - 1)
a <- c(firstNum, rep(1, indexNum - 1))
heteroVec <- append(heteroVec, a)
}
}
heteroOut <- data.frame(heteroVec, 1)
If it is really important to have the math exactly as you described in your example then this should work.
homoVec <- c()
for (i in 1:length(Y.have[,1])){
if (Y.have[i,2] == 1) {
a = Y.have[i,1]
homoVec <- append(homoVec, a)
} else {
quantNum <- Y.have[i,1]
indexNum <- Y.have[i,2]
b <- round(quantNum/indexNum)
roundSum <- b * (indexNum - 1)
c <- quantNum - roundSum
a <- c(rep(b, indexNum-1), c)
homoVec <- append(homoVec, a)
}
}
homoOut <- data.frame(homoVec, 1)

Check and replace column values in R dataframe

I have multiple files to read in using R. I iterate through the files in a loop, obtain dataframes and then try to change values of a particular column. Examples of the R dataframes are as follows:
df_A:
ID ZN
1 0
2 1
3 1
4 0
df_B:
ID ZN
1 2
2 1
3 1
4 2
As shown above, the column 'ZN' for some dataaframes may have 0's and 1's and others dataframes have have 1's and 2's. What I want is - as I'm iterating through the files, I want to make changes only in the dataframes with column ZN having 1's and 2's like this: 1 to 0 and 2 to 1. Dataframes with ZN values as 0's and 1's will be left unchaged.
my attempt did not work:
if(dataframe$ZN > 1){
dataframe$ZN<-recode(dataframe$ZN,"1=0;2=1")
}
else{
dataframe$ZN
}
Any solutions please?

One approach might be to decrement the value of ZN by one if we detect a single value of 2 anywhere in the column:
if (max(df_A$ZN) == 2) {
df_A$ZN = df_A$ZN - 1
}
Demo

If there are only two values i.e. 0 and 1, then
df_A$ZN <- (df_A$ZN==0) + 1
df_A$ZN
#[1] 2 1 1 2
Or using case_when for multiple values
library(dplyr)
df_A %>%
mutate(ZN = case_when(ZN==0 ~2, TRUE ~ 1))

appending to a data frame row by row character formatting issue

I am trying to build a dataframe from the output of a mapply.
Here is one example of my output.
> out[1:9,1]
$statistic
X-squared
1311.404
$parameter
df
1
$p.value
[1] 1.879366e-287
$estimate
prop 1 prop 2
0.001680737 0.009517644
$null.value
NULL
$conf.int
[1] -1.000000000 -0.007153045
attr(,"conf.level")
[1] 0.95
$alternative
[1] "less"
$method
[1] "2-sample test for equality of proportions with continuity correction"
$data.name
[1] "members out of enrolled"
I want to put these values into a dataframe. I have 1684 rows in this matrix. I want a dataframe with 1684 rows.
I also have codes from outside of this data that I want to incorporate into the dataframe. These are strings from fwa$proc.
> out[,1]$p.value
[1] 1.879366e-287
> out[,1]$estimate[[1]]
[1] 0.001680737
> out[,1]$estimate[[2]]
[1] 0.009517644
> as.character(fwa$proc[1])
[1] "10022"
I have looked here for support for doing this. I am creating a dataframe first and then attempting to fill my dataframe from another dataframe row by row as such...
n<-1684
new.df <- data.frame(cpt=character(n), FFS_prop=numeric(n), PHN_prop=numeric(n)
, differnce=numeric(n), results=character(n), Null_HO = character(n), Alt_HA=character(n), stringsAsFactors=FALSE)
Here is the head.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
Now to fill data row by row...
for (i in 1:n) new.df[i, ] <- data.frame(cpt = toString(fwa$proc[i])
,FFS_prop=round(out[,i]$estimate[[1]],5)
,PHN_prop=round(out[,i]$estimate[[2]],5)
,differnce=round(out[,i]$estimate[[1]]-out[,i]$estimate[[2]],5)
,results=if(out[,i]$p.value <.05) {"Reject NUll"} else {"Fail to Reject Null"}
,Null_HO = toString('FFS = pHN')
,Alt_HA = toString('FFS < PHN')
)
Here is the head after the code runs.
> head(new.df)
cpt FFS_prop PHN_prop differnce results Null_HO Alt_HA
1 1 0.00168 0.00952 -0.00784 1 1 1
2 1 0.00033 0.00142 -0.00109 1 1 1
3 1 0.00239 0.01461 -0.01222 1 1 1
4 1 0.00135 0.00919 -0.00783 1 1 1
5 1 0.00008 0.00180 -0.00172 1 1 1
6 1 0.00036 0.00177 -0.00141 1 1 1
Please friends, why don't my strings make it into the data dataframe?
I have tried to put as.character() around them, toString() around them all for naught.
Wiser ones please advise.
Thanks.

You can either set options(stringsAsFactors=F) of you can also set stringsAsFactors=F in the data.frame in you loop. The problem is that because you are building a new data.frame in each loop, it doesn't know about the rules you've set on the data.frame that it's going to added to later. So at the time of creation, it converts it's values to a factor which is stored as a unique integer for each observed character string. Since you are only adding one value, each factor has one level so they each coded as the integer 1.
Then when you go to do the assignment to the master data.frame, that integer 1 is converted to a character "1". So the str(new.df) should show that your character columns are still characters, they just happen to contain the character "1" for each row.
Building data.frames row-by-row is always a messy process that should be avoided if at all possible. It's better to try to build data data column wise and then build your data.frame at the end. You said that out was the result of using mapply on a prop.test so i've created a sample
out<-mapply(prop.test, replicate(10, rbinom(1, size = 100, prob = .5)), 100)
That gives something that matches your out with only 10 columns I believe. But then you can extract all the p-values with
apply(out, 2, '[[', "p.value")
and all of your FSS values with
apply(out, 2, function(x) x$estimate[[1]])
so your data.frame construction would look more like
new.df<- data.frame(cpt = fwa$proc
,FFS_prop=apply(out, 2, function(x) x$estimate[[1]])
,PHN_prop=apply(out, 2, function(x) x$estimate[[2]])
,pval = apply(out, 2, '[[', "p.value")
,Null_HO = 'FFS = pHN'
,Alt_HA = 'FFS < PHN'
,stringsAsFactors=F
)
new.df <- transform(new.df,
differnce=FFS_prop-PHN_prop,
,results=ifelse(pval<.05, "Reject NUll", "Fail to Reject Null")
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data frame to matrix in R - r

Related

Function to recode multiple variables conditional on other variables

Looping through a column to make a new table in R

Split integers based on a value in second column, assign new values, and, recombine into new dataset

Check and replace column values in R dataframe

appending to a data frame row by row character formatting issue

Categories

Resources