Check and replace column values in R dataframe - r

I have multiple files to read in using R. I iterate through the files in a loop, obtain dataframes and then try to change values of a particular column. Examples of the R dataframes are as follows:
df_A:
ID ZN
1 0
2 1
3 1
4 0
df_B:
ID ZN
1 2
2 1
3 1
4 2
As shown above, the column 'ZN' for some dataaframes may have 0's and 1's and others dataframes have have 1's and 2's. What I want is - as I'm iterating through the files, I want to make changes only in the dataframes with column ZN having 1's and 2's like this: 1 to 0 and 2 to 1. Dataframes with ZN values as 0's and 1's will be left unchaged.
my attempt did not work:
if(dataframe$ZN > 1){
dataframe$ZN<-recode(dataframe$ZN,"1=0;2=1")
}
else{
dataframe$ZN
}
Any solutions please?

One approach might be to decrement the value of ZN by one if we detect a single value of 2 anywhere in the column:
if (max(df_A$ZN) == 2) {
df_A$ZN = df_A$ZN - 1
}
Demo

If there are only two values i.e. 0 and 1, then
df_A$ZN <- (df_A$ZN==0) + 1
df_A$ZN
#[1] 2 1 1 2
Or using case_when for multiple values
library(dplyr)
df_A %>%
mutate(ZN = case_when(ZN==0 ~2, TRUE ~ 1))

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Using row number to create a 0/1 column in R

I want to create a new column in my dataset for when 'death_code' contains an 'I' (could be I001-I100) then it would return a 1, otherwise it would return a 0
death_code
I099
E045
T054
I065
I022
I have used grepl to search for rows in a variable which contain 'I' and saved the row numbers
rows<-which(grepl('I', fulldata$deathcode))
However I now want to assign a 1 to these rows in a new column and I cannot workout how to do this.
This is what I anticipate the data to look like
death_code CVD_death
I099. 1
E045. 0
T054. 0
I065. 1
I022. 1
Instead of using which, use as.integer on the grepl result - TRUE/FALSE will be converted to 1/0.
fulldata$CVD_death <- as.integer(grepl("I", fulldata$deathcode))
Alternately, you could do it with which by setting all values in the column to 0, and then setting the which values to 1:
fulldata$CVD_death <- 0
fulldata$CVD_death[which(grepl("I", fulldata$deathcode))] <- 1
Using stringr approach:
library(dplyr)
library(stringr)
df %>% mutate(CVD_death = case_when(str_detect(death_code, '^I\\d{3}') ~ 1, TRUE ~ 0))
# A tibble: 5 x 2
death_code CVD_death
<chr> <dbl>
1 I099 1
2 E045 0
3 T054 0
4 I065 1
5 I022 1
Another option is + to convert the logical to integer
fulldata$CVD_death <- +(grepl("I", fulldata$deathcode))

Data frame to matrix in R

In R, I am using a for loop to iterate through a large data frame, trying to put the integer in the *i*th row, 7th column into a specific index in another matrix. The specific index corresponds to the index in the large data frame (again in the *i*th row, but the 2nd and 4th column instead). For example, say that my data frame has data_frame[1,2]=5, data_frame[1,4]=12, and data_frame[1,7]=375. I want to put 375 into my matrix in the index where the row has the name 5 and the column has name 12.
However, the problem (I think) is that when I do col_index=which(colnames(matrix)==data_frame[1,2]), it returns integer 0. The column name is technically 5, but I noticed it only works if I do col_index=which(colnames(matrix)=="5"). How can I make sure that (in my for loop) data_frame[i,2] corresponds to "5"?
data is saved as "out" My matrix that I want to put the data in is called "m"
m=matrix(nrow=87,ncol=87)
fips=sprintf("%03d",seq(1,173,by=2))
colnames(m)=fips
rownames(m)=fips
m[1:40,1:40]
Next, the condition that the 3rd column is equal to 27
for(i in 8:2446)
{
if(out[i,3]==27)
{
out_col=out[i,4]
out_row=out[i,2]
moves=out[i,7]
col_index=which(colnames(m)==paste(out_col))
row_index=which(rownames(m)==paste(out_row))
m[row_index,col_index]=moves
}
}
Sorry for the lack of formatting. It is putting numbers in the matrix, but they aren't the right numbers, and I can't figure out what's wrong. Any help would be much appreciated!
There's a lot of complexity in your example, but it boils down to replacing values in mat, where the row name, column name, and new value are stored in out. Let's start with a reproducible example (it would have been helpful if you posted one!)
# Matrix to have values replaced
mat <- matrix(0, nrow=3, ncol=3)
rownames(mat) <- c("1", "2", "3")
colnames(mat) <- c("4", "5", "6")
mat
# 4 5 6
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
out <- data.frame(row=c(1, 3, 3), col=c(6, 5, 4), val=c(1, 4, -1))
out
# row col val
# 1 1 6 1
# 2 3 5 4
# 3 3 4 -1
Now, doing the replacement is a one-liner:
mat[cbind(as.character(out$row), as.character(out$col))] <- out$val
mat
# 4 5 6
# 1 0 0 1
# 2 0 0 0
# 3 -1 4 0
Basically, we're indexing mat by a 2-column matrix, where each row of the indexing matrix is a row name and column name.
In your example, you appear to be excluding the first 7 rows of out, as well as any row where out[,3] does not equal 27. You could simply subset out based on these requirements with something like realout <- out[out[,3] == 27 & seq(nrow(out)) %in% 8:2446,] and then do the replacement with realout.
Note that one added benefit of doing the replacement in this way is that it will be much faster than using a for loop through the rows of out.

Subset based on granularity and average values

I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?
I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000

Comparing two columns: logical- is value from column 1 also in column 2?

I'm pretty confused on how to go about this. Say I have two columns in a dataframe. One column a numerical series in order (x), the other specifying some value from the first, or -1 (y). These are results from a matching experiment, where the goal is to see if multiple photos are taken of the same individual. In the example below, there 10 photos, but 6 are unique individuals. In the y column, the corresponding x is reported if there is a match. y is -1 for no match (might as well be NAs). If there is more than 2 photos per individual, the match # will be the most recent record (photo 1, 5 and 7 are the same individual below). The group is the time period the photo was take (no matches within a group!). Hopefully I've got this example right:
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,2,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
I would like to create a new variable to name the unique individuals, and have a final dataset with a single row per individual (i.e. only have 6 rows instead of 10), that also includes the group information. I.e. if an individual is in all three groups, there could be a value of "111" or if just in the first and last group it would be "101". Any tips?
Thanks for asking about the resulting dataset. I realized my group explanation was bad based on the actual numbers I gave, so I changed the results slightly. Bonus would also be nice to have, but not critical.
name <- c(1,2,3,4,6,8)
group_history <- as.character(c('111','101','100','011','010','001'))
bonus <- as.character(c('1,5,7','2,9','3','4,10','6','8'))
results_I_want <- data.frame(name,group_history,bonus)
My word, more mistakes fixed above...
Using the (updated) example you gave
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(-1,-1,-1,-1,1,-1,1,-1,3,4)
group <- c(1,1,1,2,2,2,3,3,3,3)
DF <- data.frame(x,y,group)
Use the x and y to create a mapping from higher numbers to lower numbers that are the same person. Note that names is a string, despite it be a string of digits.
bottom.df <- DF[DF$y==-1,]
mapdown.df <- DF[DF$y!=-1,]
mapdown <- c(mapdown.df$y, bottom.df$x)
names(mapdown) <- c(mapdown.df$x, bottom.df$x)
We don't know how many times it might take to get everything down to the lowest number, so have to use a while loop.
oldx <- DF$x
newx <- mapdown[as.character(oldx)]
while(any(oldx != newx)) {
oldx = newx
newx = mapdown[as.character(oldx)]
}
The result is the group it belongs to, names by the lowest number of that set.
DF$id <- unname(newx)
Getting the group membership is harder. Using reshape2 to convert this into wide format (one column per group) where the column is "1" if there was something in that one and "0" if not.
library("reshape2")
wide <- dcast(DF, id~group, value.var="id",
fun.aggregate=function(x){if(length(x)>0){"1"}else{"0"}})
Finally, paste these "0"/"1" memberships together to get the grouping variable you described.
wide$grouping = apply(wide[,-1], 1, paste, collapse="")
The result:
> wide
id 1 2 3 grouping
1 1 1 1 1 111
2 2 1 0 0 100
3 3 1 0 1 101
4 4 0 1 1 011
5 6 0 1 0 010
6 8 0 0 1 001
No "bonus" yet.
EDIT:
To get the bonus information, it helps to redo the mapping to keep everything. If you have a lot of cases, this could be slow.
Replace the oldx/newx part with:
iterx <- matrix(DF$x, ncol=1)
iterx <- cbind(iterx, mapdown[as.character(iterx[,1])])
while(any(iterx[,ncol(iterx)]!=iterx[,ncol(iterx)-1])) {
iterx <- cbind(iterx, mapdown[as.character(iterx[,ncol(iterx)])])
}
DF$id <- iterx[,ncol(iterx)]
To generate the bonus data, then you can use
bonus <- tapply(iterx[,1], iterx[,ncol(iterx)], paste, collapse=",")
wide$bonus <- bonus[as.character(wide$id)]
Which gives:
> wide
id 1 2 3 grouping bonus
1 1 1 1 1 111 1,5,7
2 2 1 0 0 100 2
3 3 1 0 1 101 3,9
4 4 0 1 1 011 4,10
5 6 0 1 0 010 6
6 8 0 0 1 001 8
Note this isn't same as your example output, but I don't think your example output is right (how can you have a grouping_history of "000"?)
EDIT:
Now it agrees.
Another solution for bonus variable
f_bonus <- function(data=df){
data_a <- subset(data,y== -1,select=x)
data_a$pos <- seq(nrow(data_a))
data_b <- subset(df,y!= -1,select=c(x,y))
data_b$pos <- match(data_b$y, data_a$x)
data_t <- rbind(data_a,data_b[-2])
data_t <- with(data_t,tapply(x,pos,paste,sep="",collapse=","))
return(data_t)
}

Resources