How can I replace numeric values by characters in R? - r

I have a file like this.
"1" 10 2 0 0 0 0 0 0 0 0 0 0 0 4 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
"2" 10 3 6 17 11 15 8 17 14 1 42 21 22 15 9 9 17 12 9 16 4 8 12 29 23 11 0 0 0 0
"3" 10 4 39 39 14 33 16 23 37 21 29 22 46 26 16 26 21 22 21 10 16 3 10 14 20 12 6 0 0 0
"4" 100 18 0 0 0 1 0 0 0 0 0 0 2 0 0 1 0 2 8 5 2 1 2 4 9 6 4 3 0 0
.....................
What I want to do is, replace the values from column 4 onwards by characters, i.e. if value is between 0 to 10, then it will be replaced by character 'a' and if it is between 10 to 20, it will be replaced by character b and so on.
For example, the output file will be of the form,
"1" 10 2 0 0 0 0 0 0 0 0 0 0 0 a 0 0 a 0 0 0 0 0 a 0 0 0 0 0 0 0
.............................
How can I do it in R? Is there someway I can automate the assigning of characters because currently I am using two for loops and harcoding the values by the range.
Edit: My approach:
> for ( i in 1:nrow(x) )
+ for ( j in j:ncol(x) )
+ {
+ if (x[i,j] < 10 && x[i,j] > 0 )
+ x[i,j] = a
+ else if ( x[i,j] < 20 && x[i,j] > 10 )
+ x[i,j] = b
+ }
The above is my approach. This is showing an error in conditions, and I know will take a lot of time since it involves usage of two for loops.

One possible solution is to create a dummy data set to match against, and then match all non zero values to it (assuming df is your data set)
matchData <- data.frame(lets = c(0, rep(letters, each = 10)),
nums = c(0, seq_len(length(letters)*10)))
df[, -seq_len(3)] <- sapply(df[, -seq_len(3)], function(x) matchData$lets[match(x, matchData$nums)])
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 1 10 2 0 0 0 0 0 0 0 0 0 0 0 a 0 0 a 0 0 0 0 0 a 0
# 2 2 10 3 a b b b a b b a e c c b a a b b a b a a b c
# 3 3 10 4 d d b d b c d c c c e c b c c c c a b a a b
# 4 4 100 18 0 0 0 a 0 0 0 0 0 0 a 0 0 a 0 a a a a a a a
# V26 V27 V28 V29 V30 V31
# 1 0 0 0 0 0 0
# 2 c b 0 0 0 0
# 3 b b a 0 0 0
# 4 a a a a 0 0

I think the following will be close, just a quick answer that hopefully helps you along. You'd have to apply over this method to do it for the entire dataframe. Also there's coercion that I didn't handle here, so when testing on a single row everything got coerced into a char.
The basic thought is that if you want 1-10 to correspond to "a", 11-20 to correspond to "b", then we can get that by dividing the number by 10, then calling ceiling. 1-10 then maps to 1, 11-20 then maps to 2, and so forth. letters[1] maps to "a", letters[2] maps to "b", ect so we get the desired functionality.
#everything coerced to char, I know
testVect<-c("2", 10, 3, 6, 17, 11, 15, 8 ,17, 14, 1, 42, 21, 22, 15, 9, 9, 17, 12, 9, 16, 4, 8, 12 ,29, 23, 11, 0, 0 ,0 ,0)
testAfter4<-sapply(testVect[4:length(testVect)],
function(entry) {
ifelse(entry==0, 0, letters[ceiling(as.numeric(entry)/10)])
} )
#need to cast entry back to numeric as it was coerced to char when initializing testVect
testVect[4:length(testVect)]<-testAfter4
testVect
#[1] "2" "10" "3" "a" "b" "b" "b" "a" "b" "b" "a" "e" "c" "c" "b"
#[16] "a" "a" "b" "b" "a" "b" "a" "a" "b" "c" "c" "b" "0" "0" "0"
#[31] "0"

You can use the ascii codes and an offset based on your value/10 (without remainder)...
mydat = c(10,2,0,19,20,19,0,0)
# Convert a number divided by 10 to its offset (hat tip to MrFlick for `letters`
# this uses the cryptic looking %/% operator for division without remainder
char10 = letters[1+(md %/% 10)]
# convert zeroes, and if desired replace column 1:4 with original data
char10[md==0] = 0
Output:
> char10
[1] "b" "a" "0" "b" "c" "b" "0" "0"

Related

How to count different aswers in R

I am analyzing a questionnaire and I've written the code below to count how many answers there are to each question. The questions are in columns and the answer is coded as a number, where 1=a, 2=b.
The main objective is to count how many times an answer was chosen, ignoring pattern to summarize the information.
DS is the data frame, containing questions Q_092 to Q_096. I have the code to change column names, but it expects a fixed number of columns.
Is there a prettier way to do it?
conta_respostas <- function (arr_resp) {
arr_resp[(is.na(arr_resp))]<-99
arr_result = c(
sum(arr_resp[(arr_resp=="1")])/1,
sum(arr_resp[(arr_resp=="2")])/2,
sum(arr_resp[(arr_resp=="3")])/3,
sum(arr_resp[(arr_resp=="4")])/4,
sum(arr_resp[(arr_resp=="5")])/5,
sum(arr_resp[(arr_resp=="6")])/6,
sum(arr_resp[(arr_resp=="7")])/7,
sum(arr_resp[(arr_resp=="8")])/8,
sum(arr_resp[(arr_resp=="9")])/9,
sum(arr_resp[(arr_resp=="10")])/10,
sum(arr_resp[(arr_resp=="99")])/99
)
}
adply(DS, 2, conta_respostas)
X1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Q_092 431 1987 5053 1388 0 0 0 0 0 0 36
2 Q_093 281 1489 5728 1336 0 0 0 0 0 0 61
3 Q_094 594 3380 4365 519 0 0 0 0 0 0 37
4 Q_095 89 216 5042 3511 0 0 0 0 0 0 37
5 Q_096 213 1764 5384 1511 0 0 0 0 0 0 23
what it sounds like your data looks like:
DS <- data.frame(
'Q_092' = c(1, 3, 4, 5, 2, 99, 10),
'Q_093' = c(2, 5, 6, 2, 99, 1, 1),
'Q_094' = c(3, 5, 6, 2, 4, 7, 8),
'Q_095' = c(10, 5, 5, 6, 7, 8, 6),
'Q_096' = c(1, 3, 4, 5, 2, 99, 10)
)
DS
Q_092 Q_093 Q_094 Q_095 Q_096
1 1 2 3 10 1
2 3 5 5 5 3
3 4 6 6 5 4
4 5 2 2 6 5
5 2 99 4 7 2
6 99 1 7 8 99
7 10 1 8 6 10
Recreating your code:
library(plyr)
conta_respostas <- function (arr_resp) {
arr_resp[(is.na(arr_resp))]<-99
arr_result = c(
sum(arr_resp[(arr_resp=="1")])/1,
sum(arr_resp[(arr_resp=="2")])/2,
sum(arr_resp[(arr_resp=="3")])/3,
sum(arr_resp[(arr_resp=="4")])/4,
sum(arr_resp[(arr_resp=="5")])/5,
sum(arr_resp[(arr_resp=="6")])/6,
sum(arr_resp[(arr_resp=="7")])/7,
sum(arr_resp[(arr_resp=="8")])/8,
sum(arr_resp[(arr_resp=="9")])/9,
sum(arr_resp[(arr_resp=="10")])/10,
sum(arr_resp[(arr_resp=="99")])/99
)
}
adply(DS, 2, conta_respostas)
X1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Q_092 1 1 1 1 1 0 0 0 0 1 1
2 Q_093 2 2 0 0 1 1 0 0 0 0 1
3 Q_094 0 1 1 1 1 1 1 1 0 0 0
4 Q_095 0 0 0 0 2 2 1 1 0 1 0
5 Q_096 1 1 1 1 1 0 0 0 0 1 1
Without having to write that function, you can do something like this:
t(apply(DS, 2, function(x) table(factor(x, levels = c('1', '2', '3', '4', '5',
'6', '7', '8', '9', '10',
'99')))))
This will do the following:
transform your data into factors with the levels as input in levels =. Having your data as a factor will allow you to avoid the levels where no respondents chose that response to be left out.
This creates a table for each variable with a cell for each of the factor levels.
This function is applied over the five variable columns that you are interested in.
Finally, the output from the apply() function is transposed to match the output from your original output:
1 2 3 4 5 6 7 8 9 10 99
Q_092 1 1 1 1 1 0 0 0 0 1 1
Q_093 2 2 0 0 1 1 0 0 0 0 1
Q_094 0 1 1 1 1 1 1 1 0 0 0
Q_095 0 0 0 0 2 2 1 1 0 1 0
Q_096 1 1 1 1 1 0 0 0 0 1 1
One option would be to use the apply() function with FUN=table. The only issue here is that your tables may be of different lengths, thus the final result may not be combined row-by-row.

Transforming Binary data

I have a dataframe that only consists of 0 and 1. So for each individual instead of having one column with a factoral value (ex. low price, 4 rooms) I have
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
2 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0
4 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0
How can I transform the dataset in R, so that I create new columns (#number of rooms) and give the position of the 1 (in the 4th column) a vhigh value?
I have multiple expenatory varibales I need to do this for. the 21 columns are representing 6 variables for 1000+ observations. should be something like this
PurchaseP. NumberofRooms ...
1. vhigh. 4
2. low. 4
3. vhigh. 1
4. vhigh. 2
Just did it for the first 2 epxlenatory varibales here, but essentially it repeats like this with each explenatory variable has 3-4 possible factoral values.
V1:V4 = purchase price, V5:V8 = number of rooms,V9:V11 = floors, and so on
In my head something like this could work
create a if statemt to give each 1 a value depending on column position, ex. if value in V4=1 then name "vhigh". and do this for each Vx
Then combine each column V1:V4, V5:V8, V9:V11 (depending on if it has 3-4 possible factoral/integer values) while ignoring 0 values.
Would this work, or is there a simpler approach? How would one code this in R?
Here is an approach that should work for you. I wrote a function, which will take as arguments your data.frame, the columns representing one of your variables of interest (e.g. purchase price is stored in columns 1 to 4), and the names of the levels you would like as a result. The function will then return the result you requested. You'll need to write this out for the 6 variables you are interested in.
I'll simulate some data and illustrate the approach.
df <- data.frame(matrix(rep(c(0,0,0,1, 1,0,0,0, 1,0,0,0,0,0,0,1), 2),
nrow = 4, byrow = T))
df
#> X1 X2 X3 X4 X5 X6 X7 X8
#> 1 0 0 0 1 1 0 0 0
#> 2 1 0 0 0 0 0 0 1
#> 3 0 0 0 1 1 0 0 0
#> 4 1 0 0 0 0 0 0 1
We'll say that the first four columns are the purchase price in v.low to v.high, and the second four are the number of rooms (1:4). We'll write a function that takes this information as arguments and returns the result:
rangeToCol <- function(df, # Your data.frame
range, # the columns that incode the category of interest
lev.names # The names of the category levels
) {
tdf <- df[range]
lev.names[unlist(apply(tdf, 1, function(rw){which(rw==1)}))]
}
new.df <- data.frame(PurchaseP = rangeToCol(df, 1:4,
c('vlow','low','high','vhigh')),
NumberofRooms = rangeToCol(df, 5:8, c(1:4)))
new.df
#> PurchaseP NumberofRooms
#> 1 vhigh 1
#> 2 vlow 4
#> 3 vhigh 1
#> 4 vlow 4

Transforming dataframe into expanded matrix in r

Say I have the following dataframe:
dfx <- data.frame(Var1=c("A", "B", "C", "D", "B", "C", "D", "C", "D", "D"),
Var2=c("E", "E", "E", "E", "A", "A", "A", "B", "B", "C"),
Var1out = c(1,-1,-1,-1,1,-1,-1,1,-1,-1),
Var2out= c(-1,1,1,1,-1,1,1,-1,1,1))
dfx
Var1 Var2 Var1out Var2out
1 A E 1 -1
2 B E -1 1
3 C E -1 1
4 D E -1 1
5 B A 1 -1
6 C A -1 1
7 D A -1 1
8 C B 1 -1
9 D B -1 1
10 D C -1 1
What you see here are 10 rows that correspond to match-ups between players A, B, C, D and E. They play each other once and the winner of each match-up is denoted by a +1 and the loser of each match-up is denoted by a -1 (put into the respective column Player Var1 result in Var1out, Player Var2 result in Var2out).
Desired output.
I wish to transform this dataframe to this output matrix (the order of rows are not important to me, but as you can see each row refers to a unique match-up):
A B C D E
1 1 0 0 0 -1
2 0 -1 0 0 1
3 0 0 -1 0 1
4 0 0 0 -1 1
5 -1 1 0 0 0
6 1 0 -1 0 0
7 1 0 0 -1 0
8 0 -1 1 0 0
9 0 1 0 -1 0
10 0 0 1 -1 0
What I've done:
I managed to make this matrix in a roundabout way. As roundabout ways tend to be slow and less satisfactory, I was wondering if anyone can spot a better way.
I first made sure that my two columns containing players had factor levels that contained every possible player that ever occurs (you'll note for instance that player E never occurs in Var1).
# Making sure Var1 and Var2 have same factor levels
levs <- unique(c(levels(dfx$Var1), levels(dfx$Var2))) #get all possible levels of factors
dfx$Var1 <- factor(dfx$Var1, levels=levs)
dfx$Var2 <- factor(dfx$Var2, levels=levs)
I next split the dataframe into two - one for Var1 and Var1out, and one for Var2 and Var2out:
library(dplyr)
temp.Var1 <- dfx %>% select(Var1, Var1out)
temp.Var2 <- dfx %>% select(Var2, Var2out)
Here I use model.matrix to expand columns by factor level:
mat.Var1<-with(temp.Var1, data.frame(model.matrix(~Var1+0)))
mat.Var2<-with(temp.Var2, data.frame(model.matrix(~Var2+0)))
I then replace for each row the column with a '1' indicating the presence of that factor, with the correct result and add these matrices:
mat1 <- apply(mat.Var1, 2, function(x) ifelse(x==1, x<-temp.Var1$Var1out, x<-0) )
mat2 <- apply(mat.Var2, 2, function(x) ifelse(x==1, x<-temp.Var2$Var2out, x<-0) )
matX <- mat1+mat2
matX
Var1A Var1B Var1C Var1D Var1E
1 1 0 0 0 -1
2 0 -1 0 0 1
3 0 0 -1 0 1
4 0 0 0 -1 1
5 -1 1 0 0 0
6 1 0 -1 0 0
7 1 0 0 -1 0
8 0 -1 1 0 0
9 0 1 0 -1 0
10 0 0 1 -1 0
Although this works, I have a sense that I am probably missing simpler solutions for this problem. Thanks.
Create an empty matrix and use matrix indexing to fill the relevant values in:
cols <- unique(unlist(dfx[1:2]))
M <- matrix(0, nrow = nrow(dfx), ncol = length(cols), dimnames = list(NULL, cols))
M[cbind(sequence(nrow(dfx)), match(dfx$Var1, cols))] <- dfx$Var1out
M[cbind(sequence(nrow(dfx)), match(dfx$Var2, cols))] <- dfx$Var2out
M
# A B C D E
# [1,] 1 0 0 0 -1
# [2,] 0 -1 0 0 1
# [3,] 0 0 -1 0 1
# [4,] 0 0 0 -1 1
# [5,] -1 1 0 0 0
# [6,] 1 0 -1 0 0
# [7,] 1 0 0 -1 0
# [8,] 0 -1 1 0 0
# [9,] 0 1 0 -1 0
# [10,] 0 0 1 -1 0
Another way is to use acast
library(reshape2)
#added `use.names=FALSE` from #Ananda Mahto's comments
dfy <- data.frame(Var=unlist(dfx[,1:2], use.names=FALSE),
VarOut=unlist(dfx[,3:4], use.names=FALSE), indx=1:nrow(dfx))
acast(dfy, indx~Var, value.var="VarOut", fill=0)
# A B C D E
#1 1 0 0 0 -1
#2 0 -1 0 0 1
#3 0 0 -1 0 1
#4 0 0 0 -1 1
#5 -1 1 0 0 0
#6 1 0 -1 0 0
#7 1 0 0 -1 0
#8 0 -1 1 0 0
#9 0 1 0 -1 0
#10 0 0 1 -1 0
Or use spread
library(tidyr)
spread(dfy,Var, VarOut , fill=0)[,-1]
# A B C D E
#1 1 0 0 0 -1
#2 0 -1 0 0 1
#3 0 0 -1 0 1
#4 0 0 0 -1 1
#5 -1 1 0 0 0
#6 1 0 -1 0 0
#7 1 0 0 -1 0
#8 0 -1 1 0 0
#9 0 1 0 -1 0
#10 0 0 1 -1 0

Binning and Naming New Columns with Mean of Binned Columns

This probably has been asked already, but I could not find it. I have a data set, where column names are numbers, and row names are sample names (see below).
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0
I want to bin the columns, say every 4 columns, by summation, and then name the new columns with the mean of the binned columns. For the above table I would end up with:
"599.785" "599.816"
"A" 0 7
"B" 0 4
"C" 0 4
"D" 3 4
The new column names, 599.785 and 599.816, are average of the column names that were binned. I think something like cut would work for a vector of numbers, but I am not sure how to implement it for large data frames. Thanks for any help!
colnames <- c("599.773", "599.781", "599.789", "599.797",
"599.804", "599.812" ,"599.82" ,"599.828" )
mat <- matrix(scan(), nrow=4, byrow=TRUE)
0 0 0 0 0 2 1 4
0 0 0 0 0 1 0 3
0 0 0 0 2 1 0 1
3 0 0 0 3 1 0 0
colnames(mat)=colnames
rownames(mat) = LETTERS[1:4]
sRows <- function(mat, cols) rowSums(mat[, cols])
sapply(1:(dim(mat)[2]/4), function(base) sRows(mat, base:(base+4)) )
[,1] [,2]
A 0 2
B 0 1
C 2 3
D 6 4
accum <- sapply(1:(dim(mat)[2]/4), function(base)
sRows(mat, base:(base+4)) )
colnames(accum) <- sapply(1:(dim(mat)[2]/4),
function(base)
mean(as.numeric(colnames(mat)[ base:(base+4)] )) )
accum
#-------
599.7888 599.7966
A 0 2
B 0 1
C 2 3
D 6 4
First of all Using numeric values as columns names is not a good/standard habit.
Even I am here giving a solution as the desired OP.
## read data without checking names
dt <- read.table(text='
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0',header=TRUE, check.names =FALSE)
cols <- as.numeric(colnames(dt))
## create a factor to groups columns
ff <- rep(c(TRUE,FALSE),each=length(cols)/2)
## using tapply to group operations by ff
vals <- do.call(cbind,tapply(cols,ff,
function(x)
rowSums(dt[,paste0(x)])))
nn <- tapply(cols,ff,mean)
## names columns with means
colnames(vals) <- nn[colnames(vals)]
vals
599.816 599.785
A 7 0
B 4 0
C 4 0
D 4 3

Splitting one column into multiple columns

I have a huge dataset in which there is one column including several values for each subject (row). Here is a simplified sample dataframe:
data <- data.frame(subject = c(1:8), sex = c(1, 2, 2, 1, 2, 1, 1, 2),
age = c(35, 29, 31, 46, 64, 57, 49, 58),
v1 = c("2", "0", "3,5", "2 1", "A,4", "B,1,C", "A and B,3", "5, 6 A or C"))
> data
subject sex age v1
1 1 1 35 2
2 2 2 29 0
3 3 2 31 3,5 # separated by a comma
4 4 1 46 2 1 # separated by a blank space
5 5 2 64 A,4
6 6 1 57 B,1,C
7 7 1 49 A and B,3
8 8 2 58 5, 6 A or C
I first want to remove the letters (A, B, A and B, …) in the fourth column (v1), and then split the fourth column into multiple columns just like this:
subject sex age x1 x2 x3 x4 x5 x6
1 1 1 35 0 1 0 0 0 0
2 2 2 29 0 0 0 0 0 0
3 3 2 31 0 0 1 0 1 0
4 4 1 46 1 1 0 0 0 0
5 5 2 64 0 0 0 1 0 0
6 6 1 57 1 0 0 0 0 0
7 7 1 49 0 0 1 0 0 0
8 8 2 58 0 0 0 0 1 1
where the 1st subject takes 1 at x2 because it takes 2 at v1 in the original dataset, the 3rd subject takes 1 at both x3 and x5 because it takes 3 and 5 at v1 in the original dataset, and so on.
I would appreciate any help on this question. Thanks a lot.
You can cbind this result to data[-4] and get what you need:
0+t(sapply(as.character(data$v1), function(line)
sapply(1:6, function(x) x %in% unlist(strsplit(line, split="\\s|\\,"))) ))
#----------------
[,1] [,2] [,3] [,4] [,5] [,6]
2 0 1 0 0 0 0
0 0 0 0 0 0 0
3,5 0 0 1 0 1 0
2 1 1 1 0 0 0 0
A,4 0 0 0 1 0 0
B,1,C 1 0 0 0 0 0
A and B,3 0 0 1 0 0 0
5, 6 A or C 0 0 0 0 1 1
One solution:
r <- sapply(strsplit(as.character(dt$v1), "[^0-9]+"), as.numeric)
m <- as.data.frame(t(sapply(r, function(x) {
y <- rep(0, 6)
y[x[!is.na(x)]] <- 1
y
})))
data <- cbind(data[, c("subject", "sex", "age")], m)
# subject sex age V1 V2 V3 V4 V5 V6
# 1 1 1 35 0 1 0 0 0 0
# 2 2 2 29 0 0 0 0 0 0
# 3 3 2 31 0 0 1 0 1 0
# 4 4 1 46 1 1 0 0 0 0
# 5 5 2 64 0 0 0 1 0 0
# 6 6 1 57 1 0 0 0 0 0
# 7 7 1 49 0 0 1 0 0 0
# 8 8 2 58 0 0 0 0 1 1
Following DWin's awesome solution, m could be modified as:
m <- as.data.frame(t(sapply(r, function(x) {
0 + 1:6 %in% x[!is.na(x)]
})))

Resources