Related
Problem statement:
I actually want to eliminate from further analysis columns that have identical values in all cells. In order to do this, I want to find the columns that have identical values.
I wrote the following code which seems to be working for the dataframe test but not for the real dataframe stpo
library("dplyr")
library("purrr")
test_unique <- function(x)
{
return(length(unique(x)))
}
test <-data.frame(c1 = c("a", "a"), c2 = c(NA, NA), c3 = c(1,2), c4=c(NA, 4))
# What I want to find out the columns that have the same value throughout
res <- map(test[,c(names(test))], test_unique)
res
# But when I try to apply the same thing to the dataset below, it does not work.
# Not sure what the reason is. Is there a better way to do this? Perhaps using data.table? What am I doing wrong?
res2 <- map(stpo[,c(names(stpo))], test_unique)
res2
I am not exactly sure how to put the result of dput. I am putting this below (this is the dataframe stpo)
structure(list(stlnr = c(1L, 2L, 3L, 3L, 3L, 3L, 4L), stlkn = c(1L,
1L, 1L, 2L, 3L, 4L, 5L), stpoz = c(2L, 2L, 2L, 4L, 6L, 8L, 10L
), aennr = c(NA, NA, NA, NA, NA, NA, NA), vgknt = c(0L, 0L, 0L,
0L, 0L, 0L, 0L), idnrk = c("test_1", "test_1", "test_2", "test_3",
"test_3", "test_1", "test_2"), pswrk = c(NA, NA, NA, NA, NA,
NA, NA), meins = c("EA", "EA", "EA", "EA", "EA", "EA", "EA"),
menge = c(1, 14, 4, 4, 2, 2, 1), fmeng = c(NA, NA, NA, NA,
NA, NA, NA), ausch = c(0, 0, 0, 0, 0, 0, 0), avoau = c(0,
0, 0, 0, 0, 0, 0), netau = c(NA, NA, NA, NA, NA, NA, NA),
erskz = c(NA, NA, NA, NA, NA, NA, NA), rekri = c(NA, NA,
NA, NA, NA, NA, NA), rekrs = c(NA, NA, NA, NA, NA, NA, NA
), nlfzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), verti = c(NA, NA,
NA, NA, NA, NA, NA), alpos = c(NA, NA, NA, NA, NA, NA, NA
), ewahr = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), ekgrp = c(NA, NA,
NA, NA, NA, NA, NA), lifzt = c(0L, 0L, 0L, 0L, 0L, 0L, 0L
), lifnr = c(NA, NA, NA, NA, NA, NA, NA), roms1 = c(0, 0,
0, 0, 0, 0, 0), roms2 = c(0, 0, 0, 0, 0, 0, 0), roms3 = c(0,
0, 0, 0, 0, 0, 0), romen = c(0, 0, 0, 0, 0, 0, 0), rform = c(NA,
NA, NA, NA, NA, NA, NA), upskz = c(NA, NA, NA, NA, NA, NA,
NA), valkz = c(NA, NA, NA, NA, NA, NA, NA), matkl = c(NA,
NA, NA, NA, NA, NA, NA), webaz = c(0L, 0L, 0L, 0L, 0L, 0L,
0L), clobk = c(NA, NA, NA, NA, NA, NA, NA), lgort = c(NA,
NA, NA, NA, NA, NA, 14L), kzkup = c(NA, NA, NA, NA, NA, NA,
NA), dvnam = c(NA, NA, NA, NA, NA, NA, NA), dspst = c(NA,
NA, NA, NA, NA, NA, NA), alpst = c(NA, NA, NA, NA, NA, NA,
NA), alprf = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), alpgr = c(NA,
NA, NA, NA, NA, NA, NA), kstty = c(NA, NA, NA, NA, NA, NA,
NA), kstnr = c(NA, NA, NA, NA, NA, NA, NA), nlfzv = c(0L,
0L, 0L, 0L, 0L, 0L, 0L), nlfmv = c(NA, NA, NA, NA, NA, NA,
NA), idhis = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), idvar = c(NA,
NA, NA, NA, NA, NA, NA), itsob = c(NA, NA, NA, NA, NA, NA,
NA), cufactor = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), funcid = c(NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000022534c51ef0>)
The issue is that we are subsetting on a data.table, rather than a data.frame. Here, we need with = FALSE (as mentioned in ?data.table
j - When with=TRUE (default), j is evaluated within the frame of the data.table; i.e., it sees column names as if they are variables.
stpo[,c(names(stpo))]
[1] "stlnr" "stlkn" "stpoz" "aennr" "vgknt" "idnrk" "pswrk" "meins" "menge" "fmeng" "ausch" "avoau" "netau" "erskz"
[15] "rekri" "rekrs" "nlfzt" "verti" "alpos" "ewahr" "ekgrp" "lifzt" "lifnr" "roms1" "roms2" "roms3" "romen" "rform"
[29] "upskz" "valkz" "matkl" "webaz" "clobk" "lgort" "kzkup" "dvnam" "dspst" "alpst" "alprf" "alpgr" "kstty" "kstnr"
[43] "nlfzv" "nlfmv" "idhis" "idvar" "itsob" "cufactor" "funcid"
Now, check the output of
stpo[,c(names(stpo)), with = FALSE]
stlnr stlkn stpoz aennr vgknt idnrk pswrk meins menge fmeng ausch avoau netau erskz rekri rekrs nlfzt verti alpos ewahr ekgrp lifzt lifnr roms1 roms2
1: 1 1 2 NA 0 test_1 NA EA 1 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
2: 2 1 2 NA 0 test_1 NA EA 14 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
3: 3 1 2 NA 0 test_2 NA EA 4 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
4: 3 2 4 NA 0 test_3 NA EA 4 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
5: 3 3 6 NA 0 test_3 NA EA 2 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
6: 3 4 8 NA 0 test_1 NA EA 2 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
7: 4 5 10 NA 0 test_2 NA EA 1 NA 0 0 NA NA NA NA 0 NA NA 0 NA 0 NA 0 0
roms3 romen rform upskz valkz matkl webaz clobk lgort kzkup dvnam dspst alpst alprf alpgr kstty kstnr nlfzv nlfmv idhis idvar itsob cufactor funcid
1: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
2: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
3: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
4: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
5: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
6: 0 0 NA NA NA NA 0 NA NA NA NA NA NA 0 NA NA NA 0 NA 0 NA NA 0 NA
7: 0 0 NA NA NA NA 0 NA 14 NA NA NA NA 0 NA NA NA 0 NA 0 NA
Also, there is no need to do any subsetting if the whole columns are used, i.e. simply do
purrr::map(stpo, test_unique)
-output
$stlnr
[1] 4
$stlkn
[1] 5
$stpoz
[1] 5
...
...
Regarding the use of
stpo[,1:length(names(stpo))]
It seems to be a bug or a hackish way of dealing things instead of the standard option
If we want to eliminate columns having a single value, use var (assuming all numeric columns)
Filter(var, stpo)
stlnr stlkn stpoz menge
1: 1 1 2 1
2: 2 1 2 14
3: 3 1 2 4
4: 3 2 4 4
5: 3 3 6 2
6: 3 4 8 2
7: 4 5 10 1
Or change the function to return a logical output (it will also check for other type columns)
f1 <- function(x) length(unique(x)) > 1
Filter(f1, stpo)
-output
stlnr stlkn stpoz idnrk menge lgort
1: 1 1 2 test_1 1 NA
2: 2 1 2 test_1 14 NA
3: 3 1 2 test_2 4 NA
4: 3 2 4 test_3 4 NA
5: 3 3 6 test_3 2 NA
6: 3 4 8 test_1 2 NA
7: 4 5 10 test_2 1 14
Or use the data.table way of subsetting the columns
stpo[, .SD, .SDcols = f1]
stlnr stlkn stpoz idnrk menge lgort
1: 1 1 2 test_1 1 NA
2: 2 1 2 test_1 14 NA
3: 3 1 2 test_2 4 NA
4: 3 2 4 test_3 4 NA
5: 3 3 6 test_3 2 NA
6: 3 4 8 test_1 2 NA
7: 4 5 10 test_2 1 14
Looks like I have taken a cue from what Arun wrote and modified the code like so:
res2 <- map(stpo[,1:length(names(stpo))], test_unique)
I have a table test whose NA values I would like to approximate based on linear interpolation between values that do exist.
For example, the second row plotted looks like this:
v1 <- unlist(test[2,])
plot(v1[!is.na(v1)], names(v1)[!is.na(v1)], type="l", add = TRUE)
How would one go about interpolating/approximating the NA values along the x-axis in this case? Any suggestions in base R or dplyr would be helpful
test
variable 26500 30000 30100 30700 31600 33700 33800 33900 34000 34600 34800 35100 35200 35300
1 -20 NA 0 NA NA 10 20 NA NA NA 30 NA NA NA NA
2 -10 NA 0 NA NA NA 10 NA NA NA 20 NA NA NA 30
3 0 0 NA NA NA NA NA 10 NA NA NA 20 NA NA NA
4 24 NA NA NA 0 NA NA NA NA 10 NA NA NA 20 NA
5 40 NA NA 0 NA NA NA NA 10 NA NA NA 20 NA NA
6 55 NA NA 0 NA NA NA NA 10 NA NA NA 20 NA NA
35400 35600 35800 35900 36200 36300 36400 36700 36900 37000 37200 37800 37900 38000 38200
1 40 NA NA NA 50 NA NA NA NA NA 60 NA NA NA 70
2 NA NA NA 40 NA NA NA 50 NA NA NA 60 NA NA NA
3 NA 30 NA NA 40 NA NA NA 50 NA NA NA 60 NA NA
4 NA NA 30 NA NA 40 NA NA NA 50 NA NA NA 60 NA
5 NA NA 30 NA NA 40 NA NA NA 50 NA NA NA NA 60
6 NA NA NA 30 NA NA 40 NA NA 50 NA NA NA NA 60
38800 39000 39100 39200 39700 39800 39900 40000 40200 40600 40700 40800 41700 41800
1 NA NA NA 80 NA NA NA NA 90 NA NA NA 100 NA
2 70 NA NA NA 80 NA NA NA NA 90 NA NA 100 NA
3 70 NA NA NA NA 80 NA NA NA NA 90 NA 100 NA
4 NA 70 NA NA NA NA NA 80 NA NA NA 90 100 NA
5 NA NA 70 NA NA NA NA 80 NA NA NA 90 NA 100
6 NA 70 NA NA NA NA 80 NA NA NA NA 90 100 NA
Here is the sample data:
dput(test)
structure(list(variable = c(-20, -10, 0, 24, 40, 55), `26500` = c(NA,
NA, 0L, NA, NA, NA), `30000` = c(0L, 0L, NA, NA, NA, NA), `30100` = c(NA,
NA, NA, NA, 0L, 0L), `30700` = c(NA, NA, NA, 0L, NA, NA), `31600` = c(10L,
NA, NA, NA, NA, NA), `33700` = c(20L, 10L, NA, NA, NA, NA), `33800` = c(NA,
NA, 10L, NA, NA, NA), `33900` = c(NA, NA, NA, NA, 10L, 10L),
`34000` = c(NA, NA, NA, 10L, NA, NA), `34600` = c(30L, 20L,
NA, NA, NA, NA), `34800` = c(NA, NA, 20L, NA, NA, NA), `35100` = c(NA,
NA, NA, NA, 20L, 20L), `35200` = c(NA, NA, NA, 20L, NA, NA
), `35300` = c(NA, 30L, NA, NA, NA, NA), `35400` = c(40L,
NA, NA, NA, NA, NA), `35600` = c(NA, NA, 30L, NA, NA, NA),
`35800` = c(NA, NA, NA, 30L, 30L, NA), `35900` = c(NA, 40L,
NA, NA, NA, 30L), `36200` = c(50L, NA, 40L, NA, NA, NA),
`36300` = c(NA, NA, NA, 40L, 40L, NA), `36400` = c(NA, NA,
NA, NA, NA, 40L), `36700` = c(NA, 50L, NA, NA, NA, NA), `36900` = c(NA,
NA, 50L, NA, NA, NA), `37000` = c(NA, NA, NA, 50L, 50L, 50L
), `37200` = c(60L, NA, NA, NA, NA, NA), `37800` = c(NA,
60L, NA, NA, NA, NA), `37900` = c(NA, NA, 60L, NA, NA, NA
), `38000` = c(NA, NA, NA, 60L, NA, NA), `38200` = c(70L,
NA, NA, NA, 60L, 60L), `38800` = c(NA, 70L, 70L, NA, NA,
NA), `39000` = c(NA, NA, NA, 70L, NA, 70L), `39100` = c(NA,
NA, NA, NA, 70L, NA), `39200` = c(80L, NA, NA, NA, NA, NA
), `39700` = c(NA, 80L, NA, NA, NA, NA), `39800` = c(NA,
NA, 80L, NA, NA, NA), `39900` = c(NA, NA, NA, NA, NA, 80L
), `40000` = c(NA, NA, NA, 80L, 80L, NA), `40200` = c(90L,
NA, NA, NA, NA, NA), `40600` = c(NA, 90L, NA, NA, NA, NA),
`40700` = c(NA, NA, 90L, NA, NA, NA), `40800` = c(NA, NA,
NA, 90L, 90L, 90L), `41700` = c(100L, 100L, 100L, 100L, NA,
100L), `41800` = c(NA, NA, NA, NA, 100L, NA)), row.names = c(NA,
-6L), class = "data.frame")
We could use na.interp from forecast
library(forecast)
test[-1] <- t(apply(test[-1], 1, na.interp))
Or with na.approx
test[-1] <- t(apply(test[-1], 1, na.approx, na.rm = FALSE))
then do the plotting
v1 <- unlist(test[2, -1])
plot(v1, names(v1), type = 'l')
If you want to switch easily between different interpolation methods (or time series imputation methods in general) you can also use the imputeTS package.
For the requested solution this would be:
library("imputeTS")
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "linear"))
Switching to Spline interpolation would look like this:
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "stine"))
Another option could be Stineman interpolation:
test[-1] <- t(apply(test[-1], 1, na_interpolation, option = "spline"))
Other imputation methods like na_ma (moving average imputation), na_kalman (Kalman smoothing on structural time series models) would be also possible, if you replace the na_interpolation with the specific function (see also GitHub package Readme for a imputation function overview).
So, my challenge has been to convert a raw scale csv to a scored csv. Within numerous columns, the file has cells filled with "Strongly Agree" to "Strongly Disagree", 6 levels. These factors need to be converted in integers 5 to 0 respectively.
I have tried unsuccessfully to use sapply and convert the table to a string. Sapply works on the vector, but it destroys the table structure.
Method 1:
dat$Col<-sapply(dat$Col,switch,'Strongly Disagree'=0,'Disagree'=1,'Slightly Disagree'=2,'Slightly Agree'=3,'Agree'=4, 'Strongly Agree'=5)
My second approach is to convert the csv into a string. When I examined the dput output, I saw the area I wanted to target that started with a .Label="","Strongly Agree"... Mistake. My changes did not result in a useful outcome.
My third approach came from the internet gods of destruction who seemed to express that gsub() might handle the string approach as well. Nope, again the underlying table structure was destroyed.
Method #3: Convert into a string and pattern match
dat <- textConnection("control/Surveys/StudyDat_1.csv")
#Score Scales
##"Strongly Agree"= 5
##"Agree"= 4
##"Strongly Disagree" = 0
#levels(dat$Col) <- gsub("Strongly Agree", "5", levels(dat$Col))
df<- gsub("Strongly Agree", "5",dat)
dat<-read.csv(textConnection(df),header=TRUE)
In the end, I am wanting to replace ALL "Strongly Agree" to 5 across numerous columns without the consequence of destroying the retrievability of the data.
Maybe I used the wrong search string and you know the resource I need to address this problem. I would rather avoid ALL character vector approaches as that this would require labeling each column if you provide a code response. It will need to go across ALL COLUMNS.
Thanks
Data Sample Problem
structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L, 8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z",
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z",
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z",
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z",
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z",
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = structure(c(NA,
NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA), .Label = c("", "Slightly Disagree"
), class = "factor"), B = structure(c(NA, NA, 2L, NA, 1L, NA,
NA, NA, 3L, NA, NA), .Label = c("", "Disagree", "Strongly Agree"
), class = "factor"), C = structure(c(NA, NA, 2L, NA, 1L, NA,
NA, NA, 3L, NA, NA), .Label = c("", "Agree", "Disagree"), class = "factor"),
D = structure(c(NA, NA, 2L, NA, 1L, NA, NA, NA, 2L, NA, NA
), .Label = c("", "Agree"), class = "factor"), E = structure(c(NA,
NA, 2L, NA, 1L, NA, NA, NA, 3L, NA, NA), .Label = c("", "Agree",
"Strongly Disagree"), class = "factor")), .Names = c("last_updated",
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-11L))
Data Sample Solution
df<-dget(structure(list(last_updated = structure(c(3L, 1L, 7L, 2L, 10L, 6L,8L, 9L, 7L, 5L, 4L), .Label = c("2016-05-13T12:53:56.704184Z",
"2016-05-13T12:54:09.273359Z", "2016-05-13T12:54:22.757251Z",
"2016-05-14T12:44:13.474992Z", "2016-05-14T12:44:31.736469Z",
"2016-05-16T16:45:10.623410Z", "2016-05-16T16:46:17.881402Z",
"2016-05-16T16:46:55.122257Z", "2016-05-16T16:47:14.160793Z",
"2016-05-24T02:26:04.770799Z"), class = "factor"), feedback = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), A = c(NA, NA, 2L, NA,
NA, NA, NA, NA, 2L, NA, NA), B = c(NA, NA, 1L, NA, NA, NA, NA,
NA, 5L, NA, NA), C = c(NA, NA, 4L, NA, NA, NA, NA, NA, 1L, NA,
NA), D = c(NA, NA, 4L, NA, NA, NA, NA, NA, 4L, NA, NA), E = c(NA,
NA, 4L, NA, NA, NA, NA, NA, 0L, NA, NA)), .Names = c("last_updated",
"feedback", "A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,-11L)))
we can use factor with levels specified
nm1 <- c('Strongly Disagree', 'Disagree',
'Slightly Disagree','Slightly Agree','Agree', 'Strongly Agree')
factor(dat$col, levels = nm1,
labels = 0:5))
If there are multiple factor columns with the same levels, identify the factor columns ('i1'), loop through it with lapply and specify the levels and labels.
i1 <- sapply(dat, is.factor)
dat[i1] <- lapply(dat[i1], factor, levels = nm1, labels= 0:5)
Update
Using the OP's dput output
dat[-(1:2)] <- lapply(dat[-(1:2)], factor, levels = nm1, labels = 0:5)
dat
# last_updated feedback A B C D E
#1 2016-05-13T12:54:22.757251Z NA <NA> <NA> <NA> <NA> <NA>
#2 2016-05-13T12:53:56.704184Z NA <NA> <NA> <NA> <NA> <NA>
#3 2016-05-16T16:46:17.881402Z NA 2 1 4 4 4
#4 2016-05-13T12:54:09.273359Z NA <NA> <NA> <NA> <NA> <NA>
#5 2016-05-24T02:26:04.770799Z NA <NA> <NA> <NA> <NA> <NA>
#6 2016-05-16T16:45:10.623410Z NA <NA> <NA> <NA> <NA> <NA>
#7 2016-05-16T16:46:55.122257Z NA <NA> <NA> <NA> <NA> <NA>
#8 2016-05-16T16:47:14.160793Z NA <NA> <NA> <NA> <NA> <NA>
#9 2016-05-16T16:46:17.881402Z NA 2 5 1 4 0
#10 2016-05-14T12:44:31.736469Z NA <NA> <NA> <NA> <NA> <NA>
#11 2016-05-14T12:44:13.474992Z NA <NA> <NA> <NA> <NA> <NA>
Another option is set from data.table
library(data.table)
for(j in names(dat)[-(1:2)]){
set(dat, i = NULL, j= j, value = factor(dat[[j]], levels = nm1, labels = 0:5))
}
I would just match each target column vector into a precomputed character vector to get an integer index. You can subtract 1 afterward to change the range from 1:6 to 0:5.
## define desired value order, ascending
o <- c(
'Strongly Disagree',
'Disagree',
'Slightly Disagree',
'Slightly Agree',
'Agree',
'Strongly Agree'
);
## convert target columns
for (cn in names(df)[-(1:2)]) df[[cn]] <- match(as.character(df[[cn]]),o)-1L;
df;
## last_updated feedback A B C D E
## 1 2016-05-13T12:54:22.757251Z NA NA NA NA NA NA
## 2 2016-05-13T12:53:56.704184Z NA NA NA NA NA NA
## 3 2016-05-16T16:46:17.881402Z NA 2 1 4 4 4
## 4 2016-05-13T12:54:09.273359Z NA NA NA NA NA NA
## 5 2016-05-24T02:26:04.770799Z NA NA NA NA NA NA
## 6 2016-05-16T16:45:10.623410Z NA NA NA NA NA NA
## 7 2016-05-16T16:46:55.122257Z NA NA NA NA NA NA
## 8 2016-05-16T16:47:14.160793Z NA NA NA NA NA NA
## 9 2016-05-16T16:46:17.881402Z NA 2 5 1 4 0
## 10 2016-05-14T12:44:31.736469Z NA NA NA NA NA NA
## 11 2016-05-14T12:44:13.474992Z NA NA NA NA NA NA
Previous answers might meet your needs, but note that changing the labels of a factor isn't the same as changing a factor to an integer variable. One possibility would be to use ifelse (I've made a new data frame as the one you posted didn't actually have variables with these levels in it):
lev <- c('Strongly disagree', 'Disagree', 'Slightly disagree', 'Slightly agree', 'Agree', 'Strongly agree')
dta <- sample(lev, 55, replace = TRUE)
dta <- data.frame(matrix(dta, nrow = 11))
names(dta) <- LETTERS[1:5]
f_to_int <- function(f) {
if (is.factor(f)){
ifelse(f == 'Strongly disagree', 0,
ifelse(f == 'Disagree', 1,
ifelse(f == 'Slightly disagree', 2,``
ifelse(f == 'Slightly agree', 3,
ifelse(f == 'Agree', 4,
ifelse(f == 'Strongly agree', 5, f))))))
} else f
}
dta2 <- sapply(dta, f_to_int)
Note that this returns a matrix, but it is easily converted to a data frame if necessary.
I have a matrix and my objective is to find the maximum of each column and then to divide that number by the sum of all values in the row which contains the max of that column. In other words
max(y) / sum of values in the row where y is the max
How would apply this formula to every column in R ?
> the_matrix
Source: local data frame [20 x 10]
type 100 100F 100I 100X 101 102 1028P 103 103D
(fctr) (int) (int) (int) (int) (int) (int) (int) (int) (int)
1 0 NA NA NA NA NA NA NA NA NA
2 0A 2 NA NA NA NA NA NA NA NA
3 0B NA NA NA NA NA NA NA NA NA
4 0C NA NA NA NA NA NA NA NA NA
5 0E NA NA NA NA NA NA NA NA NA
6 0G NA NA NA NA NA NA NA NA NA
7 0O NA NA NA NA NA NA NA NA NA
8 0Z NA NA NA NA NA NA NA NA NA
9 1 2 NA NA NA NA NA NA NA NA
10 1A 3968 NA 214 26 4 289 8 56030 7484
11 1B 172 NA 107 NA NA 2 NA 372 3829
12 1C 584 NA 19 NA NA 1 NA 72951 363
13 1D 27 NA NA NA NA NA NA 365 22
14 1E 27944 16 68 NA NA NA 1 62 12
15 1F 1 NA 1 NA NA 1 NA 368 27
16 1G 4 NA NA NA NA NA NA 7 NA
17 1H 65 NA 6 21 1 6 3 714 59
18 1M NA NA NA NA NA NA NA 1 NA
19 1N NA NA NA NA NA NA NA NA NA
20 1Q NA NA NA NA NA NA NA NA NA
> dput(the_matrix)
structure(list(type = structure(1:20, .Label = c("0", "0A", "0B",
"0C", "0E", "0G", "0O", "0Z", "1", "1A", "1B", "1C", "1D", "1E",
"1F", "1G", "1H", "1M", "1N", "1Q", "1S", "1X", "1Z", "2", "2A",
"2B", "2C", "2D", "2E", "2F", "2G", "2H", "2I", "2J", "2M", "2S",
"2T", "2X", "2Z", "3", "3B", "3C", "3E", "4B", "5H", "8Z", "0H",
"1I", "1R", "2N", "3H", "5D", "0D", "1K", "1P", "1T", "1U", "1V",
"1W", "1Y", "2U", "3A", "4A", "5C", "7H", "9", "0F", "0T", "1J",
"2L", "0W", "2Q", "3G"), class = "factor"), `100` = c(NA, 2L,
NA, NA, NA, NA, NA, NA, 2L, 3968L, 172L, 584L, 27L, 27944L, 1L,
4L, 65L, NA, NA, NA), `100F` = c(NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 16L, NA, NA, NA, NA, NA, NA), `100I` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 214L, 107L, 19L, NA, 68L, 1L,
NA, 6L, NA, NA, NA), `100X` = c(NA, NA, NA, NA, NA, NA, NA, NA,
NA, 26L, NA, NA, NA, NA, NA, NA, 21L, NA, NA, NA), `101` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA, NA, NA, 1L,
NA, NA, NA), `102` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 289L,
2L, 1L, NA, NA, 1L, NA, 6L, NA, NA, NA), `1028P` = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, 8L, NA, NA, NA, 1L, NA, NA, 3L, NA,
NA, NA), `103` = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 56030L,
372L, 72951L, 365L, 62L, 368L, 7L, 714L, 1L, NA, NA), `103D` = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, 7484L, 3829L, 363L, 22L, 12L,
27L, NA, 59L, NA, NA, NA)), .Names = c("type", "100", "100F",
"100I", "100X", "101", "102", "1028P", "103", "103D"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -20L))
Going step-by-step:
# let's not call a data frame a matrix
real_matrix = as.matrix(the_matrix[, -1])
# max of each column
col_max = apply(real_matrix, 2, max, na.rm = T)
# which row contains the max
col_which_max = apply(real_matrix, 2, which.max)
# row totals
row_total = rowSums(real_matrix, na.rm = T)
# col max divided by row total for corresponding row
col_max / row_total[col_which_max]
Rounded to 3 decimals, this yields the following:
100 100F 100I 100X 101 102 1028P 103 103D
0.994 0.001 0.003 0.000 0.000 0.004 0.000 0.987 0.110
I've learned R by toying, and I'm starting to think that I'm abusing the tapply function. Are there better ways to do some of the following actions? Granted, they work, but as they get more complex I wonder if I'm losing out on better options. I'm looking for some criticism, here:
tapply(var1, list(fac1, fac2), mean, na.rm=T)
tapply(var1, fac1, sum, na.rm=T) / tapply(var2, fac1, sum, na.rm=T)
cumsum(tapply(var1, fac1, sum, na.rm=T)) / sum(var1)
Update: Here's some example data...
var1 var2 fac1 fac2
1 NA 275.54 10 (266,326]
2 NA 565.89 10 (552,818]
3 NA 815.41 6 (552,818]
4 NA 281.77 6 (266,326]
5 NA 640.24 NA (552,818]
6 NA 78.42 NA [78.4,266]
7 NA 1027.06 NA (818,1.55e+03]
8 NA 355.20 NA (326,552]
9 NA 464.52 NA (326,552]
10 NA 1397.11 10 (818,1.55e+03]
11 NA 229.82 NA [78.4,266]
12 NA 542.77 NA (326,552]
13 NA 829.32 NA (818,1.55e+03]
14 NA 284.78 NA (266,326]
15 NA 194.97 10 [78.4,266]
16 NA 672.55 8 (552,818]
17 NA 348.01 10 (326,552]
18 NA 1550.79 9 (818,1.55e+03]
19 101.98 101.98 4 [78.4,266]
20 NA 292.80 6 (266,326]
Update data dump:
structure(list(var1 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 101.98, NA), var2 = c(275.54,
565.89, 815.41, 281.77, 640.24, 78.42, 1027.06, 355.2, 464.52,
1397.11, 229.82, 542.77, 829.32, 284.78, 194.97, 672.55, 348.01,
1550.79, 101.98, 292.8), fac1 = c(10L, 10L, 6L, 6L, NA, NA, NA,
NA, NA, 10L, NA, NA, NA, NA, 10L, 8L, 10L, 9L, 4L, 6L), fac2 = structure(c(2L,
4L, 4L, 2L, 4L, 1L, 5L, 3L, 3L, 5L, 1L, 3L, 5L, 2L, 1L, 4L, 3L,
5L, 1L, 2L), .Label = c("[78.4,266]", "(266,326]", "(326,552]",
"(552,818]", "(818,1.55e+03]"), class = "factor")), .Names = c("var1",
"var2", "fac1", "fac2"), row.names = c(NA, -20L), class = "data.frame")
For part 1 I prefer aggregate because it keeps the data in a more R-like one observation per row format.
aggregate(var1, list(fac1, fac2), mean, na.rm=T)