cannot compute conditionated mean - r

I have this small dataset
structure(list(score = c("mine_score", "your_score", "mine_score",
"your_score", "mine_score", "your_score"), points = c(53, 13.25,
17.5, 1.59090909090909, 48.5, 6.92857142857143)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
and when applying this formula:
mean(long[long$score == 'mine_score', "points"], na.rm = TRUE)
I got this error, but cannot figure out why:
Warning message:
In mean.default(long[long$score == "mine_score", "points"], na.rm = TRUE) :
the argument is not numeric or logic: returns NA
Could anyone possibly know what this error is due to?
Thanks

Your query is returning a one-column dataframe, which mean does not understand; it has to return a vector. You can use $ to return a vector:
mean(long$points[long$score == 'mine_score'], na.rm = TRUE)
#[1] 39.66667
If you really want to stick with your original query, you can use [[1]] to return the first column of your dataframe as a vector:
mean(long[long$score == 'mine_score', "points"][[1]], na.rm = TRUE)

Your thinking was accurate when applied to data frames, but you're dealing with a tibble.
tibble::is_tibble(long)
# [1] TRUE
These guys refuse to turn into a vector as expected, as is the case with data frames:
long_df <- as.data.frame(long)
mean(long_df[long_df$score == 'mine_score', "points"], na.rm=TRUE)
# [1] 39.66667
At least, tibbles are sensitive to colMeans in this case.
colMeans(long[long$score == 'mine_score', "points"], na.rm=TRUE)
# points
# 39.66667

Related

Recode monetary string values into new variable as numeric

First off - newbie with R so bear with me. I'm trying to recode string values as numeric. My problem is I have two different string patterns present in my values: "M" and "B" for 'million' and 'billion', respectively.
df <- (funds = c($1.76M, $2B, $57M, $9.87B)
I've successfully knocked off the dollar sign and now have:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B)
)
How can I recode these as numeric while retaining their respective monetary values? I've tried using various if statements, for loops, with or without str_detect, pipe operators, case_when, mutate, etc. to isolate values with "M" and values with "B", convert to numeric and multiply to come up the complimentary numeric value--all in a new column. This seemingly simple task turned out not as simple as I imagined it would be and I'd attribute it to being a novice. At this point I'd like to start from scratch and see if anyone has any fresh ideas. My Rstudio is a MESS.
Something like this would be nice:
df <- (funds = c($1.76M, $2B, $57M, $9.87B),
fundsR = c(1.76M, 2B, 57M, 9.87B),
fundsFinal = c(1760000, 2000000000, 57000000, 9870000000)
)
I'd really appreciate your input.
You could create a helper function f, and then apply it to the funds column:
library(dplyr)
library(stringr)
f <- function(x) {
curr = c("M"=1e6, "B" = 1e9)
val = str_remove(x,"\\$")
as.numeric(str_remove_all(val,"B|M"))*curr[str_extract(val, "B|M")]
}
df %>% mutate(fundsFinal = f(funds))
Output:
funds fundsFinal
1 $1.76M 1.76e+06
2 $2B 2.00e+09
3 $57M 5.70e+07
4 $9.87B 9.87e+09
Input:
df = structure(list(funds = c("$1.76M", "$2B", "$57M", "$9.87B")), class = "data.frame", row.names = c(NA,
-4L))
This works but I'm sure better solutions exist. Assuming funds is a character vector:
library(tidyverse)
options(scipen = 999)
df <- data.frame(funds = c('$1.76M', '$2B', '$57M', '$9.87B'))
df = df %>%
mutate( fundsFinal = ifelse(str_sub(funds,nchar(funds),-1) =='M',
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^6,
as.numeric(substr(funds, 2, nchar(funds) - 1))*10^9))

Select 1 column when DF has 2 similar column names in R

I have 2 problems. First, I have datasets with 2 column names that are similar. I want to select the first one and not use the second one. The numeric values in the column names are the serial number of the sensor and can vary and they can be in various columns.
How can I select the first column name of the 2 so I can plot it or use it in calculations?
How can I recover those long column names so I can use them? For example how to I get "Depth_456" to use in depthmax2 with out typing it in or making a subset named depth. The problem is the numeric value which is the serial number of the sensor and it changes from instrument to instrument and dataset to dataset. I am trying to write generic code that will work on all the different instruments.
My Data
df1 <- data.frame(Sal_224 = 1:8, Temp_696 = 1:8, Depth_456 = 1:8, Temp_654 = 8:15)
df2<-data.frame(sapply(df1, function(x) as.numeric(as.character(x))))
temp<- df2[grep("Temp", names(df2), value=TRUE)]
depth<- df2[grep("Depth", names(df2), value=TRUE)]
depthmax<- max(depth, na.rm = TRUE)
depthmax2<- max(df2$"Depth_456", na.rm = TRUE)
This doesn't work
depthmax2<- max(df2$grep("Depth", names(df2), value=TRUE), na.rm = TRUE)
We need [[ instead of $.
max(df2[[ grep("Depth", names(df2), value=TRUE)]], na.rm = TRUE)
#[1] 8
Or another option is startsWith
max(df2[[names(df2)[startsWith(names(df2), "Depth")]]], na.rm = TRUE)
#[1] 8
Also, max works on a vector. If there are more than one match, we may have to loop over and get the max
sapply(df2[ grep("Depth", names(df2), value=TRUE)], max, na.rm = TRUE)

Using tidyverse to get raw_alpha values with the psych package from different datasets / databases

I spent one day looking for this answer and I'm almost giving up. Actually, I really imagine is a pretty simple situation, but I'll be glad of any help.
Let's say I have two datasets, the first get all ID of all students
library(tidyverse)
library(psych)
ds_of_students <- data.frame(id=(1:4), school=c("public","private"))
The second one has all the results of a test. Let's say each column is an ID.
ds_of_results <- structure(list(i1 = c(1, 2, 4, 4),
i2 = c(3, 3, 2, 2),
i3 = c(2, 3, 3, 5),
i4 = c(4, 1, 3, 2)),
class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -4L))
And now I need to report a table of students ID , groupped by school, and they results (Actually, It's a Cronbach alpha results, what is pretty common in Psychology).
ds_of_students %>%
group_by(school) %>%
summarise(n=n(),
id = paste(id, collapse = ",")) %>%
mutate(item2=psych::alpha(ds_of_results[c(id)])$total[1])
I've got this message
Error in mutate_impl(.data, dots) :
Evaluation error: Columns `2,4`, `1,3` not found.
But When I run in the traditional fashion, it works
psych::alpha(ds_of_results[c(1,3)])$total[1]
I've tried to work with paste, noquote, gsub ans strcol
Please, run this code to have reproducible results. Thanks much!
library(tidyverse)
library(psych)
ds_of_students <- data.frame(id=(1:4), school=c("public","private"))
ds_of_results <- structure(list(i1 = c(1, 2, 4, 4),
i2 = c(3, 3, 2, 2),
i3 = c(2, 3, 3, 5),
i4 = c(4, 1, 3, 2)),
class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -4L))
ds_of_students %>%
group_by(school) %>%
summarise(n=n(),
id = paste(id, collapse = ",")) %>%
mutate(item2=psych::alpha(ds_of_results[c(id)])$total[1])
alpha(ds_of_results[c(1,3)])$total[1]
My desired output is something like that
And just to give some reality to my question, that's the real dataset, where I have to compute the Cronbach's alpha item the items of each group.
get_alpha <- function(x) {
raw_alpha <-
psych::alpha(
ds_of_results[, ds_of_students[ds_of_students$school == x, 1]])$total[1]
ids <-
paste0(names(ds_of_results[, ds_of_students[ds_of_students$school == x, 1]]),
collapse = ",")
data.frame(
school = x,
id = ids,
raw_alpha = raw_alpha
)
}
map_df(levels(ds_of_students$school), get_alpha)
Result
school id raw_alpha
1 private i2,i4 0.00
2 public i1,i3 0.85
There were several issues in your code:
mutate uses variables within a data frame while psych::alpha needs entire data frames. So I don't think that you can get your alpha values with mutate
you use $total to extract one element of the list of data frames given by psych::alpha, but that does not work in a pipeline (the pipe does not handle lists and only works with data frames)
So basically, psych::alpha, which needs entire data frames as input and outputs a list of data frames does not play well with a classic dplyr wrangling workflow.
I'm not sure this is what you're looking for, but try this and tell me if you're getting the expected result. Replace your summarise call like this (also note the "unlist" in the mutate call):
ds_of_students %>% mutate(id=lapply(strsplit(id,","),as.integer))
group_by(school) %>%
summarise(id = list(id)) %>%
mutate(item2=psych::alpha(ds_of_results[unlist(id)])$total[1])
What I'm doing here is replacing your paste with a list, so that the numbers are retained as numbers, and can be passed to the subset call in the next step without a hitch. This will also work if id is a character, of course, assuming the column names in ds_of_results are the id's from ds_of_students. You need to pass it with unlist so that the subset gets it as a simple vector, rather than as a list with one vector element.
With your fake data, I get this error:
Some items ( i2 i4 ) were negatively correlated with the total scale and
probably should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option# A tibble: 2 x 3
school id item2
<fct> <list> <data.frame>
1 private <int [2]> -1
2 public <int [2]> -1
Warning messages:
1: In cor.smooth(r) : Matrix was not positive definite, smoothing was done
2: In psych::alpha(ds_of_results[unlist(id)]) :
Some items were negatively correlated with the total scale and probably
should be reversed.
To do this, run the function again with the 'check.keys=TRUE' option
3: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
4: In cor.smooth(R) : Matrix was not positive definite, smoothing was done
But that might just be a problem with the fake data itself, not the code.

Can I use the apply family to get a stat on each column of many dataframes

Good morning Stack Overflow,
Getting some statistics (whatever) on the columns of a dataframe might be done with the (s)apply function. I am wondering whether it could be possible to get such statistics on each column for each different dataframe using the apply family?
Number of missing values per column (1 dataframe):
dataf <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
sapply(dataf, function(x) {sum(is.na(x))})
I have thought about making a list of dataframes but the statistics is then conglomerated on the elements of the list (i.e. dataframe) although I want it to be calculated on the columns. Any idea?
Have a nice day,
Anthony
In general it is a good idea to save your dataframes in a list if you want to do similar things with them. See for more information the excellent answer of #gregor in this question How do I make a list of data frames? .
The comment of #missuse is correct. Tested on your example:
dataf <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
dataf2 <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
li <- list(dataf,dataf2)
lapply(li, function(x) sapply(x, function(y) sum(is.na(y))))
> lapply(li, function(x) sapply(x, function(y) sum(is.na(y))))
[[1]]
a b
0 1
[[2]]
a b
0 1

Weird behaviour by ordering a data frame

I have the following data frame that I want to order by the fifth column ("Distance").
When I try `
df.order <- df[order(df[, 5]), ]
I always get the following error message.
Error in order(df[, 5]) : unimplemented type 'list' in 'orderVector1'`
I don't know why R consider my data frame as a list. Running is.data.frame(df) returns TRUE. I have to admit that is.list(df) also returns TRUE. Is is possible to force my data frame to be only a data frame and not a list?
Thanks for your help.
structure(list(ID = list(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Latitude = list(50.7368, 50.7368, 50.7368, 50.7369, 50.7369, 50.737, 50.737, 50.7371, 50.7371, 50.7371),
Longitude = list(6.0873, 6.0873, 6.0873, 6.0872, 6.0872, 6.0872, 6.0872, 6.0872, 6.0872, 6.0872),
Elevation = list(269.26, 268.99, 268.73, 268.69, 268.14, 267.87, 267.61, 267.31, 267.21, 267.02),
Distance = list(119.4396, 119.4396, 119.4396, 121.199, 121.199, 117.5658, 117.5658, 114.9003, 114.9003, 114.9003),
RxPower = list(-52.6695443922406, -52.269130891243, -52.9735258244422, -52.2116571930007, -51.7784534281727, -52.7703448813654, -51.6558862949081, -52.2892907635308, -51.8322993596551, -52.4971436682333)),
.Names = c("ID", "Latitude", "Longitude", "Elevation", "Distance", "RxPower"),
row.names = c(NA, 10L), class = "data.frame")
Your data frame contains lists, not vectors. You can convert this data frame to the "classical" format using as.data.frame and unlist:
df2 <- as.data.frame(lapply(df, unlist))
Now, the new data frame could be sorted in the intended way:
df2[order(df2[, 5]), ]
I've illustrated with a small example what's the problem:
df <- structure(list(ID = c(1, 2, 3, 4),
Latitude = c(50.7368, 50.7368, 50.7368, 50.7369),
Longitude = c(6.0873, 6.0873, 6.0873, 6.0872),
Elevation = c(269.26, 268.99, 268.73, 268.69),
Distance = c(119.4396, 119.4396, 119.4396, 121.199),
RxPower = c(-52.6695443922406, -52.269130891243, -52.9735258244422,
-52.2116571930007)),
.Names = c("ID", "Latitude", "Longitude", "Elevation", "Distance", "RxPower"),
row.names = c(NA, 4L), class = "data.frame")
Notice that list only occurs once. And all the values are wrapped by c(.) and not list(.). This is why doing sapply(df, class) on your data resulted in all columns having class list.
Now,
> sapply(df, classs)
# ID Latitude Longitude Elevation Distance RxPower
# "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Now order works:
> df[order(df[,4]), ]
# ID Latitude Longitude Elevation Distance RxPower
# 4 4 50.7369 6.0872 268.69 121.1990 -52.21166
# 3 3 50.7368 6.0873 268.73 119.4396 -52.97353
# 2 2 50.7368 6.0873 268.99 119.4396 -52.26913
# 1 1 50.7368 6.0873 269.26 119.4396 -52.66954
This turns you data.frame of lists into a matrix:
mat <- sapply(df,unlist)
Now you can order it.
mat[order(mat[,5]),]
If all columns are of one type, e.g., numeric, a matrix often is preferable, because operations on matrices are faster than on data.frames. However, you can transform to a data.frame using as.data.frame(mat).
Btw, a data.frame is a special kind of list and thus is.list returns TRUE for every data.frame.
Ran across this same problem. This worked for me (maybe it might help someone else who is having the same problem and stumbled on this page).
I had a structure like:
lst <- list(row1 = list(col1="A",col2=1,col3="!"), row2 = list(col1="B",col2=2,col3="#"))
> lst
$row1
$row1$col1
[1] "A"
$row1$col2
[1] 1
$row1$col3
[1] "!"
$row2
$row2$col1
[1] "B"
$row2$col2
[1] 2
$row2$col3
[1] "#"
I was doing:
df <- as.data.frame(do.call(rbind, lst))
And I kept getting the same error you were getting when I tried to df[order(df$col1),]. Turns out I had to do:
df <- do.call(rbind.data.frame, lst)

Resources