Need to get total for Column in R - r

I have done the code up to this point, but have a column called score where I have to add the total together in the rscores tibble.
library(tidyverse)
responses <- read_csv("responses.csv")
qformats <- read_csv("qformats.csv")
scoring <- read_csv("scoring.csv")
rlong <- gather(responses,Question, Response, Q1:Q10)
rlong_16 <- filter(rlong, Id == 16)
rlong2 <- inner_join(rlong_16, qformats, by = "Question")
rscores <- inner_join(rlong_2, scoring)
What line of code do I add next to get the total for this column? I have been scratching my head for hours. Any help is appreciated :)
> head(rscores)
# A tibble: 6 x 5
Id Question Response QFormat Score
<dbl> <chr> <chr> <chr> <dbl>
1 16 Q1 Slightly Disagree F 0
2 16 Q2 Definitely Agree R 0
3 16 Q3 Slightly Disagree R 1
4 16 Q4 Definitely Disagree R 1
5 16 Q5 Slightly Agree R 0
6 16 Q6 Slightly Agree R 0

colSums() is overkill if you just need the sum of one column, and it will give you an error if any other column in the tibble/data.frame/etc. is not convertible to numeric. In you case, there's at least one character (chr) column that can't be summed. Typically you'd use rowSums or colSums on a matrix as opposed to a data frame.
Just use sum function on the one column: sum(rscores$Score). Best of luck.

Related

How do I change numeric values in a subset of columns in a R dataframe to other numeric values?

I have a dataset with currently 4 rows /subjects (more to come as this is ongoing research) and 259 variables /columns. 240 variables of this dataset are ratings of fit ("How well does the following adjective match the dimension X?" and 19 variables are sociodemographic.
For these 240 rating-variables, my subjects could give a rating ranging from 1 ("fits very badly") to 7 ("fits very well"). Consequently, I have a 240 variables numbered from 1 to 7. I would like to change these numeric values as follows (the procedure being the same for all of the 240 columns)
1 should change to 0, 2 should change to 1/6, 3 should change to 2/6, 4 should change to 3/6, 5 should change to 4/6, 6 should change to 5/6 and 7 should change to 1. So no matter where in the 240 columns, a 1 should change to 0 and so on.
I have tried the following approaches:
Recode numeric values in R
In this post, it says that
x <- 1:10
# With recode function using backquotes as arguments
dplyr::recode(x, `2` = 20L, `4` = 40L)
# [1] 1 20 3 40 5 6 7 8 9 10
# With case_when function
dplyr::case_when(
x %in% 2 ~ 20,
x %in% 4 ~ 40,
TRUE ~ as.numeric(x)
)
# [1] 1 20 3 40 5 6 7 8 9 10
Consequently, I tried this:
df = ds %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20)
%>% recode(.,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
with AD01_01 etc. being the column names for the adjectives my subjects should rate. I also tried it without the ., after recode(, to no avail.
This code is flawed because it omits the 19 rows of sociodemographic data I want to keep in my dataset. Moreover, I get the error unexpected SPECIAL in "%>%".
I thought R might accept my selected columns with the pipe operator as the "x" in recode. Apparently, this is not the case. I also tried to read up on the R documentation of recode but it made things much more confusing for me, as there were a lot of technical terms I don't understand.
As there is another option mentioned in the post, I also tried this:
df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (.,%in% 1~0,%in% 2~1/6,%in%3~2/6,%in%4~3/6,%in%5~4/6,%in%6~5/6,%in%7~1)
I thought I could give the output of the select function to the case_when function. Apparently, this is also not the case.
When I execute this command, I get
Error: unexpected SPECIAL in:
"df = df %>% select(AD01_01:AD01_20,AD02_01:AD02_20,AD03_01:AD03_20,AD04_01:AD04_20,AD05_01:AD05_20,AD06_01:AD06_20, AD09_01:AD09_20,AD10_01:AD10_20,AD11_01:AD11_20,AD12_01:AD12_20,AD13_01:AD13_20,AD14_01:AD14_20) %>% case_when (%in%"
Reading up on other possibilities, I found this
https://rstudio-education.github.io/hopr/modify.html
exemplary dataset:
head(dplyr::storms)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
# We decide that we want to recode all NAs to 9999.
storm <- storms
storm$ts_diameter[is.na(storm$ts_diameter)] <- 9999
summary(storm$ts_diameter)
ds$AD01_01:AD01_20[1(ds$AD01_01:AD01_20)] <- 0, ds$AD01_01:AD01_20[2(ds$AD01_01:AD01_20)] <- 1/6, ds$AD01_01:AD01_20[3(ds$AD01_01:AD01_20)] <- 2/6,
ds$AD01_01:AD01_20[4(ds$AD01_01:AD01_20)] <- 3/6, ds$AD01_01:AD01_20[5(ds$AD01_01:AD01_20)] <- 4/6, ds$AD01_01:AD01_20[6(ds$AD01_01:AD01_20)] <- 5/6,
ds$AD01_01:AD01_20[7(ds$AD01_01:AD01_20)] <- 1
My idea in this case was to use assign for multiple columns at a time (this effort just concerns 20 of my 240 columns and it also didn't work. I got the error
could not find function ":<-" which is weird because I thought this was a basic command. The only noteworthy thing that might explain is that I executed library(readr) and library(tidyverse) beforehand.
Disclaimer: I am an R newbie and have spent 2 hours to try to solve this issue. I would also like to know where I went wrong and why my code doesn't work.
How about using mutate(across())? For example, if all your "adjective rating" columns start with "AD", you can do something like this:
library(dplyr)
ds %>% mutate(across(starts_with("AD"), ~(.x-1)/6))
Explanation of where you went wrong with your code:
First, your select(...) %>% recode(...) was close. However, when you use select, you are reducing ds to only the selected columns, thus recoding those values and assigning to df will result in df not having the demographic variables.
Second, if you want to use recode you can, but you can't feed it an entire data frame/tibble, like you are doing when you pipe (%>%) the selected columns to it. Instead, you can use recode() iteratively in .fns, on each of the columns in the .cols param of across(), like this:
ds %>%
mutate(across(
.cols = starts_with("AD"),
.fns = ~recode(.x,`1`=0,`2`=-1/6,`3`=-2/6, `4`=3/6,`5`=4/6, `6`=5/6, `7`=1))
)

r - Filtering with dataframe leads partially to NAs

I am measuring electric current (µA) over a certain time interval (s) for 4 different channels (chan_n) and this is how my data looks:
dat
s µA chan_n
<dbl> <dbl> <chr>
0.00 -0.03167860 1
0.02 -0.03136610 1
0.04 -0.03118490 1
0.06 -0.03094740 1
0.08 -0.03065360 1
0.10 -0.03047860 1
0.12 -0.03012230 1
0.14 -0.02995980 1
0.16 -0.02961610 1
... ... ...
My end goal is to get the current of a certain time after the peak value. Therefore I first get the time timepoints at which the maximum appears for each channel:
BaslineTime <- dat %>%
group_by(chan_n) %>%
slice(which.max(µA)) %>% # get max current values
transmute(s = s + 30) # add 30 to the timepoints at which the max value appears
chan_n s
<chr> <dbl>
1 539.84
2 540.00
3 539.82
4 539.80
But if I use BaselineTime to filter for my current values I get two NAs:
BaslineVal <- right_join(dat, BaselineTime, by =c("chan_n","s"))
s µA chan_n
<dbl> <dbl> <chr>
540.00 0.00364974 2
539.80 0.00610948 4
539.84 NA 1
539.82 NA 3
I checked if the time values exist for channel 1 and 3 and they do. Also if I create a data frame manualy by hardcoding the time values and use it for filtering, it works just fine.
So why isn't it working? I would be very happy for any suggestions or explanations.
I think it might have something to do the the decimal places as for channel 2 and 4 there is a 0 on the last decimal place.
Untested as the sample data isn't suitable for testing. I would try something like this:
data %>%
group_by(chan_n) %>%
mutate(
is_peak = row_number() == which.max(µA),
post_peak = lag(is_peak, n = 30, default = FALSE)
)
This will give a TRUE in the new post_peak column 30 rows after the peak, so you can trivially ... %>% filter(post_peak) or do whatever you need to with the result.
If you need more help than this, please share some data that illustrates the problem better, e.g., 10 rows each of 2 chan_n groups with the goal of finding the row 3 after the peak (and that row existing in the data).

How can you weight data for splitting in R?

I want to split my data into a development and validation set. Data should be split by ID. For around 30% of my data individuals I have rich observations, with the remaining 70% having sparse data.
For my development set, I want to include all of the individuals with rich data (even if it might not be good practice to do so), and then fill up with individuals with sparse data. The validation set should not contain any rich data.
Some example data:
# A tibble: 6 x 4
ID CONC TIME RICH
<chr> <dbl> <dbl> <dbl>
1 A 55.0 1 1
2 A 52.6 2 1
3 A 50.2 3 1
4 A 47.9 4 1
5 E 40.7 2 0
6 E 38.3 2 0
I am aware of the sample() function, but I am at a loss at how to "randomly" split data with weights.
EDIT: All IDs have several observations, and so the randomization should be on the ID depending on RICH. An individual is assigned as having rich data if there are more than n observations.
EDIT 2: The 75%/25% split should be on IDs.
Here is one raw approach :
#Unique ID's
n <- unique(df$ID)
#Get all rich ID's
rich_set <- unique(df$ID[df$RICH == 1])
#count number of unique ID's in development set
development_n <- ceiling(length(n) * 0.75)
#select random Id's to complete development set
devel_ID <- sample(setdiff(n, rich_set), development_n - length(rich_set))
#Subset data
development_set <- subset(df, ID %in% c(rich_set, devel_ID))
validaton_set <- subset(df, !ID %in% c(rich_set, devel_ID)))

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Check if a variable is time invariant in R

I tried to search an answer to my question but I find the right answer for Stata (I am using R).
I am using a national survey to study which variables influence the investment in complementary pension (it is voluntary in my country).
The survey is conducted every two years and some individuals are interviewed more than one time. I filtered the df in order to have only the individuals present more than one time trought the filter command. This is an example from the original survey already filtered:
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 1
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
2008 4 1972 F 33000 0
2010 4 1972 F 35000 0
where id is the individual, y.b is year of birth, pens is a dummy which takes value 1 if the individual invests in a complementary pension form.
I wanted to run a FE regression so I load the plm package and then I set the df like this:
df.p <- plm.data(df, c("id", "year")
After this command, I expected that constant variables were deleted but after running this regression:
pan1 <- plm (pens ~ woman + age + I(age^2) + high + medium + north + centre, model="within", effect = "individual", data=dd.p, na.action = na.omit)
(where woman is a variable which takes value 1 if the individual is a woman, high, medium refer to education level and north, centre to geographical regions) and after the command summary(pan1) the variable woman is still present.
At this point I think that there are some mistakes in the survey (for example sex was not insert correctly and so it wasn't the same for the same id), so I tried to find a way to check if for each id, sex is constant.
I tried this code but I am sure it is not correct:
df$x <- ifelse(df$id==df$id & df$sex==df$sex,1,0)
the basic idea shuold be like this:
df$x <- ifelse(df$id=="1" & df$sex=="F",1,0)
but I can't do it manually since the df is composed up to 40k observations.
If you know another way to check if a variable is constant in R I will be glad.
Thank you in advance
I think what you are trying to do is calculate the number of unique values of sex for each id. You are hoping it is 1, but any cases of 2 indicate a transcription error. The way to do this in R is
any(by(df$sex,df$id,function(x) length(unique(x))) > 1)
To break that down, the function length(unique(x)) tells you the number of different unique values in a vector. It's similar to levels for a factor (but not identical, since a factor can have levels not present).
The function by calculates the given function on each subset of df$sex according to df$id. In other words, it calculates length(unique(df$sex)) where df$id is 1, then 2, etc.
Lastly, any(... > 1) checks if any of the results are more than one. If they are, the result will be TRUE (and you can use which instead of any to find which ones). If everything is okay, the result will be FALSE.
We can try with dplyr
Example data:
df=data.frame(year=c(2002,2002,2004,2004,2006,2008,2008,2010),
id=c(1,2,1,2,3,3,4,4),
sex=c("F","M","M","M","M","M","F","F"))
Id 1 is both F and M
library(dplyr)
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))
# A tibble: 4 x 2
id sexes
<dbl> <int>
1 1 2
2 2 1
3 3 1
4 4 1
We can then filter:
df%>%group_by(id)%>%summarise(sexes=length(unique(sex)))%>%filter(sexes==2)
# A tibble: 1 x 2
id sexes
<dbl> <int>
1 1 2

Resources