How can I count and compare data over multiple columns in R? - r

I am working with a dataframe which contains text data which has been categorised and coded. Each numerical value from 1-12 represent a type of word.
I want to count the frequencies of occurrence each number (1 to 12) over 6 columns (pre1 to pre6) so I know how many types of words have been used. Could anyone please advise on how to do this?
My df is structured as such:

Would something like that work for you?
library(dplyr)
df <- data.frame(pre1 = c(sample(1:12, 10)),
pre2 = c(sample(1:12, 10)),
pre3 = c(sample(1:12, 10)),
pre4 = c(sample(1:12, 10)),
pre5 = c(sample(1:12, 10)),
pre6 = c(sample(1:12, 10)))
count <- count(df, pre1, pre2, pre3, pre4, pre5, pre6)

One solution is this:
library(tidyverse)
mtcars %>%
select(cyl, am, gear) %>% # select the columns of interest
gather(column, number) %>% # reshape
count(column, number) # get counts of numbers for each column
# # A tibble: 8 x 3
# column number n
# <chr> <dbl> <int>
# 1 am 0 19
# 2 am 1 13
# 3 cyl 4 11
# 4 cyl 6 7
# 5 cyl 8 14
# 6 gear 3 15
# 7 gear 4 12
# 8 gear 5 5
In your case column will get values as pre1, pre2, etc., number' will get values 1 - 12 andn` will be the count of a specific number at a specific column.

It is not entirely clear from the question, whether you want frequency tables for all of these columns together or for each column seperately. In possible further questions you should also make clear, whether those numbers are coded as numerics, as characters or as factors (the result of str(pCat) is a good way to do that). For this particular question, it fortunately does not matter.
The answers I have already given in the comments are
table(unlist(pCat[,4:9]))
and
table(pCat$pre3)
as an extension for the latter, I shall also point to the comment by ANG , which boils down to
lapply(pCat[,4:9], table)
These are straightforward solutions with base R without any further unneccessary packages. The answers by JonGrub and AntoniosK base on the tidyverse. There is no obvious need to import dplyr or tidyverse for this problem but I guess, the authors open those packages anyways, whenever they use R, so it does not really impose any cost on them. Other great packages to base good answers on are data.table and sqldf. Those are good packages and many people do a lot of things, that could be done in base R in one of these packages. The packages promise to be more clear or faster or reuse possible knowledge you might already have. Nothing is wrong with that. However, I take your question as an indication, that you are still learning R and I would advise, to learn R first, before you become distracted by learning special packages and DSLs.
People have been using base R for decades and they will continue to do so. Learning base R wil lnot distract you from R and the knowledge will continue to be worthwhile in decades. If the same can be said for the tidyverse or datatable, time will tell (although sqldf is probably also a solid investment in the future, maybe more so than R).

Related

Merge two rows in columns based on condition

I am new here, as well as to R, and I couldn't find any past queries that answered my following question, so I apologise if this has already been brought up before.
I am trying to merge the ID columns from two different datasets into one, but some of the IDs in the rows have been coded differently. I need to replace all the "LNZ_" IDs with the "LNZ.", however, I cannot figure out how I would go about doing this.
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"),
Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"),
Type=c("S","S","B","B","B"),
AGE=c(56,75,66,64,64))
The above is similar to the datasets that I have, only more scaled down. I hope this is somewhat clear, and any help would be appreciated.
Thanks!
The issue with merging is that your ID columns have different formatting for some of the entries which are supposed to be matched. Therefore you need to modify those values to match before performing the merge. In the examples you gave, the difference is between a period separator (.) and an underscore (_). If your real data has more complex issues, you may need to use different code to clean up those values.
However, once that is resolved, you can perform your merge easily. Here I've used the {tidyverse} packages to accomplish both steps in one pipe chain.
library(tidyverse)
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"), Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"), Type=c("S","S","B","B","B"), AGE=c(56,75,66,64,64))
df_2 %>%
mutate(ID_2 = str_replace(ID_2, "\\.", "\\_")) %>%
left_join(df_1, by = c("ID_2" = "ID_1"))
#> ID_2 Type AGE Sex
#> 1 LNZ_00001 S 56 M
#> 2 LNZ_00002 S 75 F
#> 3 LNZ_00003 B 66 F
#> 4 DFG00001 B 64 M
#> 5 CWD00001 B 64 F
Created on 2022-07-17 by the reprex package (v2.0.1)

Combining/aggregating data in R

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--
I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?
Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.
Thanks!
There are multiple parts to your question so let's take it step by step.
1.
For the first part this is a great use of tidyr::fct_collapse(). See example here:
library(tidyverse)
set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()
# original distribution
table(d)
#> d
#> a b c d e
#> 6 4 3 1 6
# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#> a other
#> 6 14
Created on 2022-02-10 by the reprex package (v2.0.1)
2.
For the second part, you will have to clarify and share some data to get more help.
3.
Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.
However something like:
df %>% group_by(birth_year) %>% summarize(n = n())
will give you number of people with that birth year listed.

How to assign unambiguous values for each row in a data frame based on values found in rows from another data frame using R?

I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2

What is the best way to treat labelled variables imported with haven?

I have about 15 SPSS election studies files saved as .sav files. My group and I will be recoding about 10 variables for each study to run some logistic regressions.
I have used haven() to import all the files, so it looks like all the variables are of the haven_labelled() class.
I have always been a little confused about how to handle this class of variables, however I have observed a lot of improved performance as the haven() and labelled() packages have been updated, so I'm inclined to keep using it as opposed to using, e.g. rio or foreign.
But I want to get a sense of what best practices should be before we start this effort so we don't look back with regret.
Each study file has about 200 variables, with a mix of factors and numeric variables. But to start, I'm wondering how I should go about recoding the sex variable so that I end up with a variable male where 1 is male and 0 is not.
One thing I want to ask about is the car::Recode() method of recoding variables as opposed to the dplyr::recode variable way. I personally find the dplyr::recode() syntax very clunky and the help documentation poor. I am also not sure about the best way to set missing values.
To be specific, I think I have three specific questions.
Question 1: is there a compelling reason to use dplyr::recode as opposed to car::Recode? My own answer is that car::Recode() looks sufficient and easy to use.
Question 2: Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels? I am concerned about this quote from the haven documentation about the labelled_class: ''This class provides few methods, as I expect you’ll coerce to a standard R class (e.g. a factor()) soon after importing''
However, maybe the haven_labelled class has been improved and is sufficiently different from the labelled class that it is no longer necessary to force conversion to other standard R classes.
Question 3: is there any advantage to setting missing values with the labelled (e.g. na_range(), na_values()) rather than with the car::Recode() method ?
My inclination is that there clear disadvantages to using the labelled methods and I should stick with the car::Recode() method.
Thank you .
#FAKE DATA
library(labelled)
var1<-labelled(rep(c(1,5), 100), c(male = 1, female = 5))
var2<-labelled(sample(c(1,3,5,7,8,9), size=200, replace=T), c('strongly agree'=1, 'agree'=3, 'disagree'=5, 'strongly disagree'=7, 'DK'=8, 'refused'=9))
#give variable labels
var_label(var1)<-'Respondent\'s sex'
var_label(var2)<-'free trade is a good thing'
df<-data.frame(var1=var1, var2=var2)
str(df)
#This works really well; and I really like this.
look_for(df, 'sex')
look_for(df, 'free trade')
#the Car way
df$male<-car::Recode(df$var1, "5=0")
#Check results
df$male
#value labels are still there, so would have to be removed or updated
as_factor(df$male)
#Remove value labels
val_labels(df$male)<-NULL
#Check
class(df$male) #left with a numeric variable
#The other car way, keeping and modifying value labels
df$male2<-car::Recode(df$var1, "5=0")
df$male2
val_label(df$male2, 0)<-c('female')
val_label(df$male2, 5)<-NULL
val_labels(df$male2)
#Check class
class(df$male2)
#Can run numeric functions on it
mean(df$male2)
#easily convert to factor
as_factor(df$male2)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#How to handle missing values
#The CAR way
#use car to set missing values to NA
df$free_trade<-Recode(df$var2, "8=NA; 9=NA")
#Check class
class(df$free_trade)
#can still run numeric functions on haven_labelled
mean(df$free_trade, na.rm=T)
#table
table(df$free_trade)
#did the na recode work?
table(is.na(df$free_trade))
#check value labels
val_labels(df$free_trade)
#set missing values the labelled way
table(df$var2)
na_values(df$var2)<-c(8,9)
#check
df$var2
#but a table function of does not pick up 8 and 9 as m isisng
table(df$var2)
#this seems to not work very well
table(to_factor(df$var2))
to_factor(df$var2)
A bit late in the game, but still some answers:
Should I make a point of converting variables to factors or numeric or will I be OK, leaving variables as haven_labelled with updated value labels?
First, you need to understand that haven_labelled vectors are all of numeric type (i.e. they will be treated as continuous variables), which you can easily check with:
library(tidyverse)
df %>%
as_tibble() %>%
head()
which gives:
# A tibble: 6 x 2
var1 var2
<dbl+lbl> <dbl+lbl>
1 1 [male] 5 [disagree]
2 5 [female] 5 [disagree]
3 1 [male] 3 [agree]
4 5 [female] 5 [disagree]
5 1 [male] 7 [strongly disagree]
6 5 [female] 9 [refused]
The question if you shoudl convert to a standard type probably depends on your analysis.
For simple frequency tables it's probably fine to leave as is, e.g.
df %>%
as_tibble() %>%
count(var1)
# A tibble: 2 x 2
var1 n
<dbl+lbl> <int>
1 1 [male] 100
2 5 [female] 100
However, for any analysis that is type sensitive (already starts for calculating means, but also regression etc.) you definitely should convert your variables to an appropriate class for your analyses. Not doing so and treating everything as continuous will give your wrong results. Just think about a truly categorical variable like 1=Bus, 2=Car, 3=Bike that you'd throw into a linear regression.
Is there a compelling reason to use dplyr::recode as opposed to car::Recode?
There is now right or wrong here. Personally, I have a preference for staying within the tidyverse, because it has easy recode functions, e.g. the recode you mentioned, but for more complex tasks, you can also use if_else or case_when. And then you also have lots of functions to deal with missings like replace_na or na_if or coalesce. They syntax of car::recode isn't much different from the dplyr, so it's really mostly personal preference I'd say.
The same is true for your question if you should use the functions from labelled or not. The labelled packages indeed adds some very powerful functions to deal with labelled vectors taht go beyond what haven or the tidyverse offers, so IMO it's a good package to use.

Timeseries average based on a defined time interval (bin)

Here is an example of my dataset. I want to calculate bin average based on time (i.e., ts) every 10 seconds. Could you please provide some hints so that I can carry on?
In my case, I want to average time (ts) and Var in every 10 seconds. For example, I will get an averaged value of Var and ts from 0 to 10 seconds; I will get another averaged value of Var and ts from 11 to 20 seconds, etc.
df = data.frame(ts = seq(1,100,by=0.5), Var = runif(199,1, 10))
Any functions or libraries in R can I use for this task?
There are many ways to calculate a binned average: with base aggregate,by, with the packages dplyr, data.table, probably with zoo and surely other timeseries packages...
library(dplyr)
df %>%
group_by(interval = round(df$ts/10)*10) %>%
summarize(Var_mean = mean(Var))
# A tibble: 11 x 2
interval Var_mean
<dbl> <dbl>
1 0 4.561653
2 10 6.544980
3 20 6.110336
4 30 4.288523
5 40 5.339249
6 50 6.811147
7 60 6.180795
8 70 4.920476
9 80 5.486937
10 90 5.284871
11 100 5.917074
That's the dplyr approach, see how it and data.table let you name the intermediate variables, which keeps code clean and legible.
Assuming df in the question, convert to a zoo object and then aggregate.
The second argument of aggregate.zoo is a vector the same length as the time vector giving the new times that each original time is to be mapped to. The third argument is applied to all time series values whose times have been mapped to the same value. This mapping could be done in various ways but here we have chosen to map times (0, 10] to 10, (10, 20] to 20, etc. by using 10 * ceiling(time(z) / 10).
In light of some of the other comments in the answers let me point out that in contrast to using a data frame there is significant simplification here, firstly because the data has been reduced to one dimension (vs. 2 in a data.frame), secondly because it is more conducive to the whole object approach whereas with data frames one needs to continually pick apart the object and work on those parts and thirdly because one now has all the facilities of zoo to manipulate the time series such as numerous NA removal schemes, rolling functions, overloaded arithmetic operators, n-way merges, simple access to classic, lattice and ggplot2 graphics, design which emphasizes consistency with base R making it easy to learn and extensive documentation including 5 vignettes plus help files with numerous examples and likely very few bugs given the 14 years of development and widespread use.
library(zoo)
z <- read.zoo(df)
z10 <- aggregate(z, 10 * ceiling(time(z) / 10), mean)
giving:
> z10
10 20 30 40 50 60 70 80
5.629926 6.571754 5.519487 5.641534 5.309415 5.793066 4.890348 5.509859
90 100
4.539044 5.480596
(Note that the data in the question is not reproducible because it used random numbers without set.seed so if you try to repeat the above you won't get an identical answer.)
Now we could plot it, say, using any of these:
plot(z10)
library(lattice)
xyplot(z10)
library(ggplot2)
autoplot(z10)
In general, I agree with #smci, the dplyr and data.table approach is the best here. Let me elaborate a bit further.
# the dplyr way
library(dplyr)
df %>%
group_by(interval = ceiling(seq_along(ts)/20)) %>%
summarize(variable_mean = mean(Var))
# the data.table way
library(data.table)
dt <- data.table(df)
dt[,list(Var_mean = mean(Var)),
by = list(interval = ceiling(seq_along(dt$ts)/20))]
I would not go to the traditional time series solutions like ts, zoo or xts here. Their methods are more suitable to handle regular frequencies and frequency like monthly or quarterly data. Apart from ts they can handle irregular frequencies and also high frequency data, but many methods such as the print methods don't work well or least do not give you an advantage over data.table or data.frame.
As long as you're just aggregating and grouping both data.table and dplyr are also likely faster in terms of performance. Guess data.table has the edge over dplyr in terms of speed, but you would have benchmark / profile that, e.g. using microbenchmark. So if you're not working with a classic R time series format anyway, there's no reason to go to these for aggregating.

Resources