How to find correlation in a data set - r

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2

I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

Related

rowwise multiplication of two different dataframes dplyr

I have two dataframes and I want to multiply one column of one dataframe (pop$Population) with parts of the other dataframe, sometimes with the mean of one column or a subset (here e.g.: multiplication with mean of df$energy).
As I want to have my results per Year i need to additionally multiply it by 365 (days).
I need the results for each Year.
age<-c("6 Months","9 Months", "12 Months")
energy<-c(2.5, NA, 2.9)
Df<-data.frame(age,energy)
Age<-1
Year<-c(1990,1991,1993, 1994)
Population<-c(200,300,400, 250)
pop<-data.frame(Age, Year,Population)
pop:
Age Year Population
1 1 1990 200
2 1 1991 300
3 1 1993 400
4 1 1994 250
df:
age energy
1 6 Months 2.5
2 9 Months NA
3 12 Months 2.9
my thoughts were, but I got an Error:
pop$energy<-pop$Population%>%
rowwise()%>%
transmute("energy_year"= .%*% mean(Df$energy, na.rm = T)%*%365)
Error in UseMethod("rowwise") :
no applicable method for 'rowwise' applied to an object of class "c('double', 'numeric')"
I wished to result in a dataframe like this:
Age Year Population energy_year
1 1 1990 200 197100
2 1 1991 300 295650
3 1 1993 400 394200
4 1 1994 250 246375
pop$Population is a vector and not a data frame hence the error.
For your use case the simplest thing to do would be:
pop %>% mutate(energy_year= Population * mean(Df$energy, na.rm = T) * 365)
This will give you the output:
Age Year Population energy_year
1 1 1990 200 197100
2 1 1991 300 295650
3 1 1993 400 394200
4 1 1994 250 246375

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

Alter variable to lag by year

I have a data set I need to test for autocorrelation in a variable.
To do this, I want to first lag it by one period, to test that autocorrelation.
However, as the data is on US elections, the data is only available in two-year intervals, i.e. 1968, 1970, 1970, 1972, etc.
As far as I know, I'll need to somehow alter the year variable so that it can run annually in some way so that I can lag the variable of interest by one period/year.
I assume that dplyr() is helpful in some way, but I am not sure how.
Yes, dplyr has a helpful lag function that works well in these cases. Since you didn't provide sample data or the specific test that you want to perform, here is a simple example showing an approach you might take:
> df <- data.frame(year = seq(1968, 1978, 2), votes = sample(1000, 6))
> df
year votes
1 1968 565
2 1970 703
3 1972 761
4 1974 108
5 1976 107
6 1978 449
> dplyr::mutate(df, vote_diff = votes - dplyr::lag(votes))
year votes vote_diff
1 1968 565 NA
2 1970 703 138
3 1972 761 58
4 1974 108 -653
5 1976 107 -1
6 1978 449 342

R- combine rows of a data frame to be unique by 3 columns

I have data frame looking like this:
> head(temp)
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 <NA> 36.4 <NA>
24339 2008253059 695 696 132 <NA> <NA>
72450 2008953178 527 528 <NA> 38.6 <NA>
72957 2008953178 527 528 123 <NA> <NA>
73976 2008965669 527 528 <NA> 36.2 <NA>
74504 2008965669 527 528 116 <NA> <NA>
First and second row are both for the same patient(same VisitIDCode), in the first row I have the value of heart rate and in the second I have the value of temperature from time 2 to 3. I want to combine these rows so that the result is one row that looks like:
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 132 36.4 <NA>
In other words, I want my data frame to be unique by combination of VisitIDCode, start and stop. This is a large dataframe with more columns that need to be combined.
What is the best way of doing it and if at all possible, avoiding for loop?
Edit: I don't want to remove the NAs. If there are 2 rows each of which have one value and 2 NAs, I want to combine them to one row so it has two values and one NA. Like the example above.
nasim,
It's useful to create a reproducible example when posting questions. It makes it much easier to sort out how to help. I created a toy example here. Hopefully, that reproduces your issue:
> df <- data.frame(MRN = c(123,125,213,214),
+ VID = c(2008,2008,2011,2011),
+ start=c(695,695),
+ heart.rate = c(NA,112,NA,96),
+ temp = c(39.6,NA,37.4,NA))
> df
MRN VID start heart.rate temp
1 123 2008 695 NA 39.6
2 125 2008 695 112 NA
3 213 2011 695 NA 37.4
4 214 2011 695 96 NA
Here is a solution using dplyr:
> library(dplyr)
> df <- df %>%
+ group_by(VID) %>%
+ summarise(MRN = max(MRN,na.rm=T),
+ start=max(start,na.rm=T),
+ heart.rate=max(heart.rate,na.rm=T),
+ temp = max(temp,na.rm=T))
> df
# A tibble: 2 × 5
VID MRN start heart.rate temp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2008 125 695 112 39.6
2 2011 214 695 96 37.4
After I made sure all columns classes are numeric (not factors) by defining the classes of columns while reading the data in, this worked for me:
CompleteCoxObs<-aggregate(x=CompleteCoxObs[c("stop","Value_EVS current weight kg CAL","Value_EVS hr heart rate NU EE0A","Value_EVS temp celsius CAL 113C")], by=list(VisitIDCode=CompleteCoxObs$VisitIDCode,start=CompleteCoxObs$start), max, na.rm = FALSE);

Coding for the onset of an event in panel data in R

I was wondering if you could help me devise an effortless way to code this country-year event data that I'm using.
In the example below, each row corresponds with an ongoing event (that I will eventually fold into a broader panel data set, which is why it looks bare now). So, for example, country 29 had the onset of an event in 1920, which continued (and ended) in 1921. Country 23 had the onset of the event in 1921, which lasted until 1923. Country 35 had the onset of an event that occurred in 1921 and only in 1921, et cetera.
country year
29 1920
29 1921
23 1921
23 1922
23 1923
35 1921
64 1926
135 1928
135 1929
135 1930
135 1931
135 1932
135 1933
135 1934
120 1930
70 1932
What I want to do is create "onset" and "ongoing" variables. The "ongoing" variable in this sample data frame would be easy. Basically: Data$ongoing <- 1
I'm more interested in creating the "onset" variable. It would be coded as 1 if it marks the onset of the event for the given country. Basically, I want to create a variable that looks like this, given this example data.
country year onset
29 1920 1
29 1921 0
23 1921 1
23 1922 0
23 1923 0
35 1921 1
64 1926 1
135 1928 1
135 1929 0
135 1930 0
135 1931 0
135 1932 0
135 1933 0
135 1934 0
120 1930 1
70 1932 1
If you can think of effortless ways to do this in R (that minimizes the chances of human error when working with it in a spreadsheet program like Excel), I'd appreciate it. I did see this related question, but this person's data set doesn't look like mine and it may require a different approach.
Thanks. Reproducible code for this example data is below.
country <- c(29,29,23,23,23,36,64,135,135,135,135,135,135,135,120,70)
year <- c(1920,1921,1921,1922,1923,1921,1926,1928,1929,1930,1931,1932,1933,1934,1930,1932)
Data=data.frame(country=country,year=year)
summary(Data)
Data
This should work, even with multiple onsets per country:
Data$onset <- with(Data, ave(year, country, FUN = function(x)
as.integer(c(TRUE, tail(x, -1L) != head(x, -1L) + 1L))))
You could also do this:
library(data.table)
setDT(Data)[, onset := (min(country*year)/country == year) + 0L, country]
This could be very fast when you have a larger dataset.

Resources