I have two data frames Test and User.
Test has 100 000 rows while User has 1 400 000 rows. I want to extract specific vectors from User data frame and merge this with Test data frame. Ex I want Income and Cat for every row in Test from User. Rows in Test is with repeated elements and I want any one value from User file. I want to keep the test file without removing duplicates.
Ex for Name A Income is 100 , Cat is M & L. Since M occurs first I need M.
> Test
Name Income Cat
A
B
C
D
...
User Cat Income
A M 100
B M 320
C U 400
D L 900
A L 100
..
I used for loop but takes lot of time. I do not want to use merge function.
for (i in 1:nrow(Test)
{
{ Test[i,"Cat"]<-User[which(User$Name==Test[i,"Name"]),"Cat"][1]}
{ Test[i,"Income"]<-User[which(User$Name==Test[i,"Name"]),"Income"][1]}}
I used merge as well but the overall count for Test file is more than 100k rows. It is appending extra elements.
I want a faster way to do by avoiding for loop and merge. Can someone suggest any apply family functions.
You can use match to find the first matching row (then vectorize the copying):
# Setup the data
User=data.frame(User=c('A','B','C','D','A'),Cat=c('M','M','U','L','L'),
Income=c(100,320,400,900,100))
Test=data.frame(Name=c('A','B','C','D'))
Test$Income<-NA
Test$Cat<-NA
> Test
Name Income Cat
1 A NA NA
2 B NA NA
3 C NA NA
4 D NA NA
## Copy only the first match to from User to Test
Test[,c("Income","Cat")]<-User[match(Test$Name,User$User),c("Income","Cat")]
> Test
Name Income Cat
1 A 100 M
2 B 320 M
3 C 400 U
4 D 900 L
Using dplyr package you can do something like this:
library(dplyr)
df %>% group_by(Name) %>% slice(1)
For your example, you get:
Original data frame:
df
Name Cat Income
1 A M 100
2 B M 320
3 C U 400
4 D L 900
5 A L 100
Picking first occurrence:
df %>% group_by(Name) %>% slice(1)
Source: local data frame [4 x 3]
Groups: Name [4]
Name Cat Income
(chr) (chr) (int)
1 A M 100
2 B M 320
3 C U 400
4 D L 900
Related
I have a dataframe that contains patients with history of their diagnosis codes in the past 10 years; something like:
Patient_ID Diagnosis_Codes Diag_Code_Description
A 1 1:Hypertension
A 1 1:Hypertension
A 4 4:Diabetes
B 3 3:Depression
B 3 3:Depression
C 1 1:Hypertension
C 4 4:Diabetes
C 4 4:Diabetes
… … …
I want to extract or make a dataframe that has unique rows of Patient_IDs and separated columns for each diagnosis code that contains the frequency of code incidence for each patient, like the following table but I don’t know how to approach and do this task in R:
Patient_ID Diag1_freq Diag2_freq Diag3_freq Diag4_freq …
A 2 0 0 1 …
B 0 0 2 0 …
C 1 0 0 2 …
… … … … … …
The real data has almost 60 000 patients and the diagnosis codes range is between 1 and 999; so the result dataframe would have 60 000 rows and 999 columns. The Patient_IDs in real dataset are numerical and not string but I used “A”, “B” and “C” to avoid confusion. I appreciate any help and many thanks in advance.
You can use either aggregate() or dplyr::group_by()%>%summarise() or there are some functions in data.table you can use as well check out more here Data.table
Example of using dplyr:
a <- group_by(dataframe, Patient_ID)
This will aggregate data at unique patient_id level.
b <- summarise(a,
Diag1_freq = length(Diagnosis_Codes[Diagnosis_Codes==1]),
Diag2_freq = ...
...)
Here is an approach that uses the diagnosis values to create a new variable, then uses the cast() function from the reshape2 package to cast the data.
rawData <- "Patient_ID Diagnosis_Codes Diag_Code_Description
A 1 1:Hypertension
A 1 1:Hypertension
A 4 4:Diabetes
B 3 3:Depression
B 3 3:Depression
C 1 1:Hypertension
C 4 4:Diabetes
C 4 4:Diabetes"
theData <- read.table(textConnection(rawData),header=TRUE)
library(reshape2)
theData$variable <- sprintf("diag%04d",theData$Diagnosis_Codes)
castData <- dcast(theData,Patient_ID ~ variable)
The output looks like this.
regards,
Len
I have a data frame of the following type:
date ID1 ID2 sum
2017-1-5 1 a 200
2017-1-5 1 b 150
2017-1-5 2 a 300
2017-1-4 1 a 200
2017-1-4 1 b 120
2017-1-4 2 a 300
2017-1-3 1 b 150
I'm trying to compare between columns combinations over different dates to see if the sum values are equal. So, in the above-mentioned example, I'd like the code to identify that the sum of [ID1=1, ID2=b] combination is different between 2017-1-5 and 2017-1-4 (In my real data I have more than 2 ID categories and more than 2 Dates).
I'd like my output to be a data frame which contains all the combinations that include (at least one) unequal results. In my example:
date ID1 ID2 sum
2017-1-5 1 b 150
2017-1-4 1 b 120
2017-1-3 1 b 150
I tried to solve it using loops like this: Is there a R function that applies a function to each pair of columns with no great success.
Your help will be appreciated.
Using dplyr, we can group_by_(.dots=paste0("ID",1:2)) and then see if the values are unique:
library(dplyr)
res <- df %>% group_by_(.dots=paste0("ID",1:2)) %>%
mutate(flag=(length(unique(sum))==1)) %>%
ungroup() %>% filter(flag==FALSE) %>% select(-flag)
The group_by_ allows you to group multiple ID columns easily. Just change 2 to however many ID columns (i.e., N) you have assuming that they are numbered consecutively from 1 to N. The column flag is created to indicate if all of the values are the same (i.e., number of unique values is 1). Then we filter for results for which flag==FALSE. This gives the desired result:
res
### A tibble: 3 x 4
## date ID1 ID2 sum
## <chr> <int> <chr> <int>
##1 2017-1-5 1 b 150
##2 2017-1-4 1 b 120
##3 2017-1-3 1 b 150
This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,3,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a repeated value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 3
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 repeated values in vector e. I would want to get a count on those - so the combination of 4344 has 2 repeated values in vector e.
The expected output would me how many times a certain combination such as 4344 had repeated values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
Both R and SQL work, whatever does the job.
Again, see my comments above, but I believe the following gives you a start on your first question. First, create a "key" variable (in this case named key_abcd which uses tidyr::unite to unite columns a, b, c, and d). Then, count up e by this key_abcd variable. The group_by is implicit.
library(tidyr)
library(dplyr)
df <- data.frame(a,b,c,d,e,f,g)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
# key_abcd e n
# (chr) (dbl) (int)
# 1 1_1_1_2 1 1
# 2 1_2_4_2 5 1
# 3 4_3_4_4 3 2
# 4 5_5_5_5 5 1
It appears from how you've worded the question, you are only interested in "more than one" combinations, therefore, you could add %>% filter(n > 1) to the above code.
In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]
So I have a data frame with two vectors. Time and team.
df <- data.frame(time=rep(seq(1:3),3), team=LETTERS[rep(1:3,each=3)])
> time team
>1 1 A
>2 2 A
>3 3 A
>4 1 B
>5 2 B
>6 3 B
>7 1 C
>8 2 C
>9 3 C
How do I split the data.frame by time then merge it back together by time? Something like this.
> time df.A df.B df.C
>1 1 A B C
>2 2 A B C
>3 3 A B C
I figured out how to split the data.frame using split or dlply but I haven't had any success using a cbind or merge to get the data frame back together.
Also, the lengths of each (split) list are different so any help adding NA into the mix will also be greatly appreciated. Thanks.
You can use reshape for this:
> df$tmp <- df$team
> reshape(df, idvar='time', timevar='team', direction='wide')
time tmp.A tmp.B tmp.C
1 1 A B C
2 2 A B C
3 3 A B C