This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I worked for a while and still not finding an efficient way to handle this matter.
I have two data frames and both are very huge.
df1 looks like (one ID could have multiple prices):
ID Price
1 20
1 9
2 12
3 587
3 59
3 7
4 78
5 12
6 290
6 191
...
1000000 485
df2 looks like(one ID only have one location):
ID Location
1 USA
2 CAN
3 TWN
4 USA
5 CAN
6 AUS
...
100000 JAP
I want to create a new data frame looks like (create Location to df1 based on ID):
ID Price Location
1 20 USA
1 9 USA
2 12 CAN
3 587 TWN
3 59 TWN
3 7 TWN
4 78 USA
5 12 CAN
6 290 AUS
6 191 AUS
...
1000000 485 JAP
I tried "merge" but R gave me negative length vectors are not allowed. Both lists are huge, one over 2M rows and one over 0.6M rows.
I also tried lapply inside a loop but failed. I cannot figure out how to handle this matter except using two loops (and two nested loops will take a long time).
We can do a join with data.table for efficiently creating the column 'Location'
library(data.table)
setDT(df1)[df2, Location := Location, on = .(ID)]
Related
This question already has answers here:
Trying to merge multiple csv files in R
(10 answers)
How to combine multiple .csv files in R?
(1 answer)
Closed 9 months ago.
I have a bunch of large CSVs and they all contain the exact same columns and I need to combine them all into a single CSV, so basically appending all the data from each data frame underneath the next. Like this
Table 1
Prop_ID
State
Pasture
Soy
Corn
1
WI
20
45
75
2
MI
10
80
122
Table 2
Prop_ID
State
Pasture
Soy
Corn
3
MN
152
0
15
4
IL
0
10
99
Output table
Prop_ID
State
Pasture
Soy
Corn
1
WI
20
45
75
2
MI
10
80
122
3
MN
152
0
15
4
IL
0
10
99
I have more than 2 tables to do this with, any help would be appreciated. Thanks
A possible solution, in base R:
rbind(df1, df2)
#> Prop_ID State Pasture Soy Corn
#> 1 1 WI 20 45 75
#> 2 2 MI 10 80 122
#> 3 3 MN 152 0 15
#> 4 4 IL 0 10 99
Or using dplyr:
dplyr::bind_rows(df1, df2)
Assuming all the csv files are in a single directory, and that these are the only files in that directory, this solution, using data.table, should work.
library(data.table)
setwd('<directory with your csv files>')
files <- list.files(pattern = '.+\\.csv$')
result <- rbindlist(lapply(files, fread))
list.files(...) returns a vector containing the file names in a given directory, based on a pattern. Here we ask for only files containing .csv at the end.
fread(...) is a very fast file reader for data.table. We apply this function to each file name ( lapply(files, fread) ) to generate a list containing the contents of each csv. Then we use rbindlist(...) to combine them row-wise.
I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)
This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 4 years ago.
I am trying to concatenate some data in a column of a df, with "0000"
I tried to use paste() in a loop, but it becomes very performance heavy, as I have +2.000.000 rows. Thus, it takes forever.
Is there a smart, less performance heavy way to do it?
#DF:
CUSTID VALUE
103 12
104 10
105 15
106 12
... ...
#Desired result:
#DF:
CUSTID VALUE
0000103 12
0000104 10
0000105 15
0000106 12
... ...
How can this be achieved?
paste is vectorized so it'll work with a vector of values (i.e. a column in a data frame. The following should work:
DF <- data.frame(
CUSTID = 103:107,
VALUE = 13:17
)
DF$CUSTID <- paste0('0000', DF$CUSTID)
Should give you
CUSTID VALUE
1 0000103 13
2 0000104 14
3 0000105 15
4 0000106 16
5 0000107 17
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 7 years ago.
I don't know how to word the title exactly, so I will just do my best to explain below... Sorry in advance for the .csv format.
I have the following example dataset:
print(data)
ID Tag Flowers
1 1 6871 1
2 2 6750 1
3 3 6859 1
4 4 6767 1
5 5 6747 1
6 6 6261 1
7 7 6750 1
8 8 6767 1
9 9 6812 1
10 10 6746 1
11 11 6496 4
12 12 6497 1
13 13 6495 4
14 14 6481 1
15 15 6485 1
Notice that in Lines 2 and 7, the tag 6750 appears twice. I observed one flower on plant number 6750 on two separate days, equaling two flowers in its lifetime. Basically, I want to add every flower that occurs for tag 6750, tag 6767, etc throughout ~100 rows. Each tag appears more than once, usually around 4 or 5 times.
I feel like I need to apply the unlist function here, but I'm a little bit lost as to how I should do so.
Without any extra packages, you can use function aggregate():
res<-aggregate(data$Flowers, list(data$Tag), sum)
This calculates a sum of the values in Flowers column for every value in the Tag column.
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 8 years ago.
what I wanted to do is to compare the first column of two data frame and find the indexes of the same value and assign the element of the second column of the first dataframe to the second dataframe :
please see the example :
datafranmeA dataframeB
id number id
1 1 45 1
2 3 78 4
3 5 67 12
4 12 18 5
5 4 44 8
6 8 32
7 13 41
output : dataframeB
id number
1 1 45
2 4 44
3 12 18
4 5 67
5 8 32
I use a two for loop and if to compare but it is really slow as my own data is really big , how should I speed it up ?
for (i in 1:length(A[,1])){
for (j in 1:length(B[,1])){
if (A[i,1]==B[j,1]) {
B[j,2]=A[i,2]}}}
thank you in advance,
Try
library(dplyr)
left_join(dataframeB, dataframeA)