merge two data frames in R based on email address - r

I have a data file that has my subject responses listed by their emails and I have another file with each subject email next to his/her subject ID. How do replace all the emails in the main data file with their subject IDs?

One of the things that's great about R is the ease with which one can create a minimal, complete and verifiable example. For this question, it's a simple matter of generating some example data, reading it into R and working out a potential solution. We'll create a list of student email addresses, IDs, and a separate data set containing exam scores.
nameData <- "email ID
Alicia#gmail.com 1
Jane#aol.com 2
Thomas#msn.com 3
Henry#yale.edu 4
LaShawn#uga.edu 5
"
examData <- "email exam1 exam2 exam3
Alicia#gmail.com 98 77 87
Jane#aol.com 99 88 93
Thomas#msn.com 73 62 73
Henry#yale.edu 100 98 99
LaShawn#uga.edu 84 98 92"
names <- read.table(text=nameData,header=TRUE,stringsAsFactors=FALSE)
exams <- read.table(text=examData,header=TRUE,stringsAsFactors=FALSE)
# merge data and drop email, which is first column
mergedData <- merge(names,exams)[,-1]
mergedData[order(mergedData$ID),]
...and the output, sorted by ID:
> mergedData[order(mergedData$ID),]
ID exam1 exam2 exam3
1 1 98 77 87
3 2 99 88 93
5 3 73 62 73
2 4 100 98 99
4 5 84 98 92
>

Related

Change selected data.frame columns using value in first row

I have created a table that show how much time each person in a team has spend for tasks each month.
Empl_level team_member 2022/05 2022/06 2022/07 2022/08
0 department 117 69 73 30
1 Diana 108 108 113 184
1 Irina 90 63 56 40
2 Inga 77 56 74 30
3 Elina 23 35 58 79
However there is such "team member" as department. how to to create a new dataset, where time from the sell department will be equally divided by real team members
Empl_level team_member 2022/05 2022/06
1 Diana 108+(117/4) 108+(69/4)
1 Irina 90+(117/4) 63+(69/4)
2 Inga 77+(117/4) etc.
3 Elina 23+(117/4)
Using data.table, something like the following could work:
library(data.table)
setDT(df)
df[, names(df)[-(1:2)] := lapply(.SD, function(x) {x + x[1]/4}), .SDcols = !1:2][-1]
The [-1] at the end removes the first "department" row.

How to extend a hash with multiple values in R

So I understand that in R, a hash() is similar to a dictionary. I would like to extract specific values from my dataframe and put them in to a hash.
The componentindex column is were I have my keys and the cluster.index + UniqueFileSourceCounts columns contain my values. So for the same key I have multiple values. e.g: hash {91: [1,15],[22,99] etc..
So I would like to create a hash that contains each key, with multiple values. But im not sure how to do that.
mini_df <- head(df,10) #using a small df
compID <- unique(mini_df$componentindex) #list with unique keys
h1 <- hash()
for (i in 1:length(mini_df)){
if(compID == mini_df[i,"componentindex"]){
h1 <- hash(mini_df[i,"componentindex"] ,c(mini_df[i,"cluster.index"],mini_df[i,"UniqueFileSourcesCount"]))
}
#h2 <- append(h2,h1)
}
if I print h1 , I end up having only the last value:
<hash> containing 1 key-value pair(s).
91 : 42 5
Which I understand since I don't append to this hash but overwrite it. Im not sure how to append/expand hashes in R and I have not been able to find a solution yet.
mini_df:
UniqueFileSourcesCount cluster.index componentindex
1 15 1 91
2 15 10 -1
3 99 22 91
4 63 23 1675
5 12 25 91
6 6 27 91
7 50 37 91
8 5 42 91
9 2 43 -1
10 2 69 -1

How to sum column based on value in another column in two dataframes?

I am trying to create a limit order book and in one of the functions I want to return a list that sums the column 'size' for the ask dataframe and the bid dataframe in the limit order book.
The output should be...
$ask
oid price size
8 a 105 100
7 o 104 292
6 r 102 194
5 k 99 71
4 q 98 166
3 m 98 88
2 j 97 132
1 n 96 375
$bid
oid price size
1 b 95 100
2 l 95 29
3 p 94 87
4 s 91 102
Total volume: 318 1418
Where the input is...
oid,side,price,size
a,S,105,100
b,B,95,100
I have a function book.total_volumes <- function(book, path) { ... } that should return total volumes.
I tried to use aggregate but struggled with the fact that it is both ask and bid in the limit order book.
I appreciate any help, I am clearly a complete beginner. Only hear to learn :)
If there is anything more I can add to this question so is more clear feel free to leave a comment!

using map function to create a dataframe from google trends data

relatively new to r, I have a list of words I want to run through the gtrendsr function to look at the google search hits, and then create a tibble with dates as index and relevant hits for each word as columns, I'm struggling to do this using the map functions in purr,
I started off trying to use a for loop but I've been told to try and use map in the tidyverse package instead, this is what I had so far:
library(gtrendsr)
words = c('cruise', 'plane', 'car')
for (i in words) {
rel_word_data = gtrends(i,geo= '', time = 'today 12-m')
iot <- data.frame()
iot[i] <- rel_word_data$interest_over_time$hits
}
I need to have the gtrends function take one word at a time, otherwise it will give a value for hits which is a adjusted for the popularity of the other words. so basically, I need the gtrends function to run the first word in the list, obtain the hits column in the interest_over_time section and add it to a final dataframe that contains a column for each word and the date as index.
I'm a bit lost in how to do this without a for loop
Assuming the gtrends output is the same length for every keyword, you can do the following:
# Load packages
library(purrr)
library(gtrendsR)
# Generate a vector of keywords
words <- c('cruise', 'plane', 'car')
# Download data by iterating gtrends over the vector of keywords
# Extract the hits data and make it into a dataframe for each keyword
trends <- map(.x = words,
~ as.data.frame(gtrends(keyword = .x, time = 'now 1-H')$interest_over_time$hits)) %>%
# Add the keywords as column names to the three dataframes
map2(.x = .,
.y = words,
~ set_names(.x, nm = .y)) %>%
# Convert the list of three dataframes to a single dataframe
map_dfc(~ data.frame(.x))
# Check data
head(trends)
#> cruise plane car
#> 1 50 75 84
#> 2 51 74 83
#> 3 100 67 81
#> 4 46 76 83
#> 5 48 77 84
#> 6 43 75 82
str(trends)
#> 'data.frame': 59 obs. of 3 variables:
#> $ cruise: int 50 51 100 46 48 43 48 53 43 50 ...
#> $ plane : int 75 74 67 76 77 75 73 80 70 79 ...
#> $ car : int 84 83 81 83 84 82 84 87 85 85 ...
Created on 2020-06-27 by the reprex package (v0.3.0)
You can use map to get all the data as a list and use reduce to combine the data.
library(purrr)
library(gtrendsr)
library(dplyr)
map(words, ~gtrends(.x,geo= '', time = 'today 12-m')$interest_over_time %>%
dplyr::select(date, !!.x := hits)) %>%
reduce(full_join, by = 'date')
# date cruise plane car
#1 2019-06-30 64 53 96
#2 2019-07-07 75 48 97
#3 2019-07-14 73 48 100
#4 2019-07-21 74 48 100
#5 2019-07-28 71 47 100
#6 2019-08-04 67 47 97
#7 2019-08-11 68 56 98
#.....

Avoid quotation marks in column and row names when using write.table [duplicate]

This question already has an answer here:
Delete "" from csv values and change column names when writing to a CSV
(1 answer)
Closed 5 years ago.
I have the following data in a file called "data.txt":
pid 1 2 4 15 18 20
1_at 100 200 89 189 299 788
2_at 8 78 33 89 90 99
3_xt 300 45 53 234 89 34
4_dx 49 34 88 8 9 15
The data is separated by tabs.
Now I wanted to extract some columns on that table, based on the information of csv file called "vector.csv", this vector got the following data:
18,1,4,20
So I wanted to end with a modified file "datamod.txt" separated with tabs that would be:
pid 18 1 4 20
1_at 299 100 89 788
2_at 90 8 33 99
3_xt 89 300 53 34
4_dx 9 49 88 15
I have made, with some help, the following code:
fileName="vector.csv"
con=file(fileName,open="r")
controlfile<-readLines(con)
controls<-controlfile[1]
controlins<-controlfile[2]
test<-paste("pid",controlins,sep=",")
test2<-c(strsplit(test,","))
test3<-c(do.call("rbind",test2))
df<-read.table("data.txt",header=T,check.names=F)
CC <- sapply(df, class)
CC[!names(CC) %in% test3] <- "NULL"
df <- read.table("data.txt", header=T, colClasses=CC,check.names=F)
df<-df[,test3]
write.table(df,"datamod.txt",row.names=FALSE,sep="\t")
The problem that I got is that my resulting file has the following format:
"pid" "18" "1" "4" "20"
"1_at" 299 100 89 788
"2_at" 90 8 33 99
"3_xt" 89 300 53 34
"4_dx" 9 49 88 15
The question I have is how to avoid those quotation "" marks that appear in my saved file, so that the data appears like I would like to.
Any help?
Thanks
To quote from the help file for write.table
quote
a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
any character or factor columns will be surrounded by double quotes.
If a numeric vector, its elements are taken as the indices of columns
to quote. In both cases, row and column names are quoted if they are
written. If FALSE, nothing is quoted.
Therefore
write.table(df,"datamod.txt",row.names=FALSE,sep="\t", quote = FALSE)
should work nicely.

Resources