Multiple rows to single cell space delimited values in pandas with group by - r

I have a data set similar to df1 here
df1 = pd.DataFrame({'id':[1,1,2,2,2],
'value':[67,45,7,5,9]})
id value
1 67
1 45
2 7
2 5
2 9
I want to bring bring it to this form. all the values corresponding to that id in one cell separated by spaces.
id values
1 67 45
2 7 5 9
Here is my code
df2 = pd.DataFrame(df1['id'].unique())
df2.columns=['id']
df2['values']=np.nan
for i in df2['id']:
s=''
for k in df1[df1['id']==i]['value']:
s=s+' '+str(k)
df2.loc[df2['id']==i,'values']=s.lstrip()
print(df2)
Is there a more pythonic way of doing this. I have 70000 unique id's, each id may have number of values ranging from 1 to 20
I am using
Anaconda python 3.5
pandas 0.20.1
numpy 1.12.1
windows 10
Also, How can we replicate the same in R

Convert the 'value' column from int to string, then perform a groupby on 'id' and apply the str.join function:
# Convert 'value' column to string.
df1['value'] = df1['value'].astype(str)
# Perform a groupby and apply a string join.
df1 = df1.groupby('id')['value'].apply(' '.join).reset_index()
The resulting output:
id value
0 1 67 45
1 2 7 5 9

Here is how to do it in R. It is the same approach
df = data.frame('id'=c(1,1,2,2,2),'value'=c(67,45,7,5,9))
aggregate(cbind(values=value)~id,
data = df,
FUN = function(x){paste(x,collapse=' ')})

Related

Replace value from updated dataset based on number of instances it appears in a second dataset

I have a simple 2-column dataset containing variable cluster_size and index. Originally all values of index were assigned a value 1. Subsequently, I received a second dataset containing only a few clusters where index should updated with different integer values.
I simply need to replace the index value from the updated dataset. My specific issue is that the values for cluster_size can repeat multiple times, but I only need to replace it for the number of instances it appears in the updated dataset. For instance, in the example data below, the cluster_size value of 34 appears three times, but only once in the updated data with an index of 6. This means that only one of these three rows should update to 6 (doesn't matter which one).
Code to recreate a 20-row sample of the original data (have), updated subset (updated), and desired dataset (want) are below. The actual data has tens of thousands of rows. Ive tried several merge and loop functions (all too pathetic to waste your time by posting here), but cant seem to find an elegant solution.
# Data with original index cases
set.seed(03151813)
have <- data.frame(clust_size=sample(1:50,20,replace=TRUE),index=rep(1,times=20))
have <- have[order(have$clust_size),]
# Updated data only contains clusters that need updating of inde
updated <- data.frame(clust_size=c(30,34,42,44,44,46),
index=c(2,6,4,8,9,4))
# Desired dataset
want <- data.frame(clust_size=have$clust_size,
index=c(rep(1,times=9),2,1,6,
1,1,1,4,1,8,9,4))
Here is a base R approach. Add row numbers to have and updated for each clust_size. So the clust_size of 34 will have rows numbered consecutively 1, 2, and 3.
Then, you can merge the two together on both clust_size and row number. If you include all.x you will get all rows from the first data frame have.
Final step is to replace the missing NA values in your new index column with the original index.
have$rn <- with(have, ave(seq_along(clust_size), clust_size, FUN = seq_along))
updated$rn <- with(updated, ave(seq_along(clust_size), clust_size, FUN = seq_along))
want <- merge(have, updated, all.x = TRUE, by = c("clust_size", "rn"))
want$index.y <- ifelse(is.na(want$index.y), want$index.x, want$index.y)
want[, c("clust_size", "index.y")]
An alternate version using dplyr would be something like this:
library(dplyr)
have2 <- have %>%
group_by(clust_size) %>%
mutate(rn = row_number())
updated2 <- updated %>%
group_by(clust_size) %>%
mutate(rn = row_number())
left_join(have2, updated2, by = c("clust_size", "rn")) %>%
mutate(index.y = coalesce(index.y, index.x))
Output
clust_size index.y
1 1 1
2 5 1
3 8 1
4 10 1
5 16 1
6 20 1
7 22 1
8 27 1
9 29 1
10 30 2
11 30 1
12 34 6
13 34 1
14 34 1
15 35 1
16 42 4
17 43 1
18 44 8
19 44 9
20 46 4

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

How to find group of rows of a data frame where error occures

I have a two-column dataframe contataining thousands of IDs where each ID has hundreds of data rows, in other words a data frame of about 6 million rows. I am grouping (using either dplyr or data.table) this data frame by IDs and performing a "tso" (outlier detection) function on grouped data frame. The problem is after hours of computation it returns me an error related to ARIMA specification of one of the IDs. Question is how can I identify the ID (or the row number) where my function is returning error?? (if I detect it then I can remove that ID from dataframe)
I tried to manually perform my function on subgroups of this dataframe however I cannot reach the erroneous ID because there are thousands of IDs so it takes me weeks to find them this way.
outlier.detection <- function(x,iter) {
y <- as.ts(x)
out2 <- tso(y,maxit.iloop = iter,tsmethod = "auto.arima",remove.method = "bottom-up",cval=3)
y[out2$outliers$ind] <- NA
return(y)}
df <- data.table(outlying1);setkey(df,id)
test <- df[,list(new.weight = outlier.detection(weight,iter=1)),by=id]
the above function finds the annomalies and replace them with NAs. here is an example,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b 90
10 b 82
it will look like the following,
ID weight
1 a 50
2 a 50
3 a 51
4 a 51.5
5 a 52
6 b 80
7 b 81
8 b 81.5
9 b NA
10 b 82

Subset by first and last value per group

I have a data frame in R with two columns temp and timeStamp. The data has temp values regularly. A portion of dataframe looks like-
I have to create line chart showing changes in temp over time. As can be seen here, temp values remain the same for several timeStamp. Having these repeating value increases the size of data file and I want to remove them. So the output should look like this-
Showing just the values where there is a change.
Cannot think of a way to get this think done in R. Any inputs in the right direction would be really helpful.
Here's a dplyr solution:
# Toy data
df <- data.frame(time = seq(20), temp = c(rep(60, 5), rep(61, 7), rep(59, 3), rep(60, 5)))
# Now filter for the first and last rows and ones bracketing a temperature change
df %>% filter(temp!=lag(temp) | temp!=lead(temp) | time==min(time) | time==max(time))
time temp
1 1 60
2 5 60
3 6 61
4 12 61
5 13 59
6 15 59
7 16 60
8 20 60
If the data are grouped by a third column (id), just add group_by(id) %>% before the filtering step.
One option would be using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'temp', we subset the first and last observation (.SD[c(1L, .N)]) per each group. If there is only a single value per group, we take the row as such (else .SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD[c(1L, .N)] else .SD, by =temp]
# temp val
#1: 22.50 1
#2: 22.50 4
#3: 22.37 5
#4: 22.42 6
#5: 22.42 7
Or a base R option with duplicated. We check the duplicated values in 'temp' (output is a logical vector), and also check the duplication from the reverse side (fromLast=TRUE). Use & to find the elements that are TRUE in both cases, negate (!) and subset the rows of 'df1'.
df1[!(duplicated(df1$temp) & duplicated(df1$temp,fromLast=TRUE)),]
# temp val
#1 22.50 1
#4 22.50 4
#5 22.37 5
#6 22.42 6
#7 22.42 7
data
df1 <- data.frame(temp=c(22.5, 22.5, 22.5, 22.5, 22.37,22.42, 22.42), val=1:7)

Resources